Anomaly Detection using Deep One-Class Classifier Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018
Anomaly Detection and Localization Using GAN and One-Class Classifier Satellite Image Forgery Detection and Localization Using GAN and One-Class Classifier https://arxiv.org/abs/1802.04881 Previous Approach I
Anomaly Detection โ€ข ์ •์ƒ์น˜์—์„œ ๋ฒ—์–ด๋‚œ ๊ด€์ธก์น˜๋“ค์„ detect ๏ƒจ One-class classification ํ˜น์€ one-class description ์—ฌ๊ธฐ์„œ๋Š” โ€ข Generative adversarial network ๋˜๋Š” Auto-encoder๋ฅผ ์ด์šฉํ•˜์—ฌ ์ •์ƒ image์— ๋Œ€ํ•œ feature๋ฅผ mappingํ•œ ํ›„ one-class support vector machine (SVM)์œผ๋กœ ๋ถ„ํฌ๋ฅผ ๊ฒฐ์ •. Query image์— ๋Œ€ํ•˜์—ฌ ๊ฒฐ์ •๋œ ๋ถ„ ํฌ๋‚ด์— ์กด์žฌํ•˜๋Š”์ง€ ์—ฌ๋ถ€ ํ™•์ธ
Problem formulation โ€ข ํ•™์Šต๋œ image์™ธ์— unseen or unfamiliar object๊ฐ€ ๋ฐœ๊ฒฌ๋  ๊ฒฝ์šฐ, ๊ทธ ๋ฆผ๊ณผ ๊ฐ™์ด binary mask๋กœ ์˜์—ญ์„ ํ‘œ์‹œ Trained Image Trained Image mask Query Image w/ unfamiliar object Query Image mask w/ unfamiliar object
Method ๐ด ๐‘’ X h ๐ด ๐‘‘ ๐‘‹ X min ๐บ max ๐ท ๐‘‰(๐ท, ๐บ) = ๐ธ ๐‘‹~๐‘ ๐‘‘๐‘Ž๐‘ก๐‘Ž log ๐ท ๐‘‹ + log(1 โˆ’ ๐ท ๐บ ๐‘‹ ) ๐‘‹ = ๐บ ๐‘‹ = ๐ด ๐‘‘ โ„Ž = ๐ด ๐‘‘ ๐ด ๐‘’(๐‘‹) โ€ข Auto-encoder๋ฅผ ์ด์šฉํ•˜์—ฌ image๋กœ๋ถ€ํ„ฐ feature(h) ๊ตฌํ•˜๊ณ  ์ด๋ฅผ ๋‹ค์‹œ ๋ณต์›. ๋ณต์›๋œ image์™€ ์› image๋ฅผ ์ด์šฉํ•˜์—ฌ GAN์„ ํ›ˆ๋ จ ๏ƒจ Auto- encoder ๋ณด๋‹ค ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅํ–ฅ์ƒ โ€ข ์ •์ƒ image์— ๋Œ€ํ•œ latent space์˜ distribution์„ ์ฐพ์•„ ๋ƒ„.
Method One class Classifier
Method Normal image์˜ cluster Abnormal image์˜ features - Training๋œ Auto-encode์˜ Encoder์— Query image๋ฅผ ์ž…๋ ฅํ•˜์—ฌ latent vector๋ฅผ ๊ณ„์‚ฐ - ๊ณ„์‚ฐ ๋œ latent vector๊ฐ€ ์ •์ƒ image์˜ cluster๋‚ด์— ํฌํ•จ๋˜๋Š”์ง€ ์—ฌ๋ถ€ ํŒ๋‹จ ๏ƒจ ์—ฌ๊ธฐ์„œ๋Š” RADIAL BASES FUNCTIONS(Gauss Kernel, Parametric modeling of Cluster) ์„ ์‚ฌ์šฉํ•œ One class SVM์„ ์‚ฌ์šฉ Features from normal patches(i.e., red dots) cluster together, whereas features from abnormal patches (i.e., blue dots) are more distant.
we solve the problem of classifying nonlinearly separable pattern in a hybrid manner involving two stages: โ€ข First: Transform a given set of nonlinearly separable patterns into a new set for which, under certain conditions, the likelihood of the transformed patterns becoming linearly separable is high. โ€ข Second: the solution of the classification problem is completed by using Stochastic Gradient Descent. Non-linear SVM Classifier using the RBF(Radial-basis function) kernel
We find w and b by solving the following objective function using Quadratic Programming. To define an optimal hyperplane we need to maximize the width of the margin(w). Linear SVM(Support Vector Machines) Support vector
โ€ข The simplest way to separate two groups of data is with a straight line (1 dimension), flat plane (2 dimensions) or an N-dimensional hyperplane. โ€ข However, there are situations where a nonlinear region can separate the groups more efficiently. โ€ข The kernel function transform the data into a higher dimensional feature space to make it possible to perform the linear separation. Non-Linear SVM(Support Vector Machines) kernel trick
To Map from input space to feature space to simplify classification task Non-linear SVM Classifier using the RBF(Radial-basis function) kernel is adopted Non-Linear SVM(Support Vector Machines) Feature space์—์„œ์˜ inner product(a measure of similarity)
Key Idea of Kernel Methods K(๐‘ฅ๐‘–, ๐‘ฅ๐‘—) K(๐‘ฅ๐‘–, ๐‘ฅ๐‘—) = ฮฆ(๐‘ฅ๐‘–)ยท ฮฆ(๐‘ฅ๐‘—)
Normal Condition : Cluster bound : exp{โˆ’ [ ๐‘ฅ1โˆ’๐‘1 2+ ๐‘ฅ2โˆ’๐‘2 2] 2๐œŽ2 } โ‰ฅ {0<Threshold<<1} ๐‘ฅ1 โˆ’ ๐‘1 2 + ๐‘ฅ2 โˆ’ ๐‘2 2 โ‰ค r2 ๐พ1 + ๐พ2 โ‰ค r2 x1 x2 .(c1,c2) r K1 K2 r2 r2 Key Idea of Kernel Methods
RBFN architecture ฮฃ Input layer Hidden layer (RBFs) Output layer W1 W2 WM x1 x2 xn No weight f(x) Each of n components of the input vector x feeds forward to m basis functions whose outputs are linearly combined with weights w (i.e. dot product xโˆ™w) into the network output f(x). The output layer performs a simple weighted sum (i.e. w โˆ™x). If the RBFN is used for regression then this output is fine. However, if pattern classification is required, then a hard- limiter or sigmoid function could be placed on the output neurons to give 0/1 output values Input data set โˆถ ๐‘‹ = { ๐‘ฅ1 ๐‘ฅ2 โ€ฆ ๐‘ฅ ๐‘}
ฮฃ RBFN architecture
RBFN architecture ๏ฌ For Gaussian basis functions ๏€จ ๏€ฉs x w w x c w w x c p i i p i i M i pj ij ijj n i M ( ) exp ( ) ๏ฒ ๏ฒ ๏ฒ ๏€ฝ ๏€ซ ๏€ญ ๏€ฝ ๏€ซ ๏€ญ ๏€ญ๏ƒฌ ๏ƒญ ๏ƒฎ ๏ƒผ ๏ƒฝ ๏ƒพ ๏€ฝ ๏€ฝ๏€ฝ ๏ƒฅ ๏ƒฅ๏ƒฅ 0 1 0 2 2 11 2 ๏† ๏ณ ๏ฌ Assume the variance ๏ณ across each dimension are equal s x w w x cp i i pj ij j n i M ( ) exp ( ) ๏ฒ ๏€ฝ ๏€ซ ๏€ญ ๏€ญ ๏ƒฌ ๏ƒญ ๏ƒฎ ๏ƒผ ๏ƒฝ ๏ƒพ๏€ฝ๏€ฝ ๏ƒฅ๏ƒฅ0 2 2 11 1 2๏ณ โ†’ โ†’ โ†’ โ†’
ฮฃ ฮฃ Category 1 Category 2 Category 1 Category 2 RBFN for classification
RBFN Learning โ€ข Design decision โ€ข number of hidden neurons โ€ข max of neurons = number of input patterns โ€ข more neurons โ€“ more complex, smaller tolerance โ€ข Parameters to be learnt โ€ข centers โ€ข radii โ€ข A hidden neuron is more sensitive to data points near its center. This sensitivity may be tuned by adjusting the radius. โ€ข smaller radius ๏ƒž fits training data better (overfitting) โ€ข larger radius ๏ƒž less sensitivity, less overfitting, network of smaller size, faster execution โ€ข weights between hidden and output layers
The question now is: How to train the RBF network? In other words, how to find: ๏ต The number and the parameters of hidden units (the basis functions) using unlabeled data (unsupervised learning). ๏‚ง K-Mean Clustering Algorithm ๏ต The weights between the hidden layer and the output layer. ๏‚ง Recursive Least-Squares Estimation Algorithm RBFN Learning
xp K-means K-Nearest Neighbor Basis Functions Linear Regression ci ci ๏ณi A w RBFN Learning
๏ฌ Use the K-mean algorithm to find ci RBFN Learning
K-mean Algorithm step1: K initial clusters are chosen randomly from the samples to form K groups. step2: Each new sample is added to the group whose mean is the closest to this sample. step3: Adjust the mean of the group to take account of the new points. step4: Repeat step2 until the distance between the old means and the new means of all clusters is smaller than a predefined tolerance.
Outcome: There are K clusters with means representing the centroid of each clusters. Advantages: (1) A fast and simple algorithm. (2) Reduce the effects of noisy samples.
๏ฌ Use K nearest neighbor rule to find the function width ๏ณ k-th nearest neighbor of ci ๏ฌ The objective is to cover the training points so that a smooth fit of the training samples can be achieved 2 1 1 ๏ƒฅ๏€ฝ ๏€ญ๏€ฝ๏ณ K k iki cc K ๏ฒ๏ฒโ†’ โ†’
๏ฌ RBF learning by gradient descent ๏€จ ๏€ฉLet and๏†i p pj ij ijj n p p px x c e x d x s x( ) exp ( ) ( ) ( ) ๏ฒ ๏ฒ ๏ฒ ๏ฒ ๏€ฝ ๏€ญ ๏€ญ๏ƒฆ ๏ƒจ ๏ƒง ๏ƒง ๏ƒถ ๏ƒธ ๏ƒท ๏ƒท ๏€ฝ ๏€ญ ๏€ฝ ๏ƒฅ 1 2 2 2 1 ๏ณ ๏€จ ๏€ฉE e xp p N ๏€ฝ ๏€ฝ ๏ƒฅ 1 2 1 2 ( ) . ๏ฒ we have ๏‚ถ ๏‚ถ ๏‚ถ ๏‚ถ๏ณ ๏‚ถ ๏‚ถ E w E E ci ij ij , , and Apply โ†’ โ†’ โ†’ โ†’ โ†’ N : No. of batch
we have the following update equations ๏ฌ RBF learning by gradient descent
Gaussian Mixture Models and Expectation-Maximization Algorithm
28 Normal Distribution (1D Gaussian) ๏€จ ๏€ฉ 2 2 1 ( , ) exp 22 x f x ๏ญ ๏ญ ๏ณ ๏ณ๏ณ ๏ฐ ๏ƒฆ ๏ƒถ๏€ญ ๏€ฝ ๏€ญ๏ƒง ๏ƒท ๏ƒง ๏ƒท ๏ƒจ ๏ƒธ ,mean๏ญ 2 ,std๏ณ
29 ๏ฎ d = 2 ๏ฎ x = random data point (2D vector) ๏ฎ = mean value (2D vector) ๏ฎ = covariance matrix (2D matrix) 2D Gaussians ๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ ๏€ฉ1 1 ( , ) exp 22 det( ) T d x x f x ๏ญ ๏ญ ๏ญ ๏ฐ ๏€ญ๏ƒฆ ๏ƒถ๏€ญ ๏“ ๏€ญ ๏“ ๏€ฝ ๏€ญ๏ƒง ๏ƒท ๏ƒง ๏ƒท ๏“ ๏ƒจ ๏ƒธ ๏“ ๏ญ ๏ฎ The same equation holds for a 3D Gaussian
30 2D Gaussians ๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ ๏€ฉ1 1 ( , ) exp 22 det( ) T d x x f x ๏ญ ๏ญ ๏ญ ๏ฐ ๏€ญ๏ƒฆ ๏ƒถ๏€ญ ๏“ ๏€ญ ๏“ ๏€ฝ ๏€ญ๏ƒง ๏ƒท ๏ƒง ๏ƒท ๏“ ๏ƒจ ๏ƒธ ๏ญ ๏“
31 Exploring Covariance Matrix ๏€จ ๏€ฉ๏€จ ๏€ฉ 2 2 1 ( , ) cov( , )1 cov( , ) i i i N T w i i i h x random vector w h w h x x N h w ๏ณ ๏ญ ๏ญ ๏ณ๏€ฝ ๏€ฝ ๏ƒฆ ๏ƒถ ๏“ ๏€ฝ ๏€ญ ๏€ญ ๏€ฝ ๏ƒง ๏ƒท ๏ƒจ ๏ƒธ ๏ƒฅ ๏ฎ is symmetric ๏ฎ has eigendecomposition (svd) ๏ฎ ๏“ ๏ƒž๏“ * * T V D V๏“ ๏€ฝ ๏“ 1 2 ... d๏ฌ ๏ฌ ๏ฌ๏‚ณ ๏‚ณ ๏‚ณ
32 Covariance Matrix Geometry ๏“ 1 2 * * 1* 2* T V D V a v b v ๏ฌ ๏ฌ ๏“ ๏€ฝ ๏€ฝ ๏€ฝ b a
33 3D Gaussians ๏€จ ๏€ฉ๏€จ ๏€ฉ 2 2 1 2 ( , , ) cov( , ) cov( , ) 1 cov( , ) cov( , ) cov( , ) cov( , ) i rN T i i g i b x r g b g r b r x x r g b g N r b g b ๏ณ ๏ญ ๏ญ ๏ณ ๏ณ ๏€ฝ ๏€ฝ ๏ƒฆ ๏ƒถ ๏ƒง ๏ƒท ๏“ ๏€ฝ ๏€ญ ๏€ญ ๏€ฝ ๏ƒง ๏ƒท ๏ƒง ๏ƒท ๏ƒจ ๏ƒธ ๏ƒฅ
34 GMMs โ€“ Gaussian Mixture Models W H ๏ฎ Suppose we have 1000 data points in 2D space (w,h)
35 W H GMMs โ€“ Gaussian Mixture Models ๏ฎ Assume each data point is normally distributed ๏ฎ Obviously, there are 5 sets of underlying gaussians
36 The GMM assumption ๏ฎ There are K components (Gaussians) ๏ฎ Each k is specified with three parameters: weight, mean, covariance matrix ๏ฎ The total density function is: ๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ ๏€ฉ1 1 1 1 1 ( ) exp 22 det( ) { , , } 0 1 T K j j j j d j j K j j j j K j j j x x f x weight j ๏ญ ๏ญ ๏ก ๏ฐ ๏ก ๏ญ ๏ก ๏ก ๏ก ๏€ญ ๏€ฝ ๏€ฝ ๏€ฝ ๏ƒฆ ๏ƒถ๏€ญ ๏“ ๏€ญ ๏ƒง ๏ƒท๏‘ ๏€ฝ ๏€ญ ๏ƒง ๏ƒท๏“ ๏ƒจ ๏ƒธ ๏‘ ๏€ฝ ๏“ ๏€ฝ ๏‚ณ ๏€ข ๏€ฝ ๏ƒฅ ๏ƒฅ
37 The EM algorithm (Dempster, Laird and Rubin, 1977) Raw data GMMs (K = 6) Total Density Function i๏“ i๏ญ
38 EM Basics ๏ฎ Objective: Given N data points, find maximum likelihood estimation of : ๏ฎ Algorithm: 1. Guess initial 2. Perform E step (expectation) ๏ฎ Based on , associate each data point with specific gaussian 3. Perform M step (maximization) ๏ฎ Based on data points clustering, maximize 4. Repeat 2-3 until convergence (~tens iterations) ๏‘ 1argmax ( ,..., )Nf x x ๏‘ ๏‘ ๏€ฝ ๏‘ ๏‘ ๏‘ ๏‘
39 EM Details ๏ฎ E-Step (estimate probability that point t associated to gaussian j): ๏ฎ M-Step (estimate new parameters): , 1 ( , ) 1,..., 1,..., ( , ) j t j j t j K i t i ii f x w j K t N f x ๏ก ๏ญ ๏ก ๏ญ๏€ฝ ๏“ ๏€ฝ ๏€ฝ ๏€ฝ ๏“๏ƒฅ , 1 ,1 ,1 ,1 ,1 1 ( )( ) N new j t j t N t j tnew t j N t jt N new new T t j t j t jnew t j N t jt w N w x w w x x w ๏ก ๏ญ ๏ญ ๏ญ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ญ ๏€ญ ๏“ ๏€ฝ ๏ƒฅ ๏ƒฅ ๏ƒฅ ๏ƒฅ ๏ƒฅ
40 EM Example Gaussian j data point t blue: wt,j
41 EM Example
42 EM Example
43 EM Example
44 EM Example
45 EM Example
46 EM Example
47 EM Example
RBF networks MLP Learning speed Very Fast Very Slow Convergence Almost guarantee Not guarantee Response time Slow Fast Memory requirement Very large Small Hardware implementation IBM ZISC036 Nestor Ni1000 www-5.ibm.com/fr/cdlab/zisc.html Voice Direct 364 www.sensoryinc.com Generalization Usually better Usually poorer Hyper-parameter ? Initial values are given !
Simulation โ€ข The color image under analysis is split into patches (either overlapping or not) of size 64x64 pixels. โ€ข A adversarially trained auto-encoder encodes the patches into a low dimensional representation called feature vector h(a 2,048 dimensional vector). โ€ข A one-class SVM fed with h is used to detect forged patches as anomalies with respect to features distribution learned from normal patches. โ€ข Once all patches are classified, a label mask for the entire image is obtained by grouping together all the patch labels.
โ€ข Small - Object size is smaller than the patch size (approximately 32 pixel per side). โ€ข Medium - Object size is comparable to patch size (approximately 64 pixel per side). โ€ข Large - Object size is larger than patch size (approximately128 pixel per side). Simulation ๊ฒ€์ถœ๋Œ€์ƒ๋ฌผ์˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ ์„ฑ๋Šฅํ‰๊ฐ€
Simulation Query Image I w/ unfamiliar object Query Image II w/ unfamiliar object GT mask I GT mask II
Unsupervised Anomaly Detection with GANs to Guide Marker Discovery https://arxiv.org/abs/1703.05921 Postech ์ด๋„์—ฝ์”จ๊ฐ€ ๊ตฌํ˜„ํ•œ Tensorflow ์ฝ”๋“œ https://github.com/LeeDoYup/AnoGAN Previous Approach II
์ด ์—ฐ๊ตฌ์—์„œ๋Š” ์•„๋ž˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์ •์ƒ data๋งŒ์œผ๋กœ ํ•™์Šต์‹œํ‚จ GAN ๋ชจ๋ธ๋ฅผ ์ด์šฉํ•˜์—ฌ Query data์— ๋Œ€ํ•˜์—ฌ ์ •์ƒ์—ฌ๋ถ€๋Š” ๋ฌผ๋ก  ๋น„์ •์ƒ ์‹œ ๋น„์ •์ƒ ์˜์—ญ์„ ์ฐพ์•„๋‚ด๊ณ ์ž ํ•จ.
1. ์ •์ƒ data๋ฅผ ์ด์šฉํ•˜์—ฌ Generator & Discriminator์˜ ํ›ˆ๋ จ - Deep convolutional generative adversarial network์„ ์ด์šฉํ•˜์—ฌ latent space(z)๋กœ ๋ถ€ํ„ฐ Generator๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒ์„ฑ๋œ image์™€ Real image๋ฅผ ๊ตฌ๋ณ„ํ•˜๋„๋ก Discriminator๋ฅผ ํ›ˆ๋ จ ๏ƒจ ์ •์ƒ data์˜ latent space(z) ๋ถ„ํฌ๋ฅผ ํ•™์Šต 2. ๋น„์ •์ƒ data์—ฌ๋ถ€์™€ ๋น„์ •์ƒ ์˜์—ญ ํŒŒ์•… - ํ›ˆ๋ จ๋œ Generator & Discriminator์˜ parameter๋ฅผ ๊ณ ์ •ํ•œ ์ฑ„ Query image์— ๋Œ€ํ•œ latent space(z)๋กœ์˜ mapping ์ž‘์—…์„ ์ˆ˜ํ–‰ ํ›ˆ๋ จ๋œ ์ •์ƒ data์˜ ๊ฒฝ์šฐ, ๊ธฐํ•™์Šต๋œ ์ •์ƒ data์˜ latent space(z) ๋กœ mapping์ด ๋˜์ง€๋งŒ, ๋น„์ •์ƒ data์˜ ๊ฒฝ์šฐ ๋ฒ—์–ด๋‚จ ๏ƒจ cost function์˜ ์˜ค์ฐจ๊ฐ€ ๋ฐœ์ƒ Anomaly Detection์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด 2๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง
1. GAN์„ ์ด์šฉํ•˜์—ฌ ์ •์ƒ data ๋ชจ๋ธ๋งํ•˜๊ธฐ : ์ •์ƒ data์˜ generative model(distribution)์„ GAN์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šต ์ •์ƒ ๐‘‘๐‘Ž๐‘ก๐‘Ž ๐ผ ๐‘š, with m = 1,2,.....,M, where ๐ผ ๐‘š โˆˆ ๐‘… ๐‘Ž๐‘ฅ๐‘ ์ž„์˜์˜ ์œ„์น˜์—์„œ ๋žœ๋คํ•˜๊ฒŒ cxcํฌ๊ธฐ์˜ K 2-D image patches๋ฅผ ์ถ”์ถœ x = ๐‘ฅ ๐‘˜,๐‘š โˆˆ โ„ต with k = 1,2,โ€ฆโ€ฆ,K. D and G are simultaneously optimized through the following two- player minimax game with value function V (G,D) The discriminator is trained to maximize the probability of assigning real training examples the โ€œrealโ€ and samples from ๐‘ ๐‘”the โ€œfakeโ€ label
2. Query data์˜ latent space Mapping Query image x๊ฐ€ ์ฃผ์–ด์งˆ ๊ฒฝ์šฐ, ์ด์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๊ฐ€์ƒ image์ธ G(z) ์— ํ•ด๋‹นํ•˜๋Š” latent space์ƒ์˜ ์  z์„ ์ฐพ๋Š”๋‹ค. x ์™€ G(z)์˜ ์œ ์‚ฌ์—ฌ๋ถ€๋Š” query image๊ฐ€ generator์˜ ํ›ˆ๋ จ์‹œ ์‚ฌ์šฉ๋œ ์ •์ƒ data์˜ ๋ถ„ํฌ ๐‘ ๐‘”๋ฅผ ์–ด๋А ์ •๋„ ๋”ฐ๋ฅด๋А๋ƒ์— ์˜ํ•ด ๊ฒฐ์ • z์„ ์ฐพ๊ธฐ ์œ„ํ•˜์—ฌ , latent space distribution Z์—์„œ ๋žœ๋คํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋œ z1 ์„ ๊ธฐํ›ˆ๋ จ๋œ generator์— ์ž…๋ ฅํ•˜์—ฌ ์–ป์€ ์ถœ๋ ฅ G(z1)์™€ x์˜ ์ฐจ(loss ftโ€™n)๋ฅผ ์ตœ ์†Œํ™”ํ•˜๋„๋ก backpropagation์„ ํ†ตํ•˜์—ฌ latent space์˜ ์ z2๋กœ update
z ์ •์ƒ image์˜ Latent space(z)๊ฐ€ 1์ฐจ์›์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ณ  Z์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ถ„ํฌ๋กœ ๊ฐ€์ •ํ•˜๋ฉด ๐œ‡ ๐‘ง z๐œ‡ ๐‘ง Query image์— ๋Œ€ํ•œ latent space(z) mapping์€ i) ์ž„์˜์˜ ๊ฐ’ ๐‘ง1์—์„œ ์‹œ์ž‘ํ•˜์—ฌ loss ftโ€™n์„ ์ตœ์†Œํ™”ํ•˜๋„๋ก update ii) ์ฃผ์–ด์ง„ ฮ“๋ฒˆ์งธ iteration ํ›„ ๐‘งฮ“์ด allowable range์•ˆ์— ๋“ค์–ด์™”๋Š”์ง€ ์—ฌ๋ถ€์— ๋•Œ๋ผ ์ •์ƒ, ๋น„์ •์ƒ์„ ๊ตฌ๋ถ„ ๐‘ง1 ๐‘ง2 ๐‘งฮ“ Allowable range
โ€ข Overall loss or Anomaly score: โ€ข Anomaly score consists of two parts: โ€ข Residual Loss - visual similarity โ€ข Discrimination Loss - enforces the generated image to lie on the manifold Query Image์˜ Mapping์— ๋Œ€ํ•œ Loss function ์ •์˜
Improved discrimination loss based on feature matching โ€ข f(.) โ€“ output of intermediate layer of the discriminator โ€ข It is some statistics of an input image This approach utilizes the trained discriminator not as classifier but as a feature extractor
3. Anomaly Detection Anomaly score : query image x๊ฐ€ ์ •์ƒ image์— ์–ผ๋งˆ๋‚˜ ๋ถ€ํ•ฉํ•˜๋Š”์ง€ ์—ฌ๋ถ€ R(x) : ฮ“๋ฒˆ์˜ backpropagationํ›„ Residual loss D(x) : ฮ“๋ฒˆ์˜ backpropagationํ›„ Discrimination Loss ๋น„์ •์ƒ image : A(x) is large ์ •์ƒ image : A(x) is small ๐‘ฅ ๐‘… = ๐‘ฅ โˆ’ ๐บ ๐‘งฮ“ Residual error : image๋‚ด์˜ ๋น„์ •์ƒ ์˜์—ญ์„ ๋‚˜ํƒ€๋ƒ„
4. Experiments ์‹คํ—˜๋Œ€์ƒ์€ ๋ง๋ง‰์ธต์„ 3์ฐจ์›์ ์œผ๋กœ ๊ด€์ธกํ•˜๋Š” ๋น›๊ฐ„์„ญ๋‹จ์ธต์ดฌ์˜(OCT) ์˜์ƒ โ€ข Data, Data Selection and Preprocessing i) Training sets : - 2D image patches extracted from 270 clinical OCT volumes of healthy subjects - The gray values were normalized to range from -1 to 1. - Extracted in total 1,000,000 2D training patches with an image resolution of 64x64 pixels at randomly sampled positions.
ii) Testing sets : - patches were extracted from 10 additional healthy cases and 10 pathological cases, which contained retinal fluid - Test set in total consisted of 8,192 image patches and comprised normal and pathological samples
iii) Model description - Adopt DCGAN architecture that resulted in stable GAN training on images of sizes 64x64 pixels. - Utilized intermediate representations with 512-256-128-64 channels (instead of 1024-512-256-128) - Discrimination loss : Feature representations of the last convolution layer of the discriminator was used - Training was performed for 20 epochs utilizing Adam optimizer. - Ran 500 backpropagation steps for the mapping of new images to the latent space. - Used ฮป= 0.1 in loss function
i) Generative capability of the DCGAN 5. Experiments Given image Generated image Residual overlay Pixel-level annotations of retinal fluid Normal image Anomalous image
ii) Detection performance ROC curves Distribution of the residual score(c) and of the discrimination score(d) Latent space์—์„œ ์ •์ƒ data(trained data ๋ฐ test data ์ค‘ ์ •์ƒ)๊ฐ„์˜ ๋ถ„ํฌ๋Š” ์œ ์‚ฌํ•˜๋‚˜ Test data ์ค‘ ๋น„์ •์ƒ๊ณผ๋Š” ํ™•์‹คํ•œ ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋ƒ„
Problems in Previous Approach - Canโ€™t control the shape and boundary of cluster - Canโ€™t control the ambiguous point at the boundary ๏ƒจ Letโ€™s find a way to control the shape of cluster and ambiguous point at the boundary
SVDD is the smallest enclosing ball problem and itโ€™s alternatives are โ€ข The minimum enclosing ball problem with errors โ€ข The minimum enclosing ball problem in a RKHS(Repoducing Kernel Hilbert Spaces) โ€ข The two class Support vector data description (SVDD) Support Vector Data Description (SVDD)
โ€ข One class is the target class, and all other data is outlier data. โ€ข Create a spherically shaped boundary around the complete target set. โ€ข To minimize the chance of accepting outliers, the volume of this description is minimized. โ€ข Outlier sensitivity can be controlled by changing the ball-shaped boundary into a more flexible boundary. โ€ข Example outliers can be included into the training procedure to find a more efficient description. SOLUTIONS FOR SOLVING DATA DESCRIPTION
1. The minimum enclosing ball problem [Tax and Duin, 2004] centerRadius, R
2. The minimum enclosing ball problem with errors
- We assume vectors x are column vectors. - We have a training set {xi }, i = 1, . . , N for which we want to obtain a description. - We further assume that the data shows variances in all feature directions. NORMAL DATA DESCRIPTION โ€ข The sphere is characterized by center a and radius R > 0. โ€ข We minimize the volume of the sphere by minimizing Rยฒ, and demand that the sphere contains all training objects xi. โ€ข To allow the possibility of outliers in the training set, the distance from xi to the center a should not be strictly smaller than Rยฒ, but larger distances should be penalized. - Minimization problem: F(R, a) = Rยฒ + Cโˆ‘ฮพi with constraints ||xi โˆ’ a||ยฒ โ‰ค Rยฒ + ฮพi, ฮพi โ‰ฅ 0 2. The minimum enclosing ball problem with errors
NORMAL DATA DESCRIPTION Lagrange function : L(R, a, ฮฑi, ฮณi, ฮพi ) = Rยฒ + Cโˆ‘ฮพi โˆ’ โˆ‘ฮฑi {Rยฒ + ฮพi โˆ’ (โ€–xiโ€–ยฒ โˆ’ 2a ยท xi + โ€–aโ€–ยฒ)} โˆ’ โˆ‘ฮณi ฮพi L should be minimized with respect to R, c, ฮพi and maximized with respect to ฮฑi and ฮณi: } With subject to: 0 โ‰ค ฮฑi โ‰ค C 2. The minimum enclosing ball problem with errors
2. The minimum enclosing ball problem with errors NORMAL DATA DESCRIPTION } Support vectors There are 3 cases ๐‘…2 = ๐‘‹ ๐‘ โˆ’ ๐‘Ž 2 = ๐‘‹ ๐‘ โ‹… ๐‘‹ ๐‘ - 2 ๐‘– ๐›ผ๐‘– (๐‘‹๐‘– โ‹… ๐‘‹ ๐‘ ) + ๐‘–,๐‘— ๐›ผ๐‘– ๐›ผ๐‘— (๐‘‹๐‘–โ‹… ๐‘‹๐‘—) Hypersphereโ€™s center can be determined as ๐‘Ž = ๐‘– ๐›ผ๐‘– ๐‘ฟ๐’Š Hypersphereโ€™s radius can be determined by selecting an arbitrary support vector on the boundary ๐‘‹ ๐‘
TEST A NEW DATA Xk To test if a new data Xk is within the sphere, the distance to the center of Sphere has to be calculated. A test data Xk is Normal when this distance is smaller than radius ||xk โˆ’ a||ยฒ โ‰ค R2 2. The minimum enclosing ball problem with errors
2. The minimum enclosing ball problem with errors Please refer to Python Code for SVDD : https://wikidocs.net/3431
SVDD with negative examples - When negative examples (objects which should be rejected) are available, they can be incorporated in the training to improve the description. - In contrast with the training (target) examples which should be within the sphere, the negative examples should be outside it. ๏ƒจ Minimization problem: With constraints: } 2. The minimum enclosing ball problem with errors
3. The minimum enclosing ball problem in a RKHS Gaussian kernel: With subject to: 0 โ‰ค ฮฑi โ‰ค C โ€ข Minimum enclosing ball problem with errors โ€ข Inner product can be substituted by a general kernel function like Gaussian kernel ๐‘‹ ๐‘˜ โˆ’ ๐‘Ž 2 = K(๐‘‹ ๐‘˜, ๐‘‹ ๐‘˜) - 2 ๐‘– ๐›ผ๐‘– K(๐‘‹๐‘–, ๐‘‹ ๐‘˜) + ๐‘–,๐‘— ๐›ผ๐‘– ๐›ผ๐‘— K(๐‘‹๐‘–, ๐‘‹๐‘—) โ‰ค ๐‘…2
3. The minimum enclosing ball problem in a RKHS - For small values of s all objects become support vectors. Test object is selected when: - For very large s the solution approximates the original spherically shaped solution. - Decreasing the parameter C constraints the values for ฮฑi more, and more objects become support vectors. - Also with decreasing C the error on the target class increases, but the covered volume of the data description decreases.
4. The two class Support vector data description (SVDD)
The two class SVDD vs. one class SVDD
Deep SVDD learns a neural network transformation ะค(ยท ; W) with weights W from input space Xโˆˆ R d to output space F โˆˆ R p that attempts to map most of the data network representations into a hypersphere characterized by center c and radius R of minimum volume. Mappings of normal examples fall within, whereas mappings of anomalies fall outside the hypersphere. Deep Support Vector Data Description (Deep SVDD)
Given some training data on X, we define the soft-boundary Deep SVDD objective as - First term : minimizing R2 minimizes the volume of the hypersphere. - Second term is a penalty term for points lying outside the sphere after being passed through the network, i.e. if its distance to the center is greater than radius R - The last term is a regularizer on the network parameters W Deep Support Vector Data Description (Deep SVDD)
To achieve this the network must extract the common factors of variation of the data. As a result, normal examples of the data are closely mapped to center c, whereas anomalous examples are mapped further away from the center or outside of the hypersphere. Through this we obtain a compact description of the normal class. Anomal data Anomal dataNomal data Nomal data Deep Support Vector Data Description (Deep SVDD)
One-Class Deep SVDD objective SVDD simply employs a quadratic loss for penalizing the distance of every network representation to c One-Class Deep SVDD contracts the sphere by minimizing the mean distance of all data representations to the center.
For a given test point x ฯต X, anomaly score s can be defined for both variants of Deep SVDD by the distance of the point to the center of the hypersphere Anomaly Score Anomaly Score Conventional Approach Deep SVDD Normal Anomal Normal Anomal Anomaly Score distribution distribution
One-class classification on MNIST and CIFAR-10 Each convolutional module consists of a convolutional layer followed by leaky ReLU activations and 2x2 max-pooling. On MNIST, a CNN with two modules, 8x(5x5x1)-filters followed by 4x(5x5x1)- filters, and a final dense layer of 32 units. On CIFAR-10, a CNN with three modules, 32x(5x5x3)-filters, 64x(5x5x3)-filters, and 128x(5x5x3)-filters, followed by a final dense layer of 128 units. a batch size of 200 and set the weight decay hyper-parameter ฮป = 10-6 Network architectures
Both MNIST and CIFAR-10 have ten different classes from which we create ten one-class classification setups. In each setup, one of the classes is the normal class and samples from the remaining classes are used to represent anomalies. Only train with training set examples from the respective normal class. Training set sizes of nโ‰ˆ6,000 for MNIST and n=5,000 for CIFAR-10. Both test sets have 10,000 samples including samples from the nine anomalous classes for each setup. Pre-process all images with global contrast normalization using the L1 norm and finally rescale to [0; 1] via min-max-scaling. One-class classification on MNIST and CIFAR-10 Data setup
One-class classification on MNIST and CIFAR-10 Average AUCs in % with StdDevs (over 10 seeds) per method and one-class experiment on MNIST and CIFAR-10
Anomaly Detection using One-Class Neural Networks arXiv:1802.06360v1 Code : https://github.com/raghavchalapathy/oc-nn
We wanna make NN like this !
Model architecture of Auto-encoder and the proposed one-class neural networks
One-Class Support Vector Machine Objective is to find a Hyper plane and distance from origin, which is positive on subset A and negative on every thing out side A. Maximize distance from hyper plane to origin Subset A Hypersphere Hyperplane ๐‘Ÿ Negative ๐‘ค
In order to obtain w and r , we need to solve the following optimization problem, One-Class Support Vector Machine where w is the norm perpendicular to the hyper-plane and r is the distance of the hyper-plane from origin. Distance of Feature vector from origin
A simple feed forward network with one hidden layer having linear or sigmoid activation g(ยท) and one output node OC-NN objective can be formulated as: where w is the scalar output obtained from the hidden to output layer, V is the weight matrix from input to hidden units. Xn is an input vector One-Class NN
Discriminative Feature Learning
A Discriminative Feature Learning For generic object, scene or action recognition. The deeply learned features need to be not only separable but also discriminative.
โ€ข Only softmax loss has been considered in classification problem ๏ƒผ SOFTMAX LOSS : encouraging the separability of features. โ€ข Discriminative feature learning approach considers center loss as well ๏ƒผ CENTER LOSS: simultaneously learning a center for deep features of each class and penalizing the distances between the deep features and their corresponding class centers. ๏ƒจ JOINT SUPERVISION: minimizing the intra-class variations while keeping the features of different classes separable A Discriminative Feature Learning
A Discriminative Feature Learning Detailed Discussion on Center Loss โ€ข Easy-to-Implement. The gradient and update equation are easy to derive and the resulting CNN model is trainable. โ€ข Easy-to-Train. Centers are updated based on mini-batch with an adjustable learning rate. โ€ข Easy-to-Input. Center loss enjoys the same requirement as the softmax loss and needs no complex sample mining and recombination, which is inevitable in contrastive loss and triple loss. โ€ข Easy-to-Converge. Faster than softmax loss only
โ€ข With only softmax loss (ฮป=0), the deeply learned features are separable, but not discriminative (significant intra-class variations). โ€ข With proper ฮป, the discriminative power of deep features can be significantly enhanced, which is crucial for classification problem A Discriminative Feature Learning

Anomaly detection using deep one class classifier

  • 1.
    Anomaly Detection using DeepOne-Class Classifier Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018
  • 2.
    Anomaly Detection andLocalization Using GAN and One-Class Classifier Satellite Image Forgery Detection and Localization Using GAN and One-Class Classifier https://arxiv.org/abs/1802.04881 Previous Approach I
  • 3.
    Anomaly Detection โ€ข ์ •์ƒ์น˜์—์„œ๋ฒ—์–ด๋‚œ ๊ด€์ธก์น˜๋“ค์„ detect ๏ƒจ One-class classification ํ˜น์€ one-class description ์—ฌ๊ธฐ์„œ๋Š” โ€ข Generative adversarial network ๋˜๋Š” Auto-encoder๋ฅผ ์ด์šฉํ•˜์—ฌ ์ •์ƒ image์— ๋Œ€ํ•œ feature๋ฅผ mappingํ•œ ํ›„ one-class support vector machine (SVM)์œผ๋กœ ๋ถ„ํฌ๋ฅผ ๊ฒฐ์ •. Query image์— ๋Œ€ํ•˜์—ฌ ๊ฒฐ์ •๋œ ๋ถ„ ํฌ๋‚ด์— ์กด์žฌํ•˜๋Š”์ง€ ์—ฌ๋ถ€ ํ™•์ธ
  • 4.
    Problem formulation โ€ข ํ•™์Šต๋œimage์™ธ์— unseen or unfamiliar object๊ฐ€ ๋ฐœ๊ฒฌ๋  ๊ฒฝ์šฐ, ๊ทธ ๋ฆผ๊ณผ ๊ฐ™์ด binary mask๋กœ ์˜์—ญ์„ ํ‘œ์‹œ Trained Image Trained Image mask Query Image w/ unfamiliar object Query Image mask w/ unfamiliar object
  • 5.
    Method ๐ด ๐‘’ X h ๐ด๐‘‘ ๐‘‹ X min ๐บ max ๐ท ๐‘‰(๐ท, ๐บ) = ๐ธ ๐‘‹~๐‘ ๐‘‘๐‘Ž๐‘ก๐‘Ž log ๐ท ๐‘‹ + log(1 โˆ’ ๐ท ๐บ ๐‘‹ ) ๐‘‹ = ๐บ ๐‘‹ = ๐ด ๐‘‘ โ„Ž = ๐ด ๐‘‘ ๐ด ๐‘’(๐‘‹) โ€ข Auto-encoder๋ฅผ ์ด์šฉํ•˜์—ฌ image๋กœ๋ถ€ํ„ฐ feature(h) ๊ตฌํ•˜๊ณ  ์ด๋ฅผ ๋‹ค์‹œ ๋ณต์›. ๋ณต์›๋œ image์™€ ์› image๋ฅผ ์ด์šฉํ•˜์—ฌ GAN์„ ํ›ˆ๋ จ ๏ƒจ Auto- encoder ๋ณด๋‹ค ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅํ–ฅ์ƒ โ€ข ์ •์ƒ image์— ๋Œ€ํ•œ latent space์˜ distribution์„ ์ฐพ์•„ ๋ƒ„.
  • 6.
  • 7.
    Method Normal image์˜ cluster Abnormalimage์˜ features - Training๋œ Auto-encode์˜ Encoder์— Query image๋ฅผ ์ž…๋ ฅํ•˜์—ฌ latent vector๋ฅผ ๊ณ„์‚ฐ - ๊ณ„์‚ฐ ๋œ latent vector๊ฐ€ ์ •์ƒ image์˜ cluster๋‚ด์— ํฌํ•จ๋˜๋Š”์ง€ ์—ฌ๋ถ€ ํŒ๋‹จ ๏ƒจ ์—ฌ๊ธฐ์„œ๋Š” RADIAL BASES FUNCTIONS(Gauss Kernel, Parametric modeling of Cluster) ์„ ์‚ฌ์šฉํ•œ One class SVM์„ ์‚ฌ์šฉ Features from normal patches(i.e., red dots) cluster together, whereas features from abnormal patches (i.e., blue dots) are more distant.
  • 8.
    we solve theproblem of classifying nonlinearly separable pattern in a hybrid manner involving two stages: โ€ข First: Transform a given set of nonlinearly separable patterns into a new set for which, under certain conditions, the likelihood of the transformed patterns becoming linearly separable is high. โ€ข Second: the solution of the classification problem is completed by using Stochastic Gradient Descent. Non-linear SVM Classifier using the RBF(Radial-basis function) kernel
  • 9.
    We find wand b by solving the following objective function using Quadratic Programming. To define an optimal hyperplane we need to maximize the width of the margin(w). Linear SVM(Support Vector Machines) Support vector
  • 10.
    โ€ข The simplestway to separate two groups of data is with a straight line (1 dimension), flat plane (2 dimensions) or an N-dimensional hyperplane. โ€ข However, there are situations where a nonlinear region can separate the groups more efficiently. โ€ข The kernel function transform the data into a higher dimensional feature space to make it possible to perform the linear separation. Non-Linear SVM(Support Vector Machines) kernel trick
  • 11.
    To Map frominput space to feature space to simplify classification task Non-linear SVM Classifier using the RBF(Radial-basis function) kernel is adopted Non-Linear SVM(Support Vector Machines) Feature space์—์„œ์˜ inner product(a measure of similarity)
  • 12.
    Key Idea ofKernel Methods K(๐‘ฅ๐‘–, ๐‘ฅ๐‘—) K(๐‘ฅ๐‘–, ๐‘ฅ๐‘—) = ฮฆ(๐‘ฅ๐‘–)ยท ฮฆ(๐‘ฅ๐‘—)
  • 13.
    Normal Condition : Clusterbound : exp{โˆ’ [ ๐‘ฅ1โˆ’๐‘1 2+ ๐‘ฅ2โˆ’๐‘2 2] 2๐œŽ2 } โ‰ฅ {0<Threshold<<1} ๐‘ฅ1 โˆ’ ๐‘1 2 + ๐‘ฅ2 โˆ’ ๐‘2 2 โ‰ค r2 ๐พ1 + ๐พ2 โ‰ค r2 x1 x2 .(c1,c2) r K1 K2 r2 r2 Key Idea of Kernel Methods
  • 14.
    RBFN architecture ฮฃ Input layer Hiddenlayer (RBFs) Output layer W1 W2 WM x1 x2 xn No weight f(x) Each of n components of the input vector x feeds forward to m basis functions whose outputs are linearly combined with weights w (i.e. dot product xโˆ™w) into the network output f(x). The output layer performs a simple weighted sum (i.e. w โˆ™x). If the RBFN is used for regression then this output is fine. However, if pattern classification is required, then a hard- limiter or sigmoid function could be placed on the output neurons to give 0/1 output values Input data set โˆถ ๐‘‹ = { ๐‘ฅ1 ๐‘ฅ2 โ€ฆ ๐‘ฅ ๐‘}
  • 15.
  • 16.
    RBFN architecture ๏ฌ ForGaussian basis functions ๏€จ ๏€ฉs x w w x c w w x c p i i p i i M i pj ij ijj n i M ( ) exp ( ) ๏ฒ ๏ฒ ๏ฒ ๏€ฝ ๏€ซ ๏€ญ ๏€ฝ ๏€ซ ๏€ญ ๏€ญ๏ƒฌ ๏ƒญ ๏ƒฎ ๏ƒผ ๏ƒฝ ๏ƒพ ๏€ฝ ๏€ฝ๏€ฝ ๏ƒฅ ๏ƒฅ๏ƒฅ 0 1 0 2 2 11 2 ๏† ๏ณ ๏ฌ Assume the variance ๏ณ across each dimension are equal s x w w x cp i i pj ij j n i M ( ) exp ( ) ๏ฒ ๏€ฝ ๏€ซ ๏€ญ ๏€ญ ๏ƒฌ ๏ƒญ ๏ƒฎ ๏ƒผ ๏ƒฝ ๏ƒพ๏€ฝ๏€ฝ ๏ƒฅ๏ƒฅ0 2 2 11 1 2๏ณ โ†’ โ†’ โ†’ โ†’
  • 17.
    ฮฃ ฮฃ Category 1Category 2 Category 1 Category 2 RBFN for classification
  • 18.
    RBFN Learning โ€ข Designdecision โ€ข number of hidden neurons โ€ข max of neurons = number of input patterns โ€ข more neurons โ€“ more complex, smaller tolerance โ€ข Parameters to be learnt โ€ข centers โ€ข radii โ€ข A hidden neuron is more sensitive to data points near its center. This sensitivity may be tuned by adjusting the radius. โ€ข smaller radius ๏ƒž fits training data better (overfitting) โ€ข larger radius ๏ƒž less sensitivity, less overfitting, network of smaller size, faster execution โ€ข weights between hidden and output layers
  • 19.
    The question nowis: How to train the RBF network? In other words, how to find: ๏ต The number and the parameters of hidden units (the basis functions) using unlabeled data (unsupervised learning). ๏‚ง K-Mean Clustering Algorithm ๏ต The weights between the hidden layer and the output layer. ๏‚ง Recursive Least-Squares Estimation Algorithm RBFN Learning
  • 20.
  • 21.
    ๏ฌ Use theK-mean algorithm to find ci RBFN Learning
  • 22.
    K-mean Algorithm step1: Kinitial clusters are chosen randomly from the samples to form K groups. step2: Each new sample is added to the group whose mean is the closest to this sample. step3: Adjust the mean of the group to take account of the new points. step4: Repeat step2 until the distance between the old means and the new means of all clusters is smaller than a predefined tolerance.
  • 23.
    Outcome: There areK clusters with means representing the centroid of each clusters. Advantages: (1) A fast and simple algorithm. (2) Reduce the effects of noisy samples.
  • 24.
    ๏ฌ Use Knearest neighbor rule to find the function width ๏ณ k-th nearest neighbor of ci ๏ฌ The objective is to cover the training points so that a smooth fit of the training samples can be achieved 2 1 1 ๏ƒฅ๏€ฝ ๏€ญ๏€ฝ๏ณ K k iki cc K ๏ฒ๏ฒโ†’ โ†’
  • 25.
    ๏ฌ RBF learningby gradient descent ๏€จ ๏€ฉLet and๏†i p pj ij ijj n p p px x c e x d x s x( ) exp ( ) ( ) ( ) ๏ฒ ๏ฒ ๏ฒ ๏ฒ ๏€ฝ ๏€ญ ๏€ญ๏ƒฆ ๏ƒจ ๏ƒง ๏ƒง ๏ƒถ ๏ƒธ ๏ƒท ๏ƒท ๏€ฝ ๏€ญ ๏€ฝ ๏ƒฅ 1 2 2 2 1 ๏ณ ๏€จ ๏€ฉE e xp p N ๏€ฝ ๏€ฝ ๏ƒฅ 1 2 1 2 ( ) . ๏ฒ we have ๏‚ถ ๏‚ถ ๏‚ถ ๏‚ถ๏ณ ๏‚ถ ๏‚ถ E w E E ci ij ij , , and Apply โ†’ โ†’ โ†’ โ†’ โ†’ N : No. of batch
  • 26.
    we have thefollowing update equations ๏ฌ RBF learning by gradient descent
  • 27.
    Gaussian Mixture Modelsand Expectation-Maximization Algorithm
  • 28.
    28 Normal Distribution (1DGaussian) ๏€จ ๏€ฉ 2 2 1 ( , ) exp 22 x f x ๏ญ ๏ญ ๏ณ ๏ณ๏ณ ๏ฐ ๏ƒฆ ๏ƒถ๏€ญ ๏€ฝ ๏€ญ๏ƒง ๏ƒท ๏ƒง ๏ƒท ๏ƒจ ๏ƒธ ,mean๏ญ 2 ,std๏ณ
  • 29.
    29 ๏ฎ d =2 ๏ฎ x = random data point (2D vector) ๏ฎ = mean value (2D vector) ๏ฎ = covariance matrix (2D matrix) 2D Gaussians ๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ ๏€ฉ1 1 ( , ) exp 22 det( ) T d x x f x ๏ญ ๏ญ ๏ญ ๏ฐ ๏€ญ๏ƒฆ ๏ƒถ๏€ญ ๏“ ๏€ญ ๏“ ๏€ฝ ๏€ญ๏ƒง ๏ƒท ๏ƒง ๏ƒท ๏“ ๏ƒจ ๏ƒธ ๏“ ๏ญ ๏ฎ The same equation holds for a 3D Gaussian
  • 30.
    30 2D Gaussians ๏€จ ๏€ฉ ๏€จ๏€ฉ ๏€จ ๏€ฉ1 1 ( , ) exp 22 det( ) T d x x f x ๏ญ ๏ญ ๏ญ ๏ฐ ๏€ญ๏ƒฆ ๏ƒถ๏€ญ ๏“ ๏€ญ ๏“ ๏€ฝ ๏€ญ๏ƒง ๏ƒท ๏ƒง ๏ƒท ๏“ ๏ƒจ ๏ƒธ ๏ญ ๏“
  • 31.
    31 Exploring Covariance Matrix ๏€จ๏€ฉ๏€จ ๏€ฉ 2 2 1 ( , ) cov( , )1 cov( , ) i i i N T w i i i h x random vector w h w h x x N h w ๏ณ ๏ญ ๏ญ ๏ณ๏€ฝ ๏€ฝ ๏ƒฆ ๏ƒถ ๏“ ๏€ฝ ๏€ญ ๏€ญ ๏€ฝ ๏ƒง ๏ƒท ๏ƒจ ๏ƒธ ๏ƒฅ ๏ฎ is symmetric ๏ฎ has eigendecomposition (svd) ๏ฎ ๏“ ๏ƒž๏“ * * T V D V๏“ ๏€ฝ ๏“ 1 2 ... d๏ฌ ๏ฌ ๏ฌ๏‚ณ ๏‚ณ ๏‚ณ
  • 32.
    32 Covariance Matrix Geometry ๏“ 1 2 ** 1* 2* T V D V a v b v ๏ฌ ๏ฌ ๏“ ๏€ฝ ๏€ฝ ๏€ฝ b a
  • 33.
    33 3D Gaussians ๏€จ ๏€ฉ๏€จ๏€ฉ 2 2 1 2 ( , , ) cov( , ) cov( , ) 1 cov( , ) cov( , ) cov( , ) cov( , ) i rN T i i g i b x r g b g r b r x x r g b g N r b g b ๏ณ ๏ญ ๏ญ ๏ณ ๏ณ ๏€ฝ ๏€ฝ ๏ƒฆ ๏ƒถ ๏ƒง ๏ƒท ๏“ ๏€ฝ ๏€ญ ๏€ญ ๏€ฝ ๏ƒง ๏ƒท ๏ƒง ๏ƒท ๏ƒจ ๏ƒธ ๏ƒฅ
  • 34.
    34 GMMs โ€“ GaussianMixture Models W H ๏ฎ Suppose we have 1000 data points in 2D space (w,h)
  • 35.
    35 W H GMMs โ€“ GaussianMixture Models ๏ฎ Assume each data point is normally distributed ๏ฎ Obviously, there are 5 sets of underlying gaussians
  • 36.
    36 The GMM assumption ๏ฎThere are K components (Gaussians) ๏ฎ Each k is specified with three parameters: weight, mean, covariance matrix ๏ฎ The total density function is: ๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ ๏€ฉ1 1 1 1 1 ( ) exp 22 det( ) { , , } 0 1 T K j j j j d j j K j j j j K j j j x x f x weight j ๏ญ ๏ญ ๏ก ๏ฐ ๏ก ๏ญ ๏ก ๏ก ๏ก ๏€ญ ๏€ฝ ๏€ฝ ๏€ฝ ๏ƒฆ ๏ƒถ๏€ญ ๏“ ๏€ญ ๏ƒง ๏ƒท๏‘ ๏€ฝ ๏€ญ ๏ƒง ๏ƒท๏“ ๏ƒจ ๏ƒธ ๏‘ ๏€ฝ ๏“ ๏€ฝ ๏‚ณ ๏€ข ๏€ฝ ๏ƒฅ ๏ƒฅ
  • 37.
    37 The EM algorithm(Dempster, Laird and Rubin, 1977) Raw data GMMs (K = 6) Total Density Function i๏“ i๏ญ
  • 38.
    38 EM Basics ๏ฎ Objective: GivenN data points, find maximum likelihood estimation of : ๏ฎ Algorithm: 1. Guess initial 2. Perform E step (expectation) ๏ฎ Based on , associate each data point with specific gaussian 3. Perform M step (maximization) ๏ฎ Based on data points clustering, maximize 4. Repeat 2-3 until convergence (~tens iterations) ๏‘ 1argmax ( ,..., )Nf x x ๏‘ ๏‘ ๏€ฝ ๏‘ ๏‘ ๏‘ ๏‘
  • 39.
    39 EM Details ๏ฎ E-Step(estimate probability that point t associated to gaussian j): ๏ฎ M-Step (estimate new parameters): , 1 ( , ) 1,..., 1,..., ( , ) j t j j t j K i t i ii f x w j K t N f x ๏ก ๏ญ ๏ก ๏ญ๏€ฝ ๏“ ๏€ฝ ๏€ฝ ๏€ฝ ๏“๏ƒฅ , 1 ,1 ,1 ,1 ,1 1 ( )( ) N new j t j t N t j tnew t j N t jt N new new T t j t j t jnew t j N t jt w N w x w w x x w ๏ก ๏ญ ๏ญ ๏ญ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ญ ๏€ญ ๏“ ๏€ฝ ๏ƒฅ ๏ƒฅ ๏ƒฅ ๏ƒฅ ๏ƒฅ
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
    RBF networks MLP Learningspeed Very Fast Very Slow Convergence Almost guarantee Not guarantee Response time Slow Fast Memory requirement Very large Small Hardware implementation IBM ZISC036 Nestor Ni1000 www-5.ibm.com/fr/cdlab/zisc.html Voice Direct 364 www.sensoryinc.com Generalization Usually better Usually poorer Hyper-parameter ? Initial values are given !
  • 49.
    Simulation โ€ข The colorimage under analysis is split into patches (either overlapping or not) of size 64x64 pixels. โ€ข A adversarially trained auto-encoder encodes the patches into a low dimensional representation called feature vector h(a 2,048 dimensional vector). โ€ข A one-class SVM fed with h is used to detect forged patches as anomalies with respect to features distribution learned from normal patches. โ€ข Once all patches are classified, a label mask for the entire image is obtained by grouping together all the patch labels.
  • 50.
    โ€ข Small -Object size is smaller than the patch size (approximately 32 pixel per side). โ€ข Medium - Object size is comparable to patch size (approximately 64 pixel per side). โ€ข Large - Object size is larger than patch size (approximately128 pixel per side). Simulation ๊ฒ€์ถœ๋Œ€์ƒ๋ฌผ์˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ ์„ฑ๋Šฅํ‰๊ฐ€
  • 51.
    Simulation Query Image Iw/ unfamiliar object Query Image II w/ unfamiliar object GT mask I GT mask II
  • 52.
    Unsupervised Anomaly Detectionwith GANs to Guide Marker Discovery https://arxiv.org/abs/1703.05921 Postech ์ด๋„์—ฝ์”จ๊ฐ€ ๊ตฌํ˜„ํ•œ Tensorflow ์ฝ”๋“œ https://github.com/LeeDoYup/AnoGAN Previous Approach II
  • 53.
    ์ด ์—ฐ๊ตฌ์—์„œ๋Š” ์•„๋ž˜๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์ •์ƒ data๋งŒ์œผ๋กœ ํ•™์Šต์‹œํ‚จ GAN ๋ชจ๋ธ๋ฅผ ์ด์šฉํ•˜์—ฌ Query data์— ๋Œ€ํ•˜์—ฌ ์ •์ƒ์—ฌ๋ถ€๋Š” ๋ฌผ๋ก  ๋น„์ •์ƒ ์‹œ ๋น„์ •์ƒ ์˜์—ญ์„ ์ฐพ์•„๋‚ด๊ณ ์ž ํ•จ.
  • 54.
    1. ์ •์ƒ data๋ฅผ์ด์šฉํ•˜์—ฌ Generator & Discriminator์˜ ํ›ˆ๋ จ - Deep convolutional generative adversarial network์„ ์ด์šฉํ•˜์—ฌ latent space(z)๋กœ ๋ถ€ํ„ฐ Generator๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒ์„ฑ๋œ image์™€ Real image๋ฅผ ๊ตฌ๋ณ„ํ•˜๋„๋ก Discriminator๋ฅผ ํ›ˆ๋ จ ๏ƒจ ์ •์ƒ data์˜ latent space(z) ๋ถ„ํฌ๋ฅผ ํ•™์Šต 2. ๋น„์ •์ƒ data์—ฌ๋ถ€์™€ ๋น„์ •์ƒ ์˜์—ญ ํŒŒ์•… - ํ›ˆ๋ จ๋œ Generator & Discriminator์˜ parameter๋ฅผ ๊ณ ์ •ํ•œ ์ฑ„ Query image์— ๋Œ€ํ•œ latent space(z)๋กœ์˜ mapping ์ž‘์—…์„ ์ˆ˜ํ–‰ ํ›ˆ๋ จ๋œ ์ •์ƒ data์˜ ๊ฒฝ์šฐ, ๊ธฐํ•™์Šต๋œ ์ •์ƒ data์˜ latent space(z) ๋กœ mapping์ด ๋˜์ง€๋งŒ, ๋น„์ •์ƒ data์˜ ๊ฒฝ์šฐ ๋ฒ—์–ด๋‚จ ๏ƒจ cost function์˜ ์˜ค์ฐจ๊ฐ€ ๋ฐœ์ƒ Anomaly Detection์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด 2๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง
  • 55.
    1. GAN์„ ์ด์šฉํ•˜์—ฌ์ •์ƒ data ๋ชจ๋ธ๋งํ•˜๊ธฐ : ์ •์ƒ data์˜ generative model(distribution)์„ GAN์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šต ์ •์ƒ ๐‘‘๐‘Ž๐‘ก๐‘Ž ๐ผ ๐‘š, with m = 1,2,.....,M, where ๐ผ ๐‘š โˆˆ ๐‘… ๐‘Ž๐‘ฅ๐‘ ์ž„์˜์˜ ์œ„์น˜์—์„œ ๋žœ๋คํ•˜๊ฒŒ cxcํฌ๊ธฐ์˜ K 2-D image patches๋ฅผ ์ถ”์ถœ x = ๐‘ฅ ๐‘˜,๐‘š โˆˆ โ„ต with k = 1,2,โ€ฆโ€ฆ,K. D and G are simultaneously optimized through the following two- player minimax game with value function V (G,D) The discriminator is trained to maximize the probability of assigning real training examples the โ€œrealโ€ and samples from ๐‘ ๐‘”the โ€œfakeโ€ label
  • 56.
    2. Query data์˜latent space Mapping Query image x๊ฐ€ ์ฃผ์–ด์งˆ ๊ฒฝ์šฐ, ์ด์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๊ฐ€์ƒ image์ธ G(z) ์— ํ•ด๋‹นํ•˜๋Š” latent space์ƒ์˜ ์  z์„ ์ฐพ๋Š”๋‹ค. x ์™€ G(z)์˜ ์œ ์‚ฌ์—ฌ๋ถ€๋Š” query image๊ฐ€ generator์˜ ํ›ˆ๋ จ์‹œ ์‚ฌ์šฉ๋œ ์ •์ƒ data์˜ ๋ถ„ํฌ ๐‘ ๐‘”๋ฅผ ์–ด๋А ์ •๋„ ๋”ฐ๋ฅด๋А๋ƒ์— ์˜ํ•ด ๊ฒฐ์ • z์„ ์ฐพ๊ธฐ ์œ„ํ•˜์—ฌ , latent space distribution Z์—์„œ ๋žœ๋คํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋œ z1 ์„ ๊ธฐํ›ˆ๋ จ๋œ generator์— ์ž…๋ ฅํ•˜์—ฌ ์–ป์€ ์ถœ๋ ฅ G(z1)์™€ x์˜ ์ฐจ(loss ftโ€™n)๋ฅผ ์ตœ ์†Œํ™”ํ•˜๋„๋ก backpropagation์„ ํ†ตํ•˜์—ฌ latent space์˜ ์ z2๋กœ update
  • 57.
    z ์ •์ƒ image์˜ Latentspace(z)๊ฐ€ 1์ฐจ์›์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ณ  Z์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ถ„ํฌ๋กœ ๊ฐ€์ •ํ•˜๋ฉด ๐œ‡ ๐‘ง z๐œ‡ ๐‘ง Query image์— ๋Œ€ํ•œ latent space(z) mapping์€ i) ์ž„์˜์˜ ๊ฐ’ ๐‘ง1์—์„œ ์‹œ์ž‘ํ•˜์—ฌ loss ftโ€™n์„ ์ตœ์†Œํ™”ํ•˜๋„๋ก update ii) ์ฃผ์–ด์ง„ ฮ“๋ฒˆ์งธ iteration ํ›„ ๐‘งฮ“์ด allowable range์•ˆ์— ๋“ค์–ด์™”๋Š”์ง€ ์—ฌ๋ถ€์— ๋•Œ๋ผ ์ •์ƒ, ๋น„์ •์ƒ์„ ๊ตฌ๋ถ„ ๐‘ง1 ๐‘ง2 ๐‘งฮ“ Allowable range
  • 58.
    โ€ข Overall lossor Anomaly score: โ€ข Anomaly score consists of two parts: โ€ข Residual Loss - visual similarity โ€ข Discrimination Loss - enforces the generated image to lie on the manifold Query Image์˜ Mapping์— ๋Œ€ํ•œ Loss function ์ •์˜
  • 59.
    Improved discrimination lossbased on feature matching โ€ข f(.) โ€“ output of intermediate layer of the discriminator โ€ข It is some statistics of an input image This approach utilizes the trained discriminator not as classifier but as a feature extractor
  • 60.
    3. Anomaly Detection Anomalyscore : query image x๊ฐ€ ์ •์ƒ image์— ์–ผ๋งˆ๋‚˜ ๋ถ€ํ•ฉํ•˜๋Š”์ง€ ์—ฌ๋ถ€ R(x) : ฮ“๋ฒˆ์˜ backpropagationํ›„ Residual loss D(x) : ฮ“๋ฒˆ์˜ backpropagationํ›„ Discrimination Loss ๋น„์ •์ƒ image : A(x) is large ์ •์ƒ image : A(x) is small ๐‘ฅ ๐‘… = ๐‘ฅ โˆ’ ๐บ ๐‘งฮ“ Residual error : image๋‚ด์˜ ๋น„์ •์ƒ ์˜์—ญ์„ ๋‚˜ํƒ€๋ƒ„
  • 61.
    4. Experiments ์‹คํ—˜๋Œ€์ƒ์€ ๋ง๋ง‰์ธต์„3์ฐจ์›์ ์œผ๋กœ ๊ด€์ธกํ•˜๋Š” ๋น›๊ฐ„์„ญ๋‹จ์ธต์ดฌ์˜(OCT) ์˜์ƒ โ€ข Data, Data Selection and Preprocessing i) Training sets : - 2D image patches extracted from 270 clinical OCT volumes of healthy subjects - The gray values were normalized to range from -1 to 1. - Extracted in total 1,000,000 2D training patches with an image resolution of 64x64 pixels at randomly sampled positions.
  • 62.
    ii) Testing sets: - patches were extracted from 10 additional healthy cases and 10 pathological cases, which contained retinal fluid - Test set in total consisted of 8,192 image patches and comprised normal and pathological samples
  • 63.
    iii) Model description -Adopt DCGAN architecture that resulted in stable GAN training on images of sizes 64x64 pixels. - Utilized intermediate representations with 512-256-128-64 channels (instead of 1024-512-256-128) - Discrimination loss : Feature representations of the last convolution layer of the discriminator was used - Training was performed for 20 epochs utilizing Adam optimizer. - Ran 500 backpropagation steps for the mapping of new images to the latent space. - Used ฮป= 0.1 in loss function
  • 64.
    i) Generative capabilityof the DCGAN 5. Experiments Given image Generated image Residual overlay Pixel-level annotations of retinal fluid Normal image Anomalous image
  • 65.
    ii) Detection performance ROCcurves Distribution of the residual score(c) and of the discrimination score(d) Latent space์—์„œ ์ •์ƒ data(trained data ๋ฐ test data ์ค‘ ์ •์ƒ)๊ฐ„์˜ ๋ถ„ํฌ๋Š” ์œ ์‚ฌํ•˜๋‚˜ Test data ์ค‘ ๋น„์ •์ƒ๊ณผ๋Š” ํ™•์‹คํ•œ ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋ƒ„
  • 66.
    Problems in PreviousApproach - Canโ€™t control the shape and boundary of cluster - Canโ€™t control the ambiguous point at the boundary ๏ƒจ Letโ€™s find a way to control the shape of cluster and ambiguous point at the boundary
  • 67.
    SVDD is thesmallest enclosing ball problem and itโ€™s alternatives are โ€ข The minimum enclosing ball problem with errors โ€ข The minimum enclosing ball problem in a RKHS(Repoducing Kernel Hilbert Spaces) โ€ข The two class Support vector data description (SVDD) Support Vector Data Description (SVDD)
  • 68.
    โ€ข One classis the target class, and all other data is outlier data. โ€ข Create a spherically shaped boundary around the complete target set. โ€ข To minimize the chance of accepting outliers, the volume of this description is minimized. โ€ข Outlier sensitivity can be controlled by changing the ball-shaped boundary into a more flexible boundary. โ€ข Example outliers can be included into the training procedure to find a more efficient description. SOLUTIONS FOR SOLVING DATA DESCRIPTION
  • 69.
    1. The minimumenclosing ball problem [Tax and Duin, 2004] centerRadius, R
  • 70.
    2. The minimumenclosing ball problem with errors
  • 71.
    - We assumevectors x are column vectors. - We have a training set {xi }, i = 1, . . , N for which we want to obtain a description. - We further assume that the data shows variances in all feature directions. NORMAL DATA DESCRIPTION โ€ข The sphere is characterized by center a and radius R > 0. โ€ข We minimize the volume of the sphere by minimizing Rยฒ, and demand that the sphere contains all training objects xi. โ€ข To allow the possibility of outliers in the training set, the distance from xi to the center a should not be strictly smaller than Rยฒ, but larger distances should be penalized. - Minimization problem: F(R, a) = Rยฒ + Cโˆ‘ฮพi with constraints ||xi โˆ’ a||ยฒ โ‰ค Rยฒ + ฮพi, ฮพi โ‰ฅ 0 2. The minimum enclosing ball problem with errors
  • 72.
    NORMAL DATA DESCRIPTION Lagrangefunction : L(R, a, ฮฑi, ฮณi, ฮพi ) = Rยฒ + Cโˆ‘ฮพi โˆ’ โˆ‘ฮฑi {Rยฒ + ฮพi โˆ’ (โ€–xiโ€–ยฒ โˆ’ 2a ยท xi + โ€–aโ€–ยฒ)} โˆ’ โˆ‘ฮณi ฮพi L should be minimized with respect to R, c, ฮพi and maximized with respect to ฮฑi and ฮณi: } With subject to: 0 โ‰ค ฮฑi โ‰ค C 2. The minimum enclosing ball problem with errors
  • 73.
    2. The minimumenclosing ball problem with errors NORMAL DATA DESCRIPTION } Support vectors There are 3 cases ๐‘…2 = ๐‘‹ ๐‘ โˆ’ ๐‘Ž 2 = ๐‘‹ ๐‘ โ‹… ๐‘‹ ๐‘ - 2 ๐‘– ๐›ผ๐‘– (๐‘‹๐‘– โ‹… ๐‘‹ ๐‘ ) + ๐‘–,๐‘— ๐›ผ๐‘– ๐›ผ๐‘— (๐‘‹๐‘–โ‹… ๐‘‹๐‘—) Hypersphereโ€™s center can be determined as ๐‘Ž = ๐‘– ๐›ผ๐‘– ๐‘ฟ๐’Š Hypersphereโ€™s radius can be determined by selecting an arbitrary support vector on the boundary ๐‘‹ ๐‘
  • 74.
    TEST A NEWDATA Xk To test if a new data Xk is within the sphere, the distance to the center of Sphere has to be calculated. A test data Xk is Normal when this distance is smaller than radius ||xk โˆ’ a||ยฒ โ‰ค R2 2. The minimum enclosing ball problem with errors
  • 75.
    2. The minimumenclosing ball problem with errors Please refer to Python Code for SVDD : https://wikidocs.net/3431
  • 76.
    SVDD with negativeexamples - When negative examples (objects which should be rejected) are available, they can be incorporated in the training to improve the description. - In contrast with the training (target) examples which should be within the sphere, the negative examples should be outside it. ๏ƒจ Minimization problem: With constraints: } 2. The minimum enclosing ball problem with errors
  • 77.
    3. The minimumenclosing ball problem in a RKHS Gaussian kernel: With subject to: 0 โ‰ค ฮฑi โ‰ค C โ€ข Minimum enclosing ball problem with errors โ€ข Inner product can be substituted by a general kernel function like Gaussian kernel ๐‘‹ ๐‘˜ โˆ’ ๐‘Ž 2 = K(๐‘‹ ๐‘˜, ๐‘‹ ๐‘˜) - 2 ๐‘– ๐›ผ๐‘– K(๐‘‹๐‘–, ๐‘‹ ๐‘˜) + ๐‘–,๐‘— ๐›ผ๐‘– ๐›ผ๐‘— K(๐‘‹๐‘–, ๐‘‹๐‘—) โ‰ค ๐‘…2
  • 78.
    3. The minimumenclosing ball problem in a RKHS - For small values of s all objects become support vectors. Test object is selected when: - For very large s the solution approximates the original spherically shaped solution. - Decreasing the parameter C constraints the values for ฮฑi more, and more objects become support vectors. - Also with decreasing C the error on the target class increases, but the covered volume of the data description decreases.
  • 79.
    4. The twoclass Support vector data description (SVDD)
  • 80.
    The two classSVDD vs. one class SVDD
  • 81.
    Deep SVDD learnsa neural network transformation ะค(ยท ; W) with weights W from input space Xโˆˆ R d to output space F โˆˆ R p that attempts to map most of the data network representations into a hypersphere characterized by center c and radius R of minimum volume. Mappings of normal examples fall within, whereas mappings of anomalies fall outside the hypersphere. Deep Support Vector Data Description (Deep SVDD)
  • 82.
    Given some trainingdata on X, we define the soft-boundary Deep SVDD objective as - First term : minimizing R2 minimizes the volume of the hypersphere. - Second term is a penalty term for points lying outside the sphere after being passed through the network, i.e. if its distance to the center is greater than radius R - The last term is a regularizer on the network parameters W Deep Support Vector Data Description (Deep SVDD)
  • 83.
    To achieve thisthe network must extract the common factors of variation of the data. As a result, normal examples of the data are closely mapped to center c, whereas anomalous examples are mapped further away from the center or outside of the hypersphere. Through this we obtain a compact description of the normal class. Anomal data Anomal dataNomal data Nomal data Deep Support Vector Data Description (Deep SVDD)
  • 84.
    One-Class Deep SVDDobjective SVDD simply employs a quadratic loss for penalizing the distance of every network representation to c One-Class Deep SVDD contracts the sphere by minimizing the mean distance of all data representations to the center.
  • 85.
    For a giventest point x ฯต X, anomaly score s can be defined for both variants of Deep SVDD by the distance of the point to the center of the hypersphere Anomaly Score Anomaly Score Conventional Approach Deep SVDD Normal Anomal Normal Anomal Anomaly Score distribution distribution
  • 86.
    One-class classification onMNIST and CIFAR-10 Each convolutional module consists of a convolutional layer followed by leaky ReLU activations and 2x2 max-pooling. On MNIST, a CNN with two modules, 8x(5x5x1)-filters followed by 4x(5x5x1)- filters, and a final dense layer of 32 units. On CIFAR-10, a CNN with three modules, 32x(5x5x3)-filters, 64x(5x5x3)-filters, and 128x(5x5x3)-filters, followed by a final dense layer of 128 units. a batch size of 200 and set the weight decay hyper-parameter ฮป = 10-6 Network architectures
  • 87.
    Both MNIST andCIFAR-10 have ten different classes from which we create ten one-class classification setups. In each setup, one of the classes is the normal class and samples from the remaining classes are used to represent anomalies. Only train with training set examples from the respective normal class. Training set sizes of nโ‰ˆ6,000 for MNIST and n=5,000 for CIFAR-10. Both test sets have 10,000 samples including samples from the nine anomalous classes for each setup. Pre-process all images with global contrast normalization using the L1 norm and finally rescale to [0; 1] via min-max-scaling. One-class classification on MNIST and CIFAR-10 Data setup
  • 88.
    One-class classification onMNIST and CIFAR-10 Average AUCs in % with StdDevs (over 10 seeds) per method and one-class experiment on MNIST and CIFAR-10
  • 89.
    Anomaly Detection using One-ClassNeural Networks arXiv:1802.06360v1 Code : https://github.com/raghavchalapathy/oc-nn
  • 90.
    We wanna makeNN like this !
  • 91.
    Model architecture ofAuto-encoder and the proposed one-class neural networks
  • 92.
    One-Class Support VectorMachine Objective is to find a Hyper plane and distance from origin, which is positive on subset A and negative on every thing out side A. Maximize distance from hyper plane to origin Subset A Hypersphere Hyperplane ๐‘Ÿ Negative ๐‘ค
  • 93.
    In order toobtain w and r , we need to solve the following optimization problem, One-Class Support Vector Machine where w is the norm perpendicular to the hyper-plane and r is the distance of the hyper-plane from origin. Distance of Feature vector from origin
  • 94.
    A simple feedforward network with one hidden layer having linear or sigmoid activation g(ยท) and one output node OC-NN objective can be formulated as: where w is the scalar output obtained from the hidden to output layer, V is the weight matrix from input to hidden units. Xn is an input vector One-Class NN
  • 95.
  • 96.
    A Discriminative FeatureLearning For generic object, scene or action recognition. The deeply learned features need to be not only separable but also discriminative.
  • 97.
    โ€ข Only softmaxloss has been considered in classification problem ๏ƒผ SOFTMAX LOSS : encouraging the separability of features. โ€ข Discriminative feature learning approach considers center loss as well ๏ƒผ CENTER LOSS: simultaneously learning a center for deep features of each class and penalizing the distances between the deep features and their corresponding class centers. ๏ƒจ JOINT SUPERVISION: minimizing the intra-class variations while keeping the features of different classes separable A Discriminative Feature Learning
  • 98.
    A Discriminative FeatureLearning Detailed Discussion on Center Loss โ€ข Easy-to-Implement. The gradient and update equation are easy to derive and the resulting CNN model is trainable. โ€ข Easy-to-Train. Centers are updated based on mini-batch with an adjustable learning rate. โ€ข Easy-to-Input. Center loss enjoys the same requirement as the softmax loss and needs no complex sample mining and recombination, which is inevitable in contrastive loss and triple loss. โ€ข Easy-to-Converge. Faster than softmax loss only
  • 99.
    โ€ข With onlysoftmax loss (ฮป=0), the deeply learned features are separable, but not discriminative (significant intra-class variations). โ€ข With proper ฮป, the discriminative power of deep features can be significantly enhanced, which is crucial for classification problem A Discriminative Feature Learning