Explanation of Autoencoder to Variontal Auto Encoder

From Autoencoder to Variational Autoencoder Hao Dong Peking University 1

• Vanilla Autoencoder • Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) 2 From Autoencoder to Variational Autoencoder Feature Representation Distribution Representation 视频：https://www.youtube.com/watch

• Vanilla Autoencoder • Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) 3

4 Vanilla Autoencoder • What is it? Reconstruct high-dimensional data using a neural network model with a narrow bottleneck layer. The bottleneck layer captures the compressed latent coding, so the nice by-product is dimension reduction. The low-dimensional representation can be used as the representation of the data in various applications, e.g., image retrieval, data compression … ! 𝑥 𝑧 𝑥 ℒ

Latent code: the compressed low dimensional representation of the input data 5 Vanilla Autoencoder • How it works? ! 𝑥 𝑧 𝑥 ℒ decoder/generator Z à X encoder X à Z Input Reconstructed Input Ideally the input and reconstruction are identical The encoder network is for dimension reduction, just like PCA

6 Vanilla Autoencoder • Training 𝑥! 𝑥" 𝑥# 𝑎1 𝑎2 𝑎3 𝑥$ 𝑥% 𝑥& 𝑎4 # 𝑥! # 𝑥" # 𝑥# # 𝑥$ # 𝑥% # 𝑥& hidden layer input layer output layer Given 𝑴 data samples • The hidden units are usually less than the number of inputs • Dimension reduction --- Representation learning The distance between two data can be measure by Mean Squared Error (MSE): ℒ = $ % ∑&'$ % (𝑥& − 𝐺(𝐸 𝑥& ) 2 where 𝑛 is the number of variables • It is trying to learn an approximation to the identity function so that the input is “compress” to the “compressed” features, discovering interesting structure about the data. Encoder Decoder

7 Vanilla Autoencoder • Testing/Inferencing 𝑥! 𝑥" 𝑥# 𝑎1 𝑎2 𝑎3 𝑥$ 𝑥% 𝑥& 𝑎4 hidden layer input layer extracted features • Autoencoder is an unsupervised learning method if we considered the latent code as the “output”. • Autoencoder is also a self-supervised (self-taught) learning method which is a type of supervised learning where the training labels are determined by the input data. • Word2Vec (from RNN lecture) is another unsupervised, self-taught learning example. Autoencoder for MNIST dataset (28×28×1, 784 pixels) % 𝒙 𝒙 Encoder

8 Vanilla Autoencoder • Example: • Compress MNIST (28x28x1) to the latent code with only 2 variables Lossy

9 Vanilla Autoencoder • Power of Latent Representation • t-SNE visualization on MNIST: PCA vs. Autoencoder PCA Autoencoder (Winner) 2006 Science paper by Hinton and Salakhutdinov

10 Vanilla Autoencoder • Discussion • Hidden layer is overcomplete if greater than the input layer

11 Vanilla Autoencoder • Discussion • Hidden layer is overcomplete if greater than the input layer • No compression • No guarantee that the hidden units extract meaningful feature

13 Denoising Autoencoder (DAE) • Why? • Avoid overfitting • Learn robust representations

14 Denoising Autoencoder • Architecture 𝑥! 𝑥" 𝑥# 𝑎1 𝑎2 𝑎3 𝑥$ 𝑥% 𝑥& 𝑎4 # 𝑥! # 𝑥" # 𝑥# # 𝑥$ # 𝑥% # 𝑥& hidden layer input layer output layer 𝑥! 𝑥" 𝑥# 𝑥$ 𝑥% 𝑥& Applying dropout between the input and the first hidden layer • Improve the robustness Encoder Decoder

15 Denoising Autoencoder • Feature Visualization Visualizing the learned features 𝑥! 𝑥" 𝑥# 𝑎1 𝑎2 𝑎3 𝑥$ 𝑥% 𝑥& 𝑎4 One neuron == One feature extractor reshape à

16 Denoising Autoencoder • Denoising Autoencoder & Dropout Denoising autoencoder was proposed in 2008, 4 years before the dropout paper (Hinton, et al. 2012). Denoising autoencoder can be seem as applying dropout between the input and the first layer. Denoising autoencoder can be seem as one type of data augmentation on the input.

18 Sparse Autoencoder • Why? • Even when the number of hidden units is large (perhaps even greater than the number of input pixels), we can still discover interesting structure, by imposing other constraints on the network. • In particular, if we impose a ”‘sparsity”’ constraint on the hidden units, then the autoencoder will still discover interesting structure in the data, even if the number of hidden units is large. 𝑥! 𝑥" 𝑥# 𝑎1 𝑎2 𝑎3 𝑥$ 𝑥% 𝑥& 𝑎4 hidden layer input layer 0.02 “inactive” 0.97 “active” 0.01 “inactive” 0.98 “active” Sigmoid Encoder

19 Sparse Autoencoder • Recap: KL Divergence Smaller == Closer

20 Sparse Autoencoder • Sparsity Regularization 𝑥! 𝑥" 𝑥# 𝑎1 𝑎2 𝑎3 𝑥$ 𝑥% 𝑥& 𝑎4 hidden layer input layer 0.02 “inactive” 0.97 “active” 0.01 “inactive” 0.98 “active” Sigmoid ^ 𝜌𝑗 = 1 𝑀 $ "%& ! 𝑎𝑗 Given 𝑴 data samples (batch size) and Sigmoid activation function, the active ratio of a neuron 𝑎𝑗: To make the output “sparse”, we would like to enforce the following constraint, where 𝜌 is a “sparsity parameter”, such as 0.2 (20% of the neurons) ^ 𝜌𝑗 = 𝜌 The penalty term is as follow, where s is the number of activation outputs. ℒ ' = ∑(%& ) 𝐾𝐿(𝜌|| ^ 𝜌𝑗) = ∑(%& ) (𝜌log ' * '! + (1 − 𝜌)log &+' &+* '! ) ℒ ,-,./ = ℒ !01 + 𝜆ℒ ' The total loss: Encoder The number of hidden units can be greater than the number of input variables.

21 Sparse Autoencoder • Sparsity Regularization Smaller 𝜌 == More sparse Autoencoders for MNIST dataset % 𝒙 𝒙 Autoencoder Sparse Autoencoder % 𝒙 Input

22 Sparse Autoencoder • Different regularization loss ℒ & on the hidden activation output Method Hidden Activation Reconstruction Activation Loss Function Method 1 Sigmoid Sigmoid ℒ ,-,./ = ℒ !01 + ℒ ' Method 2 ReLU Softplus ℒ ,-,./ = ℒ !01 + 𝒂

23 Sparse Autoencoder • Sparse Autoencoder vs. Denoising Autoencoder Feature Extractors of Sparse Autoencoder Feature Extractors of Denoising Autoencoder

24 Sparse Autoencoder • Autoencoder vs. Denoising Autoencoder vs. Sparse Autoencoder Autoencoders for MNIST dataset % 𝒙 𝒙 Autoencoder Sparse Autoencoder % 𝒙 Input Denoising Autoencoder % 𝒙

26 Contractive Autoencoder • Why? • Denoising Autoencoder and Sparse Autoencoder overcome the overcomplete problem via the input and hidden layers. • Could we add an explicit term in the loss to avoid uninteresting features? We wish the features that ONLY reflect variations observed in the training set https://www.youtube.com/watch?v=79sYlJ8Cvlc

27 Contractive Autoencoder • How • Penalize the representation being too sensitive to the input • Improve the robustness to small perturbations • Measure the sensitivity by the Frobenius norm of the Jacobian matrix of the encoder activations

𝑥 = 𝑓 𝑧 𝑧 = 𝑧! 𝑧" 𝑥 = 𝑥! 𝑥" 𝐽# = ⁄ 𝜕𝑥! 𝜕𝑧! ⁄ 𝜕𝑥! 𝜕𝑧" ⁄ 𝜕𝑥" 𝜕𝑧! ⁄ 𝜕𝑥" 𝜕𝑧" 𝐽#!" = ⁄ 𝜕𝑧! 𝜕𝑥! ⁄ 𝜕𝑧! 𝜕𝑥" ⁄ 𝜕𝑧" 𝜕𝑥! ⁄ 𝜕𝑧" 𝜕𝑥" 𝑧 = 𝑓$! 𝑥 𝑧! + 𝑧" 2𝑧! = 𝑓 𝑧! 𝑧" 𝐽# = 1 1 2 0 𝑥! 𝑥" = 𝑥"/2 𝑥! − 𝑥"/2 = 𝑓$! 𝑥! 𝑥" 𝑧! 𝑧" = 𝐽#!" = 0 1/2 1 −1/2 input output 𝐽#𝐽#!" = 𝐼 28 Contractive Autoencoder • Recap: Jocobian Matrix

29 Contractive Autoencoder • Jocobian Matrix

30 Contractive Autoencoder • New Loss reconstruction new regularization

31 Contractive Autoencoder • vs. Denoising Autoencoder • Advantages • CAE can better model the distribution of raw data • Disadvantages • DAE is easier to implement • CAE needs second-order optimization (conjugate gradient, LBFGS)

33 Stacked Autoencoder • Start from Autoencoder: Learn Feature From Input 𝑥! 𝑥" 𝑥# 𝑎! ! 𝑎" ! 𝑎# ! 𝑥$ 𝑥% 𝑥& 𝑎$ ! # 𝑥! # 𝑥" # 𝑥# # 𝑥$ # 𝑥% # 𝑥& hidden 1 input output The feature extractor for the input data Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights Encoder Decoder Unsupervised Red color indicates the trainable weights

34 Stacked Autoencoder • 2nd Stage: Learn 2nd Level Feature From 1st Level Feature 𝑥! 𝑥" 𝑥# 𝑎! ! 𝑎" ! 𝑎# ! 𝑥$ 𝑥% 𝑥& 𝑎$ ! hidden 1 input output 𝑎! " 𝑎" " 𝑎# " 𝑎$ " # 𝑥! # 𝑥" # 𝑥# # 𝑥$ # 𝑥% # 𝑥& hidden 2 The feature extractor for the first feature extractor Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights Encoder Encoder Decoder Unsupervised Red color indicates the trainable weights

35 Stacked Autoencoder • 3rd Stage: Learn 3rd Level Feature From 2nd Level Feature 𝑥! 𝑥" 𝑥# 𝑎! ! 𝑎" ! 𝑎# ! 𝑥$ 𝑥% 𝑥& 𝑎$ ! 𝑎! " 𝑎" " 𝑎# " 𝑎$ " 𝑎! # 𝑎" # 𝑎# # 𝑎$ # # 𝑥! # 𝑥" # 𝑥# # 𝑥$ # 𝑥% # 𝑥& hidden 1 input output hidden 2 hidden 3 The feature extractor for the second feature extractor Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights Encoder Encoder Encoder Decoder Unsupervised Red color indicates the trainable weights

36 Stacked Autoencoder • 4th Stage: Learn 4th Level Feature From 3rd Level Feature 𝑥! 𝑥" 𝑥# 𝑎! ! 𝑎" ! 𝑎# ! 𝑥$ 𝑥% 𝑥& 𝑎$ ! 𝑎! " 𝑎" " 𝑎# " 𝑎$ " 𝑎! # 𝑎" # 𝑎# # 𝑎$ # hidden 1 input output hidden 2 hidden 3 𝑎! $ 𝑎" $ 𝑎# $ 𝑎$ % # 𝑥! # 𝑥" # 𝑥# # 𝑥$ # 𝑥% # 𝑥& hidden 4 The feature extractor for the third feature extractor Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights Encoder Encoder Encoder Encoder Decoder Unsupervised Red color indicates the trainable weights

37 Stacked Autoencoder • Use the Learned Feature Extractor for Downstream Tasks 𝑥! 𝑥" 𝑥# 𝑎! ! 𝑎" ! 𝑎# ! 𝑥$ 𝑥% 𝑥& 𝑎$ ! 𝑎! " 𝑎" " 𝑎# " 𝑎$ " 𝑎! # 𝑎" # 𝑎# # 𝑎$ # hidden 1 input output hidden 2 hidden 3 𝑎! $ 𝑎" $ 𝑎# $ 𝑎$ $ 𝑎! % hidden 4 Learn to classify the input data by using the labels and high-level features Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights Supervised Red color indicates the trainable weights

38 Stacked Autoencoder • Fine-tuning 𝑥! 𝑥" 𝑥# 𝑎! ! 𝑎" ! 𝑎# ! 𝑥$ 𝑥% 𝑥& 𝑎$ ! 𝑎! " 𝑎" " 𝑎# " 𝑎$ " 𝑎! # 𝑎" # 𝑎# # 𝑎$ # hidden 1 input output hidden 2 hidden 3 𝑎! $ 𝑎" $ 𝑎# $ 𝑎$ $ 𝑎! % hidden 4 Fine-tune the entire model for classification Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights Supervised Red color indicates the trainable weights

39 Stacked Autoencoder • Discussion • Advantages • … • Disadvantages • …

• Vanilla Autoencoder • Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) • From Neural Network Perspective • From Probability Model Perspective 40

41 Before we start • Question? • Are the previous Autoencoders generative model? • Recap: We want to learn a probability distribution 𝑝(𝑥) over 𝑥 o Generation (sampling): 𝐱pqr~𝑝(x) (NO, The compressed latent codes of autoencoders are not prior distributions, autoencoder cannot learn to represent the data distribution) o Density Estimation: 𝑝(x) high if 𝐱 looks like a real data NO o Unsupervised Representation Learning: Discovering the underlying structure from the data distribution (e.g., ears, nose, eyes …) (YES, Autoencoders learn the feature representation)

43 Variational Autoencoder • How to perform generation (sampling)? 𝑥! 𝑥" 𝑥# 𝑧1 𝑧2 𝑧3 𝑥$ 𝑥% 𝑥& 𝑧4 # 𝑥! # 𝑥" # 𝑥# # 𝑥$ # 𝑥% # 𝑥& hidden layer input layer output layer Can the hidden output be a prior distribution, e.g., Normal distribution? 𝑧1 𝑧2 𝑧3 𝑧4 # 𝑥! # 𝑥" # 𝑥# # 𝑥$ # 𝑥% # 𝑥& 𝑁(0, 1) Decoder(Generator) maps 𝑁(0, 1) to data space Encoder Decoder Decoder 𝑝 𝑋 = ∑2 𝑝 𝑋 𝑍 𝑝(𝑍) Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

44 Variational Autoencoder • Quick Overview ℒkl ! 𝑥 𝑧 𝑥 ℒ𝑀𝑆𝐸 𝒙 𝑁(0, 1) Bidirectional Mapping Latent Space Data Space ℒ )*)+, = ℒ -./ + ℒ 0, Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 𝑝(𝑥|𝑧) generation (decode) 𝑞(𝑧|𝑥) Inference (encoder)

45 Variational Autoencoder • The neural net perspective • A variational autoencoder consists of an encoder, a decoder, and a loss function Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

46 Variational Autoencoder • Encoder, Decoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

47 Variational Autoencoder • Loss function Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 regularization Can be represented by MSE

48 • Which direction of the KL divergence to use? • Some applications require an approximation that usually places high probability anywhere that the true distribution places high probability: left one • VAE requires an approximation that rarely places high probability anywhere that the true distribution places low probability: right one Variational Autoencoder • Why KL(Q||P) not KL(P||Q) If:

49 Variational Autoencoder • Reparameterization Trick Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 ℎ! ℎ" ℎ# 𝜇1 𝜇2 𝜇3 ℎ$ ℎ% ℎ& 𝜇4 # 𝑥! # 𝑥" # 𝑥# # 𝑥$ # 𝑥% # 𝑥& 𝛿1 𝛿2 𝛿3 𝛿4 𝑧1 𝑧2 𝑧3 𝑧4 𝑧3~𝑁(𝜇3, 𝛿3) Resampling predict means predict std 𝑥! 𝑥" 𝑥# 𝑥$ 𝑥% 𝑥& 1. Encode the input 2. Predict means 3. Predict standard derivations 4. Use the predicted means and standard derivations to sample new latent variables individually 5. Reconstruct the input Latent variables are independent

50 Variational Autoencoder • Reparameterization Trick • z ~ N(μ, σ) is not differentiable • To make sampling z differentiable • z = μ + σ * ϵ Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 ϵ ～N(0, 1)

51 Variational Autoencoder • Reparameterization Trick Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

52 Variational Autoencoder • Loss function Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

53 Variational Autoencoder • Where is ‘variational’? Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

55 Variational Autoencoder • Problem Definition Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 Goal: Given 𝑋 = {𝑥€, 𝑥•, 𝑥‚ … , 𝑥p}, find 𝑝 𝑋 to represent 𝑋 How: It is difficult to directly model 𝑝 𝑋 , so alternatively, we can … 𝑝 𝑋 = D ƒ 𝑝 𝑋|𝑍 𝑝(𝑍) where 𝑝 𝑍 = 𝑁(0,1) is a prior/known distribution i.e., sample 𝑋 from 𝑍

56 Variational Autoencoder • The probability model perspective • P(X) is hard to model • Alternatively, we learn the joint distribution of X and Z Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 𝑝 𝑋 = G 4 𝑝 𝑋|𝑍 𝑝(𝑍) 𝑝 𝑋 = G 4 𝑝 𝑋, 𝑍 𝑝 𝑋, 𝑍 = 𝑝 𝑍 𝑝(𝑋|𝑍)

57 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Assumption

58 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Assumption

59 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Monte Carlo? • n might need to be extremely large before we have an accurate estimation of P(X)

60 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Monte Carlo? • Pixel difference is different from perceptual difference

61 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Monte Carlo? • VAE alters the sampling procedure

62 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Recap: Variational Inference • VI turns inference into optimization ideal approximation 𝑝 𝑧 𝑥 = 𝑝(𝑥, 𝑧) 𝑝(𝑥) ∝ 𝑝(𝑥, 𝑧)

63 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Variational Inference • VI turns inference into optimization parameter distribution

64 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Setting up the objective • Maximize P(X) • Set Q(z) to be an arbitrary distribution 𝑝 𝑧 𝑋 = 𝑝 𝑋 𝑧 𝑝(𝑧) 𝑝(𝑋) Goal: maximize this logP(x)

65 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Setting up the objective encoder ideal reconstruction/decoder KLD Goal: maximize this Goal becomes: optimize this difficult to compute ℒkl ! 𝑥 𝑧 𝑥 ℒ𝑀𝑆𝐸 ℒ )*)+, = ℒ -./ + ℒ 0, 𝑝(𝑥|𝑧) generation 𝑞(𝑧|𝑥) inference

66 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Setting up the objective : ELBO ideal encoder -ELBO 𝑝 𝑧 𝑋 = 𝑝(𝑋, 𝑧) 𝑝(𝑋)

67 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Setting up the objective : ELBO

68 Variational Autoencoder • Recap: The KL Divergence Loss Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 𝐾𝐿(𝒩(𝜇, 𝜎•)||𝒩 0,1 ) = O 𝒩 𝜇, 𝜎• 𝑙𝑜𝑔 𝒩(𝜇, 𝜎• ) 𝒩 0,1 𝑑𝑥 = O 1 2𝜋𝜎• 𝑒 „ …„† 5 •‡5 𝑙𝑜𝑔 1 2𝜋𝜎• 𝑒 „ …„† 5 •‡5 1 2𝜋 𝑒 „…5 • 𝑑𝑥 = O 1 2𝜋𝜎• 𝑒 „ …„† 5 •‡5 log( 1 𝜎• 𝑒 …5„ …„† 5 •‡5 )𝑑𝑥 = 1 2 O 1 2𝜋𝜎• 𝑒 „ …„† 5 •‡5 −𝑙𝑜𝑔𝜎• + 𝑥• − 𝑥 − 𝜇 • 𝜎• 𝑑𝑥

69 Variational Autoencoder • Recap: The KL Divergence Loss Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 𝐾𝐿(𝒩(𝜇, 𝜎•)||𝒩 0,1 ) = 1 2 O 1 2𝜋𝜎• 𝑒 „ …„† 5 •‡5 −𝑙𝑜𝑔𝜎• + 𝑥• − 𝑥 − 𝜇 • 𝜎• 𝑑𝑥 释 = 1 2 (−𝑙𝑜𝑔𝜎" + 𝜇" + 𝜎" − 1)

70 Variational Autoencoder • Recap: The KL Divergence Loss Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

71 Variational Autoencoder Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Optimizing the objective encoder ideal reconstruction KLD dataset dataset

72 Variational Autoencoder • VAE is a Generative Model Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 𝑝 𝑍|𝑋 is not 𝑁(0,1) Can we input 𝑁(0,1) to the decoder for sampling? YES: the goal of KL is to make 𝑝 𝑍|𝑋 to be 𝑁(0,1)

73 Variational Autoencoder • VAE vs. Autoencoder • VAE : distribution representation, p(z|x) is a distribution • AE: feature representation, h = E(x) is deterministic Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

74 Variational Autoencoder • Challenges • Low quality images • … Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

75 Summary: Take Home Message • Autoencoders learn data representation in an unsupervised/ self-supervised way. • Autoencoders learn data representation but cannot model the data distribution 𝑝 𝑋 . • Different with vanilla autoencoder, in sparse autoencoder, the number of hidden units can be greater than the number of input variables. • VAE • … • … • … • … • … • …

Explanation of Autoencoder to Variontal Auto Encoder

More Related Content

Similar to Explanation of Autoencoder to Variontal Auto Encoder

Recently uploaded

Explanation of Autoencoder to Variontal Auto Encoder