A Comprehensive Overview of Encoder and Decoder Architectures in Deep Learning for Natural Language Processing

Encoder-Decoder Autoencoders for Computer Vision

Encoder-Decoder • Encoder-decoder frameworks: • An encoder network extracts key features of the input data • A decoder network takes extracted feature data as its input • Used in a variety of deep learning models • In most applications output of neural network is different from its input • Ex. in image segmentation models: accurately label pixels by their semantic class • Encoder network extracts feature data from input image to determine semantic classification of different pixels • Using that feature map and pixel-wise classifications, decoder network constructs segmentation masks for each object or region in the image • Trained via supervised learning against a “ground truth” dataset of labeled images

Autoencoder • Though all autoencoder models include both an encoder and a decoder, not all encoder-decoder models are autoencoders • Autoencoders - specific subset of encoder-decoder architectures trained via unsupervised learning to reconstruct their own input data • Do not rely on labeled training data • Trained to discover hidden patterns in unlabeled data • Have a ground truth to measure their output against - the original input itself • Considered “self-supervised learning”– hence, autoencoder

Autoencoders (AE) • A powerful tool used in machine learning for: • Feature extraction • Data compression • Image reconstruction • Used for unsupervised learning tasks • An AE model has the ability to automatically learn complex features from input data • Popular method for improving accuracy of classification and prediction tasks

Autoencoders • Autoencoders are neural networks • Can learn to compress and reconstruct input data, such as images, using a hidden layer of neurons • Learn data encodings in an unsupervised manner • Consists of two parts: • Encoder: takes input data and compresses it into a lower-dimensional representation called latent space • Decoder: reconstructs input data from latent space representation • In an optimal scenario, autoencoder performs as close to perfect reconstruction as possible

X X’ h Input Layer Output Layer ENCODER DECODER

AE in Computer Vision • Input is an image and output is a reconstructed image • Input mage typically represented as a matrix of pixel values • Can be of any size, but is typically normalized to improve performance

Encoder • Encoder: Compresses input image into a lower-dimensional representation, known as latent space ("bottleneck" or "code“) • Encoder is: • Series of convolutional layers • Followed by pooling modules or simple linear layers, that extract different levels of features from input image • Each layer: • Applies a set of filters to input image • Outputs a feature map that highlights specific patterns and structures in image

Encoder • Input volume 𝐼 = {𝐼1, …, 𝐼D}, with depth D • Convolution layer composed of q convolution filters, {F (1)…F (1)} 1 q • Convolution of input volume with filters produces n activation maps 𝑚 𝑚 𝑚 𝑚 𝑧 = 𝑂 = 𝑎(𝐼 ∗ 𝐹(1) + 𝑏(1) ) • Every convolution wrapped by non-linear function a; bm is bias for mth feature map • Produced activation maps are encoding of input 𝐼 in a low-dimensional space • Convolution reduces output’s spatial extent • Not possible to reconstruct volume with same spatial extent as input • Input padding such that dim(I) = dim(decode(encode(I))) https://pgaleone.eu/neural-networks/2016/11/24/convolutional-autoencoders/

Bottleneck • Bottleneck / Latent Representation: Output of encoder is a compressed representation of input image in latent space • Captures most important features of input image • Typically, a smaller dimensional representation of input image • Restricts flow of information to decoder from encoder - allowing only most vital information to pass through • Prevents neural network from memorizing input and overfitting data • Smaller the code, lower the risk of overfitting • If input data denoted as x, then latent space representation s = E(x)

Decoder • Decoder: reconstructs input image from latent representation • Usually implemented with transposed convolutions for images • Gradually increases size of feature maps until final output is same size as input • Every layer applies a set of filters that up-sample feature maps • Output compared with ground truth • If output of decoder is o, then o = D(s) = D(E(x))

Decoder • Output of decoder is a reconstructed image similar to input image • Reconstructed image may not be identical to input image • These features can be used for tasks such as image classification, object detection, and image retrieval AE converting input to a Monet style painting https://www.analyticsvidhya.com/blog/2021/01/auto-encoders-for-computer- vision-an-endless-world-of-possibilities/

Decoder 1 ሚ • q feature maps zm=1,..q (latent representations) produced from encoder used as input to decoder to reconstruct image 𝐼 • Hyper-parameters of decoding convolution fixed by encoding architecture • Filters volume F(2) to produce same spatial extent of 𝐼 • Number of filters to learn: D • Reconstructed image 𝐼ሚ result of convolution between feature maps Z and F(2) 𝐼ሚ = a(Z*F(2) + b(2)) • Loss function L(I, Ĩ ) = ||𝐼 − 𝐼||2 2

Loss Function and Reconstruction Loss • Loss functions - critical role in training autoencoders and determining their performance • Most commonly used is reconstruction loss viz. mean squared error • Used to measure difference between model input and output • Reconstruction loss used to update weights of network during backpropagation to minimize difference between input and output • Goal: achieve low reconstruction loss • Low loss  model can effectively capture salient features of input data and reconstruct it accurately

Dimensionality Reduction • Dimensionality reduction - process of reducing number of dimensions in encoded representation of input data • AE can learn to perform dimensionality reduction: • Training encoder to map input data to a lower-dimensional latent space • Decoder trained to reconstruct original input data from latent space representation • Size of latent space typically much smaller than size of input data - allowing for efficient storage and computation of data • Through dimensionality reduction, AE can also help to remove noise and irrelevant features • Useful for improving performance of downstream tasks such as data classification or clustering

Hyperparameters • Code size: Size of bottleneck determines how much the data is to be compressed • Adjustments to code size are one way to counter overfitting or underfitting • Number of layers: Depth measured by number of layers in encoder and decoder • More depth provides greater complexity • Less depth provides greater processing speed • Number of nodes per layer: • Generally, number of nodes decreases with each encoder layer, reaches minimum at bottleneck, and increases with each layer of decoder layer • Number of neurons may vary per nature of input data – ex., an autoencoder dealing with large images would require more neurons than one dealing with smaller images

Training the AE Input 𝑥𝑖 is the ith image of m samples, each having n features ℎ𝑖 = 𝑔𝐖𝑥𝑖 + 𝑏 ��ො𝑖 = 𝑓(𝐖∗ℎ𝑖 + 𝑐) Objective function: � � 𝑖=1 𝑗=1 𝑚 𝑛 1 𝑚𝑖𝑛 ෍ ෍(�ො − 𝑥 𝑖𝑗 𝑖𝑗 )2 𝑥𝑖 ∈ ℝ1∗𝑛 𝑊 ∈ ℝ𝑛∗𝑘 𝑋 ∈ ℝ𝑚∗𝑛 ℎ𝑖 ∈ ℝ1∗𝑘 𝑥 𝑖 � � ො 𝑖 h W* W

Training the AE Compute: 𝜕𝐿 𝜕𝐿 𝜕 ��ො𝑖 𝜕𝑧2 = ∗ ∗ 𝜕𝑊∗ 𝜕 ��ො𝑖 𝜕𝑧2 𝜕𝑊∗ 𝛛𝐿 𝛛𝐿 𝑖 = ∗ ∗ 𝛛𝑊 𝛛 ��ො 𝛛𝑧 𝛛ℎ 2 1 𝛛 ��ො𝑖 𝛛𝑧2 𝛛ℎ1 1 𝛛𝑧 1 ∗ ∗ 𝛛𝑧 𝛛𝑊 𝜕 𝐿 𝜕 � �ො 𝑖 = 2 ��ො𝑖 − 𝑥𝑖 𝑥 𝑖 � � ො 𝑖 W* W 𝑧1 𝑧2 ℎ1

Undercomplete Autoencoder • Takes an image and tries to predict same image as output • Reconstructs image from compressed bottleneck region • Used primarily for dimensionality reduction • Hidden layers contain fewer nodes than input and output layers • Capacity of its bottleneck is fixed • Constrain number of nodes present in hidden layer(s) of network • Limit amount of information that can flow through the network • Model can learn most important attributes of input data and how to best reconstruct original input from an "encoded" state

Contractive autoencoder • Designed to learn a compressed representation of input data while being resistant to small perturbations in input • Achieved by adding a regularization term to training objective • This term penalizes network for changing output with respect to small changes in input Loss = L(I, Ĩ ) + regularizer • Regularization term: � � • Frobenius norm of Jacobian of ℎ w.r.t 𝑥 Ω 𝜃 =∥ 𝐽𝑥(ℎ) ∥2 𝐽 (ℎ) ∈ ℝ𝑛∗𝑘 𝑥 + Ω(𝜃) 𝑥 ∈ ℝ1∗𝑛 ℎ ∈ ℝ1∗𝑘 • Loss: 𝐿෠ 𝜃 𝑥 � � ො 𝑖 h W* W

Contractive autoencoder • 𝐽𝑥 ℎ = 𝛛ℎ1 𝛛ℎ1 𝛛𝑥1 𝛛𝑥2 𝛛ℎ1 𝛛𝑥 𝑛 𝛛𝑥 1 𝛛𝑥 2 … … … … … 𝛛ℎ𝑘 𝛛ℎ𝑘 … 𝛛ℎ𝑘 𝛛𝑥 𝑛 • 𝑗, 𝑙 entry of Jacobian captures variations in output of 𝑙th neuron with small variation in 𝑗th input ∥ 𝐽𝑥(ℎ) ∥2 = ෍ ෍ 𝐹 𝑗=1 𝑙=1 • Ideally, this should be 0 to minimize loss 𝑛 𝑘 𝜕 ℎ 𝑙 𝜕𝑥 𝑗 2

Contractive autoencoder • Consider 𝛛ℎ1 . What does it mean if 𝛛ℎ1 =0? 𝛛𝑥1 𝛛𝑥1 • ℎ1 is not sensitive to variations in 𝑥1 • But we want the neurons to capture the important information  want ℎ1 to change with 𝑥1 • Contradicting goal of minimizing 𝐿 𝜃 which requires ℎ to capture variations in input • Two contradicting objectives ensure that ℎ is sensitive to only very important variations in the input • 𝐿 𝜃 : capture important variations in data • Ω 𝜃 : do not capture variations in data • Tradeoff: capture only very important variations in data 𝑥 𝑖 � � ො 𝑖 h W* W

Sparse Autoencoder • Encoder network trained to produce sparse encoding vectors - have many zero values • Does not require reduction in number of nodes at hidden layer • Create bottleneck by reducing number of nodes that can be activated at the same time • Forces network to identify only most important features of input data • Sensitize individual hidden layer nodes toward specific attributes of input • Forced to selectively activate regions of network depending on input data

• Opacity of a node corresponds with level of activation • Individual nodes of a trained model which activate are data-dependent • Different inputs will result in activations of different nodes through network

Sparse Autoencoder • Hidden neuron with sigmoid activation: • Have values between 0 and 1 • Activated neuron – output close to 1 • Inactive neuron – output close to 0 • Sparse autoencoder tries to ensure that neuron is inactive most of the time • Average activation of the neuron is close to 0 • If neuron 𝑙 is sparse (mostly inactive), then �ො 𝑙 ⟶ 0 1 𝑚 ��ො𝑙 = 𝑚 ෍ ℎ(𝑥𝑖)𝑙 𝑖=1 𝑥 𝑖 � � ො 𝑖 h W* W

Sparse Autoencoder • Sparse encoder uses a sparsity parameter 𝜌 • Typically close to 0 (e.g. 0.005) • Tries to enforce the constraint: �ො 𝑙 = 𝜌 • Regularization on overly learning by the neuron • Whenever the neuron is active, it will really capture some relevant information • One possible solution is to add the following to the objective function: 𝑘 Ω 𝜃 = ෍ 𝜌 𝑙𝑜𝑔 𝑙=1 � � ො 𝑙 + 1 − 𝜌 𝑙𝑜𝑔 𝜌 1 − 𝜌 1 − � �ො𝑙 𝐿෠𝜃 = 𝐿𝜃 + Ω(𝜃)

Sparse Autoencoder 𝑙= 1 • Ω 𝜃 = σ𝑘 𝜌 𝑙𝑜𝑔 𝜌 + 1 − 𝜌𝑙𝑜𝑔 1−𝜌 ��ෝ𝑙 1−��ෝ𝑙 • Will have minimum value when �ො 𝑙 = 𝜌 𝛛 𝑊 • How to compute 𝛛Ω 𝜃 ? 𝑙= 1 𝑘 Ω 𝜃 = ෍ 𝜌 log 𝜌 − 𝜌 log �ො 𝑙 + 1 − 𝜌 log 1 − 𝜌 − 1 − 𝜌log(1 − �ො𝑙) 𝜕Ω 𝜃𝜕 𝑊 𝜕 � �ො𝑙 𝜕Ω 𝜃 𝜕 �ො𝑙 = ∗𝜕 𝑊 𝜕Ω 𝜃𝜕 � �ො𝑙 = − + 𝜌 (1 − 𝜌) ��ො𝑙 (1 − ��ො𝑙) 𝜕 � �ො 𝑙 𝜕 𝑊 ′ 𝑖 𝑖 = 𝑥 (𝑔 𝑊𝑥 + 𝑏 )

Denoising autoencoder • Designed to learn to reconstruct an input from a corrupted version of input • Corrupted input created by adding noise to original input • Network trained to remove noise and reconstruct original input

Denoising autoencoder • Corrupts input data using a probabilistic process (𝑃 𝑥෤𝑖 𝑗 𝑥𝑖𝑗 ) before feeding to network • Model will have to learn to reconstruct corrupted 𝑥𝑖𝑗 correctly by relying on its interactions with other elements of 𝑥𝑖 • Not able to develop a mapping which memorizes training data because input and target output are no longer same • Model learns a vector field for mapping input data towards a lower-dimensional manifold • We have effectively "canceled out" added noise 𝑥 ෤𝑖 � � ො 𝑖 h W* W 𝑥 𝑖 𝑃 𝑥෤𝑖 𝑗 𝑥𝑖𝑗

Classification 𝑥 𝑖 � � ො 𝑖 h W* W 𝑥𝑖 Classes

Super-resolution • Applying super-resolution • Super-resolution - increase resolution of a low-resolution image • Can also be achieved by upsampling image and using bilinear interpolation to fill in new pixel values, but generated image will be blurry as cannot increase amount of information in image • Can teach AE to predict pixel values for high-resolution image

Applications • Segmentation • Instead of using same input as expected output, provide a segmentation mask as expected output • Perform tasks like semantic segmentation • Ex. U-Net autoencoder architecture usually used for semantic segmentation for biomedical images

Applications • Image Inpainting - can fill in missing or corrupted parts of an image by learning underlying structure and patterns in data

A Comprehensive Overview of Encoder and Decoder Architectures in Deep Learning for Natural Language Processing

More Related Content

Similar to A Comprehensive Overview of Encoder and Decoder Architectures in Deep Learning for Natural Language Processing

Recently uploaded

A Comprehensive Overview of Encoder and Decoder Architectures in Deep Learning for Natural Language Processing