Deep Learning
Deep Learning
interact with complex data. Deep learning mimics neural networks of the human brain, it
enables computers to autonomously uncover patterns and make informed decisions from vast
amounts of unstructured data.
Image insert
In a deep neural network the input layer receives data which passes through hidden
layers that transform the data using nonlinear functions. The final output layer generates the
model’s prediction.
For more details on neural networks refer to this article: What is a Neural Network?
Fully
Connected Deep Neural Network
Image insert
Apply statistical algorithms to learn the Uses artificial neural network architecture
hidden patterns and relationships in the to learn the hidden patterns and
dataset. relationships in the dataset.
Takes less time to train the model. Takes more time to train the model.
Less complex and easy to interpret the More complex, it works like the black box
result. interpretations of the result are not easy.
The journey of deep learning began with the perceptron, a single-layer neural network
introduced in the 1950s. While innovative, perceptrons could only solve linearly separable
problems hence failing at more complex tasks like the XOR problem.
1. Feedforward neural networks (FNNs) are the simplest type of ANN, where data flows
in one direction from input to output. It is used for basic tasks like classification.
2. Convolutional Neural Networks (CNNs) are specialized for processing grid-like data,
such as images. CNNs use convolutional layers to detect spatial hierarchies, making
them ideal for computer vision tasks.
3. Recurrent Neural Networks (RNNs) are able to process sequential data, such as time
series and natural language. RNNs have loops to retain information over time, enabling
applications like language modeling and speech recognition. Variants like LSTMs and
GRUs address vanishing gradient issues.
5. Autoencoders are unsupervised networks that learn efficient data encodings. They
compress input data into a latent representation and reconstruct it, useful for
dimensionality reduction and anomaly detection.
1. Computer vision
In computer vision, deep learning models enable machines to identify and understand visual
data. Some of the main applications of deep learning in computer vision include:
Object detection and recognition: Deep learning models are used to identify and
locate objects within images and videos, making it possible for machines to perform
tasks such as self-driving cars, surveillance and robotics.
Image classification: Deep learning models can be used to classify images into
categories such as animals, plants and buildings. This is used in applications such as
medical imaging, quality control and image retrieval.
Image segmentation: Deep learning models can be used for image segmentation into
different regions, making it possible to identify specific features within images.
In NLP, deep learning model enable machines to understand and generate human language.
Some of the main applications of deep learning in NLP include:
Automatic Text Generation: Deep learning model can learn the corpus of text and new
text like summaries, essays can be automatically generated using these trained
models.
Language translation: Deep learning models can translate text from one language to
another, making it possible to communicate with people from different linguistic
backgrounds.
Sentiment analysis: Deep learning models can analyze the sentiment of a piece of text,
making it possible to determine whether the text is positive, negative or neutral.
Speech recognition: Deep learning models can recognize and transcribe spoken words,
making it possible to perform tasks such as speech-to-text conversion, voice search
and voice-controlled devices.
3. Reinforcement learning
Game playing: Deep reinforcement learning models have been able to beat human
experts at games such as Go, Chess and Atari.
Robotics: Deep reinforcement learning models can be used to train robots to perform
complex tasks such as grasping objects, navigation and manipulation.
Control systems: Deep reinforcement learning models can be used to control complex
systems such as power grids, traffic management and supply chain optimization.
3. Scalability: Deep Learning models can scale to handle large and complex datasets and
can learn from massive amounts of data.
4. Flexibility: Deep Learning models can be applied to a wide range of tasks and can
handle various types of data such as images, text and speech.
Deep learning has made significant advancements in various fields but there are still some
challenges that need to be addressed. Here are some of the main challenges in deep learning:
1. Data availability: It requires large amounts of data to learn from. For using deep
learning it's a big concern to gather as much data for training.
4. Interpretability: Deep learning models are complex, it works like a black box. It is very
difficult to interpret the result.
5. Overfitting: when the model is trained again and again it becomes too specialized for
the training data leading to overfitting and poor performance on new data.
As we continue to push the boundaries of computational power and dataset sizes, the
potential applications of deep learning are limitless. Deep Learning promises to reshape our
future where machines can learn, adapt and solve complex problems at a scale and speed
previously unimaginable.
Convolutional Neural Networks (CNNs) are deep learning models designed to process data
with a grid-like topology such as images. They are the foundation for most modern computer
vision applications to detect features within visual data.
1/3
2. Pooling Layers: They downsample the spatial dimensions of the input, reducing the
computational complexity and the number of parameters in the network. Max pooling
is a common pooling operation where we select a maximum value from a group of
neighboring pixels.
4. Fully Connected Layers: These layers are responsible for making predictions based on
the high-level features learned by the previous layers. They connect every neuron in
one layer to every neuron in the next layer.
1. Input Image: CNN receives an input image which is preprocessed to ensure uniformity
in size and format.
2. Convolutional Layers: Filters are applied to the input image to extract features like
edges, textures and shapes.
3. Pooling Layers: The feature maps generated by the convolutional layers are
downsampled to reduce dimensionality.
4. Fully Connected Layers: The downsampled feature maps are passed through fully
connected layers to produce the final output, such as a classification label.
5. Output: The CNN outputs a prediction, such as the class of the image.
CNNs are trained using a supervised learning approach. This means that the CNN is given a set
of labeled training images. The CNN learns to map the input images to their correct labels.
1. Data Preparation: The training images are preprocessed to ensure that they are all in
the same format and size.
2. Loss Function: A loss function is used to measure how well the CNN is performing on
the training data. The loss function is typically calculated by taking the difference
between the predicted labels and the actual labels of the training images.
3. Optimizer: An optimizer is used to update the weights of the CNN in order to minimize
the loss function.
4. Backpropagation: Backpropagation is a technique used to calculate the gradients of
the loss function with respect to the weights of the CNN. The gradients are then used
to update the weights of the CNN using the optimizer.
Efficiency of CNN can be evaluated using a variety of criteria. Among the most popular metrics
are:
Accuracy: Accuracy is the percentage of test images that the CNN correctly classifies.
Precision: Precision is the percentage of test images that the CNN predicts as a
particular class and that are actually of that class.
Recall: Recall is the percentage of test images that are of a particular class and that the
CNN predicts as that class.
F1 Score: The F1 Score is a harmonic mean of precision and recall. It is a good metric
for evaluating the performance of a CNN on classes that are imbalanced.
Diabetic retinopathy is a severe eye condition caused by damage to the retina's blood vessels
due to prolonged diabetes. It is a leading cause of blindness among adults aged 20 to 64. CNNs
have successfully used to detect diabetic retinopathy by analyzing retinal images. By training
on labeled datasets of healthy and affected retina images CNNs can accurately identify signs of
the disease helping in early diagnosis and treatment.
1. LeNet
LeNet developed by Yann LeCun and his colleagues in the late 1990s was one of the first
successful CNNs designed for handwritten digit recognition. It laid the foundation for modern
CNNs and achieved high accuracy on the MNIST dataset which contains 70,000 images of
handwritten digits (0-9).
2. AlexNet
AlexNet is a CNN architecture that was developed by Alex Krizhevsky, Ilya Sutskever and
Geoffrey Hinton in 2012. It was the first CNN to win the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) a major image recognition competition. It consists of several
layers of convolutional and pooling layers followed by fully connected layers. The architecture
includes five convolutional layers, three pooling layers and three fully connected layers.
3. Resnet
ResNets (Residual Networks) are designed for image recognition and processing tasks. They are
renowned for their ability to train very deep networks without overfitting making them highly
effective for complex tasks. It introduces skip connections that allow the network to learn
residual functions making it easier to train deep architecture.
4. GoogleNet
GoogleNet also known as InceptionNet is renowned for achieving high accuracy in image
classification while using fewer parameters and computational resources compared to other
state-of-the-art CNNs. The core component of GoogleNet allows the network to learn features
at different scales simultaneously to enhance performance.
5. VGG
VGGs are developed by the Visual Geometry Group at Oxford, it uses small 3x3 convolutional
filters stacked in multiple layers, creating a deep and uniform structure. Popular variants
like VGG-16 and VGG-19 achieved state-of-the-art performance on the ImageNet dataset
demonstrating the power of depth in CNNs.
Applications of CNN
Object detection: It can be used to detect objects in images such as people, cars and
buildings. They can also be used to localize objects in images which means that they
can identify the location of an object in an image.
Image segmentation: It can be used to segment images which means that they can
identify and label different objects in an image. This is useful for applications such as
medical imaging and robotics.
Video analysis: It can be used to analyze videos such as tracking objects in a video or
detecting events in a video. This is useful for applications such as video surveillance
and traffic monitoring.
Advantages of CNN
High Accuracy: They can achieve high accuracy in various image recognition tasks.
Disadvantages of CNN
Complexity: It can be complex and difficult to train, especially for large datasets.
Data Requirements: They need large amounts of labeled data for training.
Recurrent Neural Networks (RNNs) differ from regular neural networks in how they process
information. While standard neural networks pass information in one direction i.e from input
to output, RNNs feed information back into the network at each step.
1/3
Imagine reading a sentence and you try to predict the next word, you don’t rely only on the
current word but also remember the words that came before. RNNs work similarly by
“remembering” past information and passing the output from one step as input to the next i.e
it considers all the earlier words to choose the most likely next word. This memory of previous
steps helps the network understand context and make better predictions.
1. Recurrent Neurons
The fundamental processing unit in RNN is a Recurrent Unit. They hold a hidden state that
maintains information about previous inputs in a sequence. Recurrent units can "remember"
information from prior steps by feeding back their hidden state, allowing them to capture
dependencies across time.
Recurrent Neuron
2. RNN Unfolding
RNN unfolding or unrolling is the process of expanding the recurrent structure over time steps.
During unfolding each step of the sequence is represented as a separate layer in a series
illustrating how information flows across each time step.
This unrolling enables backpropagation through time (BPTT) a learning process where errors
are propagated across time steps to adjust the network’s weights enhancing the RNN’s ability
to learn dependencies within sequential data.
RNN Unfolding
RNNs share similarities in input and output structures with other deep learning architectures
but differ significantly in how information flows from input to output. Unlike traditional deep
neural networks where each dense layer has distinct weight matrices. RNNs use shared
weights across time steps, allowing them to remember information over sequences.
In RNNs the hidden state HiHi is calculated for every input XiXi to retain sequential
dependencies. The computations follow these core formulas:
h=σ(U⋅X+W⋅ht−1+B)h=σ(U⋅X+W⋅ht−1+B)
Here:
BB is the bias.
2. Output Calculation:
Y=O(V⋅h+C)Y=O(V⋅h+C)
The output YY is calculated by applying OO an activation function to the weighted hidden state
where VV and CC represent weights and bias.
3. Overall Function:
Y=f(X,h,W,U,V,B,C)Y=f(X,h,W,U,V,B,C)
This function defines the entire RNN operation where the state matrix SS holds each
element sisi representing the network's state at each time step ii.
Recurrent Neural Architecture
At each time step RNNs process units with a fixed activation function. These units have an
internal hidden state that acts as memory that retains information from previous time steps.
This memory allows the network to store past knowledge and adapt based on new inputs.
The current hidden state htht depends on the previous state ht−1ht−1 and the current
input xtxt and is calculated using the following relations:
1. State Update:
ht=f(ht−1,xt)ht=f(ht−1,xt)
where:
ht=tanh(Whh⋅ht−1+Wxh⋅xt)ht=tanh(Whh⋅ht−1+Wxh⋅xt)
Here, WhhWhh is the weight matrix for the recurrent neuron and WxhWxh is the weight
matrix for the input neuron.
3. Output Calculation:
yt=Why⋅htyt=Why⋅ht
where ytyt is the output and WhyWhy is the weight at the output layer.
These parameters are updated using backpropagation. However, since RNN works on
sequential data here we use an updated backpropagation which is known as backpropagation
through time.
Since RNNs process sequential data Backpropagation Through Time (BPTT) is used to update
the network's parameters. The loss function L(θ) depends on the final hidden state h3h3 and
each hidden state relies on preceding ones forming a sequential dependency chain:
In BPTT, gradients are backpropagated through each time step. This is essential for updating
network parameters based on temporal dependencies.
∂L(θ)∂W=∂L(θ)∂h3⋅∂h3∂W∂W∂L(θ)=∂h3∂L(θ)⋅∂W∂h3
2. Handling Dependencies in Layers: Each hidden state is updated based on its dependencies:
h3=σ(W⋅h2+b)h3=σ(W⋅h2+b)
The gradient is then calculated for each state, considering dependencies from previous hidden
states.
3. Gradient Calculation with Explicit and Implicit Parts: The gradient is broken down into
explicit and implicit parts summing up the indirect paths from each hidden state to the
weights.
∂h3∂W=∂h3+∂W+∂h3∂h2⋅∂h2+∂W∂W∂h3=∂W∂h3++∂h2∂h3⋅∂W∂h2+
4. Final Gradient Expression: The final derivative of the loss function with respect to the
weight matrix W is computed:
∂L(θ)∂W=∂L(θ)∂h3⋅∑k=13∂h3∂hk⋅∂hk∂W∂W∂L(θ)=∂h3∂L(θ)⋅∑k=13∂hk∂h3⋅∂W∂hk
There are four types of RNNs based on the number of inputs and outputs in the network:
1. One-to-One RNN
This is the simplest type of neural network architecture where there is a single input and a
single output. It is used for straightforward classification tasks such as binary classification
where no sequential data is involved.
2. One-to-Many RNN
In a One-to-Many RNN the network processes a single input to produce multiple outputs over
time. This is useful in tasks where one input triggers a sequence of predictions (outputs). For
example in image captioning a single image can be used as input to generate a sequence of
words as a caption.
One to Many RNN
3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single output. This type
is useful when the overall context of the input sequence is needed to make one prediction. In
sentiment analysis the model receives a sequence of words (like a sentence) and produces a
single output like positive, negative or neutral.
4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a sequence of
outputs. In language translation task a sequence of words in one language is given as input and
a corresponding sequence in another language is generated as output.
Many to Many RNN
There are several variations of RNNs, each designed to address specific challenges or optimize
for certain tasks:
1. Vanilla RNN
This simplest form of RNN consists of a single hidden layer where weights are shared across
time steps. Vanilla RNNs are suitable for learning short-term dependencies but are limited by
the vanishing gradient problem, which hampers long-sequence learning.
2. Bidirectional RNNs
Bidirectional RNNs process inputs in both forward and backward directions, capturing both
past and future context for each time step. This architecture is ideal for tasks where the entire
sequence is available, such as named entity recognition and question answering.
Long Short-Term Memory Networks (LSTMs) introduce a memory mechanism to overcome the
vanishing gradient problem. Each LSTM cell has three gates:
Input Gate: Controls how much new information should be added to the cell state.
Output Gate: Regulates what information should be output at the current step. This
selective memory enables LSTMs to handle long-term dependencies, making them
ideal for tasks where earlier context is critical.
Gated Recurrent Units (GRUs) simplify LSTMs by combining the input and forget gates into a
single update gate and streamlining the output mechanism. This design is computationally
efficient, often performing similarly to LSTMs and is useful in tasks where simplicity and faster
training are beneficial.
Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow information
from previous steps to be fed back into the network. This feedback enables RNNs to
remember prior inputs making them ideal for tasks where context is important.
In this section, we create a character-based text generator using Recurrent Neural Network
(RNN) in TensorFlow and Keras. We'll implement an RNN that learns patterns from a text
sequence to generate new text character-by-character.
We start by importing essential libraries for data handling and building the neural network.
import numpy as np
import tensorflow as tf
We define the input text and identify unique characters in the text which we’ll encode for our
model.
chars = sorted(list(set(text)))
To train the RNN, we need sequences of fixed length (seq_length) and the character following
each sequence as the label.
seq_length = 3
sequences = []
labels = []
labels.append(char_to_index[label])
X = np.array(sequences)
y = np.array(labels)
We create a simple RNN model with a hidden layer of 50 units and a Dense output layer
with softmax activation.
model = Sequential()
model.add(Dense(len(chars), activation='softmax'))
We compile the model using the categorical_crossentropy loss and train it for 100 epochs.
After training we use a starting sequence to generate new text character by character.
generated_text = start_seq
for i in range(50):
prediction = model.predict(x_one_hot)
next_index = np.argmax(prediction)
next_char = index_to_char[next_index]
generated_text += next_char
print("Generated Text:")
print(generated_text)
Output:
Predicting
the next word
Sequential Memory: RNNs retain information from previous inputs making them ideal
for time-series predictions where past data is crucial.
While RNNs excel at handling sequential data they face two main training challenges
i.e vanishing gradient and exploding gradient problem:
These challenges can hinder the performance of standard RNNs on complex, long-sequence
tasks.
Natural Language Processing (NLP): RNNs are fundamental in NLP tasks like language
modeling, sentiment analysis and machine translation.
Speech Recognition: RNNs capture temporal patterns in speech data, aiding in speech-
to-text and other audio-related applications.
Image and Video Processing: When combined with convolutional layers, RNNs help
analyze video sequences, facial expressions and gesture recognition.
Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural Network
(RNN) designed by Hochreiter and Schmidhuber. LSTMs can capture long-term dependencies in
sequential data making them ideal for tasks like language translation, speech recognition and
time series forecasting.
Unlike traditional RNNs which use a single hidden state passed through time LSTMs introduce a
memory cell that holds information over extended periods addressing the challenge of
learning long-term dependencies.
1/3
Recurrent Neural Networks (RNNs) are designed to handle sequential data by maintaining a
hidden state that captures information from previous time steps. However they often face
challenges in learning long-term dependencies where information from distant time steps
becomes crucial for making accurate predictions for current state. This problem is known as
the vanishing gradient or exploding gradient problem.
Vanishing Gradient: When training a model over time, the gradients which help the
model learn can shrink as they pass through many steps. This makes it hard for the
model to learn long-term patterns since earlier information becomes almost irrelevant.
Exploding Gradient: Sometimes gradients can grow too large causing instability. This
makes it difficult for the model to learn properly as the updates to the model become
erratic and unpredictable.
Both of these issues make it challenging for standard RNNs to effectively capture long-term
dependencies in sequential data.
LSTM Architecture
LSTM architectures involves the memory cell which is controlled by three gates:
2. Forget gate: Determines what information is removed from the memory cell.
3. Output gate: Controls what information is output from the memory cell.
This allows LSTM networks to selectively retain or discard information as it flows through the
network which allows them to learn long-term dependencies. The network has a hidden state
which is like its short-term memory. This memory is updated using the current input, the
previous hidden state and the current state of the memory cell.
Working of LSTM
LSTM architecture has a chain structure that contains four neural networks and different
memory blocks called cells.
LSTM Model
Information is retained by the cells and the memory manipulations are done by
the gates. There are three gates -
1. Forget Gate
The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xtxt (input at the particular time) and ht−1ht−1 (previous cell output) are fed to the gate
and multiplied with weight matrices followed by the addition of bias. The resultant is passed
through an activation function which gives a binary output. If for a particular cell state the
output is 0, the piece of information is forgotten and for output 1, the information is retained
for future use.
ft=σ(Wf⋅[ht−1,xt]+bf)ft=σ(Wf⋅[ht−1,xt]+bf)
Where:
[ht−1,xt][ht−1,xt] denotes the concatenation of the current input and the previous
hidden state.
2. Input gate
The addition of useful information to the cell state is done by the input gate. First the
information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs ht−1ht−1and xtxt. Then, a vector is created
using tanh function that gives an output from -1 to +1 which contains all the possible values
from ht−1ht−1 and xtxt. At last the values of the vector and the regulated values are multiplied
to obtain the useful information. The equation for the input gate is:
it=σ(Wi⋅[ht−1,xt]+bi)it=σ(Wi⋅[ht−1,xt]+bi)
C^t=tanh(Wc⋅[ht−1,xt]+bc)C^t=tanh(Wc⋅[ht−1,xt]+bc)
We multiply the previous state by ftft effectively filtering out the information we had decided
to ignore earlier. Then we add it⊙Ctit⊙Ct which represents the new candidate values scaled by
how much we decided to update each state value.
Ct=ft⊙Ct−1+it⊙C^tCt=ft⊙Ct−1+it⊙C^t
where
3. Output gate
The task of extracting useful information from the current cell state to be presented as output
is done by the output gate. First, a vector is generated by applying tanh function on the cell.
Then, the information is regulated using the sigmoid function and filter by the values to be
remembered using inputsht−1ht−1and xtxt. At last the values of the vector and the regulated
values are multiplied to be sent as an output and input to the next cell. The equation for the
output gate is:
ot=σ(Wo⋅[ht−1,xt]+bo)ot=σ(Wo⋅[ht−1,xt]+bo)
Output Gate
Applications of LSTM
Language Modeling: Used in tasks like language modeling, machine translation and
text summarization. These networks learn the dependencies between words in a
sentence to generate coherent and grammatically correct sentences.
Time Series Forecasting: Used for predicting stock prices, weather and energy
consumption. They learn patterns in time series data to predict future events.
Anomaly Detection: Used for detecting fraud or network intrusions. These networks
can identify patterns in data that deviate drastically and flag them as potential
anomalies.
Video Analysis: Applied in tasks such as object detection, activity recognition and
action classification. When combined with Convolutional Neural Networks (CNNs) they
help analyze video data and extract useful information.
Generative Adversarial Networks (GANs) help machines to create new, realistic data by
learning from existing examples. It is introduced by Ian Goodfellow and his team in 2014 and
they have transformed how computers generate images, videos, music and more. Unlike
traditional models that only recognize or classify data, they take a creative way by generating
entirely new content that closely resembles real-world data. This ability helped various fields
such as art, gaming, healthcare and data science. In this article, we will see more about GANs
and its core concepts.
1/4
Architecture of GANs
GANs consist of two main models that work together to create realistic synthetic data which
are as follows:
1. Generator Model
The generator is a deep neural network that takes random noise as input to generate realistic
data samples like images or text. It learns the underlying data patterns by adjusting its internal
parameters during training through backpropagation. Its objective is to produce samples that
the discriminator classifies as real.
JG=−1mΣi=1mlogD(G(zi))JG=−m1Σi=1mlogD(G(zi))
where
The generator aims to maximize D(G(zi))D(G(zi)) meaning it wants the discriminator to classify
its fake data as real (probability close to 1).
2. Discriminator Model
The discriminator acts as a binary classifier helps in distinguishing between real and generated
data. It learns to improve its classification ability through training, refining its parameters to
detect fake samples more accurately. When dealing with image data, the discriminator
uses convolutional layers or other relevant architectures which help to extract features and
enhance the model’s ability.
JD=−1mΣi=1mlogD(xi)−1mΣi=1mlog(1−D(G(zi))JD=−m1Σi=1mlogD(xi)−m1Σi=1mlog(1−D(G(zi))
JDJD measures how well the discriminator classifies real and fake samples.
The discriminator wants to correctly classify real data as real (maximize logD(xi)logD(xi) and
fake data as fake (maximize log(1−D(G(zi))log(1−D(G(zi)))
MinMax Loss
GANs are trained using a MinMax Loss between the generator and discriminator:
minGmaxD(G,D)=[Ex∼pdata[logD(x)]+Ez∼pz(z)[log(1−D(g(z)))]minGmaxD(G,D)=[Ex∼pdata
[logD(x)]+Ez∼pz(z)[log(1−D(g(z)))]
where,
The generator tries to minimize this loss (to fool the discriminator) and the discriminator tries
to maximize it (to detect fakes accurately).
GANs train by having two networks the Generator (G) and the Discriminator (D) compete and
improve together. Here's the step-by-step process
2. Discriminator's Turn
D's job is to analyze each input and find whether it's real data or something G cooked up. It
outputs a probability score between 0 and 1. A score of 1 shows the data is likely real and 0
suggests it's fake.
3. Adversarial Learning
If the discriminator correctly classifies real and fake data it gets better at its job.
If the generator fools the discriminator by creating realistic fake data, it receives a
positive update and the discriminator is penalized for making a wrong decision.
4. Generator's Improvement
Each time the discriminator mistakes fake data for real, the generator learns from this
success.
Through many iterations, the generator improves and creates more convincing fake
samples.
5. Discriminator's Adaptation
The discriminator also learns continuously by updating itself to better spot fake data.
6. Training Progression
Eventually the discriminator struggles to distinguish real from fake shows that the GAN
has reached a well-trained state.
At this point, the generator can produce high-quality synthetic data that can be used
for different applications.
Types of GANs
There are several types of GANs each designed for different purposes. Here are some
important types:
1. Vanilla GAN
Unstable training: The generator and discriminator may not improve smoothly.
Conditional GANs (CGANs) adds an additional conditional parameter to guide the generation
process. Instead of generating data randomly they allow the model to produce specific types of
outputs.
Working of CGANs:
A conditional variable (y) is fed into both the generator and the discriminator.
This ensures that the generator creates data corresponding to the given condition (e.g
generating images of specific objects).
The discriminator also receives the labels to help distinguish between real and fake
data.
Example: Instead of generating any random image, CGAN can generate a specific object like a
dog or a cat based on the label.
Deep Convolutional GANs (DCGANs) are among the most popular types of GANs used for
image generation.
Max pooling layers are replaced with convolutional stride helps in making the model
more efficient.
Fully connected layers are removed, which allows for better spatial understanding of
images.
Working of LAPGAN:
This process allows the image to gradually refine details and helps in reducing noise
and improving clarity.
Due to its ability to generate highly detailed images, LAPGAN is considered a superior approach
for photorealistic image generation.
Working of SRGAN:
Enhances low-resolution images by adding finer details helps in making them appear
sharper and more realistic.
Helps to reduce common image upscaling errors such as blurriness and pixelation.
Generative Adversarial Networks (GANs) can generate realistic images by learning from existing
image datasets. Here we will be implementing a GAN trained on the CIFAR-10 dataset using
PyTorch.
We will be using Pytorch, Torchvision, Matplotlib and Numpy libraries for this. Set the device to
GPU if available otherwise use CPU.
import torch
import torch.nn as nn
import torchvision
import numpy as np
We use PyTorch’s transforms to convert images to tensors and normalize pixel values between
-1 and 1 for better training stability.
transform = transforms.Compose([
transforms.ToTensor(),
])
Download and load the CIFAR-10 dataset with defined transformations. Use a DataLoader to
process the dataset in mini-batches of size 32 and shuffle the data.
train_dataset = datasets.CIFAR10(root='./data',\
dataloader = torch.utils.data.DataLoader(train_dataset, \
batch_size=32, shuffle=True)
Loading
beta1, beta2: Beta parameters for Adam optimizer (e.g 0.5, 0.999)
num_epochs: Number of times the entire dataset will be processed (e.g 10)
latent_dim = 100
lr = 0.0002
beta1 = 0.5
beta2 = 0.999
num_epochs = 10
Create a neural network that converts random noise into images. Use transpose convolutional
layers, batch normalization and ReLU activations. The final layer uses Tanh activation to scale
outputs to the range [-1, 1].
nn.Linear(latent_dim, 128 * 8 * 8): Defines a fully connected layer that projects the
noise vector into a higher dimensional feature space.
super(Generator, self).__init__()
self.model = nn.Sequential(
nn.ReLU(),
nn.Upsample(scale_factor=2),
nn.BatchNorm2d(128, momentum=0.78),
nn.ReLU(),
nn.Upsample(scale_factor=2),
nn.BatchNorm2d(64, momentum=0.78),
nn.ReLU(),
nn.Tanh()
img = self.model(z)
return img
Create a binary classifier network that distinguishes real from fake images. Use convolutional
layers, batch normalization, dropout, LeakyReLU activation and a Sigmoid output layer to give a
probability between 0 and 1.
class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()
self.model = nn.Sequential(
nn.LeakyReLU(0.2),
nn.Dropout(0.25),
nn.ZeroPad2d((0, 1, 0, 1)),
nn.BatchNorm2d(64, momentum=0.82),
nn.LeakyReLU(0.25),
nn.Dropout(0.25),
nn.BatchNorm2d(128, momentum=0.82),
nn.LeakyReLU(0.2),
nn.Dropout(0.25),
nn.BatchNorm2d(256, momentum=0.8),
nn.LeakyReLU(0.25),
nn.Dropout(0.25),
nn.Flatten(),
nn.Linear(256 * 5 * 5, 1),
nn.Sigmoid()
validity = self.model(img)
return validity
Generator and Discriminator are initialized on the available device (GPU or CPU).
Binary Cross-Entropy (BCE) Loss is chosen as the loss function.
Adam optimizers are defined separately for the generator and discriminator with
specified learning rates and betas.
generator = Generator(latent_dim).to(device)
discriminator = Discriminator().to(device)
adversarial_loss = nn.BCELoss()
optimizer_G = optim.Adam(generator.parameters()\
optimizer_D = optim.Adam(discriminator.parameters()\
1. For each batch train the discriminator on real images and fake images generated by the
generator.
5. After each epoch generate and display sample images created by the generator for visual
inspection.
real_images = real_images.to(device)
optimizer_D.zero_grad()
fake_images = generator(z)
real_loss = adversarial_loss(discriminator\
(real_images), valid)
fake_loss = adversarial_loss(discriminator\
(fake_images.detach()), fake)
d_loss.backward()
optimizer_D.step()
optimizer_G.zero_grad()
gen_images = generator(z)
g_loss.backward()
optimizer_G.step()
if (i + 1) % 100 == 0:
print(
f"Epoch [{epoch+1}/{num_epochs}]\
if (epoch + 1) % 10 == 0:
with torch.no_grad():
generated = generator(z).detach().cpu()
grid = torchvision.utils.make_grid(generated,\
nrow=4, normalize=True)
plt.axis("off")
plt.show()
Output:
Training Output
By following these steps we successfully implemented and trained a GAN that learns to
generate realistic CIFAR-10 images through adversarial training.
1. Image Synthesis & Generation: GANs generate realistic images, avatars and high-
resolution visuals by learning patterns from training data. They are used in art, gaming
and AI-driven design.
1. Synthetic Data Generation: GANs produce new, synthetic data resembling real data
distributions which is useful for augmentation, anomaly detection and creative tasks.
2. High-Quality Results: They can generate photorealistic images, videos, music and
other media with high quality.
3. Unsupervised Learning: They don’t require labeled data helps in making them
effective in scenarios where labeling is expensive or difficult.
4. Versatility: They can be applied across many tasks including image synthesis, text-to-
image generation, style transfer, anomaly detection and more.
GANs are evolving and shaping the future of artificial intelligence. As the technology improves,
we can expect even more innovative applications that will change how we create, work and
interact with digital content.
In Artificial Neural Networks (ANNs), data flows from the input layer to the output layer
through one or more hidden layers. Each layer consists of neurons that receive input, process
it, and pass the output to the next layer. The layers work together to extract features,
transform data, and make predictions.
Input Layer
Hidden Layers
Output Layer
Each layer is composed of nodes (neurons) that are interconnected. The layers work together
to process data through a series of transformations.
ANN Layers
1. Input Layer
Input layer is the first layer in an ANN and is responsible for receiving the raw input data. This
layer's neurons correspond to the features in the input data. For example, in image processing,
each neuron might represent a pixel value. The input layer doesn't perform any computations
but passes the data to the next layer.
Key Points:
Example: For an image, the input layer would have neurons for each pixel value.
Input Layer in ANN
2. Hidden Layers
Hidden Layers are the intermediate layers between the input and output layers. They perform
most of the computations required by the network. Hidden layers can vary in number and size,
depending on the complexity of the task.
Each hidden layer applies a set of weights and biases to the input data, followed by an
activation function to introduce non-linearity.
3. Output Layer
Output Layer is the final layer in an ANN. It produces the output predictions. The number of
neurons in this layer corresponds to the number of classes in a classification problem or the
number of outputs in a regression problem.
The activation function used in the output layer depends on the type of problem:
For better understanding of the activation functions, Refer to the article - Activation functions
in Neural Networks
Till now we have covered the basic layers: input, hidden, and output. Let’s now dive into the
specific types of hidden layers.
Dense (Fully Connected) Layer is the most common type of hidden layer in an ANN. Every
neuron in a dense layer is connected to every neuron in the previous and subsequent layers.
This layer performs a weighted sum of inputs and applies an activation function to introduce
non-linearity. The activation function (like ReLU, Sigmoid, or Tanh) helps the network learn
complex patterns.
2. Convolutional Layer
Convolutional layers are used in Convolutional Neural Networks (CNNs) for image processing
tasks. They apply convolution operations to the input, capturing spatial hierarchies in the data.
Convolutional layers use filters to scan across the input and generate feature maps. This helps
in detecting edges, textures, and other visual features.
3. Recurrent Layer
Recurrent layers are used in Recurrent Neural Networks (RNNs) for sequence data like time
series or natural language. They have connections that loop back, allowing information to
persist across time steps. This makes them suitable for tasks where context and temporal
dependencies are important.
4. Dropout Layer
Dropout layers are a regularization technique used to prevent overfitting. They randomly drop
a fraction of the neurons during training, which forces the network to learn more robust
features and reduces dependency on specific neurons. During training, each neuron is retained
with a probability p.
5. Pooling Layer
Pooling Layer is used to reduce the spatial dimensions of the data, thereby decreasing the
computational load and controlling overfitting. Common types of pooling include Max Pooling
and Average Pooling.
Understanding the different types of layers in an ANN is essential for designing effective neural
networks. Each layer has a specific role, from receiving input data to learning complex patterns
and producing predictions. By combining these layers, we can build powerful models capable
of solving a wide range of tasks.
While building a neural network, one key decision is selecting the Activation Function for both
the hidden layer and the output layer. It is a mathematical function applied to the output of a
neuron. It introduces non-linearity into the model, allowing the network to learn and
represent complex patterns in the data. Without this non-linearity feature a neural network
would behave like a linear regression model no matter how many layers it has.
Activation function decides whether a neuron should be activated by calculating the weighted
sum of inputs and adding a bias term. This helps the model make complex decisions and
predictions by introducing non-linearities to the output of each neuron.
Before diving into the activation function, you should have prior knowledge of the following
topics: Neural Networks, Backpropagation
Non-linearity means that the relationship between input and output is not a straight line. In
simple terms the output does not change proportionally with the input. A common choice is
the ReLU function defined as σ(x)=max(0,x)σ(x)=max(0,x).
Imagine you want to classify apples and bananas based on their shape and color.
If we use a linear function it can only separate them using a straight line.
But real-world data is often more complex like overlapping colors, different lighting,
etc.
By adding a non-linear activation function like ReLU, Sigmoid or Tanh the network can
create curved decision boundaries to separate them correctly.
Effect of Non-Linearity
The inclusion of the ReLU activation function σσ allows h1h1 to introduce a non-linear decision
boundary in the input space. This non-linearity enables the network to learn more complex
patterns that are not possible with a purely linear model such as:
Increasing the capacity of the network to form multiple decision boundaries based on
the combination of weights and biases.
Why is Non-Linearity Important in Neural Networks?
Neural networks consist of neurons that operate using weights, biases and activation
functions.
In the learning process these weights and biases are updated based on the error produced at
the output—a process known as backpropagation. Activation functions enable
backpropagation by providing gradients that are essential for updating the weights and biases.
Without non-linearity even deep networks would be limited to solving only simple, linearly
separable problems. Activation functions help neural networks to model highly complex data
distributions and solve advanced deep learning tasks. Adding non-linear activation functions
introduce flexibility and enable the network to learn more complex and abstract patterns from
data.
To illustrate the need for non-linearity in neural networks with a specific example let's consider
a network with two input nodes (i1and i2)(i1and i2), a single hidden layer containing
neurons h1 and h2h1 and h2 and an output neuron (out).
We will use w1,w2w1,w2 as weights connecting the inputs to the hidden neuron and w5w5 as
the weight connecting the hidden neuron to the output. We'll also include biases (b1b1 for the
hidden neuron and b2b2 for the output neuron) to complete the model.
The input to the hidden neuron h1h1 is calculated as a weighted sum of the inputs plus a bias:
h1=i1.w1+i2.w3+b1h1=i1.w1+i2.w3+b1
h2=i1.w2+i2.w4+b2h2=i1.w2+i2.w4+b2
The output neuron is then a weighted sum of the hidden neuron's output plus a bias:
output=h1.w5+h2.w6+biasoutput=h1.w5+h2.w6+bias
In order to add non-linearity, we will be using sigmoid activation function in the output layer:
σ(x)=11+e−xσ(x)=1+e−x1
This gives the final output of the network after applying the sigmoid activation function in
output layers, introducing the desired non-linearity.
Linear Activation Function resembles straight line define by y=x. No matter how many layers
the neural network contains if they all use linear activation functions the output is a linear
combination of the input.
Linear activation function is used at just one place i.e. output layer.
Using linear activation across all layers makes the network's ability to learn complex
patterns limited.
Linear activation functions are useful for specific tasks but must be combined with non-linear
functions to enhance the neural network’s learning and predictive capabilities.
Linear Activation Function or Identity Function returns the input as the output
1. Sigmoid Function
It allows neural networks to handle and model complex patterns that linear equations
cannot.
The output ranges between 0 and 1, hence useful for binary classification.
The function exhibits a steep gradient when x values are between -2 and 2. This
sensitivity means that small changes in input x can cause significant changes in output
y which is critical during the training process.
Sigmoid or Logistic Activation Function Graph
Tanh function (hyperbolic tangent function) is a shifted version of the sigmoid, allowing it to
stretch across the y-axis. It is defined as:
f(x)=tanh(x)=21+e−2x−1.f(x)=tanh(x)=1+e−2x2−1.
tanh(x)=2×sigmoid(2x)−1tanh(x)=2×sigmoid(2x)−1
Use in Hidden Layers: Commonly used in hidden layers due to its zero-centered
output, facilitating easier learning for subsequent layers.
Tanh Activation Function
Value Range: [0,∞)[0,∞), meaning the function only outputs non-negative values.
Advantage over other Activation: ReLU is less computationally expensive than tanh
and sigmoid because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and easy for
computation.
ReLU Activation Function
1. Softmax Function
The Softmax function ensures that each class is assigned a probability, helping to
identify which class the input belongs to.
Softmax Activation Function
2. SoftPlus Function
This equation ensures that the output is always positive and differentiable at all points which is
an advantage over the traditional ReLU function.
Range: The function outputs values in the range (0,∞)(0,∞), similar to ReLU, but
without the hard zero threshold that ReLU has.
The choice of activation function has a direct impact on the performance of a neural network
in several ways:
1. Convergence Speed: Functions like ReLU allow faster training by avoiding the vanishing
gradient problem while Sigmoid and Tanh can slow down convergence in deep
networks.
2. Gradient Flow: Activation functions like ReLU ensure better gradient flow, helping
deeper layers learn effectively. In contrast Sigmoid can lead to small gradients,
hindering learning in deep layers.
3. Model Complexity: Activation functions like Softmax allow the model to handle
complex multi-class problems, whereas simpler functions like ReLU or Leaky ReLU are
used for basic layers.
Activation functions are the backbone of neural networks enabling them to capture non-linear
relationships in data. From classic functions like Sigmoid and Tanh to modern variants like ReLU
and Swish, each has its place in different types of neural networks. The key is to understand
their behavior and choose the right one based on your model’s needs.
Feedforward Neural Network (FNN) is a type of artificial neural network in which information
flows in a single direction—from the input layer through hidden layers to the output layer—
without loops or feedback. It is mainly used for pattern recognition tasks like image and speech
classification.
For example in a credit scoring system banks use an FNN which analyze users' financial profiles
—such as income, credit history and spending habits—to determine their creditworthiness.
Each piece of information flows through the network’s layers where various calculations are
made to produce a final score.
Feedforward Neural Networks have a structured layered design where data flows sequentially
through each layer.
1. Input Layer: The input layer consists of neurons that receive the input data. Each
neuron in the input layer represents a feature of the input data.
2. Hidden Layers: One or more hidden layers are placed between the input and output
layers. These layers are responsible for learning the complex patterns in the data. Each
neuron in a hidden layer applies a weighted sum of inputs followed by a non-linear
activation function.
3. Output Layer: The output layer provides the final output of the network. The number
of neurons in this layer corresponds to the number of classes in a classification
problem or the number of outputs in a regression problem.
Each connection between neurons in these layers has an associated weight that is adjusted
during the training process to minimize the error in predictions.
Activation Functions
Activation functions introduce non-linearity into the network enabling it to learn and model
complex data patterns.
Tanh: tanh(x)=ex−e−xex+e−xtanh(x)=ex+e−xex−e−x
ReLU: ReLU(x)=max(0,x)ReLU(x)=max(0,x)
Training a Feedforward Neural Network involves adjusting the weights of the neurons to
minimize the error between the predicted output and the actual output. This process is
typically performed using backpropagation and gradient descent.
1. Forward Propagation: During forward propagation the input data passes through the
network and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean
Squared Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.
Forward Propagation
Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss function by iteratively
updating the weights in the direction of the negative gradient. Common variants of gradient
descent include:
Batch Gradient Descent: Updates weights after computing the gradient over the entire
dataset.
Stochastic Gradient Descent (SGD): Updates weights for each training example
individually.
Mini-batch Gradient Descent: It Updates weights after computing the gradient over a
small batch of training examples.
Accuracy: The proportion of correctly classified instances out of the total instances.
Precision: The ratio of true positive predictions to the total predicted positives.
F1 Score: The harmonic mean of precision and recall, providing a balance between the
two.
This code demonstrates the process of building, training and evaluating a neural network
model using TensorFlow and Keras to classify handwritten digits from the MNIST dataset.
The model architecture is defined using the Sequential API consisting of:
a final Dense layer with 10 neurons and softmax activation to output probabilities for
each digit class.
Model is compiled with the Adam optimizer, SparseCategoricalCrossentropy loss function and
SparseCategoricalAccuracy metric and then trained for 5 epochs on the training data.
import tensorflow as tf
mnist = tf.keras.datasets.mnist
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])
model.compile(optimizer=Adam(),
loss=SparseCategoricalCrossentropy(),
metrics=[SparseCategoricalAccuracy()])
Output:
By understanding their architecture, activation functions, and training process, one can
appreciate the capabilities and limitations of these networks. Continuous advancements in
optimization techniques and activation functions have made feedforward networks more
efficient and effective, contributing to the broader field of artificial intelligence.
Back Propagation is also known as "Backward Propagation of Errors" is a method used to train
neural network . Its goal is to reduce the difference between the model’s predicted output and
the actual output by adjusting the weights and biases in the network.
It works iteratively to adjust weights and bias to minimize the cost function. In each epoch the
model adapts these parameters by reducing loss by following the error gradient. It often uses
optimization algorithms like gradient descent or stochastic gradient descent. The algorithm
computes the gradient using the chain rule from calculus allowing it to effectively navigate
complex layers in the neural network to minimize the cost function.
Back Propagation plays a critical role in how neural networks improve over time. Here's why:
1. Efficient Weight Update: It computes the gradient of the loss function with respect to
each weight using the chain rule making it possible to update weights efficiently.
2. Scalability: The Back Propagation algorithm scales well to networks with multiple
layers and complex architectures making deep learning feasible.
3. Automated Learning: With Back Propagation the learning process becomes automated
and the model can adjust itself to optimize its performance.
The Back Propagation algorithm involves two main steps: the Forward Pass and the Backward
Pass.
In forward pass the input data is fed into the input layer. These inputs combined with their
respective weights are passed to hidden layers. For example in a network with two hidden
layers (h1 and h2) the output from h1 serves as the input to h2. Before applying an activation
function, a bias is added to the weighted inputs.
Each hidden layer computes the weighted sum (`a`) of the inputs then applies an activation
function like ReLU (Rectified Linear Unit) to obtain the output (`o`). The output is passed to
the next layer where an activation function such as softmax converts the weighted outputs
into probabilities for classification.
2. Backward Pass
In the backward pass the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method
for error calculation is the Mean Squared Error (MSE) given by:
Once the error is calculated the network adjusts weights using gradients which are computed
with the chain rule. These gradients indicate how much each weight and bias should be
adjusted to minimize the error in the next iteration. The backward pass continues layer by layer
ensuring that the network learns and improves its performance. The activation function
through its derivative plays a crucial role in computing these gradients during Back
Propagation.
Let’s walk through an example of Back Propagation in machine learning. Assume the neurons
use the sigmoid activation function for the forward and backward pass. The target output is 0.5
and the learning rate is 1.
Example (1) of backpropagation sum
Forward Propagation
1. Initial Calculation
aj=∑(wi,j∗xi)aj=∑(wi,j∗xi)
Where,
ajaj is the weighted sum of all the inputs and weights at each node
wi,jwi,j represents the weights between the ithithinput and the jthjth neuron
O (output): After applying the activation function to a, we get the output of the neuron:
2. Sigmoid Function
The sigmoid function returns a value between 0 and 1, introducing non-linearity into the
model.
yj=11+e−ajyj=1+e−aj1
To find the outputs of y3, y4 and y5
3. Computing Outputs
At h1 node
a1=(w1,1x1)+(w2,1x2)=(0.2∗0.35)+(0.2∗0.7)=0.21a1=(w1,1x1)+(w2,1x2
)=(0.2∗0.35)+(0.2∗0.7)=0.21
Once we calculated the a1 value, we can now proceed to find the y3 value:
yj=F(aj)=11+e−a1yj=F(aj)=1+e−a11
y3=F(0.21)=11+e−0.21y3=F(0.21)=1+e−0.211
y3=0.56y3=0.56
a2=(w1,2∗x1)+(w2,2∗x2)=(0.3∗0.35)+(0.3∗0.7)=0.315a2=(w1,2∗x1)+(w2,2∗x2
)=(0.3∗0.35)+(0.3∗0.7)=0.315
y4=F(0.315)=11+e−0.315y4=F(0.315)=1+e−0.3151
a3=(w1,3∗y3)+(w2,3∗y4)=(0.3∗0.57)+(0.9∗0.59)=0.702a3=(w1,3∗y3)+(w2,3∗y4
)=(0.3∗0.57)+(0.9∗0.59)=0.702
y5=F(0.702)=11+e−0.702=0.67y5=F(0.702)=1+e−0.7021=0.67
Values of y3, y4 and y5
4. Error Calculation
Our actual output is 0.5 but we obtained 0.67. To calculate the error we can use the below
formula:
Errorj=ytarget−y5Errorj=ytarget−y5
Back Propagation
1. Calculating Gradients
Δwij=η×δj×OjΔwij=η×δj×Oj
Where:
For O3:
δ5=y5(1−y5)(ytarget−y5)δ5=y5(1−y5)(ytarget−y5)
=0.67(1−0.67)(−0.17)=−0.0376=0.67(1−0.67)(−0.17)=−0.0376
For h1:
δ3=y3(1−y3)(w1,3×δ5)δ3=y3(1−y3)(w1,3×δ5)
=0.56(1−0.56)(0.3×−0.0376)=−0.0027=0.56(1−0.56)(0.3×−0.0376)=−0.0027
For h2:
δ4=y4(1−y4)(w2,3×δ5)δ4=y4(1−y4)(w2,3×δ5)
=0.59(1−0.59)(0.9×−0.0376)=−0.0819=0.59(1−0.59)(0.9×−0.0376)=−0.0819
4. Weight Updates
Δw2,3=1×(−0.0376)×0.59=−0.022184Δw2,3=1×(−0.0376)×0.59=−0.022184
New weight:
w2,3(new)=−0.022184+0.9=0.877816w2,3(new)=−0.022184+0.9=0.877816
Δw1,1=1×(−0.0027)×0.35=0.000945Δw1,1=1×(−0.0027)×0.35=0.000945
New weight:
w1,1(new)=0.000945+0.2=0.200945w1,1(new)=0.000945+0.2=0.200945
w1,2(new)=0.273225w1,2(new)=0.273225
w1,3(new)=0.086615w1,3(new)=0.086615
w2,1(new)=0.269445w2,1(new)=0.269445
w2,2(new)=0.18534w2,2(new)=0.18534
y3=0.57y3=0.57
y4=0.56y4=0.56
y5=0.61y5=0.61
Since y5=0.61y5=0.61 is still not the target output the process of calculating the error and
backpropagating continues until the desired output is reached.
This process demonstrates how Back Propagation iteratively updates weights by minimizing
errors until the network accurately predicts the output.
Error=ytarget−y5Error=ytarget−y5
=0.5−0.61=−0.11=0.5−0.61=−0.11
This process is said to be continued until the actual output is gained by the neural network.
This code demonstrates how Back Propagation is used in a neural network to solve the XOR
problem. The neural network consists of:
We define a neural network as Input layer with 2 inputs, Hidden layer with 4 neurons, Output
layer with 1 output neuron and use Sigmoid function as activation function.
self.input_size = input_size: stores the size of the input layer
import numpy as np
class NeuralNetwork:
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.weights_input_hidden = np.random.randn(
self.input_size, self.hidden_size)
self.weights_hidden_output = np.random.randn(
self.hidden_size, self.output_size)
return 1 / (1 + np.exp(-x))
return x * (1 - x)
self.hidden_activation = np.dot(
X, self.weights_input_hidden) + self.bias_hidden
self.hidden_output = self.sigmoid(self.hidden_activation)
self.output_activation = np.dot(
self.predicted_output = self.sigmoid(self.output_activation)
return self.predicted_output
In Backward pass or Back Propagation the errors between the predicted and actual outputs are
computed. The gradients are calculated using the derivative of the sigmoid function and
weights and biases are updated accordingly.
output_delta = output_error *
self.sigmoid_derivative(self.predicted_output): calculates the delta for the output
layer
hidden_delta = hidden_error *
self.sigmoid_derivative(self.hidden_output): calculates the delta for the hidden layer
output_error = y - self.predicted_output
output_delta = output_error * \
self.sigmoid_derivative(self.predicted_output)
self.weights_hidden_output += np.dot(self.hidden_output.T,
output_delta) * learning_rate
keepdims=True) * learning_rate
keepdims=True) * learning_rate
4. Training Network
The network is trained over 10,000 epochs using the Back Propagation algorithm with a
learning rate of 0.1 progressively reducing the error.
loss = np.mean(np.square(y - output)): calculates the mean squared error (MSE) loss
output = self.feedforward(X)
self.backward(X, y, learning_rate)
if epoch % 4000 == 0:
output = nn.feedforward(X)
print(output)
Output:
Trained Model
The output shows the training progress of a neural network over 10,000 epochs.
Initially the loss was high (0.2713) but it gradually decreased as the network learned
reaching a low value of 0.0066 by epoch 8000.
The final predictions are close to the expected XOR outputs: approximately 0 for [0, 0]
and [1, 1] and approximately 1 for [0, 1] and [1, 0] indicating that the network
successfully learned to approximate the XOR function.
5. Scalability: The algorithm scales efficiently with larger datasets and more complex
networks making it ideal for large-scale tasks.
1. Vanishing Gradient Problem: In deep networks the gradients can become very small
during Back Propagation making it difficult for the network to learn. This is common
when using activation functions like sigmoid or tanh.
2. Exploding Gradients: The gradients can also become excessively large causing the
network to diverge during training.
3. Overfitting: If the network is too complex it might memorize the training data instead
of learning general patterns.
Table of Content
NLP Techniques
Future Scope
NLP is used by many applications that use language, such as text translation, voice recognition,
text summarization and chatbots. You may have used some of these applications yourself, such
as voice-operated GPS systems, digital assistants, speech-to-text software and customer
service bots. NLP also helps businesses improve their efficiency, productivity and performance
by simplifying complex tasks that involve language.
NLP Techniques
NLP encompasses a wide array of techniques that aimed at enabling computers to process and
understand human language. These tasks can be categorized into several broad areas, each
addressing different aspects of language processing. Here are some of the key NLP techniques:
Stopword Removal: Removing common words (like "and", "the", "is") that may not
carry significant meaning.
Constituency Parsing: Breaking down a sentence into its constituent parts or phrases
(e.g., noun phrases, verb phrases).
3. Semantic Analysis
Named Entity Recognition (NER): Identifying and classifying entities in text, such as
names of people organizations, locations, dates, etc.
Coreference Resolution: Identifying when different words refer to the same entity in a
text (e.g., "he" refers to "John").
4. Information Extraction
Entity Extraction: Identifying specific entities and their relationships within the text.
6. Language Generation
Machine Translation: Translating text from one language to another.
7. Speech Processing
8. Question Answering
Retrieval-Based QA: Finding and returning the most relevant text passage in response
to a query.
9. Dialogue Systems
NLP Working
Data Collection: Gathering text data from various sources such as websites, books,
social media or proprietary databases.
Data Storage: Storing the collected text data in a structured format, such as a database
or a collection of documents.
2. Text Preprocessing
Preprocessing is crucial to clean and prepare the raw text data for analysis. Common
preprocessing steps include:
Stemming and Lemmatization: Reducing words to their base or root forms. Stemming
cuts off suffixes, while lemmatization considers the context and converts words to their
meaningful base form.
3. Text Representation
Bag of Words (BoW): Representing text as a collection of words, ignoring grammar and
word order but keeping track of word frequency.
4. Feature Extraction
Extracting meaningful features from the text data that can be used for various NLP tasks.
N-grams: Capturing sequences of N words to preserve some context and word order.
Syntactic Features: Using parts of speech tags, syntactic dependencies and parse trees.
Selecting and training a machine learning or deep learning model to perform specific NLP
tasks.
Supervised Learning: Using labeled data to train models like Support Vector Machines
(SVM), Random Forests or deep learning models like Convolutional Neural Networks
(CNNs) and Recurrent Neural Networks (RNNs).
Deploying the trained model and using it to make predictions or extract insights from new text
data.
Text Classification: Categorizing text into predefined classes (e.g., spam detection,
sentiment analysis).
Named Entity Recognition (NER): Identifying and classifying entities in the text.
Evaluating the performance of the NLP algorithm using metrics such as accuracy, precision,
recall, F1-score and others.
There are a variety of technologies related to natural language processing (NLP) that are used
to analyze and understand human language. Some of the most common include:
2. Natural Language Toolkits (NLTK) and other libraries: NLTK is a popular open-source
library in Python that provides tools for NLP tasks such as tokenization, stemming and
part-of-speech tagging. Other popular libraries include spaCy, OpenNLP and CoreNLP.
3. Parsers: Parsers are used to analyze the syntactic structure of sentences, such as
dependency parsing and constituency parsing.
4. Text-to-Speech (TTS) and Speech-to-Text (STT) systems: TTS systems convert written
text into spoken words, while STT systems convert spoken words into written text.
5. Named Entity Recognition (NER) systems: NER systems identify and extract named
entities such as people, places and organizations from the text.
7. Machine Translation: NLP is used for language translation from one language to
another through a computer.
8. Chatbots: NLP is used for chatbots that communicate with other chatbots or humans
through auditory or textual methods.
Algorithmic Trading: Algorithmic trading is used for predicting stock market conditions.
Using NLP, this technology examines news headlines about companies and stocks and
attempts to comprehend their meaning in order to determine if you should buy, sell or
hold certain stocks.
Questions Answering: NLP can be seen in action by using Google Search or Siri
Services. A major use of NLP is to make search engines understand the meaning of
what we are asking and generate natural language in return to give us the answers.
Future Scope
Chatbots and Virtual Assistants: NLP enables chatbots to quickly understand and
respond to user queries, providing 24/7 assistance across text or voice interactions.
Invisible User Interfaces (UI): With NLP, devices like Amazon Echo allow for seamless
communication through voice or text, making technology more accessible without
traditional interfaces.
Smarter Search: NLP is improving search by allowing users to ask questions in natural
language, as seen with Google Drive's recent update, making it easier to find
documents.
Multilingual NLP: Expanding NLP to support more languages, including regional and
minority languages, broadens accessibility.
Future Enhancements: NLP is evolving with the use of Deep Neural Networks (DNNs) to make
human-machine interactions more natural. Future advancements include improved semantics
for word understanding and broader language support, enabling accurate translations and
better NLP models for languages not yet supported.
Natural
Language Natural Language Natural Language
Processing Understanding (NLU) Generation (NLG)
(NLP)
1 It was first started by This explores the ways This enables the
Alan Turing to make
which enable the computers to produce
the machine
computers to grasp the output after
understand the
instructions provided understanding the
context of any
by users in human input given by the user
document rather than
languages like English, in natural languages
treating it as simple
Hindi etc. like English, Hindi etc.
words.
Applications of NLU
Applications of NLP
are Speech Applications of NLG
are Smart assistance,
4 recognition, sentiment are Chatbots, Voice
language translation,
analysis, spam filtering assistants etc.
text analysis etc.
etc
It makes use of
sensors for input and Sensors and After understanding
uses different layers processors are used to and processing,
5
for processing data take input and process actuators are used to
and then provides the information. provide output.
output.
It utilizes different
strategies to
It involves different It has different
7 understand the natural
analysis phases. generation phases.
language and give
feedback accordingly.
Table of Content
Tokenization
Stemming
Lemmatization
As discussed earlier, NLTK is Python's API library for performing an array of tasks in human
language. It can perform a variety of operations on textual data, such as classification,
tokenization, stemming, tagging, Leparsing, semantic reasoning, etc.
Installation:
NLTK can be installed simply using pip or by running the following code.
import nltk
nltk.download('all')
Now, having installed NLTK successfully in our system, let's perform some basic operations on
text data using NLTK.
Tokenization
Tokenization refers to break down the text into smaller units. It entails splitting paragraphs into
sentences and sentences into words. It is one of the initial steps of any NLP pipeline. Let us
have a look at the two major kinds of tokenization that NLTK provides:
Work Tokenization
Sentence Tokenization
Example:
"I study Machine Learning on GeeksforGeeks. Currently, I'm studying NLP"
will be sentence-tokenized as
['I study Machine Learning on GeeksforGeeks.', 'Currently, I'm studying NLP.']
print(word_tokenize(sent))
print(sent_tokenize(sent))
Output:
E.g. The words 'play', 'plays', 'played', and 'playing' convey the same action - hence, we can
map them all to their base form i.e. 'play'.
Now, there are two widely used canonicalization techniques: Stemming and Lemmatization.
Stemming
Stemming generates the base word from the inflected word by removing the affixes of the
word. It has a set of pre-defined rules that govern the dropping of these affixes. It must be
noted that stemmers might not always result in semantically meaningful base words.
Stemmers are faster and computationally less expensive than lemmatizers.
In the following code, we will be stemming words using Porter Stemmer - one of the most
widely used stemmers:
porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))
Output:
play
play
play
play
We can see that all the variations of the word 'play' have been reduced to the same word -
'play'. In this case, the output is a meaningful word, 'play'. However, this is not always the case.
Let us take an example.
Please note that these groups are stored in the lemmatizer; there is no removal of affixes as in
the case of a stemmer.
porter = PorterStemmer()
print(porter.stem("Communication"))
Output:
commun
The stemmer reduces the word 'communication' to a base word 'commun' which is
meaningless in itself.
Lemmatization
Lemmatization involves grouping together the inflected forms of the same word. This way, we
can reach out to the base form of any word which will be meaningful in nature. The base from
here is called the Lemma.
Example:
'play', 'plays', 'played', and 'playing' have 'play' as the lemma.
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("plays", 'v'))
print(lemmatizer.lemmatize("played", 'v'))
print(lemmatizer.lemmatize("play", 'v'))
print(lemmatizer.lemmatize("playing", 'v'))
Output:
play
play
play
play
Please note that in lemmatizers, we need to pass the Part of Speech of the word along with the
word as a function argument.
Also, lemmatizers always result in meaningful base words. Let us take the same example as we
took in the case for stemmers.
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("Communication", 'v'))
Output:
Communication
Part of Speech (POS) tagging refers to assigning each word of a sentence to its part of speech.
It is significant as it helps to give a better syntactic overview of a sentence.
Example:
"GeeksforGeeks is a Computer Science platform."
Let's see how NLTK's POS tagger will tag this sentence.
tokenized_text = word_tokenize(text)
tags
Output:
[('GeeksforGeeks', 'NNP'),
('is', 'VBZ'),
('a', 'DT'),
('Computer', 'NNP'),
('Science', 'NNP'),
('platform', 'NN'),
('.', '.')]
Conclusion
In conclusion, the Natural Language Toolkit (NLTK) works as a powerful Python library that a
wide range of tools for Natural Language Processing (NLP). From fundamental tasks like text
pre-processing to more advanced operations such as semantic reasoning, NLTK provides a
versatile API that caters to the diverse needs of language-related tasks.
Unlike simple word frequency, TF-IDF balances common and rare words to highlight the most
meaningful terms.
How TF-IDF Works?
TF-IDF combines two components: Term Frequency (TF) and Inverse Document Frequency
(IDF).
Term Frequency (TF): Measures how often a word appears in a document. A higher frequency
suggests greater importance. If a term appears frequently in a document, it is likely relevant to
the document’s content. Formula:
Limitations of TF Alone:
TF does not account for the global importance of a term across the entire corpus.
Common words like "the" or "and" may have high TF scores but are not meaningful in
distinguishing documents.
Inverse Document Frequency (IDF): Reduces the weight of common words across multiple
documents while increasing the weight of rare words. If a term appears in fewer documents, it
is more likely to be meaningful and specific. Formula:
The logarithm is used to dampen the effect of very large or very small values, ensuring
the IDF score scales appropriately.
It also helps balance the impact of terms that appear in extremely few or extremely
many documents.
IDF does not consider how often a term appears within a specific document.
A term might be rare across the corpus (high IDF) but irrelevant in a specific document
(low TF).
To better grasp how TF-IDF works, let’s walk through a detailed example. Imagine we have
a corpus (a collection of documents) with three documents:
Our goal is to calculate the TF-IDF score for specific terms in these documents. Let’s focus on
the word "cat" and see how TF-IDF evaluates its importance.
For Document 1:
For Document 2:
For Document 3:
The total number of terms in Document 3 is 6 ("cats", "and", "dogs", "are", "great",
"pets").
In Document 1 and Document 3, the word "cat" has the same TF score. This means it
appears with the same relative frequency in both documents.
In Document 2, the TF score is 0 because the word "cat" does not appear.
Number of documents containing the term "cat": 2 (Document 1 and Document 3).
So, IDF(cat,D)=log32≈0.176IDF(cat,D)=log23≈0.176
The IDF score for "cat" is relatively low. This indicates that the word "cat" is not very rare in
the corpus—it appears in 2 out of 3 documents. If a term appeared in only 1 document, its IDF
score would be higher, indicating greater uniqueness.
The TF-IDF score for "cat" is 0.029 in Document 1 and Document 3, and 0 in Document 2
that reflects both the frequency of the term in the document (TF) and its rarity across the
corpus (IDF).
TF-IDF
A higher TF-IDF score means the term is more important in that specific document.
1. Identifying Important Terms: TF-IDF helps us understand that "cat" is somewhat important
in Document 1 and Document 3 but irrelevant in Document 2.
If we were building a search engine, this score would help rank Document 1 and Document 3
higher for a query like "cat".
2. Filtering Common Words: Words like "the" or "and" would have high TF scores but very low
IDF scores because they appear in almost all documents. Their TF-IDF scores would be close to
0, indicating they are not meaningful.
3. Highlighting Unique Terms: If a term like "mat" appeared only in Document 1, it would have
a higher IDF score, making its TF-IDF score more significant in that document.
In python tf-idf values can be computed using TfidfVectorizer() method in sklearn module.
Syntax:
sklearn.feature_extraction.text.TfidfVectorizer(input)
Parameters:
Attributes:
idf_: It returns the inverse document frequency vector of the document passed as a
parameter.
Returns:
Step-by-step Approach:
Import modules.
Collect strings from documents and create a corpus having a collection of strings from
the documents d0, d1, and d2.
# assign documents
d1 = 'Geeks'
d2 = 'r2j'
# create object
tfidf = TfidfVectorizer()
# get tf-df values
result = tfidf.fit_transform(string)
print('\nidf values:')
Output:
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
print('\ntf-idf value:')
print(result)
# in matrix form
print(result.toarray())
Output:
The result variable consists of unique words as well as the tf-if values. It can be elaborated
using the below image:
Documen
t Word Document Index Word Index tf-idf value
d0 for 0 0 0.549
d0 geeks 0 1 0.8355
d1 geeks 1 1 1.000
Documen
t Word Document Index Word Index tf-idf value
d2 r2j 2 2 1.000
Below are some examples which depict how to compute tf-idf values of words from a
corpus:
# assign documents
d1 = 'Geeks'
d2 = 'r2j'
# create object
tfidf = TfidfVectorizer()
result = tfidf.fit_transform(string)
print('\nidf values:')
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
print('\ntf-idf value:')
print(result)
# in matrix form
print(result.toarray())
Output:
Example 2: Here, tf-idf values are computed from a corpus having unique values.
d0 = 'geek1'
d1 = 'geek2'
d2 = 'geek3'
d3 = 'geek4'
# create object
tfidf = TfidfVectorizer()
result = tfidf.fit_transform(string)
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
print('\ntf-idf values:')
print(result)
Output:
Example 3: In this program, tf-idf values are computed from a corpus having similar
documents.
# assign documents
# create object
tfidf = TfidfVectorizer()
result = tfidf.fit_transform(string)
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
# display tf-idf values
print('\ntf-idf values:')
print(result)
Output:
Example 4: Below is the program in which we try to calculate tf-idf value of a single
word geeks is repeated multiple times in multiple documents.
# assign corpus
# create object
tfidf = TfidfVectorizer()
result = tfidf.fit_transform(string)
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
print(result)
Output:
N-gram can be defined as the contiguous sequence of n items from a given sample
of text or speech. The items can be letters, words, or base pairs according to the application.
The N-grams typically are collected from a text or speech corpus (A long text dataset).
For instance, N-grams can be unigrams like ("This", "article", "is", "on", "NLP") or bigrams
("This article", "article is", "is on", "on NLP").
An N-gram language model predicts the probability of a given N-gram within any sequence of
words in a language. A well-crafted N-gram model can effectively predict the next word in a
sentence, which is essentially determining the value of p(w∣h), where h is the history or context
and w is the word to predict.
Let’s explore how to predict the next word in a sentence. We need to calculate p(w|h), where
w is the candidate for the next word. Consider the sentence 'This article is on...'.If we want to
calculate the probability of the next word being "NLP", the probability can be expressed as:
p("NLP"∣"This","article","is","on")p("NLP"∣"This","article","is","on")
To generalize, the conditional probability of the fifth word given the first four can be written as:
p(w5∣w1,w2,w3,w4)orp(W)=p(wn∣w1,w2,…,wn−1)p(w5∣w1,w2,w3,w4)orp(W)=p(wn∣w1,w2,
…,wn−1)
P(A∣B)=P(A∩B)P(B)andP(A∩B)=P(A∣B)P(B)P(A∣B)=P(B)P(A∩B)andP(A∩B)=P(A∣B)P(B)
This yields:
P(w1,w2,w3,…,wn)=∏iP(wi∣w1,w2,…,wi−1)P(w1,w2,w3,…,wn)=∏iP(wi∣w1,w2,…,wi−1)
By applying Markov assumptions, which propose that the future state depends only on the
current state and not on the sequence of events that preceded it, we simplify the formula:
P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−k,…,wi−1)P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−k,…,wi−1)
P(w1,w2,…,wn)≈∏iP(wi)P(w1,w2,…,wn)≈∏iP(wi)
P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−1)P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−1)
import nltk
nltk.download('reuters')
nltk.download('punkt')
# Create trigrams
tri_grams = list(trigrams(words))
model[(w1, w2)][w3] += 1
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count
"""
Predicts the next word based on the previous two words using the trained trigram model.
Args:
Returns:
"""
if next_word:
return predicted_word
else:
# Example usage
Next Word: of
H(p)=∑xp(x)⋅(−log(p(x)))H(p)=∑xp(x)⋅(−log(p(x)))
H(p)=∑i=1x1n(−log2(p(wi∣w1i−1))) H(p)=∑i=1xn1(−log2(p(wi∣w1i−1)))
The cross-entropy is always greater than or equal to Entropy i.e the model uncertainty can be
no less than the true uncertainty.
2Cross−Entropy2Cross−Entropy
Following is the formula for the calculation of Probability of the test set assigned by the
language model, normalized by the number of words:
For Example:
Let's take an example of the sentence: 'Natural Language Processing'. For predicting
the first word, let's say the word has the following probabilities:
The 0.4
Processing 0.3
Natural 0.12
Language 0.18
Now, we know the probability of getting the first word as natural. But, what's the
probability of getting the next word after getting the word 'Language' after the word
'Natural'.
The 0.05
Processing 0.3
Natural 0.15
Language 0.5
After getting the probability of generating words 'Natural Language', what's the
probability of getting 'Processing'.
The 0.1
Processing 0.7
Natural 0.1
Language 0.1
Entropy=log2(2.876)=1.524Entropy=log2(2.876)=1.524
Shortcomings:
To get a better context of the text, we need higher values of n, but this will also
increase computational overhead.
The increasing value of n in n-gram can also lead to sparsity.