0% found this document useful (0 votes)
15 views98 pages

Deep Learning

Uploaded by

mayurpatil017902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views98 pages

Deep Learning

Uploaded by

mayurpatil017902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 98

Deep Learning is transforming the way machines understand, learn and

interact with complex data. Deep learning mimics neural networks of the human brain, it
enables computers to autonomously uncover patterns and make informed decisions from vast
amounts of unstructured data.

Image insert

How Deep Learning Works?

Neural network consists of layers of interconnected nodes or neurons that collaborate to


process input data. In a fully connected deep neural network data flows through multiple
layers where each neuron performs nonlinear transformations, allowing the model to learn
intricate representations of the data.

In a deep neural network the input layer receives data which passes through hidden
layers that transform the data using nonlinear functions. The final output layer generates the
model’s prediction.

For more details on neural networks refer to this article: What is a Neural Network?

Fully
Connected Deep Neural Network

Difference between Machine Learning and Deep Learning


Machine learning and Deep Learning both are subsets of artificial intelligence but there are
many similarities and differences between them.

Image insert

Machine Learning Deep Learning

Apply statistical algorithms to learn the Uses artificial neural network architecture
hidden patterns and relationships in the to learn the hidden patterns and
dataset. relationships in the dataset.

Requires the larger volume of dataset


Can work on the smaller amount of dataset
compared to machine learning

Better for complex task like image


Better for the low-label task. processing, natural language processing,
etc.

Takes less time to train the model. Takes more time to train the model.

A model is created by relevant features Relevant features are automatically


which are manually extracted from images to extracted from images. It is an end-to-end
detect an object in the image. learning process.

Less complex and easy to interpret the More complex, it works like the black box
result. interpretations of the result are not easy.

It can work on the CPU or requires less


It requires a high-performance computer
computing power as compared to deep
with GPU.
learning.

Evolution of Neural Architectures

The journey of deep learning began with the perceptron, a single-layer neural network
introduced in the 1950s. While innovative, perceptrons could only solve linearly separable
problems hence failing at more complex tasks like the XOR problem.

This limitation led to the development of Multi-Layer Perceptrons (MLPs). It introduced


hidden layers and non-linear activation functions. MLPs trained using backpropagation could
model complex, non-linear relationships marking a significant leap in neural network
capabilities. This evolution from perceptrons to MLPs laid the groundwork for advanced
architectures like CNNs and RNNs, showcasing the power of layered structures in solving real-
world problems.

Types of neural networks

1. Feedforward neural networks (FNNs) are the simplest type of ANN, where data flows
in one direction from input to output. It is used for basic tasks like classification.

2. Convolutional Neural Networks (CNNs) are specialized for processing grid-like data,
such as images. CNNs use convolutional layers to detect spatial hierarchies, making
them ideal for computer vision tasks.

3. Recurrent Neural Networks (RNNs) are able to process sequential data, such as time
series and natural language. RNNs have loops to retain information over time, enabling
applications like language modeling and speech recognition. Variants like LSTMs and
GRUs address vanishing gradient issues.

4. Generative Adversarial Networks (GANs) consist of two networks—a generator and a


discriminator—that compete to create realistic data. GANs are widely used for image
generation, style transfer and data augmentation.

5. Autoencoders are unsupervised networks that learn efficient data encodings. They
compress input data into a latent representation and reconstruct it, useful for
dimensionality reduction and anomaly detection.

6. Transformer Networks has revolutionized NLP with self-attention mechanisms.


Transformers excel at tasks like translation, text generation and sentiment analysis,
powering models like GPT and BERT.

Deep Learning Applications

1. Computer vision

In computer vision, deep learning models enable machines to identify and understand visual
data. Some of the main applications of deep learning in computer vision include:

 Object detection and recognition: Deep learning models are used to identify and
locate objects within images and videos, making it possible for machines to perform
tasks such as self-driving cars, surveillance and robotics.

 Image classification: Deep learning models can be used to classify images into
categories such as animals, plants and buildings. This is used in applications such as
medical imaging, quality control and image retrieval.

 Image segmentation: Deep learning models can be used for image segmentation into
different regions, making it possible to identify specific features within images.

2. Natural language processing (NLP)

In NLP, deep learning model enable machines to understand and generate human language.
Some of the main applications of deep learning in NLP include:
 Automatic Text Generation: Deep learning model can learn the corpus of text and new
text like summaries, essays can be automatically generated using these trained
models.

 Language translation: Deep learning models can translate text from one language to
another, making it possible to communicate with people from different linguistic
backgrounds.

 Sentiment analysis: Deep learning models can analyze the sentiment of a piece of text,
making it possible to determine whether the text is positive, negative or neutral.

 Speech recognition: Deep learning models can recognize and transcribe spoken words,
making it possible to perform tasks such as speech-to-text conversion, voice search
and voice-controlled devices.

3. Reinforcement learning

In reinforcement learning, deep learning works as training agents to take action in an


environment to maximize a reward. Some of the main applications of deep learning in
reinforcement learning include:

 Game playing: Deep reinforcement learning models have been able to beat human
experts at games such as Go, Chess and Atari.

 Robotics: Deep reinforcement learning models can be used to train robots to perform
complex tasks such as grasping objects, navigation and manipulation.

 Control systems: Deep reinforcement learning models can be used to control complex
systems such as power grids, traffic management and supply chain optimization.

Advantages of Deep Learning

1. High accuracy: Deep Learning algorithms can achieve state-of-the-art performance in


various tasks such as image recognition and natural language processing.

2. Automated feature engineering: Deep Learning algorithms can automatically discover


and learn relevant features from data without the need for manual feature
engineering.

3. Scalability: Deep Learning models can scale to handle large and complex datasets and
can learn from massive amounts of data.

4. Flexibility: Deep Learning models can be applied to a wide range of tasks and can
handle various types of data such as images, text and speech.

5. Continual improvement: Deep Learning models can continually improve their


performance as more data becomes available.

Disadvantages of Deep Learning

Deep learning has made significant advancements in various fields but there are still some
challenges that need to be addressed. Here are some of the main challenges in deep learning:
1. Data availability: It requires large amounts of data to learn from. For using deep
learning it's a big concern to gather as much data for training.

2. Computational Resources: For training the deep learning model, it is computationally


expensive because it requires specialized hardware like GPUs and TPUs.

3. Time-consuming: While working on sequential data depending on the computational


resource it can take very large even in days or months.

4. Interpretability: Deep learning models are complex, it works like a black box. It is very
difficult to interpret the result.

5. Overfitting: when the model is trained again and again it becomes too specialized for
the training data leading to overfitting and poor performance on new data.

As we continue to push the boundaries of computational power and dataset sizes, the
potential applications of deep learning are limitless. Deep Learning promises to reshape our
future where machines can learn, adapt and solve complex problems at a scale and speed
previously unimaginable.

Convolutional Neural Networks (CNNs) are deep learning models designed to process data
with a grid-like topology such as images. They are the foundation for most modern computer
vision applications to detect features within visual data.

1/3

Key Components of a Convolutional Neural Network

1. Convolutional Layers: These layers apply convolutional operations to input images


using filters or kernels to detect features such as edges, textures and more complex
patterns. Convolutional operations help preserve the spatial relationships between
pixels.

2. Pooling Layers: They downsample the spatial dimensions of the input, reducing the
computational complexity and the number of parameters in the network. Max pooling
is a common pooling operation where we select a maximum value from a group of
neighboring pixels.

3. Activation Functions: They introduce non-linearity to the model by allowing it to learn


more complex relationships in the data.

4. Fully Connected Layers: These layers are responsible for making predictions based on
the high-level features learned by the previous layers. They connect every neuron in
one layer to every neuron in the next layer.

How CNNs Work?

1. Input Image: CNN receives an input image which is preprocessed to ensure uniformity
in size and format.
2. Convolutional Layers: Filters are applied to the input image to extract features like
edges, textures and shapes.

3. Pooling Layers: The feature maps generated by the convolutional layers are
downsampled to reduce dimensionality.

4. Fully Connected Layers: The downsampled feature maps are passed through fully
connected layers to produce the final output, such as a classification label.

5. Output: The CNN outputs a prediction, such as the class of the image.

Working of CNN Models

How to Train a Convolutional Neural Network?

CNNs are trained using a supervised learning approach. This means that the CNN is given a set
of labeled training images. The CNN learns to map the input images to their correct labels.

The training process for a CNN involves the following steps:

1. Data Preparation: The training images are preprocessed to ensure that they are all in
the same format and size.

2. Loss Function: A loss function is used to measure how well the CNN is performing on
the training data. The loss function is typically calculated by taking the difference
between the predicted labels and the actual labels of the training images.

3. Optimizer: An optimizer is used to update the weights of the CNN in order to minimize
the loss function.
4. Backpropagation: Backpropagation is a technique used to calculate the gradients of
the loss function with respect to the weights of the CNN. The gradients are then used
to update the weights of the CNN using the optimizer.

How to Evaluate CNN Models

Efficiency of CNN can be evaluated using a variety of criteria. Among the most popular metrics
are:

 Accuracy: Accuracy is the percentage of test images that the CNN correctly classifies.

 Precision: Precision is the percentage of test images that the CNN predicts as a
particular class and that are actually of that class.

 Recall: Recall is the percentage of test images that are of a particular class and that the
CNN predicts as that class.

 F1 Score: The F1 Score is a harmonic mean of precision and recall. It is a good metric
for evaluating the performance of a CNN on classes that are imbalanced.

Case Study of CNN for Diabetic retinopathy

Diabetic retinopathy is a severe eye condition caused by damage to the retina's blood vessels
due to prolonged diabetes. It is a leading cause of blindness among adults aged 20 to 64. CNNs
have successfully used to detect diabetic retinopathy by analyzing retinal images. By training
on labeled datasets of healthy and affected retina images CNNs can accurately identify signs of
the disease helping in early diagnosis and treatment.

Different Types of CNN Models

1. LeNet

LeNet developed by Yann LeCun and his colleagues in the late 1990s was one of the first
successful CNNs designed for handwritten digit recognition. It laid the foundation for modern
CNNs and achieved high accuracy on the MNIST dataset which contains 70,000 images of
handwritten digits (0-9).

2. AlexNet

AlexNet is a CNN architecture that was developed by Alex Krizhevsky, Ilya Sutskever and
Geoffrey Hinton in 2012. It was the first CNN to win the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) a major image recognition competition. It consists of several
layers of convolutional and pooling layers followed by fully connected layers. The architecture
includes five convolutional layers, three pooling layers and three fully connected layers.

3. Resnet

ResNets (Residual Networks) are designed for image recognition and processing tasks. They are
renowned for their ability to train very deep networks without overfitting making them highly
effective for complex tasks. It introduces skip connections that allow the network to learn
residual functions making it easier to train deep architecture.

4. GoogleNet
GoogleNet also known as InceptionNet is renowned for achieving high accuracy in image
classification while using fewer parameters and computational resources compared to other
state-of-the-art CNNs. The core component of GoogleNet allows the network to learn features
at different scales simultaneously to enhance performance.

5. VGG

VGGs are developed by the Visual Geometry Group at Oxford, it uses small 3x3 convolutional
filters stacked in multiple layers, creating a deep and uniform structure. Popular variants
like VGG-16 and VGG-19 achieved state-of-the-art performance on the ImageNet dataset
demonstrating the power of depth in CNNs.

Applications of CNN

 Image classification: CNNs are the state-of-the-art models for image


classification. They can be used to classify images into different categories such as cats
and dogs.

 Object detection: It can be used to detect objects in images such as people, cars and
buildings. They can also be used to localize objects in images which means that they
can identify the location of an object in an image.

 Image segmentation: It can be used to segment images which means that they can
identify and label different objects in an image. This is useful for applications such as
medical imaging and robotics.

 Video analysis: It can be used to analyze videos such as tracking objects in a video or
detecting events in a video. This is useful for applications such as video surveillance
and traffic monitoring.

Advantages of CNN

 High Accuracy: They can achieve high accuracy in various image recognition tasks.

 Efficiency: They are efficient, especially when implemented on GPUs.

 Robustness: They are robust to noise and variations in input data.

 Adaptability: It can be adapted to different tasks by modifying their architecture.

Disadvantages of CNN

 Complexity: It can be complex and difficult to train, especially for large datasets.

 Resource-Intensive: It require significant computational resources for training and


deployment.

 Data Requirements: They need large amounts of labeled data for training.

 Interpretability: They can be difficult to interpret making it challenging to understand


their predictions.

Recurrent Neural Networks (RNNs) differ from regular neural networks in how they process
information. While standard neural networks pass information in one direction i.e from input
to output, RNNs feed information back into the network at each step.

1/3

Lets understand RNN with a example:

Imagine reading a sentence and you try to predict the next word, you don’t rely only on the
current word but also remember the words that came before. RNNs work similarly by
“remembering” past information and passing the output from one step as input to the next i.e
it considers all the earlier words to choose the most likely next word. This memory of previous
steps helps the network understand context and make better predictions.

Key Components of RNNs

There are mainly two components of RNNs that we will discuss.

1. Recurrent Neurons

The fundamental processing unit in RNN is a Recurrent Unit. They hold a hidden state that
maintains information about previous inputs in a sequence. Recurrent units can "remember"
information from prior steps by feeding back their hidden state, allowing them to capture
dependencies across time.

Recurrent Neuron

2. RNN Unfolding

RNN unfolding or unrolling is the process of expanding the recurrent structure over time steps.
During unfolding each step of the sequence is represented as a separate layer in a series
illustrating how information flows across each time step.

This unrolling enables backpropagation through time (BPTT) a learning process where errors
are propagated across time steps to adjust the network’s weights enhancing the RNN’s ability
to learn dependencies within sequential data.
RNN Unfolding

Recurrent Neural Network Architecture

RNNs share similarities in input and output structures with other deep learning architectures
but differ significantly in how information flows from input to output. Unlike traditional deep
neural networks where each dense layer has distinct weight matrices. RNNs use shared
weights across time steps, allowing them to remember information over sequences.

In RNNs the hidden state HiHi is calculated for every input XiXi to retain sequential
dependencies. The computations follow these core formulas:

1. Hidden State Calculation:

h=σ(U⋅X+W⋅ht−1+B)h=σ(U⋅X+W⋅ht−1+B)

Here:

 hh represents the current hidden state.

 UU and WW are weight matrices.

 BB is the bias.

2. Output Calculation:

Y=O(V⋅h+C)Y=O(V⋅h+C)

The output YY is calculated by applying OO an activation function to the weighted hidden state
where VV and CC represent weights and bias.

3. Overall Function:

Y=f(X,h,W,U,V,B,C)Y=f(X,h,W,U,V,B,C)

This function defines the entire RNN operation where the state matrix SS holds each
element sisi representing the network's state at each time step ii.
Recurrent Neural Architecture

How does RNN work?

At each time step RNNs process units with a fixed activation function. These units have an
internal hidden state that acts as memory that retains information from previous time steps.
This memory allows the network to store past knowledge and adapt based on new inputs.

Updating the Hidden State in RNNs

The current hidden state htht depends on the previous state ht−1ht−1 and the current
input xtxt and is calculated using the following relations:

1. State Update:

ht=f(ht−1,xt)ht=f(ht−1,xt)

where:

 htht is the current state

 ht−1ht−1 is the previous state

 xtxt is the input at the current time step

2. Activation Function Application:

ht=tanh⁡(Whh⋅ht−1+Wxh⋅xt)ht=tanh(Whh⋅ht−1+Wxh⋅xt)

Here, WhhWhh is the weight matrix for the recurrent neuron and WxhWxh is the weight
matrix for the input neuron.

3. Output Calculation:

yt=Why⋅htyt=Why⋅ht

where ytyt is the output and WhyWhy is the weight at the output layer.
These parameters are updated using backpropagation. However, since RNN works on
sequential data here we use an updated backpropagation which is known as backpropagation
through time.

Backpropagation Through Time (BPTT) in RNNs

Since RNNs process sequential data Backpropagation Through Time (BPTT) is used to update
the network's parameters. The loss function L(θ) depends on the final hidden state h3h3 and
each hidden state relies on preceding ones forming a sequential dependency chain:

h3h3 depends on depends on h2,h2 depends on h1,…,h1 depends on h0 depends on h2,h2 de


pends on h1,…,h1 depends on h0.

Backpropagation Through Time (BPTT) In RNN

In BPTT, gradients are backpropagated through each time step. This is essential for updating
network parameters based on temporal dependencies.

1. Simplified Gradient Calculation:

∂L(θ)∂W=∂L(θ)∂h3⋅∂h3∂W∂W∂L(θ)=∂h3∂L(θ)⋅∂W∂h3

2. Handling Dependencies in Layers: Each hidden state is updated based on its dependencies:

h3=σ(W⋅h2+b)h3=σ(W⋅h2+b)
The gradient is then calculated for each state, considering dependencies from previous hidden
states.

3. Gradient Calculation with Explicit and Implicit Parts: The gradient is broken down into
explicit and implicit parts summing up the indirect paths from each hidden state to the
weights.

∂h3∂W=∂h3+∂W+∂h3∂h2⋅∂h2+∂W∂W∂h3=∂W∂h3++∂h2∂h3⋅∂W∂h2+

4. Final Gradient Expression: The final derivative of the loss function with respect to the
weight matrix W is computed:

∂L(θ)∂W=∂L(θ)∂h3⋅∑k=13∂h3∂hk⋅∂hk∂W∂W∂L(θ)=∂h3∂L(θ)⋅∑k=13∂hk∂h3⋅∂W∂hk

This iterative process is the essence of backpropagation through time.

Types Of Recurrent Neural Networks

There are four types of RNNs based on the number of inputs and outputs in the network:

1. One-to-One RNN

This is the simplest type of neural network architecture where there is a single input and a
single output. It is used for straightforward classification tasks such as binary classification
where no sequential data is involved.

One to One RNN

2. One-to-Many RNN

In a One-to-Many RNN the network processes a single input to produce multiple outputs over
time. This is useful in tasks where one input triggers a sequence of predictions (outputs). For
example in image captioning a single image can be used as input to generate a sequence of
words as a caption.
One to Many RNN

3. Many-to-One RNN

The Many-to-One RNN receives a sequence of inputs and generates a single output. This type
is useful when the overall context of the input sequence is needed to make one prediction. In
sentiment analysis the model receives a sequence of words (like a sentence) and produces a
single output like positive, negative or neutral.

Many to One RNN

4. Many-to-Many RNN

The Many-to-Many RNN type processes a sequence of inputs and generates a sequence of
outputs. In language translation task a sequence of words in one language is given as input and
a corresponding sequence in another language is generated as output.
Many to Many RNN

Variants of Recurrent Neural Networks (RNNs)

There are several variations of RNNs, each designed to address specific challenges or optimize
for certain tasks:

1. Vanilla RNN

This simplest form of RNN consists of a single hidden layer where weights are shared across
time steps. Vanilla RNNs are suitable for learning short-term dependencies but are limited by
the vanishing gradient problem, which hampers long-sequence learning.

2. Bidirectional RNNs

Bidirectional RNNs process inputs in both forward and backward directions, capturing both
past and future context for each time step. This architecture is ideal for tasks where the entire
sequence is available, such as named entity recognition and question answering.

3. Long Short-Term Memory Networks (LSTMs)

Long Short-Term Memory Networks (LSTMs) introduce a memory mechanism to overcome the
vanishing gradient problem. Each LSTM cell has three gates:

 Input Gate: Controls how much new information should be added to the cell state.

 Forget Gate: Decides what past information should be discarded.

 Output Gate: Regulates what information should be output at the current step. This
selective memory enables LSTMs to handle long-term dependencies, making them
ideal for tasks where earlier context is critical.

4. Gated Recurrent Units (GRUs)

Gated Recurrent Units (GRUs) simplify LSTMs by combining the input and forget gates into a
single update gate and streamlining the output mechanism. This design is computationally
efficient, often performing similarly to LSTMs and is useful in tasks where simplicity and faster
training are beneficial.

How RNN Differs from Feedforward Neural Networks?


Feedforward Neural Networks (FNNs) process data in one direction from input to output
without retaining information from previous inputs. This makes them suitable for tasks with
independent inputs like image classification. However FNNs struggle with sequential data since
they lack memory.

Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow information
from previous steps to be fed back into the network. This feedback enables RNNs to
remember prior inputs making them ideal for tasks where context is important.

Recurrent Vs Feedforward networks

Implementing a Text Generator Using Recurrent Neural Networks (RNNs)

In this section, we create a character-based text generator using Recurrent Neural Network
(RNN) in TensorFlow and Keras. We'll implement an RNN that learns patterns from a text
sequence to generate new text character-by-character.

1. Importing Necessary Libraries

We start by importing essential libraries for data handling and building the neural network.

import numpy as np

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import SimpleRNN, Dense

2. Defining the Input Text and Prepare Character Set

We define the input text and identify unique characters in the text which we’ll encode for our
model.

text = "This is GeeksforGeeks a software training institute"

chars = sorted(list(set(text)))

char_to_index = {char: i for i, char in enumerate(chars)}


index_to_char = {i: char for i, char in enumerate(chars)}

3. Creating Sequences and Labels

To train the RNN, we need sequences of fixed length (seq_length) and the character following
each sequence as the label.

seq_length = 3

sequences = []

labels = []

for i in range(len(text) - seq_length):

seq = text[i:i + seq_length]

label = text[i + seq_length]

sequences.append([char_to_index[char] for char in seq])

labels.append(char_to_index[label])

X = np.array(sequences)

y = np.array(labels)

4. Converting Sequences and Labels to One-Hot Encoding

For training we convert X and y into one-hot encoded tensors.

X_one_hot = tf.one_hot(X, len(chars))

y_one_hot = tf.one_hot(y, len(chars))

5. Building the RNN Model

We create a simple RNN model with a hidden layer of 50 units and a Dense output layer
with softmax activation.

model = Sequential()

model.add(SimpleRNN(50, input_shape=(seq_length, len(chars)), activation='relu'))

model.add(Dense(len(chars), activation='softmax'))

6. Compiling and Training the Model

We compile the model using the categorical_crossentropy loss and train it for 100 epochs.

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(X_one_hot, y_one_hot, epochs=100)


Output:

Training the RNN model

7. Generating New Text Using the Trained Model

After training we use a starting sequence to generate new text character by character.

start_seq = "This is G"

generated_text = start_seq

for i in range(50):

x = np.array([[char_to_index[char] for char in generated_text[-seq_length:]]])

x_one_hot = tf.one_hot(x, len(chars))

prediction = model.predict(x_one_hot)

next_index = np.argmax(prediction)

next_char = index_to_char[next_index]

generated_text += next_char

print("Generated Text:")

print(generated_text)

Output:
Predicting
the next word

Advantages of Recurrent Neural Networks

 Sequential Memory: RNNs retain information from previous inputs making them ideal
for time-series predictions where past data is crucial.

 Enhanced Pixel Neighborhoods: RNNs can be combined with convolutional layers to


capture extended pixel neighborhoods improving performance in image and video
data processing.

Limitations of Recurrent Neural Networks (RNNs)

While RNNs excel at handling sequential data they face two main training challenges
i.e vanishing gradient and exploding gradient problem:

1. Vanishing Gradient: During backpropagation gradients diminish as they pass through


each time step leading to minimal weight updates. This limits the RNN’s ability to learn
long-term dependencies which is crucial for tasks like language translation.

2. Exploding Gradient: Sometimes gradients grow uncontrollably causing excessively


large weight updates that de-stabilize training.

These challenges can hinder the performance of standard RNNs on complex, long-sequence
tasks.

Applications of Recurrent Neural Networks

RNNs are used in various applications where data is sequential or time-based:

 Time-Series Prediction: RNNs excel in forecasting tasks, such as stock market


predictions and weather forecasting.

 Natural Language Processing (NLP): RNNs are fundamental in NLP tasks like language
modeling, sentiment analysis and machine translation.

 Speech Recognition: RNNs capture temporal patterns in speech data, aiding in speech-
to-text and other audio-related applications.

 Image and Video Processing: When combined with convolutional layers, RNNs help
analyze video sequences, facial expressions and gesture recognition.

What is LSTM - Long Short Term Memory?

Last Updated : 28 May, 2025


Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural Network
(RNN) designed by Hochreiter and Schmidhuber. LSTMs can capture long-term dependencies in
sequential data making them ideal for tasks like language translation, speech recognition and
time series forecasting.

Unlike traditional RNNs which use a single hidden state passed through time LSTMs introduce a
memory cell that holds information over extended periods addressing the challenge of
learning long-term dependencies.

1/3

Problem with Long-Term Dependencies in RNN

Recurrent Neural Networks (RNNs) are designed to handle sequential data by maintaining a
hidden state that captures information from previous time steps. However they often face
challenges in learning long-term dependencies where information from distant time steps
becomes crucial for making accurate predictions for current state. This problem is known as
the vanishing gradient or exploding gradient problem.

 Vanishing Gradient: When training a model over time, the gradients which help the
model learn can shrink as they pass through many steps. This makes it hard for the
model to learn long-term patterns since earlier information becomes almost irrelevant.

 Exploding Gradient: Sometimes gradients can grow too large causing instability. This
makes it difficult for the model to learn properly as the updates to the model become
erratic and unpredictable.

Both of these issues make it challenging for standard RNNs to effectively capture long-term
dependencies in sequential data.

LSTM Architecture

LSTM architectures involves the memory cell which is controlled by three gates:

1. Input gate: Controls what information is added to the memory cell.

2. Forget gate: Determines what information is removed from the memory cell.

3. Output gate: Controls what information is output from the memory cell.

This allows LSTM networks to selectively retain or discard information as it flows through the
network which allows them to learn long-term dependencies. The network has a hidden state
which is like its short-term memory. This memory is updated using the current input, the
previous hidden state and the current state of the memory cell.

Working of LSTM
LSTM architecture has a chain structure that contains four neural networks and different
memory blocks called cells.

LSTM Model

Information is retained by the cells and the memory manipulations are done by
the gates. There are three gates -

1. Forget Gate

The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xtxt (input at the particular time) and ht−1ht−1 (previous cell output) are fed to the gate
and multiplied with weight matrices followed by the addition of bias. The resultant is passed
through an activation function which gives a binary output. If for a particular cell state the
output is 0, the piece of information is forgotten and for output 1, the information is retained
for future use.

The equation for the forget gate is:

ft=σ(Wf⋅[ht−1,xt]+bf)ft=σ(Wf⋅[ht−1,xt]+bf)

Where:

 WfWfrepresents the weight matrix associated with the forget gate.

 [ht−1,xt][ht−1,xt] denotes the concatenation of the current input and the previous
hidden state.

 bfbfis the bias with the forget gate.

 σσ is the sigmoid activation function.


Forget Gate

2. Input gate

The addition of useful information to the cell state is done by the input gate. First the
information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs ht−1ht−1and xtxt. Then, a vector is created
using tanh function that gives an output from -1 to +1 which contains all the possible values
from ht−1ht−1 and xtxt. At last the values of the vector and the regulated values are multiplied
to obtain the useful information. The equation for the input gate is:

it=σ(Wi⋅[ht−1,xt]+bi)it=σ(Wi⋅[ht−1,xt]+bi)

C^t=tanh⁡(Wc⋅[ht−1,xt]+bc)C^t=tanh(Wc⋅[ht−1,xt]+bc)

We multiply the previous state by ftft effectively filtering out the information we had decided
to ignore earlier. Then we add it⊙Ctit⊙Ct which represents the new candidate values scaled by
how much we decided to update each state value.

Ct=ft⊙Ct−1+it⊙C^tCt=ft⊙Ct−1+it⊙C^t

where

 ⊙⊙ denotes element-wise multiplication

 tanh is activation function


Input Gate

3. Output gate

The task of extracting useful information from the current cell state to be presented as output
is done by the output gate. First, a vector is generated by applying tanh function on the cell.
Then, the information is regulated using the sigmoid function and filter by the values to be
remembered using inputsht−1ht−1and xtxt. At last the values of the vector and the regulated
values are multiplied to be sent as an output and input to the next cell. The equation for the
output gate is:

ot=σ(Wo⋅[ht−1,xt]+bo)ot=σ(Wo⋅[ht−1,xt]+bo)
Output Gate

Applications of LSTM

Some of the famous applications of LSTM includes:

 Language Modeling: Used in tasks like language modeling, machine translation and
text summarization. These networks learn the dependencies between words in a
sentence to generate coherent and grammatically correct sentences.

 Speech Recognition: Used in transcribing speech to text and recognizing spoken


commands. By learning speech patterns they can match spoken words to
corresponding text.

 Time Series Forecasting: Used for predicting stock prices, weather and energy
consumption. They learn patterns in time series data to predict future events.

 Anomaly Detection: Used for detecting fraud or network intrusions. These networks
can identify patterns in data that deviate drastically and flag them as potential
anomalies.

 Recommender Systems: In recommendation tasks like suggesting movies, music and


books. They learn user behavior patterns to provide personalized suggestions.

 Video Analysis: Applied in tasks such as object detection, activity recognition and
action classification. When combined with Convolutional Neural Networks (CNNs) they
help analyze video data and extract useful information.

Generative Adversarial Networks (GANs) help machines to create new, realistic data by
learning from existing examples. It is introduced by Ian Goodfellow and his team in 2014 and
they have transformed how computers generate images, videos, music and more. Unlike
traditional models that only recognize or classify data, they take a creative way by generating
entirely new content that closely resembles real-world data. This ability helped various fields
such as art, gaming, healthcare and data science. In this article, we will see more about GANs
and its core concepts.

1/4

Architecture of GANs

GANs consist of two main models that work together to create realistic synthetic data which
are as follows:

1. Generator Model

The generator is a deep neural network that takes random noise as input to generate realistic
data samples like images or text. It learns the underlying data patterns by adjusting its internal
parameters during training through backpropagation. Its objective is to produce samples that
the discriminator classifies as real.

Generator Loss Function: The generator tries to minimize this loss:

JG=−1mΣi=1mlogD(G(zi))JG=−m1Σi=1mlogD(G(zi))

where

 JGJG measure how well the generator is fooling the discriminator.

 G(zi)G(zi) is the generated sample from random noise zizi

 D(G(zi))D(G(zi)) is the discriminator’s estimated probability that the generated sample


is real.

The generator aims to maximize D(G(zi))D(G(zi)) meaning it wants the discriminator to classify
its fake data as real (probability close to 1).

2. Discriminator Model

The discriminator acts as a binary classifier helps in distinguishing between real and generated
data. It learns to improve its classification ability through training, refining its parameters to
detect fake samples more accurately. When dealing with image data, the discriminator
uses convolutional layers or other relevant architectures which help to extract features and
enhance the model’s ability.

Discriminator Loss Function: The discriminator tries to minimize this loss:

JD=−1mΣi=1mlogD(xi)−1mΣi=1mlog(1−D(G(zi))JD=−m1Σi=1mlogD(xi)−m1Σi=1mlog(1−D(G(zi))

 JDJD measures how well the discriminator classifies real and fake samples.

 xixi is a real data sample.

 G(zi)G(zi) is a fake sample from the generator.


 D(xi)D(xi) is the discriminator’s probability that xixi is real.

 D(G(zi))D(G(zi)) is the discriminator’s probability that the fake sample is real.

The discriminator wants to correctly classify real data as real (maximize logD(xi)logD(xi) and
fake data as fake (maximize log(1−D(G(zi))log(1−D(G(zi)))

MinMax Loss

GANs are trained using a MinMax Loss between the generator and discriminator:

minGmaxD(G,D)=[Ex∼pdata[logD(x)]+Ez∼pz(z)[log(1−D(g(z)))]minGmaxD(G,D)=[Ex∼pdata
[logD(x)]+Ez∼pz(z)[log(1−D(g(z)))]

where,

 GGis generator network and is DD is the discriminator network

 pdata(x)pdata(x) = true data distribution

 pz(z)pz(z)= distribution of random noise (usually normal or uniform)

 D(x)D(x) = discriminator’s estimate of real data

 D(G(z))D(G(z))= discriminator’s estimate of generated data

The generator tries to minimize this loss (to fool the discriminator) and the discriminator tries
to maximize it (to detect fakes accurately).

How does a GAN work?

GANs train by having two networks the Generator (G) and the Discriminator (D) compete and
improve together. Here's the step-by-step process

1. Generator's First Move


The generator starts with a random noise vector like random numbers. It uses this noise as a
starting point to create a fake data sample such as a generated image. The generator’s internal
layers transform this noise into something that looks like real data.

2. Discriminator's Turn

The discriminator receives two types of data:

 Real samples from the actual training dataset.

 Fake samples created by the generator.

D's job is to analyze each input and find whether it's real data or something G cooked up. It
outputs a probability score between 0 and 1. A score of 1 shows the data is likely real and 0
suggests it's fake.

3. Adversarial Learning

 If the discriminator correctly classifies real and fake data it gets better at its job.

 If the generator fools the discriminator by creating realistic fake data, it receives a
positive update and the discriminator is penalized for making a wrong decision.

4. Generator's Improvement

 Each time the discriminator mistakes fake data for real, the generator learns from this
success.

 Through many iterations, the generator improves and creates more convincing fake
samples.

5. Discriminator's Adaptation

 The discriminator also learns continuously by updating itself to better spot fake data.

 This constant back-and-forth makes both networks stronger over time.

6. Training Progression

 As training continues, the generator becomes highly proficient at producing realistic


data.

 Eventually the discriminator struggles to distinguish real from fake shows that the GAN
has reached a well-trained state.

 At this point, the generator can produce high-quality synthetic data that can be used
for different applications.

Types of GANs

There are several types of GANs each designed for different purposes. Here are some
important types:

1. Vanilla GAN

Vanilla GAN is the simplest type of GAN. It consists of:


 A generator and a discriminator both are built using multi-layer perceptrons (MLPs).

 The model optimizes its mathematical formulation using stochastic gradient


descent (SGD).

While foundational, Vanilla GANs can face problems like:

 Mode collapse: The generator produces limited types of outputs repeatedly.

 Unstable training: The generator and discriminator may not improve smoothly.

2. Conditional GAN (CGAN)

Conditional GANs (CGANs) adds an additional conditional parameter to guide the generation
process. Instead of generating data randomly they allow the model to produce specific types of
outputs.

Working of CGANs:

 A conditional variable (y) is fed into both the generator and the discriminator.

 This ensures that the generator creates data corresponding to the given condition (e.g
generating images of specific objects).

 The discriminator also receives the labels to help distinguish between real and fake
data.

Example: Instead of generating any random image, CGAN can generate a specific object like a
dog or a cat based on the label.

3. Deep Convolutional GAN (DCGAN)

Deep Convolutional GANs (DCGANs) are among the most popular types of GANs used for
image generation.

They are important because they:

 Uses Convolutional Neural Networks (CNNs) instead of simple multi-layer perceptrons


(MLPs).

 Max pooling layers are replaced with convolutional stride helps in making the model
more efficient.

 Fully connected layers are removed, which allows for better spatial understanding of
images.

DCGANs are successful because they generate high-quality, realistic images.

4. Laplacian Pyramid GAN (LAPGAN)

Laplacian Pyramid GAN (LAPGAN) is designed to generate ultra-high-quality images by using a


multi-resolution approach.

Working of LAPGAN:

 Uses multiple generator-discriminator pairs at different levels of the Laplacian


pyramid.
 Images are first down sampled at each layer of the pyramid and upscaled again using
Conditional GANs (CGANs).

 This process allows the image to gradually refine details and helps in reducing noise
and improving clarity.

Due to its ability to generate highly detailed images, LAPGAN is considered a superior approach
for photorealistic image generation.

5. Super Resolution GAN (SRGAN)

Super-Resolution GAN (SRGAN) is designed to increase the resolution of low-quality images


while preserving details.

Working of SRGAN:

 Uses a deep neural network combined with an adversarial loss function.

 Enhances low-resolution images by adding finer details helps in making them appear
sharper and more realistic.

 Helps to reduce common image upscaling errors such as blurriness and pixelation.

Implementation of Generative Adversarial Network (GAN) using PyTorch

Generative Adversarial Networks (GANs) can generate realistic images by learning from existing
image datasets. Here we will be implementing a GAN trained on the CIFAR-10 dataset using
PyTorch.

Step 1: Importing Required Libraries

We will be using Pytorch, Torchvision, Matplotlib and Numpy libraries for this. Set the device to
GPU if available otherwise use CPU.

import torch

import torch.nn as nn

import torch.optim as optim

import torchvision

from torchvision import datasets, transforms

import matplotlib.pyplot as plt

import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Step 2: Defining Image Transformations

We use PyTorch’s transforms to convert images to tensors and normalize pixel values between
-1 and 1 for better training stability.

transform = transforms.Compose([
transforms.ToTensor(),

transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))

])

Step 3: Loading the CIFAR-10 Dataset

Download and load the CIFAR-10 dataset with defined transformations. Use a DataLoader to
process the dataset in mini-batches of size 32 and shuffle the data.

train_dataset = datasets.CIFAR10(root='./data',\

train=True, download=True, transform=transform)

dataloader = torch.utils.data.DataLoader(train_dataset, \

batch_size=32, shuffle=True)

Loading

Step 4: Defining GAN Hyperparameters

Set important training parameters:

 latent_dim: Dimensionality of the noise vector.

 lr: Learning rate of the optimizer.

 beta1, beta2: Beta parameters for Adam optimizer (e.g 0.5, 0.999)

 num_epochs: Number of times the entire dataset will be processed (e.g 10)

latent_dim = 100

lr = 0.0002

beta1 = 0.5

beta2 = 0.999

num_epochs = 10

Step 5: Building the Generator

Create a neural network that converts random noise into images. Use transpose convolutional
layers, batch normalization and ReLU activations. The final layer uses Tanh activation to scale
outputs to the range [-1, 1].

 nn.Linear(latent_dim, 128 * 8 * 8): Defines a fully connected layer that projects the
noise vector into a higher dimensional feature space.

 nn.Upsample(scale_factor=2): Doubles the spatial resolution of the feature maps by


upsampling.

 nn.Conv2d(128, 128, kernel_size=3, padding=1): Applies a convolutional layer keeping


the number of channels the same to refine features.
class Generator(nn.Module):

def __init__(self, latent_dim):

super(Generator, self).__init__()

self.model = nn.Sequential(

nn.Linear(latent_dim, 128 * 8 * 8),

nn.ReLU(),

nn.Unflatten(1, (128, 8, 8)),

nn.Upsample(scale_factor=2),

nn.Conv2d(128, 128, kernel_size=3, padding=1),

nn.BatchNorm2d(128, momentum=0.78),

nn.ReLU(),

nn.Upsample(scale_factor=2),

nn.Conv2d(128, 64, kernel_size=3, padding=1),

nn.BatchNorm2d(64, momentum=0.78),

nn.ReLU(),

nn.Conv2d(64, 3, kernel_size=3, padding=1),

nn.Tanh()

def forward(self, z):

img = self.model(z)

return img

Step 6: Building the Discriminator

Create a binary classifier network that distinguishes real from fake images. Use convolutional
layers, batch normalization, dropout, LeakyReLU activation and a Sigmoid output layer to give a
probability between 0 and 1.

 nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1): Second convolutional layer


increasing channels to 64, downsampling further.

 nn.BatchNorm2d(256, momentum=0.8): Batch normalization for 256 feature maps


with momentum 0.8.

class Discriminator(nn.Module):
def __init__(self):

super(Discriminator, self).__init__()

self.model = nn.Sequential(

nn.Conv2d(3, 32, kernel_size=3, stride=2, padding=1),

nn.LeakyReLU(0.2),

nn.Dropout(0.25),

nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),

nn.ZeroPad2d((0, 1, 0, 1)),

nn.BatchNorm2d(64, momentum=0.82),

nn.LeakyReLU(0.25),

nn.Dropout(0.25),

nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),

nn.BatchNorm2d(128, momentum=0.82),

nn.LeakyReLU(0.2),

nn.Dropout(0.25),

nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),

nn.BatchNorm2d(256, momentum=0.8),

nn.LeakyReLU(0.25),

nn.Dropout(0.25),

nn.Flatten(),

nn.Linear(256 * 5 * 5, 1),

nn.Sigmoid()

def forward(self, img):

validity = self.model(img)

return validity

Step 7: Initializing GAN Components

 Generator and Discriminator are initialized on the available device (GPU or CPU).
 Binary Cross-Entropy (BCE) Loss is chosen as the loss function.

 Adam optimizers are defined separately for the generator and discriminator with
specified learning rates and betas.

generator = Generator(latent_dim).to(device)

discriminator = Discriminator().to(device)

adversarial_loss = nn.BCELoss()

optimizer_G = optim.Adam(generator.parameters()\

, lr=lr, betas=(beta1, beta2))

optimizer_D = optim.Adam(discriminator.parameters()\

, lr=lr, betas=(beta1, beta2))

Step 8: Training the GAN

Train for the set number of epochs:

1. For each batch train the discriminator on real images and fake images generated by the
generator.

2. Then train the generator to fool the discriminator.

3. Calculate and backpropagate the respective losses.

4. Print loss values every 200 batches for progress tracking.

5. After each epoch generate and display sample images created by the generator for visual
inspection.

 valid = torch.ones(real_images.size(0), 1, device=device): Create a tensor of ones


representing real labels for the discriminator.

 fake = torch.zeros(real_images.size(0), 1, device=device): Create a tensor of zeros


representing fake labels for the discriminator.

 z = torch.randn(real_images.size(0), latent_dim, device=device): Generate random


noise vectors as input for the generator.

 g_loss = adversarial_loss(discriminator(gen_images), valid): Calculate generator loss


based on the discriminator classifying fake images as real.

 grid = torchvision.utils.make_grid(generated, nrow=4, normalize=True): Arrange


generated images into a grid for display, normalizing pixel values.

for epoch in range(num_epochs):

for i, batch in enumerate(dataloader):


real_images = batch[0].to(device)

valid = torch.ones(real_images.size(0), 1, device=device)

fake = torch.zeros(real_images.size(0), 1, device=device)

real_images = real_images.to(device)

optimizer_D.zero_grad()

z = torch.randn(real_images.size(0), latent_dim, device=device)

fake_images = generator(z)

real_loss = adversarial_loss(discriminator\

(real_images), valid)

fake_loss = adversarial_loss(discriminator\

(fake_images.detach()), fake)

d_loss = (real_loss + fake_loss) / 2

d_loss.backward()

optimizer_D.step()

optimizer_G.zero_grad()

gen_images = generator(z)

g_loss = adversarial_loss(discriminator(gen_images), valid)

g_loss.backward()

optimizer_G.step()
if (i + 1) % 100 == 0:

print(

f"Epoch [{epoch+1}/{num_epochs}]\

Batch {i+1}/{len(dataloader)} "

f"Discriminator Loss: {d_loss.item():.4f} "

f"Generator Loss: {g_loss.item():.4f}"

if (epoch + 1) % 10 == 0:

with torch.no_grad():

z = torch.randn(16, latent_dim, device=device)

generated = generator(z).detach().cpu()

grid = torchvision.utils.make_grid(generated,\

nrow=4, normalize=True)

plt.imshow(np.transpose(grid, (1, 2, 0)))

plt.axis("off")

plt.show()

Output:
Training Output

By following these steps we successfully implemented and trained a GAN that learns to
generate realistic CIFAR-10 images through adversarial training.

Application Of Generative Adversarial Networks (GANs)

1. Image Synthesis & Generation: GANs generate realistic images, avatars and high-
resolution visuals by learning patterns from training data. They are used in art, gaming
and AI-driven design.

2. Image-to-Image Translation: They can transform images between domains while


preserving key features. Examples include converting day images to night, sketches to
realistic images or changing artistic styles.

3. Text-to-Image Synthesis: They create visuals from textual descriptions helps


applications in AI-generated art, automated design and content creation.

4. Data Augmentation: They generate synthetic data to improve machine learning


models helps in making them more robust and generalizable in fields with limited
labeled data.

5. High-Resolution Image Enhancement: They upscale low-resolution images which


helps in improving clarity for applications like medical imaging, satellite imagery and
video enhancement.
Advantages of GAN

Lets see various advantages of the GANs:

1. Synthetic Data Generation: GANs produce new, synthetic data resembling real data
distributions which is useful for augmentation, anomaly detection and creative tasks.

2. High-Quality Results: They can generate photorealistic images, videos, music and
other media with high quality.

3. Unsupervised Learning: They don’t require labeled data helps in making them
effective in scenarios where labeling is expensive or difficult.

4. Versatility: They can be applied across many tasks including image synthesis, text-to-
image generation, style transfer, anomaly detection and more.

GANs are evolving and shaping the future of artificial intelligence. As the technology improves,
we can expect even more innovative applications that will change how we create, work and
interact with digital content.

In Artificial Neural Networks (ANNs), data flows from the input layer to the output layer
through one or more hidden layers. Each layer consists of neurons that receive input, process
it, and pass the output to the next layer. The layers work together to extract features,
transform data, and make predictions.

An Artificial Neural Networks (ANNs) consists of three primary types of layers:

 Input Layer

 Hidden Layers

 Output Layer

Each layer is composed of nodes (neurons) that are interconnected. The layers work together
to process data through a series of transformations.
ANN Layers

Basic Layers in ANN

1. Input Layer

Input layer is the first layer in an ANN and is responsible for receiving the raw input data. This
layer's neurons correspond to the features in the input data. For example, in image processing,
each neuron might represent a pixel value. The input layer doesn't perform any computations
but passes the data to the next layer.

Key Points:

 Role: Receives raw data.

 Function: Passes data to the hidden layers.

 Example: For an image, the input layer would have neurons for each pixel value.
Input Layer in ANN

2. Hidden Layers

Hidden Layers are the intermediate layers between the input and output layers. They perform
most of the computations required by the network. Hidden layers can vary in number and size,
depending on the complexity of the task.

Each hidden layer applies a set of weights and biases to the input data, followed by an
activation function to introduce non-linearity.
3. Output Layer

Output Layer is the final layer in an ANN. It produces the output predictions. The number of
neurons in this layer corresponds to the number of classes in a classification problem or the
number of outputs in a regression problem.

The activation function used in the output layer depends on the type of problem:

 Softmax for multi-class classification

 Sigmoid for binary classification

 Linear for regression

For better understanding of the activation functions, Refer to the article - Activation functions
in Neural Networks

Types of Hidden Layers in Artificial Neural Networks

Till now we have covered the basic layers: input, hidden, and output. Let’s now dive into the
specific types of hidden layers.

1. Dense (Fully Connected) Layer

Dense (Fully Connected) Layer is the most common type of hidden layer in an ANN. Every
neuron in a dense layer is connected to every neuron in the previous and subsequent layers.
This layer performs a weighted sum of inputs and applies an activation function to introduce
non-linearity. The activation function (like ReLU, Sigmoid, or Tanh) helps the network learn
complex patterns.

 Role: Learns representations from input data.

 Function: Performs weighted sum and activation.


Dense (Fully Connected Layer)

2. Convolutional Layer

Convolutional layers are used in Convolutional Neural Networks (CNNs) for image processing
tasks. They apply convolution operations to the input, capturing spatial hierarchies in the data.
Convolutional layers use filters to scan across the input and generate feature maps. This helps
in detecting edges, textures, and other visual features.

 Role: Extracts spatial features from images.

 Function: Applies convolution using filters.


Convolutional Layer

3. Recurrent Layer

Recurrent layers are used in Recurrent Neural Networks (RNNs) for sequence data like time
series or natural language. They have connections that loop back, allowing information to
persist across time steps. This makes them suitable for tasks where context and temporal
dependencies are important.

 Role: Processes sequential data with temporal dependencies.

 Function: Maintains state across time steps.


Recurrent Layer

4. Dropout Layer

Dropout layers are a regularization technique used to prevent overfitting. They randomly drop
a fraction of the neurons during training, which forces the network to learn more robust
features and reduces dependency on specific neurons. During training, each neuron is retained
with a probability p.

 Role: Prevents overfitting.

 Function: Randomly drops neurons during training.


Dropout Layer

5. Pooling Layer

Pooling Layer is used to reduce the spatial dimensions of the data, thereby decreasing the
computational load and controlling overfitting. Common types of pooling include Max Pooling
and Average Pooling.

Use Cases: Dimensionality reduction in CNNs


Pooling Layer

6. Batch Normalization Layer

A Batch Normalization Layer normalizes the output of a previous activation layer by


subtracting the batch mean and dividing by the batch standard deviation. This helps in
accelerating the training process and improving the performance of the network.

Use Cases: Stabilizing and speeding up training


Batch Normalization

Understanding the different types of layers in an ANN is essential for designing effective neural
networks. Each layer has a specific role, from receiving input data to learning complex patterns
and producing predictions. By combining these layers, we can build powerful models capable
of solving a wide range of tasks.

Activation functions in Neural Networks

Last Updated : 03 Jun, 2025

While building a neural network, one key decision is selecting the Activation Function for both
the hidden layer and the output layer. It is a mathematical function applied to the output of a
neuron. It introduces non-linearity into the model, allowing the network to learn and
represent complex patterns in the data. Without this non-linearity feature a neural network
would behave like a linear regression model no matter how many layers it has.

Activation function decides whether a neuron should be activated by calculating the weighted
sum of inputs and adding a bias term. This helps the model make complex decisions and
predictions by introducing non-linearities to the output of each neuron.
Before diving into the activation function, you should have prior knowledge of the following
topics: Neural Networks, Backpropagation

Activation Functions in neural Networks

Introducing Non-Linearity in Neural Network

Non-linearity means that the relationship between input and output is not a straight line. In
simple terms the output does not change proportionally with the input. A common choice is
the ReLU function defined as σ(x)=max⁡(0,x)σ(x)=max(0,x).

Imagine you want to classify apples and bananas based on their shape and color.

 If we use a linear function it can only separate them using a straight line.

 But real-world data is often more complex like overlapping colors, different lighting,
etc.

 By adding a non-linear activation function like ReLU, Sigmoid or Tanh the network can
create curved decision boundaries to separate them correctly.

Effect of Non-Linearity

The inclusion of the ReLU activation function σσ allows h1h1 to introduce a non-linear decision
boundary in the input space. This non-linearity enables the network to learn more complex
patterns that are not possible with a purely linear model such as:

 Modeling functions that are not linearly separable.

 Increasing the capacity of the network to form multiple decision boundaries based on
the combination of weights and biases.
Why is Non-Linearity Important in Neural Networks?

Neural networks consist of neurons that operate using weights, biases and activation
functions.

In the learning process these weights and biases are updated based on the error produced at
the output—a process known as backpropagation. Activation functions enable
backpropagation by providing gradients that are essential for updating the weights and biases.

Without non-linearity even deep networks would be limited to solving only simple, linearly
separable problems. Activation functions help neural networks to model highly complex data
distributions and solve advanced deep learning tasks. Adding non-linear activation functions
introduce flexibility and enable the network to learn more complex and abstract patterns from
data.

Mathematical Proof of Need of Non-Linearity in Neural Networks

To illustrate the need for non-linearity in neural networks with a specific example let's consider
a network with two input nodes (i1and i2)(i1and i2), a single hidden layer containing
neurons h1 and h2h1 and h2 and an output neuron (out).

We will use w1,w2w1,w2 as weights connecting the inputs to the hidden neuron and w5w5 as
the weight connecting the hidden neuron to the output. We'll also include biases (b1b1 for the
hidden neuron and b2b2 for the output neuron) to complete the model.

1. Input Layer: Two inputs i1i1 and i2i2.

2. Hidden Layer: Two neuron h1h1 and h2h2

3. Output Layer: One output neuron.

The input to the hidden neuron h1h1 is calculated as a weighted sum of the inputs plus a bias:
h1=i1.w1+i2.w3+b1h1=i1.w1+i2.w3+b1

h2=i1.w2+i2.w4+b2h2=i1.w2+i2.w4+b2

The output neuron is then a weighted sum of the hidden neuron's output plus a bias:

output=h1.w5+h2.w6+biasoutput=h1.w5+h2.w6+bias

Here, h_1 , h_2 \text{ and output} are linear expressions.

In order to add non-linearity, we will be using sigmoid activation function in the output layer:

σ(x)=11+e−xσ(x)=1+e−x1

final output=σ(h1.w5+h2.w6+bias)final output=σ(h1.w5+h2.w6+bias)

final output=11+e−(h1.w5+h2.w6+bias)final output=1+e−(h1.w5+h2.w6+bias)1

This gives the final output of the network after applying the sigmoid activation function in
output layers, introducing the desired non-linearity.

Types of Activation Functions in Deep Learning

1. Linear Activation Function

Linear Activation Function resembles straight line define by y=x. No matter how many layers
the neural network contains if they all use linear activation functions the output is a linear
combination of the input.

 The range of the output spans from (−∞ to +∞)(−∞ to +∞).

 Linear activation function is used at just one place i.e. output layer.

 Using linear activation across all layers makes the network's ability to learn complex
patterns limited.

Linear activation functions are useful for specific tasks but must be combined with non-linear
functions to enhance the neural network’s learning and predictive capabilities.
Linear Activation Function or Identity Function returns the input as the output

2. Non-Linear Activation Functions

1. Sigmoid Function

Sigmoid Activation Function is characterized by 'S' shape. It is mathematically defined


asA=11+e−xA=1+e−x1. This formula ensures a smooth and continuous output that is essential
for gradient-based optimization methods.

 It allows neural networks to handle and model complex patterns that linear equations
cannot.

 The output ranges between 0 and 1, hence useful for binary classification.

 The function exhibits a steep gradient when x values are between -2 and 2. This
sensitivity means that small changes in input x can cause significant changes in output
y which is critical during the training process.
Sigmoid or Logistic Activation Function Graph

2. Tanh Activation Function

Tanh function (hyperbolic tangent function) is a shifted version of the sigmoid, allowing it to
stretch across the y-axis. It is defined as:

f(x)=tanh⁡(x)=21+e−2x−1.f(x)=tanh(x)=1+e−2x2−1.

Alternatively, it can be expressed using the sigmoid function:

tanh⁡(x)=2×sigmoid(2x)−1tanh(x)=2×sigmoid(2x)−1

 Value Range: Outputs values from -1 to +1.

 Non-linear: Enables modeling of complex data patterns.

 Use in Hidden Layers: Commonly used in hidden layers due to its zero-centered
output, facilitating easier learning for subsequent layers.
Tanh Activation Function

3. ReLU (Rectified Linear Unit) Function

ReLU activation is defined by A(x)=max⁡(0,x)A(x)=max(0,x), this means that if the input x is


positive, ReLU returns x, if the input is negative, it returns 0.

 Value Range: [0,∞)[0,∞), meaning the function only outputs non-negative values.

 Nature: It is a non-linear activation function, allowing neural networks to learn


complex patterns and making backpropagation more efficient.

 Advantage over other Activation: ReLU is less computationally expensive than tanh
and sigmoid because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and easy for
computation.
ReLU Activation Function

3. Exponential Linear Units

1. Softmax Function

Softmax function is designed to handle multi-class classification problems. It transforms raw


output scores from a neural network into probabilities. It works by squashing the output values
of each class into the range of 0 to 1 while ensuring that the sum of all probabilities equals 1.

 Softmax is a non-linear activation function.

 The Softmax function ensures that each class is assigned a probability, helping to
identify which class the input belongs to.
Softmax Activation Function

2. SoftPlus Function

Softplus function is defined mathematically as: A(x)=log⁡(1+ex)A(x)=log(1+ex).

This equation ensures that the output is always positive and differentiable at all points which is
an advantage over the traditional ReLU function.

 Nature: The Softplus function is non-linear.

 Range: The function outputs values in the range (0,∞)(0,∞), similar to ReLU, but
without the hard zero threshold that ReLU has.

 Smoothness: Softplus is a smooth, continuous function, meaning it avoids the sharp


discontinuities of ReLU which can sometimes lead to problems during optimization.
S
oftplus Activation Function

Impact of Activation Functions on Model Performance

The choice of activation function has a direct impact on the performance of a neural network
in several ways:

1. Convergence Speed: Functions like ReLU allow faster training by avoiding the vanishing
gradient problem while Sigmoid and Tanh can slow down convergence in deep
networks.

2. Gradient Flow: Activation functions like ReLU ensure better gradient flow, helping
deeper layers learn effectively. In contrast Sigmoid can lead to small gradients,
hindering learning in deep layers.

3. Model Complexity: Activation functions like Softmax allow the model to handle
complex multi-class problems, whereas simpler functions like ReLU or Leaky ReLU are
used for basic layers.

Activation functions are the backbone of neural networks enabling them to capture non-linear
relationships in data. From classic functions like Sigmoid and Tanh to modern variants like ReLU
and Swish, each has its place in different types of neural networks. The key is to understand
their behavior and choose the right one based on your model’s needs.

Feedforward Neural Network (FNN) is a type of artificial neural network in which information
flows in a single direction—from the input layer through hidden layers to the output layer—
without loops or feedback. It is mainly used for pattern recognition tasks like image and speech
classification.

For example in a credit scoring system banks use an FNN which analyze users' financial profiles
—such as income, credit history and spending habits—to determine their creditworthiness.

Each piece of information flows through the network’s layers where various calculations are
made to produce a final score.

Structure of a Feedforward Neural Network

Feedforward Neural Networks have a structured layered design where data flows sequentially
through each layer.

1. Input Layer: The input layer consists of neurons that receive the input data. Each
neuron in the input layer represents a feature of the input data.

2. Hidden Layers: One or more hidden layers are placed between the input and output
layers. These layers are responsible for learning the complex patterns in the data. Each
neuron in a hidden layer applies a weighted sum of inputs followed by a non-linear
activation function.

3. Output Layer: The output layer provides the final output of the network. The number
of neurons in this layer corresponds to the number of classes in a classification
problem or the number of outputs in a regression problem.

Each connection between neurons in these layers has an associated weight that is adjusted
during the training process to minimize the error in predictions.

Feed Forward Neural Network

Activation Functions

Activation functions introduce non-linearity into the network enabling it to learn and model
complex data patterns.

Common activation functions include:


 Sigmoid: σ(x)=σ(x)=11+e−xσ(x)=1+e−x1

 Tanh: tanh(x)=ex−e−xex+e−xtanh(x)=ex+e−xex−e−x

 ReLU: ReLU(x)=max⁡(0,x)ReLU(x)=max(0,x)

Training a Feedforward Neural Network

Training a Feedforward Neural Network involves adjusting the weights of the neurons to
minimize the error between the predicted output and the actual output. This process is
typically performed using backpropagation and gradient descent.

1. Forward Propagation: During forward propagation the input data passes through the
network and the output is calculated.

2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean
Squared Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.

3. Backpropagation: In backpropagation the error is propagated back through the


network to update the weights. The gradient of the loss function with respect to each
weight is calculated and the weights are adjusted using gradient descent.

Forward Propagation

Gradient Descent

Gradient Descent is an optimization algorithm used to minimize the loss function by iteratively
updating the weights in the direction of the negative gradient. Common variants of gradient
descent include:

 Batch Gradient Descent: Updates weights after computing the gradient over the entire
dataset.
 Stochastic Gradient Descent (SGD): Updates weights for each training example
individually.

 Mini-batch Gradient Descent: It Updates weights after computing the gradient over a
small batch of training examples.

Evaluation of Feedforward neural network

Evaluating the performance of the trained model involves several metrics:

 Accuracy: The proportion of correctly classified instances out of the total instances.

 Precision: The ratio of true positive predictions to the total predicted positives.

 Recall: The ratio of true positive predictions to the actual positives.

 F1 Score: The harmonic mean of precision and recall, providing a balance between the
two.

 Confusion Matrix: A table used to describe the performance of a classification model,


showing the true positives, true negatives, false positives, and false negatives.

Code Implementation of Feedforward neural network

This code demonstrates the process of building, training and evaluating a neural network
model using TensorFlow and Keras to classify handwritten digits from the MNIST dataset.

The model architecture is defined using the Sequential API consisting of:

 a Flatten layer to convert the 2D image input into a 1D array

 a Dense layer with 128 neurons and ReLU activation

 a final Dense layer with 10 neurons and softmax activation to output probabilities for
each digit class.

Model is compiled with the Adam optimizer, SparseCategoricalCrossentropy loss function and
SparseCategoricalAccuracy metric and then trained for 5 epochs on the training data.

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Flatten

from tensorflow.keras.optimizers import Adam

from tensorflow.keras.losses import SparseCategoricalCrossentropy

from tensorflow.keras.metrics import SparseCategoricalAccuracy

# Load and prepare the MNIST dataset

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()


x_train, x_test = x_train / 255.0, x_test / 255.0

# Build the model

model = Sequential([

Flatten(input_shape=(28, 28)),

Dense(128, activation='relu'),

Dense(10, activation='softmax')

])

# Compile the model

model.compile(optimizer=Adam(),

loss=SparseCategoricalCrossentropy(),

metrics=[SparseCategoricalAccuracy()])

# Train the model

model.fit(x_train, y_train, epochs=5)

# Evaluate the model

test_loss, test_acc = model.evaluate(x_test, y_test)

print(f'\nTest accuracy: {test_acc}')

Output:

Test accuracy: 0.9767000079154968

By understanding their architecture, activation functions, and training process, one can
appreciate the capabilities and limitations of these networks. Continuous advancements in
optimization techniques and activation functions have made feedforward networks more
efficient and effective, contributing to the broader field of artificial intelligence.

Back Propagation is also known as "Backward Propagation of Errors" is a method used to train
neural network . Its goal is to reduce the difference between the model’s predicted output and
the actual output by adjusting the weights and biases in the network.

It works iteratively to adjust weights and bias to minimize the cost function. In each epoch the
model adapts these parameters by reducing loss by following the error gradient. It often uses
optimization algorithms like gradient descent or stochastic gradient descent. The algorithm
computes the gradient using the chain rule from calculus allowing it to effectively navigate
complex layers in the neural network to minimize the cost function.

Fig(a) A simple illustration of how the backpropagation works by adjustments of weights

Back Propagation plays a critical role in how neural networks improve over time. Here's why:

1. Efficient Weight Update: It computes the gradient of the loss function with respect to
each weight using the chain rule making it possible to update weights efficiently.

2. Scalability: The Back Propagation algorithm scales well to networks with multiple
layers and complex architectures making deep learning feasible.

3. Automated Learning: With Back Propagation the learning process becomes automated
and the model can adjust itself to optimize its performance.

Working of Back Propagation Algorithm

The Back Propagation algorithm involves two main steps: the Forward Pass and the Backward
Pass.

1. Forward Pass Work

In forward pass the input data is fed into the input layer. These inputs combined with their
respective weights are passed to hidden layers. For example in a network with two hidden
layers (h1 and h2) the output from h1 serves as the input to h2. Before applying an activation
function, a bias is added to the weighted inputs.

Each hidden layer computes the weighted sum (`a`) of the inputs then applies an activation
function like ReLU (Rectified Linear Unit) to obtain the output (`o`). The output is passed to
the next layer where an activation function such as softmax converts the weighted outputs
into probabilities for classification.

The forward pass using weights and biases

2. Backward Pass

In the backward pass the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method
for error calculation is the Mean Squared Error (MSE) given by:

MSE=(Predicted Output−Actual Output)2MSE=(Predicted Output−Actual Output)2

Once the error is calculated the network adjusts weights using gradients which are computed
with the chain rule. These gradients indicate how much each weight and bias should be
adjusted to minimize the error in the next iteration. The backward pass continues layer by layer
ensuring that the network learns and improves its performance. The activation function
through its derivative plays a crucial role in computing these gradients during Back
Propagation.

Example of Back Propagation in Machine Learning

Let’s walk through an example of Back Propagation in machine learning. Assume the neurons
use the sigmoid activation function for the forward and backward pass. The target output is 0.5
and the learning rate is 1.
Example (1) of backpropagation sum

Forward Propagation

1. Initial Calculation

The weighted sum at each node is calculated using:

aj=∑(wi,j∗xi)aj=∑(wi,j∗xi)

Where,

 ajaj is the weighted sum of all the inputs and weights at each node

 wi,jwi,j represents the weights between the ithithinput and the jthjth neuron

 xixi represents the value of the ithith input

O (output): After applying the activation function to a, we get the output of the neuron:

ojoj = activation function(ajaj)

2. Sigmoid Function

The sigmoid function returns a value between 0 and 1, introducing non-linearity into the
model.

yj=11+e−ajyj=1+e−aj1
To find the outputs of y3, y4 and y5

3. Computing Outputs

At h1 node

a1=(w1,1x1)+(w2,1x2)=(0.2∗0.35)+(0.2∗0.7)=0.21a1=(w1,1x1)+(w2,1x2
)=(0.2∗0.35)+(0.2∗0.7)=0.21

Once we calculated the a1 value, we can now proceed to find the y3 value:

yj=F(aj)=11+e−a1yj=F(aj)=1+e−a11

y3=F(0.21)=11+e−0.21y3=F(0.21)=1+e−0.211

y3=0.56y3=0.56

Similarly find the values of y4 at h2 and y5 at O3

a2=(w1,2∗x1)+(w2,2∗x2)=(0.3∗0.35)+(0.3∗0.7)=0.315a2=(w1,2∗x1)+(w2,2∗x2
)=(0.3∗0.35)+(0.3∗0.7)=0.315

y4=F(0.315)=11+e−0.315y4=F(0.315)=1+e−0.3151

a3=(w1,3∗y3)+(w2,3∗y4)=(0.3∗0.57)+(0.9∗0.59)=0.702a3=(w1,3∗y3)+(w2,3∗y4
)=(0.3∗0.57)+(0.9∗0.59)=0.702

y5=F(0.702)=11+e−0.702=0.67y5=F(0.702)=1+e−0.7021=0.67
Values of y3, y4 and y5

4. Error Calculation

Our actual output is 0.5 but we obtained 0.67. To calculate the error we can use the below
formula:

Errorj=ytarget−y5Errorj=ytarget−y5

=> 0.5−0.67=−0.17=> 0.5−0.67=−0.17

Using this error value we will be backpropagating.

Back Propagation

1. Calculating Gradients

The change in each weight is calculated as:

Δwij=η×δj×OjΔwij=η×δj×Oj

Where:

 δjδj is the error term for each unit,

 ηη is the learning rate.

2. Output Unit Error

For O3:

δ5=y5(1−y5)(ytarget−y5)δ5=y5(1−y5)(ytarget−y5)
=0.67(1−0.67)(−0.17)=−0.0376=0.67(1−0.67)(−0.17)=−0.0376

3. Hidden Unit Error

For h1:

δ3=y3(1−y3)(w1,3×δ5)δ3=y3(1−y3)(w1,3×δ5)

=0.56(1−0.56)(0.3×−0.0376)=−0.0027=0.56(1−0.56)(0.3×−0.0376)=−0.0027

For h2:

δ4=y4(1−y4)(w2,3×δ5)δ4=y4(1−y4)(w2,3×δ5)

=0.59(1−0.59)(0.9×−0.0376)=−0.0819=0.59(1−0.59)(0.9×−0.0376)=−0.0819

4. Weight Updates

For the weights from hidden to output layer:

Δw2,3=1×(−0.0376)×0.59=−0.022184Δw2,3=1×(−0.0376)×0.59=−0.022184

New weight:

w2,3(new)=−0.022184+0.9=0.877816w2,3(new)=−0.022184+0.9=0.877816

For weights from input to hidden layer:

Δw1,1=1×(−0.0027)×0.35=0.000945Δw1,1=1×(−0.0027)×0.35=0.000945

New weight:

w1,1(new)=0.000945+0.2=0.200945w1,1(new)=0.000945+0.2=0.200945

Similarly other weights are updated:

 w1,2(new)=0.273225w1,2(new)=0.273225

 w1,3(new)=0.086615w1,3(new)=0.086615

 w2,1(new)=0.269445w2,1(new)=0.269445

 w2,2(new)=0.18534w2,2(new)=0.18534

The updated weights are illustrated below


Through backward pass the weights are updated

After updating the weights the forward pass is repeated yielding:

 y3=0.57y3=0.57

 y4=0.56y4=0.56

 y5=0.61y5=0.61

Since y5=0.61y5=0.61 is still not the target output the process of calculating the error and
backpropagating continues until the desired output is reached.

This process demonstrates how Back Propagation iteratively updates weights by minimizing
errors until the network accurately predicts the output.

Error=ytarget−y5Error=ytarget−y5

=0.5−0.61=−0.11=0.5−0.61=−0.11

This process is said to be continued until the actual output is gained by the neural network.

Back Propagation Implementation in Python for XOR Problem

This code demonstrates how Back Propagation is used in a neural network to solve the XOR
problem. The neural network consists of:

1. Defining Neural Network

We define a neural network as Input layer with 2 inputs, Hidden layer with 4 neurons, Output
layer with 1 output neuron and use Sigmoid function as activation function.
 self.input_size = input_size: stores the size of the input layer

 self.hidden_size = hidden_size: stores the size of the hidden layer

 self.weights_input_hidden = np.random.randn(self.input_size, self.hidden_size):


initializes weights for input to hidden layer

 self.weights_hidden_output = np.random.randn(self.hidden_size, self.output_size):


initializes weights for hidden to output layer

 self.bias_hidden = np.zeros((1, self.hidden_size)): initializes bias for hidden layer

 self.bias_output = np.zeros((1, self.output_size)): initializes bias for output layer

import numpy as np

class NeuralNetwork:

def __init__(self, input_size, hidden_size, output_size):

self.input_size = input_size

self.hidden_size = hidden_size

self.output_size = output_size

self.weights_input_hidden = np.random.randn(

self.input_size, self.hidden_size)

self.weights_hidden_output = np.random.randn(

self.hidden_size, self.output_size)

self.bias_hidden = np.zeros((1, self.hidden_size))

self.bias_output = np.zeros((1, self.output_size))

def sigmoid(self, x):

return 1 / (1 + np.exp(-x))

def sigmoid_derivative(self, x):

return x * (1 - x)

2. Defining Feed Forward Network


In Forward pass inputs are passed through the network activating the hidden and output layers
using the sigmoid function.

 self.hidden_activation = np.dot(X, self.weights_input_hidden) + self.bias_hidden:


calculates activation for hidden layer

 self.hidden_output= self.sigmoid(self.hidden_activation): applies activation function


to hidden layer

 self.output_activation= np.dot(self.hidden_output, self.weights_hidden_output) +


self.bias_output: calculates activation for output layer

 self.predicted_output = self.sigmoid(self.output_activation): applies activation


function to output layer

def feedforward(self, X):

self.hidden_activation = np.dot(

X, self.weights_input_hidden) + self.bias_hidden

self.hidden_output = self.sigmoid(self.hidden_activation)

self.output_activation = np.dot(

self.hidden_output, self.weights_hidden_output) + self.bias_output

self.predicted_output = self.sigmoid(self.output_activation)

return self.predicted_output

3. Defining Backward Network

In Backward pass or Back Propagation the errors between the predicted and actual outputs are
computed. The gradients are calculated using the derivative of the sigmoid function and
weights and biases are updated accordingly.

 output_error = y - self.predicted_output: calculates the error at the output layer

 output_delta = output_error *
self.sigmoid_derivative(self.predicted_output): calculates the delta for the output
layer

 hidden_error = np.dot(output_delta, self.weights_hidden_output.T): calculates the


error at the hidden layer

 hidden_delta = hidden_error *
self.sigmoid_derivative(self.hidden_output): calculates the delta for the hidden layer

 self.weights_hidden_output += np.dot(self.hidden_output.T, output_delta) *


learning_rate: updates weights between hidden and output layers
 self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate: updates
weights between input and hidden layers

def backward(self, X, y, learning_rate):

output_error = y - self.predicted_output

output_delta = output_error * \

self.sigmoid_derivative(self.predicted_output)

hidden_error = np.dot(output_delta, self.weights_hidden_output.T)

hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_output)

self.weights_hidden_output += np.dot(self.hidden_output.T,

output_delta) * learning_rate

self.bias_output += np.sum(output_delta, axis=0,

keepdims=True) * learning_rate

self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate

self.bias_hidden += np.sum(hidden_delta, axis=0,

keepdims=True) * learning_rate

4. Training Network

The network is trained over 10,000 epochs using the Back Propagation algorithm with a
learning rate of 0.1 progressively reducing the error.

 output = self.feedforward(X): computes the output for the current inputs

 self.backward(X, y, learning_rate): updates weights and biases using Back Propagation

 loss = np.mean(np.square(y - output)): calculates the mean squared error (MSE) loss

def train(self, X, y, epochs, learning_rate):

for epoch in range(epochs):

output = self.feedforward(X)

self.backward(X, y, learning_rate)

if epoch % 4000 == 0:

loss = np.mean(np.square(y - output))

print(f"Epoch {epoch}, Loss:{loss}")

5. Testing Neural Network


 X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]): defines the input data

 y = np.array([[0], [1], [1], [0]]): defines the target values

 nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1): initializes the


neural network

 nn.train(X, y, epochs=10000, learning_rate=0.1): trains the network

 output = nn.feedforward(X): gets the final predictions after training

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

y = np.array([[0], [1], [1], [0]])

nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)

nn.train(X, y, epochs=10000, learning_rate=0.1)

output = nn.feedforward(X)

print("Predictions after training:")

print(output)

Output:

Trained Model

 The output shows the training progress of a neural network over 10,000 epochs.
Initially the loss was high (0.2713) but it gradually decreased as the network learned
reaching a low value of 0.0066 by epoch 8000.

 The final predictions are close to the expected XOR outputs: approximately 0 for [0, 0]
and [1, 1] and approximately 1 for [0, 1] and [1, 0] indicating that the network
successfully learned to approximate the XOR function.

Advantages of Back Propagation for Neural Network Training

The key benefits of using the Back Propagation algorithm are:

1. Ease of Implementation: Back Propagation is beginner-friendly requiring no prior


neural network knowledge and simplifies programming by adjusting weights with error
derivatives.
2. Simplicity and Flexibility: Its straightforward design suits a range of tasks from basic
feedforward to complex convolutional or recurrent networks.

3. Efficiency: Back Propagation accelerates learning by directly updating weights based


on error especially in deep networks.

4. Generalization: It helps models generalize well to new data improving prediction


accuracy on unseen examples.

5. Scalability: The algorithm scales efficiently with larger datasets and more complex
networks making it ideal for large-scale tasks.

Challenges with Back Propagation

While Back Propagation is useful it does face some challenges:

1. Vanishing Gradient Problem: In deep networks the gradients can become very small
during Back Propagation making it difficult for the network to learn. This is common
when using activation functions like sigmoid or tanh.

2. Exploding Gradients: The gradients can also become excessively large causing the
network to diverge during training.

3. Overfitting: If the network is too complex it might memorize the training data instead
of learning general patterns.

Natural Language Processing (NLP) is a


field that combines computer science, artificial intelligence and language studies. It helps
computers understand, process and create human language in a way that makes sense and is
useful. With the growing amount of text data from social media, websites and other sources,
NLP is becoming a key tool to gain insights and automate tasks like analyzing text or translating
languages.

Natural Language Processing

Table of Content

 NLP Techniques

 How Natural Language Processing (NLP) Works

 Technologies related to Natural Language Processing

 Applications of Natural Language Processing (NLP)

 Future Scope

NLP is used by many applications that use language, such as text translation, voice recognition,
text summarization and chatbots. You may have used some of these applications yourself, such
as voice-operated GPS systems, digital assistants, speech-to-text software and customer
service bots. NLP also helps businesses improve their efficiency, productivity and performance
by simplifying complex tasks that involve language.
NLP Techniques

NLP encompasses a wide array of techniques that aimed at enabling computers to process and
understand human language. These tasks can be categorized into several broad areas, each
addressing different aspects of language processing. Here are some of the key NLP techniques:

1. Text Processing and Preprocessing

 Tokenization: Dividing text into smaller units, such as words or sentences.

 Stemming and Lemmatization: Reducing words to their base or root forms.

 Stopword Removal: Removing common words (like "and", "the", "is") that may not
carry significant meaning.

 Text Normalization: Standardizing text, including case normalization, removing


punctuation and correcting spelling errors.

2. Syntax and Parsing

 Part-of-Speech (POS) Tagging: Assigning parts of speech to each word in a sentence


(e.g., noun, verb, adjective).

 Dependency Parsing: Analyzing the grammatical structure of a sentence to identify


relationships between words.

 Constituency Parsing: Breaking down a sentence into its constituent parts or phrases
(e.g., noun phrases, verb phrases).

3. Semantic Analysis

 Named Entity Recognition (NER): Identifying and classifying entities in text, such as
names of people organizations, locations, dates, etc.

 Word Sense Disambiguation (WSD): Determining which meaning of a word is used in a


given context.

 Coreference Resolution: Identifying when different words refer to the same entity in a
text (e.g., "he" refers to "John").

4. Information Extraction

 Entity Extraction: Identifying specific entities and their relationships within the text.

 Relation Extraction: Identifying and categorizing the relationships between entities in


a text.

5. Text Classification in NLP

 Sentiment Analysis: Determining the sentiment or emotional tone expressed in a text


(e.g., positive, negative, neutral).

 Topic Modeling: Identifying topics or themes within a large collection of documents.

 Spam Detection: Classifying text as spam or not spam.

6. Language Generation
 Machine Translation: Translating text from one language to another.

 Text Summarization: Producing a concise summary of a larger text.

 Text Generation: Automatically generating coherent and contextually relevant text.

7. Speech Processing

 Speech Recognition: Converting spoken language into text.

 Text-to-Speech (TTS) Synthesis: Converting written text into spoken language.

8. Question Answering

 Retrieval-Based QA: Finding and returning the most relevant text passage in response
to a query.

 Generative QA: Generating an answer based on the information available in a text


corpus.

9. Dialogue Systems

 Chatbots and Virtual Assistants: Enabling systems to engage in conversations with


users, providing responses and performing tasks based on user input.

10. Sentiment and Emotion Analysis in NLP

 Emotion Detection: Identifying and categorizing emotions expressed in text.

 Opinion Mining: Analyzing opinions or reviews to understand public sentiment toward


products, services or topics.

How Natural Language Processing (NLP) Works

NLP Working

Working in natural language processing (NLP) typically involves using computational


techniques to analyze and understand human language. This can include tasks such as
language understanding, language generation and language interaction.

1. Text Input and Data Collection

 Data Collection: Gathering text data from various sources such as websites, books,
social media or proprietary databases.

 Data Storage: Storing the collected text data in a structured format, such as a database
or a collection of documents.

2. Text Preprocessing

Preprocessing is crucial to clean and prepare the raw text data for analysis. Common
preprocessing steps include:

 Tokenization: Splitting text into smaller units like words or sentences.

 Lowercasing: Converting all text to lowercase to ensure uniformity.


 Stopword Removal: Removing common words that do not contribute significant
meaning, such as "and," "the," "is."

 Punctuation Removal: Removing punctuation marks.

 Stemming and Lemmatization: Reducing words to their base or root forms. Stemming
cuts off suffixes, while lemmatization considers the context and converts words to their
meaningful base form.

 Text Normalization: Standardizing text format, including correcting spelling errors,


expanding contractions and handling special characters.

3. Text Representation

 Bag of Words (BoW): Representing text as a collection of words, ignoring grammar and
word order but keeping track of word frequency.

 Term Frequency-Inverse Document Frequency (TF-IDF): A statistic that reflects the


importance of a word in a document relative to a collection of documents.

 Word Embeddings: Using dense vector representations of words where semantically


similar words are closer together in the vector space (e.g., Word2Vec, GloVe).

4. Feature Extraction

Extracting meaningful features from the text data that can be used for various NLP tasks.

 N-grams: Capturing sequences of N words to preserve some context and word order.

 Syntactic Features: Using parts of speech tags, syntactic dependencies and parse trees.

 Semantic Features: Leveraging word embeddings and other representations to capture


word meaning and context.

5. Model Selection and Training

Selecting and training a machine learning or deep learning model to perform specific NLP
tasks.

 Supervised Learning: Using labeled data to train models like Support Vector Machines
(SVM), Random Forests or deep learning models like Convolutional Neural Networks
(CNNs) and Recurrent Neural Networks (RNNs).

 Unsupervised Learning: Applying techniques like clustering or topic modeling (e.g.,


Latent Dirichlet Allocation) on unlabeled data.

 Pre-trained Models: Utilizing pre-trained language models such as BERT, GPT or


transformer-based models that have been trained on large corpora.

6. Model Deployment and Inference

Deploying the trained model and using it to make predictions or extract insights from new text
data.

 Text Classification: Categorizing text into predefined classes (e.g., spam detection,
sentiment analysis).
 Named Entity Recognition (NER): Identifying and classifying entities in the text.

 Machine Translation: Translating text from one language to another.

 Question Answering: Providing answers to questions based on the context provided by


text data.

7. Evaluation and Optimization

Evaluating the performance of the NLP algorithm using metrics such as accuracy, precision,
recall, F1-score and others.

 Hyperparameter Tuning: Adjusting model parameters to improve performance.

 Error Analysis: Analyzing errors to understand model weaknesses and improve


robustness.

Technologies related to Natural Language Processing

There are a variety of technologies related to natural language processing (NLP) that are used
to analyze and understand human language. Some of the most common include:

1. Machine learning: NLP relies heavily on machine learning techniques such as


supervised and unsupervised learning, deep learning and reinforcement learning to
train models to understand and generate human language.

2. Natural Language Toolkits (NLTK) and other libraries: NLTK is a popular open-source
library in Python that provides tools for NLP tasks such as tokenization, stemming and
part-of-speech tagging. Other popular libraries include spaCy, OpenNLP and CoreNLP.

3. Parsers: Parsers are used to analyze the syntactic structure of sentences, such as
dependency parsing and constituency parsing.

4. Text-to-Speech (TTS) and Speech-to-Text (STT) systems: TTS systems convert written
text into spoken words, while STT systems convert spoken words into written text.

5. Named Entity Recognition (NER) systems: NER systems identify and extract named
entities such as people, places and organizations from the text.

6. Sentiment Analysis: A technique to understand the emotions or opinions expressed in


a piece of text, by using various techniques like Lexicon-Based, Machine Learning-
Based and Deep Learning-based methods

7. Machine Translation: NLP is used for language translation from one language to
another through a computer.

8. Chatbots: NLP is used for chatbots that communicate with other chatbots or humans
through auditory or textual methods.

9. AI Software: NLP is used in question-answering software for knowledge


representation, analytical reasoning as well as information retrieval.

Applications of Natural Language Processing (NLP)


 Spam Filters: One of the most irritating things about email is spam. Gmail uses natural
language processing (NLP) to discern which emails are legitimate and which are spam.
These spam filters look at the text in all the emails you receive and try to figure out
what it means to see if it's spam or not.

 Algorithmic Trading: Algorithmic trading is used for predicting stock market conditions.
Using NLP, this technology examines news headlines about companies and stocks and
attempts to comprehend their meaning in order to determine if you should buy, sell or
hold certain stocks.

 Questions Answering: NLP can be seen in action by using Google Search or Siri
Services. A major use of NLP is to make search engines understand the meaning of
what we are asking and generate natural language in return to give us the answers.

 Summarizing Information: On the internet, there is a lot of information and a lot of it


comes in the form of long documents or articles. NLP is used to decipher the meaning
of the data and then provides shorter summaries of the data so that humans can
comprehend it more quickly.

Future Scope

NLP is shaping the future of technology in several ways:

 Chatbots and Virtual Assistants: NLP enables chatbots to quickly understand and
respond to user queries, providing 24/7 assistance across text or voice interactions.

 Invisible User Interfaces (UI): With NLP, devices like Amazon Echo allow for seamless
communication through voice or text, making technology more accessible without
traditional interfaces.

 Smarter Search: NLP is improving search by allowing users to ask questions in natural
language, as seen with Google Drive's recent update, making it easier to find
documents.

 Multilingual NLP: Expanding NLP to support more languages, including regional and
minority languages, broadens accessibility.

Future Enhancements: NLP is evolving with the use of Deep Neural Networks (DNNs) to make
human-machine interactions more natural. Future advancements include improved semantics
for word understanding and broader language support, enabling accurate translations and
better NLP models for languages not yet supported.

Natural
Language Natural Language Natural Language
Processing Understanding (NLU) Generation (NLG)
(NLP)

1 It was first started by This explores the ways This enables the
Alan Turing to make
which enable the computers to produce
the machine
computers to grasp the output after
understand the
instructions provided understanding the
context of any
by users in human input given by the user
document rather than
languages like English, in natural languages
treating it as simple
Hindi etc. like English, Hindi etc.
words.

It came into existence This concept began It came into existence


2
around 1950. around 1866. around 1960.

It also has 3 phases,


It has 3 phases, first
It has 5 phases which first understanding the
paraphrasing the input
are lexical analysis, information, second
information, second
syntax analysis, formulating ways to
text conversion to
3 semantic analysis, provide output and
other languages and
disclosure integration third achieving the
third drawing
and pragmatic realization of giving
inferences from the
analysis. output in natural
given information.
languages.

Applications of NLU
Applications of NLP
are Speech Applications of NLG
are Smart assistance,
4 recognition, sentiment are Chatbots, Voice
language translation,
analysis, spam filtering assistants etc.
text analysis etc.
etc

It makes use of
sensors for input and Sensors and After understanding
uses different layers processors are used to and processing,
5
for processing data take input and process actuators are used to
and then provides the information. provide output.
output.

6 It converts instructions Converts the It generates structured


from natural language unstructured data data for the user.
to computer language provided by the user
and then the to structured or
computer returns the meaningful
information again in information.
natural language after
processing.

It utilizes different
strategies to
It involves different It has different
7 understand the natural
analysis phases. generation phases.
language and give
feedback accordingly.

It first converts the


It utilizes a learning
natural language to It formulates the plan
8 mechanism to provide
machine language for for text utterance.
efficient results.
understanding.

Natural Language Toolkit (NLTK) is one of the


largest Python libraries for performing various Natural Language Processing tasks. From
rudimentary tasks such as text pre-processing to tasks like vectorized representation of text
- NLTK's API has covered everything. In this article, we will accustom ourselves to the basics of
NLTK and perform some crucial NLP tasks: Tokenization, Stemming, Lemmatization, and POS
Tagging.

Table of Content

 What is the Natural Language Toolkit (NLTK)?

 Tokenization

 Stemming and Lemmatization

 Stemming

 Lemmatization

 Part of Speech Tagging

What is the Natural Language Toolkit (NLTK)?

As discussed earlier, NLTK is Python's API library for performing an array of tasks in human
language. It can perform a variety of operations on textual data, such as classification,
tokenization, stemming, tagging, Leparsing, semantic reasoning, etc.

Installation:
NLTK can be installed simply using pip or by running the following code.

! pip install nltk


Accessing Additional Resources:
To incorporate the usage of additional resources, such as recourses of languages other than
English - you can run the following in a python script. It has to be done only once when you are
running it for the first time in your system.

import nltk

nltk.download('all')

Now, having installed NLTK successfully in our system, let's perform some basic operations on
text data using NLTK.

Tokenization

Tokenization refers to break down the text into smaller units. It entails splitting paragraphs into
sentences and sentences into words. It is one of the initial steps of any NLP pipeline. Let us
have a look at the two major kinds of tokenization that NLTK provides:

Work Tokenization

It involves breaking down the text into words.

"I study Machine Learning on GeeksforGeeks." will be word-tokenized as


['I', 'study', 'Machine', 'Learning', 'on', 'GeeksforGeeks', '.'].

Sentence Tokenization

It involves breaking down the text into individual sentences.

Example:
"I study Machine Learning on GeeksforGeeks. Currently, I'm studying NLP"
will be sentence-tokenized as
['I study Machine Learning on GeeksforGeeks.', 'Currently, I'm studying NLP.']

In Python, both these tokenizations can be implemented in NLTK as follows:

# Tokenization using NLTK

from nltk import word_tokenize, sent_tokenize

sent = "GeeksforGeeks is a great learning platform.\

It is one of the best for Computer Science students."

print(word_tokenize(sent))

print(sent_tokenize(sent))

Output:

['GeeksforGeeks', 'is', 'a', 'great', 'learning', 'platform', '.',


'It', 'is', 'one', 'of', 'the', 'best', 'for', 'Computer', 'Science', 'students', '.']
['GeeksforGeeks is a great learning platform.',
'It is one of the best for Computer Science students.']

Stemming and Lemmatization


When working with Natural Language, we are not much interested in the form of words -
rather, we are concerned with the meaning that the words intend to convey. Thus, we try to
map every word of the language to its root/base form. This process is called canonicalization.

E.g. The words 'play', 'plays', 'played', and 'playing' convey the same action - hence, we can
map them all to their base form i.e. 'play'.

Now, there are two widely used canonicalization techniques: Stemming and Lemmatization.

Stemming

Stemming generates the base word from the inflected word by removing the affixes of the
word. It has a set of pre-defined rules that govern the dropping of these affixes. It must be
noted that stemmers might not always result in semantically meaningful base words.
Stemmers are faster and computationally less expensive than lemmatizers.

In the following code, we will be stemming words using Porter Stemmer - one of the most
widely used stemmers:

from nltk.stem import PorterStemmer

# create an object of class PorterStemmer

porter = PorterStemmer()

print(porter.stem("play"))

print(porter.stem("playing"))

print(porter.stem("plays"))

print(porter.stem("played"))

Output:

play
play
play
play

We can see that all the variations of the word 'play' have been reduced to the same word -
'play'. In this case, the output is a meaningful word, 'play'. However, this is not always the case.
Let us take an example.

Please note that these groups are stored in the lemmatizer; there is no removal of affixes as in
the case of a stemmer.

from nltk.stem import PorterStemmer

# create an object of class PorterStemmer

porter = PorterStemmer()

print(porter.stem("Communication"))
Output:

commun

The stemmer reduces the word 'communication' to a base word 'commun' which is
meaningless in itself.

Lemmatization

Lemmatization involves grouping together the inflected forms of the same word. This way, we
can reach out to the base form of any word which will be meaningful in nature. The base from
here is called the Lemma.

Lemmatizers are slower and computationally more expensive than stemmers.

Example:
'play', 'plays', 'played', and 'playing' have 'play' as the lemma.

In Python, both these tokenizations can be implemented in NLTK as follows:

from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("plays", 'v'))

print(lemmatizer.lemmatize("played", 'v'))

print(lemmatizer.lemmatize("play", 'v'))

print(lemmatizer.lemmatize("playing", 'v'))

Output:

play
play
play
play

Please note that in lemmatizers, we need to pass the Part of Speech of the word along with the
word as a function argument.

Also, lemmatizers always result in meaningful base words. Let us take the same example as we
took in the case for stemmers.

from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("Communication", 'v'))

Output:
Communication

Part of Speech Tagging

Part of Speech (POS) tagging refers to assigning each word of a sentence to its part of speech.
It is significant as it helps to give a better syntactic overview of a sentence.

Example:
"GeeksforGeeks is a Computer Science platform."
Let's see how NLTK's POS tagger will tag this sentence.

In Python, both these tokenizations can be implemented in NLTK as follows:

from nltk import pos_tag

from nltk import word_tokenize

text = "GeeksforGeeks is a Computer Science platform."

tokenized_text = word_tokenize(text)

tags = tokens_tag = pos_tag(tokenized_text)

tags

Output:

[('GeeksforGeeks', 'NNP'),
('is', 'VBZ'),
('a', 'DT'),
('Computer', 'NNP'),
('Science', 'NNP'),
('platform', 'NN'),
('.', '.')]

Conclusion

In conclusion, the Natural Language Toolkit (NLTK) works as a powerful Python library that a
wide range of tools for Natural Language Processing (NLP). From fundamental tasks like text
pre-processing to more advanced operations such as semantic reasoning, NLTK provides a
versatile API that caters to the diverse needs of language-related tasks.

TF-IDF (Term Frequency-Inverse


Document Frequency) is a statistical measure used in natural
language processing and information retrieval to evaluate the importance of a word in a
document relative to a collection of documents (corpus).

Unlike simple word frequency, TF-IDF balances common and rare words to highlight the most
meaningful terms.
How TF-IDF Works?

TF-IDF combines two components: Term Frequency (TF) and Inverse Document Frequency
(IDF).

Term Frequency (TF): Measures how often a word appears in a document. A higher frequency
suggests greater importance. If a term appears frequently in a document, it is likely relevant to
the document’s content. Formula:

Term Frequency (TF)

Limitations of TF Alone:

 TF does not account for the global importance of a term across the entire corpus.

 Common words like "the" or "and" may have high TF scores but are not meaningful in
distinguishing documents.

Inverse Document Frequency (IDF): Reduces the weight of common words across multiple
documents while increasing the weight of rare words. If a term appears in fewer documents, it
is more likely to be meaningful and specific. Formula:

Inverse Document Frequency (IDF)

 The logarithm is used to dampen the effect of very large or very small values, ensuring
the IDF score scales appropriately.

 It also helps balance the impact of terms that appear in extremely few or extremely
many documents.

Limitations of IDF Alone:

 IDF does not consider how often a term appears within a specific document.

 A term might be rare across the corpus (high IDF) but irrelevant in a specific document
(low TF).

Converting Text into vectors with TF-IDF : Example

To better grasp how TF-IDF works, let’s walk through a detailed example. Imagine we have
a corpus (a collection of documents) with three documents:

1. Document 1: "The cat sat on the mat."

2. Document 2: "The dog played in the park."

3. Document 3: "Cats and dogs are great pets."

Our goal is to calculate the TF-IDF score for specific terms in these documents. Let’s focus on
the word "cat" and see how TF-IDF evaluates its importance.

Step 1: Calculate Term Frequency (TF)

For Document 1:

 The word "cat" appears 1 time.


 The total number of terms in Document 1 is 6 ("the", "cat", "sat", "on", "the", "mat").

 So, TF(cat,Document 1) = 1/6

For Document 2:

 The word "cat" does not appear.

 So, TF(cat,Document 2)=0.

For Document 3:

 The word "cat" appears 1 time (as "cats").

 The total number of terms in Document 3 is 6 ("cats", "and", "dogs", "are", "great",
"pets").

 So, TF(cat,Document 3)=1/6

 In Document 1 and Document 3, the word "cat" has the same TF score. This means it
appears with the same relative frequency in both documents.

 In Document 2, the TF score is 0 because the word "cat" does not appear.

Step 2: Calculate Inverse Document Frequency (IDF)

 Total number of documents in the corpus (D): 3

 Number of documents containing the term "cat": 2 (Document 1 and Document 3).

So, IDF(cat,D)=log32≈0.176IDF(cat,D)=log23≈0.176

The IDF score for "cat" is relatively low. This indicates that the word "cat" is not very rare in
the corpus—it appears in 2 out of 3 documents. If a term appeared in only 1 document, its IDF
score would be higher, indicating greater uniqueness.

Step 3: Calculate TF-IDF

The TF-IDF score for "cat" is 0.029 in Document 1 and Document 3, and 0 in Document 2
that reflects both the frequency of the term in the document (TF) and its rarity across the
corpus (IDF).

TF-IDF

A higher TF-IDF score means the term is more important in that specific document.

Why is TF-IDF Useful in This Example?

1. Identifying Important Terms: TF-IDF helps us understand that "cat" is somewhat important
in Document 1 and Document 3 but irrelevant in Document 2.

If we were building a search engine, this score would help rank Document 1 and Document 3
higher for a query like "cat".

2. Filtering Common Words: Words like "the" or "and" would have high TF scores but very low
IDF scores because they appear in almost all documents. Their TF-IDF scores would be close to
0, indicating they are not meaningful.
3. Highlighting Unique Terms: If a term like "mat" appeared only in Document 1, it would have
a higher IDF score, making its TF-IDF score more significant in that document.

Implementing TF-IDF in Sklearn with Python

In python tf-idf values can be computed using TfidfVectorizer() method in sklearn module.

Syntax:

sklearn.feature_extraction.text.TfidfVectorizer(input)

Parameters:

 input: It refers to parameter document passed, it can be a filename, file or content


itself.

Attributes:

 vocabulary_: It returns a dictionary of terms as keys and values as feature indices.

 idf_: It returns the inverse document frequency vector of the document passed as a
parameter.

Returns:

 fit_transform(): It returns an array of terms along with tf-idf values.

 get_feature_names(): It returns a list of feature names.

Step-by-step Approach:

 Import modules.

# import required module

from sklearn.feature_extraction.text import TfidfVectorizer

 Collect strings from documents and create a corpus having a collection of strings from
the documents d0, d1, and d2.

# assign documents

d0 = 'Geeks for geeks'

d1 = 'Geeks'

d2 = 'r2j'

# merge documents into a single corpus

string = [d0, d1, d2]

 Get tf-idf values from fit_transform() method.

# create object

tfidf = TfidfVectorizer()
# get tf-df values

result = tfidf.fit_transform(string)

 Display idf values of the words present in the corpus.

# get idf values

print('\nidf values:')

for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_):

print(ele1, ':', ele2)

Output:

 Display tf-idf values along with indexing.

# get indexing

print('\nWord indexes:')

print(tfidf.vocabulary_)

# display tf-idf values

print('\ntf-idf value:')

print(result)

# in matrix form

print('\ntf-idf values in matrix form:')

print(result.toarray())

Output:
The result variable consists of unique words as well as the tf-if values. It can be elaborated
using the below image:

From the above image the below table can be generated:

Documen
t Word Document Index Word Index tf-idf value

d0 for 0 0 0.549

d0 geeks 0 1 0.8355

d1 geeks 1 1 1.000
Documen
t Word Document Index Word Index tf-idf value

d2 r2j 2 2 1.000

Below are some examples which depict how to compute tf-idf values of words from a
corpus:

Example 1: Below is the complete program based on the above approach:

# import required module

from sklearn.feature_extraction.text import TfidfVectorizer

# assign documents

d0 = 'Geeks for geeks'

d1 = 'Geeks'

d2 = 'r2j'

# merge documents into a single corpus

string = [d0, d1, d2]

# create object

tfidf = TfidfVectorizer()

# get tf-df values

result = tfidf.fit_transform(string)

# get idf values

print('\nidf values:')

for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_):

print(ele1, ':', ele2)

# get indexing
print('\nWord indexes:')

print(tfidf.vocabulary_)

# display tf-idf values

print('\ntf-idf value:')

print(result)

# in matrix form

print('\ntf-idf values in matrix form:')

print(result.toarray())

Output:

Example 2: Here, tf-idf values are computed from a corpus having unique values.

# import required module

from sklearn.feature_extraction.text import TfidfVectorizer


# assign documents

d0 = 'geek1'

d1 = 'geek2'

d2 = 'geek3'

d3 = 'geek4'

# merge documents into a single corpus

string = [d0, d1, d2, d3]

# create object

tfidf = TfidfVectorizer()

# get tf-df values

result = tfidf.fit_transform(string)

# get indexing

print('\nWord indexes:')

print(tfidf.vocabulary_)

# display tf-idf values

print('\ntf-idf values:')

print(result)

Output:
Example 3: In this program, tf-idf values are computed from a corpus having similar
documents.

# import required module

from sklearn.feature_extraction.text import TfidfVectorizer

# assign documents

d0 = 'Geeks for geeks!'

d1 = 'Geeks for geeks!'

# merge documents into a single corpus

string = [d0, d1]

# create object

tfidf = TfidfVectorizer()

# get tf-df values

result = tfidf.fit_transform(string)

# get indexing

print('\nWord indexes:')

print(tfidf.vocabulary_)
# display tf-idf values

print('\ntf-idf values:')

print(result)

Output:

Example 4: Below is the program in which we try to calculate tf-idf value of a single
word geeks is repeated multiple times in multiple documents.

# import required module

from sklearn.feature_extraction.text import TfidfVectorizer

# assign corpus

string = ['Geeks geeks']*5

# create object

tfidf = TfidfVectorizer()

# get tf-df values

result = tfidf.fit_transform(string)

# get indexing

print('\nWord indexes:')

print(tfidf.vocabulary_)

# display tf-idf values


print('\ntf-idf values:')

print(result)

Output:

N-gram can be defined as the contiguous sequence of n items from a given sample
of text or speech. The items can be letters, words, or base pairs according to the application.
The N-grams typically are collected from a text or speech corpus (A long text dataset).

For instance, N-grams can be unigrams like ("This", "article", "is", "on", "NLP") or bigrams
("This article", "article is", "is on", "on NLP").

N-gram Language Model

An N-gram language model predicts the probability of a given N-gram within any sequence of
words in a language. A well-crafted N-gram model can effectively predict the next word in a
sentence, which is essentially determining the value of p(w∣h), where h is the history or context
and w is the word to predict.

Let’s explore how to predict the next word in a sentence. We need to calculate p(w|h), where
w is the candidate for the next word. Consider the sentence 'This article is on...'.If we want to
calculate the probability of the next word being "NLP", the probability can be expressed as:

p("NLP"∣"This","article","is","on")p("NLP"∣"This","article","is","on")

To generalize, the conditional probability of the fifth word given the first four can be written as:

p(w5∣w1,w2,w3,w4)orp(W)=p(wn∣w1,w2,…,wn−1)p(w5∣w1,w2,w3,w4)orp(W)=p(wn∣w1,w2,
…,wn−1)

This is calculated using the chain rule of probability:

P(A∣B)=P(A∩B)P(B)andP(A∩B)=P(A∣B)P(B)P(A∣B)=P(B)P(A∩B)andP(A∩B)=P(A∣B)P(B)

Now generalize this to sequence probability:


P(X1,X2,…,Xn)=P(X1)P(X2∣X1)P(X3∣X1,X2)…P(Xn∣X1,X2,…,Xn−1)P(X1,X2,…,Xn)=P(X1)P(X2∣X1
)P(X3∣X1,X2)…P(Xn∣X1,X2,…,Xn−1)

This yields:

P(w1,w2,w3,…,wn)=∏iP(wi∣w1,w2,…,wi−1)P(w1,w2,w3,…,wn)=∏iP(wi∣w1,w2,…,wi−1)

By applying Markov assumptions, which propose that the future state depends only on the
current state and not on the sequence of events that preceded it, we simplify the formula:

P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−k,…,wi−1)P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−k,…,wi−1)

For a unigram model (k=0), this simplifies further to:

P(w1,w2,…,wn)≈∏iP(wi)P(w1,w2,…,wn)≈∏iP(wi)

And for a bigram model (k=1):

P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−1)P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−1)

Implementing N-Gram Language Modelling in NLTK

# Import necessary libraries

import nltk

from nltk import bigrams, trigrams

from nltk.corpus import reuters

from collections import defaultdict

# Download necessary NLTK resources

nltk.download('reuters')

nltk.download('punkt')

# Tokenize the text

words = nltk.word_tokenize(' '.join(reuters.words()))

# Create trigrams

tri_grams = list(trigrams(words))

# Build a trigram model

model = defaultdict(lambda: defaultdict(lambda: 0))


# Count frequency of co-occurrence

for w1, w2, w3 in tri_grams:

model[(w1, w2)][w3] += 1

# Transform the counts into probabilities

for w1_w2 in model:

total_count = float(sum(model[w1_w2].values()))

for w3 in model[w1_w2]:

model[w1_w2][w3] /= total_count

# Function to predict the next word

def predict_next_word(w1, w2):

"""

Predicts the next word based on the previous two words using the trained trigram model.

Args:

w1 (str): The first word.

w2 (str): The second word.

Returns:

str: The predicted next word.

"""

next_word = model[w1, w2]

if next_word:

predicted_word = max(next_word, key=next_word.get) # Choose the most likely next


word

return predicted_word

else:

return "No prediction available"

# Example usage

print("Next Word:", predict_next_word('the', 'stock'))


Output:

Next Word: of

Metrics for Language Modelling

 Entropy: Entropy, as a measure of the amount of information conveyed by Claude


Shannon. Below is the formula for representing entropy

H(p)=∑xp(x)⋅(−log(p(x)))H(p)=∑xp(x)⋅(−log(p(x)))

H(p) is always greater than equal to 0.

 Cross-Entropy: It measures the ability of the trained model to represent test


data(W1i−1 W1i−1 ).

H(p)=∑i=1x1n(−log2(p(wi∣w1i−1))) H(p)=∑i=1xn1(−log2(p(wi∣w1i−1)))

The cross-entropy is always greater than or equal to Entropy i.e the model uncertainty can be
no less than the true uncertainty.

 Perplexity: Perplexity is a measure of how good a probability distribution predicts a


sample. It can be understood as a measure of uncertainty. The perplexity can be
calculated by cross-entropy to the exponent of 2.

2Cross−Entropy2Cross−Entropy

Following is the formula for the calculation of Probability of the test set assigned by the
language model, normalized by the number of words:

PP(W) =∏i=1N1P(wi∣wi−1)nPP(W) =n∏i=1NP(wi∣wi−1)1

For Example:

 Let's take an example of the sentence: 'Natural Language Processing'. For predicting
the first word, let's say the word has the following probabilities:

word P(word | <start>)

The 0.4

Processing 0.3

Natural 0.12

Language 0.18
 Now, we know the probability of getting the first word as natural. But, what's the
probability of getting the next word after getting the word 'Language' after the word
'Natural'.

word P(word | 'Natural' )

The 0.05

Processing 0.3

Natural 0.15

Language 0.5

 After getting the probability of generating words 'Natural Language', what's the
probability of getting 'Processing'.

word P(word | 'Language' )

The 0.1

Processing 0.7

Natural 0.1

Language 0.1

 Now, the perplexity can be calculated as:

PP(W) =∏i=1N1P(wi∣wi−1)n=10.12∗0.5∗0.73≈2.876PP(W) =n∏i=1NP(wi∣wi−1)1


=30.12∗0.5∗0.71≈2.876

 From that we can also calculate entropy:

Entropy=log2(2.876)=1.524Entropy=log2(2.876)=1.524

Shortcomings:

 To get a better context of the text, we need higher values of n, but this will also
increase computational overhead.
 The increasing value of n in n-gram can also lead to sparsity.

You might also like