0% found this document useful (0 votes)
189 views9 pages

Transformers in Machine Learning - GeeksforGeeks

Transformers are a neural network architecture introduced in 2017, primarily used in natural language processing and computer vision, overcoming limitations of traditional models like RNNs and LSTMs. They utilize self-attention mechanisms and parallel processing to understand context better, with applications in machine translation, speech recognition, and more. The architecture includes components like positional encoding, multi-head attention, and an encoder-decoder structure, enabling effective handling of complex tasks and relationships in data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views9 pages

Transformers in Machine Learning - GeeksforGeeks

Transformers are a neural network architecture introduced in 2017, primarily used in natural language processing and computer vision, overcoming limitations of traditional models like RNNs and LSTMs. They utilize self-attention mechanisms and parallel processing to understand context better, with applications in machine translation, speech recognition, and more. The architecture includes components like positional encoding, multi-head attention, and an encoder-decoder structure, enabling effective handling of complex tasks and relationships in data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Search...

Transformers in Machine Learning


Last Updated : 27 Feb, 2025

Transformer is a neural network architecture used for performing


machine learning tasks particularly in natural language processing
(NLP) and computer vision. In 2017 Vaswani et al. published a paper "
Attention is All You Need" in which the transformers architecture was
introduced. The article explores the architecture, workings and
applications of transformers.

Need For Transformers Model in Machine Learning


Transformer Architecture is a model that uses self-attention to
transform one whole sentence into a single sentence. This is useful
where older models work step by step and it helps overcome the
challenges seen in models like RNNs and LSTMs. Traditional models
like RNNs (Recurrent Neural Networks) suffer from the vanishing
gradient problem which leads to long-term memory loss. RNNs process
text sequentially meaning they analyze words one at a time.

For example, in the sentence: "XYZ went to France in 2019 when


there were no cases of COVID and there he met the president of
that country" the word "that country" refers to "France".
However RNN would struggle to link "that country" to "France"
since it processes each word in sequence leading to losing context
over long sentences. This limitation prevents RNNs from
understanding the full meaning of the sentence.

While adding more memory cells in LSTMs (Long Short-Term Memory


networks) helped address the vanishing gradient issue they still
process words one by one. This sequential processing means LSTMs
can't analyze an entire sentence at once.

For instance the word "point" has different meanings in these two
sentences:
"The needle has a sharp point." (Point = Tip)
"It is Tutorial
Deep Learning not polite toAnalysis
Data point Tutorial
at people." (Point
Python = visualization
– Data Gesture) tutorial Sign In

Traditional models struggle with this context dependence, whereas,


Transformer model through its self-attention mechanism, processes
the entire sentence in parallel addressing these issues and making it
significantly more effective at understanding context.

Architecture and Working of Transformers

1. Positional Encoding

Unlike RNNs transformers lack an inherent understanding of word order


since they process data in parallel. To solve this Positional Encodings
are added to token embeddings providing information about the
position of each token within a sequence.

2. Position-wise Feed-Forward Networks

The Feed-Forward Networks consist of two linear transformations with


a ReLU activation. It is applied independently to each position in the
sequence.

Mathematically:

FFN(x) = max(0, xW1 + b1 )W2 + b2


​ ​ ​ ​

This transformation helps refine the encoded representation at each


position.

3. Attention Mechanism

The attention mechanism allows transformers to determine which


words in a sentence are most relevant to each other. This is done using
a scaled dot-product attention approach:

1. Each word in a sequence is mapped to three vectors:

Query (Q)
Key (K)
Value (V)

2. Attention scores are computed as: Attention(Q, K, V ) =


softmax ( QKdk ) V
T

​ ​

3. These scores determine how much attention each word should pay to
others.
Multi-Head Attention
Instead of using a single attention mechanism transformers apply
multi-head attention where multiple attention layers run in parallel.
This enables the model to capture different types of relationships
within the input.

4. Encoder-Decoder Architecture

The encoder-decoder structure is key to transformer models. The


encoder processes the input sequence into a vector, while the decoder
converts this vector back into a sequence. Each encoder and decoder
layer includes self-attention and feed-forward layers. In the decoder,
an encoder-decoder attention layer is added to focus on relevant parts
of the input.

For example, a French sentence "Je suis étudiant" is translated


into "I am a student" in English.

The encoder consists of multiple layers (typically 6 layers). Each layer


has two main components:

Self-Attention Mechanism – Helps the model understand word


relationships.
Feed-Forward Neural Network – Further transforms the
representation.

The decoder also consists of 6 layers, but with an additional encoder-


decoder attention mechanism. This allows the decoder to focus on
relevant parts of the input sentence while generating output.
For instance in the sentence "The cat didn't chase the mouse, because
it was not hungry", the word 'it' refers to 'cat'. The self-attention
mechanism helps the model correctly associate 'it' with 'cat' ensuring an
accurate understanding of sentence structure.

Applications of Transformers
Some of the applications of transformers are:

1. NLP Tasks: Transformers are used for machine translation, text


summarization, named entity recognition and sentiment analysis.
2. Speech Recognition: They process audio signals to convert speech
into transcribed text.
3. Computer Vision: Transformers are applied to image classification,
object detection, and image generation.
4. Recommendation Systems: They provide personalized
recommendations based on user preferences.
5. Text and Music Generation: Transformers are used for generating
text (e.g., articles) and composing music.

Transformers have redefined deep learning across NLP, computer vision,


and beyond. With advancements like BERT, GPT and Vision
Transformers (ViTs) they continue to push the boundaries of AI and
language understanding and multimodal learning.
Comment More info

Advertise with us

Similar Reads
Data Transformation in Machine Learning
Often the data received in a machine learning project is messy and
missing a bunch of values, creating a problem while we try to train our…

15+ min read

Top Machine Learning Trends in 2025


Just as electricity transformed almost everything 100 years ago, today I
actually have a hard time thinking of an industry that I don't think Artifici…

15+ min read

Machine Learning Models


Machine Learning models are very powerful resources that automate
multiple tasks and make them more accurate and efficient. ML handles…

15+ min read

50 Machine Learning Terms Explained


Machine Learning has become an integral part of modern technology,
driving advancements in everything from personalized recommendations…

15+ min read

Machine Learning Roadmap


Nowadays, machine learning (ML) is a key tool for gaining insights from
complex data and driving innovation in many industries. As more…

15+ min read

What is Machine Learning?


Machine learning is a branch of artificial intelligence that enables
algorithms to uncover hidden patterns within datasets. It allows them to…

15+ min read

Supervised Machine Learning


Supervised machine learning is a fundamental approach for machine
learning and artificial intelligence. It involves training a model using…

15+ min read

seq2seq Model in Machine Learning


Seq2Seq model or Sequence-to-Sequence model, is a machine learning
architecture designed for tasks involving sequential data. It takes an inpu…

15+ min read

Types of Machine Learning


Machine learning is the branch of Artificial Intelligence that focuses on
developing models and algorithms that let computers learn from data an…

15+ min read

Statistics For Machine Learning


Machine Learning Statistics: In the field of machine learning (ML),
statistics plays a pivotal role in extracting meaningful insights from data…

15+ min read

Corporate & Communications Address:


A-143, 7th Floor, Sovereign Corporate
Tower, Sector- 136, Noida, Uttar Pradesh
(201305)

Registered Address:
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam
Buddh Nagar, Uttar Pradesh, 201305

Advertise with us

Company Explore
About Us Job-A-Thon Hiring Challenge
Legal GfG Weekly Contest
Privacy Policy Offline Classroom Program
Careers DSA in JAVA/C++
In Media Master System Design
Contact Us Master CP
GfG Corporate Solution GeeksforGeeks Videos
Placement Training Program

Languages DSA
Python Data Structures
Java Algorithms
C++ DSA for Beginners
PHP Basic DSA Problems
GoLang DSA Roadmap
SQL DSA Interview Questions
R Language Competitive Programming
Android Tutorial

Data Science & ML Web Technologies


Data Science With Python HTML
Data Science For Beginner CSS
Machine Learning JavaScript
ML Maths TypeScript
Data Visualisation ReactJS
Pandas NextJS
NumPy NodeJs
NLP Bootstrap
Deep Learning Tailwind CSS
Python Tutorial Computer Science
Python Programming Examples GATE CS Notes
Django Tutorial Operating Systems
Python Projects Computer Network
Python Tkinter Database Management System
Web Scraping Software Engineering
OpenCV Tutorial Digital Logic Design
Python Interview Question Engineering Maths

DevOps System Design


Git High Level Design
AWS Low Level Design
Docker UML Diagrams
Kubernetes Interview Guide
Azure Design Patterns
GCP OOAD
DevOps Roadmap System Design Bootcamp
Interview Questions

School Subjects Databases


Mathematics SQL
Physics MYSQL
Chemistry PostgreSQL
Biology PL/SQL
Social Science MongoDB
English Grammar

Preparation Corner More Tutorials


Company-Wise Recruitment Process Software Development
Aptitude Preparation Software Testing
Puzzles Product Management
Company-Wise Preparation Project Management
Linux
Excel
All Cheat Sheets

Machine Learning/Data Science Programming Languages


Complete Machine Learning & Data Science Program - [LIVE] C Programming with Data Structures
Data Analytics Training using Excel, SQL, Python & PowerBI - C++ Programming Course
[LIVE] Java Programming Course
Data Science Training Program - [LIVE] Python Full Course
Data Science Course with IBM Certification

Clouds/Devops GATE 2026


DevOps Engineering GATE CS Rank Booster
AWS Solutions Architect Certification GATE DA Rank Booster
Salesforce Certified Administrator Course GATE CS & IT Course - 2026
GATE DA Course 2026
GATE Rank Predictor
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

You might also like