1.
Introduction
Transformers are a deep learning architecture introduced in 2017 by Vaswani et al.
in the paper “Attention Is All You Need”.
They revolutionized Natural Language Processing (NLP) by replacing recurrence
(RNNs/LSTMs) with self-attention.
Today, they are the foundation of models like BERT, GPT, T5, LLaMA, etc., and are
also used in vision, speech, and multimodal tasks.
2. Key Idea
Instead of processing tokens sequentially (like RNNs), Transformers process them in
parallel.
The core mechanism is Attention, which lets the model decide which parts of the
input are most relevant to each token.
3. Transformer Architecture
3.1 Overall Structure
A Transformer has an Encoder–Decoder structure (like in seq2seq models), though
many modern models use only the encoder (e.g., BERT) or only the decoder (e.g.,
GPT).
Encoder: Processes input sequence and creates contextual embeddings.
Decoder: Generates output sequence, using encoder outputs + attention over
previously generated tokens.
3.2 Components
Input Embeddings
Words/tokens are converted into vectors.
Position information is added using positional encoding (since no recurrence
exists).
Self-Attention
Each token looks at other tokens to understand context.
Uses Query (Q), Key (K), and Value (V) matrices.
Attention Score = softmax(QKᵀ / √d) V.
Captures long-range dependencies efficiently