0% found this document useful (0 votes)
12 views1 page

Transformet Notes

Transformers, introduced in 2017, revolutionized NLP by utilizing self-attention instead of recurrence, forming the basis for models like BERT and GPT. They process tokens in parallel, allowing for efficient context understanding through attention mechanisms. The architecture consists of an Encoder-Decoder structure, with modern models often using only the encoder or decoder for specific tasks.

Uploaded by

Apu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views1 page

Transformet Notes

Transformers, introduced in 2017, revolutionized NLP by utilizing self-attention instead of recurrence, forming the basis for models like BERT and GPT. They process tokens in parallel, allowing for efficient context understanding through attention mechanisms. The architecture consists of an Encoder-Decoder structure, with modern models often using only the encoder or decoder for specific tasks.

Uploaded by

Apu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1

1.

Introduction

Transformers are a deep learning architecture introduced in 2017 by Vaswani et al.


in the paper “Attention Is All You Need”.

They revolutionized Natural Language Processing (NLP) by replacing recurrence


(RNNs/LSTMs) with self-attention.

Today, they are the foundation of models like BERT, GPT, T5, LLaMA, etc., and are
also used in vision, speech, and multimodal tasks.

2. Key Idea

Instead of processing tokens sequentially (like RNNs), Transformers process them in


parallel.

The core mechanism is Attention, which lets the model decide which parts of the
input are most relevant to each token.

3. Transformer Architecture
3.1 Overall Structure

A Transformer has an Encoder–Decoder structure (like in seq2seq models), though


many modern models use only the encoder (e.g., BERT) or only the decoder (e.g.,
GPT).

Encoder: Processes input sequence and creates contextual embeddings.

Decoder: Generates output sequence, using encoder outputs + attention over


previously generated tokens.

3.2 Components

Input Embeddings

Words/tokens are converted into vectors.

Position information is added using positional encoding (since no recurrence


exists).

Self-Attention

Each token looks at other tokens to understand context.

Uses Query (Q), Key (K), and Value (V) matrices.

Attention Score = softmax(QKᵀ / √d) V.

Captures long-range dependencies efficiently

You might also like