mohitlal2004
diff --git a/‎Module 9 - GenAI (LLMs and Prompt Engineering)/2. Intro to LLMs/Introduction to LLMs.ipynb‎
Lines changed: 101 additions & 31 deletions b/‎Module 9 - GenAI (LLMs and Prompt Engineering)/2. Intro to LLMs/Introduction to LLMs.ipynb‎
Lines changed: 101 additions & 31 deletions
@@ -8,36 +8,50 @@
  "# **Introduction to LLMs**\n",
  "\n",
  "## **What's Covered**\n",
- "1. A little bit about Transformers\n",
+ "1. An Era Before Transformers\n",
  "2. Attention is all you need\n",
- "3. What is Language Modeling?\n",
- "4. What are LLMs?\n",
- "5. Pre-Training, Transfer Learning and Fine-Tuning\n",
- "6. Popular Modern LLMs\n",
+ "3. A little bit about Transformers\n",
+ "4. Advantages of Transformers\n",
+ "5. Disadvantages of Transformers\n",
+ "6. What is Language Modeling?\n",
+ "7. What are LLMs?\n",
+ "8. Pre-Training, Transfer Learning and Fine-Tuning\n",
+ "9. Popular Modern LLMs\n",
  " - BERT\n",
  " - GPT\n",
  " - T5\n",
  " - Domain Specific LLMs\n",
- "7. Prompt Engineering\n",
- "8. Applications\n",
- "9. Quick Summary"
+ "10. Prompt Engineering\n",
+ "11. Applications\n",
+ "12. Quick Summary\n",
+ "13. What Next? How to use LLMs?"
  ]
  },
  {
  "cell_type": "markdown",
- "id": "686df7bd-ff4d-4a6e-ad8d-c6ad674dc567",
+ "id": "9c515617-57f8-42f5-b1b1-13406bb8e8e3",
  "metadata": {},
  "source": [
- "## **A little bit about Transformers:**\n",
- "<img style=\"float: right;\" width=\"400\" height=\"400\" src=\"data/images/transformer.jpeg\">\n",
+ "## **An Era Before Transformers**\n",
  "\n",
- "1. Sequence to Sequence Model.\n",
- "2. Has two main components: Encoder and Decoder\n",
- "3. An **encoder** which is tasked with taking in raw text, splitting them up into its core components, convert them into vectors and using **self-attention** to understand the context of the text.\n",
- "4. A **decoder** excels at generating text by using a modified type of attention (i.e. **cross attention**) to predict the next best token.\n",
- "5. Transformers are **trained** to solve a specific NLP task called as **Language Modeling**.\n",
- "6. **Why not RNNs? -** Transformer's self attention mechanism allows each word to \"attend to\" all other words in the sequence which enables it to capture long-term dependencies and contextual relationships between words. The goal is to understand each word as it relates to the other tokens in the input text.\n",
- "7. **Limitation:** Transformers are still limited to an input context window (i.e. maximum length og text it can process at any given moment)"
+ "1. **2013 and before:** Various Neural Network Architectures like ANN, CNN and RNN became very popular. They use to work well for tabular data, image data and sequential data like text respectively.\n",
+ "2. **[(2014) Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215.pdf)** paper introduced the concept of **Encoder-Decoder Architecture** to solve a seq2seq task, like machine translation.\n",
+ " - The paper introduces Seq2Seq models, which are neural network architectures designed for mapping input sequences to output sequences. Unlike traditional models that rely on fixed-length input-output mappings, Seq2Seq models can handle variable-length sequences, making them suitable for tasks such as machine translation, summarization, and question answering.\n",
+ " - The core of the Seq2Seq model is the encoder-decoder architecture. The encoder processes the input sequence while maintaining the hidden state and generates a fixed-length representation, often referred to as a context vector. This context vector encapsulates the representation of the whole sentence.\n",
+ " - The decoder then uses this representation to generate the output sequence one token at a time.\n",
+ " - Both encoder and decoder used RNN/LSTM cells due to their ability to capture sequential dependencies.\n",
+ " - This architecture used to work well with smaller sentence.\n",
+ " - **The Problem:** While it could handle variable-length input and output sequences, it used to rely on generating a single fixed-length context vector for the entire input sequence, which can lead to information loss, especially for longer sequences.\n",
+ "3. **[(2015) Neural Machine Translation by Joint Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf)** paper introduced the concept of **Attention Mechanism** to solve the above problem.\n",
+ " - Unlike traditional NMT models that encode the entire source sentence into a fixed-length context vector, the attention mechanism allows the model to focus on different parts of the source sentence dynamically while generating the translation.\n",
+ " - Attention Mechanism also addressed the problem of learning alignment between input and output sequences, enables the model to weigh the importance of each word in the source sentence differently during translation. By dynamically adjusting the attention weights, the model can focus more on relevant words and ignore irrelevant ones, leading to more accurate translations.\n",
+ " - At each timestamp of the decoder, the dynamically calculated context vector indicates which timestamps of the encoder sequence are expected to have the most influence on the current decoding step of the decoder.\n",
+ " - In simple terms, context vector will be the weighted sum of encoders hidden state. And these weights are called as **attention weights**.\n",
+ " - The attention mechanism has improved, the quality of translation on long input sentences. But it was not able to solve a huge fundamental flaw i.e. sequential training.\n",
+ " - **The Problem:** Since the architecture relies on LSTM units, a notable challenge arises due to the sequential nature of training. Specifically, only one token can be processed at a time as input to the encoder, leading to slow training times. Consequently, it becomes impractical to train the model efficiently with large datasets. This limitation inhibits the application of techniques like transfer learning, which typically involve leveraging pretrained models on large datasets to improve performance on new tasks. Additionally, fine-tuning, which involves further training pretrained models on task-specific data, is also hindered by the slow training process in this architecture.\n",
+ " - Now because of the above problem, for any task which we are suppose to solve, we have to train the model from scratch. And it takes a huge amount of time, efforts and data.\n",
+ " - **Transfer Learning:** Transfer learning involves leveraging knowledge gained from solving one problem and applying it to a different, but related, problem.\n",
+ " - **Fine-Tuning:** Fine-tuning, on the other hand, refers to the process of taking a pretrained model and further training it on task-specific data to adapt it to a particular problem or domain. This typically involves adjusting the parameters of the pretrained model to better suit the new task while retaining the knowledge learned from the original training."
  ]
  },
  {
@@ -46,8 +60,72 @@
  "metadata": {},
  "source": [
  "## **Attention is all you need**\n",
- "1. **Attention** is a mechanism that assigns different weights to different parts of the input allowing the model to prioritize and emphasize the most important information while performing tasks like translation or summarization.\n",
- "2. Attention allows a model to focus on different parts of the input dynamically, leading to improved performance."
+ "\n",
+ "**[(2017) Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)** paper introduced by Google which solves the sequential training problem of earlier architecture by removing the need of RNN cells completely.\n",
+ "1. It has the encoder-decoder architecture.\n",
+ "2. It relies solely on self-attention mechanisms and feed-forward neural networks.\n",
+ "3. Recall that **Attention** is a mechanism that assigns different weights to different parts of the input allowing the model to prioritize and emphasize the most important information while performing tasks like translation or summarization. Attention allows a model to focus on different parts of the input dynamically, leading to improved performance.\n",
+ "4. **Self-Attention Mechanism:** The key innovation of the Transformer is the self-attention mechanism, which allows each word in the input sequence to attend to all other words in the sequence. This enables capturing global dependencies and alleviates the need for recurrent connections.\n",
+ "5. **Positional Encoding:** To retain positional information of words in the input sequence without using recurrence, the model introduces positional encodings. These encodings are added to the input embeddings to provide information about the position of each word in the sequence.\n",
+ "6. **Multi-Head Attention:** The Transformer employs multi-head attention mechanisms, where attention is computed multiple times in parallel with different learned linear projections. This allows the model to focus on different parts of the input sequence simultaneously, enhancing its ability to capture diverse patterns.\n",
+ "7. **Parallelization and Scalability:** By relying on self-attention mechanisms and feed-forward layers, the Transformer architecture facilitates parallelization of computation across different parts of the input sequence. This results in faster training times and better scalability compared to traditional recurrent models."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fe8b6f3b-e318-4607-baea-878ecd6fef78",
+ "metadata": {},
+ "source": [
+ "## **A little bit about Transformers**\n",
+ "\n",
+ "<img style=\"float: right;\" width=\"400\" height=\"400\" src=\"data/images/transformer.jpeg\">\n",
+ "\n",
+ "1. Introduced by Google in the year 2017\n",
+ "2. Transformer is a Sequence to Sequence Model which was proposed initially to solve the task of Machine Translation\n",
+ "3. Has two main components: Encoder-Decoder and Attention Mechanism\n",
+ "4. An **encoder** which is tasked with taking in raw text, splitting them up into its core components, convert them into vectors and using **self-attention** to understand the context of the text.\n",
+ "5. A **decoder** excels at generating text by using a modified type of attention (i.e. **cross attention**) to predict the next best token.\n",
+ "6. Transformers revolutionized NLP by enabling highly scalable training. By leveraging parallel computation and efficient self-attention mechanisms, the Transformer architecture allows for training on massive datasets with unprecedented efficiency. This scalability laid the foundation for the concept of **Transfer Learning** in NLP. Subsequent models such as BERT, GPT, and T5 were developed, leveraging pre-trained Transformer-based architectures that could be easily **fine-tuned** for a wide range of NLP tasks, further advancing the field of natural language processing.\n",
+ "7. Transformers are **trained** to solve a specific NLP task called as **Language Modeling**.\n",
+ "8. **Why not RNNs? -** RNN units can become a bottleneck due to sequential training. Due to parallel training capabilities and self attention mechanism of transformer, it allows each word to \"attend to\" all the other words in the sequence which enables it to capture long-term dependencies and contextual relationships between words at scale. The goal is to understand each word as it relates to the other tokens in the input text.\n",
+ "9. **Limitation:** Transformers are still limited to an input context window (i.e. maximum length of text it can process at any given moment)\n",
+ "10. Timeline\n",
+ " - Till 2013 - RNN/LSTMs/GRU\n",
+ " - 2014 - Seq2seq tasks using Encoder-Decoder architecture\n",
+ " - 2015 - Attention Mechanism\n",
+ " - 2017 - Transformers\n",
+ " - 2018 - BERT by Google / GPT by OpenAI\n",
+ " - 2019 - T5 by Google\n",
+ " - 2020 - Stable Diffusion / GPT3\n",
+ " - 2021 - DALL-E / Github Copilot\n",
+ " - 2022 - ChatGPT"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "33e00c8e-37b5-4f8a-9122-d948b9dd5bee",
+ "metadata": {},
+ "source": [
+ "## **Advantages of Transformers**\n",
+ "1. Parallel Training and Scalable\n",
+ "2. Transfer Learning\n",
+ "3. Multimodal Input and Output\n",
+ "4. Flexible Architecture: Encoder only transformer models like BERT, Decoder only transformer like GPT and Encode-Decoder based model like T5.\n",
+ "5. Ecosystem: HuggingFace, OpenAI, Cohere, etc..."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "206b0961-1a21-437f-914c-5ae1e5991638",
+ "metadata": {},
+ "source": [
+ "## **Disadvantages of Transformers**\n",
+ "1. Needs high computational resources like space and GPUs\n",
+ "2. Huge amount of Data is required to train a model using transformers\n",
+ "3. Overfitting\n",
+ "4. Energy Consumptions\n",
+ "5. Interpretation\n",
+ "6. Biasness due to data and Ethical Concerns"
  ]
  },
  {
@@ -230,9 +308,9 @@
  " - Zero-Shot Classification\n",
  "2. Given the task, what model(s) work for that task?\n",
  "\n",
- "**Example:**\n",
- "> **Business Problem:** Generate a news feed for an app so that users can scroll through\n",
- "> **Mapping to a NLP task:** Given news article, a standard NLP task is to summarize\n",
+ "**Example:** \n",
+ "> **Business Problem:** Generate a news feed for an app so that users can scroll through \n",
+ "> **Mapping to a NLP task:** Given news article, a standard NLP task is to summarize \n",
  "\n",
  "Now before we get into how to solve problems like above, a quick note on NLP ecosystem:\n",
  "\n",
@@ -246,14 +324,6 @@
  "| **Spark NLP** | Scale-out, production-grade NLP |\n",
  "| **LangChain** | LLM Workflows |"
  ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3dfbe827-1705-48c0-8b0e-c17a2cee53a3",
- "metadata": {},
- "outputs": [],
- "source": []
  }
  ],
  "metadata": {