Skip to content

Commit b0ffd58

Browse files
committed
Updated Text Representations and LLMs
1 parent 4e621f8 commit b0ffd58

File tree

2 files changed

+92
-45
lines changed

2 files changed

+92
-45
lines changed

Module 9 - GenAI (LLMs and Prompt Engineering)/1. Text Embeddings/Text Representation (Embeddings).ipynb

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -874,7 +874,7 @@
874874
},
875875
{
876876
"cell_type": "markdown",
877-
"id": "850957d0-95ea-4740-9e7b-06a4272a5874",
877+
"id": "e47ead76-26d8-4020-a0e0-4accdc53696f",
878878
"metadata": {},
879879
"source": [
880880
"## **Various Feature Representation Techniques**\n",
@@ -885,35 +885,48 @@
885885
" - Drawbacks: They are discrete representations, vector representation is sparse and hig-dimensional, and they cannot handle OOV words.\n",
886886
"- **Distributed Representations**\n",
887887
" - Eg: Word Embeddings (Word2Vec, GloVe, fastText), Document Embeddings (Doc2Vec)\n",
888-
" - Text embeddings are a way to represent words or phrases as vectors in a high-dimensional space based on their contextual meaning within a corpus of text data. The idea is that if two phrases are similar then the vectors that represent those phrases should be close together and vice versa.\n",
889-
" - Word2Vec Arcitecture - Continuous Bag of Words (CBOW) and SkipGram\n",
888+
" - Text embeddings are a way to represent words or phrases as vectors in a high-dimensional space based on their semantic meaning within a corpus of text data. The idea is that if two phrases are similar then the vectors that represent those phrases should be close together and vice versa.\n",
889+
" - Word2Vec Arcitecture for Training - Continuous Bag of Words (CBOW), SkipGram and SkipGram with Negative Sampling\n",
890890
" - Word2Vec don't have a good way of handling OOV words\n",
891891
" - **Handling OOV words problem:** One way is by modifying the training process by bringing in characters and other sub-level linguistic components such as morphological properties (e.g., prefixes, suffixes, word endings, etc...). **FastText** from facebook follows this approach.\n",
892-
" - Drawbacks: Above techniques only provide word embeddings. Inorder to get document embedding, we can aggregate the word embeddings to get document embeddings. But, for sentences \"dog bites man\" and \"man bites dog\", both will receive same representation.\n",
893892
" - Doc2Vec: Based on Paragraph vector framework. Neural network used to learn Doc2Vec embeddings is very similar to CBOW and SkipGram architecture of Word2Vec.\n",
894893
" - Doc2Vec Architecture - Distributed Memory (DM) and Distributed Bag of Words (DBOW)\n",
894+
" - Drawbacks: Above techniques only provide word embeddings. Inorder to get document embedding, we can aggregate the word embeddings to get document embeddings. But, for sentences \"dog bites man\" and \"man bites dog\", both will receive same representation.\n",
895+
" - To facilitate the learning of distributed representations for sequential data, a change in training architecture was necessary. This is where **Recurrent Neural Networks (RNNs)** come into play.\n",
896+
" - RNNs learn distributed representations of sequential data by processing input sequences one token at a time. RNNs are capable of capturing long-term dependencies in sequential data, making them effective for tasks such as language modeling, machine translation, sentiment analysis, and text generation.\n",
897+
" - Variants of RNNs, such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), address the vanishing gradient problem and improve the ability of RNNs to capture long-range dependencies in text data. \n",
895898
"- **Universal Language Representaion**\n",
896-
" - Problem in all the above approaches: One word gets one fixed representation\n",
897-
" - Eg: \"I went to bank to withdraw money\" and \"I sat on the river bank\" both uses the word \"bank\"\n",
898-
" - In 2018 - Researchers came with **Contextual Word Representations**, which addresses the above problem\n",
899+
" - Problem in the above approach:\n",
900+
" - One word gets one fixed representation. Eg: \"I went to bank to withdraw money\" and \"I sat on the river bank\" both uses the word \"bank\"\n",
901+
" - Handling long-term dependencies in extremely long sequences\n",
902+
" - Computationally expensive to train\n",
903+
" - Slow to train due to sequential trianing \n",
904+
" - Solutions:\n",
905+
" - In 2017 - **[Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)** paper introduced by Google solves the \"Sequential training\" and \"Long-term dependencies\" problem of earlier architecture by removing the need of RNN cells completely.\n",
906+
" - In 2018 - Researchers from University of Washington came with **[Contextual Word Representations](https://arxiv.org/pdf/1802.05365.pdf)**, which addresses the above problem of \"One word gets one fixed representation\".\n",
899907
" - **Remember:** Recently, Contextual Word Representations are learned by using the word embeddings we discussed earlier (like Word2Vec) and training on **language modeling** task using complex neural architecture (like RNNs and Transformers).\n",
900908
" - **Language Modeling:** It is a task of predicting the next likely word in a sequence of words. In its earliest form, it used the idea of n-gram frequencies to estimate the probability of the next word given a history of words.\n",
901909
" - **Key Idea:** Learn embedding on a generic task like language modeling on a massive corpus and then fine-tune learnings on a task-specific data. This is also known as **transfer learning**.\n",
902910
" - **How to decide whether to train our own embeddings or use pre-trained embeddings?** - A good rule of thumb is to compute the vocabulary overlap. If the overlap between the vocabulary of our custom domain and that of pre-trained word embeddings is significant, pre-trained word embeddings tends to give good results.\n",
903911
" - **One more important factor to consider while deploying models with embeddings-based feature extraction approach:** - Remember that learned or pre-trained embedding models have to be stored and loaded into memory while using these approaches. If the model itself is bulky, we need to factor this into our deployment needs.\n",
904912
"- **Handcrafted Features**\n",
905913
" - These features have to be designed manually, keeping in mind both the domain knowledge and the ML algorithms to train the NLP models.\n",
906-
" - Custom feature engineering is much more difficult to formulate compared to other feature engineering schemes we've seen so far. \n",
907-
"\n",
908-
"\n",
914+
" - Custom feature engineering is much more difficult to formulate compared to other feature engineering schemes we've seen so far. "
915+
]
916+
},
917+
{
918+
"cell_type": "markdown",
919+
"id": "fd17f5d4-615f-4830-b3f3-b7a8060da0c8",
920+
"metadata": {},
921+
"source": [
909922
"## **What is Language Modeling?**\n",
910923
"1. Language Modeling involves creation of statistical/deep learning models for predicting the likelyhood of a sequence of tokens in a specified vocabulary.\n",
911924
"2. Two types of Language Modeling Tasks are: \n",
912925
" a. Autoencoding Task \n",
913926
" b. Autoregressive Task \n",
914927
"3. **Autoregressive Language Models** are trained to predict the next token in a sentence, based on the previous tokens in the phrase. These models correspond to the **decoder** part of the transformer model. A mask is applied on the full sentence so that the attention head can only see the tokens that came before. These models are ideal for text generatation. For eg: **GPT**\n",
915928
"4. **Autoencoding Language Models** are trained to reconstruct the original sentence from a corrupted version of the input. These models correspond to the **encoder** part of the transformer model. Full input is passed. No mask is applied. Autoencoding models create a bidirectional representation of the whole sentence. They can be fine-tuned for a variety of tasks, but their main application is sentence classification or token classification. For eg: **BERT**\n",
916-
"5. **Combination of autoregressive and autoencoding language models** are more versatile and flexible in generating text. It has been shown that the combination models can generate more diverse and creative text in different context compared to pure decode-based autoregressive models due to their ability to capture additional context using the encoder. For eg: **T5**\n"
929+
"5. **Combination of autoregressive and autoencoding language models** are more versatile and flexible in generating text. It has been shown that the combination models can generate more diverse and creative text in different context compared to pure decode-based autoregressive models due to their ability to capture additional context using the encoder. For eg: **T5**"
917930
]
918931
},
919932
{

0 commit comments

Comments
 (0)