rohan-paul
diff --git a/‎LlaMa-2/LlaMa-2-FineTuning.ipynb‎
Lines changed: 23 additions & 7 deletions b/‎LlaMa-2/LlaMa-2-FineTuning.ipynb‎
Lines changed: 23 additions & 7 deletions
diff --git a/‎NLP/Compute_Metric_Method_for_LLM_FineTuning/compute_metrics_in_an_LLM_FineTuning_Code_FINAL.ipynb‎
Lines changed: 323 additions & 0 deletions b/‎NLP/Compute_Metric_Method_for_LLM_FineTuning/compute_metrics_in_an_LLM_FineTuning_Code_FINAL.ipynb‎
Lines changed: 323 additions & 0 deletions
@@ -21,20 +21,36 @@
  "metadata": {},
  "source": [
  "# From source. For running the examples in the repository\n",
- "# Clone the repository and install it with pip:\n",
- "\n",
- "git clone https://github.com/lvwerra/trl.git\n",
- "cd trl/\n",
- "pip install .\n",
+ "# Clone the repository and install it with pip:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!git clone https://github.com/lvwerra/trl.git\n",
  "\n",
+ "!cd trl/\n",
  "\n",
- "python examples/scripts/sft_trainer.py \\\n",
+ "! pip install ."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "```py\n",
+ "!python examples/scripts/sft_trainer.py \\\n",
  " --model_name meta-llama/Llama-2-7b-hf \\\n",
  " --dataset_name timdettmers/openassistant-guanaco \\\n",
  " --load_in_4bit \\\n",
  " --use_peft \\\n",
  " --batch_size 4 \\\n",
- " --gradient_accumulation_steps 2"
+ " --gradient_accumulation_steps 2\n",
+ "\n",
+ "```"
  ]
  }
  ],
 
@@ -0,0 +1,323 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "🚀 **The all IMPORTANT compute_metrics() method implemented in all LLM (Large Language Model) FineTuning Project**🚀🔥\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---------------------\n",
+ "\n",
+ "👉 **How does the `tokenizer.batch_decode` really work**\n",
+ "\n",
+ "👉 The `tokenizer.batch_decode` method is a part of the tokenizer object provided by Hugging Face's transformers library. It's used to convert tokenized sequences back into human-readable text, performing the inverse operation of the tokenization process.\n",
+ "\n",
+ "👉 When you tokenize a piece of text, the tokenizer converts each word or subword into a numerical representation, known as a token. For example, the sentence \"I love AI\" might be tokenized into something like `[72, 1801, 1789]`. The tokenizer maintains a mapping from these numerical tokens back to the original words or subwords.\n",
+ "\n",
+ "👉 The `tokenizer.batch_decode` function takes as input a batch of these token sequences (each sequence is a list of integers) and decodes them into their original textual form. If the `skip_special_tokens` flag is set to `True`, it will ignore special tokens like padding, start of sentence, end of sentence, etc., when converting back to text.\n",
+ "\n",
+ "Here is a basic example:\n",
+ "\n",
+ "```python\n",
+ "from transformers import AutoTokenizer\n",
+ "\n",
+ "# Initialize a tokenizer\n",
+ "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\")\n",
+ "\n",
+ "# Suppose we have the following tokenized input\n",
+ "input_ids = tokenizer.encode(\"Hello, I'm a data scientist.\", return_tensors='pt')\n",
+ "\n",
+ "print(f\"Tokenized Input: {input_ids}\")\n",
+ "# Output: tensor([[ 101, 7592, 1010, 1045, 1005, 1049, 1037, 2951, 7155, 1012, 102]])\n",
+ "\n",
+ "# Now, we decode it back to the original text\n",
+ "decoded_text = tokenizer.batch_decode(input_ids, skip_special_tokens=True)\n",
+ "\n",
+ "print(f\"Decoded Text: {decoded_text}\")\n",
+ "# Output: [\"hello, i'm a data scientist.\"]\n",
+ "```\n",
+ "\n",
+ "In this example, the tokenizer first encoded the input sentence into a sequence of tokens. Then, using the `batch_decode` method, the token sequence was decoded back to the original text."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "----------------\n",
+ "\n",
+ " **Why does the code replaces any instance of `-100` with `50256`**\n",
+ "\n",
+ " 👉 In the given code snippet, `-100` is a special token ID that is often used to represent tokens that should be ignored when computing the loss during model training. These might be extra padding tokens, or tokens that are not part of the target sequence in a sequence-to-sequence model.\n",
+ "\n",
+ "👉 During the process of training a language model, not all tokens in the output sequence are always relevant. For instance, we may have padded sequences to a particular length with irrelevant tokens or we may want to mask certain tokens. In these scenarios, we assign a token ID of `-100` to those tokens we wish the model to ignore when it computes the loss. \n",
+ "\n",
+ "👉 However, when it's time to evaluate the model, the `-100` tokens cannot be decoded back into text, since it's not a valid token ID. So, before decoding the sequence, these tokens must be replaced with a valid token ID.\n",
+ "\n",
+ "👉 The ID `50256` is chosen here because it corresponds to the padding token in the tokenizer being used. When `tokenizer.batch_decode` is called with the argument `skip_special_tokens=True`, it will automatically ignore these padding tokens, so they don't show up in the final decoded text. This is why any `-100` tokens are replaced with `50256` before decoding the predicted sequences.\n",
+ "\n",
+ "Here is a small pseudo-example to illustrate this process:\n",
+ "\n",
+ "```python\n",
+ "preds = [[1, 2, -100, 3], [2, -100, 1, 4]]\n",
+ "# These are the predicted sequences. -100 represents tokens to be ignored\n",
+ "\n",
+ "for idx in range(len(preds)):\n",
+ " for idx2 in range(len(preds[idx])):\n",
+ " if preds[idx][idx2]==-100:\n",
+ " preds[idx][idx2] = 50256\n",
+ "\n",
+ "print(preds)\n",
+ "# Output: [[1, 2, 50256, 3], [2, 50256, 1, 4]]\n",
+ "# Now all -100 tokens are replaced with 50256 (padding token) and can be properly ignored when decoding\n",
+ "```\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "--------------\n",
+ "\n",
+ "**What is the significance of the below line here**\n",
+ "\n",
+ "`labels = np.where(labels != pad_tok, labels, tokenizer.pad_token_id)`\n",
+ "\n",
+ "👉 The `np.where` function works like this: `np.where(condition, x, y)`. It checks the `condition` for each element in the array. If the condition is `True`, it selects the corresponding element from `x`; if the condition is `False`, it selects the corresponding element from `y`.\n",
+ "\n",
+ "The condition is `labels != pad_tok`. It checks each element in the `labels` array and sees if it's not equal to `pad_tok`. `pad_tok` is the padding token, which is typically used to fill in sequences to a consistent length for batch processing. If the condition is `True` (the label is not a padding token), it keeps the original label. If the condition is `False` (the label is a padding token), it replaces that label with `tokenizer.pad_token_id`.\n",
+ "\n",
+ "👉 This operation ensures that all instances of the padding token in the labels are represented with the padding token ID specific to the tokenizer being used. This is important for the subsequent decoding of the labels into text, as the padding tokens will be correctly ignored during the decoding process."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def compute_metrics(eval_preds):\n",
+ " preds, labels = eval_preds\n",
+ " if isinstance(preds, tuple):\n",
+ " preds = preds[0]\n",
+ " for idx in range(len(preds)):\n",
+ " for idx2 in range(len(preds[idx])):\n",
+ " if preds[idx][idx2]==-100:\n",
+ " preds[idx][idx2] = 50256\n",
+ " decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)\n",
+ " # Replace -100 in the labels as we can't decode them.\n",
+ " labels = np.where(labels != pad_tok, labels, tokenizer.pad_token_id)\n",
+ " decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)\n",
+ "\n",
+ " # Some simple post-processing\n",
+ " decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)\n",
+ "\n",
+ " result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)\n",
+ " \n",
+ " # The results of the Rouge metric are then multiplied by 100 and rounded to four decimal places.\n",
+ " result = {k: round(v * 100, 4) for k, v in result.items()}\n",
+ " prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]\n",
+ " result[\"gen_len\"] = np.mean(prediction_lens)\n",
+ " return result"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "==============\n",
+ "\n",
+ "FULL SOURCE CODE\n",
+ "\n",
+ "=============="
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Source - https://github.com/artidoro/qlora/issues/157\n",
+ "import pandas as pd\n",
+ "import os\n",
+ "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, BitsAndBytesConfig\n",
+ "import torch\n",
+ "from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType, prepare_model_for_kbit_training\n",
+ "from transformers import DataCollatorForSeq2Seq\n",
+ "import evaluate\n",
+ "import nltk\n",
+ "import numpy as np\n",
+ "from nltk.tokenize import sent_tokenize\n",
+ "from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments\n",
+ "from datasets import Dataset, DatasetDict\n",
+ "import argparse\n",
+ "import pickle\n",
+ "import json\n",
+ "\n",
+ "parser = argparse.ArgumentParser(description='Options')\n",
+ "parser.add_argument('--dataset_dir', default='data', type=str, help=\"folder in which the dataset is stored\")\n",
+ "parser.add_argument('--output_dir', default=\"lora-instructcodet5p\", type=str, help=\"output directory for the model\")\n",
+ "parser.add_argument('--results_dir', default=\"results\", type=str, help=\"where the results should be stored\")\n",
+ "args = parser.parse_args()\n",
+ "\n",
+ "nltk.download(\"punkt\")\n",
+ "tokenized_dataset = DatasetDict.load_from_disk(args.dataset_dir)\n",
+ "# Metric\n",
+ "metric = evaluate.load(\"rouge\")\n",
+ "pad_tok = 50256\n",
+ "token_id=\"Salesforce/instructcodet5p-16b\"\n",
+ "\n",
+ "tokenizer = AutoTokenizer.from_pretrained(token_id)\n",
+ "# helper function to postprocess text\n",
+ "def postprocess_text(preds, labels):\n",
+ " preds = [pred.strip() for pred in preds]\n",
+ " labels = [label.strip() for label in labels]\n",
+ "\n",
+ " # rougeLSum expects newline after each sentence\n",
+ " preds = [\"\\n\".join(sent_tokenize(pred)) for pred in preds]\n",
+ " labels = [\"\\n\".join(sent_tokenize(label)) for label in labels]\n",
+ "\n",
+ " return preds, labels\n",
+ "\n",
+ "def compute_metrics(eval_preds):\n",
+ " preds, labels = eval_preds\n",
+ " if isinstance(preds, tuple):\n",
+ " preds = preds[0]\n",
+ " for idx in range(len(preds)):\n",
+ " for idx2 in range(len(preds[idx])):\n",
+ " if preds[idx][idx2]==-100:\n",
+ " preds[idx][idx2] = 50256\n",
+ " decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)\n",
+ " # Replace -100 in the labels as we can't decode them.\n",
+ " labels = np.where(labels != pad_tok, labels, tokenizer.pad_token_id)\n",
+ " decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)\n",
+ "\n",
+ " # Some simple post-processing\n",
+ " decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)\n",
+ "\n",
+ " result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)\n",
+ " result = {k: round(v * 100, 4) for k, v in result.items()}\n",
+ " prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]\n",
+ " result[\"gen_len\"] = np.mean(prediction_lens)\n",
+ " return result\n",
+ "\n",
+ "def get_dict(predicts):\n",
+ " d = {}\n",
+ " for num in range(len(tokenized_dataset['test'])):\n",
+ " pred = tokenizer.decode([n for n in predicts[0][num] if n!=50256 and n!=-100])[1:]\n",
+ " d[num+1] = {'Question':tokenizer.decode([n for n in tokenized_dataset['test'][num]['input_ids'] if n!=50256]),\n",
+ " 'Ground truth solution':tokenizer.decode([n for n in tokenized_dataset['test'][num]['labels'] if n!=50256]),\n",
+ " 'Prediction': pred if pred else None}\n",
+ " return d\n",
+ "\n",
+ "def find_all_linear_names(model):\n",
+ " cls = torch.nn.Linear\n",
+ " lora_module_names = set()\n",
+ " for name, module in model.named_modules():\n",
+ " if isinstance(module, cls):\n",
+ " names = name.split('.')\n",
+ " lora_module_names.add(names[0] if len(names) == 1 else names[-1])\n",
+ "\n",
+ "\n",
+ " if 'lm_head' in lora_module_names:\n",
+ " lora_module_names.remove('lm_head')\n",
+ " return list(lora_module_names)\n",
+ "\n",
+ "\n",
+ "def main():\n",
+ " device = 'cuda'\n",
+ "\n",
+ " # huggingface hub model id\n",
+ " model_id=\"instructcodet5p-16b\"\n",
+ " if not os.path.exists(model_id):\n",
+ " model_id=token_id\n",
+ " bnb_config = BitsAndBytesConfig(\n",
+ " load_in_4bit=True,\n",
+ " bnb_4bit_use_double_quant=True,\n",
+ " bnb_4bit_quant_type=\"nf4\",\n",
+ " bnb_4bit_compute_dtype=torch.bfloat16\n",
+ " )\n",
+ " # load model from the hub\n",
+ " model = AutoModelForSeq2SeqLM.from_pretrained(model_id,\n",
+ " # torch_dtype=torch.bfloat16,\n",
+ " low_cpu_mem_usage=True,\n",
+ " trust_remote_code=True, decoder_start_token_id=1, pad_token_id=pad_tok, device_map=\"auto\", quantization_config=bnb_config)\n",
+ " modules = find_all_linear_names(model)\n",
+ " # Define LoRA Config\n",
+ " lora_config = LoraConfig(\n",
+ " r=16,\n",
+ " lora_alpha=32,\n",
+ " target_modules=modules,\n",
+ " lora_dropout=0.05,\n",
+ " bias=\"none\",\n",
+ " task_type=TaskType.SEQ_2_SEQ_LM\n",
+ " )\n",
+ " model = prepare_model_for_kbit_training(model, False)\n",
+ "\n",
+ " # add LoRA adaptor\n",
+ " model = get_peft_model(model, lora_config)\n",
+ " model.print_trainable_parameters()\n",
+ "\n",
+ " # we want to ignore tokenizer pad token in the loss\n",
+ " label_pad_token_id = pad_tok\n",
+ " # Data collator\n",
+ " data_collator = DataCollatorForSeq2Seq(\n",
+ " tokenizer,\n",
+ " model=model,\n",
+ " label_pad_token_id=label_pad_token_id,\n",
+ " pad_to_multiple_of=8\n",
+ " )\n",
+ " output_dir=args.output_dir\n",
+ "\n",
+ " training_args = Seq2SeqTrainingArguments(\n",
+ " output_dir=output_dir,\n",
+ " per_device_train_batch_size=1,\n",
+ " # per_device_eval_batch_size=1,\n",
+ " predict_with_generate=True,\n",
+ " weight_decay=0.05,\n",
+ " # warmup_steps=200,\n",
+ " fp16=False, # Overflows with fp16\n",
+ " learning_rate=1e-4,\n",
+ " num_train_epochs=5,\n",
+ " logging_dir=f\"{output_dir}/logs\",\n",
+ " logging_strategy=\"epoch\",\n",
+ " report_to=\"tensorboard\",\n",
+ " push_to_hub=False,\n",
+ " # generation_max_length=200,\n",
+ " optim=\"paged_adamw_8bit\",\n",
+ " lr_scheduler_type = 'constant'\n",
+ " )\n",
+ "\n",
+ " # Create Trainer instance\n",
+ " trainer = Seq2SeqTrainer(\n",
+ " model=model,\n",
+ " args=training_args,\n",
+ " data_collator=data_collator,\n",
+ " train_dataset=tokenized_dataset[\"train\"],\n",
+ " # eval_dataset=tokenized_dataset[\"validation\"],\n",
+ " # compute_metrics=compute_metrics,\n",
+ " )\n",
+ "\n",
+ " # train model\n",
+ " train_result = trainer.train()\n",
+ "\n",
+ "if __name__ == '__main__':\n",
+ " main()"
+ ]
+ }
+ ],
+ "metadata": {
+ "language_info": {
+ "name": "python"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}