Posted on Jul 1, 2024

Is your fine-tuned LLM any good?

Over the past few weeks, I've been diving deep into fine-tuning Large Language Models (LLMs) for various applications, with a particular focus on creating detailed summaries from multiple sources. Throughout this process, we've leveraged tools like weights and biases to track our learning rates and optimize our results. However, a crucial question remains: How can we objectively determine if our fine-tuned model is producing high-quality outputs?
While human evaluation is always an option, it's time-consuming and often impractical for large-scale assessments. This is where ROUGE (Recall-Oriented Understudy for Gisting Evaluation) comes into play, offering a quantitative approach to evaluating our fine-tuned models.

ROUGE calculates the overlap between the generated text and the reference text(s) using various methods. The most common ROUGE metrics include:

ROUGE-N: Measures the overlap of n-grams (contiguous sequences of n words) between the generated and reference texts.
ROUGE-L: Computes the longest common subsequence (LCS) between the generated and reference texts.
ROUGE-W: A weighted version of ROUGE-L that gives more importance to consecutive matches.

What do the different metrics mean?

ROUGE-1: Overlap of unigrams (single words)
ROUGE-2: Overlap of bigrams (two-word sequences)
ROUGE-L: Longest common subsequence
ROUGE-L sum: Variant of ROUGE-L computed over the entire summary

Each metric is typically reported as a score between 0 and 1, where higher values indicate better performance

Here is some code you can use for evaluting the models. Note in this example I am loading both models into a single GPU and comparing the results at the same time

import multiprocessing import time from transformers import AutoModelForCausalLM, AutoTokenizer from evaluate import load original_model_name ="" finetuned_model = "" def load_model(model_name: str, device: str): model = AutoModelForCausalLM.from_pretrained( model_name, return_dict=True, load_in_8bit=True, device_map={"":device}, trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token return model, tokenizer def inference(model, tokenizer, prompt: str): inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=200, temperature=1.0) return tokenizer.decode(outputs[0], skip_special_tokens=True) def process_task(task_queue, result_queue): original_model, original_tokenizer = load_model(original_model_name, device="cuda:0") fine_tuned_model, fine_tuned_tokenizer = load_model(finetuned_model, device="cuda:0") rouge = load('rouge') while True: task = task_queue.get() if task is None: break prompt, reference = task start = time.time() original_summary = inference(original_model, original_tokenizer, prompt) fine_tuned_summary = inference(fine_tuned_model, fine_tuned_tokenizer, prompt) print(f"Completed inference in {time.time() - start}") original_scores = rouge.compute(predictions=[original_summary], references=[reference]) fine_tuned_scores = rouge.compute(predictions=[fine_tuned_summary], references=[reference]) result_queue.put((original_scores, fine_tuned_scores)) def main(): task_queue = multiprocessing.Queue() result_queue = multiprocessing.Queue() prompt = "Your prompt here" reference = "Your reference summary here" process = multiprocessing.Process(target=process_task, args=(task_queue, result_queue)) process.start() start = time.time() # Run 3 times for _ in range(3): task_queue.put((prompt, reference)) results = [] for _ in range(3): result = result_queue.get() results.append(result) # Signal the process to terminate task_queue.put(None) process.join() end = time.time() print(f"Total time: {end - start}") # Print ROUGE scores for i, (original_scores, fine_tuned_scores) in enumerate(results): print(f"Run {i+1}:") print("Original model scores:") print(original_scores) print("Fine-tuned model scores:") print(fine_tuned_scores) print() if __name__ == "__main__": multiprocessing.set_start_method("spawn") main()

To illustrate the effectiveness of ROUGE in evaluating fine-tuned models, here are the results from a recent run:
Fine-tuned model:

rouge1: 0.1775
rouge2: 0.0271
rougeL: 0.1148
rougeLsum: 0.1148

Original model:

rouge1: 0.0780
rouge2: 0.0228
rougeL: 0.0543
rougeLsum: 0.0598

As we can see, the fine-tuned model shows significant improvements across all ROUGE metrics compared to the original model. This quantitative assessment provides concrete evidence that our fine-tuning process has indeed enhanced the model's summarization capabilities.
While ROUGE scores shouldn't be the sole criterion for evaluating LLMs, they offer a valuable, objective measure of improvement. By incorporating ROUGE into our evaluation pipeline, we can more efficiently iterate on our fine-tuning process and confidently assess the quality of our models.

DEV Community

Is your fine-tuned LLM any good?

Top comments (0)