Chatbots are getting scary good — but evaluating them? That’s still a pain. BLEU and ROUGE scores feel like trying to judge a movie by its subtitles. Human evaluation is time-consuming, inconsistent, and honestly… nobody has time for that.
So here’s the question I tackled in this project:
Can we let an LLM evaluate other LLMs?
Spoiler: Yep. And it’s shockingly effective.
The Big Idea: LLM Rating LLM 💡
We built an Auto-Eval system using Google’s Gemini 2.0 Flash model to rate chatbot responses on:
✅ Relevance – Does it actually answer the question?
✅ Helpfulness – Is the answer useful or just fluff?
✅ Clarity – Can a human actually understand it?
✅ Factual Accuracy – Is it hallucinating or nah?
And we didn’t just invent our own data — we pulled real conversations from the OpenAssistant dataset (OASST1). These are crowdsourced human-assistant chats, so it’s the real deal.
Setup: Let’s Get Nerdy ⚙️
Step 1: Load the Dataset
We used Hugging Face’s datasets
library to load OpenAssistant’s training data and converted it to a Pandas DataFrame.
from datasets import load_dataset oasst = load_dataset("OpenAssistant/oasst1") df = oasst["train"].to_pandas()
Step 2: Extract Prompt-Response Pairs
We filtered English conversations and merged assistant replies with the prompts that triggered them.
df = df[df['role'].isin(['prompter', 'assistant'])][['message_id', 'parent_id', 'text', 'role', 'lang']] df = df[df['lang'] == 'en'] merged = df.merge(df, left_on="parent_id", right_on="message_id", suffixes=("_reply", "_prompt")) merged = merged[['text_prompt', 'text_reply']].rename(columns={'text_prompt': 'prompt', 'text_reply': 'response'})
Prompt Engineering + Gemini Setup 🤖
We used few-shot prompting to make Gemini behave like an evaluator and return structured scores (in JSON format).
Here’s the eval prompt we send:
def build_eval_prompt(prompt, response): return f""" You are an evaluator. Rate this response to a user prompt. Rate from 1 to 5 on: - Relevance - Helpfulness - Clarity - Factual accuracy Return ONLY valid JSON: {{ "relevance": X, "helpfulness": X, "clarity": X, "factuality": X }} Prompt: {prompt} Response: {response} """
And then we just hit Gemini with that:
import google.generativeai as genai genai.configure(api_key="YOUR_API_KEY") model = genai.GenerativeModel("gemini-2.0-flash") response = model.generate_content(build_eval_prompt(prompt, response))
Running the Evaluation Loop 🧪
We ran the model on a sample of 15 prompt-response pairs and parsed the scores:
ratings = [] for _, row in sampled.iterrows(): try: res = model.generate_content(build_eval_prompt(row['prompt'], row['response'])) json_block = res.text[res.text.find('{'):res.text.rfind('}')+1] score = json.loads(json_block) score.update(row) ratings.append(score) except Exception as e: print("Error:", e)
Boom. Now we have an LLM scoring LLMs. Matrix-style.
Visualizing the Scores 📊
We saved everything to a CSV and used a Seaborn boxplot to get the vibes:
import seaborn as sns import matplotlib.pyplot as plt ratings_df = pd.DataFrame(ratings) sns.boxplot(data=ratings_df[['relevance', 'helpfulness', 'clarity', 'factuality']]) plt.title("LLM Auto-Eval Score Distribution") plt.show()
And the results? Pretty solid. Some outliers, but Gemini gave reasonable scores across all four dimensions.
Takeaways 🔍
✅ This works. Gemini can evaluate chatbot responses consistently.
🎯 It scales. No need to bug your friends to rate 200 replies.
🤖 Model comparisons just got easier. Want to compare GPT vs Claude vs Mistral? Auto-eval it.
What’s Next? 🛣️
📈 Add more examples and multiple models for A/B testing.
🤯 Detect hallucinations automatically.
🧑⚖️ Compare LLM vs human evaluations — who rates better?
🧪 Try It Yourself
Want to peek under the hood or run it with your own data?
👉 Check out the full notebook on Kaggle
Clone it, tweak it, break it (just don’t blame me 😅).
P.S : This post was rated 5/5 on clarity by my cat. And 2/5 on factuality by my anxiety.
Top comments (0)