DEV Community

Cover image for 🧠 Auto-Evaluating Chatbots with GenAI: The Pipeline, The Prompts, and The Proof
Abhay
Abhay Subscriber

Posted on

🧠 Auto-Evaluating Chatbots with GenAI: The Pipeline, The Prompts, and The Proof

Chatbots are getting scary good — but evaluating them? That’s still a pain. BLEU and ROUGE scores feel like trying to judge a movie by its subtitles. Human evaluation is time-consuming, inconsistent, and honestly… nobody has time for that.

So here’s the question I tackled in this project:
Can we let an LLM evaluate other LLMs?

Spoiler: Yep. And it’s shockingly effective.


The Big Idea: LLM Rating LLM 💡

We built an Auto-Eval system using Google’s Gemini 2.0 Flash model to rate chatbot responses on:

✅ Relevance – Does it actually answer the question?

✅ Helpfulness – Is the answer useful or just fluff?

✅ Clarity – Can a human actually understand it?

✅ Factual Accuracy – Is it hallucinating or nah?

And we didn’t just invent our own data — we pulled real conversations from the OpenAssistant dataset (OASST1). These are crowdsourced human-assistant chats, so it’s the real deal.


Setup: Let’s Get Nerdy ⚙️

Step 1: Load the Dataset

We used Hugging Face’s datasets library to load OpenAssistant’s training data and converted it to a Pandas DataFrame.

from datasets import load_dataset oasst = load_dataset("OpenAssistant/oasst1") df = oasst["train"].to_pandas() 
Enter fullscreen mode Exit fullscreen mode

Step 2: Extract Prompt-Response Pairs

We filtered English conversations and merged assistant replies with the prompts that triggered them.

df = df[df['role'].isin(['prompter', 'assistant'])][['message_id', 'parent_id', 'text', 'role', 'lang']] df = df[df['lang'] == 'en'] merged = df.merge(df, left_on="parent_id", right_on="message_id", suffixes=("_reply", "_prompt")) merged = merged[['text_prompt', 'text_reply']].rename(columns={'text_prompt': 'prompt', 'text_reply': 'response'}) 
Enter fullscreen mode Exit fullscreen mode

Prompt Engineering + Gemini Setup 🤖

We used few-shot prompting to make Gemini behave like an evaluator and return structured scores (in JSON format).

Here’s the eval prompt we send:

def build_eval_prompt(prompt, response): return f""" You are an evaluator. Rate this response to a user prompt. Rate from 1 to 5 on: - Relevance - Helpfulness - Clarity - Factual accuracy Return ONLY valid JSON: {{ "relevance": X, "helpfulness": X, "clarity": X, "factuality": X }} Prompt: {prompt} Response: {response} """ 
Enter fullscreen mode Exit fullscreen mode

And then we just hit Gemini with that:

import google.generativeai as genai genai.configure(api_key="YOUR_API_KEY") model = genai.GenerativeModel("gemini-2.0-flash") response = model.generate_content(build_eval_prompt(prompt, response)) 
Enter fullscreen mode Exit fullscreen mode

Running the Evaluation Loop 🧪

We ran the model on a sample of 15 prompt-response pairs and parsed the scores:

ratings = [] for _, row in sampled.iterrows(): try: res = model.generate_content(build_eval_prompt(row['prompt'], row['response'])) json_block = res.text[res.text.find('{'):res.text.rfind('}')+1] score = json.loads(json_block) score.update(row) ratings.append(score) except Exception as e: print("Error:", e) 
Enter fullscreen mode Exit fullscreen mode

Boom. Now we have an LLM scoring LLMs. Matrix-style.


Visualizing the Scores 📊

We saved everything to a CSV and used a Seaborn boxplot to get the vibes:

import seaborn as sns import matplotlib.pyplot as plt ratings_df = pd.DataFrame(ratings) sns.boxplot(data=ratings_df[['relevance', 'helpfulness', 'clarity', 'factuality']]) plt.title("LLM Auto-Eval Score Distribution") plt.show() 
Enter fullscreen mode Exit fullscreen mode

And the results? Pretty solid. Some outliers, but Gemini gave reasonable scores across all four dimensions.


Takeaways 🔍

✅ This works. Gemini can evaluate chatbot responses consistently.

🎯 It scales. No need to bug your friends to rate 200 replies.

🤖 Model comparisons just got easier. Want to compare GPT vs Claude vs Mistral? Auto-eval it.

What’s Next? 🛣️

📈 Add more examples and multiple models for A/B testing.

🤯 Detect hallucinations automatically.

🧑‍⚖️ Compare LLM vs human evaluations — who rates better?


🧪 Try It Yourself

Want to peek under the hood or run it with your own data?

👉 Check out the full notebook on Kaggle

Clone it, tweak it, break it (just don’t blame me 😅).

P.S : This post was rated 5/5 on clarity by my cat. And 2/5 on factuality by my anxiety.

Top comments (0)