This repository contains a collection of benchmarks and evaluators for evaluating agents built on top of Llama Stack.
git clone https://github.com/yanxi0830/llama-stack-evals.git cd llama-stack-evals pip install -e . - 📓 Checkout notebooks/ for working examples of how to run benchmarks using Llama Stack.
# Example evaluation setup from llama_stack_client import LlamaStackClient from llama_stack_client.lib.agents.agent import Agent from llama_stack_client.types.agent_create_params import AgentConfig from llama_stack_evals.benchmarks.hotpotqa import HotpotQAEvaluator # setup client client = LlamaStackClient(...) # setup agent config agent_config = AgentConfig(...) # setup evaluator evaluator = HotpotQAEvaluator() # Run evaluation results = evaluator.run(agent_config, client)While llama-stack makes it easy to build and deploy LLM agents, evaluating these agents comprehensively can be challenging. Current evaluation API have several limitations:
- Complex setup requirements for running evaluations (data preparation, defining scoring functions, benchmark registration)
- Difficulty in using private datasets that can't be shared with servers
- Limited flexibility in implementing custom scoring functions
- Challenges in evaluating agents with custom client tools
- Lack of streamlined solutions for running evaluations against popular benchmarks
llama-stack-evals addresses these challenges by providing a developer-friendly framework for evaluating applications built on top of Llama Stack.
Simplified Evaluation Flow
- Reduce complex evaluation setups to just a few lines of code
- Intuitive APIs that feel natural to Python developers
- Built-in support for popular open benchmarks
Flexibility & Customization
- Evaluate agents with custom client tools
- Use private datasets without uploading to servers
- Implement custom scoring functions easily
- Run component-level evaluations (e.g., retrieval tools)
Out-of-the-Box Benchmark Examples
- Model Evaluation: Release lightweight benchmark numbers on ANY llama-stack SDK compatible endpoints
- Agent Evaluation:
- E2E evaluation via AgentConfig (e.g., HotpotQA)
- Component-level evaluation (e.g., BEIR for retrieval)
- Complex Simulation Support:
- Tau-Bench (with simulated users)
- CRAG
- RAG evaluations with vector DB integration
- Not a Recipe Repository: Provide structured, maintained evaluation tools rather than ad-hoc evaluation scripts
✨ We welcome Pull Requests with improvements or suggestions.
🐛 If you want to flag an issue or propose an improvement, but don't know how to realize it, create a GitHub Issue.