Llama Stack Evals

This repository contains a collection of benchmarks and evaluators for evaluating agents built on top of Llama Stack.

🔧 Installation

git clone https://github.com/yanxi0830/llama-stack-evals.git cd llama-stack-evals pip install -e .

🚀 Usage

📓 Checkout notebooks/ for working examples of how to run benchmarks using Llama Stack.

# Example evaluation setup from llama_stack_client import LlamaStackClient from llama_stack_client.lib.agents.agent import Agent from llama_stack_client.types.agent_create_params import AgentConfig from llama_stack_evals.benchmarks.hotpotqa import HotpotQAEvaluator # setup client client = LlamaStackClient(...) # setup agent config agent_config = AgentConfig(...) # setup evaluator evaluator = HotpotQAEvaluator() # Run evaluation results = evaluator.run(agent_config, client)

Why llama-stack-evals?

While llama-stack makes it easy to build and deploy LLM agents, evaluating these agents comprehensively can be challenging. Current evaluation API have several limitations:

Complex setup requirements for running evaluations (data preparation, defining scoring functions, benchmark registration)
Difficulty in using private datasets that can't be shared with servers
Limited flexibility in implementing custom scoring functions
Challenges in evaluating agents with custom client tools
Lack of streamlined solutions for running evaluations against popular benchmarks

llama-stack-evals addresses these challenges by providing a developer-friendly framework for evaluating applications built on top of Llama Stack.

What This Library Offers

Simplified Evaluation Flow

Reduce complex evaluation setups to just a few lines of code
Intuitive APIs that feel natural to Python developers
Built-in support for popular open benchmarks

Flexibility & Customization

Evaluate agents with custom client tools
Use private datasets without uploading to servers
Implement custom scoring functions easily
Run component-level evaluations (e.g., retrieval tools)

Out-of-the-Box Benchmark Examples

Model Evaluation: Release lightweight benchmark numbers on ANY llama-stack SDK compatible endpoints
Agent Evaluation:
- E2E evaluation via AgentConfig (e.g., HotpotQA)
- Component-level evaluation (e.g., BEIR for retrieval)
Complex Simulation Support:
- Tau-Bench (with simulated users)
- CRAG
- RAG evaluations with vector DB integration

What This Library Is Not

Not a Recipe Repository: Provide structured, maintained evaluation tools rather than ad-hoc evaluation scripts

🙌 Want to contribute?

✨ We welcome Pull Requests with improvements or suggestions.

🐛 If you want to flag an issue or propose an improvement, but don't know how to realize it, create a GitHub Issue.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
notebooks		notebooks
src/llama_stack_evals		src/llama_stack_evals
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Llama Stack Evals

🔧 Installation

🚀 Usage

Why llama-stack-evals?

What This Library Offers

What This Library Is Not

🙌 Want to contribute?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

ehhuang/ls-example

Folders and files

Latest commit

History

Repository files navigation

Llama Stack Evals

🔧 Installation

🚀 Usage

Why llama-stack-evals?

What This Library Offers

What This Library Is Not

🙌 Want to contribute?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages