Skip to main content
Evaluate your agent’s reasoning, tool use, recovery, and quality by driving it through realistic tasks.

Define the test agent

  • Global default:
import mcp_eval from mcp_agent.agents.agent_spec import AgentSpec  mcp_eval.use_agent(  AgentSpec(name="Fetcher", instruction="You fetch.", server_names=["fetch"]) # see [Settings](https://github.com/lastmile-ai/mcp-eval/blob/main/src/mcp_eval/config.py) ) 
  • Per‑test override with with_agent (place above @task):
from mcp_eval.core import with_agent, task from mcp_agent.agents.agent import Agent  @with_agent(Agent(name="Custom", instruction="Custom", server_names=["fetch"])) # see [Core](https://github.com/lastmile-ai/mcp-eval/blob/main/src/mcp_eval/core.py) @task("Custom agent test") async def test_custom(agent, session):  resp = await agent.generate_str("Fetch https://example.com") 
  • Factory for parallel safety:
from mcp_eval.config import use_agent_factory from mcp_agent.agents.agent import Agent  def make_agent():  return Agent(name="Isolated", instruction="...", server_names=["fetch"]) # see [Settings](https://github.com/lastmile-ai/mcp-eval/blob/main/src/mcp_eval/config.py)  use_agent_factory(make_agent) 
More patterns: agent_definition_examples.py.

What to measure

  • Tool behavior: Expect.tools.was_called, called_with, sequence, output_matches
  • Efficiency and iterations: Expect.performance.max_iterations, Expect.path.efficiency
  • Quality: Expect.judge.llm, Expect.judge.multi_criteria
  • Performance: response times, concurrency (see metrics)
# Efficiency and iteration bounds await session.assert_that(Expect.performance.max_iterations(3))  # Tool behavior and outputs await session.assert_that(Expect.tools.was_called("fetch")) await session.assert_that(Expect.tools.output_matches("fetch", {"isError": False}, match_type="partial"))  # Path and sequence await session.assert_that(Expect.tools.sequence(["fetch"], allow_other_calls=True)) await session.assert_that(Expect.path.efficiency(expected_tool_sequence=["fetch"], allow_extra_steps=1)) 

Styles for agent evals

Inspecting spans and metrics

metrics = session.get_metrics() span_tree = session.get_span_tree() 
Sources: