Skip to main content
Assertions use a single entrypoint and an expressive catalog. Prefer structural checks for stability, and combine them with judges only when necessary.
Use one entrypoint:
await session.assert_that(Expect.content.contains("Example"), response=resp) 
Catalog source: catalog.py

Immediate vs deferred

  • Immediate: content and judge (given response)
  • Deferred (requires final metrics): tools, performance, path

Content

  • Expect.content.contains(text, case_sensitive=False)
  • Expect.content.not_contains(text, case_sensitive=False)
  • Expect.content.regex(pattern, case_sensitive=False)
Examples:
resp = await agent.generate_str("Fetch https://example.com") await session.assert_that(Expect.content.contains("Example Domain"), response=resp) await session.assert_that(Expect.content.not_contains("Stack trace"), response=resp) await session.assert_that(Expect.content.regex(r"Example\\s+Domain"), response=resp) 

Tools

  • Expect.tools.was_called(name, min_times=1)
  • Expect.tools.called_with(name, {args})
  • Expect.tools.count(name, expected_count)
  • Expect.tools.success_rate(min_rate, tool_name=None)
  • Expect.tools.failed(name)
  • Expect.tools.output_matches(name, expected, field_path?, match_type?, case_sensitive?, call_index?)
  • Expect.tools.sequence([names], allow_other_calls=False)
Notes:
  • field_path supports nested dict/list access, e.g. content[0].text or content.0.text
Examples:
# Verify a fetch occurred with expected output pattern in first content block await session.assert_that(  Expect.tools.output_matches(  tool_name="fetch",  expected_output=r"use.*examples",  match_type="regex",  case_sensitive=False,  field_path="content[0].text",  ),  name="fetch_output_match", )  # Verify sequence and counts await session.assert_that(Expect.tools.sequence(["fetch"], allow_other_calls=True)) await session.assert_that(Expect.tools.count("fetch", 1)) await session.assert_that(Expect.tools.success_rate(1.0, tool_name="fetch")) 

Performance

  • Expect.performance.max_iterations(n)
  • Expect.performance.response_time_under(ms)
Example:
await session.assert_that(Expect.performance.max_iterations(3)) await session.assert_that(Expect.performance.response_time_under(10_000)) 

Judge

  • Expect.judge.llm(rubric, min_score=0.8, include_input=False, require_reasoning=True)
  • Expect.judge.multi_criteria(criteria, aggregate_method="weighted", require_all_pass=False, include_confidence=True, use_cot=True, model=None)
Examples:
# Single-criterion rubric judge = Expect.judge.llm(  rubric="Response should identify JSON format and summarize main points",  min_score=0.8,  include_input=True, ) await session.assert_that(judge, response=resp, name="quality_check")  # Multi-criteria from mcp_eval.evaluators import EvaluationCriterion criteria = [  EvaluationCriterion(name="accuracy", description="Factual correctness", weight=2.0, min_score=0.8),  EvaluationCriterion(name="completeness", description="Covers key points", weight=1.5, min_score=0.7), ] judge_mc = Expect.judge.multi_criteria(criteria, aggregate_method="weighted", use_cot=True) await session.assert_that(judge_mc, response=resp, name="multi_criteria") 

Path

  • Expect.path.efficiency(optimal_steps?, expected_tool_sequence?, allow_extra_steps=0, penalize_backtracking=True, penalize_repeated_tools=True, tool_usage_limits?, default_tool_limit=1)
Evaluators source: evaluators/

Examples

# Enforce exact order and single use of tools await session.assert_that(  Expect.path.efficiency(  expected_tool_sequence=["validate", "process", "save"],  tool_usage_limits={"validate": 1, "process": 1, "save": 1},  allow_extra_steps=0,  penalize_backtracking=True,  ),  name="golden_path", )  # Allow a retry of a fragile step without failing the whole path await session.assert_that(  Expect.path.efficiency(  expected_tool_sequence=["fetch", "parse"],  tool_usage_limits={"fetch": 2, "parse": 1},  allow_extra_steps=1,  penalize_repeated_tools=False,  ),  name="robust_path", ) 

Tips

  • Prefer structural checks (output_matches) for tool outputs when possible for stability
  • Use name in assert_that(..., name="...") to label checks in reports
  • Combine judge + structural checks for high confidence
Combine a minimal judge (e.g., rubric with min_score=0.8) with one or two structural checks (like output_matches) for resilient tests.
Deferred assertions are evaluated when the session ends or when when="end" is used. If an assertion depends on final metrics (e.g., success rate), defer it.