Assertions use a single entrypoint and an expressive catalog. Prefer structural checks for stability, and combine them with judges only when necessary.
Use one entrypoint: await session.assert_that(Expect.content.contains("Example"), response=resp)
Catalog source: catalog.py - Immediate: content and judge (given
response) - Deferred (requires final metrics): tools, performance, path
Content
Expect.content.contains(text, case_sensitive=False) Expect.content.not_contains(text, case_sensitive=False) Expect.content.regex(pattern, case_sensitive=False)
Examples: resp = await agent.generate_str("Fetch https://example.com") await session.assert_that(Expect.content.contains("Example Domain"), response=resp) await session.assert_that(Expect.content.not_contains("Stack trace"), response=resp) await session.assert_that(Expect.content.regex(r"Example\\s+Domain"), response=resp)
Expect.tools.was_called(name, min_times=1) Expect.tools.called_with(name, {args}) Expect.tools.count(name, expected_count) Expect.tools.success_rate(min_rate, tool_name=None) Expect.tools.failed(name) Expect.tools.output_matches(name, expected, field_path?, match_type?, case_sensitive?, call_index?) Expect.tools.sequence([names], allow_other_calls=False)
Notes: field_path supports nested dict/list access, e.g. content[0].text or content.0.text
Examples: # Verify a fetch occurred with expected output pattern in first content block await session.assert_that( Expect.tools.output_matches( tool_name="fetch", expected_output=r"use.*examples", match_type="regex", case_sensitive=False, field_path="content[0].text", ), name="fetch_output_match", ) # Verify sequence and counts await session.assert_that(Expect.tools.sequence(["fetch"], allow_other_calls=True)) await session.assert_that(Expect.tools.count("fetch", 1)) await session.assert_that(Expect.tools.success_rate(1.0, tool_name="fetch"))
Expect.performance.max_iterations(n) Expect.performance.response_time_under(ms)
Example: await session.assert_that(Expect.performance.max_iterations(3)) await session.assert_that(Expect.performance.response_time_under(10_000))
Judge
Expect.judge.llm(rubric, min_score=0.8, include_input=False, require_reasoning=True) Expect.judge.multi_criteria(criteria, aggregate_method="weighted", require_all_pass=False, include_confidence=True, use_cot=True, model=None)
Examples: # Single-criterion rubric judge = Expect.judge.llm( rubric="Response should identify JSON format and summarize main points", min_score=0.8, include_input=True, ) await session.assert_that(judge, response=resp, name="quality_check") # Multi-criteria from mcp_eval.evaluators import EvaluationCriterion criteria = [ EvaluationCriterion(name="accuracy", description="Factual correctness", weight=2.0, min_score=0.8), EvaluationCriterion(name="completeness", description="Covers key points", weight=1.5, min_score=0.7), ] judge_mc = Expect.judge.multi_criteria(criteria, aggregate_method="weighted", use_cot=True) await session.assert_that(judge_mc, response=resp, name="multi_criteria")
Path
Expect.path.efficiency(optimal_steps?, expected_tool_sequence?, allow_extra_steps=0, penalize_backtracking=True, penalize_repeated_tools=True, tool_usage_limits?, default_tool_limit=1)
Evaluators source: evaluators/ Examples
# Enforce exact order and single use of tools await session.assert_that( Expect.path.efficiency( expected_tool_sequence=["validate", "process", "save"], tool_usage_limits={"validate": 1, "process": 1, "save": 1}, allow_extra_steps=0, penalize_backtracking=True, ), name="golden_path", ) # Allow a retry of a fragile step without failing the whole path await session.assert_that( Expect.path.efficiency( expected_tool_sequence=["fetch", "parse"], tool_usage_limits={"fetch": 2, "parse": 1}, allow_extra_steps=1, penalize_repeated_tools=False, ), name="robust_path", )
Tips
- Prefer structural checks (
output_matches) for tool outputs when possible for stability - Use
name in assert_that(..., name="...") to label checks in reports - Combine judge + structural checks for high confidence
Combine a minimal judge (e.g., rubric with min_score=0.8) with one or two structural checks (like output_matches) for resilient tests.
Deferred assertions are evaluated when the session ends or when when="end" is used. If an assertion depends on final metrics (e.g., success rate), defer it.