API: Session - mcp-eval

The Session API is the heart of mcp-eval testing. It manages your agent’s lifecycle, collects metrics, runs assertions, and produces comprehensive test results.

Quick start

The simplest way to create a test session:

from mcp_eval.session import test_session from mcp_eval.catalog import Expect  async with test_session("my-test") as agent:  # Agent is ready with MCP servers connected  response = await agent.generate_str("Fetch https://example.com")    # Run assertions  await agent.assert_that(  Expect.content.contains("Example Domain"),  response=response  ) 

Core concepts

TestSession

The orchestrator that manages everything:

Lifecycle management: Starts/stops agents and MCP servers
Tool discovery: Automatically finds and registers MCP tools
Metrics collection: Tracks all interactions via OTEL
Assertion execution: Runs evaluators at the right time
Report generation: Produces test artifacts

TestAgent

A thin, friendly wrapper around your LLM agent:

Simple interface: Just generate() and assert_that()
Automatic tracking: All interactions are recorded
Context preservation: Maintains conversation state

Creating sessions

Basic session creation

# Using context manager (recommended) async with test_session("test-name") as agent:  # Your test code here  pass  # Manual lifecycle (advanced) session = TestSession(test_name="test-name") agent = await session.__aenter__() try:  # Your test code  ... finally:  await session.__aexit__(None, None, None)  session.cleanup() 

Session with custom configuration

from mcp_eval.session import test_session from mcp_agent.agents.agent_spec import AgentSpec  spec = AgentSpec(  name="custom",  instruction="You are a helpful test assistant",  server_names=["my_server"], ) async with test_session("custom-test", agent=spec) as agent:  # Your test code  pass 

Agent interactions

Generating responses

# Simple string generation response = await agent.generate_str("What is 2+2?") print(response) # "The answer is 4"  # Full response object may be available depending on provider; prefer generate_str for portability 

Multi-turn conversations

# Sessions maintain context response1 = await agent.generate_str("My name is Alice") response2 = await agent.generate_str("What's my name?") # response2 will correctly identify "Alice" 

Assertions in depth

Immediate vs. deferred assertions

# Immediate: evaluated right away (content, judge) await session.assert_that(  Expect.content.contains("success"),  response=response, # Required for immediate  name="has_success" )  # Deferred: evaluated at session end (tools, performance, path) await session.assert_that(  Expect.tools.was_called("calculator"),  name="used_calculator" # No response needed )  # Force deferred evaluation at end await session.assert_that(  Expect.content.contains("final"),  response=response,  when="end" # Defer even content checks ) 

Assertion timing control

# Evaluate specific assertions immediately result = await session.evaluate_now_async(  Expect.performance.response_time_under(5000),  response=response,  name="quick_response" )  if not result.passed:  print(f"Too slow: {result.details}")  # Take corrective action  # Batch evaluate multiple assertions results = await session.evaluate_now_async(  Expect.tools.success_rate(0.95),  Expect.performance.max_iterations(3) ) 

Named assertions for better reporting

# Always name your assertions for clarity await session.assert_that(  Expect.content.regex(r"\d+ items? found"),  response=response,  name="item_count_format" # Appears in reports ) 

Metrics and results

Accessing metrics during tests

# Get current metrics metrics = session.get_metrics()  print(f"Tool calls: {len(metrics.tool_calls)}") print(f"Total tokens: {metrics.total_tokens}") print(f"Duration so far: {metrics.total_duration_ms}ms") print(f"Estimated cost: ${metrics.total_cost_usd:.4f}")  # Detailed tool information for call in metrics.tool_calls:  print(f"Tool: {call.name}")  print(f"Duration: {call.duration_ms}ms")  print(f"Success: {call.success}")  if not call.success:  print(f"Error: {call.error}") 

Getting test results

# Check if all assertions passed if session.all_passed():  print("✅ All tests passed!") else:  print("❌ Some tests failed")  # Get detailed results results = session.get_results() for result in results:  print(f"Assertion: {result.name}")  print(f"Passed: {result.passed}")  if not result.passed:  print(f"Reason: {result.details}")  # Get pass/fail summary summary = session.get_summary() print(f"Passed: {summary['passed']}/{summary['total']}") print(f"Pass rate: {summary['pass_rate']:.1%}") 

Duration tracking

# Get test duration duration_ms = session.get_duration_ms() print(f"Test took {duration_ms/1000:.2f} seconds")  # Track specific operations from time import time  start = time() response = await agent.generate_str("Complex task") operation_time = (time() - start) * 1000  if operation_time > 5000:  print(f"Warning: Operation took {operation_time:.0f}ms") 

OpenTelemetry traces

Accessing trace data

# Get structured span tree span_tree = session.get_span_tree()  def print_spans(span, indent=0):  prefix = " " * indent  print(f"{prefix}{span.name}: {span.duration_ms}ms")  for child in span.children:  print_spans(child, indent + 1)  print_spans(span_tree)  # Ensure traces are written to disk await session._ensure_traces_flushed() 

Custom span attributes

# Add custom attributes to current span from opentelemetry import trace  tracer = trace.get_tracer(__name__)  with tracer.start_as_current_span("custom_operation") as span:  span.set_attribute("user_id", "123")  span.set_attribute("operation_type", "validation")    response = await agent.generate_str("Validate user input") 

Artifacts and reporting

Session artifacts

# Sessions automatically save artifacts session = await TestSession.create(  test_name="my-test",  output_dir="test-reports", # Custom output location  save_artifacts=True # Enable artifact saving )  # After test completion, find artifacts at: # test-reports/my-test_[timestamp]/ # ├── trace.jsonl # OTEL traces # ├── results.json # Test results # ├── metrics.json # Performance metrics # └── conversation.json # Full conversation log 

Programmatic report generation

# Generate reports programmatically from mcp_eval.reports import ReportGenerator  generator = ReportGenerator(session)  # Generate different formats await generator.save_json("results.json") await generator.save_markdown("results.md") await generator.save_html("results.html")  # Get report data for custom processing report_data = generator.get_report_data() print(f"Test: {report_data['test_name']}") print(f"Duration: {report_data['duration_ms']}ms") print(f"Passed: {report_data['passed']}/{report_data['total']}") 

Advanced patterns

Custom session hooks

class CustomSession(TestSession):  async def on_tool_call(self, tool_name: str, args: dict):  """Hook called before each tool execution."""  print(f"About to call {tool_name} with {args}")    # Validate tool usage  if tool_name == "dangerous_tool":  raise ValueError("Dangerous tool not allowed in tests")    async def on_assertion_complete(self, result):  """Hook called after each assertion."""  if not result.passed:  # Log failures to external system  await self.log_to_monitoring(result) 

Session state management

# Store custom state in session session.state["test_user_id"] = "user_123" session.state["test_context"] = {"environment": "staging"}  # Access state in assertions or hooks user_id = session.state.get("test_user_id") 

Parallel session execution

import asyncio  async def run_test(test_name: str, prompt: str):  async with test_session(test_name) as agent:  response = await agent.generate_str(prompt)  await agent.assert_that(  Expect.content.contains("success"),  response=response  )  return agent.session.all_passed()  # Run multiple tests in parallel results = await asyncio.gather(  run_test("test1", "Task 1"),  run_test("test2", "Task 2"),  run_test("test3", "Task 3") )  print(f"All passed: {all(results)}") 

Best practices

Use context managers: Always use async with test_session() to ensure proper cleanup, even if tests fail.

Name your assertions: Always provide descriptive names for assertions. This makes debugging much easier when reviewing test reports.

Monitor metrics: Check metrics during long-running tests to catch performance issues early.

Error handling

try:  async with test_session("error-test") as agent:  response = await agent.generate_str("Test prompt")  await agent.assert_that(  Expect.content.contains("expected"),  response=response  ) except TimeoutError:  print("Test timed out - increase timeout_seconds") except AssertionError as e:  print(f"Assertion failed: {e}") except Exception as e:  print(f"Unexpected error: {e}")  # Session cleanup is still guaranteed 

Expect API Reference

All available assertions

Configuration

Session configuration options

Reports

Understanding test reports

Metrics & Tracing

Deep dive into metrics

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

​Quick start

​Core concepts

​TestSession

​TestAgent

​Creating sessions

​Basic session creation

​Session with custom configuration

​Agent interactions

​Generating responses

​Multi-turn conversations

​Assertions in depth

​Immediate vs. deferred assertions

​Assertion timing control

​Named assertions for better reporting

​Metrics and results

​Accessing metrics during tests

​Getting test results

​Duration tracking

​OpenTelemetry traces

​Accessing trace data

​Custom span attributes

​Artifacts and reporting

​Session artifacts

​Programmatic report generation

​Advanced patterns

​Custom session hooks

​Session state management

​Parallel session execution

​Best practices

​Error handling

​See also

Expect API Reference

Configuration

Reports

Metrics & Tracing

Quick start

Core concepts

TestSession

TestAgent

Creating sessions

Basic session creation

Session with custom configuration

Agent interactions

Generating responses

Multi-turn conversations

Assertions in depth

Immediate vs. deferred assertions

Assertion timing control

Named assertions for better reporting

Metrics and results

Accessing metrics during tests

Getting test results

Duration tracking

OpenTelemetry traces

Accessing trace data

Custom span attributes

Artifacts and reporting

Session artifacts

Programmatic report generation

Advanced patterns

Custom session hooks

Session state management

Parallel session execution

Best practices

Error handling

See also