Span-Based Evaluation

Evaluate AI system behavior by analyzing OpenTelemetry spans captured during execution.

Requires Logfire

Span-based evaluation requires logfire to be installed and configured:

pip install 'pydantic-evals[logfire]'

Overview

Span-based evaluation enables you to evaluate how your AI system executes, not just what it produces. This is essential for complex agents where ensuring the desired behavior depends on the execution path taken, not just the final output.

Why Span-Based Evaluation?

Traditional evaluators assess task inputs and outputs. For simple tasks, this may be sufficient—if the output is correct, the task succeeded. But for complex multi-step agents, the process matters as much as the result:

A correct answer reached incorrectly - An agent might produce the right output by accident (e.g., guessing, using cached data when it should have searched, calling the wrong tools but getting lucky)
Verification of required behaviors - You need to ensure specific tools were called, certain code paths executed, or particular patterns followed
Performance and efficiency - The agent should reach the answer efficiently, without unnecessary tool calls, infinite loops, or excessive retries
Safety and compliance - Critical to verify that dangerous operations weren't attempted, sensitive data wasn't accessed inappropriately, or guardrails weren't bypassed

Real-World Scenarios

Span-based evaluation is particularly valuable for:

RAG systems - Verify documents were retrieved and reranked before generation, not just that the answer included citations
Multi-agent coordination - Ensure the orchestrator delegated to the right specialist agents in the correct order
Tool-calling agents - Confirm specific tools were used (or avoided), and in the expected sequence
Debugging and regression testing - Catch behavioral regressions where outputs remain correct but the internal logic deteriorates
Production alignment - Ensure your evaluation assertions operate on the same telemetry data captured in production, so eval insights directly translate to production monitoring

How It Works

When you configure logfire (logfire.configure()), Pydantic Evals captures all OpenTelemetry spans generated during task execution. You can then write evaluators that assert conditions on:

Which tools were called - HasMatchingSpan(query={'name_contains': 'search_tool'})
Code paths executed - Verify specific functions ran or particular branches taken
Timing characteristics - Check that operations complete within SLA bounds
Error conditions - Detect retries, fallbacks, or specific failure modes
Execution structure - Verify parent-child relationships, delegation patterns, or execution order

This creates a fundamentally different evaluation paradigm: you're testing behavioral contracts, not just input-output relationships.

Basic Usage

import logfire from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import HasMatchingSpan # Configure logfire to capture spans logfire.configure(send_to_logfire='if-token-present') dataset = Dataset( cases=[Case(inputs='test')], evaluators=[ # Check that database was queried HasMatchingSpan( query={'name_contains': 'database_query'}, evaluation_name='used_database', ), ], )

HasMatchingSpan Evaluator

The HasMatchingSpan evaluator checks if any span matches a query:

from pydantic_evals.evaluators import HasMatchingSpan HasMatchingSpan( query={'name_contains': 'test'}, evaluation_name='span_check', )

Returns: bool - True if any span matches the query

SpanQuery Reference

A SpanQuery is a dictionary with query conditions:

Name Conditions

Match spans by name:

# Exact name match {'name_equals': 'search_database'} # Contains substring {'name_contains': 'tool_call'} # Regex pattern {'name_matches_regex': r'llm_call_\d+'}

Attribute Conditions

Match spans with specific attributes:

# Has specific attribute values {'has_attributes': {'operation': 'search', 'status': 'success'}} # Has attribute keys (any value) {'has_attribute_keys': ['user_id', 'request_id']}

Duration Conditions

Match based on execution time:

from datetime import timedelta # Minimum duration {'min_duration': 1.0} # seconds {'min_duration': timedelta(seconds=1)} # Maximum duration {'max_duration': 5.0} # seconds {'max_duration': timedelta(seconds=5)} # Range {'min_duration': 0.5, 'max_duration': 2.0}

Logical Operators

Combine conditions:

# NOT {'not_': {'name_contains': 'error'}} # AND (all must match) {'and_': [ {'name_contains': 'tool'}, {'max_duration': 1.0}, ]} # OR (any must match) {'or_': [ {'name_equals': 'search'}, {'name_equals': 'query'}, ]}

Child/Descendant Conditions

Query relationships between spans:

# Count direct children {'min_child_count': 1} {'max_child_count': 5} # Some child matches query {'some_child_has': {'name_contains': 'retry'}} # All children match query {'all_children_have': {'max_duration': 0.5}} # No children match query {'no_child_has': {'has_attributes': {'error': True}}} # Descendant queries (recursive) {'min_descendant_count': 5} {'some_descendant_has': {'name_contains': 'api_call'}}

Ancestor/Depth Conditions

Query span hierarchy:

# Depth (root spans have depth 0) {'min_depth': 1} # Not a root span {'max_depth': 2} # At most 2 levels deep # Ancestor queries {'some_ancestor_has': {'name_equals': 'agent_run'}} {'all_ancestors_have': {'max_duration': 10.0}} {'no_ancestor_has': {'has_attributes': {'error': True}}}

Stop Recursing

Control recursive queries:

{ 'some_descendant_has': {'name_contains': 'expensive'}, 'stop_recursing_when': {'name_equals': 'boundary'}, } # Only search descendants until hitting a span named 'boundary'

Practical Examples

Verify Tool Usage

Check that specific tools were called:

from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import HasMatchingSpan dataset = Dataset( cases=[Case(inputs='test')], evaluators=[ # Must call search tool HasMatchingSpan( query={'name_contains': 'search_tool'}, evaluation_name='used_search', ), # Must NOT call dangerous tool HasMatchingSpan( query={'not_': {'name_contains': 'delete_database'}}, evaluation_name='safe_execution', ), ], )

Check Multiple Tools

Verify a sequence of operations:

from pydantic_evals.evaluators import HasMatchingSpan evaluators = [ HasMatchingSpan( query={'name_contains': 'retrieve_context'}, evaluation_name='retrieved_context', ), HasMatchingSpan( query={'name_contains': 'generate_response'}, evaluation_name='generated_response', ), HasMatchingSpan( query={'and_': [ {'name_contains': 'cite'}, {'has_attribute_keys': ['source_id']}, ]}, evaluation_name='added_citations', ), ]

Performance Assertions

Ensure operations meet latency requirements:

from pydantic_evals.evaluators import HasMatchingSpan evaluators = [ # Database queries should be fast HasMatchingSpan( query={'and_': [ {'name_contains': 'database'}, {'max_duration': 0.1}, # 100ms max ]}, evaluation_name='fast_db_queries', ), # Overall should complete quickly HasMatchingSpan( query={'and_': [ {'name_equals': 'task_execution'}, {'max_duration': 2.0}, ]}, evaluation_name='within_sla', ), ]

Error Detection

Check for error conditions:

from pydantic_evals.evaluators import HasMatchingSpan evaluators = [ # No errors occurred HasMatchingSpan( query={'not_': {'has_attributes': {'error': True}}}, evaluation_name='no_errors', ), # Retries happened HasMatchingSpan( query={'name_contains': 'retry'}, evaluation_name='had_retries', ), # Fallback was used HasMatchingSpan( query={'name_contains': 'fallback_model'}, evaluation_name='used_fallback', ), ]

Complex Behavioral Checks

Verify sophisticated behavior patterns:

from pydantic_evals.evaluators import HasMatchingSpan evaluators = [ # Agent delegated to sub-agent HasMatchingSpan( query={'and_': [ {'name_contains': 'agent'}, {'some_child_has': {'name_contains': 'delegate'}}, ]}, evaluation_name='used_delegation', ), # Made multiple LLM calls with retries HasMatchingSpan( query={'and_': [ {'name_contains': 'llm_call'}, {'some_descendant_has': {'name_contains': 'retry'}}, {'min_descendant_count': 3}, ]}, evaluation_name='retry_pattern', ), ]

Custom Evaluators with SpanTree

For more complex span analysis, write custom evaluators:

from dataclasses import dataclass from pydantic_evals.evaluators import Evaluator, EvaluatorContext @dataclass class CustomSpanCheck(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> dict[str, bool | int]: span_tree = ctx.span_tree # Find specific spans llm_spans = span_tree.find(lambda node: 'llm' in node.name) tool_spans = span_tree.find(lambda node: 'tool' in node.name) # Calculate metrics total_llm_time = sum( span.duration.total_seconds() for span in llm_spans ) return { 'used_llm': len(llm_spans) > 0, 'used_tools': len(tool_spans) > 0, 'tool_count': len(tool_spans), 'llm_fast': total_llm_time < 2.0, }

SpanTree API

The SpanTree provides methods for span analysis:

from pydantic_evals.otel import SpanTree # Example API (requires span_tree from context) def example_api(span_tree: SpanTree) -> None: span_tree.find(lambda n: True) # Find all matching nodes span_tree.any({'name_contains': 'test'}) # Check if any span matches span_tree.all({'name_contains': 'test'}) # Check if all spans match span_tree.count({'name_contains': 'test'}) # Count matching spans # Iteration for node in span_tree: print(node.name, node.duration, node.attributes)

SpanNode Properties

Each SpanNode has:

from pydantic_evals.otel import SpanNode # Example properties (requires node from context) def example_properties(node: SpanNode) -> None: _ = node.name # Span name _ = node.duration # timedelta _ = node.attributes # dict[str, AttributeValue] _ = node.start_timestamp # datetime _ = node.end_timestamp # datetime _ = node.children # list[SpanNode] _ = node.descendants # list[SpanNode] (recursive) _ = node.ancestors # list[SpanNode] _ = node.parent # SpanNode | None

Debugging Span Queries

View Spans in Logfire

If you're sending data to Logfire, you can view all spans in the web UI to understand the trace structure.

Print Span Tree

from dataclasses import dataclass from pydantic_evals.evaluators import Evaluator, EvaluatorContext @dataclass class DebugSpans(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> bool: for node in ctx.span_tree: print(f"{' ' * len(node.ancestors)}{node.name} ({node.duration})") return True

Query Testing

Test queries incrementally:

from pydantic_evals.evaluators import HasMatchingSpan # Start simple query = {'name_contains': 'tool'} # Add conditions gradually query = {'and_': [ {'name_contains': 'tool'}, {'max_duration': 1.0}, ]} # Test in evaluator HasMatchingSpan(query=query, evaluation_name='test')

Use Cases

RAG System Verification

Verify retrieval-augmented generation workflow:

from pydantic_evals.evaluators import HasMatchingSpan evaluators = [ # Retrieved documents HasMatchingSpan( query={'name_contains': 'vector_search'}, evaluation_name='retrieved_docs', ), # Reranked results HasMatchingSpan( query={'name_contains': 'rerank'}, evaluation_name='reranked_results', ), # Generated with context HasMatchingSpan( query={'and_': [ {'name_contains': 'generate'}, {'has_attribute_keys': ['context_ids']}, ]}, evaluation_name='used_context', ), ]

Multi-Agent Systems

Verify agent coordination:

from pydantic_evals.evaluators import HasMatchingSpan evaluators = [ # Master agent ran HasMatchingSpan( query={'name_equals': 'master_agent'}, evaluation_name='master_ran', ), # Delegated to specialist HasMatchingSpan( query={'and_': [ {'name_contains': 'specialist_agent'}, {'some_ancestor_has': {'name_equals': 'master_agent'}}, ]}, evaluation_name='delegated_correctly', ), # No circular delegation HasMatchingSpan( query={'not_': {'and_': [ {'name_contains': 'agent'}, {'some_descendant_has': {'name_contains': 'agent'}}, {'some_ancestor_has': {'name_contains': 'agent'}}, ]}}, evaluation_name='no_circular_delegation', ), ]

Tool Usage Patterns

Verify intelligent tool selection:

from pydantic_evals.evaluators import HasMatchingSpan evaluators = [ # Used search before answering HasMatchingSpan( query={'and_': [ {'name_contains': 'search'}, {'some_ancestor_has': {'name_contains': 'answer'}}, ]}, evaluation_name='searched_before_answering', ), # Limited tool calls (no loops) HasMatchingSpan( query={'and_': [ {'name_contains': 'tool'}, {'max_child_count': 5}, ]}, evaluation_name='reasonable_tool_usage', ), ]

Best Practices

Start Simple: Begin with basic name queries, add complexity as needed
Use Descriptive Names: Name your spans well in your application code
Test Queries: Verify queries work before running full evaluations
Combine with Other Evaluators: Use span checks alongside output validation
Document Expectations: Comment why specific spans should/shouldn't exist

Next Steps

Logfire Integration - Set up Logfire for span capture
Custom Evaluators - Write advanced span analysis
Built-in Evaluators - Other evaluator types