- Notifications
You must be signed in to change notification settings - Fork 158
Open
Labels
apiItems related to the APIItems related to the APIenhancementNew feature or requestNew feature or requestgoPull requests that update go codePull requests that update go codetelemetryvmcpVirtual MCP Server related issuesVirtual MCP Server related issues
Description
Description
Implement health monitoring and circuit breaker patterns to detect and handle backend failures gracefully.
Scope:
- Periodic health checks for backend MCP servers
- Backend health status tracking and reporting
- Circuit breaker implementation for failing backends
- Automatic backend removal/restoration based on health
- Health status reflected in vMCP status and capabilities
Key Components
1. Backend Health Checks
- Periodic health check requests to each backend
- Configurable check interval (default: 30s)
- Track consecutive failures
- Mark backend unhealthy after threshold (default: 3 failures)
- Health check endpoint: MCP
pingortools/list
2. Health Status Tracking
- Per-backend health state:
healthy,unhealthy,unknown - Last successful health check timestamp
- Failure count and error messages
- Health status exposed in vMCP status/metrics
3. Circuit Breaker
- Three states:
closed(normal),open(failing),half-open(testing recovery) - Configurable failure threshold (default: 5 failures)
- Configurable timeout for open state (default: 60s)
- Automatic transition to half-open for recovery testing
- Track circuit breaker state per backend
4. Backend Availability Management
- Remove unhealthy backend tools from aggregated capabilities
- Return error when routing to unavailable backend
- Automatically restore backend when health recovers
- Log backend state transitions (healthy ↔ unhealthy)
- Emit metrics for monitoring systems
5. Failure Modes
- fail mode (default): Fail entire request if backend unavailable
- best_effort mode: Return partial results, include errors for failed backends
- Configurable per vMCP instance
6. Integration Points
- Update aggregated capabilities when backend health changes
- Routing layer checks health before forwarding
- Status reporting includes backend health summary
- Metrics exported for Prometheus/observability
Implementation Notes
- Health checks run in background goroutines
- Use context for cancellation on shutdown
- Circuit breaker prevents cascading failures
- Health status should be cached to avoid repeated checks
- Transition events should be logged and emitted as metrics
- Follow existing ToolHive observability patterns in
pkg/telemetry/ - Consider using
sony/gobreakeror similar library
Reference
- vMCP Design Proposal - Backend Unavailability, Partial Failures, Circuit Breaker sections
Acceptance Criteria
- Periodic health check implementation
- Configurable health check interval
- Backend health state tracking (healthy/unhealthy/unknown)
- Consecutive failure counting
- Unhealthy threshold configuration
- Circuit breaker implementation (closed/open/half-open states)
- Configurable failure threshold for circuit breaker
- Configurable timeout for open state
- Automatic transition to half-open for recovery testing
- Remove unhealthy backend tools from capabilities
- Error response when routing to unavailable backend
- Automatic restoration when backend recovers
- Health state transition logging
- Partial failure mode support (fail vs best_effort)
- Metrics emission for observability
- Unit tests for health check logic
- Unit tests for circuit breaker state machine
- Integration tests with flaky mock backends
- E2E tests with backend failure and recovery scenarios
Metadata
Metadata
Assignees
Labels
apiItems related to the APIItems related to the APIenhancementNew feature or requestNew feature or requestgoPull requests that update go codePull requests that update go codetelemetryvmcpVirtual MCP Server related issuesVirtual MCP Server related issues