Add health monitoring and circuit breaker to vMCP server #3036

Open

Assignees

Labels

apienhancementgotelemetryvmcp

opened

on Dec 15, 2025

Description

Implement health monitoring and circuit breaker patterns to detect and handle backend failures gracefully.

Scope:

Periodic health checks for backend MCP servers
Backend health status tracking and reporting
Circuit breaker implementation for failing backends
Automatic backend removal/restoration based on health
Health status reflected in vMCP status and capabilities

Key Components

1. Backend Health Checks

Periodic health check requests to each backend
Configurable check interval (default: 30s)
Track consecutive failures
Mark backend unhealthy after threshold (default: 3 failures)
Health check endpoint: MCP ping or tools/list

2. Health Status Tracking

Per-backend health state: healthy, unhealthy, unknown
Last successful health check timestamp
Failure count and error messages
Health status exposed in vMCP status/metrics

3. Circuit Breaker

Three states: closed (normal), open (failing), half-open (testing recovery)
Configurable failure threshold (default: 5 failures)
Configurable timeout for open state (default: 60s)
Automatic transition to half-open for recovery testing
Track circuit breaker state per backend

4. Backend Availability Management

Remove unhealthy backend tools from aggregated capabilities
Return error when routing to unavailable backend
Automatically restore backend when health recovers
Log backend state transitions (healthy ↔ unhealthy)
Emit metrics for monitoring systems

5. Failure Modes

fail mode (default): Fail entire request if backend unavailable
best_effort mode: Return partial results, include errors for failed backends
Configurable per vMCP instance

6. Integration Points

Update aggregated capabilities when backend health changes
Routing layer checks health before forwarding
Status reporting includes backend health summary
Metrics exported for Prometheus/observability

Implementation Notes

Health checks run in background goroutines
Use context for cancellation on shutdown
Circuit breaker prevents cascading failures
Health status should be cached to avoid repeated checks
Transition events should be logged and emitted as metrics
Follow existing ToolHive observability patterns in pkg/telemetry/
Consider using sony/gobreaker or similar library

Reference

vMCP Design Proposal - Backend Unavailability, Partial Failures, Circuit Breaker sections

Acceptance Criteria

Periodic health check implementation
Configurable health check interval
Backend health state tracking (healthy/unhealthy/unknown)
Consecutive failure counting
Unhealthy threshold configuration
Circuit breaker implementation (closed/open/half-open states)
Configurable failure threshold for circuit breaker
Configurable timeout for open state
Automatic transition to half-open for recovery testing
Remove unhealthy backend tools from capabilities
Error response when routing to unavailable backend
Automatic restoration when backend recovers
Health state transition logging
Partial failure mode support (fail vs best_effort)
Metrics emission for observability
Unit tests for health check logic
Unit tests for circuit breaker state machine
Integration tests with flaky mock backends
E2E tests with backend failure and recovery scenarios

Metadata

Assignees

yrobla

Labels

apienhancementgotelemetryvmcp

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests