Skip to content

Add health monitoring and circuit breaker to vMCP server #3036

@yrobla

Description

@yrobla

Description

Implement health monitoring and circuit breaker patterns to detect and handle backend failures gracefully.

Scope:

  • Periodic health checks for backend MCP servers
  • Backend health status tracking and reporting
  • Circuit breaker implementation for failing backends
  • Automatic backend removal/restoration based on health
  • Health status reflected in vMCP status and capabilities

Key Components

1. Backend Health Checks

  • Periodic health check requests to each backend
  • Configurable check interval (default: 30s)
  • Track consecutive failures
  • Mark backend unhealthy after threshold (default: 3 failures)
  • Health check endpoint: MCP ping or tools/list

2. Health Status Tracking

  • Per-backend health state: healthy, unhealthy, unknown
  • Last successful health check timestamp
  • Failure count and error messages
  • Health status exposed in vMCP status/metrics

3. Circuit Breaker

  • Three states: closed (normal), open (failing), half-open (testing recovery)
  • Configurable failure threshold (default: 5 failures)
  • Configurable timeout for open state (default: 60s)
  • Automatic transition to half-open for recovery testing
  • Track circuit breaker state per backend

4. Backend Availability Management

  • Remove unhealthy backend tools from aggregated capabilities
  • Return error when routing to unavailable backend
  • Automatically restore backend when health recovers
  • Log backend state transitions (healthy ↔ unhealthy)
  • Emit metrics for monitoring systems

5. Failure Modes

  • fail mode (default): Fail entire request if backend unavailable
  • best_effort mode: Return partial results, include errors for failed backends
  • Configurable per vMCP instance

6. Integration Points

  • Update aggregated capabilities when backend health changes
  • Routing layer checks health before forwarding
  • Status reporting includes backend health summary
  • Metrics exported for Prometheus/observability

Implementation Notes

  • Health checks run in background goroutines
  • Use context for cancellation on shutdown
  • Circuit breaker prevents cascading failures
  • Health status should be cached to avoid repeated checks
  • Transition events should be logged and emitted as metrics
  • Follow existing ToolHive observability patterns in pkg/telemetry/
  • Consider using sony/gobreaker or similar library

Reference

Acceptance Criteria

  • Periodic health check implementation
  • Configurable health check interval
  • Backend health state tracking (healthy/unhealthy/unknown)
  • Consecutive failure counting
  • Unhealthy threshold configuration
  • Circuit breaker implementation (closed/open/half-open states)
  • Configurable failure threshold for circuit breaker
  • Configurable timeout for open state
  • Automatic transition to half-open for recovery testing
  • Remove unhealthy backend tools from capabilities
  • Error response when routing to unavailable backend
  • Automatic restoration when backend recovers
  • Health state transition logging
  • Partial failure mode support (fail vs best_effort)
  • Metrics emission for observability
  • Unit tests for health check logic
  • Unit tests for circuit breaker state machine
  • Integration tests with flaky mock backends
  • E2E tests with backend failure and recovery scenarios

Metadata

Metadata

Assignees

Labels

apiItems related to the APIenhancementNew feature or requestgoPull requests that update go codetelemetryvmcpVirtual MCP Server related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions