Skip to content

Conversation

@peterkc
Copy link
Contributor

@peterkc peterkc commented Dec 12, 2025

Summary

Add Prometheus gauge metrics to monitor active and queued requests per deployment when using max_parallel_requests limiting.

  • New: TrackedSemaphore wrapper that tracks queue depth without accessing private asyncio.Semaphore internals
  • New: litellm_deployment_active_requests gauge metric
  • New: litellm_deployment_queued_requests gauge metric
  • New: Router.get_deployment_queue_stats() method

Problem

LiteLLM Router uses semaphores for max_parallel_requests limiting but provides no visibility into queue depth. Operators cannot monitor:

  • How many requests are actively being processed per deployment
  • How many requests are waiting in queue per deployment
  • Whether deployments are saturated

Solution

TrackedSemaphore

A wrapper around asyncio.Semaphore that explicitly tracks active and queued counts:

class TrackedSemaphore: async def acquire(self): self._queued += 1 await self._semaphore.acquire() self._queued -= 1 self._active += 1 def release(self): self._active -= 1 self._semaphore.release()

Prometheus Metrics

# Monitor saturated deployments litellm_deployment_queued_requests{model="gpt-4", model_group="production"} > 0 # Alert on queue buildup litellm_deployment_active_requests{model_group="production"} 

Test Plan

  • Unit tests for TrackedSemaphore (16 tests)
  • Unit tests for Prometheus metric definitions (9 tests)
  • Performance tests validating minimal overhead (7 tests)
  • E2E tests for metrics endpoint (5 tests, requires running proxy)

Performance

Scenario Overhead
Microbenchmark (no I/O) ~200%
Real-world (with I/O) 0.2%

The overhead is negligible because LLM API calls take 100ms-10s while semaphore operations take microseconds.

Closes #17764

Add Prometheus gauge metrics to monitor active and queued requests per deployment when using max_parallel_requests limiting. - Add TrackedSemaphore wrapper that tracks queue depth without accessing private asyncio.Semaphore internals - Add litellm_deployment_active_requests gauge metric - Add litellm_deployment_queued_requests gauge metric - Add Router.get_deployment_queue_stats() method - Add unit tests, performance tests, and e2e tests Closes BerriAI#17764
@vercel
Copy link

vercel bot commented Dec 12, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
litellm Ready Ready Preview Comment Dec 12, 2025 6:05am
@krrishdholakia
Copy link
Contributor

@AlexsanderHamir can you review this please? seems like there might be some perf impact

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants