- Notifications
You must be signed in to change notification settings - Fork 7k
[Serve][LLM] Add Pause/Resume Control Plane API for Ray Serve LLM #59523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve][LLM] Add Pause/Resume Control Plane API for Ray Serve LLM #59523
Conversation
…trl-plane-mixins-v2 Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a pause/resume control plane API, which is a valuable addition for RL training workflows. The refactoring of existing control plane endpoints into modular mixins (SleepableIngressMixin, PausableIngressMixin, CacheManagerIngressMixin) is a significant improvement, making the code cleaner and more extensible. The test coverage, including unit, isolation, and a new end-to-end release test, is thorough and well-executed. I've found one high-severity issue related to a blocking call in an async context that could impact server performance, which I've detailed in a specific comment. Overall, this is a high-quality contribution.
python/ray/llm/_internal/serve/core/ingress/mixins/broadcastable.py Outdated Show resolved Hide resolved
…ause-resume-2 Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
| /gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a pause/resume control plane API for Ray Serve LLM, which is a valuable addition for workflows like reinforcement learning training. The implementation is well-structured, following existing patterns for control plane endpoints. The changes are spread across the ingress, server, and engine layers, and are accompanied by both unit tests and a new end-to-end release test.
My review focuses on improving the robustness of the new release test by replacing fixed-time sleeps with polling, which will help prevent flakiness and unnecessary delays. The core logic of the feature itself appears solid.
eicherseiji left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| IsPausedResponse with is_paused boolean. | ||
| """ | ||
| results = await self._broadcast_to_replicas(model, "is_paused") | ||
| is_paused_result = any(results) if results else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some debug logs about the results could be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline. I think we can add it if we need to debug. I am not sure how to tune the verbosity in a nice way.
…y-project#59523) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Summary
This PR adds pause/resume control plane endpoints to DevIngress, enabling RL training workflows to temporarily halt generation without offloading model weights from GPU memory.
Must merge before this PR:
Motivation
During RL training, there's often a need to pause inference for weight updates. While sleep/wakeup exists for memory management (offloads weights to CPU), pause/resume provides a lightweight alternative that:
Changes
1. New Control Plane Endpoints
Added three new HTTP endpoints via
PausableIngressMixin:POST /pause- Pause generation on all replicasPOST /resume- Resume generation after pauseGET /is_paused- Check pause status2. vLLM Integration
Integrated with vLLM's
pause_generation()/resume_generation()/is_paused()API with configurable options:wait_for_inflight_requests: Wait for in-flight requests to complete vs abort immediatelyclear_cache: Whether to clear KV cache on pauseTesting
test_llm_serve_pause_resume.py)