Skip to content

Conversation

@kouroshHakha
Copy link
Contributor

@kouroshHakha kouroshHakha commented Dec 17, 2025

Summary

This PR adds pause/resume control plane endpoints to DevIngress, enabling RL training workflows to temporarily halt generation without offloading model weights from GPU memory.

Must merge before this PR:

Motivation

During RL training, there's often a need to pause inference for weight updates. While sleep/wakeup exists for memory management (offloads weights to CPU), pause/resume provides a lightweight alternative that:

  • Halts generation immediately
  • Keeps model weights in GPU memory
  • Enables faster resume since no weight reloading is needed

Changes

1. New Control Plane Endpoints

Added three new HTTP endpoints via PausableIngressMixin:

  • POST /pause - Pause generation on all replicas
  • POST /resume - Resume generation after pause
  • GET /is_paused - Check pause status

2. vLLM Integration

Integrated with vLLM's pause_generation() / resume_generation() / is_paused() API with configurable options:

  • wait_for_inflight_requests: Wait for in-flight requests to complete vs abort immediately
  • clear_cache: Whether to clear KV cache on pause

Testing

  • Unit tests for pause/resume cycle and model isolation
  • End-to-end release test (test_llm_serve_pause_resume.py)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…trl-plane-mixins-v2 Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label Dec 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a pause/resume control plane API, which is a valuable addition for RL training workflows. The refactoring of existing control plane endpoints into modular mixins (SleepableIngressMixin, PausableIngressMixin, CacheManagerIngressMixin) is a significant improvement, making the code cleaner and more extensible. The test coverage, including unit, isolation, and a new end-to-end release test, is thorough and well-executed. I've found one high-severity issue related to a blocking call in an async context that could impact server performance, which I've detailed in a specific comment. Overall, this is a high-quality contribution.

…ause-resume-2 Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha
Copy link
Contributor Author

/gemini review

@kouroshHakha kouroshHakha marked this pull request as ready for review December 18, 2025 01:24
@kouroshHakha kouroshHakha requested a review from a team as a code owner December 18, 2025 01:24
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a pause/resume control plane API for Ray Serve LLM, which is a valuable addition for workflows like reinforcement learning training. The implementation is well-structured, following existing patterns for control plane endpoints. The changes are spread across the ingress, server, and engine layers, and are accompanied by both unit tests and a new end-to-end release test.

My review focuses on improving the robustness of the new release test by replacing fixed-time sleeps with polling, which will help prevent flakiness and unnecessary delays. The core logic of the feature itself appears solid.

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue llm labels Dec 18, 2025
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Copy link
Contributor

@eicherseiji eicherseiji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

IsPausedResponse with is_paused boolean.
"""
results = await self._broadcast_to_replicas(model, "is_paused")
is_paused_result = any(results) if results else False
Copy link
Contributor

@eicherseiji eicherseiji Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some debug logs about the results could be helpful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. I think we can add it if we need to debug. I am not sure how to tune the verbosity in a nice way.

@kouroshHakha kouroshHakha merged commit 311eae2 into ray-project:master Dec 18, 2025
6 checks passed
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
…y-project#59523) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

2 participants