-
- Notifications
You must be signed in to change notification settings - Fork 11.3k
[Bugfix] fix CUDA illegal memory access when sleep mode is triggered during request processing #28721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fix for a CUDA illegal memory access error that occurs when sleep mode is triggered during request processing. The fix uses shared memory to signal the sleep state to the scheduler. My review focuses on improving the robustness of the shared memory implementation by addressing a race condition, preventing resource leaks, and handling exceptions more safely.
| def set_sleep_signal(value: int = 1, shared_memory_name: str = 'sleep_signal') -> None: | ||
| try: | ||
| shm = shared_memory.SharedMemory(name=shared_memory_name, create=False, size=4) | ||
| except FileNotFoundError: | ||
| shm = shared_memory.SharedMemory(name=shared_memory_name, create=True, size=4) | ||
| | ||
| if shm is not None: | ||
| shm.buf[0:4] = value.to_bytes(4, 'little') | ||
| shm.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation has a race condition when creating the shared memory, which can lead to an unhandled FileExistsError. It also doesn't guarantee that shm.close() is called if an error occurs. Furthermore, the shared memory block is never unlinked, which can cause resource leaks and lead to stale state if the server restarts.
A more robust implementation should handle the race condition atomically, ensure resource closure with a try...finally block, and the application should manage the shared memory lifecycle (e.g., in the lifespan context manager) to create it on startup and unlink it on shutdown.
| def set_sleep_signal(value: int = 1, shared_memory_name: str = 'sleep_signal') -> None: | |
| try: | |
| shm = shared_memory.SharedMemory(name=shared_memory_name, create=False, size=4) | |
| except FileNotFoundError: | |
| shm = shared_memory.SharedMemory(name=shared_memory_name, create=True, size=4) | |
| if shm is not None: | |
| shm.buf[0:4] = value.to_bytes(4, 'little') | |
| shm.close() | |
| def set_sleep_signal(value: int = 1, shared_memory_name: str = 'sleep_signal') -> None: | |
| shm = None | |
| try: | |
| try: | |
| shm = shared_memory.SharedMemory(name=shared_memory_name, create=True, size=4) | |
| except FileExistsError: | |
| shm = shared_memory.SharedMemory(name=shared_memory_name, create=False, size=4) | |
| shm.buf[0:4] = value.to_bytes(4, 'little') | |
| finally: | |
| if shm is not None: | |
| shm.close() |
| try: | ||
| shm = shared_memory.SharedMemory(name=shared_memory_name) | ||
| sleep_signal = int.from_bytes(shm.buf[0:4], 'little') | ||
| shm.close() | ||
| except: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare except: is dangerous as it can hide unexpected errors, such as issues with buffer access or data conversion. It's better to catch only the specific exception you expect, in this case FileNotFoundError. Additionally, using a try...finally block ensures that shm.close() is called even if an error occurs while reading from the shared memory, preventing resource leaks.
| try: | |
| shm = shared_memory.SharedMemory(name=shared_memory_name) | |
| sleep_signal = int.from_bytes(shm.buf[0:4], 'little') | |
| shm.close() | |
| except: | |
| pass | |
| shm = None | |
| try: | |
| shm = shared_memory.SharedMemory(name=shared_memory_name) | |
| sleep_signal = int.from_bytes(shm.buf[0:4], 'little') | |
| except FileNotFoundError: | |
| pass | |
| finally: | |
| if shm is not None: | |
| shm.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| from multiprocessing import shared_memory | ||
| def set_sleep_signal(value: int = 1, shared_memory_name: str = 'sleep_signal') -> None: | ||
| try: | ||
| shm = shared_memory.SharedMemory(name=shared_memory_name, create=False, size=4) | ||
| except FileNotFoundError: | ||
| shm = shared_memory.SharedMemory(name=shared_memory_name, create=True, size=4) | ||
| | ||
| if shm is not None: | ||
| shm.buf[0:4] = value.to_bytes(4, 'little') | ||
| shm.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shared sleep flag persists across process restarts
The new set_sleep_signal creates a named shared memory block but never unlinks or resets it. If the server crashes or terminates after /sleep sets the flag to 1, the OS keeps /sleep_signal around with that value. Any new vLLM instance started on the same host will open the existing block and read 1, causing check_stop to immediately mark all requests as FINISHED_STOPPED even though the engine has not actually gone to sleep. This makes a fresh server unusable until someone issues /wake_up or manually deletes the shared memory segment. The shared flag should be cleaned up or defaulted to 0 during startup.
Useful? React with 👍 / 👎.
6a5ab49 to e06a88e Compare …during request processing Signed-off-by: root <root@nscc-ds-4a100-node10.cluster.local>
e06a88e to 31b7db6 Compare | CC @22quinn |
This PR fixes a bug where the scheduler continues to process requests after sleep mode has freed GPU memory, causing
CUDA error: an illegal memory access was encounteredand engine crashes.This fixes #28714
Below is the test content. More test cases will be added later (such as triggering sleep when the model is processing multiple requests, triggering sleep when the scheduler is using batch_queue, etc.) along with related code fixes.
Test Plan
Test Result
After the fix, the model can successfully sleep and wake up:
