-
- Notifications
You must be signed in to change notification settings - Fork 11.2k
[BugFix] : Fix Batched DeepGemm Experts #19515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] : Fix Batched DeepGemm Experts #19515
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @varun-sundar-rabindranath, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a bug in the batched DeepGEMM MoE implementation that caused an assert failure during serving under specific configurations involving data parallelism. The fix primarily involves correcting how the number of data parallel ranks and the number of experts are used in workspace size calculations for batched MoE kernels and ensuring the necessary DeepGEMM kernel is imported.
Highlights
- Bug Fix: Batched DeepGEMM MoE Assert: Fixes an assert failure that occurred when using batched DeepGEMM experts with data parallelism enabled, specifically related to incorrect workspace size calculations.
- Workspace Calculation Correction: Corrected the calculation of
num_dpused in workspace shape determination for batched MoE implementations (batched_deep_gemm_moe.pyandfused_batched_moe.py) fromself.world_size // self.dp_sizetoself.dp_size. - Dynamic Expert Count for Workspace: Modified
modular_kernel.pyto use the local number of experts (w1.size(0)) for workspace calculations when using batched fused experts, instead of the global number of experts, which is more appropriate as batched experts handle alignment differently. - DeepGEMM Kernel Import: Ensured the DeepGEMM quantization kernel module is explicitly imported in
fp8_utils.pyto make thew8a8_block_fp8_matmul_deepgemmoperation available.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review | Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary | Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help | Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an assertion failure in Batched DeepGEMM Experts. The changes primarily involve:
- Correcting the calculation of
num_dp(data parallel degree) used for workspace sizing inBatchedDeepGemmExpertsandBatchedExpertsto useself.dp_sizedirectly. - Optimizing the workspace allocation in
FusedMoEModularKernelby using local expert count instead of global expert count for batched expert implementations likeBatchedDeepGemmExperts. - Ensuring necessary DeepGEMM custom ops are registered by adding a local import in
fp8_utils.py. - Removing a redundant initialization of
global_num_experts.
These changes appear to correctly address the issue and improve workspace sizing. The PR description could be enhanced by filling in the 'Purpose' section to provide more context on the problem being solved beyond the linked assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition isn't really right. Ideally, we should add a method that on the experts that returns a format enum or something.
I'm fine with this as a stopgap fix tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it by passing in num_local_experts and num_global_experts to workspace_shapes() and letting the function decide what to use.
| This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
5341324 to 001ac2e Compare
houseroad left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create an issue to track to the real fix?
Short term fix is probably okay for short term.
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
| Created an issue to track this here #19590 |
| Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Purpose
Bug:
Triggers assert at https://github.com/deepseek-ai/DeepGEMM/blob/8dfa3298274bfe6b242f6f8a3e6f3eff2707dd9f/deep_gemm/jit_kernels/m_grouped_gemm.py#L142
Cause:
The assert expects the num_experts dimension to be the same.
The
Qwen3-30B-A3B-FP8has 128 experts. With DP=2 TP=1, each GPU rank gets 64 experts.We compute the output shape as
(num_global_experts, max_num_tokens * num_dp, hidden_shape)-> but it should be(num_local_expers, max_num_tokens * num_dp, hidden_size)i.e. we compute shape as(128, *, *)-> but it should be(64, *, *)We have traditionally used
num_global_expertsin the workspace calculation because of how expert_maps are handled inmoe_align_block_sizeFix:
The fix (for now) is to realize the batched experts dont need
moe_align_block_sizeand usenum_local_expertsdirectly in the workspace_shape computation.the real fix would be to fix
moe_align_block_sizeto factor innum_local_expertsfrom the startTest Plan
Tested locally on H100, with lm_eval command,
Test Result
(Optional) Documentation Update