[BugFix] : Fix Batched DeepGemm Experts #19515

varun-sundar-rabindranath · 2025-06-11T22:25:10Z

Purpose

Bug:

VLLM_ALL2ALL_BACKEND="deepep_low_latency" VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --port 9010 --no-enable-prefix-caching

Triggers assert at https://github.com/deepseek-ai/DeepGEMM/blob/8dfa3298274bfe6b242f6f8a3e6f3eff2707dd9f/deep_gemm/jit_kernels/m_grouped_gemm.py#L142

Cause:

The assert expects the num_experts dimension to be the same.
The Qwen3-30B-A3B-FP8 has 128 experts. With DP=2 TP=1, each GPU rank gets 64 experts.
We compute the output shape as (num_global_experts, max_num_tokens * num_dp, hidden_shape) -> but it should be (num_local_expers, max_num_tokens * num_dp, hidden_size) i.e. we compute shape as (128, *, *) -> but it should be (64, *, *)

We have traditionally used num_global_experts in the workspace calculation because of how expert_maps are handled in moe_align_block_size

Fix:

The fix (for now) is to realize the batched experts dont need moe_align_block_size and use num_local_experts directly in the workspace_shape computation.

the real fix would be to fix moe_align_block_size to factor in num_local_experts from the start

Test Plan

Tested locally on H100, with lm_eval command,

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100

Test Result

|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.82|± |0.0386| | | |strict-match | 5|exact_match|↑ | 0.93|± |0.0256|

(Optional) Documentation Update

github-actions · 2025-06-11T22:25:20Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @varun-sundar-rabindranath, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug in the batched DeepGEMM MoE implementation that caused an assert failure during serving under specific configurations involving data parallelism. The fix primarily involves correcting how the number of data parallel ranks and the number of experts are used in workspace size calculations for batched MoE kernels and ensuring the necessary DeepGEMM kernel is imported.

Highlights

Bug Fix: Batched DeepGEMM MoE Assert: Fixes an assert failure that occurred when using batched DeepGEMM experts with data parallelism enabled, specifically related to incorrect workspace size calculations.
Workspace Calculation Correction: Corrected the calculation of num_dp used in workspace shape determination for batched MoE implementations (batched_deep_gemm_moe.py and fused_batched_moe.py) from self.world_size // self.dp_size to self.dp_size.
Dynamic Expert Count for Workspace: Modified modular_kernel.py to use the local number of experts (w1.size(0)) for workspace calculations when using batched fused experts, instead of the global number of experts, which is more appropriate as batched experts handle alignment differently.
DeepGEMM Kernel Import: Ensured the DeepGEMM quantization kernel module is explicitly imported in fp8_utils.py to make the w8a8_block_fp8_matmul_deepgemm operation available.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to fix an assertion failure in Batched DeepGEMM Experts. The changes primarily involve:

Correcting the calculation of num_dp (data parallel degree) used for workspace sizing in BatchedDeepGemmExperts and BatchedExperts to use self.dp_size directly.
Optimizing the workspace allocation in FusedMoEModularKernel by using local expert count instead of global expert count for batched expert implementations like BatchedDeepGemmExperts.
Ensuring necessary DeepGEMM custom ops are registered by adding a local import in fp8_utils.py.
Removing a redundant initialization of global_num_experts.

These changes appear to correctly address the issue and improve workspace sizing. The PR description could be enhanced by filling in the 'Purpose' section to provide more context on the problem being solved beyond the linked assert.

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

vllm/model_executor/layers/fused_moe/modular_kernel.py

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

vllm/model_executor/layers/quantization/utils/fp8_utils.py

bnellnm · 2025-06-11T22:35:21Z

vllm/model_executor/layers/fused_moe/modular_kernel.py

This condition isn't really right. Ideally, we should add a method that on the experts that returns a format enum or something.

I'm fine with this as a stopgap fix tho.

Fixed it by passing in num_local_experts and num_global_experts to workspace_shapes() and letting the function decide what to use.

mergify · 2025-06-12T14:19:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

houseroad

Can we create an issue to track to the real fix?

Short term fix is probably okay for short term.

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath · 2025-06-13T02:18:13Z

Created an issue to track this here #19590

gemini-code-assist · 2025-06-13T02:43:05Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

varun-sundar-rabindranath requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners June 11, 2025 22:25

gemini-code-assist bot reviewed Jun 11, 2025

View reviewed changes

bnellnm reviewed Jun 11, 2025

View reviewed changes

bnellnm approved these changes Jun 11, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 12, 2025

varun-sundar-rabindranath mentioned this pull request Jun 12, 2025

Fix tensor shapes for DeepEP and DeepGEMM assertions #19546

Closed

4 tasks

Varun Sundar Rabindranath added 2 commits June 12, 2025 08:04

Use num_local_experts for batched experts

9296688

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

refactor

001ac2e

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath force-pushed the varun/fix-batched-fused-experts branch from 5341324 to 001ac2e Compare June 12, 2025 15:06

mergify bot removed the needs-rebase label Jun 12, 2025

bnellnm approved these changes Jun 12, 2025

View reviewed changes

houseroad approved these changes Jun 12, 2025

View reviewed changes

houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 12, 2025

fix DP + TP > 1 for deepep ll

6096f80

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath mentioned this pull request Jun 13, 2025

[Feature]: moe_align_block_size optimization opportunity for Expert Parallel case #19590

Closed

1 task

mgoin approved these changes Jun 13, 2025

View reviewed changes

mgoin merged commit e3b1266 into vllm-project:main Jun 13, 2025
67 checks passed

skyloevil mentioned this pull request Sep 14, 2025

[MoE] Optimize fused MoE dispatch and workspace allocation with TP leader restriction #24831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] : Fix Batched DeepGemm Experts #19515

[BugFix] : Fix Batched DeepGemm Experts #19515

Uh oh!

varun-sundar-rabindranath commented Jun 11, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jun 11, 2025

gemini-code-assist bot left a comment

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bnellnm Jun 11, 2025

varun-sundar-rabindranath Jun 12, 2025

mergify bot commented Jun 12, 2025

houseroad left a comment

varun-sundar-rabindranath commented Jun 13, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jun 13, 2025

Labels

4 participants

Uh oh!

[BugFix] : Fix Batched DeepGemm Experts #19515

[BugFix] : Fix Batched DeepGemm Experts #19515

Uh oh!

Conversation

varun-sundar-rabindranath commented Jun 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Bug:

Cause:

Fix:

Test Plan

Test Result

(Optional) Documentation Update

github-actions bot commented Jun 11, 2025

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bnellnm Jun 11, 2025

Choose a reason for hiding this comment

varun-sundar-rabindranath Jun 12, 2025

Choose a reason for hiding this comment

mergify bot commented Jun 12, 2025

houseroad left a comment

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 13, 2025

Labels

4 participants

varun-sundar-rabindranath commented Jun 11, 2025 •

edited by github-actions bot

Loading

varun-sundar-rabindranath commented Jun 13, 2025 •

edited

Loading