Skip to content

Conversation

@varun-sundar-rabindranath
Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Jun 11, 2025

Purpose

Bug:

VLLM_ALL2ALL_BACKEND="deepep_low_latency" VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --port 9010 --no-enable-prefix-caching 

Triggers assert at https://github.com/deepseek-ai/DeepGEMM/blob/8dfa3298274bfe6b242f6f8a3e6f3eff2707dd9f/deep_gemm/jit_kernels/m_grouped_gemm.py#L142

Cause:

The assert expects the num_experts dimension to be the same.
The Qwen3-30B-A3B-FP8 has 128 experts. With DP=2 TP=1, each GPU rank gets 64 experts.
We compute the output shape as (num_global_experts, max_num_tokens * num_dp, hidden_shape) -> but it should be (num_local_expers, max_num_tokens * num_dp, hidden_size) i.e. we compute shape as (128, *, *) -> but it should be (64, *, *)

We have traditionally used num_global_experts in the workspace calculation because of how expert_maps are handled in moe_align_block_size

Fix:

The fix (for now) is to realize the batched experts dont need moe_align_block_size and use num_local_experts directly in the workspace_shape computation.

the real fix would be to fix moe_align_block_size to factor in num_local_experts from the start

Test Plan

Tested locally on H100, with lm_eval command,

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100 

Test Result

|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.82|± |0.0386| | | |strict-match | 5|exact_match|↑ | 0.93|± |0.0256| 

(Optional) Documentation Update

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @varun-sundar-rabindranath, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug in the batched DeepGEMM MoE implementation that caused an assert failure during serving under specific configurations involving data parallelism. The fix primarily involves correcting how the number of data parallel ranks and the number of experts are used in workspace size calculations for batched MoE kernels and ensuring the necessary DeepGEMM kernel is imported.

Highlights

  • Bug Fix: Batched DeepGEMM MoE Assert: Fixes an assert failure that occurred when using batched DeepGEMM experts with data parallelism enabled, specifically related to incorrect workspace size calculations.
  • Workspace Calculation Correction: Corrected the calculation of num_dp used in workspace shape determination for batched MoE implementations (batched_deep_gemm_moe.py and fused_batched_moe.py) from self.world_size // self.dp_size to self.dp_size.
  • Dynamic Expert Count for Workspace: Modified modular_kernel.py to use the local number of experts (w1.size(0)) for workspace calculations when using batched fused experts, instead of the global number of experts, which is more appropriate as batched experts handle alignment differently.
  • DeepGEMM Kernel Import: Ensured the DeepGEMM quantization kernel module is explicitly imported in fp8_utils.py to make the w8a8_block_fp8_matmul_deepgemm operation available.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an assertion failure in Batched DeepGEMM Experts. The changes primarily involve:

  1. Correcting the calculation of num_dp (data parallel degree) used for workspace sizing in BatchedDeepGemmExperts and BatchedExperts to use self.dp_size directly.
  2. Optimizing the workspace allocation in FusedMoEModularKernel by using local expert count instead of global expert count for batched expert implementations like BatchedDeepGemmExperts.
  3. Ensuring necessary DeepGEMM custom ops are registered by adding a local import in fp8_utils.py.
  4. Removing a redundant initialization of global_num_experts.

These changes appear to correctly address the issue and improve workspace sizing. The PR description could be enhanced by filling in the 'Purpose' section to provide more context on the problem being solved beyond the linked assert.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition isn't really right. Ideally, we should add a method that on the experts that returns a format enum or something.

I'm fine with this as a stopgap fix tho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it by passing in num_local_experts and num_global_experts to workspace_shapes() and letting the function decide what to use.

@mergify
Copy link

mergify bot commented Jun 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Varun Sundar Rabindranath added 2 commits June 12, 2025 08:04
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
@varun-sundar-rabindranath varun-sundar-rabindranath force-pushed the varun/fix-batched-fused-experts branch from 5341324 to 001ac2e Compare June 12, 2025 15:06
@mergify mergify bot removed the needs-rebase label Jun 12, 2025
Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create an issue to track to the real fix?

Short term fix is probably okay for short term.

@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 12, 2025
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
@varun-sundar-rabindranath
Copy link
Contributor Author

varun-sundar-rabindranath commented Jun 13, 2025

Created an issue to track this here #19590

@mgoin mgoin merged commit e3b1266 into vllm-project:main Jun 13, 2025
67 checks passed
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

4 participants