Grid construction based on num_active_lora and support CUDA graph capture across various num_active_lora #30984

yugong333 · 2025-12-18T19:53:44Z

Purpose

Reduce overhead of idle kernel launch due to max-loras in grid construction

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

gemini-code-assist

Code Review

This pull request introduces an optimization for LoRA by making the CUDA grid size dependent on the number of active LoRAs rather than the maximum possible number. This is achieved by passing num_active_loras through various layers down to the Triton kernel. The changes are logical and well-implemented. I've identified one area for improvement concerning code duplication, which could impact maintainability and correctness.

vllm/v1/worker/gpu_model_runner.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-18T20:00:34Z

vllm/lora/ops/triton_ops/fused_moe_lora_op.py

 * triton.cdiv(EM, META["BLOCK_SIZE_M"])
 * triton.cdiv(N, META["BLOCK_SIZE_N"]),
 len(lora_a_stacked),
- lora_a_stacked[0].shape[0],
+ num_active_loras,


Fused MoE grid skips LoRAs when batch mixes LoRA and no-LoRA tokens

The fused MoE shrink kernel now launches axis=2 programs based on num_active_loras, which is computed in lora_kernel_metadata.prepare_tensors as the count of non-negative IDs. When a batch mixes LoRA and non-LoRA tokens, active_lora_ids is sorted as [-1, <ids…>], so num_active_loras excludes the leading -1 but the grid here iterates only the first num_active_loras entries. The first iteration hits -1 and early-exits, and later positive IDs are never processed, so LoRA contributions are silently dropped for mixed batches. This miscomputes outputs for any MoE batch that contains both LoRA and non-LoRA tokens.

Useful? React with 👍 / 👎.

vllm/v1/worker/lora_model_runner_mixin.py

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

jeejeelee · 2025-12-20T08:01:23Z

Could you please provide the performance comparison?

…ut + DeepGEMM (vllm-project#30899)

…ncy (vllm-project#30700) Signed-off-by: Nathan Price <nathan@abridge.com>

)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>

Signed-off-by: SungMinCho <tjdals4565@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>

… With FP32 (vllm-project#30811) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: Bowen Bao <bowenbao@amd.com>

…roject#30903) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Signed-off-by: zhxchen17 <zhxchen17@fb.com>

…roject#30788) Signed-off-by: zzhx1 <zzh_201018@outlook.com>

Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Li, Jiang <bigpyj64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…8X.json` (vllm-project#29553) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

…uest to store/load (vllm-project#30814) Signed-off-by: ApostaC <yihua98@uchicago.edu>

mergify · 2025-12-22T19:32:40Z

Documentation preview: https://vllm--30984.org.readthedocs.build/en/30984/

mergify · 2025-12-22T19:35:03Z

Hi @yugong333, the pre-commit checks have failed. Please run:

uv pip install pre-commit pre-commit install pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed) pre-commit run --hook-stage manual mypy-3.10 # For markdownlint pre-commit run --hook-stage manual markdownlint

mergify · 2025-12-22T20:13:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yugong333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

DarkLight1337 · 2025-12-23T03:16:06Z

Looks like you messed up the rebase and tagged everyone, can you create a new PR?

yugong333 added 2 commits December 17, 2025 20:02

Using active-loras in grid in fused_moe_lora kernel

009c39b

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

Capture multiple cuda graph across various active loras

d07cb35

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

yugong333 requested a review from jeejeelee as a code owner December 18, 2025 19:53

mergify bot added nvidia v1 labels Dec 18, 2025

github-project-automation bot added this to NVIDIA Dec 18, 2025

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 18, 2025

View reviewed changes

yugong333 added 2 commits December 18, 2025 21:09

Clean code

db038ef

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

fix bug of always capture lora even no-lora case

473c791

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

LucasWilkinson assigned jeejeelee Dec 19, 2025

atalman and others added 18 commits December 22, 2025 19:28

2.9.1 PyTorch release update (vllm-project#28495)

0eaf5a7

[BugFix] Workspace allocation during profile run : DeepEPHighThroughp…

ac4fa35

…ut + DeepGEMM (vllm-project#30899)

feat(api): Eager chat template warmup to eliminate first-request late…

60242ff

…ncy (vllm-project#30700) Signed-off-by: Nathan Price <nathan@abridge.com>

[v1] Add PrefixLM support to TritonAttention backend (vllm-project#30386

76f3a1c

)

[CI][Feature] Adds auto-rebase PR rule (vllm-project#30875)

47c15e3

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>

[Metrics] Model FLOPs Utilization estimation (vllm-project#30738)

55d62a1

Signed-off-by: SungMinCho <tjdals4565@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>

[ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN…

409cc3d

… With FP32 (vllm-project#30811) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

[NIXL] Support P tensor-parallel-size > D tensor-parallel-size (vllm-…

5b1da45

…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com>

[Chore] Remove v0 dead code for Qwen2.5-omni (vllm-project#30883)

5616919

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

fused_moe_lora PDL improvements (vllm-project#30716)

0dcade4

Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

[Quantization] Support Quark int4-fp8 w4a8 for MoE (vllm-project#30071)

aca064a

Signed-off-by: Bowen Bao <bowenbao@amd.com>

[UX] Reduce DeepGEMM warmup log output to single progress bar (vllm-p…

7c4f882

…roject#30903) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

[compile] Fix CI for test_gpt2_cache_hit (vllm-project#30902)

c6edad3

Signed-off-by: zhxchen17 <zhxchen17@fb.com>

[refactor] Add prefix support to embed_tokens in DeepSeek MTP (vllm-p…

e7aa0d3

…roject#30788) Signed-off-by: zzhx1 <zzh_201018@outlook.com>

[Doc][CPU] Update CPU doc (vllm-project#30765)

e822ec8

Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Li, Jiang <bigpyj64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[PERF] Qwen3-next. Add fp8 cutlass MoE tuned configs. `chmod -x *MI30…

30a4b89

…8X.json` (vllm-project#29553) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

[AMD][CI] fix lm eval ci arg (vllm-project#30911)

885a7d8

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

[KV connector][LMCache] Only record the cuda event when there are req…

a0a0931

…uest to store/load (vllm-project#30814) Signed-off-by: ApostaC <yihua98@uchicago.edu>

yugong333 requested review from LucasWilkinson, ProExpertProg, chaunceyjiang, youkaichao and zou3519 as code owners December 22, 2025 19:32

github-project-automation bot added this to gpt-oss Issues & Enhancements Dec 22, 2025

mergify bot added the structured-output label Dec 22, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Dec 22, 2025

mergify bot added tpu Related to Google TPUs tool-calling kv-connector labels Dec 22, 2025

github-project-automation bot added this to Structured Output and Tool Calling Dec 22, 2025

mergify bot added the needs-rebase label Dec 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Grid construction based on num_active_lora and support CUDA graph capture across various num_active_lora #30984

Grid construction based on num_active_lora and support CUDA graph capture across various num_active_lora #30984

Uh oh!

yugong333 commented Dec 18, 2025 •

edited by github-actions bot

Loading

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

chatgpt-codex-connector bot Dec 18, 2025

Uh oh!

jeejeelee commented Dec 20, 2025

mergify bot commented Dec 22, 2025

mergify bot commented Dec 22, 2025

mergify bot commented Dec 22, 2025

DarkLight1337 commented Dec 23, 2025 •

edited

Loading

Labels

20 participants

Uh oh!

Grid construction based on num_active_lora and support CUDA graph capture across various num_active_lora #30984

Are you sure you want to change the base?

Grid construction based on num_active_lora and support CUDA graph capture across various num_active_lora #30984

Uh oh!

Conversation

yugong333 commented Dec 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

chatgpt-codex-connector bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

jeejeelee commented Dec 20, 2025

mergify bot commented Dec 22, 2025

mergify bot commented Dec 22, 2025

mergify bot commented Dec 22, 2025

DarkLight1337 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Labels

20 participants

yugong333 commented Dec 18, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Dec 23, 2025 •

edited

Loading