Skip to content

Conversation

@yugong333
Copy link
Contributor

@yugong333 yugong333 commented Dec 18, 2025

Purpose

Reduce overhead of idle kernel launch due to max-loras in grid construction

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization for LoRA by making the CUDA grid size dependent on the number of active LoRAs rather than the maximum possible number. This is achieved by passing num_active_loras through various layers down to the Triton kernel. The changes are logical and well-implemented. I've identified one area for improvement concerning code duplication, which could impact maintainability and correctness.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 254 to +257
* triton.cdiv(EM, META["BLOCK_SIZE_M"])
* triton.cdiv(N, META["BLOCK_SIZE_N"]),
len(lora_a_stacked),
lora_a_stacked[0].shape[0],
num_active_loras,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fused MoE grid skips LoRAs when batch mixes LoRA and no-LoRA tokens

The fused MoE shrink kernel now launches axis=2 programs based on num_active_loras, which is computed in lora_kernel_metadata.prepare_tensors as the count of non-negative IDs. When a batch mixes LoRA and non-LoRA tokens, active_lora_ids is sorted as [-1, <ids…>], so num_active_loras excludes the leading -1 but the grid here iterates only the first num_active_loras entries. The first iteration hits -1 and early-exits, and later positive IDs are never processed, so LoRA contributions are silently dropped for mixed batches. This miscomputes outputs for any MoE batch that contains both LoRA and non-LoRA tokens.

Useful? React with 👍 / 👎.

Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
@jeejeelee
Copy link
Collaborator

Could you please provide the performance comparison?

atalman and others added 18 commits December 22, 2025 19:28
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: SungMinCho <tjdals4565@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>
… With FP32 (vllm-project#30811) Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Li, Jiang <bigpyj64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…8X.json` (vllm-project#29553) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
…uest to store/load (vllm-project#30814) Signed-off-by: ApostaC <yihua98@uchicago.edu>
@mergify
Copy link

mergify bot commented Dec 22, 2025

@mergify mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models gpt-oss Related to GPT-OSS models rocm Related to AMD ROCm cpu Related to CPU backends labels Dec 22, 2025
@mergify mergify bot added tpu Related to Google TPUs tool-calling kv-connector labels Dec 22, 2025
@mergify
Copy link

mergify bot commented Dec 22, 2025

Hi @yugong333, the pre-commit checks have failed. Please run:

uv pip install pre-commit pre-commit install pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed) pre-commit run --hook-stage manual mypy-3.10 # For markdownlint pre-commit run --hook-stage manual markdownlint
@mergify
Copy link

mergify bot commented Dec 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yugong333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 22, 2025
@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 23, 2025

Looks like you messed up the rebase and tagged everyone, can you create a new PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector llama Related to Llama models multi-modality Related to multi-modality (#4194) needs-rebase new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm structured-output tool-calling tpu Related to Google TPUs v1