-
- Notifications
You must be signed in to change notification settings - Fork 12.1k
Grid construction based on num_active_lora and support CUDA graph capture across various num_active_lora #30984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an optimization for LoRA by making the CUDA grid size dependent on the number of active LoRAs rather than the maximum possible number. This is achieved by passing num_active_loras through various layers down to the Triton kernel. The changes are logical and well-implemented. I've identified one area for improvement concerning code duplication, which could impact maintainability and correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| * triton.cdiv(EM, META["BLOCK_SIZE_M"]) | ||
| * triton.cdiv(N, META["BLOCK_SIZE_N"]), | ||
| len(lora_a_stacked), | ||
| lora_a_stacked[0].shape[0], | ||
| num_active_loras, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fused MoE grid skips LoRAs when batch mixes LoRA and no-LoRA tokens
The fused MoE shrink kernel now launches axis=2 programs based on num_active_loras, which is computed in lora_kernel_metadata.prepare_tensors as the count of non-negative IDs. When a batch mixes LoRA and non-LoRA tokens, active_lora_ids is sorted as [-1, <ids…>], so num_active_loras excludes the leading -1 but the grid here iterates only the first num_active_loras entries. The first iteration hits -1 and early-exits, and later positive IDs are never processed, so LoRA contributions are silently dropped for mixed batches. This miscomputes outputs for any MoE batch that contains both LoRA and non-LoRA tokens.
Useful? React with 👍 / 👎.
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
| Could you please provide the performance comparison? |
…ncy (vllm-project#30700) Signed-off-by: Nathan Price <nathan@abridge.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: SungMinCho <tjdals4565@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>
… With FP32 (vllm-project#30811) Signed-off-by: Micah Williamson <micah.williamson@amd.com>
…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
…roject#30903) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
…roject#30788) Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Li, Jiang <bigpyj64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…8X.json` (vllm-project#29553) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
…uest to store/load (vllm-project#30814) Signed-off-by: ApostaC <yihua98@uchicago.edu>
| Documentation preview: https://vllm--30984.org.readthedocs.build/en/30984/ |
| Hi @yugong333, the pre-commit checks have failed. Please run: uv pip install pre-commit pre-commit install pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is |
| This pull request has merge conflicts that must be resolved before it can be |
| Looks like you messed up the rebase and tagged everyone, can you create a new PR? |
Purpose
Reduce overhead of idle kernel launch due to max-loras in grid construction
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.