-
- Notifications
You must be signed in to change notification settings - Fork 11.3k
[Kernel] Add cuda kernel for gpt_oss activation #22538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Add cuda kernel for gpt_oss activation #22538
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new CUDA kernel for the swiglu_oai activation function, which is used in gpt_oss models. The changes span the CUDA kernel implementation, PyTorch bindings, a Python wrapper class, and corresponding tests. My review has identified a critical issue in the Python implementation where the data layout assumption is inconsistent with the CUDA kernel, and a high-severity bug in the test logic that inadvertently breaks existing tests for another activation function.
| gate, up = x[..., ::2], x[..., 1::2] | ||
| gate = gate.clamp(min=None, max=self.limit) | ||
| up = up.clamp(min=-self.limit, max=self.limit) | ||
| glu = gate * torch.sigmoid(gate * self.alpha) | ||
| gated_output = (up + 1) * glu | ||
| return gated_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The forward_native implementation for SwiGLUOAI assumes an interleaved layout for gate and up tensors (x[..., ::2], x[..., 1::2]). However, the corresponding CUDA kernel clamp_swiglu_kernel_with_params expects a concatenated layout (first half is gate, second half is up), similar to other activation functions in this file like SiluAndMul. This inconsistency will lead to incorrect results when using the native PyTorch path and will cause the tests to fail. The implementation should be updated to match the CUDA kernel's expectation.
| gate, up = x[..., ::2], x[..., 1::2] | |
| gate = gate.clamp(min=None, max=self.limit) | |
| up = up.clamp(min=-self.limit, max=self.limit) | |
| glu = gate * torch.sigmoid(gate * self.alpha) | |
| gated_output = (up + 1) * glu | |
| return gated_output | |
| d = x.shape[-1] // 2 | |
| gate, up = x[..., :d], x[..., d:] | |
| gate = gate.clamp(min=None, max=self.limit) | |
| up = up.clamp(min=-self.limit, max=self.limit) | |
| glu = gate * torch.sigmoid(gate * self.alpha) | |
| return (up + 1) * glu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, but we should fix the kernel rather than this part
| if activation == "fatrelu": | ||
| opcheck(fn, (out, x, threshold)) | ||
| if activation == "swiglu_oai": | ||
| opcheck(fn, (out, x, layer.alpha, layer.limit)) | ||
| else: | ||
| opcheck(fn, (out, x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional logic for opcheck is incorrect. By adding a new if statement instead of an elif, you've broken the logic for the fatrelu activation. When activation == "fatrelu", opcheck is now called twice: once correctly, and a second time inside the final else block with the wrong number of arguments, which will cause the test to fail. This should be a single if/elif/else chain.
| if activation == "fatrelu": | |
| opcheck(fn, (out, x, threshold)) | |
| if activation == "swiglu_oai": | |
| opcheck(fn, (out, x, layer.alpha, layer.limit)) | |
| else: | |
| opcheck(fn, (out, x)) | |
| if activation == "fatrelu": | |
| opcheck(fn, (out, x, threshold)) | |
| elif activation == "swiglu_oai": | |
| opcheck(fn, (out, x, layer.alpha, layer.limit)) | |
| else: | |
| opcheck(fn, (out, x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catchs
| @WoosukKwon @zyongye It looks like if we can use triton kernel, is it still necessary to add cuda kernel? |
| That's right. The triton kernel should support this, but I guess it can use that only for SM80+? |
| | ||
| | ||
| @CustomOp.register("swiglu_oai") | ||
| class SwiGLUOAI(CustomOp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should be a SwigluOAIAndMul. Don't forget to put it in the registry at the bottom
mgoin left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few nits
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
mgoin left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
| could you change the name here as well?
|
Done in 8471138 |
| @jeejeelee please re-open, I ran into issue after merge, see #22948 (comment) |
| the merge order case, I'll try to resolve it. |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Test Plan
lm eval
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 lm_eval --model vllm --model_args "pretrained=unsloth/gpt-oss-20b-BF16,max_model_len=32768" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size autobenchmrk
Test Result(A800)
This PR
The main branch
@mgoin The value of GSM8K is too low. We also tested using the transformers backend and found that the scores were similar.