[NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend #21458

amirkl94 · 2025-07-23T12:54:00Z

Purpose

This PR introduces a new backend for per-tensor scaled MoE from flashinfer. This backend gives a perf improvement as described below.

Accuracy tests

Ran manual lm_eval gsm8k, using the following command:

VLLM_USE_FLASHINFER_MOE_FP8=1 CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER \ lm_eval --model vllm --model_args pretrained=<Llama4 Scout ckpts path>,\ tensor_parallel_size=1,max_model_len=2048,kv_cache_dtype=auto \ --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code \ --tasks gsm8k --num_fewshot 5 --batch_size 200

Results:

flashinfer backend: |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.932|± |0.0113| | | |strict-match | 5|exact_match|↑ |0.910|± |0.0128| default: |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.932|± |0.0113| | | |strict-match | 5|exact_match|↑ |0.916|± |0.0124|

Perf tests

Tested on a 1xB200 gpu, using latency benchmark:

VLLM_USE_V1=1 VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_USE_STANDALONE_COMPILE=0 python benchmarks/benchmark_latency.py --model=$model_dir --output-len=1024 --tensor-parallel-size=1 --input-len=128 --trust_remote_code --max-model-len=2048 --batch-size=1

Results:

flashinfer backend: Avg latency: 8.699801150080749 seconds default:Avg latency: 12.011667945154477 seconds

github-actions · 2025-07-23T12:54:08Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces a new FlashInfer backend for per-tensor scaled FP8 Mixture of Experts (MoE), which shows promising performance improvements on SM100 architectures. The changes include adding a new custom operator, refactoring some utility functions into a shared module, and updating the quantization layers to use this new backend.

The code is generally well-structured, and the refactoring of utility functions into flashinfer_utils.py is a good step towards better code organization.

However, there are a couple of areas that could be improved for better maintainability and potentially better performance:

There is significant code duplication in the logic that invokes the new MoE kernel from both the Fp8MoEMethod and ModelOptFp8MoEMethod. This should be refactored into a shared helper function.
The tile_tokens_dim parameter for the new kernel is hardcoded, which might not be optimal for all workloads and differs from the dynamic approach used in the existing block-scale kernel.

Addressing these points will enhance the quality and robustness of the new backend.

vllm/model_executor/layers/fused_moe/fused_moe.py

gemini-code-assist · 2025-07-23T12:55:12Z

vllm/model_executor/layers/quantization/modelopt.py

There appears to be significant code duplication here. The logic inside this if self.flashinfer_moe_enabled: block is nearly identical to the logic in vllm/model_executor/layers/quantization/fp8.py (lines 993-1016).

Duplicating this code block makes future maintenance harder, as changes would need to be applied in two places.

To improve maintainability, I suggest refactoring this shared logic into a common helper function. This function could be placed in a utility module, perhaps vllm/model_executor/layers/quantization/utils/flashinfer_utils.py, and called from both Fp8MoEMethod.apply and ModelOptFp8MoEMethod.apply.

For example, you could create a helper like this:

# In a shared utility file def apply_flashinfer_per_tensor_scale_fp8( layer: torch.nn.Module, x: torch.Tensor, router_logits: torch.Tensor, e_score_correction_bias: Optional[torch.Tensor], top_k: int, num_expert_group: Optional[int], topk_group: Optional[int], global_num_experts: int, apply_router_weight_on_input: bool, ) -> torch.Tensor: return torch.ops.vllm.flashinfer_fused_moe_per_tensor_scale_fp8( routing_logits=router_logits, routing_bias=e_score_correction_bias, hidden_states=x, input_scale=layer.w13_input_scale, gemm1_weights=layer.w13_weight, gemm1_weights_scale=layer.w13_weight_scale, gemm2_weights=layer.w2_weight, gemm2_weights_scale=layer.w2_weight_scale, activation_scale=layer.w2_input_scale, num_experts=global_num_experts, top_k=top_k, num_expert_group=num_expert_group, topk_group=topk_group, intermediate_size=layer.intermediate_size_per_partition, local_expert_offset=layer.ep_rank * layer.local_num_experts, local_num_experts=layer.local_num_experts, use_routing_scales_on_input=apply_router_weight_on_input, )

This would centralize the logic and make the code cleaner and easier to maintain.

I would like this utility to be implemented to help with drift

nvpohanh · 2025-07-24T03:15:55Z

vllm/model_executor/layers/fused_moe/fused_moe.py

I am a little worried about this line breaking the cuda graph capture because we are creating new tensor on-the-fly. Should we create this zero-bias in the caller instead? Or maybe ask FlashInfer to support routing_bias=None so that we don't need to pass in fake bias.

not a blocking issue for now. We will fix this later if we really see it becoming an issue.

Fair point, I think asking flashinfer to support routing_bias=None is better probably

FlashInfer has fixed this in 0.2.9rc2. Do you think this is a blocker? If not, I prefer that we merge this PR first and then file another PR after we have upgraded to FlashInfer v0.2.9rc2.

However, if you think this is a blocker, we can wait until FlashINfer v0.2.9rc2 upgrade, which should happen very soon

nvpohanh · 2025-07-24T03:20:48Z

vllm/model_executor/layers/fused_moe/fused_moe.py

Should we use RoutingMethodType.Llama4 instead of a hard-coded "3"?

not a blocking issue, just code style

It is better but the issue with it is that if a different version of flashinfer is installed (or flashinfer isn't installed at all) we'll get an import error. I thought about doing this conversion inside the function after we know that the correct version of flashinfer is installed, wdyt?

or we can define our class to mimic FlashInfer's class?

has_flashinfer = False try: import flashinfer import flashinfer.fused_moe.RoutingMethodType has_flashinfer = True except ImportError: pass class FlashInferRoutingMethodType(IntEnum): # Default: Softmax -> TopK Default = RoutingMethodType.Default if has_flashinfer else 0 # Renormalize: TopK -> Softmax Renormalize = RoutingMethodType.Renormalize if has_flashinfer else 1 # DeepSeekV3: Sigmoid -> RoutingBiasAdd -> Top2 in group -> Top4 groups -> Top8 experts from the Top4 groups DeepSeekV3 = RoutingMethodType.DeepSeekV3 if has_flashinfer else 2 # Llama4: Top1 -> Sigmoid Llama4 = RoutingMethodType.Llama4 if has_flashinfer else 3 # Qwen3: Softmax -> TopK -> Renormalize RenormalizeNaive = RoutingMethodType.RenormalizeNaive if has_flashinfer else 4 # Unspecified Unspecified = RoutingMethodType.Unspecified if has_flashinfer else 5

This is not critical and can be handled in later PRs

I think we can make this class lazy imported and remove the default arg, so we only need to import it once in the function

nvpohanh · 2025-07-24T09:41:46Z

Depends on #21485

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>

nvpohanh · 2025-07-28T02:48:12Z

vllm/model_executor/layers/fused_moe/fused_moe.py

num_expert_group = num_expert_group if num_expert_group is not None else 1

should set to 0 if num_expert_group is None

mgoin · 2025-07-28T12:10:25Z

vllm/model_executor/layers/fused_moe/fused_moe.py

+ local_num_experts: int,
+ use_routing_scales_on_input: bool,
+ routed_scaling_factor: float = 1.0,
+ routing_method_type: int = 3 # Llama4-styled routing method


I'm not a fan of defaulting this parameter if it is going to dictate model support. For instance in the current usage of this function this parameter isn't set, but there is no check that this is model needs llama 4 routing i.e. it would be silently incorrect for a Mixtral with the same quant

@amirkl94 Maybe let's remove the default value for routing_method_type and make this arg a required argument?

And from llama4.py we should pass this into fused_moe.py?

llama4 already does this by defining its own custom routing function and passing that into FusedMoE

vllm/vllm/model_executor/models/llama4.py

Line 79 in e18f085

custom_routing_function=Llama4MoE.custom_routing_function,

I suppose you could just check if custom_routing_function == Llama4MoE.custom_routing_function

I think I can't check custom_routing_function == Llama4MoE.custom_routing_function, unless you meant in llama4.py?

Should I just make this parameter optional and pass it only from llama4 and if it's not passed I'll default to the non-flashinfer implementation?

I think what @mgoin meant is that in modelopt.py: https://github.com/vllm-project/vllm/blob/185bdd608d24418fae365238b9eb500f8c778241/vllm/model_executor/layers/quantization/modelopt.py#L458

the layer object is just an instance of FusedMoE, so you can dispatch routing_method using:

if layer.routing_method == Llama4MoE.custom_routing_function: routing_method = 3

@mgoin is this what you meant?

Yes this was what I meant. Obviously not optimal, but should be okay

@mgoin Currently, FlashInfer's per-tensor FP8 MoE only supports Llama4 routing mode, so I told @amirkl94 to assert if layer.routing_method == Llama4MoE.custom_routing_function is True. If it is not, an exception will be raised.

This is done such that in the future if anyone wants to use FlashInfer per-tensor FP8 MoE for another model, it will fail loudly telling the user why that is not supported. My philosophy is: a loud failure is better than a silent corruption.

Could you check if the current implementation is acceptable to you? Thanks!

mgoin · 2025-07-28T12:16:54Z

vllm/model_executor/layers/fused_moe/fused_moe.py

I think we can make this class lazy imported and remove the default arg, so we only need to import it once in the function

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>

nvpohanh · 2025-07-30T08:07:57Z

pipeline failure doesn't seem to be caused by this PR:

[2025-07-29T16:52:22Z] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.75 GiB. GPU 0 has a total capacity of 22.05 GiB of which 685.88 MiB is free. Process 6405 has 17.01 GiB memory in use. Process 41 has 250.00 MiB memory in use. Including non-PyTorch memory, this process has 4.09 GiB memory in use. Of the allocated memory 3.72 GiB is allocated by PyTorch, and 159.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin · 2025-07-30T19:17:40Z

I fixed some issues with the PR and validated acc+performance. I see about 10% throughput improvement on gsm8k on 1xB200

lm_eval --model vllm --model_args pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,max_model_len=10000,quantization=modelopt --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto Processed prompts: 100%|██████████| 1319/1319 [01:14<00:00, 17.78it/s, est. speed input: 15466.22 toks/s, output: 1821.80 toks/s] vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,max_model_len=10000,quantization=modelopt,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9227|± |0.0074| | | |strict-match | 5|exact_match|↑ |0.8999|± |0.0083| VLLM_USE_FLASHINFER_MOE_FP8=1 lm_eval --model vllm --model_args pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,max_model_len=10000,quantization=modelopt --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto Processed prompts: 100%|██████████| 1319/1319 [01:06<00:00, 19.83it/s, est. speed input: 17251.62 toks/s, output: 2029.75 toks/s] vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,max_model_len=10000,quantization=modelopt,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9196|± |0.0075| | | |strict-match | 5|exact_match|↑ |0.9007|± |0.0082|

Will do a final review now.

nvpohanh · 2025-07-31T01:51:16Z

The failure is:

[2025-07-31T01:06:15Z] Fork a new process to run a test 0 [2025-07-31T01:06:15Z] DEBUG 07-30 18:06:15 [__init__.py:38] Available plugins for group vllm.general_plugins: [2025-07-31T01:06:15Z] DEBUG 07-30 18:06:15 [__init__.py:40] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver [2025-07-31T01:06:15Z] DEBUG 07-30 18:06:15 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. [2025-07-31T01:06:15Z] Traceback (most recent call last): [2025-07-31T01:06:15Z] File "/vllm-workspace/tests/utils.py", line 741, in wrapper [2025-07-31T01:06:15Z] f(*args, **kwargs) [2025-07-31T01:06:15Z] File "/vllm-workspace/tests/models/test_initialization.py", line 115, in can_initialize [2025-07-31T01:06:15Z] LLM( [2025-07-31T01:06:15Z] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py", line 275, in __init__ [2025-07-31T01:06:15Z] self.llm_engine = LLMEngine.from_engine_args( [2025-07-31T01:06:15Z] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [2025-07-31T01:06:15Z] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 490, in from_engine_args [2025-07-31T01:06:15Z] vllm_config = engine_args.create_engine_config(usage_context) [2025-07-31T01:06:15Z] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [2025-07-31T01:06:15Z] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1002, in create_engine_config [2025-07-31T01:06:15Z] model_config = self.create_model_config() [2025-07-31T01:06:15Z] ^^^^^^^^^^^^^^^^^^^^^^^^^^ [2025-07-31T01:06:15Z] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 870, in create_model_config [2025-07-31T01:06:15Z] return ModelConfig( [2025-07-31T01:06:15Z] ^^^^^^^^^^^^ [2025-07-31T01:06:15Z] File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 120, in __init__ [2025-07-31T01:06:15Z] s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) [2025-07-31T01:06:15Z] pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig [2025-07-31T01:06:15Z] Value error, The checkpoint you are trying to load has model type `hunyuan_v1_dense` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date. [2025-07-31T01:06:15Z] [2025-07-31T01:06:15Z] You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git` [type=value_error, input_value=ArgsKwargs((), {'model': ...attention_dtype': None}), input_type=ArgsKwargs] [2025-07-31T01:06:15Z] For further information visit https://errors.pydantic.dev/2.11/v/value_error [2025-07-31T01:06:16Z] �[31mFAILED�[0m

Doesn't seem to be related to this PR

amirkl94 · 2025-07-31T12:50:07Z

@mgoin The CI errors seem to be unrelated to my PR as I saw they're happening on other branches as well - https://github.com/vllm-project/vllm/pull/21747/commits .
@nvpohanh

mgoin · 2025-07-31T12:51:05Z

Yes, this is what I've found too. I've requested force merge, thank you.

…project#21458) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>

…project#21458) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…project#21458) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

…project#21458) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…project#21458) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…project#21458) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>

amirkl94 requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners July 23, 2025 12:54

gemini-code-assist bot reviewed Jul 23, 2025

View reviewed changes

nvpohanh reviewed Jul 24, 2025

View reviewed changes

amirkl94 force-pushed the feat/flashinfer-fp8-moe branch from c3e365c to 872160e Compare July 27, 2025 06:33

Adding flashinfer fp8 FusedMoE

fdf635b

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>

nvpohanh reviewed Jul 28, 2025

View reviewed changes

amirkl94 force-pushed the feat/flashinfer-fp8-moe branch from 872160e to fdf635b Compare July 28, 2025 07:28

mgoin reviewed Jul 28, 2025

View reviewed changes

CR Fixes

6582abc

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>

amirkl94 force-pushed the feat/flashinfer-fp8-moe branch from 185bdd6 to 6582abc Compare July 29, 2025 15:02

amirkl94 requested review from mgoin and nvpohanh July 30, 2025 06:58

nvpohanh approved these changes Jul 30, 2025

View reviewed changes

Fixes to imports and names

740627f

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin approved these changes Jul 31, 2025

View reviewed changes

mgoin enabled auto-merge (squash) July 31, 2025 00:38

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 31, 2025

Merge branch 'main' into feat/flashinfer-fp8-moe

2ce6021

mgoin disabled auto-merge July 31, 2025 02:35

mgoin enabled auto-merge (squash) July 31, 2025 02:35

Merge branch 'main' into feat/flashinfer-fp8-moe

2cd5623

vllm-bot merged commit 207b750 into vllm-project:main Jul 31, 2025
72 of 74 checks passed

Uh oh!

[NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend #21458

[NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend #21458

Uh oh!

Conversation

amirkl94 commented Jul 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Accuracy tests

Perf tests

github-actions bot commented Jul 23, 2025

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvpohanh commented Jul 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgoin Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvpohanh commented Jul 30, 2025

mgoin commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nvpohanh commented Jul 31, 2025

amirkl94 commented Jul 31, 2025

mgoin commented Jul 31, 2025

Uh oh!

Labels

4 participants

amirkl94 commented Jul 23, 2025 •

edited by github-actions bot

Loading

mgoin Jul 29, 2025 •

edited

Loading

mgoin commented Jul 30, 2025 •

edited

Loading