[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693

charlifu · 2025-09-25T16:51:53Z

This PR adds a few fusion passes for Aiter to fusion layernorm + fp8 block quant and silu + fp8 block quant.

Signed-off-by: charlifu <charlifu@amd.com>

gemini-code-assist

Code Review

This pull request introduces new fusion passes for ROCm AITer, specifically for layernorm + fp8 block quant and silu + fp8 block quant. This is achieved by adding a new pattern AiterSiluMulFp8BlockQuantPattern and registering a new custom operator. Additionally, the changes in fp8_utils.py extend AITer support to non-MI300 series GPUs by providing a Triton-based fallback, which is a great enhancement.

My main feedback is on a performance concern in fp8_utils.py where an import is performed inside a performance-critical function. I've suggested a refactoring to move the import to the module level to avoid repeated overhead.

gemini-code-assist · 2025-09-25T16:53:42Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

+ # MI300's fp8nuz should be enough to detect if we call ck vs triton
+ if current_platform.is_fp8_fnuz():
+ from aiter import gemm_a8w8_blockscale
+ else:
+ from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale
+ return gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)


Importing inside a function that is on a hot path, like this custom op implementation, can introduce performance overhead. It's best practice to move imports to the module level to ensure they are only executed once.

I'd recommend defining a module-level variable that holds the correct gemm_a8w8_blockscale function based on the platform, and then using that variable within this function. This avoids repeated import lookups.

For example, you could add the following logic at the module level (e.g., near the top of the file):

_gemm_a8w8_blockscale = None if current_platform.is_rocm(): try: # MI300's fp8nuz should be enough to detect if we call ck vs triton if current_platform.is_fp8_fnuz(): from aiter import gemm_a8w8_blockscale else: from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale _gemm_a8w8_blockscale = gemm_a8w8_blockscale except ImportError: # aiter is not installed, which is fine. # The error will be raised when the op is actually used. pass

And then this function's body can be simplified as suggested.

Suggested change

# MI300's fp8nuz should be enough to detect if we call ck vs triton

if current_platform.is_fp8_fnuz():

from aiter import gemm_a8w8_blockscale

else:

from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale

return gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)

if _gemm_a8w8_blockscale is None:

raise ImportError(

"Aiter backend for gemm_a8w8_blockscale not available. "

"Please install aiter.")

return _gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)

Let's do this dispatch outside yeah

charlifu · 2025-09-25T16:56:08Z

#25688 (comment)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

ProExpertProg · 2025-09-25T19:38:58Z

I'm currently overhauling custom op matching in #24604. We also recently added a torch implementation of group quant, could you compare its performance with AITER? Also could you compare the perf of the fused AITER kernel to the fused torch.compile kernel for rmsnorm+quant. Happy to help out with instructions, but overall:

you'll need [Performance] Move apply_w8a8_block_fp8_linear to an op class #24666 reapplied (it was recently reverted) - now in [Perf] Fix and reapply move apply w8a8 block fp8 linear to class #25696
you'll need to disable quant_fp8 using-O.custom_ops+=-quant_fp8
you'll have to replace the AITER block quant with QuantFP8
- we should refactor this after [Performance] Move apply_w8a8_block_fp8_linear to an op class #24666 is re-merged so that the aiter op is under QuantFP8 as well

gshtras · 2025-09-26T14:52:53Z

vllm/compilation/activation_quant_fusion.py

 SiluMulFp8StaticQuantPattern,
- SiluMulNvfp4QuantPattern)
+ SiluMulNvfp4QuantPattern,
+ AiterSiluMulFp8BlockQuantPattern)


This symbol definition is conditional on is_rocm_aiter_linear_enabled():
Any run will fail here if not enabled.

Should be fixed now cd059b9

tjtanaa · 2025-09-28T14:19:59Z

vllm/compilation/activation_quant_fusion.py

+ return x_fp8, out_bs
+
+ direct_register_custom_op(
+ op_name="rocm_aiter_act_mul_and_fp8_group_quant",


Can you check if the latest aiter allows you to skip direct register custom ops? I remember most ops now should be able to work without calling direct_register_custom_ops on vLLM side as it is done in AITER repository. Moreover, removing the direct_register_custom_ops wrappers can reduce additional CPU overhead. Doing direct_register_custom_ops can be costly in terms of overhead.

Please take a look at the benchmarking results in this PR ROCm#717 (the second and third case) where it shows that removing the direct_register_custom_ops on vLLM side improves the perf.

Hey, thanks for the feedback. Is there a version of aiter which has aiter.ops.triton.fused_fp8_quant and also has these direct_register_custom_ops that you mentioned? I wasn't able to figure out how to call act_mul_and_fp8_group_quant without calling direct_register_custom_op first. Would be happy to investigate further if you can point me in the right direction, otherwise I think we can always come back and get rid of these direct_register_custom_op calls if needed.

We can come back to this in later PR as the 355_wip aiter commit does not have that feature.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

mergify · 2025-10-07T19:33:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @charlifu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: charlifu <charlifu@amd.com>

ProExpertProg

Will take a look sometime next week, just placing a temp hold while #24604 gets merged

Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

Signed-off-by: charlifu <charlifu@amd.com>

ProExpertProg

I think we should add the AITER ops to MatcherRMSNorm/MatcherQuantFP8/MatcherSiluMul/... instead of creating separate patterns for the AITER ops, so that we don't need to duplicate these for every pass (think allreduce-rms-quant fusion, all of the rope fusions, etc.)

vllm/compilation/activation_quant_fusion.py

ProExpertProg · 2025-11-03T21:25:47Z

vllm/compilation/matcher_utils.py

+ def empty_bf16(self, *args, **kws):
+ return torch.empty(*args, dtype=torch.bfloat16, device=self.device, **kws)


I think this one should just use empty as that uses the model dtype

ProExpertProg · 2025-11-03T21:27:24Z

vllm/compilation/fusion.py

+if current_platform.is_rocm() and envs.VLLM_ROCM_USE_AITER:
+ from .matcher_utils import MatcherAiterFusedMulAdd
+
+ class AiterMulAddFusionPattern:


Please add a comment describing what kind of fusion is done in this pass?

pass removed.

ProExpertProg · 2025-11-03T21:28:34Z

vllm/compilation/fusion.py

+ return
+
+ def pattern(x: torch.Tensor, a: torch.Tensor, b: torch.Tensor):
+ mul_add = self.fused_mul_add_matcher.forward_native(x, a, b)


This is abusing the matcher abstraction - it's meant to be a reusable matcher for a simple op and not represent fused/unfused impls

pass removed.

ProExpertProg · 2025-11-03T21:30:16Z

vllm/compilation/fusion.py

+ ):
+ return self.fused_mul_add_matcher.forward_custom(x, a, b)
+ else:
+ return self.fused_mul_add_matcher.forward_native(x, a, b)


Why is this conditional inside the replacement? The tracing will make this always pick a single branch depending on the graph inputs. I think it would be clearer to add an extra_check parameter to the register_replacement function

pass removed.

ProExpertProg · 2025-11-03T22:14:18Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

+ # MI300's fp8nuz should be enough to detect if we call ck vs triton
+ if current_platform.is_fp8_fnuz():
+ from aiter import gemm_a8w8_blockscale
+ else:
+ from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale
+ return gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)


Let's do this dispatch outside yeah

ProExpertProg · 2025-11-03T22:15:04Z

vllm/compilation/matcher_utils.py

 return SiluAndMul.forward_native(x)
+
+
+if current_platform.is_rocm() and envs.VLLM_ROCM_USE_AITER:


All of this can live with the fusion pass

pass removed

ProExpertProg · 2025-11-03T22:15:21Z

vllm/compilation/pass_manager.py

 if self.pass_config.enable_fusion:
 self.passes += [RMSNormQuantFusionPass(config)]
 self.passes += [ActivationQuantFusionPass(config)]
+ self.passes += [MulAddFusionPass(config)]


New flag and new file please

pass removed.

ProExpertProg · 2025-11-03T22:15:41Z

vllm/compilation/fusion.py

+ )
+
+
+class MulAddFusionPass(VllmPatternMatcherPass):


New file please

pass removed

ProExpertProg · 2025-11-03T22:16:39Z

vllm/compilation/activation_quant_fusion.py

+ at1 = auto_functionalized(
+ SILU_MUL_OP, result=result_silu_mul, input=input
+ )


Please use the MatcherSiluMul

Signed-off-by: charlifu <charlifu@amd.com>

charlifu · 2025-11-06T16:43:43Z

We found that mul+add fusion is not helping performance. So we are removing this pass.
DeepSeek-R1

	no fusion	no mul + add	all fusion
bs1, in 64, out 512	7.8252	7.4551	7.5100
bs4, in 64, out 512	8.3053	8.037	8.19201
bs8, in 64, out 512	8.156	7.9659	8.2399

Signed-off-by: charlifu <charlifu@amd.com>

Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>

mergify · 2025-11-06T19:57:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @charlifu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: charlifu <charlifu@amd.com>

mergify · 2025-11-10T16:42:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @charlifu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: charlifu <charlifu@amd.com>

charlifu added 6 commits September 25, 2025 16:41

add aiter silu fused kernel

2f538fa

Signed-off-by: charlifu <charlifu@amd.com>

add silu fusion pass

9d6507b

Signed-off-by: charlifu <charlifu@amd.com>

fix pass

b901f27

Signed-off-by: charlifu <charlifu@amd.com>

workable silu_mul fusion pass

1d11425

Signed-off-by: charlifu <charlifu@amd.com>

fix aiter fp8 linear support

b48f84d

Signed-off-by: charlifu <charlifu@amd.com>

add is rocm aiter linear enabled

41e7e2f

Signed-off-by: charlifu <charlifu@amd.com>

charlifu requested review from ProExpertProg, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners September 25, 2025 16:51

mergify bot added the rocm Related to AMD ROCm label Sep 25, 2025

charlifu mentioned this pull request Sep 25, 2025

New PR number #25693 #25688

Closed

gemini-code-assist bot reviewed Sep 25, 2025

View reviewed changes

fusion for AITER group quant RMSNorm and AITER w8a8 gemm

9940a40

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

gshtras reviewed Sep 26, 2025

View reviewed changes

tjtanaa reviewed Sep 28, 2025

View reviewed changes

micah-wil added 2 commits October 3, 2025 23:39

fix undefined symbol conditional on is_rocm_aiter_linear_enabled

cd059b9

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

only add aiter rmsnorm fusion patterns if aiter is enabled

6cf02a9

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

mergify bot added the needs-rebase label Oct 7, 2025

charlifu added 2 commits October 8, 2025 16:30

Merge branch 'main' into amd/aiter_fusion_pass

f2cd510

Signed-off-by: charlifu <charlifu@amd.com>

fix silu + fp8 block quant pass

7298b55

Signed-off-by: charlifu <charlifu@amd.com>

mergify bot removed the needs-rebase label Oct 8, 2025

ProExpertProg requested changes Oct 9, 2025

View reviewed changes

Merge branch 'main' into amd/aiter_fusion_pass

19348df

charlifu requested a review from pavanimajety as a code owner October 21, 2025 16:12

charlifu and others added 8 commits October 21, 2025 21:15

fix import

4793f39

Signed-off-by: charlifu <charlifu@amd.com>

use fusion pass to enable aiter fused_mul_add

af6bc06

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

Merge branch 'main' into amd/aiter_fusion_pass

824c419

Merge branch 'main' into micah/aiter_fused_muladd

5754863

Signed-off-by: charlifu <charlifu@amd.com>

Merge branch 'micah/aiter_fused_muladd' into amd/aiter_fusion_pass

f600359

Signed-off-by: charlifu <charlifu@amd.com>

Fix conditions for applying fused_mul_add

3f2ee60

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

Merge branch 'main' into amd/aiter_fusion_pass

f69d89e

Merge branch 'main' into amd/aiter_fusion_pass

9153e6b

tjtanaa mentioned this pull request Oct 29, 2025

[FEAT] [Triton] Add transpose scale to the triton fused_rms_fp8_group_quant ROCm/aiter#1291

Merged

1 task

charlifu added 2 commits October 30, 2025 18:21

Merge branch 'main' into amd/aiter_fusion_pass

3f250be

Signed-off-by: charlifu <charlifu@amd.com>

Merge branch 'main' into amd/aiter_fusion_pass

06516eb

ProExpertProg requested changes Nov 3, 2025

View reviewed changes

charlifu added 2 commits November 5, 2025 20:40

fix rms+quant pass

f450db9

Signed-off-by: charlifu <charlifu@amd.com>

Merge branch 'main' into amd/aiter_fusion_pass

1d2803b

charlifu and others added 2 commits November 6, 2025 16:47

remove mul+add pass

afdd2f5

Signed-off-by: charlifu <charlifu@amd.com>

Apply suggestions from code review

c3cf015

Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>

mergify bot added the needs-rebase label Nov 6, 2025

charlifu added 5 commits November 6, 2025 21:09

fix type hint

1001ee0

Signed-off-by: charlifu <charlifu@amd.com>

fix silu pass

8b1792a

Signed-off-by: charlifu <charlifu@amd.com>

fix silu pass

881ba7e

Signed-off-by: charlifu <charlifu@amd.com>

Merge branch 'main' into amd/aiter_fusion_pass

9054adf

Signed-off-by: charlifu <charlifu@amd.com>

fix merging

ee41e28

Signed-off-by: charlifu <charlifu@amd.com>

mergify bot removed the needs-rebase label Nov 7, 2025

fix fusion pass

4710e07

Signed-off-by: charlifu <charlifu@amd.com>

mergify bot added the needs-rebase label Nov 10, 2025

remove print

9712dcb

Signed-off-by: charlifu <charlifu@amd.com>

		def empty_bf16(self, args, *kws):
		return torch.empty(args, dtype=torch.bfloat16, device=self.device, *kws)

		return SiluAndMul.forward_native(x)


		if current_platform.is_rocm() and envs.VLLM_ROCM_USE_AITER:

Uh oh!

[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693

Are you sure you want to change the base?

[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693

Conversation

charlifu commented Sep 25, 2025

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charlifu commented Sep 25, 2025

ProExpertProg commented Sep 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tjtanaa Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Oct 7, 2025

ProExpertProg left a comment

Choose a reason for hiding this comment

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charlifu commented Nov 6, 2025

mergify bot commented Nov 6, 2025

mergify bot commented Nov 10, 2025

Labels

6 participants

tjtanaa Sep 28, 2025 •

edited

Loading