Skip to content

Conversation

@lshpku
Copy link
Contributor

@lshpku lshpku commented Jul 22, 2025

PR Category

Operator Mechanism

PR Types

Performance

Description

优化paddle.incubate.nn.functional.fused_transpose_split_quant的性能,包括:

  1. 将block大小从(32, 32)改成(32, 16),这样每个SM可以同时放2个block,提高并行度
  2. 调整expert_idx计算的顺序,这样能够与load实现overlap
  3. 调整一些下标的类型,更加严谨

优化后,H卡上的带宽利用率从46%提升到80%

Pcard-85711

@paddle-bot
Copy link

paddle-bot bot commented Jul 22, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@phlrain phlrain self-requested a review July 23, 2025 05:05
@phlrain phlrain merged commit 9c0d34c into PaddlePaddle:develop Jul 23, 2025
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants