Optimize fused_stack_transpose_quant op #73639

lshpku · 2025-06-25T13:33:17Z

PR Category

Operator Mechanism

PR Types

Performance

Description

优化fused_stack_transpose_quantop，包括以下4点：

去除将x的指针发送到device的memcpy操作，改为使用函数参数传递，与PHI算子库的stack一致，经nsys检查这样确实不会出现memcpy；且stack的基础组件支持任意数量的x，当≤64时使用参数传递，>64时fallback到memcpy，经压测正确性无问题
将fused_stack_quant和fused_stack_transpose_quant合并为一个文件，其实原本就是一个文件，API也是同一个（用transpose参数区分），但是从NLP仓库迁移过来的时候给拆成2个了，为了后续维护方便还是合成一个比较好
调整block和循环的大小，提升H卡的性能，因为之前第一版实现是在A100上做的，当时没有太多H卡的经验，没注意到H卡的寄存器比A卡紧张，现在调整之后H卡也不会有寄存器溢出了，性能就上来了
修复0-Size下忘记return的bug

H卡性能：无transpose 91%带宽，有transpose 87%带宽；相同size不同输入个数的性能差距在1%以内，说明指针的传入机制是高效的

Pcard-85711

paddle-bot · 2025-06-25T13:33:22Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

lshpku force-pushed the improve-stack-quant branch from f219c66 to 8c3de66 Compare June 26, 2025 05:31

Optimize fused_stack_transpose_quant op

3d2961a

lshpku force-pushed the improve-stack-quant branch from 8c3de66 to 3d2961a Compare June 26, 2025 07:18

phlrain self-requested a review June 27, 2025 14:17

phlrain approved these changes Jun 27, 2025

View reviewed changes

phlrain merged commit b483197 into PaddlePaddle:develop Jun 27, 2025
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize fused_stack_transpose_quant op #73639

Optimize fused_stack_transpose_quant op #73639

lshpku commented Jun 25, 2025 •

edited

Loading

paddle-bot bot commented Jun 25, 2025

Uh oh!

Labels

2 participants

Uh oh!

Optimize fused_stack_transpose_quant op #73639

Optimize fused_stack_transpose_quant op #73639

Conversation

lshpku commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

paddle-bot bot commented Jun 25, 2025

Uh oh!

Labels

2 participants

lshpku commented Jun 25, 2025 •

edited

Loading