Skip to content

Conversation

@lshpku
Copy link
Contributor

@lshpku lshpku commented Jun 25, 2025

PR Category

Operator Mechanism

PR Types

Performance

Description

优化fused_stack_transpose_quantop,包括以下4点:

  1. 去除将x的指针发送到device的memcpy操作,改为使用函数参数传递,与PHI算子库的stack一致,经nsys检查这样确实不会出现memcpy;且stack的基础组件支持任意数量的x,当≤64时使用参数传递,>64时fallback到memcpy,经压测正确性无问题
  2. fused_stack_quantfused_stack_transpose_quant合并为一个文件,其实原本就是一个文件,API也是同一个(用transpose参数区分),但是从NLP仓库迁移过来的时候给拆成2个了,为了后续维护方便还是合成一个比较好
  3. 调整block和循环的大小,提升H卡的性能,因为之前第一版实现是在A100上做的,当时没有太多H卡的经验,没注意到H卡的寄存器比A卡紧张,现在调整之后H卡也不会有寄存器溢出了,性能就上来了
  4. 修复0-Size下忘记return的bug

H卡性能:无transpose 91%带宽,有transpose 87%带宽;相同size不同输入个数的性能差距在1%以内,说明指针的传入机制是高效的

Pcard-85711

@paddle-bot
Copy link

paddle-bot bot commented Jun 25, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@lshpku lshpku force-pushed the improve-stack-quant branch from f219c66 to 8c3de66 Compare June 26, 2025 05:31
@lshpku lshpku force-pushed the improve-stack-quant branch from 8c3de66 to 3d2961a Compare June 26, 2025 07:18
@phlrain phlrain self-requested a review June 27, 2025 14:17
@phlrain phlrain merged commit b483197 into PaddlePaddle:develop Jun 27, 2025
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants