Skip to content

Conversation

@zhanghonggeng
Copy link
Contributor

@zhanghonggeng zhanghonggeng commented Jul 10, 2025

PR Category

Performance Optimization

PR Types

Improvements

Description

问题背景:输入:(108, 64, 12288), axis:0, index:input_shape[axis]为例,gather反向相比torch慢60%,因此考虑优化gather_gard。
实现GPUScatterAdd kernel替换GPUScatterAssign。GPUScatterAdd kernel支持stride,通过stride计算将kernel内索引计算转换为首地址+偏移量,简化了kernel内复杂的索引计算,上述case中有60%性能提升。

对应slice case中输入:Tensor([108,64,12288],"float32"), index:Tensor([2,4,6],"int64") 。

  1. getitem中index_size为1时选择gather+reshape kernel作为快速通道,fp32前向gpu score:0.97 -> 0.68, 反向gpu score:
    2.73 -> 1.21,
  2. gather反向中GPUScatterAdd kernel支持index.numel() != x.dims()[axis_v]的场景。

pcard-67164

@paddle-bot
Copy link

paddle-bot bot commented Jul 10, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@zhanghonggeng zhanghonggeng changed the title [slice]list_tensor_gather test [slice]support different shape case for GPUScatterAdd op Jul 14, 2025
@zhanghonggeng
Copy link
Contributor Author

/re-run all-failed

1 similar comment
@zhanghonggeng
Copy link
Contributor Author

/re-run all-failed

@xiaoguoguo626807 xiaoguoguo626807 merged commit 77166d2 into PaddlePaddle:develop Jul 15, 2025
83 of 86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants