Skip to content

Conversation

@cangtianhuang
Copy link
Contributor

@cangtianhuang cangtianhuang commented Aug 6, 2025

PR Category

Operator Mechanism

PR Types

Performance

Description

修复 #74081 精度修复时,对部分模型造成的性能下降:https://console.cloud.baidu-int.com/devops/icafe/issue/DLTP-92332/show

修复方法为:

  1. 回退 ThrustCumsumKernel 快速路径
  2. ThrustCumsumKernel 增加 fp16 与 bf16 类型支持

在之前的测试中,错误地判断了 Thrust 库的计算精度;在新的测试中,对于 1D 超大张量的边缘情况(即单个巨型行), Thrust 库表现完美,而 BlockScanKernel 由于 grid_size == 1 ,导致其退化为串行执行,计算速度显著下降

以下为 20 万至 20 亿元素个数时, paddle.cumsum API 通过 BlockScanKernel 分支与 ThrustCumsumKernel 分支的计算精度(与 torch 相比)与计算速度对比:

2d42baff-9e1b-4a92-8ac7-0288e93ee05a 2d211282-ed69-4660-a899-fc870bc23de6

结果说明,在 1D 张量的情况下, Thrust 库的计算精度与计算速度均显著优于当前的 BlockScanKernel 内核实现。当前 BlockScanKernel 内核实现主要为多行数据设计,其每个 Block 都在并行处理不同的数据行。

Pcard-85711

@paddle-bot
Copy link

paddle-bot bot commented Aug 6, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@cangtianhuang cangtianhuang changed the title [PHI] Fix BlockPrefixCallbackOp [PHI] Fix paddle.cumsum calculation speed Aug 10, 2025
@lshpku lshpku merged commit 9db2cad into PaddlePaddle:develop Aug 12, 2025
68 of 69 checks passed
maxiaolong001 pushed a commit to maxiaolong001/Paddle that referenced this pull request Aug 12, 2025
* fix ThrustCumsumKernel * refine * refine ThrustCumsumKernel * fix * update ThrustCumsumKernel * fix logcumsumexp in ThrustCumsumKernel
@cangtianhuang cangtianhuang deleted the fix-cumsum branch September 4, 2025 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants