[PHI] Fix paddle.cumsum calculation speed #74442
Merged
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
PR Category
Operator Mechanism
PR Types
Performance
Description
修复 #74081 精度修复时,对部分模型造成的性能下降:https://console.cloud.baidu-int.com/devops/icafe/issue/DLTP-92332/show
修复方法为:
ThrustCumsumKernel快速路径ThrustCumsumKernel增加 fp16 与 bf16 类型支持在之前的测试中,错误地判断了 Thrust 库的计算精度;在新的测试中,对于 1D 超大张量的边缘情况(即单个巨型行), Thrust 库表现完美,而
BlockScanKernel由于grid_size == 1,导致其退化为串行执行,计算速度显著下降以下为 20 万至 20 亿元素个数时,
paddle.cumsumAPI 通过BlockScanKernel分支与ThrustCumsumKernel分支的计算精度(与 torch 相比)与计算速度对比:结果说明,在 1D 张量的情况下, Thrust 库的计算精度与计算速度均显著优于当前的
BlockScanKernel内核实现。当前BlockScanKernel内核实现主要为多行数据设计,其每个 Block 都在并行处理不同的数据行。Pcard-85711