[PHI][BIT] Fix logcumsumexp for big tensor #73380
Merged
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
PR Category
Operator Mechanism
PR Types
Bug fixes
Description
logcumsumexp 与 cumsum 共用 ScanKernel,在 #72562 pr 合入后,cumsum 能过通过大 tensor 测试,但是部分 logcumsumexp 用例仍报 Erroneous arithmetic operation
经打桩检查,是因为 int 类型所致,将两个 int 类型转换为 int64_t 后不再出现上述报错;但是在运行 float32 配置时会抛出 paddle::memory::allocation::BadAlloc,将数据类型全部改为 float16 后通过 paddleonly 测试:

考虑到 paddle::memory::allocation::BadAlloc 的错误,经检查,发现是 ScanKernel 在计算时将申请一份 tmp_data 用于转置、行反转等中间计算,大小与输出相同,这说明该内核在执行时将花销约两倍显存,而 torch 采用直接索引的方法读取目标数据,在某些情况下可能更加通用
将 LogAddExp 修改为 log1p,使其数值更稳定,但是大 tensor 仍有严重的精度问题,需要进一步修复