Skip to content

Conversation

@xingmingyyj
Copy link
Contributor

@xingmingyyj xingmingyyj commented Jul 16, 2025

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

  • 修复int溢出导致的访存越界问题
  • batch较大时存在精度diff,未定位出明显bug
[accuracy error] backward paddle.incubate.nn.functional.fused_bias_dropout_residual_layer_norm(x=Tensor([270000000, 2, 4],"float32"), residual=Tensor([270000000, 2, 4],"float32"), bias=None, ln_scale=Tensor([4],"float32"), ln_bias=None, dropout_rate=0.0, ln_epsilon=1e-05, training=True, mode="upscale_in_train", name=None, ) Not equal to tolerance rtol=0.01, atol=0.01 Tensor-likes are not close! Mismatched elements: 3 / 2160000000 (0.0%) Greatest absolute difference: 0.03557777404785156 at index (122354108, 0, 3) (up to 0.01 allowed) Greatest relative difference: 0.1643351912498474 at index (64014755, 1, 3) (up to 0.01 allowed) ACTUAL: (shape=torch.Size([270000000, 2, 4]), dtype=torch.float32) 

通过一些测试发现和做layernorm在实现上和torch后很大的区别。
paddle实现位置:https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/kernels/funcs/layer_norm_impl.cu.h#L445
torch实现位置:https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/layer_norm_kernel.cu#L196
paddle通过$$D(x) = E(x^2) - (E(x))^2$$的方式计算方差,而torch采用了Welford算法,在数值上更具稳定性。另外,paddle和torch在计算dx时也有很大的计算差异:
paddle实现位置:
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/kernels/funcs/layer_norm_impl.cu.h#L1735
torch实现位置:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/layer_norm_kernel.cu#L348
这些diff使得在feature_size在较小的情况下导致float32计算会产生上述精度diff。

[accuracy error] backward paddle.incubate.nn.functional.fused_bias_dropout_residual_layer_norm(x=Tensor([200000000, 1, 4],"float32"), residual=Tensor([200000000, 1, 4],"float32"), bias=None, ln_scale=Tensor([4],"float32"), ln_bias=None, dropout_rate=0.0, ln_epsilon=1e-05, training=True, mode="upscale_in_train", name=None, ) Not equal to tolerance rtol=0.01, atol=0.01 Tensor-likes are not close! Mismatched elements: 1 / 800000000 (0.0%) Greatest absolute difference: 0.02027149498462677 at index (7029107, 0, 3) (up to 0.01 allowed) Greatest relative difference: 0.20506636798381805 at index (7029107, 0, 3) (up to 0.01 allowed) ACTUAL: (shape=torch.Size([200000000, 1, 4]), dtype=torch.float32) tensor([[[ 0.4957, -0.1273, -0.2506, -0.1178]], 

在float64下正常:

[Pass] paddle.incubate.nn.functional.fused_bias_dropout_residual_layer_norm(x=Tensor([200000000, 1, 4],"float64"), residual=Tensor([200000000, 1, 4],"float64"), bias=None, ln_scale=Tensor([4],"float64"), ln_bias=None, dropout_rate=0.0, ln_epsilon=1e-05, training=True, mode="upscale_in_train", name=None, ) 

Pcard-73263

@paddle-bot
Copy link

paddle-bot bot commented Jul 16, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-ci-bot
Copy link

paddle-ci-bot bot commented Jul 24, 2025

Sorry to inform you that d9090b0's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

Copy link
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lshpku lshpku merged commit 97f18d5 into PaddlePaddle:develop Jul 29, 2025
86 of 88 checks passed
@xingmingyyj xingmingyyj deleted the fused_bias_dropout_residual_layer_norm branch July 30, 2025 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants