Skip to content

Conversation

@HeyDavid633
Copy link
Contributor

@HeyDavid633 HeyDavid633 commented Aug 1, 2025

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

主要修复:

  1. label.numel() 为大Tensor时的错误抛出
  2. label.numel() 导致 MemoryBuffer 分配尺寸溢出的错误抛出
    导致错误的原始case可见 [Big Tensor] Fix big tensor problem for paddle.nn.functional.class_center_sample PFCCLab/PaddleAPITest#488

class_center_sample 是一个调用第三方库的实现,

  • 在文档中的定义为 paddle.nn.functional.class_center_sample 该 API 的输入参数为 (label, num_classes, num_samples, group=None)
  • 其中 label 是 1-D 的Tensor,数据类型为 int32 或者 int64,每个元素的取值范围在 [0, num_classes)
  • PaddleAPITest/tester/api_config/config_analyzer.py 的输入数据约束中已经对参数做了语意上的限制,即 0<= label[i] < num_classes

label.numel() 为大Tensor时的错误抛出

  1. 于 paddleonly 的测试,取典型值的错误情况并不统一,具体情况如下

  4294967295 2294967295 2094967295 294967295
数据说明 恰好小于 uint 32 大于 int32 小于int32 远小于 int32
torch(Accuracy = Ture) [torch error] num_inp 4294967295 is too big to for CUB [torch error] num_inp 2294967295 is too big to for CUB [paddle error] (PreconditionNotMet) The meta data must be valid when call the mutable data function.但是无法定位代码,报错信息指向/Paddle/paddle/phi/core/dense_tensor.cc [Pass]在A100-SXM4-80GB执行时间约 20mins
paddleonly [paddle error] (External) CUDA error(1), invalid argument.可以定位到 /Paddle/paddle/phi/kernels/gpu/class_center_sample_kernel.cu [paddle error] (PreconditionNotMet) The meta data must be valid when call the mutable data function.但是无法定位代码,报错信息指向/Paddle/paddle/phi/core/dense_tensor.cc [paddle error] (PreconditionNotMet) The meta data must be valid when call the mutable data function.但是无法定位代码,报错信息指向/Paddle/paddle/phi/core/dense_tensor.cc

进一步检查,通过在 Paddle/paddle/phi/kernels/gpu/class_center_sample_kernel.cu 定位到具体实现在 Paddle/third_party/cub/cub/cub.cuh,该第三方库不支持label.numel()为大Tensor的情况。
因此在 class_center_sample_kernel.cu 通过 PADDLE_ENFORCE_LE 对大Tensor 错误抛出。

由于 paddleonly 的测试在 label.numel() 的小Tensor 情况下仍然出现错误,进一步(二分法)排查 label.numel() 会导致
(PreconditionNotMet) 类报错的边界值。经多个case核查,该数值仅和 label.numel() 有关,和其他两个输入无关。具体情况如下:

label.numel() 356493280 356493279 356493278
数据说明 (PreconditionNotMet) The meta data must be valid when call the mutable data function. 报错 报错临界点,大于该值均有报错 能正确的通过,小于该值均Pass

插桩排查后定位问题在

MemoryBuffer<T, Context> memory_buffer = MemoryBuffer<T, Context>(num_buffer_ele, num_temp_ele, nranks, dev_ctx);

MemoryBuffer 实例化时 经过 buffer.Resize({4 * num_buffer_ele + 3 * (nranks + 1) + num_temp_ele}); 重新分配了空间,当 label.numel() 较大时导致了这个值溢出。故在此前用 PADDLE_ENFORCE_GT 进行尺寸的检查,使它不能是负数。

@paddle-bot
Copy link

paddle-bot bot commented Aug 1, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

wanghuancoder
wanghuancoder previously approved these changes Aug 1, 2025
Copy link
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HeyDavid633
Copy link
Contributor Author

/re-run all-failed

@wanghuancoder wanghuancoder merged commit b4a021f into PaddlePaddle:develop Aug 4, 2025
72 of 75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

3 participants