Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
PR Category
Operator Mechanism
PR Types
Bug fixes
Description
主要修复:
导致错误的原始case可见 [Big Tensor] Fix big tensor problem for paddle.nn.functional.class_center_sample PFCCLab/PaddleAPITest#488
class_center_sample 是一个调用第三方库的实现,
(label, num_classes, num_samples, group=None);PaddleAPITest/tester/api_config/config_analyzer.py的输入数据约束中已经对参数做了语意上的限制,即 0<= label[i] < num_classeslabel.numel() 为大Tensor时的错误抛出
进一步检查,通过在
Paddle/paddle/phi/kernels/gpu/class_center_sample_kernel.cu定位到具体实现在 Paddle/third_party/cub/cub/cub.cuh,该第三方库不支持label.numel()为大Tensor的情况。因此在
class_center_sample_kernel.cu通过PADDLE_ENFORCE_LE对大Tensor 错误抛出。插桩排查后定位问题在
MemoryBuffer 实例化时 经过
buffer.Resize({4 * num_buffer_ele + 3 * (nranks + 1) + num_temp_ele}); 重新分配了空间,当 label.numel() 较大时导致了这个值溢出。故在此前用PADDLE_ENFORCE_GT进行尺寸的检查,使它不能是负数。