Skip to content

Conversation

@GITD245
Copy link
Contributor

@GITD245 GITD245 commented Jul 4, 2025

PR Category

Auto Parallel

PR Types

Bug fixes

Description

在inplaced操作中,api传入参数为引用,这意味这api的input参数与output参数指向同一个地址。由于SetGradOutMeta是在调用api后记录,此时记录的参数是已经被api修改后的错误值,而非api实际传入的参数。

在非自动并行中,SetGradOutMeta仅记录传入x的meta相关参数;但在自动并行中,SetGradOutMeta会同时记录经过api修改后传入x的dist_attr以及dims,这些错误的参数会导致反向计算时该node自动推导的过程中得到错误的dist_attr。

本pr通过在调用api前保存一份正确的dist_attr以及dims,并用这些值来设置SetGradOutMeta从而修复该问题。如下图所示,其中绿色为正确操作,红色为错误操作,虚线为本pr实现的操作。

TODO: 由于此更改应应用于所有inplaced操作,影响范围较广,目前仅限制在reshape_上生效

pcard-86802

未命名绘图1 drawio (1)

修改前的dygraph_functions.cc

 // Forward API Call auto& api_result = paddle::experimental::reshape_(x, shape); // Log memory information paddle::memory::LogDeviceMemoryStats(egr::Controller::Instance().GetExpectedPlace(), "reshape_"); // Check NaN and Inf if needed if (FLAGS_check_nan_inf) { egr::CheckTensorHasNanOrInf("reshape_", api_result); } // Get Outputs auto& out = api_result; // Get Output AutoGradMeta egr::AutogradMeta* out_autograd_meta = egr::EagerUtils::autograd_meta(&out); // Check Inplace if needed egr::EagerUtils::CheckInplace(x, x_autograd_meta, require_any_grad); // Bump Inplace Version x.bump_inplace_version(); VLOG(3) << "Tensor(" << x.name() << ") uses Inplace Strategy."; // Set grad_node after API call if (require_any_grad) { egr::EagerUtils::PassStopGradient(false,out_autograd_meta); // SetGradOutMeta & SetEdges grad_node->SetGradOutMeta(x, 0); // SetOutRank & SetHistory & SetGradInMeta if (out_autograd_meta) { egr::EagerUtils::SetOutRankWithSlot(out_autograd_meta, 0); } if (out_autograd_meta) { egr::EagerUtils::SetHistory(out_autograd_meta, grad_node); } grad_node->SetGradInMeta(out, 0); // Set TensorWrappers for Forward Outputs if needed }

修改后的dygraph_functions.cc

 // Forward API Call phi::distributed::TensorDistAttr x_dist_attr; phi::DDim x_dims; if (x.is_dist_tensor()&& x.impl()){ auto* x_dist_tensor = static_cast<phi::distributed::DistTensor*>(x.impl().get()); x_dist_attr = x_dist_tensor -> dist_attr(); x_dims = x_dist_tensor->dims(); } auto& api_result = paddle::experimental::reshape_(x, shape); // Log memory information paddle::memory::LogDeviceMemoryStats(egr::Controller::Instance().GetExpectedPlace(), "reshape_"); // Check NaN and Inf if needed if (FLAGS_check_nan_inf) { egr::CheckTensorHasNanOrInf("reshape_", api_result); } // Get Outputs auto& out = api_result; // Get Output AutoGradMeta egr::AutogradMeta* out_autograd_meta = egr::EagerUtils::autograd_meta(&out); // Check Inplace if needed egr::EagerUtils::CheckInplace(x, x_autograd_meta, require_any_grad); // Bump Inplace Version x.bump_inplace_version(); VLOG(3) << "Tensor(" << x.name() << ") uses Inplace Strategy."; // Set grad_node after API call if (require_any_grad) { egr::EagerUtils::PassStopGradient(false,out_autograd_meta); // SetGradOutMeta & SetEdges if (x.is_dist_tensor()&& x.impl()){ grad_node->SetGradOutMeta(x, 0, x_dist_attr, x_dims); } else{ grad_node->SetGradOutMeta(x, 0); } // SetOutRank & SetHistory & SetGradInMeta if (out_autograd_meta) { egr::EagerUtils::SetOutRankWithSlot(out_autograd_meta, 0); } if (out_autograd_meta) { egr::EagerUtils::SetHistory(out_autograd_meta, grad_node); } grad_node->SetGradInMeta(out, 0); // Set TensorWrappers for Forward Outputs if needed }
@GITD245 GITD245 changed the title [Auto Paralle] FIx inplaced ops save wrong dist_attr for backward in auto parallel [Auto Parallel] FIx inplaced ops save wrong dist_attr for backward in auto parallel Jul 4, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jul 4, 2025

Codecov Report

Attention: Patch coverage is 60.97561% with 16 lines in your changes missing coverage. Please review.

Please upload report for BASE (develop@57d1e08). Learn more about missing BASE report.

Files with missing lines Patch % Lines
paddle/fluid/eager/grad_node_info.cc 60.97% 16 Missing ⚠️

❌ Your patch status has failed because the patch coverage (60.97%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@ Coverage Diff @@ ## develop #73836 +/- ## ========================================== Coverage ? 60.97% ========================================== Files ? 1 Lines ? 41 Branches ? 0 ========================================== Hits ? 25 Misses ? 16 Partials ? 0 

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@paddle-bot
Copy link

paddle-bot bot commented Jul 7, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@liym27 liym27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议增加单测验证该逻辑,并通过覆盖率测试

@GITD245
Copy link
Contributor Author

GITD245 commented Jul 8, 2025

建议增加单测验证该逻辑,并通过覆盖率测试

Done

@GITD245
Copy link
Contributor Author

GITD245 commented Jul 9, 2025

/re-run all-failed

Copy link
Contributor

@liym27 liym27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liym27 liym27 merged commit 4c0a9e9 into PaddlePaddle:develop Jul 9, 2025
74 of 78 checks passed
@GITD245 GITD245 deleted the inplaced branch July 9, 2025 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

5 participants