pp_save_load_amp #73749

zty-king · 2025-07-01T15:04:09Z

PR Category

Auto Parallel

PR Types

Bug fixes

Description

动半pp的save_load精度对齐，目前热启动存在两个问题需要修复，描述如下：

问题1 热启动报异常访问内存：

报错直接原因是：

热启动的时候，在float16精度下(bf16也会如此)，为了防止数据下溢，反向过程中，对loss和梯度做scale，在参数更新时，即opt阶段，需要做unscale将数据缩放回来，此时会调用auto_parallel.api中的unscale_method方法，在该方法中，只要param的grad不为None，就会被添加到处理列表中，而对于非本rank的梯度，此时处于定义了但未初始化的状态，因此是未分配内存的，此时访问这些grad则会报非法访问的内存错误。

问题2 热启动时参数未正确加载（不会报错，会直接用原始模型参数做初始化训练，但无法正确加载checkpoint）

报错直接原因是：

在save的时候，没有用到state_dict，而是直接保存了模型参数，未保存optimizer的参数，一方面保存的checkpoint中没有optimizer参数的信息，另一方面，导致加载时，key的名称对不上，以state_dict保存会有model.和optimizer.的前缀，后者没有。

问题1由本pr修复，问题2在paddlenlp由以下pr修复：Fix the _save function so that it can save the optimizer parameters. PaddleNLP#10789

codecov-commenter · 2025-07-01T18:53:40Z

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Please upload report for BASE (develop@57474c7). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
python/paddle/distributed/auto_parallel/api.py	0.00%	2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@ Coverage Diff @@ ## develop #73749 +/- ## ========================================== Coverage ? 0.00% ========================================== Files ? 1 Lines ? 2 Branches ? 0 ========================================== Hits ? 0 Misses ? 2 Partials ? 0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

paddle-bot · 2025-07-02T02:14:02Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

xuxinyi389 · 2025-07-03T12:28:41Z

LGTM，热启的单测写在paddlenlp

zty-king · 2025-07-04T02:08:23Z

/re-run all-failed

pkuzyc

LGTM

pp_save_load_amp

467599f

paddle-bot bot added the contributor External developers label Jul 1, 2025

zty-king added 2 commits July 3, 2025 02:27

Modify the logic of nothing to do

b279891

Delete the logic of nothing to do

d3beb22

XieYunshen added the skip-ci: coverage label Jul 4, 2025

pkuzyc approved these changes Jul 4, 2025

View reviewed changes

pkuzyc merged commit 26ef5eb into PaddlePaddle:develop Jul 4, 2025
149 of 163 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

pp_save_load_amp #73749

pp_save_load_amp #73749

zty-king commented Jul 1, 2025

codecov-commenter commented Jul 1, 2025 •

edited

Loading

paddle-bot bot commented Jul 2, 2025

xuxinyi389 commented Jul 3, 2025 •

edited

Loading

zty-king commented Jul 4, 2025

pkuzyc left a comment

Uh oh!

Labels

5 participants

Uh oh!

pp_save_load_amp #73749

pp_save_load_amp #73749

Conversation

zty-king commented Jul 1, 2025

PR Category

PR Types

Description

动半pp的save_load精度对齐，目前热启动存在两个问题需要修复，描述如下：

codecov-commenter commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

paddle-bot bot commented Jul 2, 2025

xuxinyi389 commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

zty-king commented Jul 4, 2025

pkuzyc left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

5 participants

codecov-commenter commented Jul 1, 2025 •

edited

Loading

xuxinyi389 commented Jul 3, 2025 •

edited

Loading