Skip to content

Conversation

@zty-king
Copy link
Contributor

@zty-king zty-king commented Jul 1, 2025

PR Category

Auto Parallel

PR Types

Bug fixes

Description

动半pp的save_load精度对齐,目前热启动存在两个问题需要修复,描述如下:

  • 问题1 热启动报异常访问内存:

image

报错直接原因是

​ 热启动的时候,在float16精度下(bf16也会如此),为了防止数据下溢,反向过程中,对loss和梯度做scale,在参数更新时,即opt阶段,需要做unscale将数据缩放回来,此时会调用auto_parallel.api中的unscale_method方法,在该方法中,只要paramgrad不为None,就会被添加到处理列表中,而对于非本rank的梯度,此时处于定义了但未初始化的状态,因此是未分配内存的,此时访问这些grad则会报非法访问的内存错误。

image

  • 问题2 热启动时参数未正确加载(不会报错,会直接用原始模型参数做初始化训练,但无法正确加载checkpoint)

image

报错直接原因是:

​ 在save的时候,没有用到state_dict,而是直接保存了模型参数,未保存optimizer的参数,一方面保存的checkpoint中没有optimizer参数的信息,另一方面,导致加载时,key的名称对不上,以state_dict保存会有model.和optimizer.的前缀,后者没有。

image

image

@paddle-bot paddle-bot bot added the contributor External developers label Jul 1, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jul 1, 2025

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Please upload report for BASE (develop@57474c7). Learn more about missing BASE report.

Files with missing lines Patch % Lines
python/paddle/distributed/auto_parallel/api.py 0.00% 2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@ Coverage Diff @@ ## develop #73749 +/- ## ========================================== Coverage ? 0.00% ========================================== Files ? 1 Lines ? 2 Branches ? 0 ========================================== Hits ? 0 Misses ? 2 Partials ? 0 

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@paddle-bot
Copy link

paddle-bot bot commented Jul 2, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@xuxinyi389
Copy link
Contributor

xuxinyi389 commented Jul 3, 2025

LGTM,热启的单测写在paddlenlp

@zty-king
Copy link
Contributor Author

zty-king commented Jul 4, 2025

/re-run all-failed

Copy link
Contributor

@pkuzyc pkuzyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pkuzyc pkuzyc merged commit 26ef5eb into PaddlePaddle:develop Jul 4, 2025
149 of 163 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 participants