[XPU] support unified ckpt function #9312

cqulilujia · 2024-10-24T09:04:37Z

PR types

Function optimization

PR changes

Others

Description

Support unified ckpt function on XPU

paddle-bot · 2024-10-24T09:04:43Z

Thanks for your contribution!

codecov · 2024-10-24T09:38:03Z

Codecov Report

Attention: Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.

Project coverage is 52.92%. Comparing base (7551730) to head (9cdabb6).
Report is 263 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/trainer/trainer.py	0.00%	5 Missing ⚠️
paddlenlp/trainer/plugins/unified_checkpoint.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@ Coverage Diff @@ ## develop #9312 +/- ## =========================================== + Coverage 52.80% 52.92% +0.11%  =========================================== Files 660 660 Lines 106869 106875 +6 =========================================== + Hits 56434 56564 +130  + Misses 50435 50311 -124

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

DesmonDay · 2024-10-24T09:45:37Z

paddlenlp/trainer/plugins/unified_checkpoint.py

+ if paddle.is_compiled_with_xpu():
+ # XPU does not support all_reduce prod now, in XPU, bool is treated as int8,
+ # so temporarily use reduce_min instead
+ dist.all_reduce(local_resume, op=dist.ReduceOp.MIN)


那就统一改成dist.all_reduce(local_resume, op=dist.ReduceOp.MIN)吧，不需要特地区分XPU和GPU

DesmonDay · 2024-10-24T09:45:49Z

paddlenlp/trainer/plugins/unified_checkpoint.py

+ if paddle.is_compiled_with_xpu():
+ # XPU does not support all_reduce prod now, in XPU, bool is treated as int8,
+ # so temporarily use reduce_min instead
+ dist.all_reduce(local_resume, op=dist.ReduceOp.MIN)


DesmonDay

LGTM

zhiqiu

LGTM

ZHUI · 2024-10-25T06:55:50Z

paddlenlp/trainer/trainer.py

+ if not len(checkpoint_rng_state["cuda"]) == core.get_xpu_device_count():
+ raise ValueError("Length of xpu state list shoule be equal to the xpu device count")
+ for i in range(core.get_xpu_device_count()):
+ core.default_xpu_generator(i).set_state(checkpoint_rng_state["cuda"][i])


这里 xpu 很下面 custom device 处理会有不同吗？更适合框架测修改吧。

在框架侧的device上XPU和GPU、CPU是同一级别的，custom device（如海光、昇腾等）是放在一起的

paddle-bot bot added the XPU label Oct 24, 2024

DesmonDay reviewed Oct 24, 2024

View reviewed changes

cqulilujia force-pushed the unified branch from c0c6f34 to cfc24b1 Compare October 24, 2024 10:43

DesmonDay previously approved these changes Oct 24, 2024

View reviewed changes

cqulilujia dismissed DesmonDay’s stale review via 2ec9a6d October 24, 2024 10:48

cqulilujia force-pushed the unified branch from cfc24b1 to 2ec9a6d Compare October 24, 2024 10:48

[XPU], support unified ckpt function

9cdabb6

cqulilujia force-pushed the unified branch from 2ec9a6d to 9cdabb6 Compare October 24, 2024 10:49

zhiqiu approved these changes Oct 24, 2024

View reviewed changes

cqulilujia changed the title ~~[XPU], support unified ckpt function~~ [XPU] support unified ckpt function Oct 25, 2024

wawltor merged commit b237ba7 into PaddlePaddle:develop Oct 25, 2024
2 of 4 checks passed

ZHUI reviewed Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[XPU] support unified ckpt function #9312

[XPU] support unified ckpt function #9312

Uh oh!

cqulilujia commented Oct 24, 2024

paddle-bot bot commented Oct 24, 2024

codecov bot commented Oct 24, 2024 •

edited

Loading

DesmonDay Oct 24, 2024

cqulilujia Oct 24, 2024

DesmonDay Oct 24, 2024

DesmonDay left a comment

zhiqiu left a comment

Uh oh!

ZHUI Oct 25, 2024

cqulilujia Oct 25, 2024 •

edited

Loading

cqulilujia Oct 25, 2024

Labels

5 participants

[XPU] support unified ckpt function #9312

[XPU] support unified ckpt function #9312

Uh oh!

Conversation

cqulilujia commented Oct 24, 2024

PR types

PR changes

Description

paddle-bot bot commented Oct 24, 2024

codecov bot commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

DesmonDay Oct 24, 2024

Choose a reason for hiding this comment

cqulilujia Oct 24, 2024

Choose a reason for hiding this comment

DesmonDay Oct 24, 2024

Choose a reason for hiding this comment

DesmonDay left a comment

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment

Uh oh!

ZHUI Oct 25, 2024

Choose a reason for hiding this comment

cqulilujia Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

cqulilujia Oct 25, 2024

Choose a reason for hiding this comment

Labels

5 participants

codecov bot commented Oct 24, 2024 •

edited

Loading

cqulilujia Oct 25, 2024 •

edited

Loading