Skip to content

Conversation

@DesmonDay
Copy link
Contributor

PR types

Function optimization

PR changes

Others

Description

Support sharding_comm_overlap.

@paddle-bot
Copy link

paddle-bot bot commented Nov 7, 2024

Thanks for your contribution!

self.optimizer = fleet.distributed_optimizer(self.optimizer)

if self.args.enable_sharding_comm_overlap:
model.register_sharding_comm_overlap_hook(self.optimizer)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZHUI 看一下这个要不要专门针对uc来打开这个开关

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要的,最小影响到其他策略。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加了一个判断条件,针对split_param开启了再打开

@codecov
Copy link

codecov bot commented Nov 7, 2024

Codecov Report

Attention: Patch coverage is 19.23077% with 21 lines in your changes missing coverage. Please review.

Project coverage is 52.94%. Comparing base (2838e80) to head (dc5e369).
Report is 227 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/trainer/trainer.py 0.00% 5 Missing ⚠️
...r/unified_checkpoint/sharding_split_param_utils.py 37.50% 5 Missing ⚠️
paddlenlp/trainer/trainer_utils.py 20.00% 4 Missing ⚠️
paddlenlp/trainer/training_args.py 0.00% 2 Missing ⚠️
...nlp/trainer/unified_checkpoint/check_completion.py 0.00% 2 Missing ⚠️
paddlenlp/trainer/unified_checkpoint/utils.py 33.33% 2 Missing ⚠️
paddlenlp/trainer/unified_checkpoint/load_local.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@ Coverage Diff @@ ## develop #9392 +/- ## =========================================== - Coverage 52.96% 52.94% -0.02%  =========================================== Files 676 676 Lines 107827 107836 +9 =========================================== - Hits 57109 57099 -10  - Misses 50718 50737 +19 

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lugimzzz lugimzzz force-pushed the develop branch 3 times, most recently from 3310514 to aecf9f1 Compare November 8, 2024 03:15
model = self.model_wrapped
opt_state_dict = self.unified_checkpoint_handler.load_unified_optimizer(
model=self.model,
model=model,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm,会不会 state_dict 的 name 前面又套了一层其他的东西
比如 model.model.embedding ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

相关情况已处理

opt_state_dict = None
else:
model = self.model
if hasattr(self.args, "enable_sharding_comm_overlap") and self.args.enable_sharding_comm_overlap:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm,我觉得要不你搞一个 uc_with_pp_sharding_comm_overlap 之类的config,你内部单独用吧。

不和 enable_sharding_comm_overlap 搞在一起了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样会更复杂,而且判断逻辑如果和training_args.py存在冲突或对不齐,就糟糕了。

@CLAassistant
Copy link

CLAassistant commented Nov 12, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZHUI ZHUI merged commit 921fc44 into PaddlePaddle:develop Nov 14, 2024
9 of 12 checks passed
ZHUI pushed a commit to DesmonDay/PaddleNLP that referenced this pull request Nov 14, 2024
DesmonDay added a commit that referenced this pull request Nov 14, 2024
* [Unified Checkpoint] Support sharding_comm_overlap (#9392)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants