Skip to content

Conversation

@wangxicoding
Copy link
Contributor

@wangxicoding wangxicoding commented Sep 28, 2021

PR types

Bug fixes

PR changes

Others

Describe

1、混合并行中,mp的非distributed的参数需要保持各个mp rank一致,PR修复mp的非distributed参数未广播的问题。
2、开启optimize_offload或(optimize_cast+optimize_sharding)时,会将param设置为 非persistable,在program.clone时,会出错。鉴于该问题目前只在hybrid中存在,PR通过将非persistable param重新生成为var的方式修复该问题。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

rings.append(self.dp_ring_id)

# need sync non distributed param in mp group
if self.mp_ring_id is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

放另一个地方会好一些吧? mp 的初始化同步为什么会放到 offload 中实现?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为需要先把参数给广播好,然后再插入cast、memcpy op,否则会造成各个卡的fp16参数和offload变量不一致

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也可以直接在_initialization_broadcast里再写一段逻辑专门处理offload和optimize_cast需要先广播参数的需求,可能麻烦一些,不过从模块化角度来说,确实要好一些。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我之后再专门搞个逻辑处理处理这个需求吧。

Copy link

@sandyhouse sandyhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gongweibao gongweibao merged commit bec9fc9 into PaddlePaddle:develop Sep 29, 2021
@wangxicoding wangxicoding deleted the fix_mp_no_dist_param_bcast branch September 29, 2021 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants