Skip to content

Conversation

@xymyeah
Copy link
Contributor

@xymyeah xymyeah commented Dec 18, 2021

PR types

New features

PR changes

Others

Describe

[Auto Parallel] add gradient merge pass
Refer to for the results of precision alignment : https://github.com/xymyeah/gradient_merge_precision_alignment

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -0,0 +1,349 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename gradient_merge.py to auto_parallel_gradient_merge.py since this pass may not work for other codes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return optimize_ops_desc


def _remove_op_role_var(param, grad):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of _remove_op_role_var?

Copy link
Contributor Author

@xymyeah xymyeah Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

非自动并行的情况下,多卡训练时,是通过“op_role_var”这一变量来记录通信的Var(标记哪些是需要通信的梯度),增加grad merge后,原op中记录的op_role_var是错误的,需要删除,同时相应的optimizer op需要增加相应op_role_var,便于后续对grad进行allreduce merge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

给op_role_var 添加allreduce 是PE的逻辑,自动并行不走PE, 是通过每个dist op 自己判断是否需要梯度同步 和同步的dp world (可能存在多个dp world 的情况)。op_role_var 在自动并行是不生效的(因为无法区分多个 dp world 的情况)。所以都不需要添加

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已将program中的op_role_var参数删除

aoyulong
aoyulong previously approved these changes Dec 28, 2021
_add_gm_op_role_var(new_grad_op, param, gradient_merge_var,
cond_var_name)
new_params_grads.append([param, gradient_merge_var])
return new_params_grads, param_to_gradient_merge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to rename new_params_grads to new_params_to_grads as param_to_gradient_merge to explicitly indicate a dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

JZ-LIANG
JZ-LIANG previously approved these changes Dec 30, 2021
Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xymyeah xymyeah changed the title [Auto Parallel] add gradient merge pass [Auto Parallel] Add general gradient merge pass to support auot parallel Dec 30, 2021
@xymyeah xymyeah changed the title [Auto Parallel] Add general gradient merge pass to support auot parallel [Auto Parallel] Add general gradient merge pass to support auto parallel Dec 31, 2021
@JZ-LIANG JZ-LIANG merged commit 89ce6db into PaddlePaddle:develop Dec 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants