-
Couldn't load subscription status.
- Fork 5.9k
[AutoParallel]Fix get_group method of processmesh #73099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 你的PR提交成功,感谢你对开源项目的贡献! |
| Sorry to inform you that f056bd8's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
… fix_get_group
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (57.14%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@ ## develop #73099 +/- ## ========================================== Coverage ? 57.14% ========================================== Files ? 1 Lines ? 14 Branches ? 0 ========================================== Hits ? 8 Misses ? 6 Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| ) | ||
| | ||
| return parallel_group_map[dim_name]() | ||
| existing_group = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
冗余变量,在if set(group.ranks) == set(self._process_ids)分支下直接返回 group就好
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| | ||
| | ||
| if __name__ == "__main__": | ||
| test_dp_parallel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fleet_test_xx,都是类似的文件,是否可以想办法合并成一个文件
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| /re-run all-failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM











PR Category
Auto Parallel
PR Types
Bug fixes
Description
ProcessMesh 的 get_group 方法,在实际使用时会重复创建通信组,导致显存爆炸,或者通信过程中会存在非预期的错误。因此在通过 ProcessMesh 转换为动手使用的通信组是得慎重,若此时已经存在与mesh.get_group方法相同mesh的group,使用get_group 应该获取该group,而不是直接创建新的通信组。