[SPMD] Support manual all-reduce #7576

alanwaketan · 2024-06-26T01:35:12Z

Summary:
This is to add manual all-reduce support to SPMD and it currently only supports one input tensor. For array support, we can do that in python layer instead.

Test Plan:
python ./test/spmd/test_xla_sharding.py -v -k test_spmd_all_reduce

JackCaoG

approve to unblock, but I think we should fix the tensor method name

JackCaoG · 2024-06-26T02:05:57Z

torch_xla/csrc/tensor_methods.cpp

 }
 }

+XLATensorPtr all_reduce(const XLATensorPtr& input, AllReduceType reduce_type,


can you call it all_reduce _no_token, the only difference in signature is it does not take pin_layout but the main difference in the op is that it does not set token.. It is better to reflect that in the name.

Sure. I can follow up with that.

JackCaoG · 2024-06-26T02:08:53Z

for array support do you plan to call all_reduce multiple times? In our C++ implementation I think we group tensors by dtype and call all_rduce once per dtype.

alanwaketan · 2024-06-26T04:44:56Z

for array support do you plan to call all_reduce multiple times? In our C++ implementation I think we group tensors by dtype and call all_rduce once per dtype.

I don't think that's necessary. I'm thinking the compiler should be smart enough to fuse all-reduces if the fusion is necessary.

alanwaketan · 2024-06-26T04:46:37Z

Thanks Jack for approving.

alanwaketan added 2 commits June 26, 2024 01:32

initiial commit

d3795b3

Fix linters

7f7739a

alanwaketan added the backport_2.4 label Jun 26, 2024

alanwaketan requested review from JackCaoG and jonb377 June 26, 2024 01:35

alanwaketan self-assigned this Jun 26, 2024

JackCaoG approved these changes Jun 26, 2024

View reviewed changes

alanwaketan merged commit 0df5c29 into master Jun 26, 2024

alanwaketan deleted the alanwaketan/spmd_all_reduce branch June 26, 2024 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[SPMD] Support manual all-reduce #7576

[SPMD] Support manual all-reduce #7576

Uh oh!

alanwaketan commented Jun 26, 2024

JackCaoG left a comment

JackCaoG Jun 26, 2024

alanwaketan Jun 26, 2024

JackCaoG commented Jun 26, 2024

alanwaketan commented Jun 26, 2024

alanwaketan commented Jun 26, 2024

Labels

2 participants

Uh oh!

[SPMD] Support manual all-reduce #7576

[SPMD] Support manual all-reduce #7576

Uh oh!

Conversation

alanwaketan commented Jun 26, 2024

JackCaoG left a comment

Choose a reason for hiding this comment

JackCaoG Jun 26, 2024

Choose a reason for hiding this comment

alanwaketan Jun 26, 2024

Choose a reason for hiding this comment

JackCaoG commented Jun 26, 2024

alanwaketan commented Jun 26, 2024

alanwaketan commented Jun 26, 2024

Labels

2 participants