[Distributed] Move the cached all reduce token to C++ #4912

alanwaketan · 2023-04-20T01:08:16Z

Summary:
For all the cc ops, we use a token to introduce control dependencies among them such that they will be executed in order. This token is cached in the Python layer and this pull request moves it to C++ given the upcoming pytorch/pytorch#93173 won't carry the token from Python to C++.

Test Plan:
CI.

alanwaketan · 2023-04-21T00:13:49Z

Okay, GPU CI is happy without test_zero1.py. Let's skip that and I will follow up next week.

JackCaoG · 2023-04-21T00:24:03Z

torch_xla/core/xla_model.py

 result = torch_xla._XLAC._xla_all_gather(value, token, dim, shard_count,
 groups or [], pin_layout)
- devctx.all_reduce_token = result[1]
+ torch_xla._XLAC._set_all_reduce_token(result[1])


hmm, I guess in the long term if we don't even set the token in python, there is no need to return the token to the python layer? We can look into this later, maybe it is still cleaner this way.

Yea, I'm just too lazy to make the changes to all cc ops to remove the token in the python layer.

JackCaoG · 2023-04-21T00:26:25Z

torch_xla/csrc/cross_replica_reduces.cpp

+ std::mutex lock;
+};
+
+AllReduceToken g_all_reduce_token;


Shouldn't we have one token per device? Under the PJRT for v3 cases, each process will have 2 thread and 1 device per thread. In that case those two thread should not share the same token.

Good point. Completely forgot the V3 case...

torch_xla/csrc/cross_replica_reduces.cpp

JackCaoG · 2023-04-21T00:37:26Z

torch_xla/csrc/init_python_bindings.cpp

 [](at::Tensor& self, const at::Tensor& source) -> at::Tensor& {
 return XLANativeFunctions::set_(self, source);
 });
+ m.def("_get_all_reduce_token",


I think it is better to call it _get_cc_token or _get_xla_token, although it is currently only for all_reduce. We can also do this after we convert second cc op to use cpp token

I'm just following the traditional in the python layer where it's named as all_reduce_token. haha.

JackCaoG

mostly lgtm, @pratnali @amithrm FYI we are moving the token to cpp so it can be traced by the dynamo.

torch_xla/csrc/cross_replica_reduces.cpp

alanwaketan · 2023-04-21T01:36:58Z

Thanks Jack for approving the change.

pratnali · 2023-04-21T06:38:33Z

Thanks for the feedback everyone.

alanwaketan self-assigned this Apr 20, 2023

alanwaketan marked this pull request as draft April 20, 2023 01:08

alanwaketan changed the base branch from master to alanwaketan/cc April 20, 2023 01:08

alanwaketan changed the base branch from alanwaketan/cc to master April 20, 2023 01:09

alanwaketan force-pushed the alanwaketan/token branch from b96e1ac to 5497245 Compare April 20, 2023 05:34

alanwaketan marked this pull request as ready for review April 20, 2023 07:04

alanwaketan requested a review from JackCaoG April 20, 2023 07:04

alanwaketan added 9 commits April 21, 2023 00:19

Introduce GetAllReduceToken

51869fc

Introduce ResetAllReduceToken

2f753d6

introduce pybinds

e814d60

introduce _set_all_reduce_token

a0c808f

Remove token from python

3ca03ea

Remove create token

6a8902f

Fix linters

72a88c9

Make token process local and thread safe

a46d6dc

Skip test_zero1 to see if GPU are good

36361ed

alanwaketan force-pushed the alanwaketan/token branch from 39c0092 to 36361ed Compare April 21, 2023 00:21

JackCaoG reviewed Apr 21, 2023

View reviewed changes

torch_xla/csrc/cross_replica_reduces.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed Apr 21, 2023

View reviewed changes

alanwaketan added 2 commits April 21, 2023 01:30

Address V3 and remove thread safe guards

ff7885c

Fix linters

4122a23

JackCaoG reviewed Apr 21, 2023

View reviewed changes

torch_xla/csrc/cross_replica_reduces.cpp Outdated Show resolved Hide resolved

JackCaoG approved these changes Apr 21, 2023

View reviewed changes

nit

216b3c3

alanwaketan merged commit 44c2fa0 into master Apr 21, 2023

This was referenced Jan 4, 2024

Need to reenable ZeRO1 for GPU to enable coverage for reduce-scatter/all-gather #6260

Closed

Reenable ZeRO1 test for GPU to enable coverage for reduce-scatter/all-gather #6278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Distributed] Move the cached all reduce token to C++ #4912

[Distributed] Move the cached all reduce token to C++ #4912

Uh oh!

alanwaketan commented Apr 20, 2023

alanwaketan commented Apr 21, 2023

JackCaoG Apr 21, 2023

alanwaketan Apr 21, 2023

JackCaoG Apr 21, 2023

alanwaketan Apr 21, 2023

Uh oh!

JackCaoG Apr 21, 2023

alanwaketan Apr 21, 2023

JackCaoG left a comment

Uh oh!

alanwaketan commented Apr 21, 2023

pratnali commented Apr 21, 2023

Labels

4 participants

Uh oh!

[Distributed] Move the cached all reduce token to C++ #4912

[Distributed] Move the cached all reduce token to C++ #4912

Uh oh!

Conversation

alanwaketan commented Apr 20, 2023

alanwaketan commented Apr 21, 2023

JackCaoG Apr 21, 2023

Choose a reason for hiding this comment

alanwaketan Apr 21, 2023

Choose a reason for hiding this comment

JackCaoG Apr 21, 2023

Choose a reason for hiding this comment

alanwaketan Apr 21, 2023

Choose a reason for hiding this comment

Uh oh!

JackCaoG Apr 21, 2023

Choose a reason for hiding this comment

alanwaketan Apr 21, 2023

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

alanwaketan commented Apr 21, 2023

pratnali commented Apr 21, 2023

Labels

4 participants