[Distributed] Switch all_reduce to use the new functional collective op #6887

yifuwang · 2024-04-04T17:35:49Z

PyTorch has implemented a new set of functional collective ops and is planning to remove the old ops. Migrating all_reduce to use the new op.

See context in pytorch/pytorch#93173 (comment)

alanwaketan

Let me know if you need any helps to setup XLA environment?

torch_xla/core/xla_model.py

torch_xla/csrc/cross_replica_reduces.cpp

yifuwang · 2024-04-08T21:48:36Z

Let me know if you need any helps to setup XLA environment?

Hey @alanwaketan, I'd appreciate your help on this. Is it possible to setup the dev env without having access to docker?

alanwaketan · 2024-04-09T19:56:53Z

Let me know if you need any helps to setup XLA environment?

Hey @alanwaketan, I'd appreciate your help on this. Is it possible to setup the dev env without having access to docker?

Right...

alanwaketan · 2024-04-09T19:58:17Z

Let me know if you need any helps to setup XLA environment?

Hey @alanwaketan, I'd appreciate your help on this. Is it possible to setup the dev env without having access to docker?

Have you followed https://github.com/pytorch/xla/blob/master/CONTRIBUTING.md? What error do you get? I wonder may be you can just develop and test this on the cpu machine and no need to access TPU.

yifuwang · 2024-04-09T21:22:35Z

Have you followed https://github.com/pytorch/xla/blob/master/CONTRIBUTING.md? What error do you get? I wonder may be you can just develop and test this on the cpu machine and no need to access TPU.

Yea I tried to follow it. But building from source requires docker which is not available in my dev env :(

alanwaketan · 2024-04-09T22:06:25Z

Have you followed master/CONTRIBUTING.md? What error do you get? I wonder may be you can just develop and test this on the cpu machine and no need to access TPU.

Yea I tried to follow it. But building from source requires docker which is not available in my dev env :(

Oh, so you cannot use docker or you cannot access our docker?

yifuwang · 2024-04-09T22:19:54Z

Oh, so you cannot use docker or you cannot access our docker?

Cannot use docker. It's a restriction of my corporate dev env :(

alanwaketan · 2024-04-09T22:21:23Z

Maybe you can leave the development to our team then. I don't know how difficult it is to use non-docker env... @JackCaoG Do you know?

PyTorch has implemented a new set of functional collective ops and is planning to remove the old ops. Migrating all_reduce to use the new op. See context in pytorch/pytorch#93173 (comment)

yifuwang · 2024-04-09T23:22:44Z

Maybe you can leave the development to our team then.

Sounds good. Thanks for the help @alanwaketan! I just wanted to make sure there wasn't a gap for torch-xla to adopt the new API.

yifuwang · 2024-04-10T19:34:43Z

@alanwaketan - following your advices, the CI is now green.

Note this PR doesn't address the TODO re. generating groups. It merely switches to the new API while being consistent with the old behavior. Happy to continue discussing how to plumb through group information.

Let me know what you think.

alanwaketan · 2024-04-10T19:58:48Z

Thanks, Yifu. We don't need group information in XLA in general. So we can follow up on that later. Let me trigger the TPU CI. Once everything is green. I will just merge the PR.

alanwaketan

LGTM.

alanwaketan · 2024-04-10T20:00:08Z

Oh, we cannot run TPU CI on fork I guess... Never mind. I will just merge it. The head CI will take care of the rest.

… legacy funcol After pytorch/xla#6887, torch-xla now also uses the all_reduce from native funcol. So we can remove this logic. [ghstack-poisoned]

…-xla to use legacy funcol" After pytorch/xla#6887, torch-xla now also uses the all_reduce from native funcol. So we can remove this logic. [ghstack-poisoned]

…hat forces torch-xla to use legacy funcol" After pytorch/xla#6887, torch-xla now also uses the all_reduce from native funcol. So we can remove this logic. [ghstack-poisoned]

…-xla to use legacy funcol" After pytorch/xla#6887, torch-xla now also uses the all_reduce from native funcol. So we can remove this logic. [ghstack-poisoned]

… legacy funcol (#123776) After pytorch/xla#6887, torch-xla now also uses the all_reduce from native funcol. So we can remove this logic. Pull Request resolved: #123776 Approved by: https://github.com/wanchaol

… legacy funcol (pytorch#123776) After pytorch/xla#6887, torch-xla now also uses the all_reduce from native funcol. So we can remove this logic. Pull Request resolved: pytorch#123776 Approved by: https://github.com/wanchaol

yifuwang force-pushed the all_reduce branch from b4ea4a5 to f20991c Compare April 4, 2024 17:48

JackCaoG requested a review from alanwaketan April 4, 2024 17:51

yifuwang marked this pull request as draft April 5, 2024 20:09

yifuwang force-pushed the all_reduce branch from f20991c to 51ff157 Compare April 5, 2024 20:23

alanwaketan reviewed Apr 8, 2024

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

torch_xla/csrc/cross_replica_reduces.cpp Outdated Show resolved Hide resolved

[Distributed] Switch all_reduce to use the new functional collective op

f3625aa

PyTorch has implemented a new set of functional collective ops and is planning to remove the old ops. Migrating all_reduce to use the new op. See context in pytorch/pytorch#93173 (comment)

yifuwang force-pushed the all_reduce branch from 51ff157 to f3625aa Compare April 9, 2024 23:12

yifuwang marked this pull request as ready for review April 10, 2024 18:48

alanwaketan approved these changes Apr 10, 2024

View reviewed changes

alanwaketan merged commit a816c42 into pytorch:master Apr 10, 2024

yifuwang mentioned this pull request Apr 10, 2024

[functional_collective] remove the logic that forces torch-xla to use legacy funcol pytorch/pytorch#123776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Distributed] Switch all_reduce to use the new functional collective op #6887

[Distributed] Switch all_reduce to use the new functional collective op #6887

Uh oh!

yifuwang commented Apr 4, 2024

alanwaketan left a comment

Uh oh!

Uh oh!

yifuwang commented Apr 8, 2024

alanwaketan commented Apr 9, 2024

alanwaketan commented Apr 9, 2024

yifuwang commented Apr 9, 2024

alanwaketan commented Apr 9, 2024

yifuwang commented Apr 9, 2024

alanwaketan commented Apr 9, 2024

yifuwang commented Apr 9, 2024

yifuwang commented Apr 10, 2024

alanwaketan commented Apr 10, 2024

alanwaketan left a comment

alanwaketan commented Apr 10, 2024

Labels

2 participants

Uh oh!

[Distributed] Switch all_reduce to use the new functional collective op #6887

[Distributed] Switch all_reduce to use the new functional collective op #6887

Uh oh!

Conversation

yifuwang commented Apr 4, 2024

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yifuwang commented Apr 8, 2024

alanwaketan commented Apr 9, 2024

alanwaketan commented Apr 9, 2024

yifuwang commented Apr 9, 2024

alanwaketan commented Apr 9, 2024

yifuwang commented Apr 9, 2024

alanwaketan commented Apr 9, 2024

yifuwang commented Apr 9, 2024

yifuwang commented Apr 10, 2024

alanwaketan commented Apr 10, 2024

alanwaketan left a comment

Choose a reason for hiding this comment

alanwaketan commented Apr 10, 2024

Labels

2 participants