Support `dist.all_gather` related collective ops #7860

zpcore · 2024-08-15T17:15:42Z

Add dynamo/nondynamo support for torch.distributed.all_reduce and torch.distributed.all_gather_into_tensor.

Motivation: We want to deprecate the collective ops in xla_model.py and be consist with the torch.distributed.

Issue: dist.all_reduct doesn't work with dynamo openxla backend at this time.

zpcore · 2024-08-16T19:14:33Z

Hi @JackCaoG , I commented out assert met.metric_data("ExecuteTime")[0] == 1 in the test_traceable_collectives.py since the value changed to 3 for all_gather ops due to upstream changes. Shall we enable the test first so we can run the test?

zpcore · 2024-08-16T22:57:05Z

I have no idea how this works:

xla/torch_xla/distributed/xla_backend.py

Lines 71 to 72 in 37312c1

     xm.all_reduce(reduce_type, tensors, groups=self._mesh, pin_layout=False)  
   return _ret_work(tensors)  
 

xm.all_reduce should return a tensor instead of modifying inputs argument.

I will probably give up supporting dist.all_reduce in this PR.

Update: turns out for output_tensor=dist.all_reduce(intput_tensor...)both intput_tensor and output_tensor got updated. We have to use the input_tensor as the final results for the nondynamo path.

test/dynamo/test_traceable_collectives.py

torch_xla/distributed/xla_backend.py

torch_xla/csrc/tensor_methods.cpp

torch_xla/distributed/xla_backend.py

test/pjrt/test_collective_ops_tpu.py

rpsilva-aws · 2025-03-08T00:17:16Z

Motivation: We want to deprecate the collective ops in xla_model.py and be consist with the torch.distributed.

@zpcore This makes sense, and we would like to help close that gap. Do you have a more descriptive motivation and/or opened issue to migrate the remaining collective ops? I tried finding, but checking in first before I create one. cc: @miladm @tengyifei

zpcore force-pushed the piz/cop branch from 3529d20 to ca0fd5d Compare August 16, 2024 19:10

zpcore requested review from JackCaoG and will-cromar August 16, 2024 19:11

zpcore force-pushed the piz/cop branch from ca0fd5d to 6e23e8a Compare August 16, 2024 19:55

zpcore added the tpuci label Aug 16, 2024

zpcore marked this pull request as ready for review August 16, 2024 20:02

zpcore requested a review from lsy323 August 16, 2024 20:02

zpcore changed the title ~~prototype of all_gather related distributed ops~~ Support dist.all_gather related distributed ops Aug 16, 2024

zpcore changed the title ~~Support dist.all_gather related distributed ops~~ Support dist.all_gather related collective ops Aug 16, 2024

will-cromar reviewed Aug 19, 2024

View reviewed changes

test/dynamo/test_traceable_collectives.py Outdated Show resolved Hide resolved

torch_xla/distributed/xla_backend.py Show resolved Hide resolved

torch_xla/csrc/tensor_methods.cpp Outdated Show resolved Hide resolved

lsy323 removed their request for review August 20, 2024 16:26

zpcore force-pushed the piz/cop branch from 7ec8f12 to b37add2 Compare August 20, 2024 22:50

zpcore requested a review from will-cromar August 20, 2024 22:51

will-cromar reviewed Aug 21, 2024

View reviewed changes

zpcore requested a review from will-cromar August 21, 2024 23:12

will-cromar reviewed Aug 22, 2024

View reviewed changes

test/pjrt/test_collective_ops_tpu.py Outdated Show resolved Hide resolved

test/pjrt/test_collective_ops_tpu.py Outdated Show resolved Hide resolved

zpcore requested a review from will-cromar August 22, 2024 21:46

zpcore force-pushed the piz/cop branch from 09bba56 to 0dd525c Compare August 23, 2024 18:12

will-cromar approved these changes Aug 26, 2024

View reviewed changes

zpcore added 8 commits August 27, 2024 15:47

add test case

4b988c3

nit

ca757f9

nit

15b8059

clean up test and add dist.all_gather

6e97295

remove mock object

65989e3

nit

583d984

update to openxla backend

ecac71b

use toy backend

dfe8a54

zpcore force-pushed the piz/cop branch from 0dd525c to dfe8a54 Compare August 27, 2024 15:48

zpcore merged commit f9a706e into master Aug 27, 2024

zpcore deleted the piz/cop branch August 27, 2024 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support `dist.all_gather` related collective ops #7860

Support `dist.all_gather` related collective ops #7860

Uh oh!

zpcore commented Aug 15, 2024 •

edited

Loading

zpcore commented Aug 16, 2024

zpcore commented Aug 16, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rpsilva-aws commented Mar 8, 2025

Labels

3 participants

Uh oh!

Support dist.all_gather related collective ops #7860

Support dist.all_gather related collective ops #7860

Uh oh!

Conversation

zpcore commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

zpcore commented Aug 16, 2024

zpcore commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rpsilva-aws commented Mar 8, 2025

Labels

3 participants

Support `dist.all_gather` related collective ops #7860

Support `dist.all_gather` related collective ops #7860

zpcore commented Aug 15, 2024 •

edited

Loading

zpcore commented Aug 16, 2024 •

edited

Loading