Skip to content

Conversation

@zpcore
Copy link
Member

@zpcore zpcore commented Aug 15, 2024

Add dynamo/nondynamo support for torch.distributed.all_reduce and torch.distributed.all_gather_into_tensor.

Motivation: We want to deprecate the collective ops in xla_model.py and be consist with the torch.distributed.

Issue: dist.all_reduct doesn't work with dynamo openxla backend at this time.

@zpcore
Copy link
Member Author

zpcore commented Aug 16, 2024

Hi @JackCaoG , I commented out assert met.metric_data("ExecuteTime")[0] == 1 in the test_traceable_collectives.py since the value changed to 3 for all_gather ops due to upstream changes. Shall we enable the test first so we can run the test?

@zpcore zpcore added the tpuci label Aug 16, 2024
@zpcore zpcore marked this pull request as ready for review August 16, 2024 20:02
@zpcore zpcore requested a review from lsy323 August 16, 2024 20:02
@zpcore zpcore changed the title prototype of all_gather related distributed ops Support dist.all_gather related distributed ops Aug 16, 2024
@zpcore zpcore changed the title Support dist.all_gather related distributed ops Support dist.all_gather related collective ops Aug 16, 2024
@zpcore
Copy link
Member Author

zpcore commented Aug 16, 2024

I have no idea how this works:

xm.all_reduce(reduce_type, tensors, groups=self._mesh, pin_layout=False)
return _ret_work(tensors)

xm.all_reduce should return a tensor instead of modifying inputs argument.

I will probably give up supporting dist.all_reduce in this PR.

Update: turns out for output_tensor=dist.all_reduce(intput_tensor...)both intput_tensor and output_tensor got updated. We have to use the input_tensor as the final results for the nondynamo path.

@lsy323 lsy323 removed their request for review August 20, 2024 16:26
@zpcore zpcore requested a review from will-cromar August 20, 2024 22:51
@zpcore zpcore requested a review from will-cromar August 21, 2024 23:12
@zpcore zpcore merged commit f9a706e into master Aug 27, 2024
@zpcore zpcore deleted the piz/cop branch August 27, 2024 16:58
@rpsilva-aws
Copy link
Collaborator

Motivation: We want to deprecate the collective ops in xla_model.py and be consist with the torch.distributed.

@zpcore This makes sense, and we would like to help close that gap. Do you have a more descriptive motivation and/or opened issue to migrate the remaining collective ops? I tried finding, but checking in first before I create one. cc: @miladm @tengyifei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants