Skip to content

Conversation

@jonb377
Copy link
Collaborator

@jonb377 jonb377 commented Feb 5, 2024

As pointed out in #6465, our documentation is missing discussion of how to initialize the process group in SPMD execution mode.

A process group is required for distributed checkpointing and can be used with various other torch.distributed APIs. In SPMD, we don't allow process groups on the XLA backend, since the compiler is responsible for controlling the on-device collectives.

@jonb377 jonb377 requested a review from yeounoh February 5, 2024 19:10
@jonb377 jonb377 self-assigned this Feb 5, 2024
@yeounoh
Copy link
Contributor

yeounoh commented Feb 5, 2024

cc @vanbasten23 , you might have already done, did we add the SPMD + GPU documentqtaion/section as well?

Copy link
Contributor

@yeounoh yeounoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@jonb377 jonb377 merged commit 732a1c7 into master Feb 5, 2024
@jonb377 jonb377 deleted the jonbolin/pg branch February 5, 2024 21:22
@vanbasten23
Copy link
Collaborator

cc @vanbasten23 , you might have already done, did we add the SPMD + GPU documentqtaion/section as well?

Good call. I haven't done that yet but let me add some tmr.

amithrm pushed a commit to amithrm/xla that referenced this pull request Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants