Skip to content

Conversation

@vanbasten23
Copy link
Collaborator

@vanbasten23 vanbasten23 commented Oct 25, 2023

Currently the test (PJRT_DEVICE=GPU torchrun --nproc_per_node=4 --nnodes=1 --node_rank=1 --rdzv_endpoint="10.164.0.13:12355" pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1) fails with error:

Traceback (most recent call last): File "pytorch/xla/test/test_train_mp_imagenet.py", line 378, in <module> _mp_fn(FLAGS) TypeError: _mp_fn() missing 1 required positional argument: 'flags' 

This PR fixes it.

@vanbasten23 vanbasten23 requested a review from jonb377 October 25, 2023 00:58
Copy link
Collaborator

@jonb377 jonb377 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Xiongfei

if __name__ == '__main__':
if dist.is_torchelastic_launched():
_mp_fn(FLAGS)
_mp_fn(xu.getenv_as(xenv.LOCAL_RANK, int), FLAGS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set by torchrun, right? Just confirming my understanding.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@vanbasten23
Copy link
Collaborator Author

Thanks for the review!

@vanbasten23 vanbasten23 merged commit 294610a into master Oct 25, 2023
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
pytorch#5729) * Fix the missing parameter error when running mp_imagenet with torchrun * made it local rank
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
pytorch#5729) * Fix the missing parameter error when running mp_imagenet with torchrun * made it local rank
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
#5729) * Fix the missing parameter error when running mp_imagenet with torchrun * made it local rank
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
#5729) * Fix the missing parameter error when running mp_imagenet with torchrun * made it local rank
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants