Need to reenable ZeRO1 for GPU to enable coverage for reduce-scatter/all-gather

🐛 Bug

Currently ZeRO1 test/test_zero1.py is disabled for GPU since version 2.1 (#4912). We should reenable it for GPU to enable coverage for reduce-scatter/all-gather.

When I tried with torch/xla version 2.2 (sha 7c46e4c), I hit a segmenation fault:

---------------------------------------------------------------------- Ran 1 test in 1.428s OK Segmentation fault (core dumped)

To Reproduce

Steps to reproduce the behavior:

Build torch/xla as in https://github.com/pytorch/xla/blob/master/docs/gpu.md
Edit test/test_zero1.py and remove/comment-out the line that starts with

@unittest.skipIf(pjrt.device_type() == 'GPU', "TODO(alanwaketan): Fix it for the token change.")

Run the test

GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_zero1.py GPU_NUM_DEVICES=2 PJRT_DEVICE=CUDA python test/test_zero1.py

Expected behavior

Test runs and passes on GPUs without segfault

Environment

Reproducible on XLA backend [CPU/TPU]: GPU/CUDA
torch_xla version: 2.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Need to reenable ZeRO1 for GPU to enable coverage for reduce-scatter/all-gather #6260

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Need to reenable ZeRO1 for GPU to enable coverage for reduce-scatter/all-gather #6260

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions