update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU #5754

vanbasten23 · 2023-11-01T20:15:45Z

#5675 discovered that runtime has replaced PJRT_DEVICE GPU with cuda and rocm. This PR updates our GPU documentation.

Test via:

GPU_NUM_DEVICES=1 PJRT_DEVICE=GPU python pytorch/xla/test/test_train_mp_imagenet.py --fake_data
PJRT_DEVICE=CUDA torchrun --nnodes 1 --nproc-per-node 4 pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
GPU CI

zpcore · 2023-11-01T20:28:52Z

Thanks for the update. Do we also want to remove the GPU from the device list here: https://github.com/pytorch/xla/blob/1f0e972a6dfc5ed3205b23d0fccac8ebb128584f/torch_xla/core/xla_model.py#L92C1-L92C1? There are several places still mention GPU choice in xla_model.py file.

JackCaoG · 2023-11-01T20:39:21Z

Hey @vanbasten23 can we make sure PJRT_DEVICE=GPU continue to work and add warning message to let user to set it to CUDA.

vanbasten23 · 2023-11-01T21:08:04Z

Thanks for the update. Do we also want to remove the GPU from the device list here: https://github.com/pytorch/xla/blob/1f0e972a6dfc5ed3205b23d0fccac8ebb128584f/torch_xla/core/xla_model.py#L92C1-L92C1? There are several places still mention GPU choice in xla_model.py file.

Yeah, that's a good point. I plan to remove it in a follow-up PR

vanbasten23 · 2023-11-02T02:28:33Z

Hey @vanbasten23 can we make sure PJRT_DEVICE=GPU continue to work and add warning message to let user to set it to CUDA.

FWIW, it's already causing some problem: if I use our nightly (11/1/23), I see:

root@xiowei-gpu-1:/# PJRT_DEVICE=GPU python Python 3.8.18 (default, Oct 12 2023, 10:35:13) [GCC 10.2.1 20210110] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch_xla.core.xla_model as xm >>> import torch, torch_xla >>> xm.get_xla_supported_devices(devkind='GPU') # it returns nothing

On the contrary, r2.1 returns ['xla:0']. Let me see if there is an easy fix

zpcore · 2023-11-02T04:50:03Z

Hey @vanbasten23 can we make sure PJRT_DEVICE=GPU continue to work and add warning message to let user to set it to CUDA.

FWIW, it's already causing some problem: if I use our nightly (11/1/23), I see:
root@xiowei-gpu-1:/# PJRT_DEVICE=GPU python Python 3.8.18 (default, Oct 12 2023, 10:35:13) [GCC 10.2.1 20210110] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch_xla.core.xla_model as xm >>> import torch, torch_xla >>> xm.get_xla_supported_devices(devkind='GPU') # it returns nothing 
On the contrary, r2.1 returns ['xla:0']. Let me see if there is an easy fix

Yes, that's what we saw when run on GPU recently.
We added devkind = "CUDA" if devkind == "GPU" else devkind before xla_devices = _DEVICES.value line for a temporary fix in order to run on GPU.

vanbasten23 · 2023-11-02T16:57:30Z

can we make sure PJRT_DEVICE=GPU continue to work and add warning message to let user to set it to CUDA.

hi @JackCaoG , I've added warning messages and fix the currently known bug.
Regarding "make sure PJRT_DEVICE=GPU continue to work", we're already in a mixed state where GPU CI uses CUDA but our public documentation uses GPU. IMO, the only way to make sure of that is to run GPU CI with PJRT_DEVICE=GPU but that sounds too much overhead. How about we just do our best effort?

JackCaoG · 2023-11-02T17:22:56Z

@vanbasten23 just have one test run with PJRT_DEVICE=GPU(explcity set env var when running test or something) and rest run with PJRT_DEVICE=CUDA(default)

vanbasten23 · 2023-11-03T17:48:31Z

GPU CI test TestDynamicShapeModels is failing:

test_backward_pass_with_dynamic_input (__main__.TestDynamicShapeModels) ... 2023-11-03 17:22:04.313597: E external/xla/xla/status_macros.cc:54] INTERNAL: RET_CHECK failure (external/xla/xla/service/dynamic_padder.cc:1989) op_support != OpDynamismSupport::kNoSupport Dynamic input unexpectedly found for unsupported instruction: %add = f32[<=10,1]{1,0} add(f32[<=10,1]{1,0} %broadcast.2, f32[10,1]{1,0} %exponential)

and it's likely due to the recent pin update. A likely culprit is openxla/xla@33bcc66.

The test has not been run since last pin update because

xla/test/ds/test_dynamic_shape_models.py

Line 46 in a93e14b

not xm.get_xla_supported_devices("GPU") and

would silently fail and it's always been skipped.

Luckily, another cl is trying to revert openxla/xla@33bcc66. I'll skip this test on GPU for now and only let it run on TPU (on TPU, the test succeeds.).

vanbasten23 · 2023-11-04T00:12:34Z

Thanks for the review!

…h#5754) * update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU * add warning message. * fix comment and test failure. * skip dynamic shape model test on cuda.

* update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU * add warning message. * fix comment and test failure. * skip dynamic shape model test on cuda.

…h#5754) * update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU * add warning message. * fix comment and test failure. * skip dynamic shape model test on cuda.

* update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU * add warning message. * fix comment and test failure. * skip dynamic shape model test on cuda.

update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU

cadccba

vanbasten23 requested review from JackCaoG, will-cromar and zpcore November 1, 2023 20:25

add warning message.

815ba53

fix comment and test failure.

30f6a8c

skip dynamic shape model test on cuda.

5659afc

JackCaoG approved these changes Nov 4, 2023

View reviewed changes

vanbasten23 merged commit f01cdb6 into master Nov 4, 2023

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU #5754

update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU #5754

Uh oh!

vanbasten23 commented Nov 1, 2023 •

edited

Loading

zpcore commented Nov 1, 2023 •

edited

Loading

JackCaoG commented Nov 1, 2023

vanbasten23 commented Nov 1, 2023

vanbasten23 commented Nov 2, 2023 •

edited

Loading

zpcore commented Nov 2, 2023

vanbasten23 commented Nov 2, 2023

JackCaoG commented Nov 2, 2023

vanbasten23 commented Nov 3, 2023 •

edited

Loading

vanbasten23 commented Nov 4, 2023

Labels

4 participants

Uh oh!

update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU #5754

update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU #5754

Uh oh!

Conversation

vanbasten23 commented Nov 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

zpcore commented Nov 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JackCaoG commented Nov 1, 2023

vanbasten23 commented Nov 1, 2023

vanbasten23 commented Nov 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

zpcore commented Nov 2, 2023

vanbasten23 commented Nov 2, 2023

JackCaoG commented Nov 2, 2023

vanbasten23 commented Nov 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vanbasten23 commented Nov 4, 2023

Labels

4 participants

vanbasten23 commented Nov 1, 2023 •

edited

Loading

zpcore commented Nov 1, 2023 •

edited

Loading

vanbasten23 commented Nov 2, 2023 •

edited

Loading

vanbasten23 commented Nov 3, 2023 •

edited

Loading