Skip to content

PyTorch/XLA usability progress tracking #7739

@zpcore

Description

@zpcore

Description

Status tracking for the progress of improving the usability.

Action items for APIs

Below are APIs we plan to clean up to imporve the usability. The list is mostly based on the doc from @will-cromar .

APIs Actions Complete Date Related issues Related Design PR
xla_model.parse_xla_device() internalize 2024-07-18 #7675
torch_xla.launch() new API introduce 2024-07-12 Improve multiprocess with torch_xla.launch() #7648
xla_model.xrt_world_size() deprecate with runtime.world_size() , remove defval argument 2024-07-24 #7679
xla_model.get_ordinal() deprecate with runtime.global_ordinal() 2024-07-24 #7679
xla_model.get_local_ordinal() deprecate with runtime.global_ordinal() 2024-07-24 #7679
using_pjrt() deprecate #7730
requires_pjrt() deprecate #7730
xla_real_devices deprecate
xla_device_hw deprecate
xla_replication_devices replace
set_replication replace
unlazy delete
RateTracker internalize
ToXlaTensorArena internalize
check_view_sharing delete
add_step_closure replace
mark_step replace #6751
reduce_gradients internalize
optimizer_step replace?
save replace/upstream
xla_rendezvous deprecate with torch.distributed
rendezvous deprecate with torch.distributed
do_on_ordinals delete
mesh_reduce replace and implement torch.distributed.all_gather_object
set_rng_state delete
get_rng_state delete

Actions items for integration between PyTorch/XLA and cloud infra

Distributed Training

How to handle distributed checkpoint

  • How to resume distributed training

Integrate with GCP (especially preemptible resources) makes the usability worse

  • Persistent disk gcp operation and storage for large models/datasheet with sharding
  • Retrieve logs from all workers including the failing worker.
  • Stability: high change one of the worker will stop working

Debug and logging

  • Node can fail without any notifications

Tutorial support

  • Check why the FSDP/SPMD tutorial is not working?
  • Incorrect use of FSDP is hard to spot.

Freshness of documentation

  • Not only the documentation in our site, but also third party like Huggingface

Debugging

  • Support of notification of dynamic graph

Metadata

Metadata

Assignees

Labels

usabilityBugs/features related to improving the usability of PyTorch/XLA

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions