PyTorch/XLA usability progress tracking

@will-cromar

Description

Status tracking for the progress of improving the usability.

Action items for APIs

Below are APIs we plan to clean up to imporve the usability. The list is mostly based on the doc from @will-cromar .

APIs	Actions	Complete Date	Related issues	Related Design	PR
xla_model.parse_xla_device()	internalize	2024-07-18			#7675
torch_xla.launch()	new API introduce	2024-07-12		Improve multiprocess with torch_xla.launch()	#7648
xla_model.xrt_world_size()	deprecate with runtime.world_size() , remove `defval` argument	2024-07-24			#7679
xla_model.get_ordinal()	deprecate with runtime.global_ordinal()	2024-07-24			#7679
xla_model.get_local_ordinal()	deprecate with runtime.global_ordinal()	2024-07-24			#7679
using_pjrt()	deprecate		#7730
requires_pjrt()	deprecate		#7730
xla_real_devices	deprecate
xla_device_hw	deprecate
xla_replication_devices	replace
set_replication	replace
unlazy	delete
RateTracker	internalize
ToXlaTensorArena	internalize
check_view_sharing	delete
add_step_closure	replace
mark_step	replace		#6751
reduce_gradients	internalize
optimizer_step	replace?
save	replace/upstream
xla_rendezvous	deprecate with torch.distributed
rendezvous	deprecate with torch.distributed
do_on_ordinals	delete
mesh_reduce	replace and implement torch.distributed.all_gather_object
set_rng_state	delete
get_rng_state	delete

Actions items for integration between PyTorch/XLA and cloud infra

Distributed Training

How to handle distributed checkpoint

How to resume distributed training

Integrate with GCP (especially preemptible resources) makes the usability worse

Persistent disk gcp operation and storage for large models/datasheet with sharding
Retrieve logs from all workers including the failing worker.
Stability: high change one of the worker will stop working

Debug and logging

Node can fail without any notifications

Tutorial support

Check why the FSDP/SPMD tutorial is not working?
Incorrect use of FSDP is hard to spot.

Freshness of documentation

Not only the documentation in our site, but also third party like Huggingface

Debugging

Support of notification of dynamic graph

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PyTorch/XLA usability progress tracking #7739

Description

Action items for APIs

Actions items for integration between PyTorch/XLA and cloud infra

Distributed Training

How to handle distributed checkpoint

Integrate with GCP (especially preemptible resources) makes the usability worse

Debug and logging

Tutorial support

Freshness of documentation

Debugging

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PyTorch/XLA usability progress tracking #7739

Description

Description

Action items for APIs

Actions items for integration between PyTorch/XLA and cloud infra

Distributed Training

How to handle distributed checkpoint

Integrate with GCP (especially preemptible resources) makes the usability worse

Debug and logging

Tutorial support

Freshness of documentation

Debugging

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions