-
Couldn't load subscription status.
- Fork 560
Open
Labels
usabilityBugs/features related to improving the usability of PyTorch/XLABugs/features related to improving the usability of PyTorch/XLA
Description
Description
Status tracking for the progress of improving the usability.
Action items for APIs
Below are APIs we plan to clean up to imporve the usability. The list is mostly based on the doc from @will-cromar .
| APIs | Actions | Complete Date | Related issues | Related Design | PR |
|---|---|---|---|---|---|
| xla_model.parse_xla_device() | internalize | 2024-07-18 | #7675 | ||
| torch_xla.launch() | new API introduce | 2024-07-12 | Improve multiprocess with torch_xla.launch() | #7648 | |
| xla_model.xrt_world_size() | deprecate with runtime.world_size() , remove defval argument | 2024-07-24 | #7679 | ||
| xla_model.get_ordinal() | deprecate with runtime.global_ordinal() | 2024-07-24 | #7679 | ||
| xla_model.get_local_ordinal() | deprecate with runtime.global_ordinal() | 2024-07-24 | #7679 | ||
| using_pjrt() | deprecate | #7730 | |||
| requires_pjrt() | deprecate | #7730 | |||
| xla_real_devices | deprecate | ||||
| xla_device_hw | deprecate | ||||
| xla_replication_devices | replace | ||||
| set_replication | replace | ||||
| unlazy | delete | ||||
| RateTracker | internalize | ||||
| ToXlaTensorArena | internalize | ||||
| check_view_sharing | delete | ||||
| add_step_closure | replace | ||||
| mark_step | replace | #6751 | |||
| reduce_gradients | internalize | ||||
| optimizer_step | replace? | ||||
| save | replace/upstream | ||||
| xla_rendezvous | deprecate with torch.distributed | ||||
| rendezvous | deprecate with torch.distributed | ||||
| do_on_ordinals | delete | ||||
| mesh_reduce | replace and implement torch.distributed.all_gather_object | ||||
| set_rng_state | delete | ||||
| get_rng_state | delete |
Actions items for integration between PyTorch/XLA and cloud infra
Distributed Training
How to handle distributed checkpoint
- How to resume distributed training
Integrate with GCP (especially preemptible resources) makes the usability worse
- Persistent disk gcp operation and storage for large models/datasheet with sharding
- Retrieve logs from all workers including the failing worker.
- Stability: high change one of the worker will stop working
Debug and logging
- Node can fail without any notifications
Tutorial support
- Check why the FSDP/SPMD tutorial is not working?
- Incorrect use of FSDP is hard to spot.
Freshness of documentation
- Not only the documentation in our site, but also third party like Huggingface
Debugging
- Support of notification of dynamic graph
Metadata
Metadata
Assignees
Labels
usabilityBugs/features related to improving the usability of PyTorch/XLABugs/features related to improving the usability of PyTorch/XLA