quick refactor on _get_group_assignment #5318

JackCaoG · 2023-07-18T18:21:19Z

No description provided.

* Update inline style code to multiline (#5291) * Fix typo in _test.yml (#5172) s/metadtaa/metadata/ * [SPMD][Virtual Device]All tensors should be in SPMD:0 C++ device (#5284) * Move all tensors to SPMD:0 C++ device under spmd context * fix load shards * fix test_mark_sharding_2d by not creating placeholder for virtual device * fix the waitdeviceop for spmd case * Fix test_shard_hashing * fix spmd device casting issue * remove hacks in test_xla_virtual_device.py * add test for new virtual device usage * fix review comments * fix IsTpuDevice * linter * Revert pr #2682 (#5215) * Make README more actionable (#5262) * Make README more actionable * move profiling guide link * text wrapping * [SPMD] Use xs.Mesh in test_2d_tensor_3d_mesh (#5295) * use mesh in test_2d_tensor_3d_mesh * remove attributes patch * [SPMD] Add FSDP sharding for test_train_spmd_linear_model.py (#5299) Summary: This diff adds FSDP sharding for test_train_spmd_linear_model.py. Test Plan: PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_train_spmd_linear_model.py --sharding fsdp * [SPMD] Avoid recompilations in xs.mark_sharding() (#5300) Summary: This pull requests fixes the recompilation issue in xs.mark_sharding(). xtensor->GetXlaData() will compile the program if xtensor is an IR in order to get the BackendData. I believe this is not intended given the error message below suggests only data type xtensors are supported. Test Plan: PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py * [SPMD] Support mark_sharding on IRs (#5301) Summary: This pull requests fixes the recompilation issue in xs.mark_sharding(). xtensor->GetXlaData() will compile the program if xtensor is an IR in order to get the BackendData. I believe this is not intended given the error message below suggests only data type xtensors are supported. Test Plan: PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py * [SPMD] Allow dumping post optimizations hlo (#5302) Summary: This pull request partial reverts the change in #5266 to re-enble dumping post optimizations hlo. Test Plan: XLA_USE_SPMD=1 PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py -v -k test_xla_sharded_hlo_dump_post_optimizations * Add `_sharded_cpu_state_dict` for distributed checkpointing (#5288) * initiak commit * Add test workflow for `xrt` branch (#5241) * Add test workflow for `xrt` branch * Only run for PRs targeting XRT branch * Add function to generate stablehlo based callable from pytorch model (#5216) * Add function to generate stablehlo based callable from pytorch model Added function `torch_xla.experimental.stablehlo_saved_model.export_pytorch_model`. This function will take a pytorch Module and convert it into stablehlo bytecode. * Only run the main CI workflow on PRs targeting master and release branches (#5244) * Only run main CI for master and release branches. * Disabling XRT tests on main CI * AMP for TPUs v3 (#5161) * remove duplicate autocast_test (#5246) * Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247) * Install `expecttest` in xla_test_job.yaml (#5252) * Add IAM roles for cloudbuild_editors (#5251) * [Functionalization] Remove view in view_symint (#5231) * [Functionalization] Remove view in view_symint Summary: This pull request removes views in tensor_method::view_symint. Test Plan: XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view * Fix linters * fixed the test * ran the linter --------- Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com> * Delete XRT from the main branch (#5240) * Delete XRT from the main branch * Remove dead import * formatting * Remove disable_xrt build option * Fix runtime init * Revert "Remove disable_xrt build option" This reverts commit ba312e7. * Add disable XRT option back * formatting * Prune mesh service * Remove obsolete test * Remove other run server script * Remove XRT config * Update PJRT default device test * Add a file I forgot to save * if using_pjrt -> @requires_pjrt * Remove irrelevant test case * Remove XRT env vars * fix md link * formatting * Remove extra `requires_pjrt` * merge conflicts * Add other autocast back * Add nightly build for cuda 12 (#5253) * Fix the linter command in the CI (#5254) * fix linter command * ran linter * Jack cao g/fix spmd buff is null (#5256) * Fix that non-tensor scalar can't be handled by virtual device * add test * comment * Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239) * Skip calling as_strided in empty_strided_symint. * only return empty_symint conditionally. * add a comment * Add XRT nightly builds (#5261) * Add XRT nightly builds * remove space * [OpenXLA] Migrate to pull XLA from OpenXLA (#5202) PyTorch/XLA migrate to pull XLA from OpenXLA by replacing TensorFlow with OpenXLA after deprecating XRT usage, and replace TensorFlow-pin with OpenXLA-pin to May09 * Add ToString method for both PjrtData and PjrtShardedData (#5265) * Add ToString method for both PjrtData and PjrtShardedData * on cpu same config will become replicated, dont't check actual op sharding type * Update Sharded graph HLO dumping (#5266) * Enable PjRt Client Compilation with StableHLO (#5233) * Enable xla PjRt client compilation with StableHLO * add XLA_STABLEHLO_COMPILE to configuration.yaml * fix merge conflict * dummy commit to trigger ci * Revert "dummy commit to trigger ci" This reverts commit f7aec23. * Disable Bazel remote cache for forked PR (#5259) * disable bazel remote cache if gcloud key is empty * remove remote cache from setup.py * experiment with debug msg * fix flag * add more logs * skip remote chache if credential file is empty * add comment * add logs * add check in test and coverage script * fix condition in coverage test * advance branch pr * allow remote cache if gloud file isn't specified explicitly * remove dummy comment * Suppress debug symbols in OpenXLA code (#5269) * [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268) * Make TPU detection more robust (#5271) * Clean bazel stuff on distutils clean. (#5274) * Clean bazel stuff on distutils clean * Fix python formatting * Delete unused .so file, and .lds files (#5275) * [OpenXLA] Delete unused .so file and .lds files * Fix the error when export_torch_model is given a non-tensor (#5277) However the generated StableHLO graph still hardcodes the non-tensor value. this is not correct, will fix later. * Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282) * Always do build_ext in python setup.py develop (#5273) Bazel should figure out that _XLAC.so is current or not, and trigger rebuild if any cpp files changed. * Remove or improve several hardcoded TPU test conditions (#5272) * Remove or improve several hardcoded TPU test conditions * Fix test condition * Add `runtime.host_index` (#5283) * Make it an error if calling sizes() on a dynamic tensor. (#4998) * Err if calling sizes() on dynamic tensor * try to set has_symbolic_sizes_strides_ * resolve merge conflict * enable CONTINUE_ON_ERROR * fixed the python test test_SizeEq_should_not_compile_for_identical_symints * fix test_index_types * set CONTINUE_ON_ERROR to true * remove some unwanted code. * add a print * directly set has_symbolic_sizes_strides_ = true * make some fixes. * fix empty_strided_symint * ran linter * change error type in the test. * fix comments * ran linter * Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281) * Fix the error where mark_step does not materalize tensors on SPMD:0 * typo * fix test_non_tensor_scalar * Disable torch._dynamo.config.automatic_dynamic_shapes (#5285) * Set torch._dynamo.config.automatic_dynamic_shapes to False * Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape * run linter * wrap only if sharding type is non-replicated * Handle non-tensors * run linter * Call wrap_if_sharded first * Add exception in test for unsharded tensor * fix test * Use torch.Tensor instead of torch.tensor * use .cpu() only for tensors --------- Co-authored-by: Will Cromar <wcromar@google.com> Co-authored-by: qihqi <hanq@google.com> Co-authored-by: Meghan Cowan <cowanmeg@google.com> Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com> Co-authored-by: Jiewen Tan <jwtan@google.com> Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com> Co-authored-by: Wonjoo Lee <wonjoo@google.com> Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com> Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: stgpetrovic <stgpetrovic@gmail.com> Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com> * Supoort unordered sharding spec correctly (#5305) * Supoort non-ordered sharding spec correctly * use permute instead of transpose * use dim > 2 to suit TPU v3(otherwise can't be divide evenly) * Support unordered sharding spec for partial replication (#5316) * Suport unordered sharding spec for partial replication * add 4d test * handle 2d tensor with 2d mesh case * refactoring * Fix mismatched GPU docker image in the doc. (#5319) * quick refactor on _get_group_assignment (#5318) * Add tf independent serialization (#5308) Create a serialization format for StableHLO graphs and weights without tf.saved_model Need to not use tensorflow because tensorflow is no longer dependency of pytorch/xla. Information saved are enough to reconstruct the tf.saved_model for serving. Information stored: * metadata on which tensor maps which input position * StableHLO version number * metadata on which tensor corresponds to user input or parameter * metadata on shape and dtype of each tensor. * Tensors themselves are saved as numpy arrays using np.save. * Disable coverage for now (#5321) * Enable Some input output aliasing under SPMD (#5320) * Use `_sharded_cpu_state_dict` functionality to Write Items for SPMD Save Planner (#5315) * initial commit * add suggested changes * add unit test * fix test * fix test * add suggested changes * remove is_sharded_tensor check * check if device type is xla in `wrap_if_sharded` * change order * update resolve_data and add more tests * run linter * use subtest * formatting fixes * run linter * handle single tensor for method send_to_device_single (#5317) * handle single tensor for method send_to_device_single * fix broadcast parameter --------- Co-authored-by: Wonjoo Lee <wonjoo@google.com> Co-authored-by: Nikita Shulga <nshulga@meta.com> Co-authored-by: iefgnoix <isaacwxf23@gmail.com> Co-authored-by: Will Cromar <wcromar@google.com> Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com> Co-authored-by: Jiewen Tan <jwtan@google.com> Co-authored-by: Yash Shah <55116947+yashs97@users.noreply.github.com> Co-authored-by: qihqi <hanq@google.com> Co-authored-by: Meghan Cowan <cowanmeg@google.com> Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com> Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>

quick refactor on _get_group_assignment

4fac7ce

JackCaoG requested a review from jonb377 July 18, 2023 18:21

jonb377 approved these changes Jul 18, 2023

View reviewed changes

JackCaoG merged commit a8f723b into master Jul 18, 2023

khatwanimohit pushed a commit that referenced this pull request Jul 20, 2023

quick refactor on _get_group_assignment (#5318)

9e32619

JackCaoG added a commit that referenced this pull request Jul 21, 2023

quick refactor on _get_group_assignment (#5318)

37b8518

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

quick refactor on _get_group_assignment #5318

quick refactor on _get_group_assignment #5318

Uh oh!

JackCaoG commented Jul 18, 2023

Labels

3 participants

Uh oh!

quick refactor on _get_group_assignment #5318

quick refactor on _get_group_assignment #5318

Uh oh!

Conversation

JackCaoG commented Jul 18, 2023

Labels

3 participants