Support PreemptionSyncManager in XlaCoordinator #5733

jonb377 · 2023-10-25T18:56:44Z

To support autocheckpointing upon preemption, we need to access a PreemptionSyncManager to identify sync points when a preemption has occurred.

This change additionally refactors the DistributedRuntime to be owned by the ComputationClient, since in the GPU case the ComputationClient has a direct dependency on the DistributedRuntimeClient.

This change adds the PreemptionSyncManager to the new XlaCoordinator class. The PreemptionSyncManager has the side effect of registering a SIGTERM handler, so it is not enabled by default.

torch_xla/csrc/runtime/distributed_runtime.h

vanbasten23

LGTM!

will-cromar · 2023-10-26T16:38:40Z

torch_xla/csrc/runtime/distributed_runtime.cc

 }

+void DistributedRuntime::ActivatePreemptionSyncManager() {
+ if (preemption_sync_manager_ == nullptr) {


Is there any harm in initializing the PreemptionSyncManager when you initialize the xla::DistributedRuntimeService and Client? In general, I try to avoid cases where you "partially" construct an object and leave potential bugs to happen later (like calling ReachedSyncPoint before ActivatePreemptionSyncManager)

I was hesistant to do that, since it will register a SIGTERM handler which will cause any intentional SIGTERMs to be ignored. Open to revisiting, let me know which approach you think makes more sense!

torch_xla/csrc/runtime/distributed_runtime.h

will-cromar · 2023-10-26T17:03:52Z

torch_xla/csrc/runtime/distributed_runtime.h


+// DistributedRuntime serves as the point of entry for all operations which
+// required the XLA distributed runtime, such as preemption coordination.
 class DistributedRuntime {


I really dislike the naming choice for the upstream xla::DistributedRuntime, since it's not actually a distributed runtime. Since this class is becoming more than just a wrapper around xla::DistributedRuntimeService and xla::DistributedRuntimeClient, what do you think of changing the name to something more intuitive? e.g. XlaCoordinator

Totally agree, XlaCoordinator it is! I'll update the pybinds as well.

I left as-is for now to keep this change minimal, we can revisit in the upcoming refactor.

will-cromar · 2023-10-26T17:06:35Z

torch_xla/csrc/runtime/distributed_runtime.h

+ // The PreemptionSyncManager must be activated within the DistributedRuntime.
+ // Returns true when the input step has been identified as a sync point, and
+ // false otherwise.
+ bool ReachedSyncPoint(int step);


Do you think it makes more sense to expose the tsl::PreemptionSyncManager directly as we do with the xla::DistributedRuntimeClient? Or do we want to restrict access to the underlying object?

I considered that, but if the PreemptionSyncManager outlives the DistributedRuntimeClient, the program will segfault... 😢 I figured it's better to keep it hidden to avoid that edge case.

jonb377

Thanks for the review @vanbasten23 and @will-cromar! I'll update to address the feedback.

jonb377 · 2023-10-26T17:12:53Z

torch_xla/csrc/runtime/distributed_runtime.cc

 }

+void DistributedRuntime::ActivatePreemptionSyncManager() {
+ if (preemption_sync_manager_ == nullptr) {


I was hesistant to do that, since it will register a SIGTERM handler which will cause any intentional SIGTERMs to be ignored. Open to revisiting, let me know which approach you think makes more sense!

torch_xla/csrc/runtime/distributed_runtime.h

jonb377 · 2023-10-26T17:16:38Z

torch_xla/csrc/runtime/distributed_runtime.h


+// DistributedRuntime serves as the point of entry for all operations which
+// required the XLA distributed runtime, such as preemption coordination.
 class DistributedRuntime {


Totally agree, XlaCoordinator it is! I'll update the pybinds as well.

jonb377 · 2023-10-26T17:17:44Z

torch_xla/csrc/runtime/distributed_runtime.h

+ // The PreemptionSyncManager must be activated within the DistributedRuntime.
+ // Returns true when the input step has been identified as a sync point, and
+ // false otherwise.
+ bool ReachedSyncPoint(int step);


I considered that, but if the PreemptionSyncManager outlives the DistributedRuntimeClient, the program will segfault... 😢 I figured it's better to keep it hidden to avoid that edge case.

vanbasten23 · 2023-10-27T15:58:03Z

Oh one more thing, could you help check if we have a test verifying whether the distributed runtime service is always turned down every time? I'd imagine if we comment out the line dist_runtime_service_->Shutdown();, the test PJRT_DEVICE=GPU torchrun --nnodes 1 --nproc-per-node 2 pytorch/xla/test/pjrt/test_torchrun.py should time out.

jonb377 · 2023-10-27T16:19:29Z

@vanbasten23 @will-cromar I've updated to have the ComputationClient own the XlaCoordinator. Please take a second look when you get a chance!

will-cromar

Overall LGTM.

FYI, our style guide cautions against forward declarations of entities in other projects, even if it saves compile time: https://google.github.io/styleguide/cppguide.html#Forward_Declarations

If you forward-declared the DistributedRuntime classes to unravel e.g. a macro conflict or circular dependency, please leave a comment explaining why.

torch_xla/csrc/runtime/computation_client.h

torch_xla/csrc/runtime/pjrt_computation_client.cc

torch_xla/csrc/runtime/xla_coordinator.h

will-cromar · 2023-10-27T16:29:56Z

torch_xla/csrc/runtime/computation_client.h

 virtual void WaitDeviceOps(const std::vector<std::string>& devices) = 0;

+ // Check whether the XlaCoordinator has been initialized.
+ virtual bool CoordinatorInitialized() const = 0;


Do these need to be virtual? It looks like the implementations below don't depend on the underlying runtime client.

The XlaCoordinator depends on PJRT, so I kept it separate. Though I guess that's not a strong justification...

test/spmd/test_xla_distributed_checkpoint.py

torch_xla/csrc/runtime/runtime.cc

will-cromar · 2023-10-30T17:58:21Z

torch_xla/csrc/runtime/computation_client.h


+// Forward declare XlaCoordinator to avoid logging macro redefinition from the
+// transitively included PJRT header.
+// TODO(jonbolin): We need a way to ensure the right macros are included


logging macros are cursed ☹️

torch_xla/csrc/runtime/pjrt_computation_client.cc

alanwaketan

I guess I had a hard time of imagining how this is going to be incorporated to the ckpt mgr. @jonb377 Can you point me to some examples?

jonb377 · 2023-10-30T20:30:28Z

I guess I had a hard time of imagining how this is going to be incorporated to the ckpt mgr. @jonb377 Can you point me to some examples?

@alanwaketan CheckpointManager will initialize the PreemptionSyncManager on construction and call into _sync_point_reached each step to check for preemption in should_save - I'll open a draft PR to illustrate.

jonb377 · 2023-10-30T20:32:57Z

@alanwaketan See 5fdce13 for the intended usage.

alanwaketan · 2023-10-30T21:07:26Z

I guess I had a hard time of imagining how this is going to be incorporated to the ckpt mgr.

I guess I had a hard time of imagining how this is going to be incorporated to the ckpt mgr. @jonb377 Can you point me to some examples?

@alanwaketan CheckpointManager will initialize the PreemptionSyncManager on construction and call into _sync_point_reached each step to check for preemption in should_save - I'll open a draft PR to illustrate.

I see, that makes sense.

alanwaketan

LGTM.

* Support PreemptionSyncManager in DistributedRuntime * Refactor to be owned by ComputationClient * Clean up logging macro issue handling

jonb377 requested review from alanwaketan and vanbasten23 October 25, 2023 18:56

vanbasten23 reviewed Oct 25, 2023

View reviewed changes

torch_xla/csrc/runtime/distributed_runtime.h Outdated Show resolved Hide resolved

vanbasten23 reviewed Oct 25, 2023

View reviewed changes

torch_xla/csrc/runtime/distributed_runtime.h Outdated Show resolved Hide resolved

vanbasten23 reviewed Oct 25, 2023

View reviewed changes

torch_xla/csrc/runtime/distributed_runtime.h Outdated Show resolved Hide resolved

vanbasten23 reviewed Oct 25, 2023

View reviewed changes

torch_xla/csrc/runtime/distributed_runtime.h Outdated Show resolved Hide resolved

jonb377 requested a review from will-cromar October 25, 2023 23:20

vanbasten23 approved these changes Oct 26, 2023

View reviewed changes

will-cromar reviewed Oct 26, 2023

View reviewed changes

jonb377 commented Oct 26, 2023

View reviewed changes

jonb377 force-pushed the jonbolin/preemption branch 2 times, most recently from a5c5e74 to 10dc8db Compare October 27, 2023 16:18

jonb377 requested a review from vanbasten23 October 27, 2023 16:18

jonb377 requested a review from will-cromar October 27, 2023 16:19

will-cromar approved these changes Oct 27, 2023

View reviewed changes

will-cromar reviewed Oct 27, 2023

View reviewed changes

test/spmd/test_xla_distributed_checkpoint.py Outdated Show resolved Hide resolved

yeounoh self-requested a review October 27, 2023 18:26

jonb377 force-pushed the jonbolin/preemption branch from f20d522 to 1502aa8 Compare October 27, 2023 20:33

jonb377 commented Oct 27, 2023

View reviewed changes

torch_xla/csrc/runtime/runtime.cc Outdated Show resolved Hide resolved

jonb377 changed the title ~~Support PreemptionSyncManager in DistributedRuntime~~ Support PreemptionSyncManager in XlaCoordinator Oct 28, 2023

jonb377 force-pushed the jonbolin/preemption branch 4 times, most recently from b954fed to 9aafdf5 Compare October 28, 2023 17:42

will-cromar reviewed Oct 30, 2023

View reviewed changes

torch_xla/csrc/runtime/pjrt_computation_client.cc Show resolved Hide resolved

jonb377 force-pushed the jonbolin/preemption branch 2 times, most recently from d2a36a6 to 673cad9 Compare October 30, 2023 18:53

alanwaketan reviewed Oct 30, 2023

View reviewed changes

alanwaketan approved these changes Oct 30, 2023

View reviewed changes

jonb377 force-pushed the jonbolin/preemption branch 3 times, most recently from 0d35028 to cfbb93a Compare October 30, 2023 21:56

jonb377 changed the title ~~Support PreemptionSyncManager in XlaCoordinator~~ Support PreemptionSyncManager in DistributedRuntime Oct 30, 2023

jonb377 force-pushed the jonbolin/preemption branch from cfbb93a to 6aab523 Compare October 31, 2023 02:05

jonb377 requested review from JackCaoG, mateuszlewko and stgpetrovic as code owners October 31, 2023 02:05

jonb377 force-pushed the jonbolin/preemption branch from 6aab523 to d880511 Compare October 31, 2023 02:14

jonb377 added 2 commits October 31, 2023 21:31

Support PreemptionSyncManager in DistributedRuntime

bb13a6a

Refactor to be owned by ComputationClient

f7c3abc

jonb377 force-pushed the jonbolin/preemption branch from d880511 to 287fa96 Compare October 31, 2023 21:31

jonb377 changed the title ~~Support PreemptionSyncManager in DistributedRuntime~~ Support PreemptionSyncManager in XlaCoordinator Oct 31, 2023

Clean up logging macro issue handling

bf27ec9

jonb377 force-pushed the jonbolin/preemption branch from 287fa96 to bf27ec9 Compare November 1, 2023 00:03

jonb377 merged commit b20a082 into master Nov 1, 2023

jonb377 deleted the jonbolin/preemption branch November 1, 2023 16:50

jonb377 mentioned this pull request Nov 1, 2023

Support autocheckpointing in CheckpointManager #5753

Merged

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023

Support PreemptionSyncManager in XlaCoordinator (#5733)

a7e2767

* Support PreemptionSyncManager in DistributedRuntime * Refactor to be owned by ComputationClient * Clean up logging macro issue handling

ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023

Support PreemptionSyncManager in XlaCoordinator (#5733)

244efa7

* Support PreemptionSyncManager in DistributedRuntime * Refactor to be owned by ComputationClient * Clean up logging macro issue handling

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Support PreemptionSyncManager in XlaCoordinator (#5733)

b1db3ce

* Support PreemptionSyncManager in DistributedRuntime * Refactor to be owned by ComputationClient * Clean up logging macro issue handling

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Support PreemptionSyncManager in XlaCoordinator (#5733)

5094921

* Support PreemptionSyncManager in DistributedRuntime * Refactor to be owned by ComputationClient * Clean up logging macro issue handling

Uh oh!

Support PreemptionSyncManager in XlaCoordinator #5733

Support PreemptionSyncManager in XlaCoordinator #5733

Uh oh!

Conversation

jonb377 commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vanbasten23 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonb377 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanbasten23 commented Oct 27, 2023

jonb377 commented Oct 27, 2023

will-cromar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

jonb377 Oct 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanwaketan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

jonb377 commented Oct 30, 2023

jonb377 commented Oct 30, 2023

alanwaketan commented Oct 30, 2023

alanwaketan left a comment

Choose a reason for hiding this comment

Labels

6 participants

jonb377 commented Oct 25, 2023 •

edited

Loading

jonb377 Oct 27, 2023 •

edited

Loading

alanwaketan left a comment •

edited

Loading