Skip to content

Conversation

@jonb377
Copy link
Collaborator

@jonb377 jonb377 commented Nov 1, 2023

When a preemption is detected, the CheckpointManager should decide to automatically take a checkpoint. This change enables that functionality by querying the new _sync_point_reached APIs, introduced in #5733, in each should_save call.

@jonb377 jonb377 requested a review from alanwaketan November 1, 2023 19:06
@jonb377 jonb377 self-assigned this Nov 1, 2023
@jonb377 jonb377 force-pushed the jonbolin/autochkpt branch from cf4861f to d4c757a Compare November 6, 2023 03:59
@jonb377 jonb377 requested a review from yeounoh November 7, 2023 01:28
Copy link
Contributor

@yeounoh yeounoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jonb377
Copy link
Collaborator Author

jonb377 commented Nov 7, 2023

Thanks @alanwaketan and @yeounoh! I'll merge after TPU CI.

@jonb377 jonb377 merged commit 4664380 into master Nov 8, 2023
@jonb377 jonb377 deleted the jonbolin/autochkpt branch November 8, 2023 04:50
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023
ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants