- Notifications
You must be signed in to change notification settings - Fork 25.7k
Correctly update SLM stats with master shutdown #134152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes SLM (Snapshot Lifecycle Management) stats accuracy by correctly handling successful snapshots that were not recorded due to master node shutdown. Previously, SLM would incorrectly mark successful snapshots as failures when recovering from master shutdown.
Key changes:
- Pre-load completed snapshot info from repository to determine actual snapshot status (success/failure)
- Update SLM's cluster state update job to use actual snapshot status for accurate stats calculation
- Add comprehensive tests for master shutdown scenarios
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| SnapshotLifecycleTask.java | Core logic changes to fetch and use actual snapshot status when updating SLM stats |
| SnapshotLifecycleTaskTests.java | Updated test methods to accommodate new snapshot info parameter and added test helper methods |
| SLMSnapshotBlockingIntegTests.java | Added integration tests for master shutdown scenarios with successful and deleted snapshots |
| RegisteredPolicySnapshots.java | Updated comment to reflect new behavior of using actual snapshot status |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java Outdated Show resolved Hide resolved
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java Show resolved Hide resolved
…apshotLifecycleTask.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| findCompletedRegisteredSnapshotInfo(currentProjectState, policyId, client, new ActionListener<>() { | ||
| @Override | ||
| public void onResponse(List<SnapshotInfo> snapshotInfo) { | ||
| submitUnbatchedTask( | ||
| clusterService, | ||
| "slm-record-success-" + policyId, | ||
| WriteJobStatus.success(projectId, policyId, snapshotId, snapshotStartTime, timestamp, snapshotInfo) | ||
| ); | ||
| } | ||
| | ||
| @Override | ||
| public void onFailure(Exception e) { | ||
| logger.warn(() -> format("failed to retrieve stale registered snapshots for job [%s]", jobId), e); | ||
| // still record the successful snapshot | ||
| submitUnbatchedTask( | ||
| clusterService, | ||
| "slm-record-success-" + policyId, | ||
| WriteJobStatus.success( | ||
| projectId, | ||
| policyId, | ||
| snapshotId, | ||
| snapshotStartTime, | ||
| timestamp, | ||
| Collections.emptyList() | ||
| ) | ||
| ); | ||
| } | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a little bit of duplication, maybe there's a more elegant way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the duplication is too bad here, duplicating it once is fine.
| Pinging @elastic/es-data-management (Team:Data Management) |
| Hi @samxbr, I've created a changelog YAML for you. |
dakrone left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this Sam! I left a few minor comments, nothing major.
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java Outdated Show resolved Hide resolved
| findCompletedRegisteredSnapshotInfo(currentProjectState, policyId, client, new ActionListener<>() { | ||
| @Override | ||
| public void onResponse(List<SnapshotInfo> snapshotInfo) { | ||
| submitUnbatchedTask( | ||
| clusterService, | ||
| "slm-record-success-" + policyId, | ||
| WriteJobStatus.success(projectId, policyId, snapshotId, snapshotStartTime, timestamp, snapshotInfo) | ||
| ); | ||
| } | ||
| | ||
| @Override | ||
| public void onFailure(Exception e) { | ||
| logger.warn(() -> format("failed to retrieve stale registered snapshots for job [%s]", jobId), e); | ||
| // still record the successful snapshot | ||
| submitUnbatchedTask( | ||
| clusterService, | ||
| "slm-record-success-" + policyId, | ||
| WriteJobStatus.success( | ||
| projectId, | ||
| policyId, | ||
| snapshotId, | ||
| snapshotStartTime, | ||
| timestamp, | ||
| Collections.emptyList() | ||
| ) | ||
| ); | ||
| } | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the duplication is too bad here, duplicating it once is fine.
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java Show resolved Hide resolved
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java Outdated Show resolved Hide resolved
…apshotLifecycleTask.java Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
This change makes SLM stats more accurate by pre-loading the completed snapshot info from repository, and pass them to SLM's cluster state update job. The snapshot info is then used to calculate the correct stats for previous SLM snapshots.
Background
SLM runs on master node, it case of the master node being shut down in the middle of a SLM run, the current SLM execution will fail. Depending on the timing, if the master node is shut down after a SLM triggered snapshot was completed but before SLM updates cluster state, the state update will be lost. To recover the lost update, SLM keeps track of a list of registered snapshots so it can detect the above case and retroactively update the SLM state. However SLM did not check the status of the snapshot, it assumes the snapshot failed, resulting SLM incorrectly recording successful snapshots as failure. This change corrects the stats in such case by taking snapshot status (success/failure) into account.