Skip to content

Conversation

@samxbr
Copy link
Contributor

@samxbr samxbr commented Sep 4, 2025

This change makes SLM stats more accurate by pre-loading the completed snapshot info from repository, and pass them to SLM's cluster state update job. The snapshot info is then used to calculate the correct stats for previous SLM snapshots.

Background

SLM runs on master node, it case of the master node being shut down in the middle of a SLM run, the current SLM execution will fail. Depending on the timing, if the master node is shut down after a SLM triggered snapshot was completed but before SLM updates cluster state, the state update will be lost. To recover the lost update, SLM keeps track of a list of registered snapshots so it can detect the above case and retroactively update the SLM state. However SLM did not check the status of the snapshot, it assumes the snapshot failed, resulting SLM incorrectly recording successful snapshots as failure. This change corrects the stats in such case by taking snapshot status (success/failure) into account.

@samxbr samxbr added :Data Management/ILM+SLM Index and Snapshot lifecycle management v9.1.0 v9.0.7 v9.1.4 auto-backport Automatically create backport pull requests when merged and removed v9.1.0 auto-backport Automatically create backport pull requests when merged v9.1.4 v9.0.7 labels Sep 4, 2025
@samxbr samxbr requested a review from Copilot September 4, 2025 16:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes SLM (Snapshot Lifecycle Management) stats accuracy by correctly handling successful snapshots that were not recorded due to master node shutdown. Previously, SLM would incorrectly mark successful snapshots as failures when recovering from master shutdown.

Key changes:

  • Pre-load completed snapshot info from repository to determine actual snapshot status (success/failure)
  • Update SLM's cluster state update job to use actual snapshot status for accurate stats calculation
  • Add comprehensive tests for master shutdown scenarios

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
SnapshotLifecycleTask.java Core logic changes to fetch and use actual snapshot status when updating SLM stats
SnapshotLifecycleTaskTests.java Updated test methods to accommodate new snapshot info parameter and added test helper methods
SLMSnapshotBlockingIntegTests.java Added integration tests for master shutdown scenarios with successful and deleted snapshots
RegisteredPolicySnapshots.java Updated comment to reflect new behavior of using actual snapshot status

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

…apshotLifecycleTask.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@samxbr samxbr requested a review from dakrone September 4, 2025 16:59
Comment on lines +217 to +244
findCompletedRegisteredSnapshotInfo(currentProjectState, policyId, client, new ActionListener<>() {
@Override
public void onResponse(List<SnapshotInfo> snapshotInfo) {
submitUnbatchedTask(
clusterService,
"slm-record-success-" + policyId,
WriteJobStatus.success(projectId, policyId, snapshotId, snapshotStartTime, timestamp, snapshotInfo)
);
}

@Override
public void onFailure(Exception e) {
logger.warn(() -> format("failed to retrieve stale registered snapshots for job [%s]", jobId), e);
// still record the successful snapshot
submitUnbatchedTask(
clusterService,
"slm-record-success-" + policyId,
WriteJobStatus.success(
projectId,
policyId,
snapshotId,
snapshotStartTime,
timestamp,
Collections.emptyList()
)
);
}
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a little bit of duplication, maybe there's a more elegant way to do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the duplication is too bad here, duplicating it once is fine.

@samxbr samxbr marked this pull request as ready for review September 4, 2025 17:01
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@samxbr samxbr added the >bug label Sep 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @samxbr, I've created a changelog YAML for you.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this Sam! I left a few minor comments, nothing major.

Comment on lines +217 to +244
findCompletedRegisteredSnapshotInfo(currentProjectState, policyId, client, new ActionListener<>() {
@Override
public void onResponse(List<SnapshotInfo> snapshotInfo) {
submitUnbatchedTask(
clusterService,
"slm-record-success-" + policyId,
WriteJobStatus.success(projectId, policyId, snapshotId, snapshotStartTime, timestamp, snapshotInfo)
);
}

@Override
public void onFailure(Exception e) {
logger.warn(() -> format("failed to retrieve stale registered snapshots for job [%s]", jobId), e);
// still record the successful snapshot
submitUnbatchedTask(
clusterService,
"slm-record-success-" + policyId,
WriteJobStatus.success(
projectId,
policyId,
snapshotId,
snapshotStartTime,
timestamp,
Collections.emptyList()
)
);
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the duplication is too bad here, duplicating it once is fine.

@samxbr samxbr added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Sep 15, 2025
@samxbr samxbr enabled auto-merge (squash) September 15, 2025 20:45
@samxbr samxbr merged commit aea67ba into elastic:main Sep 16, 2025
34 checks passed
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Sep 17, 2025
gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team v9.2.0

3 participants