Correctly update SLM stats with master shutdown #134152

samxbr · 2025-09-04T15:29:41Z

This change makes SLM stats more accurate by pre-loading the completed snapshot info from repository, and pass them to SLM's cluster state update job. The snapshot info is then used to calculate the correct stats for previous SLM snapshots.

Background

SLM runs on master node, it case of the master node being shut down in the middle of a SLM run, the current SLM execution will fail. Depending on the timing, if the master node is shut down after a SLM triggered snapshot was completed but before SLM updates cluster state, the state update will be lost. To recover the lost update, SLM keeps track of a list of registered snapshots so it can detect the above case and retroactively update the SLM state. However SLM did not check the status of the snapshot, it assumes the snapshot failed, resulting SLM incorrectly recording successful snapshots as failure. This change corrects the stats in such case by taking snapshot status (success/failure) into account.

Copilot

Pull Request Overview

This PR fixes SLM (Snapshot Lifecycle Management) stats accuracy by correctly handling successful snapshots that were not recorded due to master node shutdown. Previously, SLM would incorrectly mark successful snapshots as failures when recovering from master shutdown.

Key changes:

Pre-load completed snapshot info from repository to determine actual snapshot status (success/failure)
Update SLM's cluster state update job to use actual snapshot status for accurate stats calculation
Add comprehensive tests for master shutdown scenarios

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
SnapshotLifecycleTask.java	Core logic changes to fetch and use actual snapshot status when updating SLM stats
SnapshotLifecycleTaskTests.java	Updated test methods to accommodate new snapshot info parameter and added test helper methods
SLMSnapshotBlockingIntegTests.java	Added integration tests for master shutdown scenarios with successful and deleted snapshots
RegisteredPolicySnapshots.java	Updated comment to reflect new behavior of using actual snapshot status

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

…apshotLifecycleTask.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

samxbr · 2025-09-04T17:01:18Z

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

+ findCompletedRegisteredSnapshotInfo(currentProjectState, policyId, client, new ActionListener<>() {
+ @Override
+ public void onResponse(List<SnapshotInfo> snapshotInfo) {
+ submitUnbatchedTask(
+ clusterService,
+ "slm-record-success-" + policyId,
+ WriteJobStatus.success(projectId, policyId, snapshotId, snapshotStartTime, timestamp, snapshotInfo)
+ );
+ }
+
+ @Override
+ public void onFailure(Exception e) {
+ logger.warn(() -> format("failed to retrieve stale registered snapshots for job [%s]", jobId), e);
+ // still record the successful snapshot
+ submitUnbatchedTask(
+ clusterService,
+ "slm-record-success-" + policyId,
+ WriteJobStatus.success(
+ projectId,
+ policyId,
+ snapshotId,
+ snapshotStartTime,
+ timestamp,
+ Collections.emptyList()
+ )
+ );
+ }
+ });


There's a little bit of duplication, maybe there's a more elegant way to do this?

I don't think the duplication is too bad here, duplicating it once is fine.

elasticsearchmachine · 2025-09-04T17:02:06Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-09-04T17:40:24Z

Hi @samxbr, I've created a changelog YAML for you.

dakrone

LGTM, thanks for working on this Sam! I left a few minor comments, nothing major.

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

dakrone · 2025-09-11T20:59:27Z

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

+ findCompletedRegisteredSnapshotInfo(currentProjectState, policyId, client, new ActionListener<>() {
+ @Override
+ public void onResponse(List<SnapshotInfo> snapshotInfo) {
+ submitUnbatchedTask(
+ clusterService,
+ "slm-record-success-" + policyId,
+ WriteJobStatus.success(projectId, policyId, snapshotId, snapshotStartTime, timestamp, snapshotInfo)
+ );
+ }
+
+ @Override
+ public void onFailure(Exception e) {
+ logger.warn(() -> format("failed to retrieve stale registered snapshots for job [%s]", jobId), e);
+ // still record the successful snapshot
+ submitUnbatchedTask(
+ clusterService,
+ "slm-record-success-" + policyId,
+ WriteJobStatus.success(
+ projectId,
+ policyId,
+ snapshotId,
+ snapshotStartTime,
+ timestamp,
+ Collections.emptyList()
+ )
+ );
+ }
+ });


I don't think the duplication is too bad here, duplicating it once is fine.

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

…apshotLifecycleTask.java Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>

Correctly update SLM stats with master shutdown

77ae8e4

elasticsearchmachine added the v9.2.0 label Sep 4, 2025

samxbr added 2 commits September 4, 2025 11:48

clean up unused methods and comments

e84b2f7

minor clean up

d53f07b

samxbr added :Data Management/ILM+SLM Index and Snapshot lifecycle management v9.1.0 v9.0.7 v9.1.4 auto-backport Automatically create backport pull requests when merged and removed v9.1.0 auto-backport Automatically create backport pull requests when merged v9.1.4 v9.0.7 labels Sep 4, 2025

samxbr requested a review from Copilot September 4, 2025 16:51

Copilot AI reviewed Sep 4, 2025

View reviewed changes

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java Outdated Show resolved Hide resolved

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java Show resolved Hide resolved

Update x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/Sn…

d6341e7

…apshotLifecycleTask.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

samxbr requested a review from dakrone September 4, 2025 16:59

samxbr commented Sep 4, 2025

View reviewed changes

samxbr marked this pull request as ready for review September 4, 2025 17:01

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 4, 2025

samxbr added the >bug label Sep 4, 2025

samxbr added 2 commits September 4, 2025 13:40

Update docs/changelog/134152.yaml

1837d8e

Merge branch 'main' into fix/slm-master-shutdown

2baa8b1

dakrone approved these changes Sep 11, 2025

View reviewed changes

samxbr and others added 4 commits September 13, 2025 20:05

Merge branch 'main' into fix/slm-master-shutdown

b21bb9c

Update x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/Sn…

d3755a6

…apshotLifecycleTask.java Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>

Address comment

5665e0f

Merge branch 'main' into fix/slm-master-shutdown

01906b4

samxbr added 2 commits September 15, 2025 16:35

minor rename

3e32728

Merge branch 'main' into fix/slm-master-shutdown

a06e06a

samxbr added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Sep 15, 2025

samxbr enabled auto-merge (squash) September 15, 2025 20:45

Merge branch 'main' into fix/slm-master-shutdown

a91f752

samxbr merged commit aea67ba into elastic:main Sep 16, 2025
34 checks passed

mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Sep 17, 2025

Correctly update SLM stats in case of master shutdown (elastic#134152)

489c712

gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 18, 2025

Correctly update SLM stats in case of master shutdown (elastic#134152)

942ab85

samxbr mentioned this pull request Sep 26, 2025

Add origin to client in SLM task #135484

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correctly update SLM stats with master shutdown #134152

Correctly update SLM stats with master shutdown #134152

Uh oh!

samxbr commented Sep 4, 2025 •

edited

Loading

Copilot AI left a comment

Uh oh!

Uh oh!

samxbr Sep 4, 2025

dakrone Sep 11, 2025

elasticsearchmachine commented Sep 4, 2025

elasticsearchmachine commented Sep 4, 2025

dakrone left a comment

Uh oh!

dakrone Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

Labels

3 participants

Correctly update SLM stats with master shutdown #134152

Correctly update SLM stats with master shutdown #134152

Uh oh!

Conversation

samxbr commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

samxbr Sep 4, 2025

Choose a reason for hiding this comment

dakrone Sep 11, 2025

Choose a reason for hiding this comment

elasticsearchmachine commented Sep 4, 2025

elasticsearchmachine commented Sep 4, 2025

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

dakrone Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Labels

3 participants

samxbr commented Sep 4, 2025 •

edited

Loading