Skip to content

Conversation

@HiDAl
Copy link

@HiDAl HiDAl commented May 15, 2023

This PR introduces new rules to the ILM Health Indicator. The new rules check all the managed indices by an ILM policy according to the current action and step where they are at any time. For instance, if an index has been on the rollover > wait-for-active-shards combo for more than 24 hours or the step wait-for-active-shards has been retried more than 100 times, the indicator will turn yellow. Below you can see an example of the indicator impact, diagnoses and details:

{ "cluster_name": "runTask", "indicators": { "ilm": { "status": "yellow", "symptom": "3 indices have stayed on the same action longer than expected.", "details": { "stagnating_indices_per_action": { "downsample": 0, "allocate": 0, "shrink": 0, "searchable_snapshot": 0, "rollover": 3, "forcemerge": 0, "delete": 0, "migrate": 0 }, "policies": 19, "stagnating_indices": 3, "ilm_status": "RUNNING" }, "impacts": [ { "id": "elasticsearch:health:ilm:impact:stagnating_index", "severity": 3, "description": "Automatic index lifecycle and data retention management cannot make progress on one or more indices. The performance and stability of the indices and/or the cluster could be impacted.", "impact_areas": [ "deployment_management" ] } ], "diagnosis": [ { "id": "elasticsearch:health:ilm:diagnosis:stagnating_action:rollover", "cause": "Some indices have been stagnated on the action [rollover] longer than the expected time.", "action": "Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.", "help_url": "https://ela.st/ilm-explain", "affected_resources": { "ilm_policies": [ "ilm-history-ilm-policy", "some-policy-2", "some-policy" ], "indices": [ ".ds-test-001-2023.05.16-000007", ".ds-ilm-history-5-2023.05.16-000001", ".ds-policy2-1-2023.05.16-000007" ] } } ] } } }

Right now all the rules have the same configuration (> 24 hours in the same action && > 100 errors). This can be easily changed and open to review.

@HiDAl HiDAl requested review from dakrone and gmarouli May 15, 2023 08:01
@elasticsearchmachine elasticsearchmachine added v8.9.0 needs:triage Requires assignment of a team area label labels May 15, 2023
@HiDAl HiDAl added >feature :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels May 15, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @HiDAl, I've created a changelog YAML for you.

@HiDAl
Copy link
Author

HiDAl commented May 15, 2023

@elasticsearchmachine run elasticsearch-ci/part-2

@HiDAl
Copy link
Author

HiDAl commented May 15, 2023

@elasticsearchmachine run elasticsearch-ci/bwc

@HiDAl
Copy link
Author

HiDAl commented May 15, 2023

@elasticsearchmachine run elasticsearch-ci/part-2

@HiDAl
Copy link
Author

HiDAl commented Jun 5, 2023

@elasticmachine run elasticsearch-ci/part-2

@HiDAl
Copy link
Author

HiDAl commented Jun 6, 2023

@elasticmachine run elasticsearch-ci/bwc

@HiDAl
Copy link
Author

HiDAl commented Jun 6, 2023

@andreidan I've tackled all your comments, the only missing part would be to review the actual rules. If you want we can check them online during the day. Thanks!

@HiDAl
Copy link
Author

HiDAl commented Jun 6, 2023

@elasticmachine run elasticsearch-ci/bwc

@HiDAl
Copy link
Author

HiDAl commented Jun 6, 2023

@HiDAl
Copy link
Author

HiDAl commented Jun 7, 2023

@elasticmachine update branch

@HiDAl
Copy link
Author

HiDAl commented Jun 7, 2023

@elasticmachine run elasticsearch-ci/part-2

1 similar comment
@HiDAl
Copy link
Author

HiDAl commented Jun 8, 2023

@elasticmachine run elasticsearch-ci/part-2

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Pablo

This is shaping up really nice ! Just a few more minor comments

NAME,
"stagnating_action:" + entry.getKey(),
"Some indices have been stagnated on the action [" + entry.getKey() + "] longer than the expected time.",
"Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.",
"Check the current status of the Index Lifecycle Management for every affected index using the [GET /_ilm/explain] API.",

I wonder if we should make it explicit that the API is called on the index itself

Maybe [GET /affected_index_name/_ilm/explain] API. Please replace the affected_index_name in the API with the actual index name.

What do you think?

}
}

record ActionRule(String action, TimeValue maxTimeOn) implements RuleConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we document these rules and how they work?

)
);

private static final TimeValue ONE_DAY = TimeValue.timeValueDays(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have this configurable via a setting?

Similar to the ones we use in the coordination diagnostics service - https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java#L138
and document the new one here

We should still default to 24 hours, but it'll give us a lever to make the health API stop returning YELLOW if we need to.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds cool. Will do it. btw the documentation will be in a different PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, all these rules are statically created, so there's no access to the ClusterService at that point, implying deeper changes. I'd rather do it in a different PR because this one is bloated already

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks Pablo

),
//
DeleteAction.NAME,
actionRule(DeleteAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(DeleteStep.NAME, ONE_DAY)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete only needs a step rule (no action rule)

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Pablo.

Some final comments on the rules but otherwise LGTM


static final Map<String, RuleConfig> RULES_BY_ACTION_CONFIG = Map.of(
RolloverAction.NAME,
actionRule(RolloverAction.NAME).stepRules(stepRule(WaitForActiveShardsStep.NAME, ONE_DAY)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah just one more thing, sorry for the late note here - but I think we'd also like two step rules here for when check-rollover-ready or attempt-rollover (these are async steps) are retried a number of times (100? )

This would help us signal that an index doesn't have the necessary alias setup properly (check-rollover-ready would fail and attempt retries) and that attempt-rollover is not successful for some reason.

actionRule(DeleteAction.NAME).stepRules(stepRule(DeleteStep.NAME, ONE_DAY)),
//
ShrinkAction.NAME,
actionRule(ShrinkAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(WaitForNoFollowersStep.NAME, ONE_DAY)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note (maybe in a comment) that we're adding this rule on the WaitForNoFollowersStep which runs on the leader node because the follower cluster will have an unfollow action injected before Shrink and SearchableSnapshot actions - the index should not be receiving writes anymore so the unfollow on the follower cluster should make sure that the WaitForNoFollowersStep on the leader is not taking too long.

We currently seem to not inject Unfollow before Downsample which I think is a bug for now, so please remove this WaitForNoFollowersStep rule from the Downsample action (and open a bug to fix and inject Unfollow before Downsample)

Comment on lines 131 to 132
DownsampleAction.NAME,
actionRule(DownsampleAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(WaitForNoFollowersStep.NAME, ONE_DAY))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this rule and open a bug to inject UNFOLLOW before DOWNSAMPLE on the follower cluster (and after that re-add this rule here)

actionRule(AllocateAction.NAME).maxTimeOnAction(ONE_DAY).noStepRules(),
//
ForceMergeAction.NAME,
actionRule(ForceMergeAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(WaitForIndexColorStep.NAME, ONE_DAY)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule means:
forcemerge takes more than a day & the wait-for-index-colour step takes more than a day

Note that the index is closed/reopen only if the codec is changed in the forcemerge action (which is not very common AFAIK)

I think the rule here should be

 forcemerge takes more than a day && ( the wait-for-index-colour step takes more than a day || the forcemerge step takes more than a day OR is retried more than 100 times || the segment-count step takes more than a day OR is retried more than 100 times ) 

What do you think?

Also @dakrone - does this make sense to you as well?

@HiDAl HiDAl merged commit d471146 into elastic:main Jun 8, 2023
@HiDAl HiDAl deleted the ilm-indicator branch June 8, 2023 14:30
@HiDAl
Copy link
Author

HiDAl commented Jun 8, 2023

relates #93859

Comment on lines +111 to +112
stepRule(WaitForRolloverReadyStep.NAME, ONE_DAY),
stepRule(RolloverStep.NAME, ONE_DAY)
Copy link
Contributor

@andreidan andreidan Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should tack number of retries for these ones (as they're async steps that move into the step from ERROR when they're retried)

Please update them to track max 100 retries ( #96092 (comment) )

We should not track time in WaitForRolloverReadyStep as we could be there for days and that's normal

Copy link
Author

@HiDAl HiDAl Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of retries is tracked by default in the StepRules. should we make the number of retries optional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes I see that.

We should not track time in WaitForRolloverReadyStep as we could be there for days and that's normal

I think having it mandatory AND implicit is a bit confusing, especially for things where we maybe don't want to track time like this one above ^^ ( where we'd ONLY want to track number of retries)

Should we have it mandatory to specify one of maxTimeOn or number of retries? (with the option to specify both as well- all options explicit in the constructors)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tackled on #96855 96855

Comment on lines +145 to +146
stepRule(ForceMergeStep.NAME, ONE_DAY),
stepRule(SegmentCountStep.NAME, ONE_DAY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

I believe we should tack number of retries for these ones (as they're async steps that move into the step from ERROR when they're retried)

Please update them to also track max 100 retries

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tackled on #96855

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM Index and Snapshot lifecycle management >feature Team:Data Management Meta label for data/management team v8.9.0

5 participants