- Notifications
You must be signed in to change notification settings - Fork 25.6k
Enhance ILM Health Indicator #96092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance ILM Health Indicator #96092
Conversation
| Pinging @elastic/es-data-management (Team:Data Management) |
| Hi @HiDAl, I've created a changelog YAML for you. |
| @elasticsearchmachine run elasticsearch-ci/part-2 |
| @elasticsearchmachine run elasticsearch-ci/bwc |
| @elasticsearchmachine run elasticsearch-ci/part-2 |
| @elasticmachine run elasticsearch-ci/part-2 |
| @elasticmachine run elasticsearch-ci/bwc |
| @andreidan I've tackled all your comments, the only missing part would be to review the actual rules. If you want we can check them online during the day. Thanks! |
| @elasticmachine run elasticsearch-ci/bwc |
| bwc tests have been constantly failing since last Friday https://github.com/elastic/elasticsearch/issues?q=is%3Aissue+is%3Aopen+UpgradeClusterClientYamlTestSuiteIT |
| @elasticmachine update branch |
| @elasticmachine run elasticsearch-ci/part-2 |
1 similar comment
| @elasticmachine run elasticsearch-ci/part-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this Pablo
This is shaping up really nice ! Just a few more minor comments
| NAME, | ||
| "stagnating_action:" + entry.getKey(), | ||
| "Some indices have been stagnated on the action [" + entry.getKey() + "] longer than the expected time.", | ||
| "Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.", | |
| "Check the current status of the Index Lifecycle Management for every affected index using the [GET /_ilm/explain] API.", |
I wonder if we should make it explicit that the API is called on the index itself
Maybe [GET /affected_index_name/_ilm/explain] API. Please replace the affected_index_name in the API with the actual index name.
What do you think?
| } | ||
| } | ||
| | ||
| record ActionRule(String action, TimeValue maxTimeOn) implements RuleConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we document these rules and how they work?
| ) | ||
| ); | ||
| | ||
| private static final TimeValue ONE_DAY = TimeValue.timeValueDays(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have this configurable via a setting?
Similar to the ones we use in the coordination diagnostics service - https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java#L138
and document the new one here
We should still default to 24 hours, but it'll give us a lever to make the health API stop returning YELLOW if we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds cool. Will do it. btw the documentation will be in a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, all these rules are statically created, so there's no access to the ClusterService at that point, implying deeper changes. I'd rather do it in a different PR because this one is bloated already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks Pablo
| ), | ||
| // | ||
| DeleteAction.NAME, | ||
| actionRule(DeleteAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(DeleteStep.NAME, ONE_DAY)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete only needs a step rule (no action rule)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this Pablo.
Some final comments on the rules but otherwise LGTM
| | ||
| static final Map<String, RuleConfig> RULES_BY_ACTION_CONFIG = Map.of( | ||
| RolloverAction.NAME, | ||
| actionRule(RolloverAction.NAME).stepRules(stepRule(WaitForActiveShardsStep.NAME, ONE_DAY)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah just one more thing, sorry for the late note here - but I think we'd also like two step rules here for when check-rollover-ready or attempt-rollover (these are async steps) are retried a number of times (100? )
This would help us signal that an index doesn't have the necessary alias setup properly (check-rollover-ready would fail and attempt retries) and that attempt-rollover is not successful for some reason.
| actionRule(DeleteAction.NAME).stepRules(stepRule(DeleteStep.NAME, ONE_DAY)), | ||
| // | ||
| ShrinkAction.NAME, | ||
| actionRule(ShrinkAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(WaitForNoFollowersStep.NAME, ONE_DAY)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note (maybe in a comment) that we're adding this rule on the WaitForNoFollowersStep which runs on the leader node because the follower cluster will have an unfollow action injected before Shrink and SearchableSnapshot actions - the index should not be receiving writes anymore so the unfollow on the follower cluster should make sure that the WaitForNoFollowersStep on the leader is not taking too long.
We currently seem to not inject Unfollow before Downsample which I think is a bug for now, so please remove this WaitForNoFollowersStep rule from the Downsample action (and open a bug to fix and inject Unfollow before Downsample)
| DownsampleAction.NAME, | ||
| actionRule(DownsampleAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(WaitForNoFollowersStep.NAME, ONE_DAY)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this rule and open a bug to inject UNFOLLOW before DOWNSAMPLE on the follower cluster (and after that re-add this rule here)
| actionRule(AllocateAction.NAME).maxTimeOnAction(ONE_DAY).noStepRules(), | ||
| // | ||
| ForceMergeAction.NAME, | ||
| actionRule(ForceMergeAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(WaitForIndexColorStep.NAME, ONE_DAY)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule means:
forcemerge takes more than a day & the wait-for-index-colour step takes more than a day
Note that the index is closed/reopen only if the codec is changed in the forcemerge action (which is not very common AFAIK)
I think the rule here should be
forcemerge takes more than a day && ( the wait-for-index-colour step takes more than a day || the forcemerge step takes more than a day OR is retried more than 100 times || the segment-count step takes more than a day OR is retried more than 100 times ) What do you think?
Also @dakrone - does this make sense to you as well?
x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java Show resolved Hide resolved
| relates #93859 |
| stepRule(WaitForRolloverReadyStep.NAME, ONE_DAY), | ||
| stepRule(RolloverStep.NAME, ONE_DAY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should tack number of retries for these ones (as they're async steps that move into the step from ERROR when they're retried)
Please update them to track max 100 retries ( #96092 (comment) )
We should not track time in WaitForRolloverReadyStep as we could be there for days and that's normal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the number of retries is tracked by default in the StepRules. should we make the number of retries optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes I see that.
We should not track time in WaitForRolloverReadyStep as we could be there for days and that's normal
I think having it mandatory AND implicit is a bit confusing, especially for things where we maybe don't want to track time like this one above ^^ ( where we'd ONLY want to track number of retries)
Should we have it mandatory to specify one of maxTimeOn or number of retries? (with the option to specify both as well- all options explicit in the constructors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tackled on #96855 96855
| stepRule(ForceMergeStep.NAME, ONE_DAY), | ||
| stepRule(SegmentCountStep.NAME, ONE_DAY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
I believe we should tack number of retries for these ones (as they're async steps that move into the step from ERROR when they're retried)
Please update them to also track max 100 retries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tackled on #96855
This PR introduces new rules to the ILM Health Indicator. The new rules check all the managed indices by an ILM policy according to the current action and step where they are at any time. For instance, if an index has been on the
rollover > wait-for-active-shardscombo for more than 24 hours or the stepwait-for-active-shardshas been retried more than 100 times, the indicator will turn yellow. Below you can see an example of the indicator impact, diagnoses and details:{ "cluster_name": "runTask", "indicators": { "ilm": { "status": "yellow", "symptom": "3 indices have stayed on the same action longer than expected.", "details": { "stagnating_indices_per_action": { "downsample": 0, "allocate": 0, "shrink": 0, "searchable_snapshot": 0, "rollover": 3, "forcemerge": 0, "delete": 0, "migrate": 0 }, "policies": 19, "stagnating_indices": 3, "ilm_status": "RUNNING" }, "impacts": [ { "id": "elasticsearch:health:ilm:impact:stagnating_index", "severity": 3, "description": "Automatic index lifecycle and data retention management cannot make progress on one or more indices. The performance and stability of the indices and/or the cluster could be impacted.", "impact_areas": [ "deployment_management" ] } ], "diagnosis": [ { "id": "elasticsearch:health:ilm:diagnosis:stagnating_action:rollover", "cause": "Some indices have been stagnated on the action [rollover] longer than the expected time.", "action": "Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.", "help_url": "https://ela.st/ilm-explain", "affected_resources": { "ilm_policies": [ "ilm-history-ilm-policy", "some-policy-2", "some-policy" ], "indices": [ ".ds-test-001-2023.05.16-000007", ".ds-ilm-history-5-2023.05.16-000001", ".ds-policy2-1-2023.05.16-000007" ] } } ] } } }Right now all the rules have the same configuration (> 24 hours in the same action && > 100 errors). This can be easily changed and open to review.