Enhance ILM Health Indicator #96092

HiDAl · 2023-05-15T08:01:36Z

This PR introduces new rules to the ILM Health Indicator. The new rules check all the managed indices by an ILM policy according to the current action and step where they are at any time. For instance, if an index has been on the rollover > wait-for-active-shards combo for more than 24 hours or the step wait-for-active-shards has been retried more than 100 times, the indicator will turn yellow. Below you can see an example of the indicator impact, diagnoses and details:

{ "cluster_name": "runTask", "indicators": { "ilm": { "status": "yellow", "symptom": "3 indices have stayed on the same action longer than expected.", "details": { "stagnating_indices_per_action": { "downsample": 0, "allocate": 0, "shrink": 0, "searchable_snapshot": 0, "rollover": 3, "forcemerge": 0, "delete": 0, "migrate": 0 }, "policies": 19, "stagnating_indices": 3, "ilm_status": "RUNNING" }, "impacts": [ { "id": "elasticsearch:health:ilm:impact:stagnating_index", "severity": 3, "description": "Automatic index lifecycle and data retention management cannot make progress on one or more indices. The performance and stability of the indices and/or the cluster could be impacted.", "impact_areas": [ "deployment_management" ] } ], "diagnosis": [ { "id": "elasticsearch:health:ilm:diagnosis:stagnating_action:rollover", "cause": "Some indices have been stagnated on the action [rollover] longer than the expected time.", "action": "Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.", "help_url": "https://ela.st/ilm-explain", "affected_resources": { "ilm_policies": [ "ilm-history-ilm-policy", "some-policy-2", "some-policy" ], "indices": [ ".ds-test-001-2023.05.16-000007", ".ds-ilm-history-5-2023.05.16-000001", ".ds-policy2-1-2023.05.16-000007" ] } } ] } } }

Right now all the rules have the same configuration (> 24 hours in the same action && > 100 errors). This can be easily changed and open to review.

elasticsearchmachine · 2023-05-15T08:02:32Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2023-05-15T08:02:33Z

Hi @HiDAl, I've created a changelog YAML for you.

HiDAl · 2023-05-15T08:46:56Z

@elasticsearchmachine run elasticsearch-ci/part-2

HiDAl · 2023-05-15T08:47:11Z

@elasticsearchmachine run elasticsearch-ci/bwc

HiDAl · 2023-05-15T09:10:06Z

@elasticsearchmachine run elasticsearch-ci/part-2

HiDAl · 2023-06-05T16:02:00Z

@elasticmachine run elasticsearch-ci/part-2

HiDAl · 2023-06-06T07:10:50Z

@elasticmachine run elasticsearch-ci/bwc

HiDAl · 2023-06-06T07:20:09Z

@andreidan I've tackled all your comments, the only missing part would be to review the actual rules. If you want we can check them online during the day. Thanks!

HiDAl · 2023-06-06T07:47:42Z

@elasticmachine run elasticsearch-ci/bwc

HiDAl · 2023-06-06T07:48:07Z

bwc tests have been constantly failing since last Friday https://github.com/elastic/elasticsearch/issues?q=is%3Aissue+is%3Aopen+UpgradeClusterClientYamlTestSuiteIT

HiDAl · 2023-06-07T09:04:36Z

@elasticmachine update branch

HiDAl · 2023-06-07T13:34:50Z

@elasticmachine run elasticsearch-ci/part-2

HiDAl · 2023-06-08T09:06:15Z

@elasticmachine run elasticsearch-ci/part-2

andreidan

Thanks for iterating on this Pablo

This is shaping up really nice ! Just a few more minor comments

andreidan · 2023-06-08T09:18:55Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+ NAME,
+ "stagnating_action:" + entry.getKey(),
+ "Some indices have been stagnated on the action [" + entry.getKey() + "] longer than the expected time.",
+ "Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.",


Suggested change

"Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.",

"Check the current status of the Index Lifecycle Management for every affected index using the [GET /_ilm/explain] API.",

I wonder if we should make it explicit that the API is called on the index itself

Maybe [GET /affected_index_name/_ilm/explain] API. Please replace the affected_index_name in the API with the actual index name.

What do you think?

andreidan · 2023-06-08T09:19:43Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+ }
+ }
+
+ record ActionRule(String action, TimeValue maxTimeOn) implements RuleConfig {


Shall we document these rules and how they work?

andreidan · 2023-06-08T09:23:00Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+ )
+ );
+
+ private static final TimeValue ONE_DAY = TimeValue.timeValueDays(1);


Shall we have this configurable via a setting?

Similar to the ones we use in the coordination diagnostics service - https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java#L138
and document the new one here

We should still default to 24 hours, but it'll give us a lever to make the health API stop returning YELLOW if we need to.

sounds cool. Will do it. btw the documentation will be in a different PR

well, all these rules are statically created, so there's no access to the ClusterService at that point, implying deeper changes. I'd rather do it in a different PR because this one is bloated already

Sounds good, thanks Pablo

andreidan · 2023-06-08T09:28:23Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+ ),
+ //
+ DeleteAction.NAME,
+ actionRule(DeleteAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(DeleteStep.NAME, ONE_DAY)),


Delete only needs a step rule (no action rule)

andreidan

Thanks for iterating on this Pablo.

Some final comments on the rules but otherwise LGTM

andreidan · 2023-06-08T10:33:53Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+
+ static final Map<String, RuleConfig> RULES_BY_ACTION_CONFIG = Map.of(
+ RolloverAction.NAME,
+ actionRule(RolloverAction.NAME).stepRules(stepRule(WaitForActiveShardsStep.NAME, ONE_DAY)),


Ah just one more thing, sorry for the late note here - but I think we'd also like two step rules here for when check-rollover-ready or attempt-rollover (these are async steps) are retried a number of times (100? )

This would help us signal that an index doesn't have the necessary alias setup properly (check-rollover-ready would fail and attempt retries) and that attempt-rollover is not successful for some reason.

andreidan · 2023-06-08T10:38:21Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+ actionRule(DeleteAction.NAME).stepRules(stepRule(DeleteStep.NAME, ONE_DAY)),
+ //
+ ShrinkAction.NAME,
+ actionRule(ShrinkAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(WaitForNoFollowersStep.NAME, ONE_DAY)),


Note (maybe in a comment) that we're adding this rule on the WaitForNoFollowersStep which runs on the leader node because the follower cluster will have an unfollow action injected before Shrink and SearchableSnapshot actions - the index should not be receiving writes anymore so the unfollow on the follower cluster should make sure that the WaitForNoFollowersStep on the leader is not taking too long.

We currently seem to not inject Unfollow before Downsample which I think is a bug for now, so please remove this WaitForNoFollowersStep rule from the Downsample action (and open a bug to fix and inject Unfollow before Downsample)

andreidan · 2023-06-08T10:38:34Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+ DownsampleAction.NAME,
+ actionRule(DownsampleAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(WaitForNoFollowersStep.NAME, ONE_DAY))


We should remove this rule and open a bug to inject UNFOLLOW before DOWNSAMPLE on the follower cluster (and after that re-add this rule here)

andreidan · 2023-06-08T10:43:57Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+ actionRule(AllocateAction.NAME).maxTimeOnAction(ONE_DAY).noStepRules(),
+ //
+ ForceMergeAction.NAME,
+ actionRule(ForceMergeAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(WaitForIndexColorStep.NAME, ONE_DAY)),


This rule means:
forcemerge takes more than a day & the wait-for-index-colour step takes more than a day

Note that the index is closed/reopen only if the codec is changed in the forcemerge action (which is not very common AFAIK)

I think the rule here should be

forcemerge takes more than a day && ( the wait-for-index-colour step takes more than a day || the forcemerge step takes more than a day OR is retried more than 100 times || the segment-count step takes more than a day OR is retried more than 100 times )

What do you think?

Also @dakrone - does this make sense to you as well?

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

HiDAl · 2023-06-08T14:33:02Z

relates #93859

andreidan · 2023-06-09T08:46:35Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+ stepRule(WaitForRolloverReadyStep.NAME, ONE_DAY),
+ stepRule(RolloverStep.NAME, ONE_DAY)


I believe we should tack number of retries for these ones (as they're async steps that move into the step from ERROR when they're retried)

Please update them to track max 100 retries ( #96092 (comment) )

We should not track time in WaitForRolloverReadyStep as we could be there for days and that's normal

the number of retries is tracked by default in the StepRules. should we make the number of retries optional?

Ah, yes I see that.

We should not track time in WaitForRolloverReadyStep as we could be there for days and that's normal

I think having it mandatory AND implicit is a bit confusing, especially for things where we maybe don't want to track time like this one above ^^ ( where we'd ONLY want to track number of retries)

Should we have it mandatory to specify one of maxTimeOn or number of retries? (with the option to specify both as well- all options explicit in the constructors)

tackled on #96855 96855

andreidan · 2023-06-09T08:49:15Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+ stepRule(ForceMergeStep.NAME, ONE_DAY),
+ stepRule(SegmentCountStep.NAME, ONE_DAY)


Same as above.

I believe we should tack number of retries for these ones (as they're async steps that move into the step from ERROR when they're retried)

Please update them to also track max 100 retries

tackled on #96855

Pablo Alcantar Morales added 15 commits May 15, 2023 09:23

basic infra to evaluate rules

0807692

return an indicator in the case an index is stuck

de75d4b

dummy conditions

915dab7

separate impacts of ILM stopped and index stuck

ed478b9

summarize details of stuck indices

ee9fbd5

convert to TimeValue

8076937

dummy actions

e329450

add diagnoses per ILM-action

84fce73

lift up to a class the logic to find stuck indices

36479a6

add test

0baabaa

tests for StuckIndicesFinder

932361b

missing test

d8b7d2d

spotless

f244f85

create Rules using the records

0ab528d

fix message

e697f0b

HiDAl requested review from dakrone and gmarouli May 15, 2023 08:01

elasticsearchmachine added v8.9.0 needs:triage Requires assignment of a team area label labels May 15, 2023

HiDAl added >feature :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels May 15, 2023

Update docs/changelog/96092.yaml

b98bc25

fix test after message change

a5259bd

Pablo Alcantar Morales added 4 commits June 1, 2023 16:46

removes unneeded indirection layetr

f333627

add tests for RuleConfig

2bac0dc

add test for RuleConfigBuilder

a633c7b

stepRule receives the maxTimeOn

96763cf

transform back to a class StagnatingIndicesFinder

aecf96f

Merge branch 'main' into ilm-indicator

90de8f6

andreidan reviewed Jun 8, 2023

View reviewed changes

Pablo Alcantar Morales added 3 commits June 8, 2023 11:59

Merge remote-tracking branch 'elastic/main' into ilm-indicator

22ae060

add javadoc to Rule DSL builder

1539f90

DELETE action does not check maxTimeOn

3655292

andreidan approved these changes Jun 8, 2023

View reviewed changes

HiDAl mentioned this pull request Jun 8, 2023

[ILM] Inject Unfollow step before Downsample step for Downsample Action #96705

Open

fix rules according PR comments

f2f1cd7

HiDAl merged commit d471146 into elastic:main Jun 8, 2023

HiDAl deleted the ilm-indicator branch June 8, 2023 14:30

andreidan reviewed Jun 9, 2023

View reviewed changes

HiDAl mentioned this pull request Jun 15, 2023

Makes StepRule more configurable #96855

Merged

This was referenced Apr 2, 2024

[HealthAPI] Warn when ILM steps are waiting for too long #91121

Closed

[HealthAPI] Warn when ILM is retrying steps #91120

Closed

	"Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.",
	"Check the current status of the Index Lifecycle Management for every affected index using the [GET /_ilm/explain] API.",

		DownsampleAction.NAME,
		actionRule(DownsampleAction.NAME).maxTimeOnAction(ONE_DAY).stepRules(stepRule(WaitForNoFollowersStep.NAME, ONE_DAY))

		stepRule(WaitForRolloverReadyStep.NAME, ONE_DAY),
		stepRule(RolloverStep.NAME, ONE_DAY)

		stepRule(ForceMergeStep.NAME, ONE_DAY),
		stepRule(SegmentCountStep.NAME, ONE_DAY)

Enhance ILM Health Indicator #96092

Enhance ILM Health Indicator #96092

Uh oh!

Conversation

HiDAl commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented May 15, 2023

elasticsearchmachine commented May 15, 2023

HiDAl commented May 15, 2023

HiDAl commented May 15, 2023

HiDAl commented May 15, 2023

HiDAl commented Jun 5, 2023

HiDAl commented Jun 6, 2023

HiDAl commented Jun 6, 2023

HiDAl commented Jun 6, 2023

HiDAl commented Jun 6, 2023

HiDAl commented Jun 7, 2023

HiDAl commented Jun 7, 2023

HiDAl commented Jun 8, 2023

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

HiDAl commented Jun 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

andreidan Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

HiDAl Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

5 participants

HiDAl commented May 15, 2023 •

edited

Loading

HiDAl commented Jun 8, 2023 •

edited

Loading

andreidan Jun 9, 2023 •

edited

Loading

HiDAl Jun 12, 2023 •

edited

Loading