Do not treat replica as unassigned if primary recently created and unassigned time is below a threshold. #112066

parkertimmins · 2024-08-21T16:07:06Z

Changes the way we calculate if all replicas are unassigned when primary is recently created.
Related to #107794 which did not treat replica as unassigned if primary was not yet active.

This changes extend the time when replicas are not treated as unassigned to a buffer time period after the primary has become active. This buffer time period is controlled through the setting health.shards_availability.replica_unassigned_buffer_time. This is only used in serverless; on stateful the behavior remains the same where new unassigned replicas will only not be treated was unassigned while the primary is not yet active.

elasticsearchmachine · 2024-08-21T16:07:31Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2024-08-21T16:07:54Z

Hi @parkertimmins, I've created a changelog YAML for you.

...lasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorService.java

- pass now millis instead of clock - update setting with addSettingsUpdateConsumer - simplify cutoff condition logic

- need to add settings to mocked ClusterService

...lasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorService.java

dakrone

LGTM once the Serverless piece is merged and won't cause CI failures. I left one comment but nothing major, thanks Parker!

...lasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorService.java

…lastic#112066) Changes the way we calculate if all replicas are unassigned when primary is recently created. This change will only be used in serverless, not in stateful. When a primary is new, if the primary is active, but the replica is unassigned for less than a buffer time period, do not treat is as unassigned. Control time period through health.shards_availability.replica_unassigned_buffer_time setting.

Increase the default value of health.shards_availability.replica_unassigned_buffer_time to 5 seconds. This values in the identification of unavailable shards on serverless. Increasing the value to 5s keep more shards from going red transiently, while still being low enough to go red quickly if there is an actual availability issue. Related to #112066

shard_availablity yellow if unassigned less than threshold

2b8bcf5

parkertimmins added >enhancement :Data Management/Health labels Aug 21, 2024

parkertimmins requested a review from dakrone August 21, 2024 16:07

elasticsearchmachine added v8.16.0 Team:Data Management Meta label for data/management team labels Aug 21, 2024

Update docs/changelog/112066.yaml

fa16a83

dakrone added the test-update-serverless label Aug 21, 2024

parkertimmins added 2 commits August 21, 2024 14:31

Change replica_unassigned_buffer_time default to 3s

974fd73

Fix existing test by using buffer=0s

79e2da3

dakrone requested changes Aug 21, 2024

View reviewed changes

parkertimmins added 4 commits August 21, 2024 22:17

Updates from review

46e422e

- pass now millis instead of clock - update setting with addSettingsUpdateConsumer - simplify cutoff condition logic

remove dead code

b33b8a0

more spotless

a8a96ea

Fix broken test

e2678f5

- need to add settings to mocked ClusterService

parkertimmins commented Aug 22, 2024

View reviewed changes

...lasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorService.java Outdated Show resolved Hide resolved

parkertimmins commented Aug 22, 2024

View reviewed changes

...lasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorService.java Outdated Show resolved Hide resolved

parkertimmins requested a review from dakrone August 22, 2024 18:07

dakrone approved these changes Aug 26, 2024

View reviewed changes

...lasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorService.java Outdated Show resolved Hide resolved

parkertimmins added 2 commits August 28, 2024 10:46

Small review improvments

bd33e8f

Merge branch 'main' into shard-availability-replica-time-buffer

ac52985

parkertimmins merged commit b776cf6 into elastic:main Aug 28, 2024
16 checks passed

parkertimmins deleted the shard-availability-replica-time-buffer branch August 28, 2024 19:18

parkertimmins mentioned this pull request Sep 12, 2024

Increase replica_unassigned_buffer_time default from 3s to 5s #112834

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not treat replica as unassigned if primary recently created and unassigned time is below a threshold. #112066

Do not treat replica as unassigned if primary recently created and unassigned time is below a threshold. #112066

parkertimmins commented Aug 21, 2024 •

edited

Loading

elasticsearchmachine commented Aug 21, 2024

elasticsearchmachine commented Aug 21, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dakrone left a comment

Uh oh!

Uh oh!

Labels

3 participants

Do not treat replica as unassigned if primary recently created and unassigned time is below a threshold. #112066

Do not treat replica as unassigned if primary recently created and unassigned time is below a threshold. #112066

Conversation

parkertimmins commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Aug 21, 2024

elasticsearchmachine commented Aug 21, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

3 participants

parkertimmins commented Aug 21, 2024 •

edited

Loading