Refactor RestoreClusterStateListener to use ClusterStateObserver #96662

original-brownbear · 2023-06-07T14:24:16Z

Fixes hanging listeners waiting for any cluster state update so that the restore state changes can be observed. Using a ClusterStateListener here meant that if a restore was already removed from the cluster state when getting to adding the listener and no subsequent state changes happened, the listener would not get invoked and hang until the next state change.

The observer infrastructure is safer here checking the current state for a predicate before starting to wait in concurrency safe way.

closes #96425

Note: this is still a little buggy in that sometimes a null RestoreInfo may be returned incorrectly. This is not a new bug though and is actually less likely with this change than it used to be. We could fix this using more involved logic eventually but for now this resolves hanging APIs and the linked test failure.

Fixes hanging listeners waiting for any cluster state update so that the restore state changes can be observed. Using a ClusterStateListener here meant that if a restore was already removed from the cluster state when getting to adding the listener and no subsequent state changes happened, the listener would not get invoked and hang until the next state change. The observer infrastructure is safer here checking the current state for a predicate before starting to wait in concurrency safe way. closes elastic#96425

elasticsearchmachine · 2023-06-07T14:24:41Z

Hi @original-brownbear, I've created a changelog YAML for you.

elasticsearchmachine · 2023-06-07T14:24:41Z

Pinging @elastic/es-distributed (Team:Distributed)

henningandersen

I think this nearly works, provided I read the observer code correctly.

I wonder if we can add a test that reproduces this more reliably than the original CI report?

henningandersen · 2023-06-08T16:09:48Z

...va/org/elasticsearch/action/admin/cluster/snapshots/restore/RestoreClusterStateListener.java

- assert prevEntry.state().completed() : "expected completed snapshot state but was " + prevEntry.state();
+ final String uuid = response.getUuid();
+ final DiscoveryNode localNode = clusterService.localNode();
+ new ClusterStateObserver(clusterService, null, logger, threadContext).waitForNextChange(new RestoreListener(listener, localNode) {


Does this not have the same issue as the original listener version, since the observer grabs the current version and we thus rely on one more cluster state update?

I think we can fix it by passing in clusterService.state().version()-1 to the observer in the constructor.

But the unlike the listener the observer also checks the current state against the predicate. So we'll never wait here if the state already fulfils the predicate => this should be safe practically I believe (also this is verified by the way I used the observer in a ton of test infrastructure when waiting for snapshots to complete and such I think).

Alright @henningandersen as discussed elsewhere, I pushed e1bbe9a now which more safely checks the current state using the pattern we also use when working around this issue in tests.
See org.elasticsearch.test.ClusterServiceUtils#awaitClusterState.

Maybe instead of adding a tricky test here a follow-up that would extracting the check-before-wait logic so it can be used in a couple of spots and adding a test there would be an alternative? (promise I'd get to it this week ;))

henningandersen

LGTM.

original-brownbear · 2023-06-12T15:12:48Z

Thanks Henning!

original-brownbear added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.9.0 labels Jun 7, 2023

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 7, 2023

Update docs/changelog/96662.yaml

37e8865

original-brownbear mentioned this pull request Jun 7, 2023

[CI] SharedClusterSnapshotRestoreIT classMethod failing #96425

Closed

original-brownbear requested review from DaveCTurner and henningandersen June 7, 2023 16:23

henningandersen reviewed Jun 8, 2023

View reviewed changes

original-brownbear requested a review from henningandersen June 12, 2023 10:34

original-brownbear added 2 commits June 12, 2023 13:09

Merge remote-tracking branch 'elastic/main' into 96425

0a35df0

check current state safely first

e1bbe9a

henningandersen approved these changes Jun 12, 2023

View reviewed changes

original-brownbear merged commit 6d5b6a0 into elastic:main Jun 12, 2023

original-brownbear deleted the 96425 branch June 12, 2023 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor RestoreClusterStateListener to use ClusterStateObserver #96662

Refactor RestoreClusterStateListener to use ClusterStateObserver #96662

Uh oh!

original-brownbear commented Jun 7, 2023

elasticsearchmachine commented Jun 7, 2023

elasticsearchmachine commented Jun 7, 2023

henningandersen left a comment

henningandersen Jun 8, 2023

original-brownbear Jun 12, 2023

original-brownbear Jun 12, 2023

henningandersen left a comment

original-brownbear commented Jun 12, 2023

Labels

3 participants

Refactor RestoreClusterStateListener to use ClusterStateObserver #96662

Refactor RestoreClusterStateListener to use ClusterStateObserver #96662

Uh oh!

Conversation

original-brownbear commented Jun 7, 2023

elasticsearchmachine commented Jun 7, 2023

elasticsearchmachine commented Jun 7, 2023

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Jun 8, 2023

Choose a reason for hiding this comment

original-brownbear Jun 12, 2023

Choose a reason for hiding this comment

original-brownbear Jun 12, 2023

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

original-brownbear commented Jun 12, 2023

Labels

3 participants