- Notifications
You must be signed in to change notification settings - Fork 25.6k
Refactor RestoreClusterStateListener to use ClusterStateObserver #96662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes hanging listeners waiting for any cluster state update so that the restore state changes can be observed. Using a ClusterStateListener here meant that if a restore was already removed from the cluster state when getting to adding the listener and no subsequent state changes happened, the listener would not get invoked and hang until the next state change. The observer infrastructure is safer here checking the current state for a predicate before starting to wait in concurrency safe way. closes elastic#96425
| Hi @original-brownbear, I've created a changelog YAML for you. |
| Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this nearly works, provided I read the observer code correctly.
I wonder if we can add a test that reproduces this more reliably than the original CI report?
| assert prevEntry.state().completed() : "expected completed snapshot state but was " + prevEntry.state(); | ||
| final String uuid = response.getUuid(); | ||
| final DiscoveryNode localNode = clusterService.localNode(); | ||
| new ClusterStateObserver(clusterService, null, logger, threadContext).waitForNextChange(new RestoreListener(listener, localNode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not have the same issue as the original listener version, since the observer grabs the current version and we thus rely on one more cluster state update?
I think we can fix it by passing in clusterService.state().version()-1 to the observer in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the unlike the listener the observer also checks the current state against the predicate. So we'll never wait here if the state already fulfils the predicate => this should be safe practically I believe (also this is verified by the way I used the observer in a ton of test infrastructure when waiting for snapshots to complete and such I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright @henningandersen as discussed elsewhere, I pushed e1bbe9a now which more safely checks the current state using the pattern we also use when working around this issue in tests.
See org.elasticsearch.test.ClusterServiceUtils#awaitClusterState.
Maybe instead of adding a tricky test here a follow-up that would extracting the check-before-wait logic so it can be used in a couple of spots and adding a test there would be an alternative? (promise I'd get to it this week ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| Thanks Henning! |
Fixes hanging listeners waiting for any cluster state update so that the restore state changes can be observed. Using a
ClusterStateListenerhere meant that if a restore was already removed from the cluster state when getting to adding the listener and no subsequent state changes happened, the listener would not get invoked and hang until the next state change.The observer infrastructure is safer here checking the current state for a predicate before starting to wait in concurrency safe way.
closes #96425
Note: this is still a little buggy in that sometimes a
nullRestoreInfomay be returned incorrectly. This is not a new bug though and is actually less likely with this change than it used to be. We could fix this using more involved logic eventually but for now this resolves hanging APIs and the linked test failure.