- Notifications
You must be signed in to change notification settings - Fork 25.5k
Reset desired balance when a node is marked for shutdown #106998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset desired balance when a node is marked for shutdown #106998
Conversation
.../shutdown/src/main/java/org/elasticsearch/xpack/shutdown/TransportPutShutdownNodeAction.java Outdated Show resolved Hide resolved
.../shutdown/src/main/java/org/elasticsearch/xpack/shutdown/TransportPutShutdownNodeAction.java Outdated Show resolved Hide resolved
...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java Outdated Show resolved Hide resolved
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ok but I'm worried that we might miss something by considering a state as "processed" at this point. Could we instead incorporate the current shutdown markers into DesiredBalanceInput
, then we're processing them at the same time as the rest of the input?
assert MasterService.assertMasterUpdateOrTestThread() : Thread.currentThread().getName(); | ||
assert allocation.ignoreDisable() == false; | ||
| ||
processNodeShutdowns(clusterService.state()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think this should be within the computation. We haven't really "processed" the new state until we've done that.
| ||
private synchronized void processNodeShutdowns(ClusterState clusterState) { | ||
// Clean up processed shutdowns that are removed from the cluster metadata | ||
processedNodeShutdowns.removeIf(nodeId -> clusterState.metadata().nodeShutdowns().contains(nodeId) == false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove a shutdown marker from a node while the node is still in the cluster, I think that needs a reset too.
// If we remove a shutdown marker from a node, but it is still in the cluster, we'd need a reset. | ||
boolean reset = processedNodeShutdowns.stream() | ||
.anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null); | ||
// Clean up processed shutdowns that are removed from the cluster metadata | ||
processedNodeShutdowns.removeIf(nodeId -> nodeShutdowns.contains(nodeId) == false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i could merge these two loops, but i find it more readable this way (and the set size should be always small).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: could be simplified with processedNodeShutdowns.retainAll(nodeShutdowns.keySet());
...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java Outdated Show resolved Hide resolved
Set<String> newShutdowns = new HashSet<>(); | ||
for (var shutdown : nodeShutdowns.getAll().entrySet()) { | ||
if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART | ||
&& processedNodeShutdowns.contains(shutdown.getKey()) == false) { | ||
newShutdowns.add(shutdown.getKey()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be simplified:
Set<String> newShutdowns = new HashSet<>(); | |
for (var shutdown : nodeShutdowns.getAll().entrySet()) { | |
if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART | |
&& processedNodeShutdowns.contains(shutdown.getKey()) == false) { | |
newShutdowns.add(shutdown.getKey()); | |
} | |
} | |
for (var shutdown : nodeShutdowns.getAll().entrySet()) { | |
if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART) { | |
reset |= processedNodeShutdowns.add(shutdown.getKey()); | |
} | |
} |
this way we do not need to allocate additional collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in practice what we've ended up with effectively does a reset on almost all shutdown metadata changes, so I've made that explicit in 8bf3430. WDYT @idegtiarenko
.setType(shutdownType) | ||
.setStartedAtMillis(randomNonNegativeLong()); | ||
if (shutdownType.equals(Type.REPLACE)) { | ||
singleShutdownMetadataBuilder.setTargetNodeName(randomIdentifier()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set an existing node name here?
Not sure if this is important in this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test, I think it doesn't really matter.
...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java Outdated Show resolved Hide resolved
| ||
final var clusterStateBuilder = ClusterState.builder(clusterState) | ||
.metadata(Metadata.builder(clusterState.metadata()).putCustom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY)); | ||
final var nodeRemovedFromCluster = randomBoolean(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to suggest to unconditionally do an empty reroute before the node is removed.
This way the test is more linear and simpler to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that important? We have a reroute at L720, does that count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the test is structured as following now:
final var nodeRemovedFromCluster = randomBoolean(); if (nodeRemovedFromCluster) { //remove node } conditionally assert reset and reroute if (nodeRemovedFromCluster==false) { //remove node //assert reset and reroute }
I think the test would be a little simpler if instead we
if (sometimes) { //do empty reroute //assert no reset } //remove node //assert reset and reroute
or even do an empty reroute on every test execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've simplified the last part of the test, and also clarified some names. Hope it helps! See e83bef4.
...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java Outdated Show resolved Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left couple of suggestions, but overall 👍
@pxsalehi is away at the moment, I've adopted this branch for now and made some of the suggested changes. |
| ||
private void processNodeShutdowns(ClusterState clusterState) { | ||
final var nodeShutdowns = clusterState.metadata().nodeShutdowns().getAllNodeIds(); | ||
if (nodeShutdowns.equals(processedNodeShutdowns) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need to reset when removing shutdown entry for the node that is already removed. This is also not going to harm anything.
Do we need to reset for the restart case?
@elasticmachine update branch |
Thanks @idegtiarenko and @DaveCTurner for continuing this. It seems this is good to merge. But before merging, I have one question: the simplification done in #106998 (comment) would also reset the desired balance for a |
...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java Show resolved Hide resolved
Sorry @pxsalehi I started working on this and then got waylaid by other stuff. I'm not 100% sure we want to do as many resets as I've proposed. I mean maybe we do, but don't take it as set in stone. |
@DaveCTurner the extra resets seem unnecessary and seem to be a trade off to have a simpler logic there. With the simplification we are now resetting desired balance also when 1) the shutdown metadata is of type RESTART, 2) a node shutdown metadata is removed and the node is also removed (previously we would only reset if the shutdown metadata were removed but the node stays in the cluster. see #106998 (comment)). Both of these seem unnecessary. Hence my question, since I am not super familiar with the desired balance computation, if the extra cost of these resets is not trivial, then I'd rather go back to the more selective way of resetting than the proposed one. Otherwise, I'm good with it. Any thoughts? |
…edBalanceOnNodeShutdownMarker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the two cases mentioned above, I don't think we need to reroute. Also after talking with Ievgen, it seems that if we're not in a reconciled state, each reset would cause the computation again. E.g. when there is a rolling restart, it seems it could lead to unnecessary resets. It's just a few lines of code more, so I've changed it to be more selective. Please give this another check. Thanks.
.setType(shutdownType) | ||
.setStartedAtMillis(randomNonNegativeLong()); | ||
if (shutdownType.equals(Type.REPLACE)) { | ||
singleShutdownMetadataBuilder.setTargetNodeName(randomIdentifier()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test, I think it doesn't really matter.
| ||
final var clusterStateBuilder = ClusterState.builder(clusterState) | ||
.metadata(Metadata.builder(clusterState.metadata()).putCustom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY)); | ||
final var nodeRemovedFromCluster = randomBoolean(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that important? We have a reroute at L720, does that count?
…edBalanceOnNodeShutdownMarker
…19968) We prevent retries of allocations/relocations once they see index.allocation.max_retries failed attempts (default 5). In #108987, we added reseting the allocation failure counters when a node joins the cluster. As discussed in the linked discussion, it would make sense to extend this reset also to relocations AND also consider node shutdown events. With this change we reset both allocation/relocation failures if a new node joins the cluster or a shutdown metadata is applied. The subset of shutdown events that we consider and how we track them is more or less copied from what was done for #106998. To me the logic seemed to make sense here too. Closes ES-10492
This relies only on shutdown metadata and not the actual nodes leaving the cluster. It keeps some state around which shutdown metadata exist in the cluster and are processed. The assumption here is that these are properly applied and removed for the state maintained inside the Allocator to be correct.
Closes ES-7916