Reset desired balance when a node is marked for shutdown #106998

pxsalehi · 2024-04-02T12:05:52Z

This relies only on shutdown metadata and not the actual nodes leaving the cluster. It keeps some state around which shutdown metadata exist in the cluster and are processed. The assumption here is that these are properly applied and removed for the state maintained inside the Allocator to be correct.

Closes ES-7916

.../shutdown/src/main/java/org/elasticsearch/xpack/shutdown/TransportPutShutdownNodeAction.java

This reverts commit 94e5890.

This reverts commit c714703.

…edBalanceOnNodeShutdownMarker

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java

elasticsearchmachine · 2024-04-03T17:06:26Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner

I think this is ok but I'm worried that we might miss something by considering a state as "processed" at this point. Could we instead incorporate the current shutdown markers into DesiredBalanceInput, then we're processing them at the same time as the rest of the input?

DaveCTurner · 2024-04-04T16:40:13Z

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java

 assert MasterService.assertMasterUpdateOrTestThread() : Thread.currentThread().getName();
 assert allocation.ignoreDisable() == false;

+ processNodeShutdowns(clusterService.state());


Hmm I think this should be within the computation. We haven't really "processed" the new state until we've done that.

DaveCTurner · 2024-04-04T16:44:10Z

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java


+ private synchronized void processNodeShutdowns(ClusterState clusterState) {
+ // Clean up processed shutdowns that are removed from the cluster metadata
+ processedNodeShutdowns.removeIf(nodeId -> clusterState.metadata().nodeShutdowns().contains(nodeId) == false);


If we remove a shutdown marker from a node while the node is still in the cluster, I think that needs a reset too.

pxsalehi · 2024-04-05T09:51:53Z

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java

+ // If we remove a shutdown marker from a node, but it is still in the cluster, we'd need a reset.
+ boolean reset = processedNodeShutdowns.stream()
+ .anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null);
+ // Clean up processed shutdowns that are removed from the cluster metadata
+ processedNodeShutdowns.removeIf(nodeId -> nodeShutdowns.contains(nodeId) == false);


i could merge these two loops, but i find it more readable this way (and the set size should be always small).

NIT: could be simplified with processedNodeShutdowns.retainAll(nodeShutdowns.keySet());

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java

idegtiarenko · 2024-04-09T06:53:28Z

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java

+ Set<String> newShutdowns = new HashSet<>();
+ for (var shutdown : nodeShutdowns.getAll().entrySet()) {
+ if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART
+ && processedNodeShutdowns.contains(shutdown.getKey()) == false) {
+ newShutdowns.add(shutdown.getKey());
+ }
+ }


Might be simplified:

Suggested change

Set<String> newShutdowns = new HashSet<>();

for (var shutdown : nodeShutdowns.getAll().entrySet()) {

if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART

&& processedNodeShutdowns.contains(shutdown.getKey()) == false) {

newShutdowns.add(shutdown.getKey());

}

}

for (var shutdown : nodeShutdowns.getAll().entrySet()) {

if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART) {

reset |= processedNodeShutdowns.add(shutdown.getKey());

}

}

this way we do not need to allocate additional collection

I think in practice what we've ended up with effectively does a reset on almost all shutdown metadata changes, so I've made that explicit in 8bf3430. WDYT @idegtiarenko

idegtiarenko · 2024-04-09T06:58:42Z

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

+ .setType(shutdownType)
+ .setStartedAtMillis(randomNonNegativeLong());
+ if (shutdownType.equals(Type.REPLACE)) {
+ singleShutdownMetadataBuilder.setTargetNodeName(randomIdentifier());


Should we set an existing node name here?
Not sure if this is important in this test.

For this test, I think it doesn't really matter.

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

idegtiarenko · 2024-04-09T07:04:50Z

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

+
+ final var clusterStateBuilder = ClusterState.builder(clusterState)
+ .metadata(Metadata.builder(clusterState.metadata()).putCustom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY));
+ final var nodeRemovedFromCluster = randomBoolean();


I would like to suggest to unconditionally do an empty reroute before the node is removed.
This way the test is more linear and simpler to read.

Why is that important? We have a reroute at L720, does that count?

It seems that the test is structured as following now:

final var nodeRemovedFromCluster = randomBoolean(); if (nodeRemovedFromCluster) { //remove node } conditionally assert reset and reroute if (nodeRemovedFromCluster==false) { //remove node //assert reset and reroute }

I think the test would be a little simpler if instead we

if (sometimes) { //do empty reroute //assert no reset } //remove node //assert reset and reroute

or even do an empty reroute on every test execution

I've simplified the last part of the test, and also clarified some names. Hope it helps! See e83bef4.

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

idegtiarenko

Left couple of suggestions, but overall 👍

Worth another look first

DaveCTurner · 2024-04-09T09:28:39Z

@pxsalehi is away at the moment, I've adopted this branch for now and made some of the suggested changes.

idegtiarenko · 2024-04-09T09:34:47Z

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java


+ private void processNodeShutdowns(ClusterState clusterState) {
+ final var nodeShutdowns = clusterState.metadata().nodeShutdowns().getAllNodeIds();
+ if (nodeShutdowns.equals(processedNodeShutdowns) == false) {


I do not think we need to reset when removing shutdown entry for the node that is already removed. This is also not going to harm anything.

Do we need to reset for the restart case?

pxsalehi · 2024-04-29T11:24:45Z

@elasticmachine update branch

…rker

pxsalehi · 2024-04-29T11:32:33Z

Thanks @idegtiarenko and @DaveCTurner for continuing this. It seems this is good to merge. But before merging, I have one question: the simplification done in #106998 (comment) would also reset the desired balance for a RESTART, which seems unnecessary. Is that harmless in terms of the unnecessary extra work/computation it will cause?

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

DaveCTurner · 2024-04-29T11:56:04Z

Sorry @pxsalehi I started working on this and then got waylaid by other stuff. I'm not 100% sure we want to do as many resets as I've proposed. I mean maybe we do, but don't take it as set in stone.

pxsalehi · 2024-04-29T12:18:22Z

@DaveCTurner the extra resets seem unnecessary and seem to be a trade off to have a simpler logic there. With the simplification we are now resetting desired balance also when 1) the shutdown metadata is of type RESTART, 2) a node shutdown metadata is removed and the node is also removed (previously we would only reset if the shutdown metadata were removed but the node stays in the cluster. see #106998 (comment)). Both of these seem unnecessary. Hence my question, since I am not super familiar with the desired balance computation, if the extra cost of these resets is not trivial, then I'd rather go back to the more selective way of resetting than the proposed one. Otherwise, I'm good with it. Any thoughts?

…edBalanceOnNodeShutdownMarker

pxsalehi

So for the two cases mentioned above, I don't think we need to reroute. Also after talking with Ievgen, it seems that if we're not in a reconciled state, each reset would cause the computation again. E.g. when there is a rolling restart, it seems it could lead to unnecessary resets. It's just a few lines of code more, so I've changed it to be more selective. Please give this another check. Thanks.

pxsalehi · 2024-04-30T09:12:55Z

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

+ .setType(shutdownType)
+ .setStartedAtMillis(randomNonNegativeLong());
+ if (shutdownType.equals(Type.REPLACE)) {
+ singleShutdownMetadataBuilder.setTargetNodeName(randomIdentifier());


For this test, I think it doesn't really matter.

pxsalehi · 2024-04-30T09:13:41Z

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

+
+ final var clusterStateBuilder = ClusterState.builder(clusterState)
+ .metadata(Metadata.builder(clusterState.metadata()).putCustom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY));
+ final var nodeRemovedFromCluster = randomBoolean();


Why is that important? We have a reroute at L720, does that count?

…edBalanceOnNodeShutdownMarker

…19968) We prevent retries of allocations/relocations once they see index.allocation.max_retries failed attempts (default 5). In #108987, we added reseting the allocation failure counters when a node joins the cluster. As discussed in the linked discussion, it would make sense to extend this reset also to relocations AND also consider node shutdown events. With this change we reset both allocation/relocation failures if a new node joins the cluster or a shutdown metadata is applied. The subset of shutdown events that we consider and how we track them is more or less copied from what was done for #106998. To me the logic seemed to make sense here too. Closes ES-10492

Reset desired balance when a node is marked for shutdown

c714703

pxsalehi added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Apr 2, 2024

elasticsearchmachine added the v8.14.0 label Apr 2, 2024

pxsalehi commented Apr 2, 2024

View reviewed changes

.../shutdown/src/main/java/org/elasticsearch/xpack/shutdown/TransportPutShutdownNodeAction.java Outdated Show resolved Hide resolved

test

94e5890

DaveCTurner reviewed Apr 2, 2024

View reviewed changes

.../shutdown/src/main/java/org/elasticsearch/xpack/shutdown/TransportPutShutdownNodeAction.java Outdated Show resolved Hide resolved

pxsalehi added 4 commits April 3, 2024 11:31

Revert "test"

27452ab

This reverts commit 94e5890.

Revert "Reset desired balance when a node is marked for shutdown"

60ddd41

This reverts commit c714703.

Merge remote-tracking branch 'upstream/main' into ps240402-resetDesir…

46f5fd0

…edBalanceOnNodeShutdownMarker

Keep track of node shutdown in allocator and reset balance

9ec75be

pxsalehi commented Apr 3, 2024

View reviewed changes

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java Outdated Show resolved Hide resolved

pxsalehi marked this pull request as ready for review April 3, 2024 17:06

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 3, 2024

pxsalehi requested review from DaveCTurner and idegtiarenko April 3, 2024 17:08

DaveCTurner reviewed Apr 4, 2024

View reviewed changes

pxsalehi added 3 commits April 5, 2024 11:25

address review comments

760e749

synchronized not needed anymore

f5b3fd7

remove at once

e6e510e

pxsalehi commented Apr 5, 2024

View reviewed changes

pxsalehi requested a review from DaveCTurner April 5, 2024 09:52

idegtiarenko reviewed Apr 9, 2024

View reviewed changes

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java Outdated Show resolved Hide resolved

idegtiarenko reviewed Apr 9, 2024

View reviewed changes

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java Outdated Show resolved Hide resolved

idegtiarenko reviewed Apr 9, 2024

View reviewed changes

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java Outdated Show resolved Hide resolved

idegtiarenko previously approved these changes Apr 9, 2024

View reviewed changes

DaveCTurner added 2 commits April 9, 2024 10:22

Assert empty

dd52174

Unnecessary setState() calls

d18d97f

idegtiarenko reviewed Apr 9, 2024

View reviewed changes

idegtiarenko approved these changes Apr 9, 2024

View reviewed changes

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

Merge branch 'main' into ps240402-resetDesiredBalanceOnNodeShutdownMa…

3d50076

…rker

pxsalehi commented Apr 29, 2024

View reviewed changes

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java Show resolved Hide resolved

spotless

f276e47

pxsalehi added 2 commits April 30, 2024 10:44

Merge remote-tracking branch 'upstream/main' into ps240402-resetDesir…

f319821

…edBalanceOnNodeShutdownMarker

reset only when needed

117c65d

pxsalehi commented Apr 30, 2024

View reviewed changes

pxsalehi requested review from DaveCTurner and idegtiarenko and removed request for DaveCTurner April 30, 2024 09:16

pxsalehi added 2 commits May 2, 2024 10:05

simplify test

e83bef4

Merge remote-tracking branch 'upstream/main' into ps240402-resetDesir…

0b00915

…edBalanceOnNodeShutdownMarker

idegtiarenko approved these changes May 2, 2024

View reviewed changes

Improve test

f18b6ed

pxsalehi added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label May 2, 2024

elasticsearchmachine merged commit 0a410b5 into elastic:main May 2, 2024

pxsalehi deleted the ps240402-resetDesiredBalanceOnNodeShutdownMarker branch May 2, 2024 09:56

pxsalehi mentioned this pull request Jan 15, 2025

Reset relocation/allocation failure counter on node join/shutdown #119968

Merged

Reset desired balance when a node is marked for shutdown #106998

Reset desired balance when a node is marked for shutdown #106998

Uh oh!

Conversation

pxsalehi commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 3, 2024

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idegtiarenko Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

idegtiarenko left a comment

Choose a reason for hiding this comment

DaveCTurner commented Apr 9, 2024

Choose a reason for hiding this comment

pxsalehi commented Apr 29, 2024

pxsalehi commented Apr 29, 2024

Uh oh!

DaveCTurner commented Apr 29, 2024

pxsalehi commented Apr 29, 2024

pxsalehi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

5 participants

pxsalehi commented Apr 2, 2024 •

edited

Loading

idegtiarenko Apr 9, 2024 •

edited

Loading