Skip to content

Conversation

pxsalehi
Copy link
Member

@pxsalehi pxsalehi commented Apr 2, 2024

This relies only on shutdown metadata and not the actual nodes leaving the cluster. It keeps some state around which shutdown metadata exist in the cluster and are processed. The assumption here is that these are properly applied and removed for the state maintained inside the Allocator to be correct.

Closes ES-7916

@pxsalehi pxsalehi added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Apr 2, 2024
@pxsalehi pxsalehi marked this pull request as ready for review April 3, 2024 17:06
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 3, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok but I'm worried that we might miss something by considering a state as "processed" at this point. Could we instead incorporate the current shutdown markers into DesiredBalanceInput, then we're processing them at the same time as the rest of the input?

assert MasterService.assertMasterUpdateOrTestThread() : Thread.currentThread().getName();
assert allocation.ignoreDisable() == false;

processNodeShutdowns(clusterService.state());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think this should be within the computation. We haven't really "processed" the new state until we've done that.


private synchronized void processNodeShutdowns(ClusterState clusterState) {
// Clean up processed shutdowns that are removed from the cluster metadata
processedNodeShutdowns.removeIf(nodeId -> clusterState.metadata().nodeShutdowns().contains(nodeId) == false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove a shutdown marker from a node while the node is still in the cluster, I think that needs a reset too.

Comment on lines +206 to +210
// If we remove a shutdown marker from a node, but it is still in the cluster, we'd need a reset.
boolean reset = processedNodeShutdowns.stream()
.anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null);
// Clean up processed shutdowns that are removed from the cluster metadata
processedNodeShutdowns.removeIf(nodeId -> nodeShutdowns.contains(nodeId) == false);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i could merge these two loops, but i find it more readable this way (and the set size should be always small).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: could be simplified with processedNodeShutdowns.retainAll(nodeShutdowns.keySet());

@pxsalehi pxsalehi requested a review from DaveCTurner April 5, 2024 09:52
Comment on lines 212 to 218
Set<String> newShutdowns = new HashSet<>();
for (var shutdown : nodeShutdowns.getAll().entrySet()) {
if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART
&& processedNodeShutdowns.contains(shutdown.getKey()) == false) {
newShutdowns.add(shutdown.getKey());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be simplified:

Suggested change
Set<String> newShutdowns = new HashSet<>();
for (var shutdown : nodeShutdowns.getAll().entrySet()) {
if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART
&& processedNodeShutdowns.contains(shutdown.getKey()) == false) {
newShutdowns.add(shutdown.getKey());
}
}
for (var shutdown : nodeShutdowns.getAll().entrySet()) {
if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART) {
reset |= processedNodeShutdowns.add(shutdown.getKey());
}
}

this way we do not need to allocate additional collection

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in practice what we've ended up with effectively does a reset on almost all shutdown metadata changes, so I've made that explicit in 8bf3430. WDYT @idegtiarenko

.setType(shutdownType)
.setStartedAtMillis(randomNonNegativeLong());
if (shutdownType.equals(Type.REPLACE)) {
singleShutdownMetadataBuilder.setTargetNodeName(randomIdentifier());
Copy link
Contributor

@idegtiarenko idegtiarenko Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set an existing node name here?
Not sure if this is important in this test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test, I think it doesn't really matter.


final var clusterStateBuilder = ClusterState.builder(clusterState)
.metadata(Metadata.builder(clusterState.metadata()).putCustom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY));
final var nodeRemovedFromCluster = randomBoolean();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to suggest to unconditionally do an empty reroute before the node is removed.
This way the test is more linear and simpler to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that important? We have a reroute at L720, does that count?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the test is structured as following now:

final var nodeRemovedFromCluster = randomBoolean(); if (nodeRemovedFromCluster) { //remove node } conditionally assert reset and reroute if (nodeRemovedFromCluster==false) { //remove node //assert reset and reroute } 

I think the test would be a little simpler if instead we

if (sometimes) { //do empty reroute //assert no reset } //remove node //assert reset and reroute 

or even do an empty reroute on every test execution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified the last part of the test, and also clarified some names. Hope it helps! See e83bef4.

idegtiarenko
idegtiarenko previously approved these changes Apr 9, 2024
Copy link
Contributor

@idegtiarenko idegtiarenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left couple of suggestions, but overall 👍

@DaveCTurner DaveCTurner dismissed idegtiarenko’s stale review April 9, 2024 09:28

Worth another look first

@DaveCTurner
Copy link
Contributor

@pxsalehi is away at the moment, I've adopted this branch for now and made some of the suggested changes.


private void processNodeShutdowns(ClusterState clusterState) {
final var nodeShutdowns = clusterState.metadata().nodeShutdowns().getAllNodeIds();
if (nodeShutdowns.equals(processedNodeShutdowns) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need to reset when removing shutdown entry for the node that is already removed. This is also not going to harm anything.

Do we need to reset for the restart case?

@pxsalehi
Copy link
Member Author

@elasticmachine update branch

@pxsalehi
Copy link
Member Author

Thanks @idegtiarenko and @DaveCTurner for continuing this. It seems this is good to merge. But before merging, I have one question: the simplification done in #106998 (comment) would also reset the desired balance for a RESTART, which seems unnecessary. Is that harmless in terms of the unnecessary extra work/computation it will cause?

@DaveCTurner
Copy link
Contributor

Sorry @pxsalehi I started working on this and then got waylaid by other stuff. I'm not 100% sure we want to do as many resets as I've proposed. I mean maybe we do, but don't take it as set in stone.

@pxsalehi
Copy link
Member Author

@DaveCTurner the extra resets seem unnecessary and seem to be a trade off to have a simpler logic there. With the simplification we are now resetting desired balance also when 1) the shutdown metadata is of type RESTART, 2) a node shutdown metadata is removed and the node is also removed (previously we would only reset if the shutdown metadata were removed but the node stays in the cluster. see #106998 (comment)). Both of these seem unnecessary. Hence my question, since I am not super familiar with the desired balance computation, if the extra cost of these resets is not trivial, then I'd rather go back to the more selective way of resetting than the proposed one. Otherwise, I'm good with it. Any thoughts?

Copy link
Member Author

@pxsalehi pxsalehi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the two cases mentioned above, I don't think we need to reroute. Also after talking with Ievgen, it seems that if we're not in a reconciled state, each reset would cause the computation again. E.g. when there is a rolling restart, it seems it could lead to unnecessary resets. It's just a few lines of code more, so I've changed it to be more selective. Please give this another check. Thanks.

.setType(shutdownType)
.setStartedAtMillis(randomNonNegativeLong());
if (shutdownType.equals(Type.REPLACE)) {
singleShutdownMetadataBuilder.setTargetNodeName(randomIdentifier());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test, I think it doesn't really matter.


final var clusterStateBuilder = ClusterState.builder(clusterState)
.metadata(Metadata.builder(clusterState.metadata()).putCustom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY));
final var nodeRemovedFromCluster = randomBoolean();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that important? We have a reroute at L720, does that count?

@pxsalehi pxsalehi requested review from DaveCTurner and idegtiarenko and removed request for DaveCTurner April 30, 2024 09:16
@pxsalehi pxsalehi added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label May 2, 2024
@elasticsearchmachine elasticsearchmachine merged commit 0a410b5 into elastic:main May 2, 2024
@pxsalehi pxsalehi deleted the ps240402-resetDesiredBalanceOnNodeShutdownMarker branch May 2, 2024 09:56
pxsalehi added a commit that referenced this pull request Jan 28, 2025
…19968) We prevent retries of allocations/relocations once they see index.allocation.max_retries failed attempts (default 5). In #108987, we added reseting the allocation failure counters when a node joins the cluster. As discussed in the linked discussion, it would make sense to extend this reset also to relocations AND also consider node shutdown events. With this change we reset both allocation/relocation failures if a new node joins the cluster or a shutdown metadata is applied. The subset of shutdown events that we consider and how we track them is more or less copied from what was done for #106998. To me the logic seemed to make sense here too. Closes ES-10492
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.15.0

5 participants