Skip to content

Conversation

@idegtiarenko
Copy link
Contributor

@idegtiarenko idegtiarenko commented May 8, 2023

As pointed in #93993 (comment) it is possible that desired balance is computed before delayed allocations are expired but reconciled after that. As a result delay markers are removed but shards are not allocated as their desired locations are not computed yet and follow up allocation is not scheduled as all markers might be expired and removed.

This change attempts to fix this bug by injecting time into reconciliation representing the time of the latest balance computed.

Close: #93993
Close: #93470

@idegtiarenko idegtiarenko added >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.9.0 labels May 8, 2023
@idegtiarenko idegtiarenko requested review from DaveCTurner and arteam May 8, 2023 15:51
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @idegtiarenko, I've created a changelog YAML for you.

* @param assignments a set of the (persistent) node IDs to which each {@link ShardId} should be allocated
*/
public record DesiredBalance(long lastConvergedIndex, Map<ShardId, ShardAssignment> assignments) {
public record DesiredBalance(long lastConvergedIndex, long currentTimeNano, Map<ShardId, ShardAssignment> assignments) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time is captured here

@FunctionalInterface
public interface DesiredBalanceReconcilerAction {
ClusterState apply(ClusterState clusterState, Consumer<RoutingAllocation> routingAllocationAction);
ClusterState apply(ClusterState clusterState, long currentTimeNano, Consumer<RoutingAllocation> routingAllocationAction);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And passed to the reconciler via this interface. The rest are mostly changes to accommodate new parameter.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather we fixed this as I suggested in #93993 (comment): remove the delay markers only in a full reroute and not in a reconcile.

@idegtiarenko idegtiarenko changed the title Attempt to fix delay allocation Fix delay allocation May 9, 2023
@idegtiarenko
Copy link
Contributor Author

Updated to use a flag. Though I am not sure if it is a good idea. May be we should have a dedicated reconcile method on allocation service?

@idegtiarenko idegtiarenko requested a review from DaveCTurner May 16, 2023 13:15
Comment on lines 531 to 533
if (routingAllocationAction.removeDelayMarkers()) {
removeDelayMarkers(allocation);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather have another method on RoutingAllocationAction which consumes the RoutingAllocation (and which by default removes the delay markers) instead of one which just returns a boolean.

}

@FunctionalInterface
public interface RoutingAllocationAction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit/suggestion - "Action" is overloaded and this isn't really an action, it's more like a strategy. How about RerouteStrategy?

@idegtiarenko idegtiarenko requested a review from DaveCTurner May 16, 2023 14:55
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@idegtiarenko idegtiarenko merged commit e6cf790 into elastic:main May 17, 2023
@idegtiarenko idegtiarenko deleted the fix_delay_allocation branch May 17, 2023 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.9.0

3 participants