Skip to content

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Oct 20, 2025

For allocation simulation, allocation decisions should not be THROTTLE for moveShards and balance. Allocate unassigned shards should not see THROTTLE unless there is prior shard assignment. This PR makes this statement explicit by fixing test allocation decider and HasFrozenAllocationDecider as well as adding relevant assertions.

Relates: ES-12955

Other than HasFrozenCacheAllocationDecider, allocation decision should not be THROTTLE unless there is prior shard assignment/movement. This PR makes this statement explicit by adding relevant assertions and fixing test allocation decider. Relates: ES-12955
@ywangd ywangd added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v9.3.0 labels Oct 20, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@ywangd
Copy link
Member Author

ywangd commented Oct 20, 2025

Initially, I wanted to ensure that no THROTTLE can be returned at all unless there is prior shard movement. But HasFrozenCacheAllocationDecider is not compatible with the idea. We also have existing code explicitly checking for it. So it seems that we have to live with this exception. As a result, I added assertions and fixed RandomAllocationDecider to make this exception more explicit. I am open to suggestions.

Comment on lines 712 to 713
// 2. Or, a partial searchable snapshot index due to HasFrozenCacheAllocationDecider
assert allocation.isSimulating() == false || indexMetadata.isPartialSearchableSnapshot()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could check that the THROTTLE decision is actually from the HasFrozenCacheAllocationDecider. But it's not really feasible. Instead, we can check it's a frozen index. Not really ideal, but seems to be the best we can do.

Copy link
Contributor

@nicktindall nicktindall Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of putting knowledge of specific deciders in here, as discussed, I wonder if HasFrozenAllocationDecider is abusing THROTTLE anyway. My reasoning was:

In the canAllocateUnassigned logic we'll wait for the outcome of a lower-weight THROTTLE when we have higher-weight YES so that would mean this decider can defer the allocation of a partial searchable snapshot while a non-frozen node was initialising doesn't it?
I would think we'd return NO until the setting was loaded?

It seems like returning THROTTLED while we wait to find out if the node can host frozen shards could mean we defer allocating the shard to another node that we KNOW can host frozen shards, while waiting to find out if an initialising node is a frozen node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we defer allocating the shard to another node that we KNOW can host frozen shards, while waiting to find out if an initialising node is a frozen node.

That is true. I also wonder whether it is intentional since otherwise the shards will be assigned to an existing node only to relocate later momentarily. That said, after reading through code around HasFrozenAllocationDecider, I believe it is OK to return NO since it should not change the beahviour. That is because we always call allocateExistingUnassignedShards before calling into the allocator. This method uses SearchableSnapshotAllocator to allocate unassigne shards first. It also checks cache status and reject allocation when cache state is still being fetched. It in turns places the shard in the ignored list and BalancedShardAllocator ignores them as well. In short, the allocation decision for paritially mounted shards are made earlier in a different place so that it should not matter for the decider to return NO.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this change LGTM pending the question of whether we can make HasFrozenCacheAllocationDecider return NO while we're initialising the node so that we can remove the special knowledge of it from here?

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this task is to gain confidence that the THROTTLE decision is not encountered, but right now this patch doesn't provide that confidence. This isn't in the hot-path, so I don't think we need to rush the change. I'd rather sort out the Decider before proceeding.

@ywangd ywangd marked this pull request as draft October 20, 2025 23:12
@ywangd ywangd marked this pull request as ready for review October 21, 2025 07:17
@ywangd ywangd changed the title Tighten when THROTTLE decision can be returned Tighten on when THROTTLE decision can be returned Oct 21, 2025
@ywangd
Copy link
Member Author

ywangd commented Oct 21, 2025

@nicktindall @DiannaHohensee The PR is ready for another look. In summary, I have taken Nick's suggestion for returning NO from HasFrozenAllocationDecider.

I spent quite some time reading through relevant code for searchable snapshot indices allocation. I believe it is safe to do so.

  1. Unassigned searchable snapshot shards are allocated elsewhere, see also this comment.
  2. For relocations, throttle is effectively a NO and unlike unassigned shards, there is no urgency for them.
Comment on lines 86 to 99
default -> STILL_FETCHING;
case FETCHING -> allocation.isSimulating() ? NO_STILL_FETCHING : STILL_FETCHING;
case UNKNOWN -> UNKNOWN_NODE;
Copy link
Member Author

@ywangd ywangd Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HasFrozenAllocationDecider does not differentiate between fetching node state (FETCHING) and no-such-node (UNKNOWN). The later can only happen when a node leaves the cluster which seems more justified to say NO. In any case, they do not really impact SearchableSnapshotAllocator which is what matters for allocating unassigned searchable snapshot shards.
Btw, I can also undo the split if it is considered unrelated.

Comment on lines 488 to 498
if (hasChanges == false && info.lastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_THROTTLED) {
// Unassigned ignored shards must be based on the provided set of ignoredShards
assert ignoredShards.contains(discardAllocationStatus(shard))
|| ignoredShards.stream().filter(ShardRouting::primary).anyMatch(primary -> primary.shardId().equals(shard.shardId()))
: "ignored shard "
+ shard
+ " unexpectedly has THROTTLE status and no counterpart in the provided ignoredShards set "
+ ignoredShards;
// Simulation could not progress due to missing information in any of the deciders.
// Currently, this could happen if `HasFrozenCacheAllocationDecider` is still fetching the data.
// Progress would be made after the followup reroute call.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about HasFrozenCacheAllocationDecider had me concerned for some time. Adding this assertion to make it clear that throttled status can only come from DesiredBalanceInput#ignoredShards, i.e. they are not produced by BalancedShardAllocator.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only comment is I wonder if we should do the change to NO for "still fetching" and the additional assertions as separate PRs, but happy with whatever you decide there.

@ywangd
Copy link
Member Author

ywangd commented Oct 28, 2025

@nicktindall I refactor the change for UNKNOWN state to a separate PR #137232. Let me know if this matches what you had in mind.

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments about nested assertions. In general, can we move the assertions out of the conditional paths and assert that not encountering THROTTLE during simulation is generally true, as opposed to conditionally? Unless I've misunderstood the necessity?

+ allocation.isSimulating()
+ ") when balancing index ["
+ index
+ "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the assert in a conditional branch makes this seem like certain paths are okay for THROTTLE. Suppose we were in a simulation and somehow got more than 1 relocating shard at once: that would be unsafe, right? Is there any reason we can't safely always have the assert in place? Like,

assert allocation.isSimulating() == false || routingNodes.getRelocatingShardCount() <= 1 : "...."; if (routingNodes.getRelocatingShardCount() > 0) { shardBalanced = true; } if (completeEarlyOnShardAssignmentChange && shardBalanced) { .... 
+ allocation.isSimulating()
+ ") with no prior assignment when allocating unassigned shard ["
+ shard
+ "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than putting this in a conditional, can we assert that either isSimulating==false || decision is not throttle up on line 1272? Or better yet, right after the decideAllocateUnassigned call, outside another layer of if-else condition?

}

// A THROTTLE allocation decision can happen when not simulating
assert moveDecision.getAllocationDecision() != AllocationDecision.THROTTLED || allocation.isSimulating() == false
Copy link
Contributor

@DiannaHohensee DiannaHohensee Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be done right after the decideMove call? There's an early return path above on which we'd miss this assert.

elasticsearchmachine pushed a commit that referenced this pull request Oct 30, 2025
…137232) Since SearchableSnapshotAllocator updates frozen cache service before allocation, the unknown state for a node can only happen when the node leaves the cluster. In this case, it makes more sense to reject the allocation instead of throttling since the node may not come back. This is also more consistent with the behaviour of SearchableSnapshotAllocator which does not consider any node that is not currently in the cluster (effectively a No). If the node does come back, it will go through again the cache info fetching steps. Relates: #136794
chrisparrinello pushed a commit to chrisparrinello/elasticsearch that referenced this pull request Nov 3, 2025
…lastic#137232) Since SearchableSnapshotAllocator updates frozen cache service before allocation, the unknown state for a node can only happen when the node leaves the cluster. In this case, it makes more sense to reject the allocation instead of throttling since the node may not come back. This is also more consistent with the behaviour of SearchableSnapshotAllocator which does not consider any node that is not currently in the cluster (effectively a No). If the node does come back, it will go through again the cache info fetching steps. Relates: elastic#136794
@ywangd
Copy link
Member Author

ywangd commented Nov 5, 2025

@DiannaHohensee Sorry for the late reply here. Got distracted by other work and then PTO. I pushed c62be8e based on your feedback for asserting in a more general path. Please let me know if the latest change works better for you.

@ywangd ywangd requested a review from DiannaHohensee November 5, 2025 09:05
Comment on lines 99 to 100
// TODO: considering returning NO as well for non-simulation https://elasticco.atlassian.net/browse/ES-13378
case FETCHING -> allocation.isSimulating() ? NO_STILL_FETCHING : STILL_FETCHING;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed over the other PR. I am keeping throttle for non-simulation case for the time being. I added a link to the JIRA ticket for considering its removal.

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I left some additional recommendations. Thanks for moving up the asserts to apply more generally.

lowIdx = 0;
highIdx = relevantNodes - 1;

assert allocation.isSimulating() == false || routingNodes.getRelocatingShardCount() > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have asked this in an earlier review round, but shouldn't we expect routingNodes.getRelocatingShardCount() to be exactly 1 when isSimulating is true? IIUC, tryRelocateShard will increment routingNodes.getRelocatingShardCount() when there's a YES answer but leave it unchanged for THROTTLE, per this code

canAllocateOrRebalance == Type.YES
/* only allocate on the cluster if we are not throttled */
? routingNodes.relocateShard(shard, minNode.getNodeId(), shardSize, "rebalance", allocation.changes()).v1()
: shard.relocate(minNode.getNodeId(), shardSize)

Combined with the one shard move and then exit logic added recently, the assert would then be

simulation is false
OR
relocatingShardCount == 1 (and implicitly simulation is true)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can tighten it further see 6cdfdc1

// A THROTTLE allocation decision can happen when not simulating
assert moveDecision.isDecisionTaken() == false
|| moveDecision.getAllocationDecision() != AllocationDecision.THROTTLED
|| allocation.isSimulating() == false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opt nit: might flip isSimulating above != THROTTLED check, seems easier to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure see 1726061

// 2. Or, there is shard assigned before this one
assert allocationDecision.getAllocationStatus() != AllocationStatus.DECIDERS_THROTTLED
|| allocation.isSimulating() == false
|| shardAssignmentChanged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unassigned replica shards can returning THROTTLE because only the primary shard path is THROTTLE free during simulation. At least until I get ES-12942 complete. I take it that's the problem here?

Since we have more or less ignored the allocateUnassigned code path for not-preferred, feel free to skip this assert: I'm not sure what benefit it adds right now. Or you can add a TODO referencing ES-12942 and I expect I should be able to tighten this up. We'll revisit allocateUnassigned for not-preferred in ES-13279, too.

Copy link
Member Author

@ywangd ywangd Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unassigned replica shards can returning THROTTLE because only the primary shard path is THROTTLE free during simulation. At least until I get ES-12942 complete. I take it that's the problem here?

Yes ThrottlingAllocationDecider can still return throttling decision for unassigne replica shards. But ES-12942 is a separate issue. It's about early return in DesiredBalanceComputer instead of BalancedShardAllocator. It currently short-circuits only for primary shard assignement. Hence, changing ThrottlingAllocationDecider returning YES for replica shards is not really relevant since replica shards can already be assigned today but we don't publish the balance quickly enough after they are assigned.

In summary, let me raise a follow-up to fix ThrottlingAllocationDecider so that it does not return throttling in simulation. It's all part of the current JIRA ES-12955

case NO_CACHE -> NO_FROZEN_CACHE;
case FAILED -> UNKNOWN_FROZEN_CACHE;
case FETCHING -> STILL_FETCHING;
// TODO: considering returning NO as well for non-simulation https://elasticco.atlassian.net/browse/ES-13378
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public repo: should we shorten to ES-13378 since the link won't work for external users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Pushed 228c35d

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 6, 2025
@elasticsearchmachine elasticsearchmachine merged commit 29f2d0c into elastic:main Nov 6, 2025
34 checks passed
@ywangd ywangd deleted the ES-12955-no-throttle-before-movement branch November 6, 2025 03:52
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Nov 6, 2025
For allocation simulation, allocation decisions should not be THROTTLE for moveShards and balance. Allocate unassigned shards should not see THROTTLE unless there is prior shard assignment. This PR makes this statement explicit by fixing test allocation decider and HasFrozenAllocationDecider as well as adding relevant assertions. Relates: ES-12955
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 6, 2025
…-json * upstream/main: Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=false} elastic#137691 Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=true} elastic#137690 [LTR] Fix feature display order when using explain. (elastic#137671) Remove extra RemoteClusterService instances in unit test (elastic#137647) Fix `ComponentTemplatesFileSettingsIT.testSettingsApplied` (elastic#137669) Consolidates troubleshooting content into the "Returning semantic field embeddings in _source" section (elastic#137233) Update bundled JDK to 25.0.1 (elastic#137640) resolve indices for prefixed _all expressions (elastic#137330) ESQL: Add TopN support for exponential histograms (elastic#137313) allows field caps to be cross project (elastic#137530) ESQL: Add exponential histogram percentile function (elastic#137553) Wait for nodes to have downloaded databases in `GeoIpDownloaderIT` (elastic#137636) Tighten on when THROTTLE decision can be returned (elastic#136794) Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeMetricsIT test elastic#137655 Add a test for two little known conditional processor paths (elastic#137645) Extract a common ORIGIN constant (elastic#137612) Remove early phase failure in batched (elastic#136889) Returning correct index mode from get data streams api (elastic#137646) [ML] Manage AD results indices (elastic#136065)
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 6, 2025
…-json * upstream/main: Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=false} elastic#137691 Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=true} elastic#137690 [LTR] Fix feature display order when using explain. (elastic#137671) Remove extra RemoteClusterService instances in unit test (elastic#137647) Fix `ComponentTemplatesFileSettingsIT.testSettingsApplied` (elastic#137669) Consolidates troubleshooting content into the "Returning semantic field embeddings in _source" section (elastic#137233) Update bundled JDK to 25.0.1 (elastic#137640) resolve indices for prefixed _all expressions (elastic#137330) ESQL: Add TopN support for exponential histograms (elastic#137313) allows field caps to be cross project (elastic#137530) ESQL: Add exponential histogram percentile function (elastic#137553) Wait for nodes to have downloaded databases in `GeoIpDownloaderIT` (elastic#137636) Tighten on when THROTTLE decision can be returned (elastic#136794) Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeMetricsIT test elastic#137655 Add a test for two little known conditional processor paths (elastic#137645) Extract a common ORIGIN constant (elastic#137612) Remove early phase failure in batched (elastic#136889) Returning correct index mode from get data streams api (elastic#137646) [ML] Manage AD results indices (elastic#136065)
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025
For allocation simulation, allocation decisions should not be THROTTLE for moveShards and balance. Allocate unassigned shards should not see THROTTLE unless there is prior shard assignment. This PR makes this statement explicit by fixing test allocation decider and HasFrozenAllocationDecider as well as adding relevant assertions. Relates: ES-12955
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

4 participants