HasFrozenCacheAllocationDecider returns No decision for unknown node #137232

ywangd · 2025-10-28T07:14:25Z

Since SearchableSnapshotAllocator updates frozen cache service before allocation, the unknown state for a node can only happen when the node leaves the cluster. In this case, it makes more sense to reject the allocation instead of throttling since the node may not come back. This is also more consistent with the behaviour of
SearchableSnapshotAllocator which does not consider any node that is not currently in the cluster (effectively a No). If the node does come back, it will go through again the cache info fetching steps.

Relates: #136794

Since SearchableSnapshotAllocator updates frozen cache service before allocation, the unknown state for a node can only happen when the node leaves the cluster. In this case, it makes more sense to reject the allocation instead of throttling since the node may not come back. This is also more consistent with the behaviour of SearchableSnapshotAllocator which does not consider any node that is not currently in the cluster (effectively a No).

elasticsearchmachine · 2025-10-28T07:14:49Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-10-28T07:14:49Z

Hi @ywangd, I've created a changelog YAML for you.

ywangd · 2025-10-28T07:31:13Z

Labelling this as non-issue because:

Unassigned frozen shards are handled by SearchableSnapshotAllocator and hence should see no impact
Assigned frozen can now see No instead of Throttle for moveShards and balance where Throttle is effectively a No, i.e. it does not move shards on the cluster. With desired balance allocator and the early return mechanism, throttle becomes even less relevant.

DiannaHohensee · 2025-10-28T22:20:59Z

...sticsearch/xpack/searchablesnapshots/allocation/decider/HasFrozenCacheAllocationDecider.java

 case NO_CACHE -> NO_FROZEN_CACHE;
 case FAILED -> UNKNOWN_FROZEN_CACHE;
- default -> STILL_FETCHING;
+ case FETCHING -> STILL_FETCHING;


I've only briefly looked at this code area, but I'm wondering why we'd use THROTTLE instead of NO. I would expect the FrozenCacheService to make a reroute when it receives the update, but it doesn't look like that happens. So THROTTLE looks like a hack. Could we make that reroute happen in a new FrozenCacheService#clusterChanged?

Or perhaps not a clusterChanged, but whenever the node's settings become known -- I didn't actually follow the code that far.

In the other PR (#136794), I indeed changed it to No when it is called in simulation. That should give us what we want. I left it unchanged for non-simulation since that is not important to us and less change.

As commented on the other PR, FrozenCacheInfoService#updateNodes is called by SearchableSnapshotAllocator on the master thread as part of a reroute call which effectively does what you are suggesting. If there is any node still fetching cache state, unassigned frozen shards are ignored which means BalancedShardAllocator ignores them as well, i.e., whether it returns No or Throttle does not matter for unassigned shards in simulation. For started frozen shards, BalancedShardAllocator can see a Throttle at balacing time with the current code. That's where the other PR comes in, i.e. it changes the decision to No which works effectively the same as Throttle in simluation especially now that we have the early return mechanism. So it is a non-issue change.

I hope this makes sense.

I left it unchanged for non-simulation since that is not important to us and less change.

So it's safe to make it always NO, but you're trying to minimize the change?

I answered in elsewhere. Reposting here for easy consumption

I looked into the simulation part more carefully so that I am pretty sure it is a non-issue (to change to N)). I believe it should be safe in both cases. But less change is my preference since non-simulation case is unrelated.

Btw, if the split between UNKNOWN and FETCHING is considered unrelated and adding confusion than its worth, I can also undo it so that the change will just be

default -> allocation.isSimulating() ? NO : STILL_FETCHING;

as suggested over the other PR. Just let me know.

The split / explicit switches per enum is all good to me, I'm more hesitant about not fully changing over to NO. I think a remaining throttle case leaves the code in a confusing state: it's technical debt.

I'm good with pushing the change as is.

Thanks for the discussion! I will be merging this PR since it's about the split and we are OK with it. We can continue the discussion about remaining Throttle over the other PR.

nicktindall

LGTM

DiannaHohensee

LGTM. I'd like to fully remove THROTTLE later, so I filed https://elasticco.atlassian.net/browse/ES-13378

DiannaHohensee · 2025-10-29T17:40:28Z

...sticsearch/xpack/searchablesnapshots/allocation/decider/HasFrozenCacheAllocationDecider.java

 case NO_CACHE -> NO_FROZEN_CACHE;
 case FAILED -> UNKNOWN_FROZEN_CACHE;
- default -> STILL_FETCHING;
+ case FETCHING -> STILL_FETCHING;


The split / explicit switches per enum is all good to me, I'm more hesitant about not fully changing over to NO. I think a remaining throttle case leaves the code in a confusing state: it's technical debt.

I'm good with pushing the change as is.

…lastic#137232) Since SearchableSnapshotAllocator updates frozen cache service before allocation, the unknown state for a node can only happen when the node leaves the cluster. In this case, it makes more sense to reject the allocation instead of throttling since the node may not come back. This is also more consistent with the behaviour of SearchableSnapshotAllocator which does not consider any node that is not currently in the cluster (effectively a No). If the node does come back, it will go through again the cache info fetching steps. Relates: elastic#136794

ywangd requested a review from nicktindall October 28, 2025 07:14

ywangd added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v9.3.0 labels Oct 28, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 28, 2025

Update docs/changelog/137232.yaml

40e5aec

ywangd mentioned this pull request Oct 28, 2025

Tighten on when THROTTLE decision can be returned #136794

Merged

[CI] Auto commit changes from spotless

0f6de10

ywangd added >non-issue and removed >enhancement labels Oct 28, 2025

Delete docs/changelog/137232.yaml

bcf8269

DiannaHohensee reviewed Oct 28, 2025

View reviewed changes

ywangd added 2 commits October 29, 2025 09:53

Merge branch 'main' into no-decision-for-unknown-node

b89d4a7

Merge branch 'main' into no-decision-for-unknown-node

28b3c0f

nicktindall approved these changes Oct 29, 2025

View reviewed changes

DiannaHohensee approved these changes Oct 29, 2025

View reviewed changes

Merge branch 'main' into no-decision-for-unknown-node

0c6b70f

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 30, 2025

elasticsearchmachine merged commit 6c33c94 into elastic:main Oct 30, 2025
34 checks passed

ywangd deleted the no-decision-for-unknown-node branch October 30, 2025 06:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HasFrozenCacheAllocationDecider returns No decision for unknown node #137232

HasFrozenCacheAllocationDecider returns No decision for unknown node #137232

Uh oh!

ywangd commented Oct 28, 2025

elasticsearchmachine commented Oct 28, 2025

elasticsearchmachine commented Oct 28, 2025

ywangd commented Oct 28, 2025

DiannaHohensee Oct 28, 2025 •

edited

Loading

DiannaHohensee Oct 28, 2025

ywangd Oct 28, 2025

DiannaHohensee Oct 28, 2025

ywangd Oct 29, 2025

ywangd Oct 29, 2025

DiannaHohensee Oct 29, 2025

ywangd Oct 30, 2025

nicktindall left a comment

DiannaHohensee left a comment

DiannaHohensee Oct 29, 2025

Uh oh!

Labels

4 participants

HasFrozenCacheAllocationDecider returns No decision for unknown node #137232

HasFrozenCacheAllocationDecider returns No decision for unknown node #137232

Uh oh!

Conversation

ywangd commented Oct 28, 2025

elasticsearchmachine commented Oct 28, 2025

elasticsearchmachine commented Oct 28, 2025

ywangd commented Oct 28, 2025

DiannaHohensee Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall left a comment

Choose a reason for hiding this comment

DiannaHohensee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants

DiannaHohensee Oct 28, 2025 •

edited

Loading