- Notifications
You must be signed in to change notification settings - Fork 25.6k
HasFrozenCacheAllocationDecider returns No decision for unknown node #137232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HasFrozenCacheAllocationDecider returns No decision for unknown node #137232
Conversation
Since SearchableSnapshotAllocator updates frozen cache service before allocation, the unknown state for a node can only happen when the node leaves the cluster. In this case, it makes more sense to reject the allocation instead of throttling since the node may not come back. This is also more consistent with the behaviour of SearchableSnapshotAllocator which does not consider any node that is not currently in the cluster (effectively a No).
| Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
| Hi @ywangd, I've created a changelog YAML for you. |
| Labelling this as non-issue because:
|
| case NO_CACHE -> NO_FROZEN_CACHE; | ||
| case FAILED -> UNKNOWN_FROZEN_CACHE; | ||
| default -> STILL_FETCHING; | ||
| case FETCHING -> STILL_FETCHING; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only briefly looked at this code area, but I'm wondering why we'd use THROTTLE instead of NO. I would expect the FrozenCacheService to make a reroute when it receives the update, but it doesn't look like that happens. So THROTTLE looks like a hack. Could we make that reroute happen in a new FrozenCacheService#clusterChanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps not a clusterChanged, but whenever the node's settings become known -- I didn't actually follow the code that far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the other PR (#136794), I indeed changed it to No when it is called in simulation. That should give us what we want. I left it unchanged for non-simulation since that is not important to us and less change.
As commented on the other PR, FrozenCacheInfoService#updateNodes is called by SearchableSnapshotAllocator on the master thread as part of a reroute call which effectively does what you are suggesting. If there is any node still fetching cache state, unassigned frozen shards are ignored which means BalancedShardAllocator ignores them as well, i.e., whether it returns No or Throttle does not matter for unassigned shards in simulation. For started frozen shards, BalancedShardAllocator can see a Throttle at balacing time with the current code. That's where the other PR comes in, i.e. it changes the decision to No which works effectively the same as Throttle in simluation especially now that we have the early return mechanism. So it is a non-issue change.
I hope this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it unchanged for non-simulation since that is not important to us and less change.
So it's safe to make it always NO, but you're trying to minimize the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I answered in elsewhere. Reposting here for easy consumption
I looked into the simulation part more carefully so that I am pretty sure it is a non-issue (to change to N)). I believe it should be safe in both cases. But less change is my preference since non-simulation case is unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, if the split between UNKNOWN and FETCHING is considered unrelated and adding confusion than its worth, I can also undo it so that the change will just be
default -> allocation.isSimulating() ? NO : STILL_FETCHING;as suggested over the other PR. Just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split / explicit switches per enum is all good to me, I'm more hesitant about not fully changing over to NO. I think a remaining throttle case leaves the code in a confusing state: it's technical debt.
I'm good with pushing the change as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the discussion! I will be merging this PR since it's about the split and we are OK with it. We can continue the discussion about remaining Throttle over the other PR.
nicktindall left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
DiannaHohensee left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'd like to fully remove THROTTLE later, so I filed https://elasticco.atlassian.net/browse/ES-13378
| case NO_CACHE -> NO_FROZEN_CACHE; | ||
| case FAILED -> UNKNOWN_FROZEN_CACHE; | ||
| default -> STILL_FETCHING; | ||
| case FETCHING -> STILL_FETCHING; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split / explicit switches per enum is all good to me, I'm more hesitant about not fully changing over to NO. I think a remaining throttle case leaves the code in a confusing state: it's technical debt.
I'm good with pushing the change as is.
…lastic#137232) Since SearchableSnapshotAllocator updates frozen cache service before allocation, the unknown state for a node can only happen when the node leaves the cluster. In this case, it makes more sense to reject the allocation instead of throttling since the node may not come back. This is also more consistent with the behaviour of SearchableSnapshotAllocator which does not consider any node that is not currently in the cluster (effectively a No). If the node does come back, it will go through again the cache info fetching steps. Relates: elastic#136794
Since SearchableSnapshotAllocator updates frozen cache service before allocation, the unknown state for a node can only happen when the node leaves the cluster. In this case, it makes more sense to reject the allocation instead of throttling since the node may not come back. This is also more consistent with the behaviour of
SearchableSnapshotAllocator which does not consider any node that is not currently in the cluster (effectively a No). If the node does come back, it will go through again the cache info fetching steps.
Relates: #136794