- Notifications
You must be signed in to change notification settings - Fork 25.7k
Make TransportLocalClusterStateAction wait for cluster to unblock #117230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This will make `TransportLocalClusterStateAction` wait for a new state that is not blocked. This means we need a timeout (again). For consistency's sake, we're reusing the REST param `master_timeout` for this timeout as well. The only class that was using `TransportLocalClusterStateAction` was `TransportGetAliasesAction`, so its request needed to accept a timeout again as well.
e6c4cfd to b4f4cf5 Compare | | ||
| include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=expand-wildcards] | ||
| | ||
| include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=master-timeout] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if these docs and rest-api-spec changes are out of scope or not for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok to me, tho note that we have never before supported ?master_timeout on these APIs, this isn't reinstating a parameter that previously was removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I realized that, but I figured it made sense to add them if they're parameters we're accepting.
| exports org.elasticsearch.action.support.master; | ||
| exports org.elasticsearch.action.support.master.info; | ||
| exports org.elasticsearch.action.support.nodes; | ||
| exports org.elasticsearch.action.support.local; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved TransportLocalClusterStateAction to its own local package - together with the new LocalClusterStateRequest.
| public GetAliasesRequest(String... aliases) { | ||
| this.aliases = aliases; | ||
| this.originalAliases = aliases; | ||
| this(MasterNodeRequest.TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT, aliases); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It felt out of scope for this PR to refactor all these constructor callers to provide an explicit master timeout - that will have to be done in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather we did those changes first to set things up for this PR to go through without having to add back this trappy parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we do those changes first if there is currently - on main - no timeout to set in these constructors? Are you suggesting adding a constructor on main that takes a timeout but doesn't actually do anything with it and swapping that constructor with the ones on this branch afterward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, hm, good point. Ok, let's do it this way round.
server/src/main/java/org/elasticsearch/action/support/local/LocalClusterStateRequest.java Outdated Show resolved Hide resolved
| Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
| Hi @nielsbauman, I've created a changelog YAML for you. |
| @elasticmachine update branch |
server/src/main/java/org/elasticsearch/action/support/local/LocalClusterStateRequest.java Outdated Show resolved Hide resolved
...r/src/main/java/org/elasticsearch/action/support/local/TransportLocalClusterStateAction.java Outdated Show resolved Hide resolved
| ); | ||
| listener.onFailure(new ElasticsearchTimeoutException("timed out while waiting for cluster to unblock", exception)); | ||
| } | ||
| }, clusterState -> isTaskCancelled(task) || checkBlock(request, clusterState) == null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that if the task is cancelled while waiting for an unblocking cluster state then it'll wait for the next cluster state update before completing, which could be arbitrarily far in the future. Could we use org.elasticsearch.tasks.CancellableTask#addListener to react more promptly to cancellation instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea 👍. It looks like we're not doing that in TransportMasterNodeAction either. Am I missing something or should we add it there too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh we probably should. But that's a good point, let's leave this with the same behaviour as in TransportMasterNodeAction and we can come back to this change later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #117971 for this.
| public GetAliasesRequest(String... aliases) { | ||
| this.aliases = aliases; | ||
| this.originalAliases = aliases; | ||
| this(MasterNodeRequest.TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT, aliases); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather we did those changes first to set things up for this PR to go through without having to add back this trappy parameter.
| | ||
| include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=expand-wildcards] | ||
| | ||
| include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=master-timeout] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok to me, tho note that we have never before supported ?master_timeout on these APIs, this isn't reinstating a parameter that previously was removed.
DaveCTurner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few superficial things
docs/changelog/117230.yaml Outdated
| @@ -0,0 +1,5 @@ | |||
| pr: 117230 | |||
| summary: Make `TransportLocalClusterStateAction` wait for cluster to unblock | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could describe this change in a more user-facing way: we're fixing that the various aliases APIs don't wait for a block to be removed. The fact that we're doing this by changing TransportLocalClusterStateAction is not really relevant to the release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the summary to Make various alias retrieval APIs wait for cluster to unblock. Let me know if you have any suggestions and/or if you think we should elaborate with a description.
server/src/main/java/org/elasticsearch/action/admin/indices/alias/get/GetAliasesRequest.java Outdated Show resolved Hide resolved
server/src/main/java/org/elasticsearch/action/admin/indices/alias/get/GetAliasesRequest.java Outdated Show resolved Hide resolved
server/src/main/java/org/elasticsearch/action/support/local/LocalClusterStateRequest.java Outdated Show resolved Hide resolved
.../test/java/org/elasticsearch/action/support/local/TransportLocalClusterStateActionTests.java Outdated Show resolved Hide resolved
.../test/java/org/elasticsearch/action/support/local/TransportLocalClusterStateActionTests.java Outdated Show resolved Hide resolved
| | ||
| public void testRetryAfterBlock() throws ExecutionException, InterruptedException { | ||
| var request = new Request(); | ||
| ClusterBlock block = new ClusterBlock(1, "", true, true, false, randomFrom(RestStatus.values()), ClusterBlockLevel.ALL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cluster block ID 1 is special - it means the STATE_NOT_RECOVERED_BLOCK. Could we randomize this to indicate that the ID doesn't matter to these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced it with randomInt(). I saw some places using randomInt() and some with a lower and upper bound. Let me know if you think bounding the random int is preferable.
.../test/java/org/elasticsearch/action/support/local/TransportLocalClusterStateActionTests.java Outdated Show resolved Hide resolved
DaveCTurner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This will make
TransportLocalClusterStateActionwait for a new statethat is not blocked. This means we need a timeout (again). For
consistency's sake, we're reusing the REST param
master_timeoutforthis timeout as well.
The only class that was using
TransportLocalClusterStateActionwasTransportGetAliasesAction, so its request needed to accept a timeoutagain as well.