- Notifications
You must be signed in to change notification settings - Fork 25.5k
Implement runtime skip_unavailable=true #121240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @smalyshev, I've created a changelog YAML for you. |
27d76b6
to d3c8ef5
Compare Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass review. Several questions left and a few suggestions for changes, some optional. Once I understand the answers to the questions I left I'll do another round of review.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java Show resolved Hide resolved
.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/TransportEsqlQueryAction.java Show resolved Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSessionCCSUtils.java Show resolved Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSessionCCSUtils.java Outdated Show resolved Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/RemoteListenerGroup.java Show resolved Hide resolved
} | ||
}; | ||
// Cancel the group on sink failure | ||
exchangeRequestListener = createCancellingListener("exchange sink failure", computeListener.acquireAvoid(), finishGroup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking to learn - why does the exchangeRequestListener call acquireAvoid while the clusterRequestListener calls acquireCompute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acquireCompute
collects profiles, which as I understand will be coming through clusterRequestListener reqs. At least as far as I understand that code, it's a bit hairy so I may be wrong.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/RemoteListenerGroup.java Show resolved Hide resolved
} | ||
| ||
// Non-disconnect remote failures still fail the request | ||
public void testRemoteFailureSkipUnavailable() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change name to include "True", testRemoteFailureWithSkipUnavailableTrue
. Otherwise you have to read a bit of code to figure out the setting.
...l/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/CrossClustersQueryIT.java Show resolved Hide resolved
assertThat(executionInfo.isPartial(), equalTo(true)); | ||
EsqlExecutionInfo.Cluster cluster = executionInfo.getCluster(REMOTE_CLUSTER); | ||
| ||
assertThat(cluster.getStatus(), equalTo(EsqlExecutionInfo.Cluster.Status.PARTIAL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm confused. I thought cancelling an ES|QL would result in a top level failure, like a 500 status. Is that not right? It returns partial data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canceling the remote shouldn't - at least according to our slack conversation. Is it not correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. Thanks @smalyshev.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/RemoteListenerGroup.java Show resolved Hide resolved
ComputeListener computeListener, | ||
String clusterAlias, | ||
EsqlExecutionInfo executionInfo, | ||
ActionListener<Void> delegate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to pass both ComputeListener and this extra listener?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code, this listener is used to complete the ComputeListener is the group is finished, but also is used in the exception clause of ExchangeService.openExchange
- so I am not sure how to get rid of it.
} | ||
}; | ||
// Cancel the group on sink failure | ||
exchangeRequestListener = createCancellingListener("exchange sink failure", computeListener.acquireAvoid(), finishGroup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, if a compute from a remote cluster fails and skip_unavailable=false, or the failure is not remote_unavailable_exceptions, we will not start canceling computes on other clusters, including the local cluster, until the cancellation of the failing cluster complete - which may take time. Can we maintain this behavior? But it's perfectly fine not to maintain it if it's complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to do this - do you have any suggestions?
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ClusterComputeHandler.java Outdated Show resolved Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSessionCCSUtils.java Outdated Show resolved Hide resolved
...nternalClusterTest/java/org/elasticsearch/xpack/esql/action/CrossClustersCancellationIT.java Outdated Show resolved Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next round review. Can approve on next round after another fix or two. Looks good.
) { | ||
return delegate.delegateResponse((l, e) -> { | ||
if (EsqlCCSUtils.shouldIgnoreRuntimeError(executionInfo, clusterAlias, e)) { | ||
EsqlCCSUtils.markClusterWithFinalStateAndNoShards(executionInfo, clusterAlias, status, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you don't need the EsqlCCSUtils.
qualifier on the two method calls above.
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/session/EsqlCCSUtilsTests.java Show resolved Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ClusterComputeHandler.java Show resolved Hide resolved
3b3ae1f
to ff431f9
Compare There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - nice work!
💔 Backport failed
You can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
* Implement runtime skip_unavailable=true (#121240) * Implement runtime skip_unavailable=true (cherry picked from commit 2fbec77) # Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
* Implement runtime skip_unavailable=true
createLocalIndex() was added in elastic#121240 but is unused now after testCloseSkipUnavailable() was removed in elastic#123054.
createLocalIndex() was added in elastic#121240 but is unused now after testCloseSkipUnavailable() was removed in elastic#123054.
When
skip_unavailable=true
, if remote returns disconnect error or cancellation error at runtime, the cluster is marked as "partial", but the query overall does not fail. If the error happens before the exchange is established, the cluster is marked as "skipped".