Skip to content

Conversation

smalyshev
Copy link
Contributor

@smalyshev smalyshev commented Jan 29, 2025

When skip_unavailable=true, if remote returns disconnect error or cancellation error at runtime, the cluster is marked as "partial", but the query overall does not fail. If the error happens before the exchange is established, the cluster is marked as "skipped".

@elasticsearchmachine
Copy link
Collaborator

Hi @smalyshev, I've created a changelog YAML for you.

@smalyshev smalyshev requested review from dnhatn and quux00 January 30, 2025 23:08
@smalyshev smalyshev marked this pull request as ready for review January 30, 2025 23:08
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass review. Several questions left and a few suggestions for changes, some optional. Once I understand the answers to the questions I left I'll do another round of review.

}
};
// Cancel the group on sink failure
exchangeRequestListener = createCancellingListener("exchange sink failure", computeListener.acquireAvoid(), finishGroup);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking to learn - why does the exchangeRequestListener call acquireAvoid while the clusterRequestListener calls acquireCompute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acquireCompute collects profiles, which as I understand will be coming through clusterRequestListener reqs. At least as far as I understand that code, it's a bit hairy so I may be wrong.

}

// Non-disconnect remote failures still fail the request
public void testRemoteFailureSkipUnavailable() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change name to include "True", testRemoteFailureWithSkipUnavailableTrue. Otherwise you have to read a bit of code to figure out the setting.

assertThat(executionInfo.isPartial(), equalTo(true));
EsqlExecutionInfo.Cluster cluster = executionInfo.getCluster(REMOTE_CLUSTER);

assertThat(cluster.getStatus(), equalTo(EsqlExecutionInfo.Cluster.Status.PARTIAL));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm confused. I thought cancelling an ES|QL would result in a top level failure, like a 500 status. Is that not right? It returns partial data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canceling the remote shouldn't - at least according to our slack conversation. Is it not correct?

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Thanks @smalyshev.

ComputeListener computeListener,
String clusterAlias,
EsqlExecutionInfo executionInfo,
ActionListener<Void> delegate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass both ComputeListener and this extra listener?

Copy link
Contributor Author

@smalyshev smalyshev Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, this listener is used to complete the ComputeListener is the group is finished, but also is used in the exception clause of ExchangeService.openExchange - so I am not sure how to get rid of it.

}
};
// Cancel the group on sink failure
exchangeRequestListener = createCancellingListener("exchange sink failure", computeListener.acquireAvoid(), finishGroup);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, if a compute from a remote cluster fails and skip_unavailable=false, or the failure is not remote_unavailable_exceptions, we will not start canceling computes on other clusters, including the local cluster, until the cancellation of the failing cluster complete - which may take time. Can we maintain this behavior? But it's perfectly fine not to maintain it if it's complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to do this - do you have any suggestions?

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next round review. Can approve on next round after another fix or two. Looks good.

) {
return delegate.delegateResponse((l, e) -> {
if (EsqlCCSUtils.shouldIgnoreRuntimeError(executionInfo, clusterAlias, e)) {
EsqlCCSUtils.markClusterWithFinalStateAndNoShards(executionInfo, clusterAlias, status, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need the EsqlCCSUtils. qualifier on the two method calls above.

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - nice work!

@smalyshev smalyshev added the auto-backport Automatically create backport pull requests when merged label Feb 3, 2025
@smalyshev smalyshev merged commit 2fbec77 into elastic:main Feb 3, 2025
17 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 121240

@smalyshev
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

elasticsearchmachine pushed a commit that referenced this pull request Feb 4, 2025
* Implement runtime skip_unavailable=true (#121240) * Implement runtime skip_unavailable=true (cherry picked from commit 2fbec77) # Conflicts: #	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
fzowl pushed a commit to voyage-ai/elasticsearch that referenced this pull request Feb 4, 2025
* Implement runtime skip_unavailable=true
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request Aug 14, 2025
createLocalIndex() was added in elastic#121240 but is unused now after testCloseSkipUnavailable() was removed in elastic#123054.
JeremyDahlgren added a commit that referenced this pull request Aug 14, 2025
createLocalIndex() was added in #121240 but is unused now after testCloseSkipUnavailable() was removed in #123054.
joshua-adams-1 pushed a commit to joshua-adams-1/elasticsearch that referenced this pull request Aug 15, 2025
createLocalIndex() was added in elastic#121240 but is unused now after testCloseSkipUnavailable() was removed in elastic#123054.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.0 v9.1.0

4 participants