- Notifications
You must be signed in to change notification settings - Fork 25.6k
Support CCS minimize round trips in async search #96012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CCS minimize round trips in async search #96012
Conversation
| Pinging @elastic/es-search (Team:Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me so far, I am curious on why it's marked WIP ;) maybe we should add some indication in the get async search API response that some remote clusters are minimizing roundtrips?
qa/multi-cluster-search/src/test/java/org/elasticsearch/search/CCSDuelIT.java Outdated Show resolved Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchProgressListener.java Outdated Show resolved Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchProgressListener.java Outdated Show resolved Hide resolved
1477f42 to 82930fd Compare 193e2af to eeb6bfc Compare | With these changes, I have removed the notifyMinimizeRoundtrips/onMinimizeRoundtrips call to the SearchProgressListener in favor of modifying the The Here's an example of the progression in the POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=true { // ... search body for a slow search not shown } // after _async_seach submit { "id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=", "is_partial": true, "is_running": true, "start_time_in_millis": 1684516160099, "expiration_time_in_millis": 1684948160099, "response": { "took": 806, "timed_out": false, "terminated_early": false, "num_reduce_phases": 0, "_shards": { "total": 3, // only shows local shards until remote searches finish (current behavior) "successful": 0, // no shards completed "skipped": 0, "failed": 0 }, "_clusters": { "total": 3, "successful": 0, // no clusters completed (new behavior) "skipped": 0 }, "hits": { "total": { "value": 0, "relation": "gte" }, "max_score": null, "hits": [] } } } // while running (local cluster has finished; two remote cluster searches still running GET /_async_search/FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI { "id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=", "is_partial": true, "is_running": true, "start_time_in_millis": 1684516160099, "expiration_time_in_millis": 1684948160099, "response": { "took": 5973, "timed_out": false, "terminated_early": false, "_shards": { "total": 3, "successful": 3, // local shards completed "skipped": 0, "failed": 0 }, "_clusters": { "total": 3, "successful": 1, // local cluster has completed (new behavior) "skipped": 0 }, "hits": { "total": { "value": 142, "relation": "eq" }, "max_score": null, "hits": [] } } } // when all clusters completed GET /_async_search/FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI { "id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=", "is_partial": false, "is_running": false, "start_time_in_millis": 1684516160099, "expiration_time_in_millis": 1684948160099, "response": { "took": 10350, "timed_out": false, "num_reduce_phases": 4, "_shards": { "total": 9, "successful": 9, "skipped": 0, "failed": 0 }, "_clusters": { "total": 3, "successful": 3, "skipped": 0 }, "hits": { "total": { "value": 432, "relation": "eq" }, "max_score": 17.110811, "hits": [ ...Note that this is NOT the behavior with ccs_minimize_roundtrips=false. There there _clusters output is always the final output where total == successful + skipped: POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=false { // ... search body for a slow search not shown } ... "id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=", "is_partial": true, "is_running": true, "start_time_in_millis": 1684516160099, "expiration_time_in_millis": 1684948160099, "response": { ... "_clusters": { "total": 3, "successful": 3, // shows final state even if no shards/clusters have completed (current behavior) "skipped": 0 }, |
.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java Outdated Show resolved Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchResponse.java Outdated Show resolved Hide resolved
eeb6bfc to aaa9b4e Compare | I tested a case where the remote clusters are down: // case where skip_unavailable: true is NOT set for a remote cluster and it is down: POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=true ... // response: { "id": "FjJJZG9yWUhOU3VLaTJzNXZMalZaTkEaaVhraVNDT2VTenFoUDlod21hYmlKZzo5MTE=", "is_partial": true, "is_running": false, "start_time_in_millis": 1684767109136, "expiration_time_in_millis": 1685199109136, "response": { "took": 10927, "timed_out": false, "terminated_early": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "_clusters": { "total": 3, "successful": 1, "skipped": 0 }, "hits": { "total": { "value": 167, "relation": "eq" }, "max_score": null, "hits": [] } }, "error": { "type": "status_exception", "reason": "error while executing search", "caused_by": { "type": "connect_transport_exception", "reason": "[][127.0.0.1:9301] connect_exception", "caused_by": { "type": "uncategorized_execution_exception", "reason": "Failed execution", "caused_by": { "type": "execution_exception", "reason": "execution_exception: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9301", "caused_by": { "type": "i_o_exception", "reason": "Connection refused: /127.0.0.1:9301", "caused_by": { "type": "i_o_exception", "reason": "Connection refused" } } } } } } } ---- // where skip_unavailable:true POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=true ... // response: { "id": "FnE0SjZrb2lOVFR5ejFCN3dtX0NHbVEbaVhraVNDT2VTenFoUDlod21hYmlKZzoxMDIz", "is_partial": false, "is_running": false, "start_time_in_millis": 1684767263845, "expiration_time_in_millis": 1685199263845, "response": { "took": 11943, "timed_out": false, "num_reduce_phases": 3, "_shards": { "total": 6, "successful": 6, "skipped": 0, "failed": 0 }, "_clusters": { "total": 3, "successful": 2, "skipped": 1 // was properly updated at the end (started off as 0) }, "hits": { "total": { "value": 370, "relation": "eq" }, "max_score": 16.595804, "hits": [ { ... |
aaa9b4e to 2793f5d Compare There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a couple of comments, great progress!!
server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java Outdated Show resolved Hide resolved
.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java Outdated Show resolved Hide resolved
.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java Outdated Show resolved Hide resolved
.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java Outdated Show resolved Hide resolved
271c63c to fba88c4 Compare There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHIP IT
1cab8d9 to a6460bd Compare | @abdonpijpelink - Would you have time to review the end user docs changes here? You can look at the last commit to see the new changes: a7b7237 |
a6460bd to a7b7237 Compare | I just talked with Luca and we are going to change the default setting in a follow-on PR, so the end user docs changes will go into that new PR so you can wait to review it then. I'll be deleting those commits from this PR in a moment. Sorry for the change. |
a7b7237 to 5c20cf5 Compare There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing! For your follow-up PR: note that the <1> numbers ("callouts") have to restart from 1 for each code snippet, otherwise the docs won't build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "successful": 3, <4> | |
| "skipped": 0, | |
| "failed": 0 | |
| }, | |
| "_clusters": { | |
| "total": 3, | |
| "successful": 1, <5> | |
| "skipped": 0 | |
| }, | |
| "hits": { | |
| "total": { | |
| "value": 167, <6> | |
| "relation": "eq" | |
| }, | |
| "max_score": null, | |
| "hits": [] | |
| } | |
| } | |
| } | |
| -------------------------------------------------- | |
| <4> All the local cluster shards have completed. | |
| <5> The local cluster search has completed, so the "successful" clusters entry | |
| is set to 1. The `_clusters` response section will not be updated for the remote clusters | |
| until all remote searches have finished (either successfully or been skipped). | |
| <6> Number of hits from the local cluster search. Final hits are not | |
| "successful": 3, <1> | |
| "skipped": 0, | |
| "failed": 0 | |
| }, | |
| "_clusters": { | |
| "total": 3, | |
| "successful": 1, <2> | |
| "skipped": 0 | |
| }, | |
| "hits": { | |
| "total": { | |
| "value": 167, <3> | |
| "relation": "eq" | |
| }, | |
| "max_score": null, | |
| "hits": [] | |
| } | |
| } | |
| } | |
| -------------------------------------------------- | |
| <1> All the local cluster shards have completed. | |
| <2> The local cluster search has completed, so the "successful" clusters entry | |
| is set to 1. The `_clusters` response section will not be updated for the remote clusters | |
| until all remote searches have finished (either successfully or been skipped). | |
| <3> Number of hits from the local cluster search. Final hits are not |
This commit makes the smallest set of changes to allow async-search based cross-cluster search to work with the CCS minimize_round_trips feature without changing the internals/architecture of the search action. When ccsMinimizeRoundtrips is set to true on SubmitAsyncSearchRequest, the AsyncSearchTask on the primary CCS coordinator sends a synchronous SearchRequest to all to clusters for a remote coordinator to orchestrate and return the entire result set to the CCS coordinator as a single response. This is the same functionality provided by synchronous CCS search using minimize_round_trips. However, since this is an async search, it means that the async search coordinator has no visibility into search progress on the remote clusters while they are running the search, thus losing one of the key features of async search. However, this is a good first approach for improving overall search latency for cross cluster searches that query a large number of shards on remote clusters, since Kibana does not currently expose incremental progress of an async search to users.
…ul count in Clusters - for async minimizeRoundtrips usage
MutableSearchResponse now has logic in buildResponse about whether to update the Clusters object to indicate that the local cluster has finished.
…tableSearchResponse
…ProgressListener. Modified SearchResponse.Clusters to optionally track remoteClusters and ccsMinimizeRoundtrips. As the original Clusters object was intended to be immutable, I've added a "frozen" flag that is set when the original constructor is called. CCS AsyncSearch minimizeRoundTrips uses this new Clusters variant. onListShards, with the revised impl and usage of SearchResponse.Clusters, now handles notifying the SearchProgressListener in the AsyncSearchTask about the number of remote clusters, whether local shards are being searched and ccsMinimizeRoundtrips status.
Made the Clusters and localClusterStatusUpdated fields to be updated in a synchronized block. in MutableSearchResponse.SearchClustersInfo to make them safe for multi threaded access.
5c20cf5 to ded776b Compare
This commit makes the smallest set of changes to allow async-search based cross-cluster search
to work with the CCS minimize_round_trips feature without changing the internals/architecture of
the search action.
When ccsMinimizeRoundtrips is set to true on SubmitAsyncSearchRequest, the AsyncSearchTask on the
primary CCS coordinator sends a synchronous SearchRequest to all to clusters for a remote coordinator
to orchestrate and return the entire result set to the CCS coordinator as a single response.
This is the same functionality provided by synchronous CCS search using minimize_round_trips.
However, since this is an async search, it means that the async search coordinator has no visibility
into search progress on the remote clusters while they are running the search, thus losing one of
the key features of async search. However, this is a good first approach for improving overall search
latency for cross cluster searches that query a large number of shards on remote clusters, since
Kibana does not currently expose incremental progress of an async search to users.