Skip to content

Conversation

ioanatia
Copy link
Contributor

tracked in #115103

actual implementation of the prototype from #118106

This adds a query builder rewrite phase on the coordinator to ES|QL.
Since the MatchQueryBuilder now supports semantic_text (#117839), with the new query builder rewrite phase it becomes very easy to add support for querying semantic_text in ES|QL.

Still needs more tests - for now I added CSV tests (for positive cases) and integration tests to show how we capture and return errors that can happen during the query builder rewrite.

@ioanatia ioanatia marked this pull request as ready for review December 16, 2024 16:00
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Dec 16, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

assertEquals(404, re.getResponse().getStatusLine().getStatusCode());
}

@Before
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would @BeforeClass / @AfterClass be better in the setup / teardown methods as no data is modified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having a bit of trouble using AfterClass/BeforeClass even if I switch to adminClient():

SemanticMatchIT > classMethod FAILED java.lang.NullPointerException: Cannot invoke "org.elasticsearch.client.RestClient.performRequest(org.elasticsearch.client.Request)" because the return value of "org.elasticsearch.xpack.esql.qa.rest.SemanticMatchTestCase.adminClient()" is null at __randomizedtesting.SeedInfo.seed([9885A2F1F0E88028]:0) at org.elasticsearch.xpack.esql.qa.rest.SemanticMatchTestCase.wipeData(SemanticMatchTestCase.java:105) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1763) at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:909) at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:48) 

I see that most integration tests use @Before and @After, so I will revert to that.

;

_id:keyword | _score:double
2 | 1.2879333961116942E19
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if score should be checked via YAML tests:

  • It would allow to check that it's just greater than zero
  • We could check if it's the same score as returned by an equivalent search

No need to do anything on this PR, just thinking out loud 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I am not sure either - for now I followed the pattern we already have for testing with _score.
Happy to change this, but not as part of this PR.
I think it's a bit more complicated than that - if we could just make sure we use the same sharding strategy consistently in the scoring tests, this shouldn't be a problem anymore.
The sharding does not only affect the value of the scores, but as we have seen from some of the failing tests, it can also affect the order of the documents, with some document getting scored higher than other depending on the sharding strategy. So to fix this scoring tests, it is not sufficient to just check that the scores are greater than 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change this, but not as part of this PR.

++

to fix this scoring tests, it is not sufficient to just check that the scores are greater than 0.

What worked for me to test knn rescoring is to get the results from an actual search and compare them directly to the ones returned from another, both scoring and the results themselves. We can check on some follow up on the strategy

// Check for every possible query data type
for (DataType fieldDataType : fieldDataTypes) {
// TODO: semantic_text is not present in mapping-all-types.json so we skip it for now
if (fieldDataType == DataType.SEMANTIC_TEXT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add it there?

Copy link
Contributor Author

@ioanatia ioanatia Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because I have not fully mapped how mapping-all-types.json is being used - I will follow up on this separately.

callback.accept(plan, listener);
return;
}
QueryRewriteContext ctx = queryRewriteContext(indexNames);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if QueryRewriteContext an intermediate data structure to carry ResolvedIndices to get the inference results? Can we just provide ResolvedIndices instead, without having to create a QueryRewriteContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I am following - we need a QueryRewriteContext, because what Rewritable.rewriteAndFetch does (among other things) is to call the rewrite method from QueryBuilder which needs a QueryRewriteContext:

/**
* Rewrites this query builder into its primitive form. By default this method return the builder itself. If the builder
* did not change the identity reference must be returned otherwise the builder will be rewritten infinitely.
*/
@Override
default QueryBuilder rewrite(QueryRewriteContext queryRewriteContext) throws IOException {
return this;
}

Many implementations of the QueryBuilder interface will then have their own rewrite logic.

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💯

}

@Override
protected Collection<Class<? extends Plugin>> nodePlugins() {
Copy link
Contributor Author

@ioanatia ioanatia Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KqlFunctionIT tests would fail with:

 java.lang.AssertionError: Unknown NamedWriteable [org.elasticsearch.index.query.QueryBuilder][kql] |   -- | --   | at __randomizedtesting.SeedInfo.seed([7E86655D8AA148E0]:0) |     | at org.elasticsearch.common.io.stream.NamedWriteableRegistry.throwOnUnknownWritable(NamedWriteableRegistry.java:150) |     | at org.elasticsearch.common.io.stream.NamedWriteableRegistry.getReader(NamedWriteableRegistry.java:126) |     | at org.elasticsearch.common.io.stream.NamedWriteableRegistry.getReader(NamedWriteableRegistry.java:109) 

this is because we needed to explicitly require the KqlPlugin so that the named writable for the KqlQueryBuilder is registered.
note that this was just a nit for this particular test (and other ES|QL integration tests also override this method), and that single and multi node tests for KQL continue to work with no required changes.

| keep _id
;

_id:keyword
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped the score column for now because I am getting different values for single vs multi node tests 🤷‍♀️ .
It's a separate problem which we'll need to look into.

Example:

 Actual: |  ----------------------------> these are the values generated with single node -- | --   | _id:keyword \| _score:double |     | 2 \| 5.603396578413904E18 |     | 3 \| 2.156063961865257E18 |     | 1 \| 8.411017936759685E17 |     |   |     | Expected: |   -------------------------------> these are the generated values for multi node   | _id:keyword \| _score:double |     | 2 \| 5.603396E18 |     | 3 \| 2.156063E18 |     | 1 \| 8.411017E17 
@ChrisHegarty ChrisHegarty self-requested a review December 18, 2024 12:09
@ioanatia ioanatia merged commit fb6d7db into elastic:main Dec 18, 2024
16 checks passed
@ioanatia ioanatia deleted the esql_query_rewrite branch December 18, 2024 12:11
ioanatia added a commit to ioanatia/elasticsearch that referenced this pull request Dec 18, 2024
* Semantic search with query builder rewrite * Address review feedback * Add feature behind snapshot * Use after/before instead of afterClass/beforeClass * Call onFailure instead of throwing exception * Fix KqlFunctionIT by requiring KqlPlugin * Update scoring tests now that they are enabled * Drop the score column for now
ioanatia added a commit that referenced this pull request Dec 18, 2024
* Semantic search with query builder rewrite * Address review feedback * Add feature behind snapshot * Use after/before instead of afterClass/beforeClass * Call onFailure instead of throwing exception * Fix KqlFunctionIT by requiring KqlPlugin * Update scoring tests now that they are enabled * Drop the score column for now
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Dec 18, 2024
* Semantic search with query builder rewrite * Address review feedback * Add feature behind snapshot * Use after/before instead of afterClass/beforeClass * Call onFailure instead of throwing exception * Fix KqlFunctionIT by requiring KqlPlugin * Update scoring tests now that they are enabled * Drop the score column for now
@jimczi jimczi added the v8.18.0 label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0

7 participants