- Notifications
You must be signed in to change notification settings - Fork 25.5k
Semantic search with query builder rewrite #118676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/QueryBuilderResolver.java Show resolved Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/QueryBuilderResolver.java Show resolved Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java Outdated Show resolved Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java Show resolved Hide resolved
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
assertEquals(404, re.getResponse().getStatusLine().getStatusCode()); | ||
} | ||
| ||
@Before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would @BeforeClass
/ @AfterClass
be better in the setup / teardown methods as no data is modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having a bit of trouble using AfterClass
/BeforeClass
even if I switch to adminClient()
:
SemanticMatchIT > classMethod FAILED java.lang.NullPointerException: Cannot invoke "org.elasticsearch.client.RestClient.performRequest(org.elasticsearch.client.Request)" because the return value of "org.elasticsearch.xpack.esql.qa.rest.SemanticMatchTestCase.adminClient()" is null at __randomizedtesting.SeedInfo.seed([9885A2F1F0E88028]:0) at org.elasticsearch.xpack.esql.qa.rest.SemanticMatchTestCase.wipeData(SemanticMatchTestCase.java:105) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1763) at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:909) at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:48)
I see that most integration tests use @Before
and @After
, so I will revert to that.
x-pack/plugin/esql/qa/testFixtures/src/main/resources/match-function.csv-spec Show resolved Hide resolved
; | ||
| ||
_id:keyword | _score:double | ||
2 | 1.2879333961116942E19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if score should be checked via YAML tests:
- It would allow to check that it's just greater than zero
- We could check if it's the same score as returned by an equivalent search
No need to do anything on this PR, just thinking out loud 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I am not sure either - for now I followed the pattern we already have for testing with _score
.
Happy to change this, but not as part of this PR.
I think it's a bit more complicated than that - if we could just make sure we use the same sharding strategy consistently in the scoring tests, this shouldn't be a problem anymore.
The sharding does not only affect the value of the scores, but as we have seen from some of the failing tests, it can also affect the order of the documents, with some document getting scored higher than other depending on the sharding strategy. So to fix this scoring tests, it is not sufficient to just check that the scores are greater than 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change this, but not as part of this PR.
++
to fix this scoring tests, it is not sufficient to just check that the scores are greater than 0.
What worked for me to test knn rescoring is to get the results from an actual search and compare them directly to the ones returned from another, both scoring and the results themselves. We can check on some follow up on the strategy
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/QueryBuilderResolver.java Show resolved Hide resolved
// Check for every possible query data type | ||
for (DataType fieldDataType : fieldDataTypes) { | ||
// TODO: semantic_text is not present in mapping-all-types.json so we skip it for now | ||
if (fieldDataType == DataType.SEMANTIC_TEXT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because I have not fully mapped how mapping-all-types.json
is being used - I will follow up on this separately.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/QueryBuilderResolver.java Show resolved Hide resolved
callback.accept(plan, listener); | ||
return; | ||
} | ||
QueryRewriteContext ctx = queryRewriteContext(indexNames); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if QueryRewriteContext
an intermediate data structure to carry ResolvedIndices
to get the inference results? Can we just provide ResolvedIndices
instead, without having to create a QueryRewriteContext
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I am following - we need a QueryRewriteContext
, because what Rewritable.rewriteAndFetch
does (among other things) is to call the rewrite
method from QueryBuilder
which needs a QueryRewriteContext
:
elasticsearch/server/src/main/java/org/elasticsearch/index/query/QueryBuilder.java
Lines 62 to 69 in 8134c79
/** | |
* Rewrites this query builder into its primitive form. By default this method return the builder itself. If the builder | |
* did not change the identity reference must be returned otherwise the builder will be rewritten infinitely. | |
*/ | |
@Override | |
default QueryBuilder rewrite(QueryRewriteContext queryRewriteContext) throws IOException { | |
return this; | |
} |
Many implementations of the QueryBuilder
interface will then have their own rewrite logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 💯
} | ||
| ||
@Override | ||
protected Collection<Class<? extends Plugin>> nodePlugins() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KqlFunctionIT
tests would fail with:
java.lang.AssertionError: Unknown NamedWriteable [org.elasticsearch.index.query.QueryBuilder][kql] | -- | -- | at __randomizedtesting.SeedInfo.seed([7E86655D8AA148E0]:0) | | at org.elasticsearch.common.io.stream.NamedWriteableRegistry.throwOnUnknownWritable(NamedWriteableRegistry.java:150) | | at org.elasticsearch.common.io.stream.NamedWriteableRegistry.getReader(NamedWriteableRegistry.java:126) | | at org.elasticsearch.common.io.stream.NamedWriteableRegistry.getReader(NamedWriteableRegistry.java:109)
this is because we needed to explicitly require the KqlPlugin
so that the named writable for the KqlQueryBuilder
is registered.
note that this was just a nit for this particular test (and other ES|QL integration tests also override this method), and that single and multi node tests for KQL continue to work with no required changes.
| keep _id | ||
; | ||
| ||
_id:keyword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped the score column for now because I am getting different values for single vs multi node tests 🤷♀️ .
It's a separate problem which we'll need to look into.
Example:
Actual: | ----------------------------> these are the values generated with single node -- | -- | _id:keyword \| _score:double | | 2 \| 5.603396578413904E18 | | 3 \| 2.156063961865257E18 | | 1 \| 8.411017936759685E17 | | | | Expected: | -------------------------------> these are the generated values for multi node | _id:keyword \| _score:double | | 2 \| 5.603396E18 | | 3 \| 2.156063E18 | | 1 \| 8.411017E17
* Semantic search with query builder rewrite * Address review feedback * Add feature behind snapshot * Use after/before instead of afterClass/beforeClass * Call onFailure instead of throwing exception * Fix KqlFunctionIT by requiring KqlPlugin * Update scoring tests now that they are enabled * Drop the score column for now
* Semantic search with query builder rewrite * Address review feedback * Add feature behind snapshot * Use after/before instead of afterClass/beforeClass * Call onFailure instead of throwing exception * Fix KqlFunctionIT by requiring KqlPlugin * Update scoring tests now that they are enabled * Drop the score column for now
* Semantic search with query builder rewrite * Address review feedback * Add feature behind snapshot * Use after/before instead of afterClass/beforeClass * Call onFailure instead of throwing exception * Fix KqlFunctionIT by requiring KqlPlugin * Update scoring tests now that they are enabled * Drop the score column for now
tracked in #115103
actual implementation of the prototype from #118106
This adds a query builder rewrite phase on the coordinator to ES|QL.
Since the
MatchQueryBuilder
now supportssemantic_text
(#117839), with the new query builder rewrite phase it becomes very easy to add support for queryingsemantic_text
in ES|QL.Still needs more tests - for now I added CSV tests (for positive cases) and integration tests to show how we capture and return errors that can happen during the query builder rewrite.