- Notifications
You must be signed in to change notification settings - Fork 25.5k
semantic search in ES|QL #116253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
semantic search in ES|QL #116253
Conversation
name 'x-pack-esql' | ||
description 'The plugin that powers ESQL for Elasticsearch' | ||
classname 'org.elasticsearch.xpack.esql.plugin.EsqlPlugin' | ||
extendedPlugins = ['x-pack-esql-core', 'lang-painless', 'x-pack-ml'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to change build.gradle
for both esql-core and esql because we need SemanticQueryBuilder
.
I had to remove x-pack-ml
because of jar hell.
The problem is that the categorize integration tests no longer work:
REPRODUCE WITH: ./gradlew ":x-pack:plugin:esql:qa:server:single-node:javaRestTest" --tests "org.elasticsearch.xpack.esql.qa.single_node.EsqlSpecIT" -Dtests.method="test {categorize.Categorize ASYNC}" [2024-11-18T19:19:08,448][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] fatal error in thread [elasticsearch[test-cluster-0][search][T#47]], exiting java.lang.NoClassDefFoundError: org/elasticsearch/xpack/ml/job/categorization/CategorizationAnalyzer | -- | -- | at org.elasticsearch.xpack.esql.expression.function.grouping.Categorize.toEvaluator(Categorize.java:112) | | at org.elasticsearch.xpack.esql.evaluator.EvalMapper.toEvaluator(EvalMapper.java:55) | | at org.elasticsearch.xpack.esql.planner.LocalExecutionPlanner.planEval(LocalExecutionPlanner.java:410) | | at org.elasticsearch.xpack.esql.planner.LocalExecutionPlanner.plan(LocalExecutionPlanner.java:197) | | at org.elasticsearch.xpack.esql.planner.LocalExecutionPlanner.planAggregation(LocalExecutionPlanner.java:242) | | at org.elasticsearch.xpack.esql.planner.LocalExecutionPlanner.plan(LocalExecutionPlanner.java:189) ... Caused by: java.lang.ClassNotFoundException: org.elasticsearch.xpack.ml.job.categorization.CategorizationAnalyzer | -- | -- | at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) | | at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:595) | | at java.base/java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:870) | | at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:528)
I don't know how to fix this - if I leave x-pack-ml
here, we get Jar hell when Elasticsearch starts:
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:76) Caused by: java.lang.IllegalStateException: jar hell! class: com.ibm.icu.impl.Assert jar1: /Users/ioana/projects/elasticsearch/distribution/archives/darwin-tar/build/install/elasticsearch-9.0.0-SNAPSHOT/modules/x-pack-ml/icu4j-68.2.jar jar2: /Users/ioana/projects/elasticsearch/distribution/archives/darwin-tar/build/install/elasticsearch-9.0.0-SNAPSHOT/modules/x-pack-esql-core/icu4j-68.2.jar at org.elasticsearch.base@9.0.0-SNAPSHOT/org.elasticsearch.jdk.JarHell.checkClass(JarHell.java:316) at org.elasticsearch.base@9.0.0-SNAPSHOT/org.elasticsearch.jdk.JarHell.checkJarHell(JarHell.java:234) at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.plugins.PluginsUtils.checkBundleJarHell(PluginsUtils.java:319) ... 9 more
inference and ml seem to have many common dependencies and AFAIK esql is the only plugin that seems to need both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved this by no longer using SemanticQueryBuilder
- this means we no longer need to extend from x-pack-inference
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to make both search and ml work here, we should figure out how to deal with the dependency other than deferring it. @jan-elastic @ChrisHegarty do you have any idea whether it is ok to remove x-pack-ml
here, or how to make it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is ok to remove x-pack-ml here, or how to make it work?
It's okay to remove it because we added it to x-pack-esql-core
- x-pack-esql-core
now extends x-pack-ml
and x-pack-esql
extends x-pack-esql-core
.
When I simply removed x-pack-ml
and replaced it x-pack-inference
, the CATEGORIZE
tests would start to fail. So I know this should be safe, because I can see CATEGORIZE
works as expected (none of the tests for it fail).
we should figure out how to deal with the dependency other than deferring it.
I agree - I went back and forth on this and this is the best we could come up with:
- we could move
SemanticQueryBuilder
inx-pack-core
- this would mean moving some additional classes likeSemanticTextFieldMapper
, maybe more, I haven't unravelled this one. - however even if we can use
SemanticQueryBuilder
in ES|QL this would only be temporary - the reason for this is that for _search we want the match query to actually supportsemantic_text
(Match query support for semantic_text fields - POC DO NOT MERGE #112166 - a POC and the actual implementation will be done in the following weeks). - if you take a look at the match query POC it looks like
MatchQueryBuilder
hassemantic_text
support -MatchQueryBuilder
is in server so we should be able to use without dependency issues (we already use it).
So my take is that moving SemanticQueryBuilder
does not gain us much if support in MatchQueryBuilder
is coming soon. And I think that once that's coming, we might be able to revert build.gradle
to what it was originally.
I captured that in a bullet point in #115103
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM - great work! 💯
We should get eyes from the AE team
} | ||
| ||
private String embeddingsFieldName() { | ||
return name.concat(".inference.chunks.embeddings"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a pity we can't access this constant from the ML plugin 😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prevents us from accessing this constant from the ML plugin? Is it related to the dependency issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not from the ML plugin - they are from the inference plugin:
Lines 58 to 69 in fa9f2bf
static final String TEXT_FIELD = "text"; | |
static final String INFERENCE_FIELD = "inference"; | |
static final String INFERENCE_ID_FIELD = "inference_id"; | |
static final String SEARCH_INFERENCE_ID_FIELD = "search_inference_id"; | |
static final String CHUNKS_FIELD = "chunks"; | |
static final String CHUNKED_EMBEDDINGS_FIELD = "embeddings"; | |
static final String CHUNKED_TEXT_FIELD = "text"; | |
static final String MODEL_SETTINGS_FIELD = "model_settings"; | |
static final String TASK_TYPE_FIELD = "task_type"; | |
static final String DIMENSIONS_FIELD = "dimensions"; | |
static final String SIMILARITY_FIELD = "similarity"; | |
static final String ELEMENT_TYPE_FIELD = "element_type"; |
We can import things from the ML plugin, but once we extend the inference plugin we end up with jar hell.
.../esql-core/src/main/java/org/elasticsearch/xpack/esql/core/querydsl/query/SemanticQuery.java Show resolved Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/InferenceUtils.java Outdated Show resolved Hide resolved
- match: { columns: [{name: name, type: keyword}] } | ||
| ||
--- | ||
semantic_text declared in mapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this test - it failed for serverless tests on mixed version clusters with Invalid EsField type [SemanticTextEsField]
. However:
-
this test is a kind of a hack - it puts the full internal representation for the semantic_text field which will change, so this test will need change to use the test inference plugin so that the embeddings are internally generated - but then we have the same problem as Add mixed mode BWC testing support for test-specific plugins #115166
-
the amount of CSV tests we have already cover the fact that for ESQL the block loader is capable of only returning the original text content
-
semantic_text
is only available in snapshots - so not available for serverless - the fact that this test fails has no meaning for serverless prod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a yml test for semantic_text is better, if serverless is affected, can this be muted on serverless? Better keep it for stateful if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the removal and muted the test on serverless.
Also keeping track in #115103 to not forget to unmute this test in serverless once the changes here make it in 9.0.0 snapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting these together @ioanatia ! I added some comments, and made a summary of my thoughts so far:
- This is a much more complicated feature compared to
KQL
, as it touches a lot of more areas, and it worth spending more time on it. - I was looking for design docs, however I have no luck finding one that help me clarify the questions in my mind yet. There are a few areas that I think they worth further discussions:
- The workflow of inference resolution in
EsqlSession
, I wonder if any alternative approaches were being considered or discussed. Have we considered implementing inference resolution in a similar way as index or enrich policy resolver? Making Inference resolution independent of them seems more clean if possible, so that we can minimize unnecessary interface changes or changes to tests that are irrelevant to inference resolution. - What is the best way to tell Analyzer that there is remote cluster involved? And how to deal with it within the scope of match?
- How to deal with semantic_text fields with multiple inference ids/results(this is allowed in the SemanticTextEsField definition), is it necessary to deal with them when there is no match function/operator in the query? Resolving inference adds extra cost, it is better to do it when we know that we will need it. It is likely multiple inference ids on the same semantic text field happens when there are multiple indices that match a common index pattern in this case, to make it more complicated, the data type could be different given the same field name, do we know how they behave?
- The workflow of inference resolution in
- More negative and positive tests will be appreciated, as they are the only way to exercise all possible code paths.
name 'x-pack-esql' | ||
description 'The plugin that powers ESQL for Elasticsearch' | ||
classname 'org.elasticsearch.xpack.esql.plugin.EsqlPlugin' | ||
extendedPlugins = ['x-pack-esql-core', 'lang-painless', 'x-pack-ml'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to make both search and ml work here, we should figure out how to deal with the dependency other than deferring it. @jan-elastic @ChrisHegarty do you have any idea whether it is ok to remove x-pack-ml
here, or how to make it work?
...ck/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/type/SemanticTextEsFieldTests.java Outdated Show resolved Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/InferenceUtils.java Outdated Show resolved Hide resolved
FieldAttribute field = (FieldAttribute) match.field(); | ||
SemanticTextEsField esField = (SemanticTextEsField) field.field(); | ||
if (esField.inferenceIds().size() == 1) { | ||
result.add(new SemanticQuery(field.sourceText(), match.query().sourceText(), esField.inferenceIds().iterator().next())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first inferenceId
is used here, match
works with one inferenceId
only, however multiple inferenceId
can be defined in SemanticTextEsField
. it will be appreciated if there are some comments here, and if we don't deal with multiple inference ids, does it make sense to keep just one inference id in SemanticTextEsField
?
In case there are multiple inferenceId
s, do we want to throw an exception here? Similar with what is done in Verifier
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case there are multiple inferenceIds, do we want to throw an exception here? Similar with what is done in Verifier.
I guess it does not hurt to throw an exception - but at this point in the execution, the Verifier
already checked and returned an error if we are dealing with multiple inference ids.
} else { | ||
| ||
// This should never happen, but we handle it here either way | ||
throw new IllegalStateException("Cannot handle inference results"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is IllegalStateException
a 400 or 500?
public Set<String> inferenceIdsForField( | ||
Map<String, IndexMetadata> indexMetadata, | ||
FieldCapabilitiesResponse fieldCapsResponse, | ||
String name | ||
) { | ||
Set<String> inferenceIds = new HashSet<>(); | ||
| ||
for (FieldCapabilitiesIndexResponse ir : fieldCapsResponse.getIndexResponses()) { | ||
String indexName = ir.getIndexName(); | ||
InferenceFieldMetadata inferenceFieldMetadata = indexMetadata.get(indexName).getInferenceFields().get(name); | ||
if (inferenceFieldMetadata != null) { | ||
inferenceIds.add(inferenceFieldMetadata.getInferenceId()); | ||
} | ||
} | ||
return inferenceIds; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider decoupling this from IndexResolver
and make an InferenceResolver
to handle inference related tasks.
10043 | Yishay | Tzvieli | ||
; | ||
| ||
testMatchWithSemanticText |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More positive tests will be appreciated. How does it work with multi-valued fields? More combinations(conjunction) with other functions and operators, before/after stats/eval when it works, for the completeness. There could be surprises, without testing it out, it is hard to tell.
8 | William Faulkner | ||
; | ||
| ||
testMatchWithSemanticText |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as match-function
, more variety of queries will be appreciated.
); | ||
} | ||
| ||
public void testMatchWithSemanticTextWithCCQ() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More negative tests will be appreciated, we have code to detect errors, without having negative tests to cover those code paths, it is hard to tell whether they work as expected. Here are a few scenarios that I can think of:
Match
onsemantic_text
fields with multipleinference_ids
Match
on multiple indices that have a common index pattern, the field names is the same, however they may have different data types -text
,keyword
,semantic_text
QSTR
orKQL
with semantic_text field in the index mapping, they are search functions as well.- Include
semantic_text
inMatchTests
andMatchOperatorTests
.
- match: { columns: [{name: name, type: keyword}] } | ||
| ||
--- | ||
semantic_text declared in mapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a yml test for semantic_text is better, if serverless is affected, can this be muted on serverless? Better keep it for stateful if possible.
…f semantic_text" This reverts commit 8d8efa9.
Thank you @fang-xing-esql for your review and thanks for meeting with me to discuss this.
Looking at ways to simplify this I am thinking to:
If the
I think I also need to add a comment in
Will follow up on this tomorrow. thanks! |
I started to do some experiments with the branch, and came across an issue when we have multiple indices with the same semantic_text field name, but different inference ids, here are the steps to reproduce it.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing the issue with multiple inference ids @ioanatia! I got chance to try more today and add some comments with my findings.
Set<String> indexNames = indexNames(analyzedPlan); | ||
| ||
analyzedPlan.forEachExpressionDown(Match.class, match -> { | ||
if (match.field().dataType() == DataType.SEMANTIC_TEXT && match.field() instanceof FieldAttribute field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this case if it hasn't been covered yet? Schema is here.
+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d { "query": "from testidx00 | eval x = textfield | where match(x, \"live\")" } ' { "error" : { "root_cause" : [ { "type" : "null_pointer_exception", "reason" : "Cannot invoke \"org.elasticsearch.inference.InferenceResults.getClass()\" because \"this.inferenceResults\" is null" } ], "type" : "null_pointer_exception", "reason" : "Cannot invoke \"org.elasticsearch.inference.InferenceResults.getClass()\" because \"this.inferenceResults\" is null", "suppressed" : [ { "type" : "exception", "reason" : "1 further exceptions were dropped" }, { "type" : "task_cancelled_exception", "reason" : "parent task was cancelled [cancelled on failure]" } ] }, "status" : 500 } + curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d { "query": "from testidx04 | eval x = textfield | where match(x, \"live\")" } ' textfield | textfield.keyword | x -------------------------------------+-------------------------------------+------------------------------------- live long and prosper 04 text keyword|live long and prosper 04 text keyword|live long and prosper 04 text keyword
) | ||
); | ||
} else { | ||
failures.add( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it more, I wonder if we should allow multiple inference ids on the same field, if the inference ids come from different indices, and each index has only one inference id on this field. This is more like a usability issue.
If I have multiple indices that have the same semantic text field name, I couldn't find a way to query them in at once. Does it look like a valid use case? Schema is here.
Thanks for fixing this! + curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d { "query": "from testidx00, testidx01 | where match(textfield, \"live\")" } ' { "error" : { "root_cause" : [ { "type" : "verification_exception", "reason" : "Found 1 problem\nline 1:41: [MATCH] function cannot operate on [textfield] because it is configured with multiple inference IDs." } ], "type" : "verification_exception", "reason" : "Found 1 problem\nline 1:41: [MATCH] function cannot operate on [textfield] because it is configured with multiple inference IDs." }, "status" : 400 } + curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d { "query": "from testidx00, testidx01 | where match(textfield::keyword, \"live\") | eval k = textfield::keyword | keep k" } ' { "error" : { "root_cause" : [ { "type" : "verification_exception", "reason" : "Found 1 problem\nline 1:41: [MATCH] function cannot operate on [textfield::keyword], which is not a field from an index mapping" } ], "type" : "verification_exception", "reason" : "Found 1 problem\nline 1:41: [MATCH] function cannot operate on [textfield::keyword], which is not a field from an index mapping" }, "status" : 400 }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fang-xing-esql , the problem with multiple inference IDs is relevance. It does not make sense to mix multiple inference IDs from the scoring point of view, as different models will get different scores that can't directly be combined. That's the main reason it is not allowed on the _search
endpoint.
In _search, users can combine different inference ids using different query clauses in _search, that can the be further adjusted via boosting.
Even if it's technically possible to do this from a filtering PoV, it won't make sense to do it once we combine with score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we take this as a follow up item?
Since this restriction is also present in _search, for all the reasons @carlosdelest mentioned, I thought having it in ES|QL is reasonable for now.
The time where it becomes relevant to lift this restriction is when we introduce RRF, in that case we might want to allow RRF between two subqueries that query the same field on different indices: WHERE where match(textfield, "live") AND _index == "testidx00"
and WHERE where match(textfield, "live") AND _index == "testidx01"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this can be addressed as a follow up.
match.inferenceResults() | ||
); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another scenario that I can think of - there are two indices with the same field name, one is semantic text, the other is a text field for example. If I query each of them individually, I got expected results, however if I query both of them together, I got empty results. It is similar to an issue that I brought up to Carlos on his PR, however, semantic text has a different type of query for now, perhaps it has to be taken care of in a different way.
+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d { "query": "from testidx00 | where match(textfield, \"live\")" } ' textfield ------------------------------ live long and prosper 00 elser + curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d { "query": "from testidx04 | where match(textfield, \"live\")" } ' textfield | textfield.keyword -------------------------------------+------------------------------------- live long and prosper 04 text keyword|live long and prosper 04 text keyword + curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d { "query": "from testidx00, testidx04 | where match(textfield::keyword, \"live\") | eval k = textfield::keyword | keep k" } ' k ---------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find! I added this as a follow up of things to do before we take this out of snapshot in #115103.
closed in favour of #118676 |
tracked in: #115103
This brings the ability to execute semantic queries using the
match
function onsemantic_text
fields.How to test:
POST /_license/start_trial?acknowledge=true
.Machine Learning > Trained Models
you can check the status of the ELSER model:Execution
What we do in _search:
In _search when we execute a semantic query, we have 2 rewrite phases:
live long
- the inference results are the embeddings we will use to querying thesemantic_text
field. Retrieving the embeddings is done on the coordinator, because we do not want to do an inference call on every shard. The inference endpoint might be configured to work with an external, paid service like Azure OpenAI/Amazon Bedrock.For ES|QL we kept the same execution flow:
match
function that has asemantic_text
argument. We collect these semantic query arguments and call the inference service to retrieve the embeddings. We then embed the inference results in theMatch
objects that represent semantic queries(InferenceUtils
).Match
function to a query builder as part of pushing the filters to Lucene, we do the same type of query rewriting that is done in _search depending on whether the inference results are dense or sparse vectors.I I initially wanted to useSemanticQueryBuilder
but that resulted in a dependency hell. NOTE: this is only temporary, very soon we should have support forsemantic_text
inMatchQueryBuilder
, so this logic will be simplified.For licensing:
I did not add any extra licensing checks. After consulting with team, it seems those are not needed.
SemanticQueryBuilder
also does not do any licensing checks. It relies on the inference APIs to have proper licensing handling.Checklist:
semantic
queries in_search
either - but we need to check we don't fail with a funny error, but with a meaningful messagesemantic_text
field that has differentinference_id
s - validations/tests are needed, this is not supported in_search
too