- Notifications
You must be signed in to change notification settings - Fork 25.6k
ESQL - KNN function uses LIMIT for K, transforms to exact search when not pushed down #132944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL - KNN function uses LIMIT for K, transforms to exact search when not pushed down #132944
Conversation
…earch-non-pushed' into non-issue/esql-knn-exact-search-non-pushed # Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/querydsl/query/KnnQuery.java
…act-search-non-pushed # Conflicts: # x-pack/plugin/esql/qa/testFixtures/src/main/resources/knn-function.csv-spec # x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/plugin/KnnFunctionIT.java # x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerTests.java
…earch-non-pushed' into non-issue/esql-knn-exact-search-non-pushed
🔍 Preview links for changed docs |
assert page.getBlockCount() >= 2 : "Expected at least 2 blocks, got " + page.getBlockCount(); | ||
assert page.getBlock(0).asVector() instanceof DocVector : "Expected a DocVector, got " + page.getBlock(0).asVector(); | ||
assert page.getBlock(1).asVector() instanceof DoubleVector : "Expected a DoubleVector, got " + page.getBlock(1).asVector(); | ||
assert page.getBlockCount() > scoreBlockPosition : "Expected to get a score block in position " + scoreBlockPosition; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor unrelated change - removes unnecessary assertions and uses a non-hardcoded position
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
Pinging @elastic/search-relevance (Team:Search - Relevance) |
`boost` | ||
: (float) Floating point number used to decrease or increase the relevance scores of the query.Defaults to 1.0. | ||
| ||
`min_candidates` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through this deviation from knn query in the _search dsl may impact my work. I had naively started to add visit_percentage
into the set of available options here too but now I'm questioning that a bit given that you are thinking of moving away from num_candidates
. Thoughts on my draft here: https://github.com/elastic/elasticsearch/pull/133753/files#diff-0ec49ad4bdf06d1a122ea4657297fa276019d12b7affc9b07ab61f36e7b77c09
visit_percentage
would for bbq_disk
override num_candidates
and provide essentially more fine-grained control over what users can specify for total explored vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
visit_percentage would for bbq_disk override num_candidates and provide essentially more fine-grained control over what users can specify for total explored vectors.
IIUC, we could provide a min_visit_percentage
here for the same purpose. Would that work?
The main goal is to make sure users are not surprised in case knn needs to be translated to an exact query because it can't be pushed down.
Is this an option we plan to add for knn in general? What will happen in case the underlying index format doesn't support it? Or are we planning to support this for hnsw as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we could provide a min_visit_percentage here for the same purpose. Would that work?
Let me think about that over the weekend here. I think I understand why you are doing this. So it may just be a matter of whose PR goes first.
Is this an option we plan to add for knn in general? What will happen in case the underlying index format doesn't support it? Or are we planning to support this for hnsw as well?
Like only "disk" related formats which only be "bbq_disk" for now. It will be ignored otherwise. Feedback on that is welcome too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about this some more; I'm going to take any updates to ESQL out of my PR. For one it complicates what's there. And we also just don't need it yet. We can revisit in the future and that at least makes this conversation moot.
Unrelated my gut reaction was this is a bit of a red flag. I struggle a little bit with differing options between _search and ESQL but after reading a bit I understand why this is needed. +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jumping on this thread, I think we'll need some communication around this to users - it's not super intuitive if you're used to working with oldschool KNN search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that in the docs and blog post. It's good that num_candidates is no longer a supported option in this knn function to avoid further confusion.
assertTrue(secondKnnFilters.contains(firstOr.right())); | ||
} | ||
| ||
public void testKnnImplicitLimit() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get a test when knn is used before stats? e.g. from test | where knn(...) | stats x = COUNT(*)
.
I know it might seem like this does not make a lot of sense to use STATS after KNN, but it would be good to check what k
do we set.
since k
is set through an optimization rule, it would be great to also have CSV tests with knn used before STATS, RERANK etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Break if it's not the initial limit | ||
breakerReached.set(firstLimit.get()); | ||
firstLimit.set(true); | ||
} else if (plan instanceof TopN || plan instanceof Rerank || plan instanceof Aggregate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens in the case where we reach TopN/Rerank/Aggregate before we reach a LIMIT plan?
will the value of k
be null in this case? does that mean we will later fail to translate the knn function in a knn query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ioanatia That fails as of now - I should have tested for this 😞 .
I can think of two ways of addressing this:
- Enforcing using a LIMIT for knn. Fail the query if there's no implicit / explicit LIMIT.
The following would fail:
| WHERE knn() | STATS c = count(*)
but also, the following would fail:
| WHERE knn() | STATS c = count(*) where _score > 0.5
- Use an exact search when a LIMIT is not enforced. That would help with the examples above, at the cost of doing a search over all matching rows.
I'm favoring 1) as knn makes no sense if not used as part of a TopN, and users should use an explicit exact search for those cases. It's also a restriction that we can eventually lift if we come with other solution.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a shot at implementing 1) in 9d3c85f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for now - we can always address this in another way later 👍
…act-search-non-pushed # Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/EsqlFunctionRegistry.java
…earch-non-pushed' into non-issue/esql-knn-exact-search-non-pushed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, some minor questions
`boost` | ||
: (float) Floating point number used to decrease or increase the relevance scores of the query.Defaults to 1.0. | ||
| ||
`min_candidates` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jumping on this thread, I think we'll need some communication around this to users - it's not super intuitive if you're used to working with oldschool KNN search.
public KnnQuery(Source source, String field, float[] query, Map<String, Object> options, List<QueryBuilder> filterQueries) { | ||
public KnnQuery(Source source, String field, float[] query, Integer k, Map<String, Object> options, List<QueryBuilder> filterQueries) { | ||
super(source); | ||
assert k != null && k > 0 : "k must be a positive integer, but was: " + k; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert on max limit too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was more about a sanity check for having k
set before the query translation in ES|QL, as the overall error checking will be done on the QueryBuilder
.
} | ||
| ||
public void testKnnUsesLimitForK() { | ||
assumeTrue("dense_vector capability not available", EsqlCapabilities.Cap.DENSE_VECTOR_FIELD_TYPE.isEnabled()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to check for the fork capability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no FORK involved in the test - the test name is "test Knn Uses Limit For K", not "FORK" 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha that's what I get for reading it quickly 🤦
// Break if it's not the initial limit | ||
breakerReached.set(firstLimit.get()); | ||
firstLimit.set(true); | ||
} else if (plan instanceof TopN || plan instanceof Rerank || plan instanceof Aggregate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for now - we can always address this in another way later 👍
"""), containsString("Knn function must be used with a LIMIT clause")); | ||
} | ||
| ||
public void testKnnWithRerankAmdLimit() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to get one more test - we have an optimization that combines limits and it would be nice to test the combination of the two, e.g. FROM my-index metadata _score | WHERE knn(...) | LIMIT 200 | LIMIT 10
- I expect k
will be 10 here, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it doesn't hurt - added in c4f3da7.
The nice thing about optimization rules is that they can be applied independently and still work together, like in this case 🥳
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
Changes how KNN works for ESQL:
k
is no longer specified as a parameterLIMIT
is used to provide the k parameterknn
can't be pushed down to Lucene, it is transformed into an exact searchnum_candidates
, users can providemin_candidates
. That is the minimum number of candidates to use, as in case the query is transformed into an exact search we can't really guarantee that we will be evaluatingnum_candidates
at the mostmin_candidates
is specified, then it is used as k if it's bigger than the LIMIT applied to KNNExample:
The above example will use k=10, as that is the LIMIT used
The following example specifies the minimum number of candidates as 200:
The following example will use exact nearest neighbors search, as it is used as part of a non-pushable disjunction:
The following example will use exact nearest neighbors search, as it has a non-pushable knn prefilter: