- Notifications
You must be signed in to change notification settings - Fork 25.5k
Make knn search a query #98916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Make knn search a query #98916
Changes from all commits
Commits
Show all changes
24 commits Select commit Hold shift + click to select a range
5a05304
Make knn search a query
mayya-sharipova 5aca3c2
Update docs/changelog/98916.yaml
mayya-sharipova 897266a
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova 8b02234
Correct transport version, remove byteQueryVector
mayya-sharipova 88c3233
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova 49f2b80
Other adjustments
mayya-sharipova 9b1e7c8
Add aliasFilter to the SearchExecutionContext instead of QueryBuilder
mayya-sharipova cf1c0ae
Remove query_vector_builder
mayya-sharipova 9046812
Add query _name to tests
mayya-sharipova 8a4cb9b
Add filter alias during doToQuery
mayya-sharipova dccd3c7
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova d4a5758
Simplify over-wire protocol
mayya-sharipova 00a5d5c
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova 7d2d091
Add nested support for knn query
mayya-sharipova 56b8bbb
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova b3ca311
Add documentation and other queries
mayya-sharipova 21f40b4
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova 9d825cf
Updates to knn-query documentation
mayya-sharipova e0ae1dc
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova 348305f
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova dc0012c
Adjust docs
mayya-sharipova 6876c5a
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova c4b80ba
Merge branch 'main' into knn-as-query
mayya-sharipova 6450681
Fix an error in docs
mayya-sharipova File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
pr: 98916 | ||
summary: Make knn search a query | ||
area: Vector Search | ||
type: feature | ||
issues: [] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,222 @@ | ||
[[query-dsl-knn-query]] | ||
=== Knn query | ||
++++ | ||
<titleabbrev>Knn</titleabbrev> | ||
++++ | ||
| ||
Finds the _k_ nearest vectors to a query vector, as measured by a similarity | ||
metric. _knn_ query finds nearest vectors through approximate search on indexed | ||
dense_vectors. The preferred way to do approximate kNN search is through the | ||
<<knn-search,top level knn section>> of a search request. _knn_ query is reserved for | ||
expert cases, where there is a need to combine this query with other queries. | ||
| ||
[[knn-query-ex-request]] | ||
==== Example request | ||
| ||
[source,console] | ||
---- | ||
PUT my-image-index | ||
{ | ||
"mappings": { | ||
"properties": { | ||
"image-vector": { | ||
"type": "dense_vector", | ||
"dims": 3, | ||
"index": true, | ||
"similarity": "l2_norm" | ||
}, | ||
"file-type": { | ||
"type": "keyword" | ||
} | ||
} | ||
} | ||
} | ||
---- | ||
| ||
. Index your data. | ||
+ | ||
[source,console] | ||
---- | ||
POST my-image-index/_bulk?refresh=true | ||
{ "index": { "_id": "1" } } | ||
{ "image-vector": [1, 5, -20], "file-type": "jpg" } | ||
{ "index": { "_id": "2" } } | ||
{ "image-vector": [42, 8, -15], "file-type": "png" } | ||
{ "index": { "_id": "3" } } | ||
{ "image-vector": [15, 11, 23], "file-type": "jpg" } | ||
---- | ||
//TEST[continued] | ||
| ||
. Run the search using the `knn` query, asking for the top 3 nearest vectors. | ||
+ | ||
[source,console] | ||
---- | ||
POST my-image-index/_search | ||
{ | ||
"size" : 3, | ||
"query" : { | ||
"knn": { | ||
"field": "image-vector", | ||
"query_vector": [-5, 9, -12], | ||
"num_candidates": 10 | ||
} | ||
} | ||
} | ||
---- | ||
//TEST[continued] | ||
| ||
NOTE: `knn` query doesn't have a separate `k` parameter. `k` is defined by | ||
`size` parameter of a search request similar to other queries. `knn` query | ||
collects `num_candidates` results from each shard, then merges them to get | ||
the top `size` results. | ||
| ||
| ||
[[knn-query-top-level-parameters]] | ||
==== Top-level parameters for `knn` | ||
| ||
`field`:: | ||
+ | ||
-- | ||
(Required, string) The name of the vector field to search against. Must be a | ||
<<index-vectors-knn-search, `dense_vector` field with indexing enabled>>. | ||
-- | ||
| ||
`query_vector`:: | ||
+ | ||
-- | ||
(Required, array of floats) Query vector. Must have the same number of dimensions | ||
as the vector field you are searching against. | ||
-- | ||
| ||
`num_candidates`:: | ||
+ | ||
-- | ||
(Required, integer) The number of nearest neighbor candidates to consider per shard. | ||
Cannot exceed 10,000. {es} collects `num_candidates` results from each shard, then | ||
merges them to find the top results. Increasing `num_candidates` tends to improve the | ||
accuracy of the final results. | ||
-- | ||
| ||
`filter`:: | ||
+ | ||
-- | ||
(Optional, query object) Query to filter the documents that can match. | ||
The kNN search will return the top documents that also match this filter. | ||
The value can be a single query or a list of queries. If `filter` is not provided, | ||
all documents are allowed to match. | ||
| ||
The filter is a pre-filter, meaning that it is applied **during** the approximate | ||
kNN search to ensure that `num_candidates` matching documents are returned. | ||
-- | ||
| ||
`similarity`:: | ||
+ | ||
-- | ||
(Optional, float) The minimum similarity required for a document to be considered | ||
a match. The similarity value calculated relates to the raw | ||
<<dense-vector-similarity, `similarity`>> used. Not the document score. The matched | ||
documents are then scored according to <<dense-vector-similarity, `similarity`>> | ||
and the provided `boost` is applied. | ||
-- | ||
| ||
`boost`:: | ||
+ | ||
-- | ||
(Optional, float) Floating point number used to multiply the | ||
scores of matched documents. This value cannot be negative. Defaults to `1.0`. | ||
-- | ||
| ||
`_name`:: | ||
+ | ||
-- | ||
(Optional, string) Name field to identify the query | ||
-- | ||
| ||
[[knn-query-filtering]] | ||
==== Pre-filters and post-filters in knn query | ||
| ||
There are two ways to filter documents that match a kNN query: | ||
| ||
. **pre-filtering** – filter is applied during the approximate kNN search | ||
to ensure that `k` matching documents are returned. | ||
. **post-filtering** – filter is applied after the approximate kNN search | ||
completes, which results in fewer than k results, even when there are enough | ||
matching documents. | ||
| ||
Pre-filtering is supported through the `filter` parameter of the `knn` query. | ||
Also filters from <<filter-alias,aliases>> are applied as pre-filters. | ||
| ||
All other filters found in the Query DSL tree are applied as post-filters. | ||
For example, `knn` query finds the top 3 documents with the nearest vectors | ||
(num_candidates=3), which are combined with `term` filter, that is | ||
post-filtered. The final set of documents will contain only a single document | ||
that passes the post-filter. | ||
| ||
| ||
[source,console] | ||
---- | ||
POST my-image-index/_search | ||
{ | ||
"size" : 10, | ||
"query" : { | ||
"bool" : { | ||
"must" : { | ||
"knn": { | ||
"field": "image-vector", | ||
"query_vector": [-5, 9, -12], | ||
"num_candidates": 3 | ||
} | ||
}, | ||
"filter" : { | ||
"term" : { "file-type" : "png" } | ||
} | ||
} | ||
} | ||
} | ||
---- | ||
//TEST[continued] | ||
| ||
[[knn-query-with-nested-query]] | ||
==== Knn query inside a nested query | ||
| ||
`knn` query can be used inside a nested query. The behaviour here is similar | ||
to <<nested-knn-search, top level nested kNN search>>: | ||
| ||
* kNN search over nested dense_vectors diversifies the top results over | ||
the top-level document | ||
* `filter` over the top-level document metadata is supported and acts as a | ||
post-filter | ||
* `filter` over `nested` field metadata is not supported | ||
| ||
A sample query can look like below: | ||
| ||
[source,js] | ||
---- | ||
{ | ||
"query" : { | ||
"nested" : { | ||
"path" : "paragraph", | ||
"query" : { | ||
"knn": { | ||
"query_vector": [ | ||
0.45, | ||
45 | ||
], | ||
"field": "paragraph.vector", | ||
"num_candidates": 2 | ||
} | ||
} | ||
} | ||
} | ||
} | ||
---- | ||
// NOTCONSOLE | ||
| ||
[[knn-query-aggregations]] | ||
==== Knn query with aggregations | ||
`knn` query calculates aggregations on `num_candidates` from each shard. | ||
Thus, the final results from aggregations contain | ||
`num_candidates * number_of_shards` documents. This is different from | ||
the <<knn-search,top level knn section>> where aggregations are | ||
calculated on the global top k nearest documents. | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just because its too difficult to make work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK for now.
I will have to think more about if we should support it or not. IMO, it seems like we should. We are just matching the nearest "k", so it seems to fit OK. But I can also see the argument against (as you laid out).
Additionally, we should have a "more_like_this" query that utilizes knn as well.
Maybe open a Github issue to track discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This as well, but more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that semantic similarity has no natural bounds like lexical. But, using this to find semantically similar queries to a stored/new docs is powerful. Especially when you consider hybrid search, and similarity filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benwtrent Let's discuss this offline (about percolator and more_like_this queries) and create github issues if we find them necessary.
I will consider this PR will go without those queries.