Add retrievers using the parser-only approach #105470

jdconrad · 2024-02-13T22:43:21Z

This enhancement adds a new abstraction to the _search API called "retriever." A retriever is something that returns top hits.

Three retrievers are going to be added initially:

standard - This retriever replaces standard query functionality. It contains sub-elements for query, search_after, terminate_after, sort_field, min_score, and collapse.
knn - This retriever replaces a knn search. It contains sub-elements for field, k, num_candidates, query_vector, query_vector_builder, and similarity.
rrf - This retriever replaces rank for rrf. It contains sub-elements for retrievers, rank_constant, and window_size.

All retrievers have a filter element as well. For rrf this is propagated to its child retrievers.

Also included is node feature functionality. (Note the plumbing for this was part of #105417). Each retriever has its own node feature to ensure compatibility with serverless. New retrievers can be easily added by adding a new node feature without having to worry about incompatibilities.

Retrievers are always "extracted" to a SearchSourceBuilder. They are simply an abstraction to better show how elements of the _search API interact together in an intuitive way. They are never serialized and are ephemerally used only during parsing.

Some examples:

GET index/_search { "retriever": { "standard": { "query": { "match": { "message": { "query": "This is a test." } } }, "sort": [ "popularity" ] } }, "size": 5, "fields": ["message"] } GET index/_search { "retriever": { "knn": { "field": "vector", "k": 3, "num_candidates": 10, "query_vector": [1, 2, 3] } }, "fields": ["message"] } GET index/_search { "retriever": { "rrf": { "window_size": 100, "retrievers": [ { "knn": { "field": "vector", "k": 3, "num_candidates": 10, "query_vector": [1, 2, 3] } }, { "standard": { "query": { "match": { "message": { "query": "This is a test." } } } } } ] } }, "size": 5, "fields": ["message"] }

This change adds an RRFRetrieverBuilder using the same strategy as ClassicRetrieverBuilder where we extract the parsed request body into a SearchSourceBuilder.

This adds an initial linear combination retriever using a similar strategy to RRF. This prototype will allow us to cover all existing use cases for standard search, and should make it easier to implement retrievers at more than a superficial level.

This change does the following: * renames the classic retriever to standard retriever * adds support for (pre) filter to the standard retriever * adds yaml tests for the standard retriever that covers all of the parse fields

This change removes all of the serialization code from retrievers since these will only be used as x- content parsers extracting values into the SearchSourceBuilder. This drastically reduces the test coverage area required. This change also removes post filter from the standard retriever for now as a post filter may need to apply to all returned documents in a compound query. Finally, this adds some additional checks for values that should not be shared between global values and retriever values.

This change adds yaml tests for the rrf retriever. This also fixes a bug that will allow knn retrievers to add their knn query as an addition to the current knn queries as opposed to overwriting it. This only applies to compound retrievers. This also adds a gate rrf platinum licensing.

jdconrad · 2024-03-07T21:46:15Z

@javanna @mayya-sharipova Thank you for the first round of reviews. I have responded to all the PR comments and added documentation changes for this. I would strongly prefer to minimize additional changes unless something is either incorrect or there is a significant design flaw given other work depending on this.

javanna

LGTM thanks @jdconrad

javanna · 2024-03-08T10:27:45Z

docs/reference/search/rrf.asciidoc

+[[rrf-using-sub-searches]]
+==== Reciprocal rank fusion using sub searches
+
+RRF using sub searches is no longer supported. Use the


I thought we never released sub_searches support to the public. But it seems like we did, yet it was experimental? I wonder if we should even mention it as a migration path.

I agree! I actually had removed it, but there are cross references from enterprise-search docs that I wasn't aware of until I got errors here during CI. I figured this allows enterprise-search time to move their documentation to use retrievers instead of sub_searches and we can remove them once they have the opportunity to change their references.

I see, can we clarify whether we need to mention the previous way at all?

@javanna I chatted with @kderusso about removing these elements from the documentation. She's going to review the docs in this PR, and we will follow up with getting the cross references changed once this PR is merged.

kderusso

Very nice work Jack! I left a few comments, none of which are blocking. Congrats!

kderusso · 2024-03-11T19:01:30Z

docs/reference/query-dsl/text-expansion-query.asciidoc

+This can also be achieved by using an <<rrf-retriever, `rrf` retriever>>
+with multiple <<standard-retriever, `standard` retrievers>>.


It would be nice if we continued to link to the main RRF page here as well.

Suggested change

This can also be achieved by using an <<rrf-retriever, `rrf` retriever>>

with multiple <<standard-retriever, `standard` retrievers>>.

This can also be achieved using <<rrf>>, through an <<rrf-retriever, `rrf` retriever>>

with multiple <<standard-retriever, `standard` retrievers>>.

kderusso · 2024-03-11T19:07:43Z

docs/reference/search/retriever.asciidoc

+A <<standard-retriever, retriever>> that replaces the functionality of a traditional <<query-dsl, query>>.
+
+`knn`::
+A <<knn-retriever, retriever>> that replaces the functionality of a <<search-api-knn, knn search>>.


I could see this being a little confusing to users, since knn is also a query (e.g. could knn-query be nested under a standard retriever?) I realize this is a weird case, but it might be good to address this somewhere, maybe in a blog post or something? WDYT?

You could use a knn query under a standard retriever. I agree this is probably better left to documentation in a different place, though. I worry that adding some notes about how you can use knn-query as part of a standard retriever will just confuse the issue more currently. I'll give this some thought as a possible follow up.

kderusso · 2024-03-11T19:14:43Z

docs/reference/search/retriever.asciidoc

+`k`::
+(Required, integer)
+
+Number of nearest neighbors to return as top hits. This value must be fewer than


Is it fewer or <=?

Good catch! I added an equal to as well.

kderusso · 2024-03-11T19:21:00Z

docs/reference/search/rrf.asciidoc

 https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf[Reciprocal rank fusion (RRF)]
-is a method for combining multiple result sets with different relevance
-indicators into a single result set. RRF requires no tuning, and the different
+is a method for combining multiple top documents sets with different relevance


Are we actively trying to change our vocabulary from result to top documents? It may be unintuitive or confusing?

I was doing this since the notion of a retriever is an entity that returns top documents. So I wanted to try to be consistent with that. I think I will change this one back to results, though, since it's not in the context of a retriever here.

kderusso · 2024-03-11T19:22:00Z

docs/reference/search/rrf.asciidoc

 }
 }
 ----
 // TEST[skip:example fragment]


Are we planning to add tests once RRF moves out of tech preview?

Ideally, yes. Setup here with documentation is tricky with the other example being tested directly.

kderusso · 2024-03-11T19:23:15Z

docs/reference/search/rrf.asciidoc


-<<request-body-sub-searches, Sub searches>> provides a way to
-combine and rank multiple searches using RRF.
+The `rrf` retriever provides a way to combine and rank multiple


Nice example 👍

kderusso · 2024-03-11T19:24:07Z

docs/reference/search/rrf.asciidoc

-independently of each other. First we run the term query for
-`blue shoes sales` using the standard BM25 scoring algorithm. Then
-we run the text expansion query for `What blue shoes are on sale?`
+In the above example, we execute each of the two `standard` retrievers


This is another example that could be annotated with numbers in the example fragment?

Same comment as above. I didn't find a way where I felt like the readability of the example was improved.

Here's along the lines of what I was thinking:

[source,console] ---- GET example-index/_search { "retriever": { "rrf": { <3> "retrievers": [ { "standard": { <1> "query": { "term": { "text": "blue shoes sale" } } } }, { "standard": { <2> "query": { "text_expansion":{ "ml.tokens":{ "model_id":"my_elser_model", "model_text":"What blue shoes are on sale?" } } } } } ], "window_size": 50, "rank_constant": 20 } } } ---- // TEST[skip:example fragment] In the above example, we execute each of the two `standard` retrievers independently of each other. <1> First we run the `standard` retriever specifying a term query for `blue shoes sales` using the standard BM25 scoring algorithm. <2> Next we run the `standard` retriever specifying a text expansion query for `What blue shoes are on sale?` using our <<semantic-search-elser, ELSER>> scoring algorithm. <3> The `rrf` retriever allows us to combine the two top documents sets generated by completely independent scoring algorithms with equal weighting. Not only does this remove the need to figure out what the appropriate weighting is using linear combination, but RRF is also shown to give improved relevance over either query individually.

If you don't feel that it provides value though, it's definitely not a necessary change.

This is nicer. I will change this. Thank you for the write up here. (It hadn't occurred to me to mark the rrf retriever out of order in this way.)

kderusso · 2024-03-11T19:26:12Z

docs/reference/search/search-your-data/search-application-api.asciidoc

- "ml.inference.{{.}}_expanded.predicted_value": {
- "model_text": "{{query_string}}",
- "model_id": "<elser_model_id>"
+ "retriever": {


One note for search applications - it's possible that people may have defined search applications with templates that use sub_searches (especially since we provided this as an example). Do we want to provide any type of warning to users via search responses that this is deprecated?

Good question! Let me follow up on this. Since rank and sub_searches aren't supported on serverless we can do this as a follow up.

kderusso · 2024-03-11T19:27:09Z

docs/reference/search/search.asciidoc

 [[request-body-rank]]
 `rank`::
 preview:[]
 This param is in technical preview and may change in the future. The syntax will


Should we remove the verbiage about syntax changing prior to GA for these features?

I think this is pretty standard until a feature does become GA just in case we discover any design flaws, so I'd prefer to error on the side of caution here.

kderusso · 2024-03-11T19:32:32Z

server/src/main/java/org/elasticsearch/search/retriever/RetrieverBuilder.java

+ if (categoryClass.equals(RetrieverBuilder.class)) {
+ nestedDepth++;
+
+ if (nestedDepth > 2) {


It would be nice to break this out as a constant.

Absolutely! This needs a bit of a re-design to work with re-rankers, so I'm going to leave it for now. I think depth will be less important than specific ordering in the end.

benwtrent · 2024-04-11T15:54:21Z

@pmpailis @jdconrad we should highlight the new format as its much nicer. I am not sure @giladgal as it is still technical preview.

jdconrad and others added 30 commits October 13, 2023 10:07

Adds base and classic retrievers (#100731)

a0be9df

Merge branch 'main' into retrievers

1407559

Merge branch 'main' into retrievers

987a4ac

Merge remote-tracking branch 'upstream/main' into retrievers

addda00

Introduce kNN retriever (#101166)

69cfca8

Add an RRF Retriever (#101090)

b89b582

This change adds an RRFRetrieverBuilder using the same strategy as ClassicRetrieverBuilder where we extract the parsed request body into a SearchSourceBuilder.

Merge branch 'main' into retrievers

c62d4ad

Merge branch 'main' into retrievers

5b4a214

Merge branch 'main' into retrievers

cb6e5f6

Merge branch 'main' into retrievers

32f97a4

Merge branch 'main' into retrievers

3b0c859

Merge branch 'main' into retrievers

0ceb78a

Merge branch 'main' into retrievers

5553191

Merge branch 'main' into retrievers

9426c04

Merge branch 'main' into retrievers

6a36899

Merge branch 'main' into retrievers

349a966

Merge branch 'main' into retrievers

c95cb61

Merge branch 'main' into retrievers

e6f35e1

Merge branch 'main' into retrievers

8270e05

Merge branch 'main' into retrievers

fcbd7f8

Merge branch 'main' into retrievers

bf746d9

Yaml tests for standard retriever (#104482)

90e2d36

This change does the following: * renames the classic retriever to standard retriever * adds support for (pre) filter to the standard retriever * adds yaml tests for the standard retriever that covers all of the parse fields

Merge branch 'main' into retrievers

706e3d4

Merge remote-tracking branch 'upstream/retrievers' into retrievers

945caba

Merge branch 'main' into retrievers

74ab5e4

Merge branch 'main' into retrievers

948ce3d

Merge branch 'main' into retrievers

4bb0feb

jdconrad added 13 commits February 26, 2024 12:27

Merge branch 'main' into retrievers

72ce090

fix tests

4300470

Merge branch 'main' into retrievers

35bb53e

add unit tests for standard retriever parsing

c6de697

Merge branch 'main' into retrievers

4f8cd27

add unit tests for parsing knn retriever

cdf1b5d

add rrf retriever parsing tests

baf0c0c

fix license header

2aca681

Merge branch 'main' into retrievers

3b7a45e

add outline for retrievers doc with standard retriever parameters

b887f1f

finish parameters reference for retrievers

2c7ff54

remove rank and sub_searches parameters from docs

7990e67

Merge branch 'main' into retrievers

351aec9

fix doc crosslinks and test bwc

2914e37

javanna approved these changes Mar 8, 2024

View reviewed changes

jdconrad added 3 commits March 11, 2024 11:15

Merge branch 'main' into retrievers

663b5a6

add docs for sub_searches back in

6c6a6bc

add docs for rank back in

1e20770

kderusso approved these changes Mar 11, 2024

View reviewed changes

jdconrad added 3 commits March 11, 2024 15:45

response to pr comments

db29499

update link

3a5037b

add annotations

cb77355

jdconrad merged commit 68b0aca into main Mar 12, 2024

jdconrad deleted the retrievers branch March 12, 2024 17:11

benwtrent added the release highlight label Apr 11, 2024

ezimuel mentioned this pull request Apr 15, 2024

Add the retriever API in _search elastic/elasticsearch-specification#2508

Closed

sakurai-youhei mentioned this pull request Aug 20, 2024

RRF Retriever's filter is not documented #112000

Closed

pmpailis mentioned this pull request Oct 15, 2024

Removing tech-preview header and updating documentation for retrievers and RRF #114810

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retrievers using the parser-only approach #105470

Add retrievers using the parser-only approach #105470

Uh oh!

jdconrad commented Feb 13, 2024

jdconrad commented Mar 7, 2024

javanna left a comment

javanna Mar 8, 2024

jdconrad Mar 8, 2024

javanna Mar 11, 2024

jdconrad Mar 11, 2024

kderusso left a comment

kderusso Mar 11, 2024

jdconrad Mar 11, 2024

kderusso Mar 11, 2024

jdconrad Mar 11, 2024

kderusso Mar 11, 2024

jdconrad Mar 11, 2024

kderusso Mar 11, 2024

jdconrad Mar 11, 2024

kderusso Mar 11, 2024

jdconrad Mar 11, 2024

kderusso Mar 11, 2024

kderusso Mar 11, 2024

jdconrad Mar 11, 2024

kderusso Mar 12, 2024

jdconrad Mar 12, 2024

kderusso Mar 11, 2024

jdconrad Mar 11, 2024

kderusso Mar 11, 2024

jdconrad Mar 11, 2024

kderusso Mar 11, 2024

jdconrad Mar 11, 2024

benwtrent commented Apr 11, 2024

Labels

7 participants

		This can also be achieved by using an <<rrf-retriever, `rrf` retriever>>
		with multiple <<standard-retriever, `standard` retrievers>>.

Add retrievers using the parser-only approach #105470

Add retrievers using the parser-only approach #105470

Uh oh!

Conversation

jdconrad commented Feb 13, 2024

jdconrad commented Mar 7, 2024

javanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kderusso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Apr 11, 2024

Labels

7 participants