- Notifications
You must be signed in to change notification settings - Fork 25.6k
Add retrievers using the parser-only approach #105470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change adds an RRFRetrieverBuilder using the same strategy as ClassicRetrieverBuilder where we extract the parsed request body into a SearchSourceBuilder.
This adds an initial linear combination retriever using a similar strategy to RRF. This prototype will allow us to cover all existing use cases for standard search, and should make it easier to implement retrievers at more than a superficial level.
This change does the following: * renames the classic retriever to standard retriever * adds support for (pre) filter to the standard retriever * adds yaml tests for the standard retriever that covers all of the parse fields
This change removes all of the serialization code from retrievers since these will only be used as x- content parsers extracting values into the SearchSourceBuilder. This drastically reduces the test coverage area required. This change also removes post filter from the standard retriever for now as a post filter may need to apply to all returned documents in a compound query. Finally, this adds some additional checks for values that should not be shared between global values and retriever values.
This change adds yaml tests for the rrf retriever. This also fixes a bug that will allow knn retrievers to add their knn query as an addition to the current knn queries as opposed to overwriting it. This only applies to compound retrievers. This also adds a gate rrf platinum licensing.
@javanna @mayya-sharipova Thank you for the first round of reviews. I have responded to all the PR comments and added documentation changes for this. I would strongly prefer to minimize additional changes unless something is either incorrect or there is a significant design flaw given other work depending on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks @jdconrad
[[rrf-using-sub-searches]] | ||
==== Reciprocal rank fusion using sub searches | ||
| ||
RRF using sub searches is no longer supported. Use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we never released sub_searches support to the public. But it seems like we did, yet it was experimental? I wonder if we should even mention it as a migration path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! I actually had removed it, but there are cross references from enterprise-search docs that I wasn't aware of until I got errors here during CI. I figured this allows enterprise-search time to move their documentation to use retrievers instead of sub_searches and we can remove them once they have the opportunity to change their references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, can we clarify whether we need to mention the previous way at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work Jack! I left a few comments, none of which are blocking. Congrats!
This can also be achieved by using an <<rrf-retriever, `rrf` retriever>> | ||
with multiple <<standard-retriever, `standard` retrievers>>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we continued to link to the main RRF page here as well.
This can also be achieved by using an <<rrf-retriever, `rrf` retriever>> | |
with multiple <<standard-retriever, `standard` retrievers>>. | |
This can also be achieved using <<rrf>>, through an <<rrf-retriever, `rrf` retriever>> | |
with multiple <<standard-retriever, `standard` retrievers>>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
A <<standard-retriever, retriever>> that replaces the functionality of a traditional <<query-dsl, query>>. | ||
| ||
`knn`:: | ||
A <<knn-retriever, retriever>> that replaces the functionality of a <<search-api-knn, knn search>>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see this being a little confusing to users, since knn is also a query (e.g. could knn-query be nested under a standard retriever?) I realize this is a weird case, but it might be good to address this somewhere, maybe in a blog post or something? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use a knn query under a standard retriever. I agree this is probably better left to documentation in a different place, though. I worry that adding some notes about how you can use knn-query as part of a standard retriever will just confuse the issue more currently. I'll give this some thought as a possible follow up.
`k`:: | ||
(Required, integer) | ||
+ | ||
Number of nearest neighbors to return as top hits. This value must be fewer than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fewer or <=?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I added an equal to as well.
docs/reference/search/rrf.asciidoc Outdated
https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf[Reciprocal rank fusion (RRF)] | ||
is a method for combining multiple result sets with different relevance | ||
indicators into a single result set. RRF requires no tuning, and the different | ||
is a method for combining multiple top documents sets with different relevance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we actively trying to change our vocabulary from result
to top documents
? It may be unintuitive or confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was doing this since the notion of a retriever is an entity that returns top documents. So I wanted to try to be consistent with that. I think I will change this one back to results, though, since it's not in the context of a retriever here.
} | ||
} | ||
---- | ||
// TEST[skip:example fragment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to add tests once RRF moves out of tech preview?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, yes. Setup here with documentation is tricky with the other example being tested directly.
| ||
<<request-body-sub-searches, Sub searches>> provides a way to | ||
combine and rank multiple searches using RRF. | ||
The `rrf` retriever provides a way to combine and rank multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice example 👍
independently of each other. First we run the term query for | ||
`blue shoes sales` using the standard BM25 scoring algorithm. Then | ||
we run the text expansion query for `What blue shoes are on sale?` | ||
In the above example, we execute each of the two `standard` retrievers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another example that could be annotated with numbers in the example fragment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above. I didn't find a way where I felt like the readability of the example was improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's along the lines of what I was thinking:
[source,console] ---- GET example-index/_search { "retriever": { "rrf": { <3> "retrievers": [ { "standard": { <1> "query": { "term": { "text": "blue shoes sale" } } } }, { "standard": { <2> "query": { "text_expansion":{ "ml.tokens":{ "model_id":"my_elser_model", "model_text":"What blue shoes are on sale?" } } } } } ], "window_size": 50, "rank_constant": 20 } } } ---- // TEST[skip:example fragment] In the above example, we execute each of the two `standard` retrievers independently of each other. <1> First we run the `standard` retriever specifying a term query for `blue shoes sales` using the standard BM25 scoring algorithm. <2> Next we run the `standard` retriever specifying a text expansion query for `What blue shoes are on sale?` using our <<semantic-search-elser, ELSER>> scoring algorithm. <3> The `rrf` retriever allows us to combine the two top documents sets generated by completely independent scoring algorithms with equal weighting. Not only does this remove the need to figure out what the appropriate weighting is using linear combination, but RRF is also shown to give improved relevance over either query individually.
If you don't feel that it provides value though, it's definitely not a necessary change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nicer. I will change this. Thank you for the write up here. (It hadn't occurred to me to mark the rrf retriever out of order in this way.)
"ml.inference.{{.}}_expanded.predicted_value": { | ||
"model_text": "{{query_string}}", | ||
"model_id": "<elser_model_id>" | ||
"retriever": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note for search applications - it's possible that people may have defined search applications with templates that use sub_searches (especially since we provided this as an example). Do we want to provide any type of warning to users via search responses that this is deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Let me follow up on this. Since rank and sub_searches aren't supported on serverless we can do this as a follow up.
[[request-body-rank]] | ||
`rank`:: | ||
preview:[] | ||
This param is in technical preview and may change in the future. The syntax will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the verbiage about syntax changing prior to GA for these features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is pretty standard until a feature does become GA just in case we discover any design flaws, so I'd prefer to error on the side of caution here.
if (categoryClass.equals(RetrieverBuilder.class)) { | ||
nestedDepth++; | ||
| ||
if (nestedDepth > 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to break this out as a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! This needs a bit of a re-design to work with re-rankers, so I'm going to leave it for now. I think depth will be less important than specific ordering in the end.
This enhancement adds a new abstraction to the
_search
API called "retriever." A retriever is something that returns top hits.Three retrievers are going to be added initially:
standard
- This retriever replaces standard query functionality. It contains sub-elements forquery
,search_after
,terminate_after
,sort_field
,min_score
, andcollapse
.knn
- This retriever replaces a knn search. It contains sub-elements forfield
,k
,num_candidates
,query_vector
,query_vector_builder
, andsimilarity
.rrf
- This retriever replaces rank for rrf. It contains sub-elements forretrievers
,rank_constant
, andwindow_size
.All retrievers have a
filter
element as well. Forrrf
this is propagated to its child retrievers.Also included is node feature functionality. (Note the plumbing for this was part of #105417). Each retriever has its own node feature to ensure compatibility with serverless. New retrievers can be easily added by adding a new node feature without having to worry about incompatibilities.
Retrievers are always "extracted" to a
SearchSourceBuilder
. They are simply an abstraction to better show how elements of the_search
API interact together in an intuitive way. They are never serialized and are ephemerally used only during parsing.Some examples: