Skip to content

Conversation

jdconrad
Copy link
Contributor

This enhancement adds a new abstraction to the _search API called "retriever." A retriever is something that returns top hits.

Three retrievers are going to be added initially:

  • standard - This retriever replaces standard query functionality. It contains sub-elements for query, search_after, terminate_after, sort_field, min_score, and collapse.
  • knn - This retriever replaces a knn search. It contains sub-elements for field, k, num_candidates, query_vector, query_vector_builder, and similarity.
  • rrf - This retriever replaces rank for rrf. It contains sub-elements for retrievers, rank_constant, and window_size.

All retrievers have a filter element as well. For rrf this is propagated to its child retrievers.

Also included is node feature functionality. (Note the plumbing for this was part of #105417). Each retriever has its own node feature to ensure compatibility with serverless. New retrievers can be easily added by adding a new node feature without having to worry about incompatibilities.

Retrievers are always "extracted" to a SearchSourceBuilder. They are simply an abstraction to better show how elements of the _search API interact together in an intuitive way. They are never serialized and are ephemerally used only during parsing.

Some examples:

GET index/_search { "retriever": { "standard": { "query": { "match": { "message": { "query": "This is a test." } } }, "sort": [ "popularity" ] } }, "size": 5, "fields": ["message"] } GET index/_search { "retriever": { "knn": { "field": "vector", "k": 3, "num_candidates": 10, "query_vector": [1, 2, 3] } }, "fields": ["message"] } GET index/_search { "retriever": { "rrf": { "window_size": 100, "retrievers": [ { "knn": { "field": "vector", "k": 3, "num_candidates": 10, "query_vector": [1, 2, 3] } }, { "standard": { "query": { "match": { "message": { "query": "This is a test." } } } } } ] } }, "size": 5, "fields": ["message"] } 
jdconrad and others added 30 commits October 13, 2023 10:07
This change adds an RRFRetrieverBuilder using the same strategy as ClassicRetrieverBuilder where we extract the parsed request body into a SearchSourceBuilder.
This adds an initial linear combination retriever using a similar strategy to RRF. This prototype will allow us to cover all existing use cases for standard search, and should make it easier to implement retrievers at more than a superficial level.
This change does the following: * renames the classic retriever to standard retriever * adds support for (pre) filter to the standard retriever * adds yaml tests for the standard retriever that covers all of the parse fields
This change removes all of the serialization code from retrievers since these will only be used as x- content parsers extracting values into the SearchSourceBuilder. This drastically reduces the test coverage area required. This change also removes post filter from the standard retriever for now as a post filter may need to apply to all returned documents in a compound query. Finally, this adds some additional checks for values that should not be shared between global values and retriever values.
This change adds yaml tests for the rrf retriever. This also fixes a bug that will allow knn retrievers to add their knn query as an addition to the current knn queries as opposed to overwriting it. This only applies to compound retrievers. This also adds a gate rrf platinum licensing.
@jdconrad
Copy link
Contributor Author

jdconrad commented Mar 7, 2024

@javanna @mayya-sharipova Thank you for the first round of reviews. I have responded to all the PR comments and added documentation changes for this. I would strongly prefer to minimize additional changes unless something is either incorrect or there is a significant design flaw given other work depending on this.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @jdconrad

[[rrf-using-sub-searches]]
==== Reciprocal rank fusion using sub searches

RRF using sub searches is no longer supported. Use the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we never released sub_searches support to the public. But it seems like we did, yet it was experimental? I wonder if we should even mention it as a migration path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! I actually had removed it, but there are cross references from enterprise-search docs that I wasn't aware of until I got errors here during CI. I figured this allows enterprise-search time to move their documentation to use retrievers instead of sub_searches and we can remove them once they have the opportunity to change their references.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, can we clarify whether we need to mention the previous way at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javanna I chatted with @kderusso about removing these elements from the documentation. She's going to review the docs in this PR, and we will follow up with getting the cross references changed once this PR is merged.

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work Jack! I left a few comments, none of which are blocking. Congrats!

Comment on lines 158 to 159
This can also be achieved by using an <<rrf-retriever, `rrf` retriever>>
with multiple <<standard-retriever, `standard` retrievers>>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we continued to link to the main RRF page here as well.

Suggested change
This can also be achieved by using an <<rrf-retriever, `rrf` retriever>>
with multiple <<standard-retriever, `standard` retrievers>>.
This can also be achieved using <<rrf>>, through an <<rrf-retriever, `rrf` retriever>>
with multiple <<standard-retriever, `standard` retrievers>>.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

A <<standard-retriever, retriever>> that replaces the functionality of a traditional <<query-dsl, query>>.

`knn`::
A <<knn-retriever, retriever>> that replaces the functionality of a <<search-api-knn, knn search>>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see this being a little confusing to users, since knn is also a query (e.g. could knn-query be nested under a standard retriever?) I realize this is a weird case, but it might be good to address this somewhere, maybe in a blog post or something? WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use a knn query under a standard retriever. I agree this is probably better left to documentation in a different place, though. I worry that adding some notes about how you can use knn-query as part of a standard retriever will just confuse the issue more currently. I'll give this some thought as a possible follow up.

`k`::
(Required, integer)
+
Number of nearest neighbors to return as top hits. This value must be fewer than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fewer or <=?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I added an equal to as well.

https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf[Reciprocal rank fusion (RRF)]
is a method for combining multiple result sets with different relevance
indicators into a single result set. RRF requires no tuning, and the different
is a method for combining multiple top documents sets with different relevance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we actively trying to change our vocabulary from result to top documents? It may be unintuitive or confusing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing this since the notion of a retriever is an entity that returns top documents. So I wanted to try to be consistent with that. I think I will change this one back to results, though, since it's not in the context of a retriever here.

}
}
----
// TEST[skip:example fragment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to add tests once RRF moves out of tech preview?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes. Setup here with documentation is tricky with the other example being tested directly.


<<request-body-sub-searches, Sub searches>> provides a way to
combine and rank multiple searches using RRF.
The `rrf` retriever provides a way to combine and rank multiple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example 👍

independently of each other. First we run the term query for
`blue shoes sales` using the standard BM25 scoring algorithm. Then
we run the text expansion query for `What blue shoes are on sale?`
In the above example, we execute each of the two `standard` retrievers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example that could be annotated with numbers in the example fragment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. I didn't find a way where I felt like the readability of the example was improved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's along the lines of what I was thinking:

[source,console] ---- GET example-index/_search { "retriever": { "rrf": { <3> "retrievers": [ { "standard": { <1> "query": { "term": { "text": "blue shoes sale" } } } }, { "standard": { <2> "query": { "text_expansion":{ "ml.tokens":{ "model_id":"my_elser_model", "model_text":"What blue shoes are on sale?" } } } } } ], "window_size": 50, "rank_constant": 20 } } } ---- // TEST[skip:example fragment] In the above example, we execute each of the two `standard` retrievers independently of each other. <1> First we run the `standard` retriever specifying a term query for `blue shoes sales` using the standard BM25 scoring algorithm. <2> Next we run the `standard` retriever specifying a text expansion query for `What blue shoes are on sale?` using our <<semantic-search-elser, ELSER>> scoring algorithm. <3> The `rrf` retriever allows us to combine the two top documents sets generated by completely independent scoring algorithms with equal weighting. Not only does this remove the need to figure out what the appropriate weighting is using linear combination, but RRF is also shown to give improved relevance over either query individually. 

If you don't feel that it provides value though, it's definitely not a necessary change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nicer. I will change this. Thank you for the write up here. (It hadn't occurred to me to mark the rrf retriever out of order in this way.)

"ml.inference.{{.}}_expanded.predicted_value": {
"model_text": "{{query_string}}",
"model_id": "<elser_model_id>"
"retriever": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note for search applications - it's possible that people may have defined search applications with templates that use sub_searches (especially since we provided this as an example). Do we want to provide any type of warning to users via search responses that this is deprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Let me follow up on this. Since rank and sub_searches aren't supported on serverless we can do this as a follow up.

[[request-body-rank]]
`rank`::
preview:[]
This param is in technical preview and may change in the future. The syntax will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the verbiage about syntax changing prior to GA for these features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty standard until a feature does become GA just in case we discover any design flaws, so I'd prefer to error on the side of caution here.

if (categoryClass.equals(RetrieverBuilder.class)) {
nestedDepth++;

if (nestedDepth > 2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to break this out as a constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! This needs a bit of a re-design to work with re-rankers, so I'm going to leave it for now. I think depth will be less important than specific ordering in the end.

@jdconrad jdconrad merged commit 68b0aca into main Mar 12, 2024
@jdconrad jdconrad deleted the retrievers branch March 12, 2024 17:11
@benwtrent
Copy link
Member

@pmpailis @jdconrad we should highlight the new format as its much nicer. I am not sure @giladgal as it is still technical preview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement release highlight :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Search Meta label for search team v8.14.0

7 participants