- Notifications
You must be signed in to change notification settings - Fork 25.7k
Support querying multiple indices with the simplified linear retriever #133720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support querying multiple indices with the simplified linear retriever #133720
Conversation
| Hi @ioanatia, I've created a changelog YAML for you. |
| } | ||
| return innerRetrievers; | ||
| // there are no lexical fields that need to be queried, no need to create a retriever | ||
| if (lexicalQueryBuilders.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could be in this situation when:
- No query fields are provided and we need to use the ones from
index.query.default_fieldindex setting and not all indices have the same value for this setting, resulting in different list of fields that need to be queried per index. This case is uncommon. - One of the fields from the list that is provided to the linear retriever is a semantic_text field that is not present in all indices. This will be more common. It could be that a field with the same name does not even exist in all the other indices, or that it was mapped to another mapping type. We don't really know since we don't have access to the mappings.
From what I tested, the current approach where compose a boolean query and we filter by index names, seems to mitigate both cases.
If we find it too complicated, I can change this to raise an exception for both of these cases, so that we always use a single multi match query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the fields from the list that is provided to the linear retriever is a semantic_text field that is not present in all indices. This will be more common. It could be that a field with the same name does not even exist in all the other indices, or that it was mapped to another mapping type. We don't really know since we don't have access to the mappings.
Agreed this will be more common, enough so to the point that IMO we need to handle it. I think it's also worth pointing out that we can be in this case when differentNonInferenceFields == true and lexicalQueryBuilders is not empty.
Mikep86 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I reviewed just the core logic for now, not the tests. I have some suggestions about how we may be able to simplify and handle more edge cases at the same time.
...ugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/MultiFieldsInnerRetrieverUtils.java Outdated Show resolved Hide resolved
...ugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/MultiFieldsInnerRetrieverUtils.java Show resolved Hide resolved
| } | ||
| return innerRetrievers; | ||
| // there are no lexical fields that need to be queried, no need to create a retriever | ||
| if (lexicalQueryBuilders.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the fields from the list that is provided to the linear retriever is a semantic_text field that is not present in all indices. This will be more common. It could be that a field with the same name does not even exist in all the other indices, or that it was mapped to another mapping type. We don't really know since we don't have access to the mappings.
Agreed this will be more common, enough so to the point that IMO we need to handle it. I think it's also worth pointing out that we can be in this case when differentNonInferenceFields == true and lexicalQueryBuilders is not empty.
...ugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/MultiFieldsInnerRetrieverUtils.java Show resolved Hide resolved
| Pinging @elastic/search-relevance (Team:Search - Relevance) |
| assertFalse("the lexical retriever is only asserted once", assertedLexical); | ||
| assertFalse(expectedNonInferenceFields.isEmpty()); | ||
| | ||
| QueryBuilder topDocsQueryBuilder = standardRetrieverBuilder.topDocsQuery(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason why I went through these hoops for the lexical query is because we are comparing boolean queries.
and AFAICS I can't guarantee the order of the should clauses which are held in a List<QueryBuilder>.
when we compare two boolean queries, we check that we have the same list of should clauses:
elasticsearch/server/src/main/java/org/elasticsearch/index/query/BoolQueryBuilder.java
Lines 333 to 340 in 5a0636a
| protected boolean doEquals(BoolQueryBuilder other) { | |
| return Objects.equals(adjustPureNegative, other.adjustPureNegative) | |
| && Objects.equals(minimumShouldMatch, other.minimumShouldMatch) | |
| && Objects.equals(mustClauses, other.mustClauses) | |
| && Objects.equals(shouldClauses, other.shouldClauses) | |
| && Objects.equals(mustNotClauses, other.mustNotClauses) | |
| && Objects.equals(filterClauses, other.filterClauses); | |
| } |
so here I couldn't just use a simple assertEquals check as we had before
Mikep86 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, awesome work 🚀 ! I left some minor comments about tests, they should be very easy to address.
| queryBuilder = new BoolQueryBuilder().must(queryBuilder).filter(new TermsQueryBuilder("_index", indices)); | ||
| } | ||
| return new InnerRetriever(new StandardRetrieverBuilder(queryBuilder), weight, expectedNormalizer); | ||
| }).collect(Collectors.toSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check that rankWindowSize is propagated to the rewritten retriever as expected?
| MinMaxScoreNormalizer.INSTANCE, | ||
| DEFAULT_RANK_WINDOW_SIZE, | ||
| new float[0], | ||
| new ScoreNormalizer[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Use indexName instead of a magic string here? Repeat where "test-index" is referenced below.
| new float[0], | ||
| new ScoreNormalizer[0] | ||
| ); | ||
| assertMultiIndexMultiFieldsParamsRewrite( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Use anotherIndexName instead of a magic string here? Repeat where "test-another-index" is referenced below.
This only handles the linear retriever, but the same approach can be used for the RRF retriever.
I just haven't removed the restriction for RRF yet, we can do it in a follow up.
It should be as simple as removing the check we have for RRF and adding tests.
And once we do this change for both retrievers, we can adjust the docs too.
When we rewrite the linear retriever, we form two sub-retrievers:
The lexical query uses multi match at the moment.
For the semantic part, we group by (inferenceId, field_name) and issue a match query for each group.
For each group, if the semantic_text field does not exist in all queried indices, we add an additional filter on the index names.
Some examples:
Let's assume we have index A and index B and both have the same semantic text fields using the same inference IDs:
semantic_field_1andsemantic_field_2.And that we issue a query to query the
lexical_field_*,semantic_field_1,semantic_field_2.The linear retriever will be rewritten to:
Let's now assume that indexB has an extra
semantic_field_3and that we query:["lexical_field_*^3", "semantic_field_1^10", "semantic_field_2^20", "semantic_field_3^30"].The linear retriever will be rewritten to:
If only
semantic_textfield is queried, we don't need to have 2 levels of normalization, so we rewrite to: