Skip to content

Conversation

@ioanatia
Copy link
Contributor

@ioanatia ioanatia commented Aug 28, 2025

This only handles the linear retriever, but the same approach can be used for the RRF retriever.
I just haven't removed the restriction for RRF yet, we can do it in a follow up.
It should be as simple as removing the check we have for RRF and adding tests.

And once we do this change for both retrievers, we can adjust the docs too.

When we rewrite the linear retriever, we form two sub-retrievers:

linear standard retriever with lexical query, normalizer:<specified_normalizer> linear, normalizer:<specified_normalizer> standard retriever normalizer:<specified_normalizer> match on semantic_text field 1, standard retriever normalizer:<specified_normalizer> match on semantic_text field 2 ... 

The lexical query uses multi match at the moment.
For the semantic part, we group by (inferenceId, field_name) and issue a match query for each group.
For each group, if the semantic_text field does not exist in all queried indices, we add an additional filter on the index names.

Some examples:

Let's assume we have index A and index B and both have the same semantic text fields using the same inference IDs: semantic_field_1 and semantic_field_2.
And that we issue a query to query the lexical_field_*, semantic_field_1, semantic_field_2.

The linear retriever will be rewritten to:

linear // Lexical group standard retriever using multi-match on `lexical_field_*` normalizer:minmax multi // Semantic group linear normalizer:minmax standard { match on semantic_field_1}, weight:1 normalizer:minmax standard { match on semantic_field_2}, weight:1 normalizer:minmax 

Let's now assume that indexB has an extra semantic_field_3 and that we query:
["lexical_field_*^3", "semantic_field_1^10", "semantic_field_2^20", "semantic_field_3^30"].

The linear retriever will be rewritten to:

linear // Lexical group standard retriever `lexical_field_*` normalizer:minmax bool: should: multi_match ["lexical_field_*^3", "semantic_field_3^30"] filter on indexA multi_match ["lexical_field_*^3"] filter on indexB // Semantic group linear normalizer:minmax standard { match on semantic_field_1}, weight:1 normalizer:minmax standard { match on semantic_field_2}, weight:1 normalizer:minmax standard { match on semantic_field_3, filter on indexB}, weight:1 normalizer:minmax 

If only semantic_text field is queried, we don't need to have 2 levels of normalization, so we rewrite to:

linear standard retriever for lexical query, normalizer:minmax standard retriever, normalizer:minmax match on semantic_text fieldd 
@ioanatia ioanatia added >enhancement :SearchOrg/Relevance Label for the Search (solution/org) Relevance team Team:Search - Relevance The Search organization Search Relevance team v9.2.0 labels Aug 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @ioanatia, I've created a changelog YAML for you.

}
return innerRetrievers;
// there are no lexical fields that need to be queried, no need to create a retriever
if (lexicalQueryBuilders.isEmpty()) {
Copy link
Contributor Author

@ioanatia ioanatia Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could be in this situation when:

  1. No query fields are provided and we need to use the ones from index.query.default_field index setting and not all indices have the same value for this setting, resulting in different list of fields that need to be queried per index. This case is uncommon.
  2. One of the fields from the list that is provided to the linear retriever is a semantic_text field that is not present in all indices. This will be more common. It could be that a field with the same name does not even exist in all the other indices, or that it was mapped to another mapping type. We don't really know since we don't have access to the mappings.

From what I tested, the current approach where compose a boolean query and we filter by index names, seems to mitigate both cases.
If we find it too complicated, I can change this to raise an exception for both of these cases, so that we always use a single multi match query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the fields from the list that is provided to the linear retriever is a semantic_text field that is not present in all indices. This will be more common. It could be that a field with the same name does not even exist in all the other indices, or that it was mapped to another mapping type. We don't really know since we don't have access to the mappings.

Agreed this will be more common, enough so to the point that IMO we need to handle it. I think it's also worth pointing out that we can be in this case when differentNonInferenceFields == true and lexicalQueryBuilders is not empty.

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I reviewed just the core logic for now, not the tests. I have some suggestions about how we may be able to simplify and handle more edge cases at the same time.

}
return innerRetrievers;
// there are no lexical fields that need to be queried, no need to create a retriever
if (lexicalQueryBuilders.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the fields from the list that is provided to the linear retriever is a semantic_text field that is not present in all indices. This will be more common. It could be that a field with the same name does not even exist in all the other indices, or that it was mapped to another mapping type. We don't really know since we don't have access to the mappings.

Agreed this will be more common, enough so to the point that IMO we need to handle it. I think it's also worth pointing out that we can be in this case when differentNonInferenceFields == true and lexicalQueryBuilders is not empty.

@ioanatia ioanatia marked this pull request as ready for review September 9, 2025 10:08
@ioanatia ioanatia requested a review from Mikep86 September 9, 2025 10:08
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-relevance (Team:Search - Relevance)

assertFalse("the lexical retriever is only asserted once", assertedLexical);
assertFalse(expectedNonInferenceFields.isEmpty());

QueryBuilder topDocsQueryBuilder = standardRetrieverBuilder.topDocsQuery();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason why I went through these hoops for the lexical query is because we are comparing boolean queries.
and AFAICS I can't guarantee the order of the should clauses which are held in a List<QueryBuilder>.

when we compare two boolean queries, we check that we have the same list of should clauses:

protected boolean doEquals(BoolQueryBuilder other) {
return Objects.equals(adjustPureNegative, other.adjustPureNegative)
&& Objects.equals(minimumShouldMatch, other.minimumShouldMatch)
&& Objects.equals(mustClauses, other.mustClauses)
&& Objects.equals(shouldClauses, other.shouldClauses)
&& Objects.equals(mustNotClauses, other.mustNotClauses)
&& Objects.equals(filterClauses, other.filterClauses);
}

so here I couldn't just use a simple assertEquals check as we had before

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, awesome work 🚀 ! I left some minor comments about tests, they should be very easy to address.

queryBuilder = new BoolQueryBuilder().must(queryBuilder).filter(new TermsQueryBuilder("_index", indices));
}
return new InnerRetriever(new StandardRetrieverBuilder(queryBuilder), weight, expectedNormalizer);
}).collect(Collectors.toSet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check that rankWindowSize is propagated to the rewritten retriever as expected?

MinMaxScoreNormalizer.INSTANCE,
DEFAULT_RANK_WINDOW_SIZE,
new float[0],
new ScoreNormalizer[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Use indexName instead of a magic string here? Repeat where "test-index" is referenced below.

new float[0],
new ScoreNormalizer[0]
);
assertMultiIndexMultiFieldsParamsRewrite(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Use anotherIndexName instead of a magic string here? Repeat where "test-another-index" is referenced below.

@ioanatia ioanatia merged commit 54152ca into elastic:main Sep 11, 2025
34 checks passed
@ioanatia ioanatia deleted the multi_index_support_simplified branch September 11, 2025 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :SearchOrg/Relevance Label for the Search (solution/org) Relevance team Team:Search - Relevance The Search organization Search Relevance team v9.2.0

3 participants