Skip to content

Conversation

carlosdelest
Copy link
Member

ESQL did not support cosine normalization. Vectors were retrieved directly from doc values without applying denormalization.

This PR fixes that, including test cases for the different vectors similarities.

@carlosdelest carlosdelest added >non-issue :Analytics/ES|QL AKA ESQL Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0 :Search Relevance/ES|QL Search functionality in ES|QL labels Aug 12, 2025
private final DenseVectorFieldMapper.DenseVectorFieldType fieldType;

public DenseVectorBlockLoader(String fieldName, int dimensions) {
public DenseVectorBlockLoader(String fieldName, int dimensions, DenseVectorFieldMapper.DenseVectorFieldType fieldType) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing the DenseVectorFieldType looked better than including other specific attributes needed for normalization (fieldType.name(), IndexVersion). I opted for creating a isNormalized() method in the DenseVectorFieldType instead.

private final FloatVectorValues floatVectorValues;
private final KnnVectorValues.DocIndexIterator iterator;
private final int dimensions;
private abstract static class DenseVectorValuesBlockReader<T extends KnnVectorValues> extends BlockDocValuesReader {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an abstract class is previous work for supporting byte element type.

assert floats.length == dimensions
: "unexpected dimensions for vector value; expected " + dimensions + " but got " + floats.length;

assert magnitudeDocValues.advanceExact(iterator.docID());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use DenormalizedCosineFloatVectorValues instead, but that class depends on the ordinal and not the docId() for retrieving the magnitude.

Using the docId() to retrieve the magnitude works.

public CacheHelper getCoreCacheHelper() {
return r.getCoreCacheHelper();
}
denseVectorFieldType.isNormalized() && denseVectorFieldType.indexed ? r -> new FilterLeafReader(r) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used the isNormalized() method to improve readability - that messed up some indentation

for (String indexType : DENSE_VECTOR_INDEX_TYPES) {
params.add(new Object[] { indexType, true, false });
}
Supplier<ElementType> elementTypeProvider = () -> ElementType.FLOAT;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing for all combinations took > 1 min. I decided to rely on randomization for testing the different combinations.

elasticsearchmachine and others added 5 commits August 12, 2025 12:35
…r-support-normalization' into non-issue/esql-dense-vector-support-normalization # Conflicts: #	x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/DenseVectorFieldTypeIT.java
@carlosdelest carlosdelest marked this pull request as ready for review August 12, 2025 17:08
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@carlosdelest
Copy link
Member Author

@nik9000 I'd appreciate your review here, as the doc values reading was tricky. I'd like to ensure I am doing the right thing!

…r-support-normalization' into non-issue/esql-dense-vector-support-normalization
Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall changes look good, but I'll defer to someone more expert in ES|QL and KNN.


@Override
protected void appendDoc(BlockLoader.FloatBuilder builder) throws IOException {
float[] floats = vectorValues.vectorValue(iterator.index());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we refactor checking the dimension count out, since it's shared between both block readers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's a dimension() common method for KnnVectorValues that I wasn't aware of. Done in e98bbcb

public class DenseVectorFieldTypeIT extends AbstractEsqlIntegTestCase {

private static final Set<String> DENSE_VECTOR_INDEX_TYPES = Set.of(
public static final Set<String> ALL_DENSE_VECTOR_INDEX_TYPES = Set.of(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use VectorIndexType.values() here to create the set, rather than hard coding the specific index types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in 0cb587a

private final String indexType;

@ParametersFactory
public static Iterable<Object[]> parameters() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could simplify this a bit, with something like this (untested):

public static Iterable<Object[]> parameters() throws Exception { return ALL_DENSE_VECTOR_INDEX_TYPES.stream() .filter(indexType -> indexType.contains("flat") == false) .map(indexType -> new Object[] { DenseVectorFieldMapper.ElementType.FLOAT, indexType }) .collect(Collectors.toList()); } 
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find good ol' loops better in terms of clarity here. Also, soon enough we'll need to add the bytes as well.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

carlosdelest and others added 3 commits August 14, 2025 10:17
…vector-support-normalization # Conflicts: #	server/src/main/java/org/elasticsearch/index/mapper/BlockDocValuesReader.java #	x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/DenseVectorFieldTypeIT.java #	x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/plugin/KnnFunctionIT.java
@carlosdelest carlosdelest merged commit 7a34391 into elastic:main Aug 14, 2025
33 checks passed
joshua-adams-1 pushed a commit to joshua-adams-1/elasticsearch that referenced this pull request Aug 14, 2025
* Take into account normalization for dense vector support * Fix cherry pick * [CI] Auto commit changes from spotless * Remove debugging code * Check that we may not have magnitudes at all, or for normalized vectors * Better parameterized test * Refactor dimension check * Refactor index names * Remove comment * Fix test --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
szybia added a commit to szybia/elasticsearch that referenced this pull request Aug 15, 2025
* upstream/main: (278 commits) ESQL - dense vector support cosine normalization (elastic#132721) [ML] Add support for dimensions in google vertex ai request (elastic#132689) ESQL - Add byte element support for dense_vector data type (elastic#131863) ESQL: Fix async operator warnings not always sent when blocking (elastic#132744) Method not needed anymore (elastic#132912) [Test] Excercise shutdown more reliably in snapshot stress IT (elastic#132909) Update Gradle shadow plugin to 9.0.1 (elastic#132637) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/410_named_queries/named_queries_with_score} elastic#132906 Update docker.elastic.co/wolfi/chainguard-base-fips:latest Docker digest to fa6cb69 (elastic#132735) Remove unnecessary calls to fold() (elastic#131870) Use consistent terminology for transport version resources/references (elastic#132882) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search.vectors/40_knn_search_cosine/kNN search only regular query} elastic#132890 Finalize release notes for v9.1.2 release (elastic#132745) Finalize release notes for v9.0.5 release (elastic#132718) Move inner records out of TransportVersionUtils (elastic#132872) Add support for Lookup Join on Multiple Fields (elastic#131559) Bootstrap PR-based benchmarks (elastic#132717) Refactor MetadataIndexTemplateService to use template maps instead of project metadata (elastic#132662) [Gradle] Update nebula ospackage plugin to 12.1.0 (elastic#132640) Mute org.elasticsearch.xpack.esql.CsvTests test {csv-spec:ip.CdirMatchEqualsInsOrs} elastic#132860 ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >non-issue :Search Relevance/ES|QL Search functionality in ES|QL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0

4 participants