- Notifications
You must be signed in to change notification settings - Fork 25.5k
ESQL - dense vector support cosine normalization #132721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL - dense vector support cosine normalization #132721
Conversation
private final DenseVectorFieldMapper.DenseVectorFieldType fieldType; | ||
| ||
public DenseVectorBlockLoader(String fieldName, int dimensions) { | ||
public DenseVectorBlockLoader(String fieldName, int dimensions, DenseVectorFieldMapper.DenseVectorFieldType fieldType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing the DenseVectorFieldType looked better than including other specific attributes needed for normalization (fieldType.name(), IndexVersion). I opted for creating a isNormalized()
method in the DenseVectorFieldType instead.
private final FloatVectorValues floatVectorValues; | ||
private final KnnVectorValues.DocIndexIterator iterator; | ||
private final int dimensions; | ||
private abstract static class DenseVectorValuesBlockReader<T extends KnnVectorValues> extends BlockDocValuesReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an abstract class is previous work for supporting byte element type.
assert floats.length == dimensions | ||
: "unexpected dimensions for vector value; expected " + dimensions + " but got " + floats.length; | ||
| ||
assert magnitudeDocValues.advanceExact(iterator.docID()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to use DenormalizedCosineFloatVectorValues
instead, but that class depends on the ordinal and not the docId() for retrieving the magnitude.
Using the docId() to retrieve the magnitude works.
public CacheHelper getCoreCacheHelper() { | ||
return r.getCoreCacheHelper(); | ||
} | ||
denseVectorFieldType.isNormalized() && denseVectorFieldType.indexed ? r -> new FilterLeafReader(r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used the isNormalized()
method to improve readability - that messed up some indentation
for (String indexType : DENSE_VECTOR_INDEX_TYPES) { | ||
params.add(new Object[] { indexType, true, false }); | ||
} | ||
Supplier<ElementType> elementTypeProvider = () -> ElementType.FLOAT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing for all combinations took > 1 min. I decided to rely on randomization for testing the different combinations.
…r-support-normalization' into non-issue/esql-dense-vector-support-normalization # Conflicts: # x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/DenseVectorFieldTypeIT.java
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
@nik9000 I'd appreciate your review here, as the doc values reading was tricky. I'd like to ensure I am doing the right thing! |
…r-support-normalization' into non-issue/esql-dense-vector-support-normalization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall changes look good, but I'll defer to someone more expert in ES|QL and KNN.
| ||
@Override | ||
protected void appendDoc(BlockLoader.FloatBuilder builder) throws IOException { | ||
float[] floats = vectorValues.vectorValue(iterator.index()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we refactor checking the dimension count out, since it's shared between both block readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's a dimension()
common method for KnnVectorValues
that I wasn't aware of. Done in e98bbcb
...n/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/DenseVectorFieldTypeIT.java Show resolved Hide resolved
public class DenseVectorFieldTypeIT extends AbstractEsqlIntegTestCase { | ||
| ||
private static final Set<String> DENSE_VECTOR_INDEX_TYPES = Set.of( | ||
public static final Set<String> ALL_DENSE_VECTOR_INDEX_TYPES = Set.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use VectorIndexType.values()
here to create the set, rather than hard coding the specific index types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in 0cb587a
private final String indexType; | ||
| ||
@ParametersFactory | ||
public static Iterable<Object[]> parameters() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could simplify this a bit, with something like this (untested):
public static Iterable<Object[]> parameters() throws Exception { return ALL_DENSE_VECTOR_INDEX_TYPES.stream() .filter(indexType -> indexType.contains("flat") == false) .map(indexType -> new Object[] { DenseVectorFieldMapper.ElementType.FLOAT, indexType }) .collect(Collectors.toList()); }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find good ol' loops better in terms of clarity here. Also, soon enough we'll need to add the bytes as well.
…vector-support-normalization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…vector-support-normalization # Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/BlockDocValuesReader.java # x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/DenseVectorFieldTypeIT.java # x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/plugin/KnnFunctionIT.java
* Take into account normalization for dense vector support * Fix cherry pick * [CI] Auto commit changes from spotless * Remove debugging code * Check that we may not have magnitudes at all, or for normalized vectors * Better parameterized test * Refactor dimension check * Refactor index names * Remove comment * Fix test --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
* upstream/main: (278 commits) ESQL - dense vector support cosine normalization (elastic#132721) [ML] Add support for dimensions in google vertex ai request (elastic#132689) ESQL - Add byte element support for dense_vector data type (elastic#131863) ESQL: Fix async operator warnings not always sent when blocking (elastic#132744) Method not needed anymore (elastic#132912) [Test] Excercise shutdown more reliably in snapshot stress IT (elastic#132909) Update Gradle shadow plugin to 9.0.1 (elastic#132637) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/410_named_queries/named_queries_with_score} elastic#132906 Update docker.elastic.co/wolfi/chainguard-base-fips:latest Docker digest to fa6cb69 (elastic#132735) Remove unnecessary calls to fold() (elastic#131870) Use consistent terminology for transport version resources/references (elastic#132882) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search.vectors/40_knn_search_cosine/kNN search only regular query} elastic#132890 Finalize release notes for v9.1.2 release (elastic#132745) Finalize release notes for v9.0.5 release (elastic#132718) Move inner records out of TransportVersionUtils (elastic#132872) Add support for Lookup Join on Multiple Fields (elastic#131559) Bootstrap PR-based benchmarks (elastic#132717) Refactor MetadataIndexTemplateService to use template maps instead of project metadata (elastic#132662) [Gradle] Update nebula ospackage plugin to 12.1.0 (elastic#132640) Mute org.elasticsearch.xpack.esql.CsvTests test {csv-spec:ip.CdirMatchEqualsInsOrs} elastic#132860 ...
ESQL did not support cosine normalization. Vectors were retrieved directly from doc values without applying denormalization.
This PR fixes that, including test cases for the different vectors similarities.