- Notifications
You must be signed in to change notification settings - Fork 25.6k
Improve brute force vector search speed by using Lucene functions #96617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve brute force vector search speed by using Lucene functions #96617
Conversation
| Pinging @elastic/es-search (Team:Search) |
| Hi @benwtrent, I've created a changelog YAML for you. |
Performance characteristicsI ran a custom rally track (modification of These were ran on my M1 Max Macbook, Default results vs. no Vector Incubator APIThis run is the baseline (main) with my change ignoring vector incubator improvements. My change even makes this faster! Default results vs. my change + Vector Incubator APINumbers speak for themselves ;) |
| ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length); | ||
| for (int dim = 0; dim < vector.length; dim++) { | ||
| vector[dim] = byteBuffer.getFloat(); | ||
| vector[dim] = byteBuffer.getFloat((dim + vectorBR.offset) * Float.BYTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that vectorBR.offset is not needed here. In fact it would be a problem if it is anything other than 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrisHegarty I guess I misunderstand. I thought getFloat was the absolute position within the underlying bytes, not relative to the starting position (dictated by the wrapping of bytes with offset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My equation is wrong for sure, I think it should be (dim * Float.BYTES) + vectorBR.offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test passes with my fixed equation:
public void testDecoding() { float[] inputFloats = new float[]{1f, 2f, 3f, 4f}; ByteBuffer byteBuffer = ByteBuffer.allocate(16); for (float f : inputFloats) { byteBuffer.putFloat(f); } BytesRef floatBytes = new BytesRef(byteBuffer.array()); floatBytes.length = 12; floatBytes.offset = 4; float[] outputFloats = new float[3]; VectorEncoderDecoder.decodeDenseVector(floatBytes, outputFloats); assertArrayEquals(outputFloats, new float[]{2f, 3f, 4f}, 0f); } There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrisHegarty what do you think? I fixed my math. Is this still an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benwtrent Your changes are good. The wrap with an offset and length is not relevant any more - it could be a plain wrap(byte[]) - since we are now using absolute addressing of the buffer - the complete byte[] is available to us even with wrap(byte[],int,int), which just sets up an initial position and limit.
Microbenchmark for BytesLucene also added SIMD improvements for bytes. The function requires
The following benchmark was ran on M1-Max, 128 byte width lanes. This was over 100 random byte vectors. One set is "short" (96 dims) the other is "medium" (768 dims). If the vector API is NOT available, and we still copy the However, if the vector API is present, speed up is achieved. Here are some numbers (lower is better). If we can bypass this internal @jpountz && @jdconrad what do you think? Should we attempt to by-pass the copy if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! One minor comment. The benchmarks look awesome :)
| return defaultValue; | ||
| } | ||
| return new BinaryDenseVector(value, dims, indexVersion); | ||
| VectorEncoderDecoder.decodeDenseVector(value, vectorValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible get is used w/ the same value twice. Wonder if it makes sense to cache it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can cache it for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added code to cache
…wtrent/elasticsearch into feature/knn-brute-force-improvements
| @jdconrad I went ahead and added binary byte support for Lucene accelerated calculations. The byte array copy adds less than a millisecond overhead when reading 25,000 values from the same field. But, it cuts runtime almost in half when it the Vector API is used. This seems justifiable to me. Let me know if you significantly disagree. |
| run elasticsearch-ci/bwc |
| @elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the new changes. I agree with you that the performance trade off for decoding is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to find more places where these new vectorized methods can be used!
| ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length); | ||
| for (int dim = 0; dim < vector.length; dim++) { | ||
| vector[dim] = byteBuffer.getFloat(); | ||
| vector[dim] = byteBuffer.getFloat((dim * Float.BYTES) + vectorBR.offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vectorBR.offset is already taken into account to create the ByteBuffer, so we shouldn't add it there again?
Also unrelated to your PR, I would expect ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length).asFloatBuffer().get(vector) to run faster than reading a float at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz I just asked @mayya-sharipova about this same thing. Using getFloat(int) ignores the current position of the buffer. It may make sense to wrap without the offset and length parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also unrelated to your PR, I would expect ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length).asFloatBuffer().get(vector) to run faster than reading a float at a time.
@jpountz it doesn't. This is 30% faster than making it a float buffer. For some reason, getting the absolute is faster. @ChrisHegarty might have an intuition as to why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vectorBR.offset is already taken into account to create the ByteBuffer, so we shouldn't add it there again?
I honestly don't know. I added a test to verify (see: https://github.com/elastic/elasticsearch/pull/96617/files#diff-2d953a23603f9d7ef2f18d9f7bff3960307afc1a763e28fa2c7eca0ee3a65599). That test creates a bytebuffer with a custom length & offset and we get the expected results with my change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jdconrad, you are right indeed! Thanks @benwtrent for adding a test.
@benwtrent Thanks for checking performance. I remember that MikeS found that wrapping as a float buffer was faster when adding DataInput#readFloats, but it was with a direct byte buffer with little-endian byte order, these differences might matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's the endianness of the bytebuffers we're using here - big in this case. It's most efficient to use the native endianness, otherwise byte swapping will occur. Since these are already in the index, is it possible to switch them to little?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz here are some benchmark results:
So, average time to calculate latency for decoding float[] from ByteBuffers. Here is the benchmark code (if you can spot any wonkiness): https://gist.github.com/benwtrent/29cc0338cd851c345cace5c486095507
Direct Read (via index in the buffer) is always faster. The floatbuffer vs. byte buffer are almost identical for both decoding kinds (iteration vs. direct read).
Benchmark (floatArraySizes) Mode Cnt Score Error Units ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead 96 avgt 5 3348858.476 ± 13314.154 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead 768 avgt 5 19960837.116 ± 137071.324 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead 2048 avgt 5 49142580.882 ± 216614.762 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration 96 avgt 5 4992161.167 ± 47263.300 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration 768 avgt 5 33762525.212 ± 110609.781 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration 2048 avgt 5 89409177.001 ± 1122561.441 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead 96 avgt 5 3529213.563 ± 24635.601 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead 768 avgt 5 18152123.115 ± 49712.077 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead 2048 avgt 5 46610729.883 ± 2462771.340 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration 96 avgt 5 5085876.957 ± 75157.360 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration 768 avgt 5 32359519.955 ± 399297.947 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration 2048 avgt 5 86083215.393 ± 1120197.606 ns/op | @elasticmachine update branch |
| run elasticsearch-ci/part-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benwtrent Thanks Ben, great speedups!
Lucene has integrated hardware accelerated vector calculations. Meaning, calculations like
dot_productcan be much faster when using the Lucene defined functions.When a
dense_vectoris indexed, we already support this. However, whenindex: falsewe store float vectors as binary fields in Lucene and decode them ourselves. Meaning, we don't use the underlying Lucene structures or functions.To take advantage of the large performance boost, this PR refactors the binary vector values in the following way:
related to: #96370