Improve brute force vector search speed by using Lucene functions #96617

benwtrent · 2023-06-06T14:13:26Z

Lucene has integrated hardware accelerated vector calculations. Meaning, calculations like dot_product can be much faster when using the Lucene defined functions.

When a dense_vector is indexed, we already support this. However, when index: false we store float vectors as binary fields in Lucene and decode them ourselves. Meaning, we don't use the underlying Lucene structures or functions.

To take advantage of the large performance boost, this PR refactors the binary vector values in the following way:

Eagerly decode the binary blobs when iterated
Call the Lucene defined VectorUtil functions when possible

related to: #96370

elasticsearchmachine · 2023-06-06T14:13:51Z

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine · 2023-06-06T14:18:34Z

Hi @benwtrent, I've created a changelog YAML for you.

benwtrent · 2023-06-06T14:21:58Z

Performance characteristics

I ran a custom rally track (modification of dense_vector) to test this change. Ensuring that the values were not indexed into the HNSW structure and only using script score. Here are some results.

These were ran on my M1 Max Macbook, preferredBitSize=128 (WHICH IS LAME!!! GIMME AT LEAST 256 APPLE!!!!!)

Default results vs. no Vector Incubator API

This run is the baseline (main) with my change ignoring vector incubator improvements. My change even makes this faster!

------------------------------------------------------ _______ __ _____ / ____(_)___ ____ _/ / / ___/_________ ________ / /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \ / __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/ /_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/ ------------------------------------------------------ | Metric | Task | Baseline | Contender | Diff | Unit | Diff % | |--------------------------------------------------------------:|---------------------:|----------------:|----------------:|------------:|-------:|---------:| | Cumulative indexing time of primary shards | | 1.49652 | 1.58852 | 0.092 | min | +6.15% | | Min cumulative indexing time across primary shard | | 0.747817 | 0.791517 | 0.0437 | min | +5.84% | | Median cumulative indexing time across primary shard | | 0.748258 | 0.794258 | 0.046 | min | +6.15% | | Max cumulative indexing time across primary shard | | 0.7487 | 0.797 | 0.0483 | min | +6.45% | | Cumulative indexing throttle time of primary shards | | 0 | 0 | 0 | min | 0.00% | | Min cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Median cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Max cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Cumulative merge time of primary shards | | 0.8653 | 0.809883 | -0.05542 | min | -6.40% | | Cumulative merge count of primary shards | | 4 | 4 | 0 | | 0.00% | | Min cumulative merge time across primary shard | | 0.431583 | 0.3972 | -0.03438 | min | -7.97% | | Median cumulative merge time across primary shard | | 0.43265 | 0.404942 | -0.02771 | min | -6.40% | | Max cumulative merge time across primary shard | | 0.433717 | 0.412683 | -0.02103 | min | -4.85% | | Cumulative merge throttle time of primary shards | | 0.506683 | 0.496267 | -0.01042 | min | -2.06% | | Min cumulative merge throttle time across primary shard | | 0.25275 | 0.24225 | -0.0105 | min | -4.15% | | Median cumulative merge throttle time across primary shard | | 0.253342 | 0.248133 | -0.00521 | min | -2.06% | | Max cumulative merge throttle time across primary shard | | 0.253933 | 0.254017 | 8e-05 | min | +0.03% | | Cumulative refresh time of primary shards | | 0.109317 | 0.113283 | 0.00397 | min | +3.63% | | Cumulative refresh count of primary shards | | 46 | 46 | 0 | | 0.00% | | Min cumulative refresh time across primary shard | | 0.0539833 | 0.05465 | 0.00067 | min | +1.23% | | Median cumulative refresh time across primary shard | | 0.0546583 | 0.0566417 | 0.00198 | min | +3.63% | | Max cumulative refresh time across primary shard | | 0.0553333 | 0.0586333 | 0.0033 | min | +5.96% | | Cumulative flush time of primary shards | | 0.0907833 | 0.1014 | 0.01062 | min | +11.69% | | Cumulative flush count of primary shards | | 2 | 2 | 0 | | 0.00% | | Min cumulative flush time across primary shard | | 0.045 | 0.0496333 | 0.00463 | min | +10.30% | | Median cumulative flush time across primary shard | | 0.0453917 | 0.0507 | 0.00531 | min | +11.69% | | Max cumulative flush time across primary shard | | 0.0457833 | 0.0517667 | 0.00598 | min | +13.07% | | Total Young Gen GC time | | 1.09 | 0.783 | -0.307 | s | -28.17% | | Total Young Gen GC count | | 91 | 41 | -50 | | -54.95% | | Total Old Gen GC time | | 0 | 0 | 0 | s | 0.00% | | Total Old Gen GC count | | 0 | 0 | 0 | | 0.00% | | Store size | | 1.99315 | 2.04484 | 0.0517 | GB | +2.59% | | Translog size | | 1.02445e-07 | 1.02445e-07 | 0 | GB | 0.00% | | Heap used for segments | | 0 | 0 | 0 | MB | 0.00% | | Heap used for doc values | | 0 | 0 | 0 | MB | 0.00% | | Heap used for terms | | 0 | 0 | 0 | MB | 0.00% | | Heap used for norms | | 0 | 0 | 0 | MB | 0.00% | | Heap used for points | | 0 | 0 | 0 | MB | 0.00% | | Heap used for stored fields | | 0 | 0 | 0 | MB | 0.00% | | Segment count | | 2 | 2 | 0 | | 0.00% | | Total Ingest Pipeline count | | 0 | 0 | 0 | | 0.00% | | Total Ingest Pipeline time | | 0 | 0 | 0 | ms | 0.00% | | Total Ingest Pipeline failed | | 0 | 0 | 0 | | 0.00% | | Min Throughput | index-append | 23392.3 | 22829.6 | -562.622 | docs/s | -2.41% | | Mean Throughput | index-append | 23866.3 | 23267.7 | -598.594 | docs/s | -2.51% | | Median Throughput | index-append | 23824.8 | 23208.6 | -616.263 | docs/s | -2.59% | | Max Throughput | index-append | 24283.7 | 23695.1 | -588.604 | docs/s | -2.42% | | 50th percentile latency | index-append | 170.492 | 177.02 | 6.52729 | ms | +3.83% | | 90th percentile latency | index-append | 196.944 | 204.761 | 7.81731 | ms | +3.97% | | 99th percentile latency | index-append | 216.506 | 233.9 | 17.3939 | ms | +8.03% | | 100th percentile latency | index-append | 537.944 | 366.611 | -171.333 | ms | -31.85% | | 50th percentile service time | index-append | 170.492 | 177.02 | 6.52729 | ms | +3.83% | | 90th percentile service time | index-append | 196.944 | 204.761 | 7.81731 | ms | +3.97% | | 99th percentile service time | index-append | 216.506 | 233.9 | 17.3939 | ms | +8.03% | | 100th percentile service time | index-append | 537.944 | 366.611 | -171.333 | ms | -31.85% | | error rate | index-append | 0 | 0 | 0 | % | 0.00% | | Min Throughput | refresh-after-index | 0.549916 | 0.505059 | -0.04486 | ops/s | -8.16% | | Mean Throughput | refresh-after-index | 0.549916 | 0.505059 | -0.04486 | ops/s | -8.16% | | Median Throughput | refresh-after-index | 0.549916 | 0.505059 | -0.04486 | ops/s | -8.16% | | Max Throughput | refresh-after-index | 0.549916 | 0.505059 | -0.04486 | ops/s | -8.16% | | 100th percentile latency | refresh-after-index | 1815.22 | 1977.07 | 161.856 | ms | +8.92% | | 100th percentile service time | refresh-after-index | 1815.22 | 1977.07 | 161.856 | ms | +8.92% | | error rate | refresh-after-index | 0 | 0 | 0 | % | 0.00% | | Min Throughput | refresh-after-update | 80.8276 | 79.9921 | -0.8355 | ops/s | -1.03% | | Mean Throughput | refresh-after-update | 80.8276 | 79.9921 | -0.8355 | ops/s | -1.03% | | Median Throughput | refresh-after-update | 80.8276 | 79.9921 | -0.8355 | ops/s | -1.03% | | Max Throughput | refresh-after-update | 80.8276 | 79.9921 | -0.8355 | ops/s | -1.03% | | 100th percentile latency | refresh-after-update | 9.11138 | 9.25912 | 0.14775 | ms | +1.62% | | 100th percentile service time | refresh-after-update | 9.11138 | 9.25912 | 0.14775 | ms | +1.62% | | error rate | refresh-after-update | 0 | 0 | 0 | % | 0.00% | | Min Throughput | force-merge | 0.0595047 | 0.0653879 | 0.00588 | ops/s | +9.89% | | Mean Throughput | force-merge | 0.0595047 | 0.0653879 | 0.00588 | ops/s | +9.89% | | Median Throughput | force-merge | 0.0595047 | 0.0653879 | 0.00588 | ops/s | +9.89% | | Max Throughput | force-merge | 0.0595047 | 0.0653879 | 0.00588 | ops/s | +9.89% | | 100th percentile latency | force-merge | 16802.2 | 15290.3 | -1511.91 | ms | -9.00% | | 100th percentile service time | force-merge | 16802.2 | 15290.3 | -1511.91 | ms | -9.00% | | error rate | force-merge | 0 | 0 | 0 | % | 0.00% | | Min Throughput | script-score-query | 7.0288 | 8.55991 | 1.53111 | ops/s | +21.78% | | Mean Throughput | script-score-query | 7.55875 | 9.68523 | 2.12648 | ops/s | +28.13% | | Median Throughput | script-score-query | 7.61418 | 9.8201 | 2.20593 | ops/s | +28.97% | | Max Throughput | script-score-query | 7.68795 | 10.0394 | 2.35142 | ops/s | +30.59% | | 50th percentile latency | script-score-query | 128.268 | 96.7239 | -31.5439 | ms | -24.59% | | 90th percentile latency | script-score-query | 129.359 | 100.52 | -28.8388 | ms | -22.29% | | 99th percentile latency | script-score-query | 133.435 | 102.186 | -31.2491 | ms | -23.42% | | 99.9th percentile latency | script-score-query | 146.3 | 114.949 | -31.3514 | ms | -21.43% | | 100th percentile latency | script-score-query | 149.823 | 116.18 | -33.6427 | ms | -22.46% | | 50th percentile service time | script-score-query | 128.268 | 96.7239 | -31.5439 | ms | -24.59% | | 90th percentile service time | script-score-query | 129.359 | 100.52 | -28.8388 | ms | -22.29% | | 99th percentile service time | script-score-query | 133.435 | 102.186 | -31.2491 | ms | -23.42% | | 99.9th percentile service time | script-score-query | 146.3 | 114.949 | -31.3514 | ms | -21.43% | | 100th percentile service time | script-score-query | 149.823 | 116.18 | -33.6427 | ms | -22.46% | | error rate | script-score-query | 0 | 0 | 0 | % | 0.00% |

Default results vs. my change + Vector Incubator API

Numbers speak for themselves ;)

------------------------------------------------------ _______ __ _____ / ____(_)___ ____ _/ / / ___/_________ ________ / /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \ / __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/ /_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/ ------------------------------------------------------ | Metric | Task | Baseline | Contender | Diff | Unit | Diff % | |--------------------------------------------------------------:|---------------------:|----------------:|----------------:|------------:|-------:|---------:| | Cumulative indexing time of primary shards | | 1.49652 | 1.65228 | 0.15577 | min | +10.41% | | Min cumulative indexing time across primary shard | | 0.747817 | 0.826067 | 0.07825 | min | +10.46% | | Median cumulative indexing time across primary shard | | 0.748258 | 0.826142 | 0.07788 | min | +10.41% | | Max cumulative indexing time across primary shard | | 0.7487 | 0.826217 | 0.07752 | min | +10.35% | | Cumulative indexing throttle time of primary shards | | 0 | 0 | 0 | min | 0.00% | | Min cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Median cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Max cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Cumulative merge time of primary shards | | 0.8653 | 0.251483 | -0.61382 | min | -70.94% | | Cumulative merge count of primary shards | | 4 | 2 | -2 | | -50.00% | | Min cumulative merge time across primary shard | | 0.431583 | 0.12565 | -0.30593 | min | -70.89% | | Median cumulative merge time across primary shard | | 0.43265 | 0.125742 | -0.30691 | min | -70.94% | | Max cumulative merge time across primary shard | | 0.433717 | 0.125833 | -0.30788 | min | -70.99% | | Cumulative merge throttle time of primary shards | | 0.506683 | 0 | -0.50668 | min | -100.00% | | Min cumulative merge throttle time across primary shard | | 0.25275 | 0 | -0.25275 | min | -100.00% | | Median cumulative merge throttle time across primary shard | | 0.253342 | 0 | -0.25334 | min | -100.00% | | Max cumulative merge throttle time across primary shard | | 0.253933 | 0 | -0.25393 | min | -100.00% | | Cumulative refresh time of primary shards | | 0.109317 | 0.106517 | -0.0028 | min | -2.56% | | Cumulative refresh count of primary shards | | 46 | 48 | 2 | | +4.35% | | Min cumulative refresh time across primary shard | | 0.0539833 | 0.0516167 | -0.00237 | min | -4.38% | | Median cumulative refresh time across primary shard | | 0.0546583 | 0.0532583 | -0.0014 | min | -2.56% | | Max cumulative refresh time across primary shard | | 0.0553333 | 0.0549 | -0.00043 | min | -0.78% | | Cumulative flush time of primary shards | | 0.0907833 | 0.1073 | 0.01652 | min | +18.19% | | Cumulative flush count of primary shards | | 2 | 2 | 0 | | 0.00% | | Min cumulative flush time across primary shard | | 0.045 | 0.0526167 | 0.00762 | min | +16.93% | | Median cumulative flush time across primary shard | | 0.0453917 | 0.05365 | 0.00826 | min | +18.19% | | Max cumulative flush time across primary shard | | 0.0457833 | 0.0546833 | 0.0089 | min | +19.44% | | Total Young Gen GC time | | 1.09 | 0.775 | -0.315 | s | -28.90% | | Total Young Gen GC count | | 91 | 44 | -47 | | -51.65% | | Total Old Gen GC time | | 0 | 0 | 0 | s | 0.00% | | Total Old Gen GC count | | 0 | 0 | 0 | | 0.00% | | Store size | | 1.99315 | 2.01533 | 0.02218 | GB | +1.11% | | Translog size | | 1.02445e-07 | 1.02445e-07 | 0 | GB | 0.00% | | Heap used for segments | | 0 | 0 | 0 | MB | 0.00% | | Heap used for doc values | | 0 | 0 | 0 | MB | 0.00% | | Heap used for terms | | 0 | 0 | 0 | MB | 0.00% | | Heap used for norms | | 0 | 0 | 0 | MB | 0.00% | | Heap used for points | | 0 | 0 | 0 | MB | 0.00% | | Heap used for stored fields | | 0 | 0 | 0 | MB | 0.00% | | Segment count | | 2 | 2 | 0 | | 0.00% | | Total Ingest Pipeline count | | 0 | 0 | 0 | | 0.00% | | Total Ingest Pipeline time | | 0 | 0 | 0 | ms | 0.00% | | Total Ingest Pipeline failed | | 0 | 0 | 0 | | 0.00% | | Min Throughput | index-append | 23392.3 | 23195.8 | -196.462 | docs/s | -0.84% | | Mean Throughput | index-append | 23866.3 | 23333.3 | -532.962 | docs/s | -2.23% | | Median Throughput | index-append | 23824.8 | 23253.6 | -571.197 | docs/s | -2.40% | | Max Throughput | index-append | 24283.7 | 23698.7 | -585.011 | docs/s | -2.41% | | 50th percentile latency | index-append | 170.492 | 188.767 | 18.2744 | ms | +10.72% | | 90th percentile latency | index-append | 196.944 | 218.767 | 21.8235 | ms | +11.08% | | 99th percentile latency | index-append | 216.506 | 266.375 | 49.8685 | ms | +23.03% | | 100th percentile latency | index-append | 537.944 | 551.11 | 13.166 | ms | +2.45% | | 50th percentile service time | index-append | 170.492 | 188.767 | 18.2744 | ms | +10.72% | | 90th percentile service time | index-append | 196.944 | 218.767 | 21.8235 | ms | +11.08% | | 99th percentile service time | index-append | 216.506 | 266.375 | 49.8685 | ms | +23.03% | | 100th percentile service time | index-append | 537.944 | 551.11 | 13.166 | ms | +2.45% | | error rate | index-append | 0 | 0 | 0 | % | 0.00% | | Min Throughput | refresh-after-index | 0.549916 | 0.528989 | -0.02093 | ops/s | -3.81% | | Mean Throughput | refresh-after-index | 0.549916 | 0.528989 | -0.02093 | ops/s | -3.81% | | Median Throughput | refresh-after-index | 0.549916 | 0.528989 | -0.02093 | ops/s | -3.81% | | Max Throughput | refresh-after-index | 0.549916 | 0.528989 | -0.02093 | ops/s | -3.81% | | 100th percentile latency | refresh-after-index | 1815.22 | 1889.02 | 73.8069 | ms | +4.07% | | 100th percentile service time | refresh-after-index | 1815.22 | 1889.02 | 73.8069 | ms | +4.07% | | error rate | refresh-after-index | 0 | 0 | 0 | % | 0.00% | | Min Throughput | refresh-after-update | 80.8276 | 101.564 | 20.7368 | ops/s | +25.66% | | Mean Throughput | refresh-after-update | 80.8276 | 101.564 | 20.7368 | ops/s | +25.66% | | Median Throughput | refresh-after-update | 80.8276 | 101.564 | 20.7368 | ops/s | +25.66% | | Max Throughput | refresh-after-update | 80.8276 | 101.564 | 20.7368 | ops/s | +25.66% | | 100th percentile latency | refresh-after-update | 9.11138 | 7.33983 | -1.77154 | ms | -19.44% | | 100th percentile service time | refresh-after-update | 9.11138 | 7.33983 | -1.77154 | ms | -19.44% | | error rate | refresh-after-update | 0 | 0 | 0 | % | 0.00% | | Min Throughput | force-merge | 0.0595047 | 0.0643411 | 0.00484 | ops/s | +8.13% | | Mean Throughput | force-merge | 0.0595047 | 0.0643411 | 0.00484 | ops/s | +8.13% | | Median Throughput | force-merge | 0.0595047 | 0.0643411 | 0.00484 | ops/s | +8.13% | | Max Throughput | force-merge | 0.0595047 | 0.0643411 | 0.00484 | ops/s | +8.13% | | 100th percentile latency | force-merge | 16802.2 | 15539 | -1263.26 | ms | -7.52% | | 100th percentile service time | force-merge | 16802.2 | 15539 | -1263.26 | ms | -7.52% | | error rate | force-merge | 0 | 0 | 0 | % | 0.00% | | Min Throughput | script-score-query | 7.0288 | 11.0674 | 4.03855 | ops/s | +57.46% | | Mean Throughput | script-score-query | 7.55875 | 12.6622 | 5.10345 | ops/s | +67.52% | | Median Throughput | script-score-query | 7.61418 | 12.8619 | 5.24775 | ops/s | +68.92% | | Max Throughput | script-score-query | 7.68795 | 13.2074 | 5.51943 | ops/s | +71.79% | | 50th percentile latency | script-score-query | 128.268 | 73.065 | -55.2028 | ms | -43.04% | | 90th percentile latency | script-score-query | 129.359 | 76.7706 | -52.5883 | ms | -40.65% | | 99th percentile latency | script-score-query | 133.435 | 77.9628 | -55.4722 | ms | -41.57% | | 99.9th percentile latency | script-score-query | 146.3 | 79.9961 | -66.3044 | ms | -45.32% | | 100th percentile latency | script-score-query | 149.823 | 87.4413 | -62.3815 | ms | -41.64% | | 50th percentile service time | script-score-query | 128.268 | 73.065 | -55.2028 | ms | -43.04% | | 90th percentile service time | script-score-query | 129.359 | 76.7706 | -52.5883 | ms | -40.65% | | 99th percentile service time | script-score-query | 133.435 | 77.9628 | -55.4722 | ms | -41.57% | | 99.9th percentile service time | script-score-query | 146.3 | 79.9961 | -66.3044 | ms | -45.32% | | 100th percentile service time | script-score-query | 149.823 | 87.4413 | -62.3815 | ms | -41.64% | | error rate | script-score-query | 0 | 0 | 0 | % | 0.00% |

ChrisHegarty · 2023-06-06T16:17:12Z

server/src/main/java/org/elasticsearch/index/mapper/vectors/VectorEncoderDecoder.java

 ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);
 for (int dim = 0; dim < vector.length; dim++) {
- vector[dim] = byteBuffer.getFloat();
+ vector[dim] = byteBuffer.getFloat((dim + vectorBR.offset) * Float.BYTES);


I would expect that vectorBR.offset is not needed here. In fact it would be a problem if it is anything other than 0?

@ChrisHegarty I guess I misunderstand. I thought getFloat was the absolute position within the underlying bytes, not relative to the starting position (dictated by the wrapping of bytes with offset).

My equation is wrong for sure, I think it should be (dim * Float.BYTES) + vectorBR.offset?

This test passes with my fixed equation:

public void testDecoding() { float[] inputFloats = new float[]{1f, 2f, 3f, 4f}; ByteBuffer byteBuffer = ByteBuffer.allocate(16); for (float f : inputFloats) { byteBuffer.putFloat(f); } BytesRef floatBytes = new BytesRef(byteBuffer.array()); floatBytes.length = 12; floatBytes.offset = 4; float[] outputFloats = new float[3]; VectorEncoderDecoder.decodeDenseVector(floatBytes, outputFloats); assertArrayEquals(outputFloats, new float[]{2f, 3f, 4f}, 0f); }

@ChrisHegarty what do you think? I fixed my math. Is this still an issue?

@benwtrent Your changes are good. The wrap with an offset and length is not relevant any more - it could be a plain wrap(byte[]) - since we are now using absolute addressing of the buffer - the complete byte[] is available to us even with wrap(byte[],int,int), which just sets up an initial position and limit.

benwtrent · 2023-06-06T16:44:06Z

Microbenchmark for Bytes

Lucene also added SIMD improvements for bytes. The function requires byte[] and assumes full length and offset: 0. When index: true, the benefit is obvious. However, index: false poses an interesting conundrum.

BytesRef is returned from the value iterator. Meaning, there is some offset and length associated with the underlying bytes reference. I am not 100% sure if these are always zero and byte[]#length. Copying the underlying array reference ensures this.

The following benchmark was ran on M1-Max, 128 byte width lanes. This was over 100 random byte vectors. One set is "short" (96 dims) the other is "medium" (768 dims).

If the vector API is NOT available, and we still copy the ByteRef, the calculation is slower.

However, if the vector API is present, speed up is achieved.

Here are some numbers (lower is better).

ByteBufferDecodingFunctionBenchmark.benchmarkByteCopied 768 avgt 5 13452.402 ± 21.980 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkByteCopied 96 avgt 5 2350.160 ± 5.545 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkByteNotCopied 768 avgt 5 11766.417 ± 20.773 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkByteNotCopied 96 avgt 5 1936.556 ± 3.638 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkDefault 768 avgt 5 24347.973 ± 24.249 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkDefault 96 avgt 5 3416.354 ± 65.553 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkDefaultCopied 768 avgt 5 26117.556 ± 50.439 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkDefaultCopied 96 avgt 5 3603.092 ± 42.338 ns/op

If we can bypass this internal byte[] copy, this change is really nice.

@jpountz && @jdconrad what do you think? Should we attempt to by-pass the copy if offset: 0 in the BytesRef?

benwtrent · 2023-06-06T17:01:08Z

@jpountz && @jdconrad I cannot by pass the copy when we store in binary blobs as we append the danged magnitude at the end of the byte array.

jdconrad

LGTM! One minor comment. The benchmarks look awesome :)

jdconrad · 2023-06-06T19:34:47Z

...er/src/main/java/org/elasticsearch/script/field/vectors/BinaryDenseVectorDocValuesField.java

 return defaultValue;
 }
- return new BinaryDenseVector(value, dims, indexVersion);
+ VectorEncoderDecoder.decodeDenseVector(value, vectorValue);


It's possible get is used w/ the same value twice. Wonder if it makes sense to cache it?

I can cache it for sure.

added code to cache

…wtrent/elasticsearch into feature/knn-brute-force-improvements

benwtrent · 2023-06-07T13:29:01Z

@jdconrad I went ahead and added binary byte support for Lucene accelerated calculations. The byte array copy adds less than a millisecond overhead when reading 25,000 values from the same field. But, it cuts runtime almost in half when it the Vector API is used. This seems justifiable to me.

Let me know if you significantly disagree.

benwtrent · 2023-06-07T14:13:41Z

run elasticsearch-ci/bwc

benwtrent · 2023-06-07T14:13:53Z

@elasticmachine update branch

jdconrad

LGTM with the new changes. I agree with you that the performance trade off for decoding is worth it.

jpountz

Nice to find more places where these new vectorized methods can be used!

jpountz · 2023-06-07T20:22:03Z

server/src/main/java/org/elasticsearch/index/mapper/vectors/VectorEncoderDecoder.java

 ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);
 for (int dim = 0; dim < vector.length; dim++) {
- vector[dim] = byteBuffer.getFloat();
+ vector[dim] = byteBuffer.getFloat((dim * Float.BYTES) + vectorBR.offset);


The vectorBR.offset is already taken into account to create the ByteBuffer, so we shouldn't add it there again?

Also unrelated to your PR, I would expect ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length).asFloatBuffer().get(vector) to run faster than reading a float at a time.

@jpountz I just asked @mayya-sharipova about this same thing. Using getFloat(int) ignores the current position of the buffer. It may make sense to wrap without the offset and length parameters.

Also unrelated to your PR, I would expect ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length).asFloatBuffer().get(vector) to run faster than reading a float at a time.

@jpountz it doesn't. This is 30% faster than making it a float buffer. For some reason, getting the absolute is faster. @ChrisHegarty might have an intuition as to why.

The vectorBR.offset is already taken into account to create the ByteBuffer, so we shouldn't add it there again?

I honestly don't know. I added a test to verify (see: https://github.com/elastic/elasticsearch/pull/96617/files#diff-2d953a23603f9d7ef2f18d9f7bff3960307afc1a763e28fa2c7eca0ee3a65599). That test creates a bytebuffer with a custom length & offset and we get the expected results with my change.

Thanks @jdconrad, you are right indeed! Thanks @benwtrent for adding a test.

@benwtrent Thanks for checking performance. I remember that MikeS found that wrapping as a float buffer was faster when adding DataInput#readFloats, but it was with a direct byte buffer with little-endian byte order, these differences might matter.

Yes, it's the endianness of the bytebuffers we're using here - big in this case. It's most efficient to use the native endianness, otherwise byte swapping will occur. Since these are already in the index, is it possible to switch them to little?

@jpountz here are some benchmark results:

So, average time to calculate latency for decoding float[] from ByteBuffers. Here is the benchmark code (if you can spot any wonkiness): https://gist.github.com/benwtrent/29cc0338cd851c345cace5c486095507

Direct Read (via index in the buffer) is always faster. The floatbuffer vs. byte buffer are almost identical for both decoding kinds (iteration vs. direct read).

Benchmark (floatArraySizes) Mode Cnt Score Error Units ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead 96 avgt 5 3348858.476 ± 13314.154 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead 768 avgt 5 19960837.116 ± 137071.324 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead 2048 avgt 5 49142580.882 ± 216614.762 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration 96 avgt 5 4992161.167 ± 47263.300 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration 768 avgt 5 33762525.212 ± 110609.781 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration 2048 avgt 5 89409177.001 ± 1122561.441 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead 96 avgt 5 3529213.563 ± 24635.601 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead 768 avgt 5 18152123.115 ± 49712.077 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead 2048 avgt 5 46610729.883 ± 2462771.340 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration 96 avgt 5 5085876.957 ± 75157.360 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration 768 avgt 5 32359519.955 ± 399297.947 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration 2048 avgt 5 86083215.393 ± 1120197.606 ns/op

benwtrent · 2023-06-08T13:32:09Z

@elasticmachine update branch

benwtrent · 2023-06-08T13:57:55Z

run elasticsearch-ci/part-2

mayya-sharipova

@benwtrent Thanks Ben, great speedups!

…esOpenChannels (#96635) Also reduce all grace periods below 100ms Fixes: #96617

Improve brute force vector search speed

12dcc1b

benwtrent added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.9.0 labels Jun 6, 2023

benwtrent requested review from jdconrad and mayya-sharipova June 6, 2023 14:13

elasticsearchmachine added the Team:Search Meta label for search team label Jun 6, 2023

Update docs/changelog/96617.yaml

80bb47a

ChrisHegarty mentioned this pull request Jun 6, 2023

Leverage accelerated vector hardware instructions in Vector Search #96370

Closed

8 tasks

ChrisHegarty reviewed Jun 6, 2023

View reviewed changes

jdconrad approved these changes Jun 6, 2023

View reviewed changes

benwtrent added 3 commits June 7, 2023 08:02

add indexed byte Lucene methods

5e9ba58

Merge branch 'feature/knn-brute-force-improvements' of github.com:ben…

e060d4f

…wtrent/elasticsearch into feature/knn-brute-force-improvements

adding vector API support for binary stored byte vectors

39a8e9b

benwtrent requested a review from jdconrad June 7, 2023 12:56

add decoder test

ce40541

fixing format

4b42476

Merge branch 'main' into feature/knn-brute-force-improvements

789dae8

jdconrad approved these changes Jun 7, 2023

View reviewed changes

jpountz approved these changes Jun 7, 2023

View reviewed changes

ChrisHegarty approved these changes Jun 8, 2023

View reviewed changes

Merge branch 'main' into feature/knn-brute-force-improvements

67be030

benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 8, 2023

mayya-sharipova approved these changes Jun 8, 2023

View reviewed changes

elasticsearchmachine merged commit 763cd14 into elastic:main Jun 8, 2023

benwtrent deleted the feature/knn-brute-force-improvements branch June 8, 2023 14:41

stu-elastic added a commit that referenced this pull request Jun 8, 2023

Do not double close in AbstractHttpServerTransportTests.testForceClos…

946ffa3

…esOpenChannels (#96635) Also reduce all grace periods below 100ms Fixes: #96617

benwtrent mentioned this pull request Jun 8, 2023

Fix weird panama integration failure #96715

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve brute force vector search speed by using Lucene functions #96617

Improve brute force vector search speed by using Lucene functions #96617

Uh oh!

benwtrent commented Jun 6, 2023

elasticsearchmachine commented Jun 6, 2023

elasticsearchmachine commented Jun 6, 2023

benwtrent commented Jun 6, 2023

ChrisHegarty Jun 6, 2023 •

edited

Loading

benwtrent Jun 7, 2023

benwtrent Jun 7, 2023

benwtrent Jun 7, 2023

benwtrent Jun 7, 2023

ChrisHegarty Jun 8, 2023

benwtrent commented Jun 6, 2023

benwtrent commented Jun 6, 2023

jdconrad left a comment

jdconrad Jun 6, 2023

benwtrent Jun 6, 2023

benwtrent Jun 7, 2023

benwtrent commented Jun 7, 2023

benwtrent commented Jun 7, 2023

benwtrent commented Jun 7, 2023

jdconrad left a comment

jpountz left a comment

jpountz Jun 7, 2023

jdconrad Jun 7, 2023

benwtrent Jun 8, 2023

benwtrent Jun 8, 2023

jpountz Jun 8, 2023

ChrisHegarty Jun 8, 2023

benwtrent Jun 8, 2023

benwtrent commented Jun 8, 2023

benwtrent commented Jun 8, 2023

mayya-sharipova left a comment

Labels

7 participants

Improve brute force vector search speed by using Lucene functions #96617

Improve brute force vector search speed by using Lucene functions #96617

Uh oh!

Conversation

benwtrent commented Jun 6, 2023

elasticsearchmachine commented Jun 6, 2023

elasticsearchmachine commented Jun 6, 2023

benwtrent commented Jun 6, 2023

Performance characteristics

Default results vs. no Vector Incubator API

Default results vs. my change + Vector Incubator API

ChrisHegarty Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Jun 6, 2023

Microbenchmark for Bytes

benwtrent commented Jun 6, 2023

jdconrad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Jun 7, 2023

benwtrent commented Jun 7, 2023

benwtrent commented Jun 7, 2023

jdconrad left a comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Jun 8, 2023

benwtrent commented Jun 8, 2023

mayya-sharipova left a comment

Choose a reason for hiding this comment

Labels

7 participants

ChrisHegarty Jun 6, 2023 •

edited

Loading