Skip to content

Conversation

@benwtrent
Copy link
Member

Lucene has integrated hardware accelerated vector calculations. Meaning, calculations like dot_product can be much faster when using the Lucene defined functions.

When a dense_vector is indexed, we already support this. However, when index: false we store float vectors as binary fields in Lucene and decode them ourselves. Meaning, we don't use the underlying Lucene structures or functions.

To take advantage of the large performance boost, this PR refactors the binary vector values in the following way:

  • Eagerly decode the binary blobs when iterated
  • Call the Lucene defined VectorUtil functions when possible

related to: #96370

@benwtrent benwtrent added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.9.0 labels Jun 6, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 6, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

@benwtrent
Copy link
Member Author

Performance characteristics

I ran a custom rally track (modification of dense_vector) to test this change. Ensuring that the values were not indexed into the HNSW structure and only using script score. Here are some results.

These were ran on my M1 Max Macbook, preferredBitSize=128 (WHICH IS LAME!!! GIMME AT LEAST 256 APPLE!!!!!)

Default results vs. no Vector Incubator API

This run is the baseline (main) with my change ignoring vector incubator improvements. My change even makes this faster!

------------------------------------------------------ _______ __ _____ / ____(_)___ ____ _/ / / ___/_________ ________ / /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \ / __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/ /_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/ ------------------------------------------------------ | Metric | Task | Baseline | Contender | Diff | Unit | Diff % | |--------------------------------------------------------------:|---------------------:|----------------:|----------------:|------------:|-------:|---------:| | Cumulative indexing time of primary shards | | 1.49652 | 1.58852 | 0.092 | min | +6.15% | | Min cumulative indexing time across primary shard | | 0.747817 | 0.791517 | 0.0437 | min | +5.84% | | Median cumulative indexing time across primary shard | | 0.748258 | 0.794258 | 0.046 | min | +6.15% | | Max cumulative indexing time across primary shard | | 0.7487 | 0.797 | 0.0483 | min | +6.45% | | Cumulative indexing throttle time of primary shards | | 0 | 0 | 0 | min | 0.00% | | Min cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Median cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Max cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Cumulative merge time of primary shards | | 0.8653 | 0.809883 | -0.05542 | min | -6.40% | | Cumulative merge count of primary shards | | 4 | 4 | 0 | | 0.00% | | Min cumulative merge time across primary shard | | 0.431583 | 0.3972 | -0.03438 | min | -7.97% | | Median cumulative merge time across primary shard | | 0.43265 | 0.404942 | -0.02771 | min | -6.40% | | Max cumulative merge time across primary shard | | 0.433717 | 0.412683 | -0.02103 | min | -4.85% | | Cumulative merge throttle time of primary shards | | 0.506683 | 0.496267 | -0.01042 | min | -2.06% | | Min cumulative merge throttle time across primary shard | | 0.25275 | 0.24225 | -0.0105 | min | -4.15% | | Median cumulative merge throttle time across primary shard | | 0.253342 | 0.248133 | -0.00521 | min | -2.06% | | Max cumulative merge throttle time across primary shard | | 0.253933 | 0.254017 | 8e-05 | min | +0.03% | | Cumulative refresh time of primary shards | | 0.109317 | 0.113283 | 0.00397 | min | +3.63% | | Cumulative refresh count of primary shards | | 46 | 46 | 0 | | 0.00% | | Min cumulative refresh time across primary shard | | 0.0539833 | 0.05465 | 0.00067 | min | +1.23% | | Median cumulative refresh time across primary shard | | 0.0546583 | 0.0566417 | 0.00198 | min | +3.63% | | Max cumulative refresh time across primary shard | | 0.0553333 | 0.0586333 | 0.0033 | min | +5.96% | | Cumulative flush time of primary shards | | 0.0907833 | 0.1014 | 0.01062 | min | +11.69% | | Cumulative flush count of primary shards | | 2 | 2 | 0 | | 0.00% | | Min cumulative flush time across primary shard | | 0.045 | 0.0496333 | 0.00463 | min | +10.30% | | Median cumulative flush time across primary shard | | 0.0453917 | 0.0507 | 0.00531 | min | +11.69% | | Max cumulative flush time across primary shard | | 0.0457833 | 0.0517667 | 0.00598 | min | +13.07% | | Total Young Gen GC time | | 1.09 | 0.783 | -0.307 | s | -28.17% | | Total Young Gen GC count | | 91 | 41 | -50 | | -54.95% | | Total Old Gen GC time | | 0 | 0 | 0 | s | 0.00% | | Total Old Gen GC count | | 0 | 0 | 0 | | 0.00% | | Store size | | 1.99315 | 2.04484 | 0.0517 | GB | +2.59% | | Translog size | | 1.02445e-07 | 1.02445e-07 | 0 | GB | 0.00% | | Heap used for segments | | 0 | 0 | 0 | MB | 0.00% | | Heap used for doc values | | 0 | 0 | 0 | MB | 0.00% | | Heap used for terms | | 0 | 0 | 0 | MB | 0.00% | | Heap used for norms | | 0 | 0 | 0 | MB | 0.00% | | Heap used for points | | 0 | 0 | 0 | MB | 0.00% | | Heap used for stored fields | | 0 | 0 | 0 | MB | 0.00% | | Segment count | | 2 | 2 | 0 | | 0.00% | | Total Ingest Pipeline count | | 0 | 0 | 0 | | 0.00% | | Total Ingest Pipeline time | | 0 | 0 | 0 | ms | 0.00% | | Total Ingest Pipeline failed | | 0 | 0 | 0 | | 0.00% | | Min Throughput | index-append | 23392.3 | 22829.6 | -562.622 | docs/s | -2.41% | | Mean Throughput | index-append | 23866.3 | 23267.7 | -598.594 | docs/s | -2.51% | | Median Throughput | index-append | 23824.8 | 23208.6 | -616.263 | docs/s | -2.59% | | Max Throughput | index-append | 24283.7 | 23695.1 | -588.604 | docs/s | -2.42% | | 50th percentile latency | index-append | 170.492 | 177.02 | 6.52729 | ms | +3.83% | | 90th percentile latency | index-append | 196.944 | 204.761 | 7.81731 | ms | +3.97% | | 99th percentile latency | index-append | 216.506 | 233.9 | 17.3939 | ms | +8.03% | | 100th percentile latency | index-append | 537.944 | 366.611 | -171.333 | ms | -31.85% | | 50th percentile service time | index-append | 170.492 | 177.02 | 6.52729 | ms | +3.83% | | 90th percentile service time | index-append | 196.944 | 204.761 | 7.81731 | ms | +3.97% | | 99th percentile service time | index-append | 216.506 | 233.9 | 17.3939 | ms | +8.03% | | 100th percentile service time | index-append | 537.944 | 366.611 | -171.333 | ms | -31.85% | | error rate | index-append | 0 | 0 | 0 | % | 0.00% | | Min Throughput | refresh-after-index | 0.549916 | 0.505059 | -0.04486 | ops/s | -8.16% | | Mean Throughput | refresh-after-index | 0.549916 | 0.505059 | -0.04486 | ops/s | -8.16% | | Median Throughput | refresh-after-index | 0.549916 | 0.505059 | -0.04486 | ops/s | -8.16% | | Max Throughput | refresh-after-index | 0.549916 | 0.505059 | -0.04486 | ops/s | -8.16% | | 100th percentile latency | refresh-after-index | 1815.22 | 1977.07 | 161.856 | ms | +8.92% | | 100th percentile service time | refresh-after-index | 1815.22 | 1977.07 | 161.856 | ms | +8.92% | | error rate | refresh-after-index | 0 | 0 | 0 | % | 0.00% | | Min Throughput | refresh-after-update | 80.8276 | 79.9921 | -0.8355 | ops/s | -1.03% | | Mean Throughput | refresh-after-update | 80.8276 | 79.9921 | -0.8355 | ops/s | -1.03% | | Median Throughput | refresh-after-update | 80.8276 | 79.9921 | -0.8355 | ops/s | -1.03% | | Max Throughput | refresh-after-update | 80.8276 | 79.9921 | -0.8355 | ops/s | -1.03% | | 100th percentile latency | refresh-after-update | 9.11138 | 9.25912 | 0.14775 | ms | +1.62% | | 100th percentile service time | refresh-after-update | 9.11138 | 9.25912 | 0.14775 | ms | +1.62% | | error rate | refresh-after-update | 0 | 0 | 0 | % | 0.00% | | Min Throughput | force-merge | 0.0595047 | 0.0653879 | 0.00588 | ops/s | +9.89% | | Mean Throughput | force-merge | 0.0595047 | 0.0653879 | 0.00588 | ops/s | +9.89% | | Median Throughput | force-merge | 0.0595047 | 0.0653879 | 0.00588 | ops/s | +9.89% | | Max Throughput | force-merge | 0.0595047 | 0.0653879 | 0.00588 | ops/s | +9.89% | | 100th percentile latency | force-merge | 16802.2 | 15290.3 | -1511.91 | ms | -9.00% | | 100th percentile service time | force-merge | 16802.2 | 15290.3 | -1511.91 | ms | -9.00% | | error rate | force-merge | 0 | 0 | 0 | % | 0.00% | | Min Throughput | script-score-query | 7.0288 | 8.55991 | 1.53111 | ops/s | +21.78% | | Mean Throughput | script-score-query | 7.55875 | 9.68523 | 2.12648 | ops/s | +28.13% | | Median Throughput | script-score-query | 7.61418 | 9.8201 | 2.20593 | ops/s | +28.97% | | Max Throughput | script-score-query | 7.68795 | 10.0394 | 2.35142 | ops/s | +30.59% | | 50th percentile latency | script-score-query | 128.268 | 96.7239 | -31.5439 | ms | -24.59% | | 90th percentile latency | script-score-query | 129.359 | 100.52 | -28.8388 | ms | -22.29% | | 99th percentile latency | script-score-query | 133.435 | 102.186 | -31.2491 | ms | -23.42% | | 99.9th percentile latency | script-score-query | 146.3 | 114.949 | -31.3514 | ms | -21.43% | | 100th percentile latency | script-score-query | 149.823 | 116.18 | -33.6427 | ms | -22.46% | | 50th percentile service time | script-score-query | 128.268 | 96.7239 | -31.5439 | ms | -24.59% | | 90th percentile service time | script-score-query | 129.359 | 100.52 | -28.8388 | ms | -22.29% | | 99th percentile service time | script-score-query | 133.435 | 102.186 | -31.2491 | ms | -23.42% | | 99.9th percentile service time | script-score-query | 146.3 | 114.949 | -31.3514 | ms | -21.43% | | 100th percentile service time | script-score-query | 149.823 | 116.18 | -33.6427 | ms | -22.46% | | error rate | script-score-query | 0 | 0 | 0 | % | 0.00% | 

Default results vs. my change + Vector Incubator API

Numbers speak for themselves ;)

------------------------------------------------------ _______ __ _____ / ____(_)___ ____ _/ / / ___/_________ ________ / /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \ / __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/ /_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/ ------------------------------------------------------ | Metric | Task | Baseline | Contender | Diff | Unit | Diff % | |--------------------------------------------------------------:|---------------------:|----------------:|----------------:|------------:|-------:|---------:| | Cumulative indexing time of primary shards | | 1.49652 | 1.65228 | 0.15577 | min | +10.41% | | Min cumulative indexing time across primary shard | | 0.747817 | 0.826067 | 0.07825 | min | +10.46% | | Median cumulative indexing time across primary shard | | 0.748258 | 0.826142 | 0.07788 | min | +10.41% | | Max cumulative indexing time across primary shard | | 0.7487 | 0.826217 | 0.07752 | min | +10.35% | | Cumulative indexing throttle time of primary shards | | 0 | 0 | 0 | min | 0.00% | | Min cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Median cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Max cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% | | Cumulative merge time of primary shards | | 0.8653 | 0.251483 | -0.61382 | min | -70.94% | | Cumulative merge count of primary shards | | 4 | 2 | -2 | | -50.00% | | Min cumulative merge time across primary shard | | 0.431583 | 0.12565 | -0.30593 | min | -70.89% | | Median cumulative merge time across primary shard | | 0.43265 | 0.125742 | -0.30691 | min | -70.94% | | Max cumulative merge time across primary shard | | 0.433717 | 0.125833 | -0.30788 | min | -70.99% | | Cumulative merge throttle time of primary shards | | 0.506683 | 0 | -0.50668 | min | -100.00% | | Min cumulative merge throttle time across primary shard | | 0.25275 | 0 | -0.25275 | min | -100.00% | | Median cumulative merge throttle time across primary shard | | 0.253342 | 0 | -0.25334 | min | -100.00% | | Max cumulative merge throttle time across primary shard | | 0.253933 | 0 | -0.25393 | min | -100.00% | | Cumulative refresh time of primary shards | | 0.109317 | 0.106517 | -0.0028 | min | -2.56% | | Cumulative refresh count of primary shards | | 46 | 48 | 2 | | +4.35% | | Min cumulative refresh time across primary shard | | 0.0539833 | 0.0516167 | -0.00237 | min | -4.38% | | Median cumulative refresh time across primary shard | | 0.0546583 | 0.0532583 | -0.0014 | min | -2.56% | | Max cumulative refresh time across primary shard | | 0.0553333 | 0.0549 | -0.00043 | min | -0.78% | | Cumulative flush time of primary shards | | 0.0907833 | 0.1073 | 0.01652 | min | +18.19% | | Cumulative flush count of primary shards | | 2 | 2 | 0 | | 0.00% | | Min cumulative flush time across primary shard | | 0.045 | 0.0526167 | 0.00762 | min | +16.93% | | Median cumulative flush time across primary shard | | 0.0453917 | 0.05365 | 0.00826 | min | +18.19% | | Max cumulative flush time across primary shard | | 0.0457833 | 0.0546833 | 0.0089 | min | +19.44% | | Total Young Gen GC time | | 1.09 | 0.775 | -0.315 | s | -28.90% | | Total Young Gen GC count | | 91 | 44 | -47 | | -51.65% | | Total Old Gen GC time | | 0 | 0 | 0 | s | 0.00% | | Total Old Gen GC count | | 0 | 0 | 0 | | 0.00% | | Store size | | 1.99315 | 2.01533 | 0.02218 | GB | +1.11% | | Translog size | | 1.02445e-07 | 1.02445e-07 | 0 | GB | 0.00% | | Heap used for segments | | 0 | 0 | 0 | MB | 0.00% | | Heap used for doc values | | 0 | 0 | 0 | MB | 0.00% | | Heap used for terms | | 0 | 0 | 0 | MB | 0.00% | | Heap used for norms | | 0 | 0 | 0 | MB | 0.00% | | Heap used for points | | 0 | 0 | 0 | MB | 0.00% | | Heap used for stored fields | | 0 | 0 | 0 | MB | 0.00% | | Segment count | | 2 | 2 | 0 | | 0.00% | | Total Ingest Pipeline count | | 0 | 0 | 0 | | 0.00% | | Total Ingest Pipeline time | | 0 | 0 | 0 | ms | 0.00% | | Total Ingest Pipeline failed | | 0 | 0 | 0 | | 0.00% | | Min Throughput | index-append | 23392.3 | 23195.8 | -196.462 | docs/s | -0.84% | | Mean Throughput | index-append | 23866.3 | 23333.3 | -532.962 | docs/s | -2.23% | | Median Throughput | index-append | 23824.8 | 23253.6 | -571.197 | docs/s | -2.40% | | Max Throughput | index-append | 24283.7 | 23698.7 | -585.011 | docs/s | -2.41% | | 50th percentile latency | index-append | 170.492 | 188.767 | 18.2744 | ms | +10.72% | | 90th percentile latency | index-append | 196.944 | 218.767 | 21.8235 | ms | +11.08% | | 99th percentile latency | index-append | 216.506 | 266.375 | 49.8685 | ms | +23.03% | | 100th percentile latency | index-append | 537.944 | 551.11 | 13.166 | ms | +2.45% | | 50th percentile service time | index-append | 170.492 | 188.767 | 18.2744 | ms | +10.72% | | 90th percentile service time | index-append | 196.944 | 218.767 | 21.8235 | ms | +11.08% | | 99th percentile service time | index-append | 216.506 | 266.375 | 49.8685 | ms | +23.03% | | 100th percentile service time | index-append | 537.944 | 551.11 | 13.166 | ms | +2.45% | | error rate | index-append | 0 | 0 | 0 | % | 0.00% | | Min Throughput | refresh-after-index | 0.549916 | 0.528989 | -0.02093 | ops/s | -3.81% | | Mean Throughput | refresh-after-index | 0.549916 | 0.528989 | -0.02093 | ops/s | -3.81% | | Median Throughput | refresh-after-index | 0.549916 | 0.528989 | -0.02093 | ops/s | -3.81% | | Max Throughput | refresh-after-index | 0.549916 | 0.528989 | -0.02093 | ops/s | -3.81% | | 100th percentile latency | refresh-after-index | 1815.22 | 1889.02 | 73.8069 | ms | +4.07% | | 100th percentile service time | refresh-after-index | 1815.22 | 1889.02 | 73.8069 | ms | +4.07% | | error rate | refresh-after-index | 0 | 0 | 0 | % | 0.00% | | Min Throughput | refresh-after-update | 80.8276 | 101.564 | 20.7368 | ops/s | +25.66% | | Mean Throughput | refresh-after-update | 80.8276 | 101.564 | 20.7368 | ops/s | +25.66% | | Median Throughput | refresh-after-update | 80.8276 | 101.564 | 20.7368 | ops/s | +25.66% | | Max Throughput | refresh-after-update | 80.8276 | 101.564 | 20.7368 | ops/s | +25.66% | | 100th percentile latency | refresh-after-update | 9.11138 | 7.33983 | -1.77154 | ms | -19.44% | | 100th percentile service time | refresh-after-update | 9.11138 | 7.33983 | -1.77154 | ms | -19.44% | | error rate | refresh-after-update | 0 | 0 | 0 | % | 0.00% | | Min Throughput | force-merge | 0.0595047 | 0.0643411 | 0.00484 | ops/s | +8.13% | | Mean Throughput | force-merge | 0.0595047 | 0.0643411 | 0.00484 | ops/s | +8.13% | | Median Throughput | force-merge | 0.0595047 | 0.0643411 | 0.00484 | ops/s | +8.13% | | Max Throughput | force-merge | 0.0595047 | 0.0643411 | 0.00484 | ops/s | +8.13% | | 100th percentile latency | force-merge | 16802.2 | 15539 | -1263.26 | ms | -7.52% | | 100th percentile service time | force-merge | 16802.2 | 15539 | -1263.26 | ms | -7.52% | | error rate | force-merge | 0 | 0 | 0 | % | 0.00% | | Min Throughput | script-score-query | 7.0288 | 11.0674 | 4.03855 | ops/s | +57.46% | | Mean Throughput | script-score-query | 7.55875 | 12.6622 | 5.10345 | ops/s | +67.52% | | Median Throughput | script-score-query | 7.61418 | 12.8619 | 5.24775 | ops/s | +68.92% | | Max Throughput | script-score-query | 7.68795 | 13.2074 | 5.51943 | ops/s | +71.79% | | 50th percentile latency | script-score-query | 128.268 | 73.065 | -55.2028 | ms | -43.04% | | 90th percentile latency | script-score-query | 129.359 | 76.7706 | -52.5883 | ms | -40.65% | | 99th percentile latency | script-score-query | 133.435 | 77.9628 | -55.4722 | ms | -41.57% | | 99.9th percentile latency | script-score-query | 146.3 | 79.9961 | -66.3044 | ms | -45.32% | | 100th percentile latency | script-score-query | 149.823 | 87.4413 | -62.3815 | ms | -41.64% | | 50th percentile service time | script-score-query | 128.268 | 73.065 | -55.2028 | ms | -43.04% | | 90th percentile service time | script-score-query | 129.359 | 76.7706 | -52.5883 | ms | -40.65% | | 99th percentile service time | script-score-query | 133.435 | 77.9628 | -55.4722 | ms | -41.57% | | 99.9th percentile service time | script-score-query | 146.3 | 79.9961 | -66.3044 | ms | -45.32% | | 100th percentile service time | script-score-query | 149.823 | 87.4413 | -62.3815 | ms | -41.64% | | error rate | script-score-query | 0 | 0 | 0 | % | 0.00% | 
ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);
for (int dim = 0; dim < vector.length; dim++) {
vector[dim] = byteBuffer.getFloat();
vector[dim] = byteBuffer.getFloat((dim + vectorBR.offset) * Float.BYTES);
Copy link
Contributor

@ChrisHegarty ChrisHegarty Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that vectorBR.offset is not needed here. In fact it would be a problem if it is anything other than 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisHegarty I guess I misunderstand. I thought getFloat was the absolute position within the underlying bytes, not relative to the starting position (dictated by the wrapping of bytes with offset).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My equation is wrong for sure, I think it should be (dim * Float.BYTES) + vectorBR.offset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes with my fixed equation:

 public void testDecoding() { float[] inputFloats = new float[]{1f, 2f, 3f, 4f}; ByteBuffer byteBuffer = ByteBuffer.allocate(16); for (float f : inputFloats) { byteBuffer.putFloat(f); } BytesRef floatBytes = new BytesRef(byteBuffer.array()); floatBytes.length = 12; floatBytes.offset = 4; float[] outputFloats = new float[3]; VectorEncoderDecoder.decodeDenseVector(floatBytes, outputFloats); assertArrayEquals(outputFloats, new float[]{2f, 3f, 4f}, 0f); } 
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisHegarty what do you think? I fixed my math. Is this still an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent Your changes are good. The wrap with an offset and length is not relevant any more - it could be a plain wrap(byte[]) - since we are now using absolute addressing of the buffer - the complete byte[] is available to us even with wrap(byte[],int,int), which just sets up an initial position and limit.

@benwtrent
Copy link
Member Author

Microbenchmark for Bytes

Lucene also added SIMD improvements for bytes. The function requires byte[] and assumes full length and offset: 0. When index: true, the benefit is obvious. However, index: false poses an interesting conundrum.

BytesRef is returned from the value iterator. Meaning, there is some offset and length associated with the underlying bytes reference. I am not 100% sure if these are always zero and byte[]#length. Copying the underlying array reference ensures this.

The following benchmark was ran on M1-Max, 128 byte width lanes. This was over 100 random byte vectors. One set is "short" (96 dims) the other is "medium" (768 dims).

If the vector API is NOT available, and we still copy the ByteRef, the calculation is slower.

However, if the vector API is present, speed up is achieved.

Here are some numbers (lower is better).

ByteBufferDecodingFunctionBenchmark.benchmarkByteCopied 768 avgt 5 13452.402 ± 21.980 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkByteCopied 96 avgt 5 2350.160 ± 5.545 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkByteNotCopied 768 avgt 5 11766.417 ± 20.773 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkByteNotCopied 96 avgt 5 1936.556 ± 3.638 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkDefault 768 avgt 5 24347.973 ± 24.249 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkDefault 96 avgt 5 3416.354 ± 65.553 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkDefaultCopied 768 avgt 5 26117.556 ± 50.439 ns/op ByteBufferDecodingFunctionBenchmark.benchmarkDefaultCopied 96 avgt 5 3603.092 ± 42.338 ns/op 

If we can bypass this internal byte[] copy, this change is really nice.

@jpountz && @jdconrad what do you think? Should we attempt to by-pass the copy if offset: 0 in the BytesRef?

@benwtrent
Copy link
Member Author

@jpountz && @jdconrad I cannot by pass the copy when we store in binary blobs as we append the danged magnitude at the end of the byte array.

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One minor comment. The benchmarks look awesome :)

return defaultValue;
}
return new BinaryDenseVector(value, dims, indexVersion);
VectorEncoderDecoder.decodeDenseVector(value, vectorValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible get is used w/ the same value twice. Wonder if it makes sense to cache it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can cache it for sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added code to cache

@benwtrent benwtrent requested a review from jdconrad June 7, 2023 12:56
@benwtrent
Copy link
Member Author

@jdconrad I went ahead and added binary byte support for Lucene accelerated calculations. The byte array copy adds less than a millisecond overhead when reading 25,000 values from the same field. But, it cuts runtime almost in half when it the Vector API is used. This seems justifiable to me.

Let me know if you significantly disagree.

@benwtrent
Copy link
Member Author

run elasticsearch-ci/bwc

@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the new changes. I agree with you that the performance trade off for decoding is worth it.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to find more places where these new vectorized methods can be used!

ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);
for (int dim = 0; dim < vector.length; dim++) {
vector[dim] = byteBuffer.getFloat();
vector[dim] = byteBuffer.getFloat((dim * Float.BYTES) + vectorBR.offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorBR.offset is already taken into account to create the ByteBuffer, so we shouldn't add it there again?

Also unrelated to your PR, I would expect ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length).asFloatBuffer().get(vector) to run faster than reading a float at a time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz I just asked @mayya-sharipova about this same thing. Using getFloat(int) ignores the current position of the buffer. It may make sense to wrap without the offset and length parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unrelated to your PR, I would expect ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length).asFloatBuffer().get(vector) to run faster than reading a float at a time.

@jpountz it doesn't. This is 30% faster than making it a float buffer. For some reason, getting the absolute is faster. @ChrisHegarty might have an intuition as to why.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorBR.offset is already taken into account to create the ByteBuffer, so we shouldn't add it there again?

I honestly don't know. I added a test to verify (see: https://github.com/elastic/elasticsearch/pull/96617/files#diff-2d953a23603f9d7ef2f18d9f7bff3960307afc1a763e28fa2c7eca0ee3a65599). That test creates a bytebuffer with a custom length & offset and we get the expected results with my change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdconrad, you are right indeed! Thanks @benwtrent for adding a test.

@benwtrent Thanks for checking performance. I remember that MikeS found that wrapping as a float buffer was faster when adding DataInput#readFloats, but it was with a direct byte buffer with little-endian byte order, these differences might matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the endianness of the bytebuffers we're using here - big in this case. It's most efficient to use the native endianness, otherwise byte swapping will occur. Since these are already in the index, is it possible to switch them to little?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz here are some benchmark results:

So, average time to calculate latency for decoding float[] from ByteBuffers. Here is the benchmark code (if you can spot any wonkiness): https://gist.github.com/benwtrent/29cc0338cd851c345cace5c486095507

Direct Read (via index in the buffer) is always faster. The floatbuffer vs. byte buffer are almost identical for both decoding kinds (iteration vs. direct read).

Benchmark (floatArraySizes) Mode Cnt Score Error Units ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead 96 avgt 5 3348858.476 ± 13314.154 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead 768 avgt 5 19960837.116 ± 137071.324 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead 2048 avgt 5 49142580.882 ± 216614.762 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration 96 avgt 5 4992161.167 ± 47263.300 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration 768 avgt 5 33762525.212 ± 110609.781 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration 2048 avgt 5 89409177.001 ± 1122561.441 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead 96 avgt 5 3529213.563 ± 24635.601 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead 768 avgt 5 18152123.115 ± 49712.077 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead 2048 avgt 5 46610729.883 ± 2462771.340 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration 96 avgt 5 5085876.957 ± 75157.360 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration 768 avgt 5 32359519.955 ± 399297.947 ns/op ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration 2048 avgt 5 86083215.393 ± 1120197.606 ns/op 
@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 8, 2023
@benwtrent
Copy link
Member Author

run elasticsearch-ci/part-2

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent Thanks Ben, great speedups!

@elasticsearchmachine elasticsearchmachine merged commit 763cd14 into elastic:main Jun 8, 2023
@benwtrent benwtrent deleted the feature/knn-brute-force-improvements branch June 8, 2023 14:41
stu-elastic added a commit that referenced this pull request Jun 8, 2023
…esOpenChannels (#96635) Also reduce all grace periods below 100ms Fixes: #96617
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.9.0

7 participants