Implement off-heap scoring for OSQ 4, 7, and 8 bit representations #15257

mccullocht · 2025-09-29T18:34:40Z

Partial implementation of #15155

So far this is not any faster than the alternative. On an AMD RYZEN AI MAX+ 395

baseline: Results: recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.913 1.635 1.630 0.997 1000000 100 100 32 250 8 bits 6824 0.00 Infinity 0.04 1 3759.67 3677.368 747.681 HNSW candidate: Results: recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.913 1.671 1.661 0.994 1000000 100 100 32 250 8 bits 6824 0.00 Infinity 0.04 1 3759.67 3677.368 747.681 HNSW

DO NOT MERGE
Performance observations: on an avx512 host the profiles are quite different. the original path spends a most time in dotProductBody512 followed by Int512Vector.reduceLanes(). the new path spends much more time in reduceLanes() but also spends more time loading from a memory segment for the input vectors -- a 128 bit load from a memory segment instead of a heap array. this could be memory latency but in that case why doesn't the load into the heap array show up in the profile?

benwtrent · 2025-10-01T14:04:34Z

@mccullocht maybe we only do the byte part of the comparisons off-heap? Then apply the corrections all on heap. I would assume applying corrections is pretty cheap, but even then if we did it in bulk, maybe on-heap bulk correction application is pretty fast?

mccullocht · 2025-10-01T18:16:36Z

I did move just the vector dot product off-heap and I'm not planning to do anything clever with the corrections. I'm not sure that would pay off anyway -- you'd have to transpose from row view to column view to parallelize that work, and it would be 128-bit on x86 which may not go well.

I was assuming that accessing the corrective terms was messing with performance but larger jfr stacks point at a more mysterious culprit. This PR spends more time in lane reduction (???) and 128-bit loads of data from the memory segment (probably memory latency). For the latter case its weird that I don't see anything on the baseline when I know that copying to the heap should be inducing a similar hit.

baseline: 36.66% 12745 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.util.VectorUtil#uint8DotProduct() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer#quantizedScore() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer$1#score() [JIT compiled code] at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [Inlined code] 25.90% 9005 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.util.VectorUtil#uint8DotProduct() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer#quantizedScore() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer$1#score() [JIT compiled code] at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [JIT compiled code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [JIT compiled code] 8.64% 3005 jdk.incubator.vector.IntVector#reduceLanesTemplate() [Inlined code] at jdk.incubator.vector.Int512Vector#reduceLanes() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.util.VectorUtil#uint8DotProduct() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer#quantizedScore() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer$1#score() [JIT compiled code] at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] candidate: 33.93% 11848 jdk.incubator.vector.IntVector#reduceLanesTemplate() [Inlined code] at jdk.incubator.vector.Int512Vector#reduceLanes() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.internal.vectorization.Lucene104MemorySegmentScalarQuantizedVectorScorer$RandomVectorScorerImpl#score() [JIT compiled code] at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [Inlined code] 23.97% 8369 jdk.incubator.vector.IntVector#reduceLanesTemplate() [Inlined code] at jdk.incubator.vector.Int512Vector#reduceLanes() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.internal.vectorization.Lucene104MemorySegmentScalarQuantizedVectorScorer$RandomVectorScorerImpl#score() [JIT compiled code] at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [JIT compiled code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [JIT compiled code] 13.33% 4655 jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code] at jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegment() [Inlined code] at jdk.incubator.vector.ByteVector#fromMemorySegment0Template() [Inlined code] at jdk.incubator.vector.Byte128Vector#fromMemorySegment0() [Inlined code] at jdk.incubator.vector.ByteVector#fromMemorySegment() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$MemorySegmentLoader#load() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.internal.vectorization.Lucene104MemorySegmentScalarQuantizedVectorScorer$RandomVectorScorerImpl#score() [JIT compiled code]

benwtrent · 2025-10-01T20:38:20Z

@mccullocht you might find this interesting: #15272

mccullocht · 2025-10-01T21:06:13Z

I plan to try the simplest thing first and just copy the dot product code for byte[] x MemorySegment to see if that yields an improvement, then go from there. I hoped that the JVM would monomorphize these calls but I guess not.

…ectorization

…ent codec

mccullocht · 2025-10-02T18:17:38Z

Ok, repeated the experiment with a specific byte[] x MemorySegment implementation. In the luceneutil benchmarks I'm not suffering from the same inlining/pollution issues. The profiles remain the same where suddenly reduceLanes() becomes very expensive. I have not tried this on other hardware (e.g. a Mac) yet.

github-actions bot added the module:core/codecs label Sep 29, 2025

iverase force-pushed the main branch from 978b2d3 to 38fc368 Compare September 30, 2025 14:57

github-actions bot added module:analysis module:core/index module:core/search module:benchmark module:queryparser module:test-framework module:misc module:core/hnsw module:build-infra labels Sep 30, 2025

mccullocht and others added 15 commits October 1, 2025 14:39

introduce vector provider call to get scorer

d8a31cd

factor out that parts of score computation that are not amenable to v…

4754166

…ectorization

share with 102 binarized vectors

767a52e

make qbvv public so I can use it in the accelerated code path

7511cab

feature complete

4ca050a

fix tidy

99918a0

explicitly cast long -> float

2894e44

fix toString tests

fd0d85e

fix bug in off heap sqvv that affects invariant checks in memory segm…

cc328d5

…ent codec

try to create fewer memory segments

9261f41

nodeSize

d012ebf

try to avoid allocating multiple memory segments

78388d3

try flattening the corrective terms into the node

5e18d3a

cleanup

955d083

try another formulation of vector handling

e7178bc

mccullocht force-pushed the osq-offheap-vector branch from 9e1d76e to e7178bc Compare October 1, 2025 21:40

github-actions bot removed module:analysis module:core/index module:core/search module:benchmark module:queryparser module:test-framework module:misc module:core/hnsw module:build-infra labels Oct 1, 2025

mccullocht added 7 commits October 2, 2025 15:24

settle on a single path

151d239

Merge remote-tracking branch 'upstream/main' into osq-offheap-vector

de51459

fix license

4d262f1

fix handling of asymmetric in memory seg impl

2c98ead

javadoc

1652595

fix a bug in random vector scoring

8fca8a3

restore memseg scorer; fix some errant formatting issues

2500556

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement off-heap scoring for OSQ 4, 7, and 8 bit representations #15257

Implement off-heap scoring for OSQ 4, 7, and 8 bit representations #15257

Uh oh!

mccullocht commented Sep 29, 2025 •

edited

Loading

benwtrent commented Oct 1, 2025

mccullocht commented Oct 1, 2025

benwtrent commented Oct 1, 2025

mccullocht commented Oct 1, 2025

mccullocht commented Oct 2, 2025

Labels

2 participants

Implement off-heap scoring for OSQ 4, 7, and 8 bit representations #15257

Are you sure you want to change the base?

Implement off-heap scoring for OSQ 4, 7, and 8 bit representations #15257

Uh oh!

Conversation

mccullocht commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

benwtrent commented Oct 1, 2025

mccullocht commented Oct 1, 2025

benwtrent commented Oct 1, 2025

mccullocht commented Oct 1, 2025

mccullocht commented Oct 2, 2025

Labels

2 participants

mccullocht commented Sep 29, 2025 •

edited

Loading