Skip to content

Conversation

@mccullocht
Copy link
Contributor

@mccullocht mccullocht commented Sep 29, 2025

Partial implementation of #15155

So far this is not any faster than the alternative. On an AMD RYZEN AI MAX+ 395

baseline: Results: recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.913 1.635 1.630 0.997 1000000 100 100 32 250 8 bits 6824 0.00 Infinity 0.04 1 3759.67 3677.368 747.681 HNSW candidate: Results: recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.913 1.671 1.661 0.994 1000000 100 100 32 250 8 bits 6824 0.00 Infinity 0.04 1 3759.67 3677.368 747.681 HNSW 

DO NOT MERGE
Performance observations: on an avx512 host the profiles are quite different. the original path spends a most time in dotProductBody512 followed by Int512Vector.reduceLanes(). the new path spends much more time in reduceLanes() but also spends more time loading from a memory segment for the input vectors -- a 128 bit load from a memory segment instead of a heap array. this could be memory latency but in that case why doesn't the load into the heap array show up in the profile?

@benwtrent
Copy link
Member

@mccullocht maybe we only do the byte part of the comparisons off-heap? Then apply the corrections all on heap. I would assume applying corrections is pretty cheap, but even then if we did it in bulk, maybe on-heap bulk correction application is pretty fast?

@mccullocht
Copy link
Contributor Author

I did move just the vector dot product off-heap and I'm not planning to do anything clever with the corrections. I'm not sure that would pay off anyway -- you'd have to transpose from row view to column view to parallelize that work, and it would be 128-bit on x86 which may not go well.

I was assuming that accessing the corrective terms was messing with performance but larger jfr stacks point at a more mysterious culprit. This PR spends more time in lane reduction (???) and 128-bit loads of data from the memory segment (probably memory latency). For the latter case its weird that I don't see anything on the baseline when I know that copying to the heap should be inducing a similar hit.

baseline: 36.66% 12745 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.util.VectorUtil#uint8DotProduct() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer#quantizedScore() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer$1#score() [JIT compiled code] at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [Inlined code] 25.90% 9005 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.util.VectorUtil#uint8DotProduct() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer#quantizedScore() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer$1#score() [JIT compiled code] at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [JIT compiled code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [JIT compiled code] 8.64% 3005 jdk.incubator.vector.IntVector#reduceLanesTemplate() [Inlined code] at jdk.incubator.vector.Int512Vector#reduceLanes() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.util.VectorUtil#uint8DotProduct() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer#quantizedScore() [Inlined code] at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer$1#score() [JIT compiled code] at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] candidate: 33.93% 11848 jdk.incubator.vector.IntVector#reduceLanesTemplate() [Inlined code] at jdk.incubator.vector.Int512Vector#reduceLanes() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.internal.vectorization.Lucene104MemorySegmentScalarQuantizedVectorScorer$RandomVectorScorerImpl#score() [JIT compiled code] at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [Inlined code] 23.97% 8369 jdk.incubator.vector.IntVector#reduceLanesTemplate() [Inlined code] at jdk.incubator.vector.Int512Vector#reduceLanes() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.internal.vectorization.Lucene104MemorySegmentScalarQuantizedVectorScorer$RandomVectorScorerImpl#score() [JIT compiled code] at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [JIT compiled code] at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [JIT compiled code] 13.33% 4655 jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code] at jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegment() [Inlined code] at jdk.incubator.vector.ByteVector#fromMemorySegment0Template() [Inlined code] at jdk.incubator.vector.Byte128Vector#fromMemorySegment0() [Inlined code] at jdk.incubator.vector.ByteVector#fromMemorySegment() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$MemorySegmentLoader#load() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code] at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code] at org.apache.lucene.internal.vectorization.Lucene104MemorySegmentScalarQuantizedVectorScorer$RandomVectorScorerImpl#score() [JIT compiled code] 
@benwtrent
Copy link
Member

@mccullocht you might find this interesting: #15272

@mccullocht
Copy link
Contributor Author

I plan to try the simplest thing first and just copy the dot product code for byte[] x MemorySegment to see if that yields an improvement, then go from there. I hoped that the JVM would monomorphize these calls but I guess not.

@mccullocht
Copy link
Contributor Author

Ok, repeated the experiment with a specific byte[] x MemorySegment implementation. In the luceneutil benchmarks I'm not suffering from the same inlining/pollution issues. The profiles remain the same where suddenly reduceLanes() becomes very expensive. I have not tried this on other hardware (e.g. a Mac) yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants