Skip to content

Conversation

@kaivalnp
Copy link
Contributor

@kaivalnp kaivalnp commented Oct 1, 2025

I was trying to figure out why #14863 adversely affected indexing performance of byte quantized vectors in nightly benchmarks (see #14863 (comment)), when it was supposed to speed things up by doing vector computations off-heap -- and may have found something interesting!

- Pollute VectorScorerBenchmark with on-heap dot products - Add benchmark function with compiler directives that work around the pollution
@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 1, 2025

We use the same underlying function for a variety of combinations of on and off-heap vectors -- which may lead to its call site being polluted, and compiled functions being sub-optimal

To demonstrate this, I added some type pollution to VectorScorerBenchmark

Benchmark without pollution:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 7.543 ± 0.062 ops/us 

Benchmark with pollution:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 6.327 ± 0.057 ops/us 
@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 1, 2025

On printing some JVM inlining internals, I came across entries like:

 @ 35 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$ArrayLoader::length (6 bytes) inline (hot) callee changed to org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$MemorySegmentLoader::length (13 bytes) inline (hot) callee changed to org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (250 bytes) \-> TypeProfile (1028/43516 counts) = org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport$MemorySegmentLoader callee changed to org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (250 bytes) \-> TypeProfile (42488/43516 counts) = org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport$ArrayLoader 

..so I turned off JVM method inlining for those internal classes using some compiler directives (so that their callers are compiled separately, into more type-appropriate methods)

Here are the results:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 6.374 ± 0.026 ops/us VectorScorerBenchmark.binaryDotProductMemSegWithVectorDirectives 1024 thrpt 15 7.401 ± 0.005 ops/us 

Looks like we can regain most of the performance drop from call site pollution!

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 1, 2025

Long-term, we probably want to rewrite PanamaVectorUtilSupport so that these directives aren't needed?

In the short-term, how can we make users aware of this issue / workaround?
For example, the directives probably need to be added to files like this (for OpenSearch) or this (for ElasticSearch)

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 2, 2025

Also wanted to note: we fixed a similar JVM inlining issue in #14874 -- and the PanamaVectorUtilSupport$ArrayLoader and PanamaVectorUtilSupport$MemorySegmentLoader classes referenced in this PR were actually added there^

I undid those changes locally (move away from ByteVectorLoader, wrap byte[] in a MemorySegment and have all functions work on two MemorySegments), but that degrades performance further:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 1.592 ± 0.065 ops/us 

i.e. the above PR was net-net positive, but left some more performance to be reclaimed by sidestepping call site pollution (which we hope to do here)

@msokolov
Copy link
Contributor

msokolov commented Oct 2, 2025

It's annoying that the JVM only allows this kind of control via command-line args rather than annotations that can be placed in code. In fact there are such annotations (@inline and @DontInline) but they are not available to lowly users, only for JVM internal code.

@mccullocht
Copy link
Contributor

It's possible that wrapping byte[] in MemorySegment is just moving the inlining issue around. I tried something similar in my PR for off-heap scoring and got similar results.

It might be best to use a script to generate code for the 3 different input cases.

@mccullocht
Copy link
Contributor

I tried adding a ByteVectorLoader implementation that stores both possible representations and switches between them in the loop (think if (segment != null) { <segment stuff> } else { <array stuff> }. This was not successful.

With pollution:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 25.176 ± 2.890 ops/us 

Without pollution:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 32.198 ± 0.077 ops/us 

With union vector representation:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 28.138 ± 0.108 ops/us 

I also tried sealing ByteVectorLoader and switching on the class type. On most run this is faster than the union vector representation, but on some jvm runs it is ~10x slower :/.

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 6, 2025

Turns out these compiler directives make things worse in actual benchmarks (Cohere vectors, 768d):

baseline

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.915 1.531 1.523 0.995 200000 100 50 32 250 8 bits 4583 12.41 16113.44 18.41 1 751.21 735.474 149.536 HNSW 

candidate (baseline + the following JVM arguments)

-XX:CompileCommand=dontinline,org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$ArrayLoader::* -XX:CompileCommand=dontinline,org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$MemorySegmentLoader::* 
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.917 9.765 9.745 0.998 200000 100 50 32 250 8 bits 4592 13.03 15346.84 96.83 1 751.29 735.474 149.536 HNSW 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants