Performance regression in vector computations due to call site pollution #15272

kaivalnp · 2025-10-01T20:18:25Z

I was trying to figure out why #14863 adversely affected indexing performance of byte quantized vectors in nightly benchmarks (see #14863 (comment)), when it was supposed to speed things up by doing vector computations off-heap -- and may have found something interesting!

- Pollute VectorScorerBenchmark with on-heap dot products - Add benchmark function with compiler directives that work around the pollution

kaivalnp · 2025-10-01T20:18:55Z

We use the same underlying function for a variety of combinations of on and off-heap vectors -- which may lead to its call site being polluted, and compiled functions being sub-optimal

To demonstrate this, I added some type pollution to VectorScorerBenchmark

Benchmark without pollution:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 7.543 ± 0.062 ops/us

Benchmark with pollution:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 6.327 ± 0.057 ops/us

kaivalnp · 2025-10-01T20:30:00Z

On printing some JVM inlining internals, I came across entries like:

 @ 35 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$ArrayLoader::length (6 bytes) inline (hot) callee changed to org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$MemorySegmentLoader::length (13 bytes) inline (hot) callee changed to org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (250 bytes) \-> TypeProfile (1028/43516 counts) = org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport$MemorySegmentLoader callee changed to org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (250 bytes) \-> TypeProfile (42488/43516 counts) = org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport$ArrayLoader

..so I turned off JVM method inlining for those internal classes using some compiler directives (so that their callers are compiled separately, into more type-appropriate methods)

Here are the results:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 6.374 ± 0.026 ops/us VectorScorerBenchmark.binaryDotProductMemSegWithVectorDirectives 1024 thrpt 15 7.401 ± 0.005 ops/us

Looks like we can regain most of the performance drop from call site pollution!

kaivalnp · 2025-10-01T20:42:10Z

Long-term, we probably want to rewrite PanamaVectorUtilSupport so that these directives aren't needed?

In the short-term, how can we make users aware of this issue / workaround?
For example, the directives probably need to be added to files like this (for OpenSearch) or this (for ElasticSearch)

kaivalnp · 2025-10-02T15:19:16Z

Also wanted to note: we fixed a similar JVM inlining issue in #14874 -- and the PanamaVectorUtilSupport$ArrayLoader and PanamaVectorUtilSupport$MemorySegmentLoader classes referenced in this PR were actually added there^

I undid those changes locally (move away from ByteVectorLoader, wrap byte[] in a MemorySegment and have all functions work on two MemorySegments), but that degrades performance further:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 1.592 ± 0.065 ops/us

i.e. the above PR was net-net positive, but left some more performance to be reclaimed by sidestepping call site pollution (which we hope to do here)

msokolov · 2025-10-02T15:29:54Z

It's annoying that the JVM only allows this kind of control via command-line args rather than annotations that can be placed in code. In fact there are such annotations (@inline and @DontInline) but they are not available to lowly users, only for JVM internal code.

mccullocht · 2025-10-02T16:46:39Z

It's possible that wrapping byte[] in MemorySegment is just moving the inlining issue around. I tried something similar in my PR for off-heap scoring and got similar results.

It might be best to use a script to generate code for the 3 different input cases.

mccullocht · 2025-10-03T23:03:02Z

I tried adding a ByteVectorLoader implementation that stores both possible representations and switches between them in the loop (think if (segment != null) { <segment stuff> } else { <array stuff> }. This was not successful.

With pollution:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 25.176 ± 2.890 ops/us

Without pollution:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 32.198 ± 0.077 ops/us

With union vector representation:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 1024 thrpt 15 28.138 ± 0.108 ops/us

I also tried sealing ByteVectorLoader and switching on the class type. On most run this is faster than the union vector representation, but on some jvm runs it is ~10x slower :/.

kaivalnp · 2025-10-06T15:17:08Z

Turns out these compiler directives make things worse in actual benchmarks (Cohere vectors, 768d):

baseline

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.915 1.531 1.523 0.995 200000 100 50 32 250 8 bits 4583 12.41 16113.44 18.41 1 751.21 735.474 149.536 HNSW

candidate (baseline + the following JVM arguments)

-XX:CompileCommand=dontinline,org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$ArrayLoader::* -XX:CompileCommand=dontinline,org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$MemorySegmentLoader::*

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.917 9.765 9.745 0.998 200000 100 50 32 250 8 bits 4592 13.03 15346.84 96.83 1 751.29 735.474 149.536 HNSW

Demonstration

85af468

- Pollute VectorScorerBenchmark with on-heap dot products - Add benchmark function with compiler directives that work around the pollution

kaivalnp mentioned this pull request Oct 1, 2025

Implement off-heap quantized scoring #14863

Merged

benwtrent mentioned this pull request Oct 1, 2025

Implement off-heap scoring for OSQ 4, 7, and 8 bit representations #15257

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance regression in vector computations due to call site pollution #15272

Performance regression in vector computations due to call site pollution #15272

Uh oh!

kaivalnp commented Oct 1, 2025

kaivalnp commented Oct 1, 2025

kaivalnp commented Oct 1, 2025

kaivalnp commented Oct 1, 2025

kaivalnp commented Oct 2, 2025

msokolov commented Oct 2, 2025

mccullocht commented Oct 2, 2025

mccullocht commented Oct 3, 2025

kaivalnp commented Oct 6, 2025

Labels

3 participants

Performance regression in vector computations due to call site pollution #15272

Are you sure you want to change the base?

Performance regression in vector computations due to call site pollution #15272

Uh oh!

Conversation

kaivalnp commented Oct 1, 2025

kaivalnp commented Oct 1, 2025

kaivalnp commented Oct 1, 2025

kaivalnp commented Oct 1, 2025

kaivalnp commented Oct 2, 2025

msokolov commented Oct 2, 2025

mccullocht commented Oct 2, 2025

mccullocht commented Oct 3, 2025

kaivalnp commented Oct 6, 2025

Labels

3 participants