Skip to content

Conversation

@kaivalnp
Copy link
Contributor

@kaivalnp kaivalnp commented Jun 30, 2025

Description

In #14863, I noticed a regression in computing vector scores if the query vector was present on heap (i.e. allocated using MemorySegment.ofArray) but the doc was on RAM

We do have a JMH benchmark for testing the off-heap scoring performance -- but it only tests the indexing case (using an UpdateableRandomVectorScorer), when both vectors are on RAM (mmap-ed input)

I also added benchmark functions to demonstrate the searching case (using a RandomVectorScorer)

main:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductIndexingDefault 1024 thrpt 15 2.299 ± 0.017 ops/us VectorScorerBenchmark.binaryDotProductIndexingMemSeg 1024 thrpt 15 7.533 ± 0.107 ops/us VectorScorerBenchmark.binaryDotProductSearchingDefault 1024 thrpt 15 2.332 ± 0.017 ops/us VectorScorerBenchmark.binaryDotProductSearchingMemSeg 1024 thrpt 15 2.069 ± 0.069 ops/us 

This PR:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductIndexingDefault 1024 thrpt 15 2.295 ± 0.012 ops/us VectorScorerBenchmark.binaryDotProductIndexingMemSeg 1024 thrpt 15 7.551 ± 0.027 ops/us VectorScorerBenchmark.binaryDotProductSearchingDefault 1024 thrpt 15 2.341 ± 0.019 ops/us VectorScorerBenchmark.binaryDotProductSearchingMemSeg 1024 thrpt 15 4.241 ± 0.064 ops/us 

We see ~2x improvement in vector scoring time! (and the off-heap computation is actually slower than on-heap on main?)
I'm not sure if this is specific to my machine, or something in general..

Pasting CPU details in case it is relevant:

Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: ARM Model: 1 Stepping: r1p1 BogoMIPS: 2100.00 L1d cache: 64K L1i cache: 64K L2 cache: 1024K L3 cache: 32768K NUMA node0 CPU(s): 0-63 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng 

and java --version:

openjdk 24.0.1 2025-04-15 OpenJDK Runtime Environment Corretto-24.0.1.9.1 (build 24.0.1+9-FR) OpenJDK 64-Bit Server VM Corretto-24.0.1.9.1 (build 24.0.1+9-FR, mixed mode, sharing) 
@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jun 30, 2025

Here's another set of benchmarks on an x86 machine (results are similar to aarch64 results above^)

main:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductIndexingDefault 1024 thrpt 15 2.129 ± 0.006 ops/us VectorScorerBenchmark.binaryDotProductIndexingMemSeg 1024 thrpt 15 10.411 ± 0.065 ops/us VectorScorerBenchmark.binaryDotProductSearchingDefault 1024 thrpt 15 2.130 ± 0.006 ops/us VectorScorerBenchmark.binaryDotProductSearchingMemSeg 1024 thrpt 15 2.215 ± 0.071 ops/us 

This PR:

Benchmark (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductIndexingDefault 1024 thrpt 15 2.133 ± 0.004 ops/us VectorScorerBenchmark.binaryDotProductIndexingMemSeg 1024 thrpt 15 10.659 ± 0.106 ops/us VectorScorerBenchmark.binaryDotProductSearchingDefault 1024 thrpt 15 2.130 ± 0.008 ops/us VectorScorerBenchmark.binaryDotProductSearchingMemSeg 1024 thrpt 15 4.317 ± 0.317 ops/us 

CPU details:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Stepping: 7 CPU MHz: 3043.199 BogoMIPS: 4999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke 

and java --version:

openjdk 24.0.1 2025-04-15 OpenJDK Runtime Environment Corretto-24.0.1.9.1 (build 24.0.1+9-FR) OpenJDK 64-Bit Server VM Corretto-24.0.1.9.1 (build 24.0.1+9-FR, mixed mode, sharing) 
@msokolov
Copy link
Contributor

wow, interesting, nice find! If you have time, it would also be great to see some knnPerTest.py results comparing before/after this change, in addition to the microbenchmarking, just to make extra sure

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 1, 2025

see some knnPerTest.py results comparing before/after

I tried this on 300d byte vectors generated using:

./gradlew vectors-300

..and the results are strange!


main without jdk.incubator.vector:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.834 0.507 0.498 0.982 100000 100 50 64 250 no 23.20 4310.16 0.00 1 31.91 114.441 114.441 HNSW 

main with jdk.incubator.vector:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.835 0.449 0.441 0.982 100000 100 50 64 250 no 15.36 6511.69 0.00 1 31.91 114.441 114.441 HNSW 

This PR without jdk.incubator.vector:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.839 0.524 0.516 0.985 100000 100 50 64 250 no 22.71 4402.57 0.00 1 31.93 114.441 114.441 HNSW 

This PR with jdk.incubator.vector:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.834 0.832 0.825 0.992 100000 100 50 64 250 no 15.43 6479.20 0.00 1 31.92 114.441 114.441 HNSW 

We see ~85% regression in the vectorized implementation after making these changes


I understand there is a cost to allocate a native MemorySegment, copy the query vector, and GC overhead, but does the regression seem a bit too high?

Attaching the flame graph comparison as well..
Screenshot 2025-07-01 at 12 22 11 PM

Any leads to debug would be appreciated!

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 1, 2025

One possible explanation (from the memory profiler) for dot products being expensive is that the conversion from 8 bytes to integers is expensive:
Screenshot 2025-07-01 at 12 28 20 PM

@msokolov
Copy link
Contributor

msokolov commented Jul 3, 2025

I would be curious to see if the results hold up for 768d vectors. Maybe 300 not being divisible by 8 is problematic, although I don't believe it. But ... the results look so weird, I just want to try some other random thing and see if it holds up?

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 3, 2025

if the results hold up for 768d vectors

Thanks @msokolov I quantized the Cohere 768d vectors by:

  1. normalizing
  2. scaling each dimension by 256
  3. clipping between [-128, 127]

i.e. basically replicating this snippet

But something really strange is happening!

I first ran a set of benchmarks using -reindex to create fresh indices and ensure that the indexing times are not adversely affected:

main (run 1)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.940 1.568 1.567 1.000 200000 100 50 64 250 no 63.52 3148.47 23.21 1 152.11 585.938 585.938 HNSW 

This PR (run 2)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.939 2.894 2.894 1.000 200000 100 50 64 250 no 37.69 5306.87 12.99 1 152.09 585.938 585.938 HNSW 

This felt a bit strange because of a higher latency at query time (conflicting with JMH results) + lower indexing time in this PR -- so I ran a follow-up without -reindex to search from a built index (basically re-using the index created in run 2 for runs 3 and 4):

main (run 3)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.939 1.602 1.601 0.999 200000 100 50 64 250 no 0.00 Infinity 0.12 1 152.09 585.938 585.938 HNSW 

This PR (run 4)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.939 1.090 1.090 1.000 200000 100 50 64 250 no 0.00 Infinity 0.12 1 152.09 585.938 585.938 HNSW 

The performance in main is unchanged -- but we see significant improvement after moving the query off-heap!
I don't have a good explanation for this, I'll try to replicate on 300d vectors to see if this holds true!

Meanwhile, it'd be great if you could reproduce the benchmarks like I did :)

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 3, 2025

Okay I ran things slightly differently for 300d vectors. All runs are without -reindex, but I'm deleting the index between runs of main and this PR to create a fresh one

main (run 1, no pre-existing index)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.808 0.635 0.635 1.000 200000 100 50 64 250 no 38.55 5188.07 0.00 1 64.00 228.882 228.882 HNSW 

main (run 2, use same index as run 1)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.808 0.635 0.634 0.999 200000 100 50 64 250 no 0.00 Infinity 0.12 1 64.00 228.882 228.882 HNSW 

This PR (run 3, no pre-existing index)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.809 0.977 0.978 1.001 200000 100 50 64 250 no 40.11 4985.91 0.00 1 63.95 228.882 228.882 HNSW 

This PR (run 4, use same index as run 3)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.809 0.491 0.491 1.000 200000 100 50 64 250 no 0.00 Infinity 0.12 1 63.95 228.882 228.882 HNSW 

The results seem consistent with what we see above, the changes in this PR cause a regression if used immediately after indexing, but a speedup if used on an existing index?


Edit: The indexing time had not improved if we started fresh -- so I kept the old index (from run 4) around and did another benchmark (run 5) on this PR using -reindex and:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.809 1.626 1.626 1.000 200000 100 50 64 250 no 24.59 8133.39 0.00 1 64.00 228.882 228.882 HNSW 

..we see faster indexing times! (but the slow query-time issue still persists)

I ran a similar setup on main as well -- kept the old index (from run 5) around and ran another benchmark (run 6) using -reindex:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.806 0.477 0.477 0.999 200000 100 50 64 250 no 24.50 8163.60 0.00 1 63.98 228.882 228.882 HNSW 

..and both query / indexing times have improved!

Maybe it's something related to OS caching, I'm still confused :)

@msokolov
Copy link
Contributor

msokolov commented Jul 9, 2025

I was able to reproduce the weird behavior where the first run produces higher latency than subsequent runs. I also noticed netCPU and avgCpuCount were both negative in that first run?? I'm suspecting a bug in KnnGraphTester that may be related to having run the indexing (or maybe having computed the exact knn results) in the same run?

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 9, 2025

I also noticed netCPU and avgCpuCount were both negative in that first run??

Yeah I noticed this too! I think it was because of computing exact KNN results, I didn't see the same issue on re-indexing..

@msokolov
Copy link
Contributor

msokolov commented Jul 9, 2025

My weird results (w/768d cohere vectors). In sum, it looks to me as if we have some noise immediately after reindexing. Either this is a measurement artifact in luceneutil or it is a startup transient related to page caching / preloading / madvise / something. But after things settle doen this is clearly a bug improvement and we should merge, while also trying to understand this measurement problem -- which is pre-existsing and should not block this change.

run 1

baseline

first line does reindex -- and then latency and netCPU increase? This has nothing to do with this PR.

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited 0.987 17.861 17.850 0.999 500000 100 50 32 250 no 25663 0.987 25.272 25.256 0.999 500000 100 50 32 250 no 25663 0.987 24.355 24.340 0.999 500000 100 50 32 250 no 25663 

candidate

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited 0.987 13.261 13.251 0.999 500000 100 50 32 250 no 25663 0.987 13.366 13.356 0.999 500000 100 50 32 250 no 25663 

run 2

candidate (reindex each time)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited 0.983 57.972 57.959 1.000 500000 100 50 32 250 no 25648 0.980 55.219 55.205 1.000 500000 100 50 32 250 no 25802 0.986 93.766 93.752 1.000 500000 100 50 32 250 no 25630 

no reindex

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited 0.986 13.170 13.159 0.999 500000 100 50 32 250 no 25630 0.986 13.411 13.400 0.999 500000 100 50 32 250 no 25630 0.986 13.535 13.523 0.999 500000 100 50 32 250 no 25630 

baseline (no reindex)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited 0.986 26.048 26.033 0.999 500000 100 50 32 250 no 25630 0.986 25.954 25.938 0.999 500000 100 50 32 250 no 25630 
@msokolov
Copy link
Contributor

msokolov commented Jul 9, 2025

I tried another test:

  1. knnGraphTest.py with -reindex and -stats (no searching)
  2. knnGraphTest.py with no -reindex and -search

With the idea that if there was some kind of OS paging thing it would affect this in the same way as the normal -reindex + -search, but it did not: I saw the latency improvement (around ~13 ms). So this makes me think something happens in the JVM that is doing the indexing that predisposes it to have worse search performance? Some kind of hotspot something or other? Next I want to try looking at profiler, but it is a bit tricky since the index + search profile is going to be muddied by the indexing part

@msokolov
Copy link
Contributor

I did some deep-diving with profiler and I realized that when indexing, we call these dotProduct methods in a different context in which all of the vectors are on-heap. I'm surmising that this induces hotspot compiler to make some choices that end up being bad later when half of the vectors (the document ones) are off-heap; ie while searching. To test this theory I cloned the dotProduct(ByteSegment, ByteSegment) and called the new dotProductWTF(ByteSegment, ByteSegment) from dotProduct(byte[], byte[]). This did indeed lead to even greater speedups (another 2x, astoundingly), although strangely, a small but significant drop in recall?? I'm now trying again with single-threaded merges to see if somehow maybe increased lock contention during merging is leading to the recall drop. Hmm actually I wonder if it's simply that there are a different number of segments being produced.

Separately, this whole line of attack made me wonder if maybe we ought to be storing our vectors differently while indexing ... instead of a List<byte[]> or List<float[]> we might prefer to preallocate a larger array and write to it, advancing a pointer as we go. If we did this we could also maybe preallocate a MemorySegment along with it to use for scoring??. I don't know what impact if any this might have on the whole hotspot discovery, but it is becoming clear to me that the way we manage memory can have a major impact on the performance here in nonobvious ways.

I'm also still somewhat confused about the status of the whole Paname vectorization code. I guess it is still incubating, so we have decided not to require it, and our code paths are conditional on whether it is enabled or not. I think this conditionalization itself may be problematic from a performance perspective, although this is more a suspicion than anything, but I wonder if it would be possible to just require users to enable the incubating API be enabled at this point? It is in its umpteenth uncubation cycle, seems like it should be safe.

@msokolov
Copy link
Contributor

Separately, I tried using the Arena.ofAuto().allocateFrom() construct in the on-heap case that is used during indexing and this made indexing incredibly slow. I guess it is because we force the allocation of many many small memory segments that have to be cleaned up by the garbage collector. Possibly during search, when the memory is off-heap, this is faster because the allocation is more lightweight? In any case it points to the need to have separate code paths for on-heap and off-heap cases, or else more explicit management of the lifecycle of these memory segments, rather than leaving it up to GC as happens with ofAuto().

@msokolov
Copy link
Contributor

OK I discovered the loss of recall was due to a silly bug. After fixing that, these are the results I'm seeing with the addition of a separate code path for dotProduct(byte[], byte[]):

candidate (reindex)

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) num_segments 0.979 7.794 7.795 1.000 500000 100 50 32 250 no 25766 497.73 3 0.979 13.193 13.194 1.000 500000 100 50 32 250 no 25766 0.00 3 0.979 13.487 13.489 1.000 500000 100 50 32 250 no 25766 0.00 3 

baseline

0.979 24.282 24.280 1.000 500000 100 50 32 250 no 25766 0.00 3 0.979 25.850 25.848 1.000 500000 100 50 32 250 no 25766 0.00 3 

baseline (reindex)

 0.986 16.265 16.265 1.000 500000 100 50 32 250 no 25702 441.32 3 0.986 24.319 24.316 1.000 500000 100 50 32 250 no 25702 0.00 3 0.986 24.319 24.316 1.000 500000 100 50 32 250 no 25702 0.00 3 

candidate

 0.986 13.406 13.407 1.000 500000 100 50 32 250 no 25702 0.00 3 0.986 13.088 13.089 1.000 500000 100 50 32 250 no 25702 0.00 3 0.986 13.319 13.320 1.000 500000 100 50 32 250 no 25702 0.00 3 

so now the situation is reversed and we see faster searching after indexing in the same JVM.

@msokolov
Copy link
Contributor

what I did:

@@ -305,7 +306,36 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport { @Override public int dotProduct(byte[] a, byte[] b) { - return dotProduct(MemorySegment.ofArray(a), MemorySegment.ofArray(b)); + return dotProductWTF(MemorySegment.ofArray(a), MemorySegment.ofArray(b)); + } + + static int dotProductWTF(MemorySegment a, MemorySegment b) { + assert a.byteSize() == b.byteSize(); + int i = 0; + int res = 0; + + // only vectorize if we'll at least enter the loop a single time, and we have at least 128-bit + // vectors (256-bit on intel to dodge performance landmines) + if (a.byteSize() >= 16 && PanamaVectorConstants.HAS_FAST_INTEGER_VECTORS) { + // compute vectorized dot product consistent with VPDPBUSD instruction + if (VECTOR_BITSIZE >= 512) { + i += BYTE_SPECIES.loopBound(a.byteSize()); + res += dotProductBody512(a, b, i); + } else if (VECTOR_BITSIZE == 256) { + i += BYTE_SPECIES.loopBound(a.byteSize()); + res += dotProductBody256(a, b, i); + } else { + // tricky: we don't have SPECIES_32, so we workaround with "overlapping read" + i += ByteVector.SPECIES_64.loopBound(a.byteSize() - ByteVector.SPECIES_64.length()); + res += dotProductBody128(a, b, i); + } + } + + // scalar tail + for (; i < a.byteSize(); i++) { + res += b.get(JAVA_BYTE, i) * a.get(JAVA_BYTE, i); + } + return res; } 

Where dotProductWTF is just a clone of dotProduct

@msokolov
Copy link
Contributor

BTW the above results were on ARM/Graviton 2. I also tried on an Intel laptp and got speedups, although not as much, and the weird faster search after indexing also persists here

## candidate 0.985 3.263 3.256 0.998 500000 100 50 32 250 no 560.67 891.80 0 0.00 0.000 0.000 HNSW 0.985 4.255 4.242 0.997 500000 100 50 32 250 no 0.00 Infinity 0 0.00 0.000 0.000 HNSW 0.985 4.084 4.071 0.997 500000 100 50 32 250 no 0.00 Infinity 0 0.00 0.000 0.000 HNSW ## baseline 0.985 5.756 5.744 0.998 500000 100 50 32 250 no 0.00 Infinity 0 0.00 0.000 0.000 HNSW 0.985 5.696 5.684 0.998 500000 100 50 32 250 no 0.00 Infinity 0 0.00 0.000 0.000 HNSW 
@vigyasharma
Copy link
Contributor

Wow, these are very impressive gains! Nice find @kaivalnp.

So the key change is in Arena.ofAuto().allocateFrom(JAVA_BYTE, queryVector); which allocates an off heap MemorySegment for the query vector? Do we have an intuition on why it creates such dramatic (4x?!) speedup?

My understanding was that off heap document vectors helped by avoiding a copy back into the heap, plus avoiding the cost of reallocation and copy if some of them got garbage collected. But doesn't this change add a copy, by copying the byte[] queryVector from heap to the allocated off-heap segment? Also, since the query vector is only used during the lifetime of the query, I would've thought keeping it on heap should be okay?

...

Separately, I tried using the Arena.ofAuto().allocateFrom() construct in the on-heap case that is used during indexing and this made indexing incredibly slow. I guess it is because we force the allocation of many many small memory segments that have to be cleaned up by the garbage collector.

@msokolov : so indexing is faster if vectors are "on heap", but search is faster if vectors are "off heap".. Or do you think it's mainly because of the ofAuto() which defers gc to jvm? Can you share the code path for indexing that you modified?

...

Where dotProductWTF is just a clone of dotProduct

I'm confused, if dotProductWTF and dotProduct are exactly identical, why did dotProductWTF fix the 'search after indexing' case?

...
Separately, this PR holds up in real world benchmarks and looks good to merge. There are some interesting spin off issues here that can make indexing faster.

vigyasharma
vigyasharma previously approved these changes Jul 10, 2025
@msokolov
Copy link
Contributor

My understanding was that off heap document vectors helped by avoiding a copy back into the heap, plus avoiding the cost of reallocation and copy if some of them got garbage collected. But doesn't this change add a copy, by copying the byte[] queryVector from heap to the allocated off-heap segment? Also, since the query vector is only used during the lifetime of the query, I would've thought keeping it on heap should be okay?

It is confusing to me too. I think to understand it we need to decompile and look at the instructions that are generated -- after hotspot does its work. Maybe we are bypassing memory barriers that get applied to on-heap arrays? I am really not sure.

I'm confused, if dotProductWTF and dotProduct are exactly identical, why did dotProductWTF fix the 'search after indexing' case?

The idea behind this was to create two separate code paths: one used during indexing (when both arrays are on-heap) and another one used during search, when one array is one-heap and the others are off-heap (memory mapped from disk). This seems to enable hotspot to separately optimize these two code paths.

There is yet another mystery here, which is: why, after adding this hotspot hackery, do we see even faster performance on the query path when it is preceded by an indexing workflow than we do when it is not (although it's still faster than the baseline).

@vigyasharma
Copy link
Contributor

This seems to enable hotspot to separately optimize these two code paths.

Ah okay! That makes sense.

There is yet another mystery here, which is: why, after adding this hotspot hackery, do we see even faster performance on the query path when it is preceded by an indexing workflow than we do when it is not (although it's still faster than the baseline).

Could indexing have caused some document vectors to already be loaded in memory? Does it create any off heap vector values?

@msokolov
Copy link
Contributor

I'm not really comfortable pushing the PR as it is given that it makes searching slower in the benchmark where we reindex first, and I think we should understand the hotspot hack a little better before we push that, because it's really kind of gross and feels like voodoo to me. I'm really not sure I see at all why forking this method makes a difference -- it doesn't really do very much work in its body. OTOH maybe hotspot keeps track of some stack frame context so it is actually influencing the rewrite of methods called by this (the dotProductBody methods)??

@msokolov
Copy link
Contributor

I did dump the generated machine instructions and while it's hard to really understand what is going on in there, there is clearly a major difference between the code being emitted for dotProductBody256 and dotProductBody256 @ 10 when the hack is in place vs not. So I think the theory that hot spot is optimizing differently in these two cases seems correct

@msokolov
Copy link
Contributor

msokolov commented Jul 11, 2025

This got me wondering -- what is hotspot doing with the code we have on main? By default, knnPerfTest.py runs 1000 iterations, plus we have a warmup phase that runs the same number of iters as the "main" phase, so a total of 2000 by default).

On Intel CPU with preferredBitSize=256

recall latency(ms) netCPU avgCpuCount nDoc nIters 0.982 5.500 5.100 0.927 500000 10 0.991 3.510 3.450 0.983 500000 100 0.979 5.851 5.834 0.997 500000 1000 0.975 5.891 5.881 0.998 500000 2500 

On ARM with preferredBitSize=128

recall latency(ms) netCPU avgCpuCount nDoc nIters 0.986 9.300 9.400 1.011 500000 10 0.989 8.010 8.010 1.000 500000 100 0.986 25.860 25.858 1.000 500000 1000 0.983 7.628 7.628 1.000 500000 2500 

So I think the bad performance we measured may just be a hotspot glitch around 1-2000 iterations, and in fact this PR is making things worse.

Now I tested this PR + the weird hack (on ARM)

Results: recall latency(ms) netCPU avgCpuCount nDoc nIters 0.986 15.900 15.900 1.000 500000 10 0.989 14.270 14.280 1.001 500000 100 0.986 13.219 13.220 1.000 500000 1000 0.983 13.508 13.508 1.000 500000 2500 

and finally this PR (no weird hotspot hack):

recall latency(ms) netCPU avgCpuCount nDoc nIters 0.986 16.000 16.000 1.000 500000 10 0.989 14.640 14.650 1.001 500000 100 0.986 13.381 13.382 1.000 500000 1000 0.983 13.794 13.794 1.000 500000 2500 

net-net I don't think we should merge this

@kaivalnp
Copy link
Contributor Author

Thanks for all the deep-dive here @msokolov!

I tried running a set of benchmarks (Cohere, 768d, byte vectors) on niter=100,500,1000,5000,10000,50000 (only search existing index, no reindex) to test how the compiler optimizes over larger runs..

main:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized num_segments 0.968 2.790 2.710 0.971 100000 100 50 64 250 no 1 0.964 2.692 2.674 0.993 100000 100 50 64 250 no 1 0.963 2.685 2.677 0.997 100000 100 50 64 250 no 1 0.963 2.599 2.597 0.999 100000 100 50 64 250 no 1 0.962 2.552 2.550 0.999 100000 100 50 64 250 no 1 0.962 2.590 2.589 1.000 100000 100 50 64 250 no 1 

This PR:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized num_segments 0.968 2.050 1.960 0.956 100000 100 50 64 250 no 1 0.964 1.868 1.850 0.990 100000 100 50 64 250 no 1 0.963 1.894 1.885 0.995 100000 100 50 64 250 no 1 0.963 1.770 1.769 0.999 100000 100 50 64 250 no 1 0.962 1.762 1.761 1.000 100000 100 50 64 250 no 1 0.962 1.742 1.741 1.000 100000 100 50 64 250 no 1 

There's a possibility that the candidate (i.e. this PR) has an inherent benefit due to being run later in time (so vectors are more likely to be loaded into RAM) -- so I ran baseline (i.e. main) immediately afterwards:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized num_segments 0.968 2.850 2.770 0.972 100000 100 50 64 250 no 1 0.964 2.712 2.694 0.993 100000 100 50 64 250 no 1 0.963 2.640 2.632 0.997 100000 100 50 64 250 no 1 0.963 2.588 2.586 0.999 100000 100 50 64 250 no 1 0.962 2.561 2.559 0.999 100000 100 50 64 250 no 1 0.962 2.550 2.549 1.000 100000 100 50 64 250 no 1 

For some reason, the changes in this PR are still better on my machine :/

I think we should understand the hotspot hack a little better before we push that, because it's really kind of gross and feels like voodoo to me

+1, not looking to merge this until we find out why we're seeing a difference in performance (seems counterintuitive as we're doing more work but seeing better latency!) -- when we (1) create a fresh index, (2) reindex, (3) search an existing index, (4) for different parameters, (5) across different machines?

Performance seems tied to the HotSpot compiler -- is there a way to make optimizations more deterministic? (or at least, explicit)

On a related note, benchmark runs have been so wildly fluctuating -- I wonder if we should set larger defaults for reliable numbers..

@msokolov
Copy link
Contributor

I'm curious if you have a Graviton, and which generation? If you run without -quiet you can captyre where it says preferredBitSize=xxx and that should also be indicative I think

@vigyasharma
Copy link
Contributor

Thanks for the analysis @msokolov , agree with not merging until we understand what's happening.

Now I tested this PR + the weird hack (on ARM)
...
and finally this PR (no weird hotspot hack):

Latency looks similar in both runs. Did the hotspot hack stop working?

@vigyasharma vigyasharma self-requested a review July 11, 2025 22:03
@vigyasharma vigyasharma dismissed their stale review July 11, 2025 22:04

Let's figure out why the hotspot hack works and get it working deterministically before merging this.

@vigyasharma
Copy link
Contributor

Another question: is the weird "search after indexing" regression specific only to this PR? I believe our current hypothesis is that hotspot sees document vectors on heap during indexing and does something weird (code optimization, memory management, ...) which impacts search.

Is this in any way related to the query vector being off-heap (and why?), or is it an existing regression? Just thinking out loud to see what others have observed, I'll also run the benchmark myself to check this.

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 11, 2025

is the weird "search after indexing" regression specific only to this PR?

It's slightly different here -- I tried the following on main

When a fresh index is created (run 1):

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.962 2.348 2.345 0.999 100000 100 50 64 250 no 11.56 8651.27 50.78 1 77.48 292.969 292.969 HNSW 

Then a run without -reindex (run 2, same index as run 1):

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.962 2.544 2.542 0.999 100000 100 50 64 250 no 0.00 Infinity 0.12 1 77.48 292.969 292.969 HNSW 

Then a run with -reindex (run 3, keeping index from run 1 around):

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.963 1.801 1.800 1.000 100000 100 50 64 250 no 11.51 8687.34 50.71 1 77.46 292.969 292.969 HNSW 

And finally a run without -reindex (run 4, same index as run 3):

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.963 2.491 2.490 0.999 100000 100 50 64 250 no 0.00 Infinity 0.12 1 77.46 292.969 292.969 HNSW 

This behavior of seeing a performance improvement after re-indexing (but not in a fresh index) has been consistent for me (see this previous result)

Edit: Sorry I pasted numbers with quantization earlier (was running something else for #14863), now updated to byte vectors + no quantization.. The trend is pretty consistent though!

@ChrisHegarty
Copy link
Contributor

I'm going to try to help determine what's going on here. There are a number of subtleties around these off-heap scorers.

@kaivalnp
Copy link
Contributor Author

FYI I was trying to see the compiled code for the dot product function using -XX:CompileCommand=print,*PanamaVectorUtilSupport.dotProductBody256 and saw that performance improved on adding that flag (no reindexing in any run)

This PR without the flag:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.962 1.711 1.711 1.000 100000 100 50 64 250 no 0.00 Infinity 0.12 1 77.44 292.969 292.969 HNSW 

This PR with the flag:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.962 1.171 1.172 1.001 100000 100 50 64 250 no 0.00 Infinity 0.12 1 77.44 292.969 292.969 HNSW 

main without the flag:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.962 2.503 2.502 0.999 100000 100 50 64 250 no 0.00 Infinity 0.12 1 77.44 292.969 292.969 HNSW 

main with the flag:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.962 1.194 1.192 0.999 100000 100 50 64 250 no 0.00 Infinity 0.12 1 77.44 292.969 292.969 HNSW 

Perhaps the flag is forcing the function to be optimized by the JVM?
In this case, it should denote the latency in a long-running application? (when the function is fully optimized by the compiler..)

If so, I don't see a lot of value in merging this PR (there isn't a large improvement) -- but I'd love to make our benchmarks more robust and report something representative of a long-running application! (these non-deterministic compiler optimizations are too trappy)

@kaivalnp
Copy link
Contributor Author

To look into compiler inlining, I added -XX:CompileCommand=PrintInlining,*PanamaVectorUtilSupport.* to the startup arguments and saw:

@ 113 jdk.incubator.vector.AbstractSpecies::loopBound (9 bytes) force inline by annotation @ 5 jdk.incubator.vector.VectorIntrinsics::roundDown (23 bytes) force inline by annotation @ 125 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody256 (110 bytes) failed to inline: callee is too large 

..in a run without -reindex

However if I performed a run with -reindex, I saw lines like:

@ 113 jdk.incubator.vector.AbstractSpecies::loopBound (9 bytes) force inline by annotation @ 5 jdk.incubator.vector.VectorIntrinsics::roundDown (23 bytes) force inline by annotation @ 125 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody256 (110 bytes) inline (hot) 

..which means it is being inlined if the dot product function is hot enough


So, I tried to turn on explicit inlining using -XX:CompileCommand=inline,*PanamaVectorUtilSupport.* in some runs:

recall latency inlining reindex
0.962 2.540 no no
0.962 1.210 yes no
0.962 1.829 no yes
0.962 1.808 yes yes

I was able to squeeze performance in the run without -reindex but not if we create an index, I'll try to dig deeper on why this is happening..

@kaivalnp
Copy link
Contributor Author

I'll try to dig deeper on why this is happening..

@msokolov I tried what you mentioned above, using the following hack:

..and the performance changed drastically!

main:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.961 1.836 1.835 0.999 100000 100 50 64 250 no 11.65 8582.22 24.70 1 77.47 292.969 292.969 HNSW 

after the hack:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.961 1.201 1.200 0.999 100000 100 50 64 250 no 11.67 8569.71 50.90 1 77.47 292.969 292.969 HNSW 

Note that search time is ~35% faster, while force-merge time is ~106% slower!

Looks like the JVM is producing optimized branches of code based on the underlying input type(s) of int dotProduct(MemorySegment, MemorySegment) -- and the non-optimized branches suffer a latency regression..

On main, looks like the indexing version was optimized, while the search version was optimized after the hack

Had the following questions:

  1. Is this also an issue in long-running applications, or just a benchmark issue with luceneutil?
  2. How can we refactor functions around so that the optimal case is always used? Perhaps separate out Panama / Vector API usages?
  3. Also, does the issue disproportionately affect applications where indexing and search happen on the same node? (v/s applications with separate writers / searchers -- and both internally execute their own optimized branches)
@msokolov
Copy link
Contributor

Looks like the JVM is producing optimized branches of code based on the underlying input type(s) of int dotProduct(MemorySegment, MemorySegment) -- and the non-optimized branches suffer a latency regression..

Yes this gibes with my understanding. It made me wonder if we could introduce a different signature for the two cases. But it seems crazy, if the Java code is identical for them, that the underlying machine code would be so completely different. And wouldn't it be nice if HotSpot could understand that context may change and generate multiple versions for multiple contexts.

@kaivalnp kaivalnp force-pushed the fix-off-heap-byte-scorer branch from 811a5bf to ac41d29 Compare July 19, 2025 02:18
@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 19, 2025

Based on the above findings, we think that the JVM is aggressively optimizing the case where both documents are off-heap during indexing (i.e. in Lucene99MemorySegmentByteVectorScorerSupplier), and then the query path (i.e. in Lucene99MemorySegmentByteVectorScorer) with an on-heap query is penalized..

I refactored code around a bit to instantiate ByteVector instances more appropriately -- since it looked like wrapping an array with MemorySegment.ofArray and then instantiating a ByteVector using ByteVector.fromMemorySegment was causing some friction^

Benchmarks:

main with -reindex:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.962 1.832 1.831 1.000 100000 100 50 64 250 no 11.40 8771.93 24.53 1 77.45 292.969 292.969 HNSW 

main without -reindex:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.962 1.214 1.213 0.999 100000 100 50 64 250 no 0.00 Infinity 0.12 1 77.45 292.969 292.969 HNSW 

This PR with -reindex:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.962 1.140 1.140 0.999 100000 100 50 64 250 no 11.10 9010.63 24.45 1 77.47 292.969 292.969 HNSW 

This PR without -reindex:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.962 1.195 1.194 0.999 100000 100 50 64 250 no 0.00 Infinity 0.12 1 77.47 292.969 292.969 HNSW 

Note: I ran the benchmark with -XX:CompileCommand=inline,*PanamaVectorUtilSupport.* to avoid the cold-start problem!

I'm seeing ~38% improvement in the "search after indexing" case -- and no degradation in the "only search" case!

@github-actions github-actions bot added this to the 11.0.0 milestone Jul 19, 2025
@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 22, 2025

I'm going to try to help determine what's going on here. There are a number of subtleties around these off-heap scorers.

@ChrisHegarty do you think this understanding of the performance regression in scoring off-heap byte vectors / fix is correct?

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It's unfortunate that we need to do this, but I agree with the change - and the benchmarks speak for themselves.

@kaivalnp
Copy link
Contributor Author

Thanks @ChrisHegarty!

If we don't have further concerns / suggestions, can someone help merge this PR?

@msokolov
Copy link
Contributor

okay, I was slightly worried about the allocations of the new internal interface (ByteVectorLoader) creating lots of garbage, but it seems they are so short-lived they can probably be covered by escape analysis? LGTM, I'll merge.

@msokolov msokolov merged commit b3f4011 into apache:main Jul 29, 2025
8 checks passed
@kaivalnp
Copy link
Contributor Author

allocations of the new internal interface (ByteVectorLoader) creating lots of garbage

Yeah, they're probably inlined away by the JVM

LGTM, I'll merge.

Thanks!

@kaivalnp
Copy link
Contributor Author

Opened #15010 to backport this to 10.x

kaivalnp pushed a commit to kaivalnp/luceneutil that referenced this pull request Aug 28, 2025
mikemccand pushed a commit to mikemccand/luceneutil that referenced this pull request Aug 28, 2025
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
@kaivalnp
Copy link
Contributor Author

I missed this earlier, but we saw an indexing throughput increase in a specific case for this change :)
mikemccand/luceneutil#452

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

4 participants