Skip to content

Conversation

@ZENOTME
Copy link

@ZENOTME ZENOTME commented Sep 17, 2025

solve #3645

Changes

I noticed that original circular buffer using by batch span processor use lock free MPSC queu based on CAS. When experiencing intense multithreading contention, Compare-And-Swap (CAS) exhibits poorer scalability compared to Fetch-And-Add (FAA). In reference of https://github.com/dbittman/waitfree-mpsc-queue, this pr attempt to implement a wait-free MPSC (Multiple Producer, Single Consumer) queue using FAA. Base on original benchmark, it indicate that this approach demonstrates better performance scalability.

Run on (48 X 2593.99 MHz CPU s) CPU Caches: L1 Data 32 KiB (x24) L1 Instruction 32 KiB (x24) L2 Unified 256 KiB (x24) L3 Unified 30720 KiB (x2) Load Average: 7.85, 5.70, 4.48 ---------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------- BM_BaselineBuffer/1 10178537 ns 51528 ns 1000 BM_BaselineBuffer/2 7408646 ns 69828 ns 1000 BM_BaselineBuffer/4 7684772 ns 127549 ns 1000 BM_BaselineBuffer/8 7222459 ns 278660 ns 1000 BM_BaselineBuffer/16 6716972 ns 603712 ns 1215 BM_LockFreeBuffer/1 3915343 ns 53125 ns 1000 BM_LockFreeBuffer/2 4798406 ns 70581 ns 1000 BM_LockFreeBuffer/4 4562709 ns 128493 ns 1000 BM_LockFreeBuffer/8 4935221 ns 289996 ns 1000 BM_LockFreeBuffer/16 5187913 ns 618856 ns 1081 BM_OptimizedBuffer/1 4256507 ns 49970 ns 1000 BM_OptimizedBuffer/2 3398719 ns 67712 ns 1000 BM_OptimizedBuffer/4 3204749 ns 127378 ns 1000 BM_OptimizedBuffer/8 3230722 ns 296507 ns 1000 BM_OptimizedBuffer/16 3859005 ns 769220 ns 1000 

I would like to refine this PR if it's acceptable.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 17, 2025

CLA Signed

  • ✅login: ZENOTME / name: zenotme / (1836bbf)

The committers listed above are authorized under a signed CLA.

@marcalff marcalff added the pr:waiting-on-cla Waiting on CLA label Sep 17, 2025
@codecov
Copy link

codecov bot commented Sep 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.08%. Comparing base (aeb5a01) to head (1836bbf).
⚠️ Report is 50 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@ Coverage Diff @@ ## main #3644 +/- ## ========================================== + Coverage 90.03% 90.08% +0.06%  ========================================== Files 220 220 Lines 7065 7111 +46 ========================================== + Hits 6360 6405 +45  - Misses 705 706 +1 
Files with missing lines Coverage Δ
...include/opentelemetry/sdk/common/circular_buffer.h 98.15% <ø> (-1.85%) ⬇️
...e/opentelemetry/sdk/common/circular_buffer_range.h 100.00% <100.00%> (ø)

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

CircularBufferRange<AtomicUniquePtr<T>> PeekImpl() noexcept
{
uint64_t current_count = count_.load(std::memory_order_relaxed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only sync operation is processed here. Is it possible return the same CircularBufferRange for more than one thread, and cause data race?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I can't get the point here, could you elaborate a more specific scene.

Copy link
Member

@owent owent Sep 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When two threads call this function and finish the callings before any change, they may returns the same values. And callback(range); in Consume will be called with the same range.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this circular buffer implementation is mpsc and assuming only one consumer thread calling this method. That's how it used in batch span processor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like PeekImpl() could be made const, so the const_cast could be removed too

@Reneg973
Copy link
Contributor

Reneg973 commented Oct 26, 2025

On Mac M1 I get:

Run on (10 X 24 MHz CPU s) CPU Caches: L1 Data 64 KiB L1 Instruction 128 KiB L2 Unified 4096 KiB (x10) Load Average: 9.93, 6.67, 5.87 ---------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------- BM_BaselineBuffer/1 2789978 ns 32254 ns 1000 BM_BaselineBuffer/2 2949209 ns 46430 ns 1000 BM_BaselineBuffer/4 1956026 ns 72750 ns 9651 BM_BaselineBuffer/8 1690755 ns 117763 ns 5736 BM_BaselineBuffer/16 1491443 ns 212598 ns 3275 BM_LockFreeBuffer/1 1223514 ns 32825 ns 10000 BM_LockFreeBuffer/2 1369947 ns 45685 ns 10000 BM_LockFreeBuffer/4 2768243 ns 76035 ns 1000 BM_LockFreeBuffer/8 3852933 ns 135797 ns 1000 BM_LockFreeBuffer/16 4341987 ns 259862 ns 1000 BM_OptimizedBuffer/1 1471849 ns 32878 ns 10000 BM_OptimizedBuffer/2 1138345 ns 45200 ns 10000 BM_OptimizedBuffer/4 1536488 ns 73745 ns 9610 BM_OptimizedBuffer/8 1577325 ns 126947 ns 5565 BM_OptimizedBuffer/16 1833288 ns 237527 ns 2986 

So the result is similar to SpinLockMutex: the higher the contention, the less efficient atomics are, and the better std::mutex performs.

@Reneg973
Copy link
Contributor

And a compiler fix for libc (MacOS) is also required, because size_t is of type unsigned long here:

 uint64_t max_check = std::min<uint64_t>(current_count, capacity_);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants