Skip to content

Conversation

lalitb
Copy link
Member

@lalitb lalitb commented Feb 21, 2023

This PR attempts to improve the measurement processing throughput - with benchmark as the time taken by SDK to process 1 million measurements with multiple dimensions and high cardinality.

The change is to remove the unnecessary memory copy operations done for each input measurement. The measurement is now copied to OrderedAttributeMap only if it is not already existing in the AttributesHashMap.

Existing flow:
instrument::Add(value, attributes) -> Copy attributes to OrderedAttributeMap -> Calculate hash for attributes in OrderedAttributeMap -> Add attributes to AttributesHashMap if not existing (by comparing hash calculated in step 3) -> Aggregate(value)

New flow:
instrument::Add(value, attributes) -> Calculate hash for attributes in KeyValueIterable -> Copy attributes to OrderedAttributeMap, and then add to AttributesHashMap if not existing (by comparing hash calculated in step 2) -> Aggregate(value)

The processing time for 1 million measurements is reduced from 3.7 secs to 1.9 secs ( on CPU: Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz, 20 processors, Memory: 32GB)

The test results are for below scenario:

Number of dimensions/attributes : 3 Cardinality of each dimension : 10 Total possible dimensions: 10 * 10 * 10 = 1000 Total number of measurements : 1 million 

(The benchmark code is added as part of this PR - measurements_benchmark.cc:

Benchmark with this PR changes:

Running ./measurements_benchmark Run on (20 X 2808.01 MHz CPU s) CPU Caches: L1 Data 32K (x10) L1 Instruction 32K (x10) L2 Unified 256K (x10) L3 Unified 20480K (x1) Load Average: 8.23, 3.09, 1.54 -------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------- BM_MeasurementsTest 1988578400 ns 1988543700 ns 1 

Benchmark with existing code:

2023-02-20 22:23:53 Running ./measurements_benchmark Run on (20 X 2808.01 MHz CPU s) CPU Caches: L1 Data 32K (x10) L1 Instruction 32K (x10) L2 Unified 256K (x10) L3 Unified 20480K (x1) Load Average: 2.84, 2.51, 1.45 -------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------- BM_MeasurementsTest 3794032100 ns 3794023000 ns 1 

Changes

Please provide a brief description of the changes here.

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed
@lalitb lalitb requested a review from a team February 21, 2023 08:13
@codecov
Copy link

codecov bot commented Feb 21, 2023

Codecov Report

Merging #1993 (7d7755b) into main (649829f) will increase coverage by 0.01%.
The diff coverage is 89.42%.

Additional details and impacted files

Impacted file tree graph

@@ Coverage Diff @@ ## main #1993 +/- ## ========================================== + Coverage 87.32% 87.32% +0.01%  ========================================== Files 166 166 Lines 4673 4723 +50 ========================================== + Hits 4080 4124 +44  - Misses 593 599 +6 
Impacted Files Coverage Δ
...ntelemetry/sdk/metrics/view/attributes_processor.h 90.00% <50.00%> (-10.00%) ⬇️
...clude/opentelemetry/sdk/common/attributemap_hash.h 83.88% <84.22%> (+7.41%) ⬆️
sdk/src/metrics/state/temporal_metric_storage.cc 98.31% <87.50%> (+0.06%) ⬆️
...ntelemetry/sdk/metrics/state/sync_metric_storage.h 86.37% <88.89%> (-1.13%) ⬇️
...entelemetry/sdk/metrics/state/attributes_hashmap.h 95.24% <96.67%> (-0.59%) ⬇️
...telemetry/sdk/metrics/state/async_metric_storage.h 86.49% <100.00%> (+0.38%) ⬆️
@lalitb
Copy link
Member Author

lalitb commented Feb 22, 2023

This is ready for review now. @ThomsonTan , @esigo please help in review this PR. We can enable exemplar once this is merged, as exemplar has it's own performance cost.

@lalitb
Copy link
Member Author

lalitb commented Feb 25, 2023

Added few more changes to reduce the lock contention (by removing AttributeHashMap hash calculation outside of mutex lock). Gained few more seconds, and better scaling with threads:

Number of threads : 1

$ ./measurements_benchmark 2023-02-24 23:09:48 Running ./measurements_benchmark Run on (20 X 2808.01 MHz CPU s) CPU Caches: L1 Data 32K (x10) L1 Instruction 32K (x10) L2 Unified 256K (x10) L3 Unified 20480K (x1) Load Average: 0.33, 1.90, 1.20 -------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------- BM_MeasurementsTest 1582235960 ns 104430 ns 10 

Number of threads: 5

$ ./measurements_benchmark 2023-02-24 23:15:19 Running ./measurements_benchmark Run on (20 X 2808.01 MHz CPU s) CPU Caches: L1 Data 32K (x10) L1 Instruction 32K (x10) L2 Unified 256K (x10) L3 Unified 20480K (x1) Load Average: 0.20, 0.70, 0.87 -------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------- BM_MeasurementsTest 991755670 ns 306610 ns 10 

Number of threads: 19

$ ./measurements_benchmark 2023-02-24 23:18:25 Running ./measurements_benchmark Run on (20 X 2808.01 MHz CPU s) CPU Caches: L1 Data 32K (x10) L1 Instruction 32K (x10) L2 Unified 256K (x10) L3 Unified 20480K (x1) Load Average: 0.65, 0.67, 0.81 -------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------- BM_MeasurementsTest 926599920 ns 878030 ns 10 

So, 1.5 secs to process 1 million measurements from single thread, .98 secs to process 1 million measurements from 5 threads, .92 secs to process 1 million measurements from 19 threads. This is on machine with 20 processors. So, while the library scales well from 1 to 5 threads, it reaches its performance limit beyond that. As valgrind/callgrind cpu profiling result shows, the limit is reached for the synchronized access to std::unordered_map used inside AttributesHashMap across all these threads. I think we can live with current performance for now unless there are any cross-platform lock-free hashmap data structure we can easily used in our code.

@esigo esigo added the size/L Denotes a PR that changes 100-499 lines. label Feb 26, 2023
@lalitb
Copy link
Member Author

lalitb commented Mar 1, 2023

The results are more promising while building the application in Release mode (similar results on Windows and Linux).
So to process 1 million measurements:
Time taken (1 thread) : ~130 ms
Time taken (5 threads): ~147 ms
Time taken (19 threads): ~165 ms

Number of threads : 1

$ ./measurements_benchmark 2023-02-28 23:32:31 Running ./measurements_benchmark Run on (20 X 2808.01 MHz CPU s) CPU Caches: L1 Data 32K (x10) L1 Instruction 32K (x10) L2 Unified 256K (x10) L3 Unified 20480K (x1) Load Average: 0.16, 0.98, 0.88 -------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------- BM_MeasurementsTest 133399288 ns 132125 ns 100 

Number of threads: 5

$ ./measurements_benchmark 2023-02-28 23:36:06 Running ./measurements_benchmark Run on (20 X 2808.01 MHz CPU s) CPU Caches: L1 Data 32K (x10) L1 Instruction 32K (x10) L2 Unified 256K (x10) L3 Unified 20480K (x1) Load Average: 0.62, 0.71, 0.78 -------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------- BM_MeasurementsTest 147010309 ns 359316 ns 100 

Number of threads: 19

$ ./measurements_benchmark 2023-02-28 23:37:36 Running ./measurements_benchmark Run on (20 X 2808.01 MHz CPU s) CPU Caches: L1 Data 32K (x10) L1 Instruction 32K (x10) L2 Unified 256K (x10) L3 Unified 20480K (x1) Load Average: 0.33, 0.60, 0.73 -------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------- BM_MeasurementsTest 165881775 ns 1205977 ns 100 
Copy link
Member

@esigo esigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks for the PR :)
sorry for the delay

@esigo esigo added the ok-to-merge The PR is ok to merge (has two approves or raised by a maintainer/approver and has one approve) label Mar 3, 2023
@lalitb lalitb enabled auto-merge (squash) March 4, 2023 04:11
@lalitb lalitb merged commit da333f8 into open-telemetry:main Mar 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-merge The PR is ok to merge (has two approves or raised by a maintainer/approver and has one approve) size/L Denotes a PR that changes 100-499 lines.

3 participants