perf-cpp provides direct access to hardware performance counters from your C++ application. The library allows for precise event-counting and sampling of specific code segments and to link sampled data (e.g., memory addresses) with application-specific details (e.g., class instances).
- Count Hardware Events: Seamlessly embed performance statistics (similar to
perf stat) into your app and profile specific code segments instead of the entire application. perf-cpp also supports Metrics (e.g., cycles per instruction) and accessing Statistics in Realtime. - Record Samples: Periodically capture profiling data–such as instruction pointers and memory accesses–directly from your application (similar to
perf [mem] record). - Customizable Event Configuration: Mix built-in events (e.g., cycles, instructions, cache-misses) with CPU-specific ones.
- Practical Examples: Jumpstart your implementation with the diverse collection of examples that demonstrate practical applications of the library.
perf-cpp is designed as a library that can be linked to your application.
# Clone the repository git clone https://github.com/jmuehlig/perf-cpp.git # Switch to the repository folder cd perf-cpp # Optional: Switch to the latest stable version git checkout v0.10.0 # Build the library (in build/) # Note: -DBUILD_EXAMPLES=1 is optional and only needed if you want to build the examples cmake . -B build -DBUILD_EXAMPLES=1 cmake --build build # Optional: Build examples (in build/examples/bin) cmake --build build --target examplesNote
Further information and detailed building instructions (e.g., how to integrate into CMake projects) are available in the Building Guide.
Recording hardware event statistics operates much like perf stat: it quantifies critical events—such as executed instructions, CPU cycles, and cache misses–throughout a code segment's execution.
#include <perfcpp/event_counter.h> /// Initialize the counter auto counters = perf::CounterDefinition{}; auto event_counter = perf::EventCounter{ counters }; /// Specify hardware events to count event_counter.add({"seconds", "instructions", "cycles", "cache-misses"}); /// Run the workload event_counter.start(); your_workload(); /// <-- Your code to profile event_counter.stop(); /// Print the result to the console const auto result = event_counter.result(); for (const auto [event_name, value] : result) { std::cout << event_name << ": " << value << std::endl; }Possible output:
seconds: 0.0955897 instructions: 5.92087e+07 cycles: 4.70254e+08 cache-misses: 1.35633e+07 Note
For additional insights please refer to the guides on Recording Events and Recording Events on Multiple CPUs/Threads. Also, check out the Hardware Events documentation for comprehensive details on both built-in and hardware-specific events.
Recording samples functions much like perf [mem] record: it captures execution snapshots, e.g., the instruction pointer, executing CPU, and timestamp, at regular intervals (here every 4,000th CPU cycle).
#include <perfcpp/sampler.h> /// Create the sampler auto counters = perf::CounterDefinition{}; auto sampler = perf::Sampler{ counters }; /// Specify when a sample is recorded: every 4000th cycle sampler.trigger("cycles", perf::Period{4000U}); /// Specify what metadata is included into a sample: time, CPU ID, instruction sampler.values() .time(true) .cpu_id(true) .instruction_pointer(true); /// Run the workload sampler.start(); your_workload(); /// <-- Your code to profile sampler.stop(); /// Print the samples to the console const auto samples = sampler.result(); for (const auto& sample_record : samples) { const auto time = sample_record.time().value(); const auto cpu_id = sample_record.cpu_id().value(); const auto instruction = sample_record.instruction_pointer().value(); std::cout << "Time = " << time << " | CPU = " << cpu_id << " | Instruction = 0x" << std::hex << instruction << std::dec << std::endl; }Possible output:
Time = 365449130714033 | CPU = 8 | Instruction = 0x5a6e84b2075c Time = 365449130913157 | CPU = 8 | Instruction = 0x64af7417c75c Time = 365449131112591 | CPU = 8 | Instruction = 0x5a6e84b2075c Time = 365449131312005 | CPU = 8 | Instruction = 0x64af7417c75c Note
For additional details—such as the types of data that can be included in samples—please consult the Sampling Guide. Additionally, consult the Sampling on Multiple CPUs/Threads Guide for guidance on parallel sampling.
We include a comprehensive collection of examples demonstrating the advanced capabilities of perf-cpp, including, for example, counting events in parallel settings and sampling memory accesses.
Tip
All code examples are available in the examples/ folder.
- Full Documentation: Explore detailed guides on every feature of perf-cpp.
- Examples: Learn how to set up different features from code-examples.
- Changelog: Stay updated with the latest changes and improvements.
- C++ Standard: Requires support for C++17 features.
- CMake Version: 3.10 or higher.
- Linux Kernel Version: 4.0 or newer (note that some features need a newer Kernel).
perf_event_paranoidsetting: Adjust as needed to allow access to performance counters (see the Paranoid Value documentation).
We welcome contributions and feedback to make perf-cpp even better. For feature requests, feedback, or bug reports, please reach out via our issue tracker or submit a pull request.
Alternatively, you can email me: jan.muehlig@tu-dortmund.de.
While perf-cpp is dedicated to providing developers with clear insights into application performance, it is part of a broader ecosystem of tools that facilitate performance analysis. Below is a non-exhaustive list of some other valuable profiling projects:
- PAPI offers access not only to CPU performance counters but also to a variety of other hardware components including GPUs, I/O systems, and more.
- Likwid is a collection of several command line tools for benchmarking, including an extensive wiki.
- PerfEvent provides lightweight access to performance counters, facilitating streamlined performance monitoring.
- Intel's Instrumentation and Tracing Technology allows applications to manage the collection of trace data effectively when used in conjunction with Intel VTune Profiler.
- For those who prefer a more hands-on approach, the perf_event_open system call can be utilized directly without any wrappers.
This is a non-exhaustive list of academic research papers and blog articles (feel free to add to it, e.g., via pull request – also your own work).
- Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis (2017)
- Analyzing memory accesses with modern processors (2020)
- Precise Event Sampling on AMD Versus Intel: Quantitative and Qualitative Comparison (2023)
- Multi-level Memory-Centric Profiling on ARM Processors with ARM SPE (2024)