Skip to content

Lightweight recording and sampling of performance counters for specific code segments directly from your C++ application.

License

Notifications You must be signed in to change notification settings

jmuehlig/perf-cpp

Repository files navigation

perf-cpp: Effortless Hardware Performance Monitoring for C++ Applications

LGPL-3.0 LinuxKernel->=4.0 C++17 Ask DeepWiki

Quick Start | How to Build | Documentation | System Requirements

perf-cpp embeds Linux's hardware performance monitoring directly into your code, letting you profile exactly what matters and process the results in your application. Tools like Linux Perf, Intel® VTune™, and AMD uProf are powerful but monitor entire programs – and high-performance applications need surgical precision.

What can perf-cpp do?

Built around Linux's powerful perf subsystem, perf-cpp provides a clean interface for counting and sampling hardware events – without the complexity of low-level APIs.

  • Measure exactly what you want – utilize performance counters to count hardware events, similar to perf stat, but around specific code paths, not an entire binary (documentation).
  • Calculate metrics such as cycles per instruction and cache miss to access ratio based on hardware events and timing (documentation).
  • Low-latency performance counters access without starting/stopping the counters, for micro-benchmarks or adaptive tuning (documentation).
  • Record instruction and memory samples, just like perf [mem] record – but from inside your application (documentation).
  • Correlate samples with data structures and symbols to generate per-class access statistics and flame graphs.
  • Mix built-in events (e.g., cycles, instructions, cache misses, ...) with processor-specific counters (documentation).

See various practical examples and the documentation for more details.

Quick Start

Record Hardware Event Statistics

Recording hardware event statistics operates much like perf stat: it quantifies critical events–such as executed instructions, CPU cycles, and cache misses–throughout a code segment's execution.

#include <perfcpp/event_counter.h> /// Initialize the counter auto event_counter = perf::EventCounter{}; /// Specify hardware events to count event_counter.add({"seconds", "instructions", "cycles", "cache-misses"}); /// Run the workload event_counter.start(); code_to_profile(); /// <-- Statistics recorded while execution event_counter.stop(); /// Print the result to the console const auto result = event_counter.result(); for (const auto [event_name, value] : result) { std::cout << event_name << ": " << value << std::endl; }

Possible output:

seconds: 0.0955897 instructions: 5.92087e+07 cycles: 4.70254e+08 cache-misses: 1.35633e+07 

Note

For additional insights please refer to the guides on recording event statistics and event statistics on multiple CPUs/threads. Also, check out the hardware events documentation for details on both built-in and processor-specific events.

Record Samples

Recording samples functions much like perf [mem] record: it captures execution snapshots, e.g., the instruction pointer, executing CPU, and timestamp, at regular intervals (here every 50,000th CPU cycle).

#include <perfcpp/sampler.h> /// Create the sampler auto sampler = perf::Sampler{}; /// Specify when a sample is recorded: every 50,000th cycle sampler.trigger("cycles", perf::Period{50000U}); /// Specify what data is included into a sample: time, CPU ID, instruction sampler.values() .timestamp(true) .cpu_id(true) .instruction_pointer(true); /// Run the workload sampler.start(); code_to_profile(); /// <-- Samples recorded while execution sampler.stop(); /// Print the samples to the console const auto samples = sampler.result(); for (const auto& record : samples) { const auto timestamp = record.metadata().timestamp().value(); const auto cpu_id = record.metadata().cpu_id().value(); const auto instruction = record.instruction_execution().logical_instruction_pointer().value(); std::cout << "Time = " << timestamp << " | CPU = " << cpu_id << " | Instruction = 0x" << std::hex << instruction << std::dec << std::endl; }

Possible output:

Time = 365449130714033 | CPU = 8 | Instruction = 0x5a6e84b2075c Time = 365449130913157 | CPU = 8 | Instruction = 0x64af7417c75c Time = 365449131112591 | CPU = 8 | Instruction = 0x5a6e84b2075c Time = 365449131312005 | CPU = 8 | Instruction = 0x64af7417c75c 

Note

For additional details–such as the types of data that can be included in samples–please consult the sampling guide. Additionally, consult the sampling on multiple CPUs/threads guide for instructions on parallel sampling.

More Examples

We include a collection of examples demonstrating the functionality and interface of perf-cpp in the examples/ directory, including

  • examples for counting hardware events (examples/statistics)
  • and for sampling (examples/sampling).

Building

perf-cpp is designed as a library (static or shared) that can be linked to your application.

# Clone the repository git clone https://github.com/jmuehlig/perf-cpp.git # Switch to the repository folder cd perf-cpp # Optional: Switch to this development version git checkout v0.12.4 # Build the library (in build/) # -DBUILD_EXAMPLES=1 compiles all examples (optional) # -DBUILD_LIB_SHARED=1 creates the library as a shared one (optional) # -DGEN_PROCESSOR_EVENTS=1 generates and compiles a .cpp file that adds events specific to the underlying CPU (optional) cmake . -B build -DBUILD_EXAMPLES=1 cmake --build build # Optional: Build examples (in build/examples/bin) if -DBUILD_EXAMPLES=1 cmake --build build --target examples

Note

Further information and detailed building instructions (e.g., how to integrate into CMake projects) are available in the building guide.

Full Documentation

Further Reading

  • Examples: Learn how to set up different features from code-examples.
  • Changelog: Stay updated with the latest changes and improvements.

System Requirements

  • Clang / GCC with support for C++17 features.
  • CMake version 3.10 or higher.
  • Linux Kernel 4.0 or newer (note that some features need a newer Kernel).
  • perf_event_paranoid setting: Adjust as needed to allow access to performance counters (see the Paranoid Value documentation).
  • Python3, if you make use of processor-specific hardware event generation.

Contribute and Contact

We welcome contributions and feedback. For feature requests, feedback, or bug reports, please reach out via our issue tracker or submit a pull request.

Alternatively, you can email me: jan.muehlig@tu-dortmund.de.


Further PMU-related Projects

Below is a non-exhaustive list of some other valuable profiling projects:

  • PAPI offers access not only to CPU performance counters but also to a variety of other hardware components including GPUs, I/O systems, and more.
  • Likwid is a collection of several command line tools for benchmarking, including an extensive wiki.
  • PerfEvent provides lightweight access to performance counters, facilitating streamlined performance monitoring.
  • Intel's Instrumentation and Tracing Technology allows applications to manage the collection of trace data effectively when used in conjunction with Intel VTune Profiler.
  • For those who prefer a more hands-on approach, the perf_event_open system call can be utilized directly without any wrappers.

Resources about (Perf-) Profiling

This is a non-exhaustive list of academic research papers and blog articles (feel free to add to it, e.g., via pull request – also your own work).

Academical Papers

Blog Posts