Posted on May 24, 2024 • Edited on Aug 2, 2024

Bringing up BPI-F3 - Part 2.5

this is a sort of intermission

Getting perf to work up to a point

Apparently the opensbi-mediated access to the performance counter does not map so using the usual cycles and instructions event works in perf record. I got this board mainly to help with dav1d development efforts, so not having perf support would make harder to reason about performance.

The best workaround after a discussion in the forums, is to build the pmu-events to include custom ones and then rely on the overly precise cpu-specific events instead:

$ perf list | grep cycle bus-cycles [Hardware event] cpu-cycles OR cycles [Hardware event] ref-cycles [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] m_mode_cycle [M-mode cycles] rtu_flush_cycle s_mode_cycle [S-mode cycles] stalled_cycle_backend [Stalled cycles backend] stalled_cycle_frontend [Stalled cycles frontend] u_mode_cycle [U-mode cycles] vidu_total_cycle vidu_vec0_cycle vidu_vec1_cycle ...

$ perf list | grep inst branch-instructions OR branches [Hardware event] instructions [Hardware event] br_inst [Branch instructions] cond_br_inst [Conditional branch instructions] indirect_br_inst [Indirect branch instructions] taken_cond_br_inst [Taken conditional branch instructions] uncond_br_inst [Unconditional branch instructions] instruction: alu_inst [ALU (integer) instructions] amo_inst [AMO instructions] atomic_inst [Atomic instructions] bus_fence_inst [Bus FENCE instructions] csr_inst [CSR instructions] div_inst [Division instructions] ecall_inst [ECALL instructions] failed_sc_inst [Failed SC instructions] fence_inst [FENCE instructions] fp_div_inst [Floating-point division instructions] fp_inst [Floating-point instructions] fp_load_inst [Floating-point load instructions] fp_store_inst [Floating-point store instructions] load_inst [Load instructions] lr_inst [LR instructions] mult_inst [Multiplication instructions] sc_inst [SC instructions] store_inst [Store instructions] unaligned_load_inst [Unaligned load instructions] unaligned_store_inst [Unaligned store instructions] vector_div_inst [Vector division instructions] vector_inst [Vector instructions] vector_load_inst [Vector load instructions] vector_store_inst [Vector store instructions] id_inst_pipedown [ID instruction pipedowns] id_one_inst_pipedown [ID one instruction pipedowns] issued_inst [Issued instructions] rf_inst_pipedown [RF instruction pipedowns] rf_one_inst_pipedown [RF one instruction pipedowns]

Building perf

Perf way to deal with cpu-specific events is through some machinery called jevents.

It lives in tools/perf/pmu-events and you can manually trigger it with.

./jevents.py riscv arch pmu-events.c

And produce C code from a bunch of JSON and a CSV map file.

When I tried build the sources the first time I tried to cut it by setting most NO_{} make variables and left NO_JEVENTS=1, luckily I fixed it after noticing the different output in the forum.

## I assume you have here the custom linux sources cd /usr/src/pi-linux/tools/perf ## being lazy I disabled about everything instead of installing dependencies, one time I disabled too much. make -j 8 V=1 VF=1 HOSTCC=riscv64-unknown-linux-gnu-gcc HOSTLD=riscv64-unknown-linux-gnu-ld CC=riscv64-unknown-linux-gnu-gcc CXX=riscv64-unknown-linux-gnu-g++ AR=riscv64-unknown-linux-gnu-ar LD=riscv64-unknown-linux-gnu-ld NM=riscv64-unknown-linux-gnu-nm PKG_CONFIG=riscv64-unknown-linux-gnu-pkg-config prefix=/usr bindir_relative=bin tipdir=share/doc/perf-6.8 'EXTRA_CFLAGS=-O2 -pipe' 'EXTRA_LDFLAGS=-Wl,-O1 -Wl,--as-needed' ARCH=riscv BUILD_BPF_SKEL= BUILD_NONDISTRO=1 JDIR= CORESIGHT= GTK2= feature-gtk2-infobar= NO_AUXTRACE= NO_BACKTRACE= NO_DEMANGLE= NO_JEVENTS=0 NO_JVMTI=1 NO_LIBAUDIT=1 NO_LIBBABELTRACE=1 NO_LIBBIONIC=1 NO_LIBBPF=1 NO_LIBCAP=1 NO_LIBCRYPTO= NO_LIBDW_DWARF_UNWIND= NO_LIBELF= NO_LIBNUMA=1 NO_LIBPERL=1 NO_LIBPFM4=1 NO_LIBPYTHON=1 NO_LIBTRACEEVENT= NO_LIBUNWIND=1 NO_LIBZSTD=1 NO_SDT=1 NO_SLANG=1 NO_LZMA=1 NO_ZLIB= TCMALLOC= WERROR=0 LIBDIR=/usr/libexec/perf-core libdir=/usr/lib64 plugindir=/usr/lib64/perf/plugins -f Makefile.perf install

Now I have a perf with still cycles and instructions not working with perf record, I wonder if there is a way at opensbi or kernel level to aggregate events to make it work properly, but I never had to look into perf internals so probably I poke it way later if nobody address it otherwise, anyway

perf record --group -e u_mode_cycle,m_mode_cycle,s_mode_cycle

produces something close enough for cycles, well u_mode_cycle is enough.

While for instructions the situation is a bit more annoying

perf record --group -e alu_inst,amo_inst,atomic_inst,fp_div_inst,fp_inst,fp_load_inst,fp_store_inst,load_inst,lr_inst,mult_inst,sc_inst,store_inst,unaligned_load_inst,unaligned_store_inst

is close to count all the scalar instructions, but trying to add vector_div_inst,vector_inst,vector_load_inst,vector_store_inst somehow makes perf record stop collecting samples silently, adding just 3 more events works though, so I guess I can be happy with u_mode_cycle,alu_inst,atomic_inst,fp_inst,vector_inst at least.

DEV Community

Bringing up BPI-F3 - Part 2.5

Getting perf to work up to a point

Building perf

Top comments (0)