- Notifications
You must be signed in to change notification settings - Fork 15.5k
Open
Description
https://godbolt.org/z/MPPnvT5h8
the simple code:
void swap_ptr_impl(int64_t* ptr, size_t len) { for (size_t i = 0; i < len; i++) { ptr[i] = std::byteswap(ptr[i]); } } void swap_ptr2_impl(int64_t* ptr, size_t len) { auto end = ptr + len; for (; ptr < end; ptr++) { *ptr = std::byteswap(*ptr); } } void swap_span_impl(std::span<int64_t> sp) { for (auto& x : sp) { x = std::byteswap(x); } } void swap_span_2(std::span<int64_t, 1024> sp) { for (auto& x : sp) { x = std::byteswap(x); } }swap_ptr_impl is 2x slower than other functions on i9-14900KF. 2.8x slower is seen on quickbench.
swap_span_2 (span length known) is also 2x slower.
Run on (32 X 3187 MHz CPU s) CPU Caches: L1 Data 48 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 2048 KiB (x16) L3 Unified 36864 KiB (x1) ------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------ swap_ptr 400 ns 390 ns 1723077 swap_ptr2 184 ns 180 ns 4072727 swap_span 176 ns 165 ns 4072727 swap_span_2 403 ns 399 ns 1723077 with -fno-vectorize, the results are reasonable.
Run on (32 X 3187 MHz CPU s) CPU Caches: L1 Data 48 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 2048 KiB (x16) L3 Unified 36864 KiB (x1) ------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------ swap_ptr 181 ns 184 ns 4072727 swap_ptr2 181 ns 180 ns 3733333 swap_span 173 ns 172 ns 3733333 swap_span_2 175 ns 173 ns 4072727 so I assume that there is something wrong in the loop vectorizer. Verified since clang 17.