Loop vectorizer generates inefficient code

the simple code:

void swap_ptr_impl(int64_t* ptr, size_t len) { for (size_t i = 0; i < len; i++) { ptr[i] = std::byteswap(ptr[i]); } } void swap_ptr2_impl(int64_t* ptr, size_t len) { auto end = ptr + len; for (; ptr < end; ptr++) { *ptr = std::byteswap(*ptr); } } void swap_span_impl(std::span<int64_t> sp) { for (auto& x : sp) { x = std::byteswap(x); } } void swap_span_2(std::span<int64_t, 1024> sp) { for (auto& x : sp) { x = std::byteswap(x); } }

swap_ptr_impl is 2x slower than other functions on i9-14900KF. 2.8x slower is seen on quickbench.
swap_span_2 (span length known) is also 2x slower.

Run on (32 X 3187 MHz CPU s) CPU Caches: L1 Data 48 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 2048 KiB (x16) L3 Unified 36864 KiB (x1) ------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------ swap_ptr 400 ns 390 ns 1723077 swap_ptr2 184 ns 180 ns 4072727 swap_span 176 ns 165 ns 4072727 swap_span_2 403 ns 399 ns 1723077

with -fno-vectorize, the results are reasonable.

Run on (32 X 3187 MHz CPU s) CPU Caches: L1 Data 48 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 2048 KiB (x16) L3 Unified 36864 KiB (x1) ------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------ swap_ptr 181 ns 184 ns 4072727 swap_ptr2 181 ns 180 ns 3733333 swap_span 173 ns 172 ns 3733333 swap_span_2 175 ns 173 ns 4072727

so I assume that there is something wrong in the loop vectorizer. Verified since clang 17.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loop vectorizer generates inefficient code #172217

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loop vectorizer generates inefficient code #172217

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions