Skip to content

Loop vectorizer generates inefficient code #172217

@zijinshanren

Description

@zijinshanren

https://godbolt.org/z/MPPnvT5h8

the simple code:

void swap_ptr_impl(int64_t* ptr, size_t len) { for (size_t i = 0; i < len; i++) { ptr[i] = std::byteswap(ptr[i]); } } void swap_ptr2_impl(int64_t* ptr, size_t len) { auto end = ptr + len; for (; ptr < end; ptr++) { *ptr = std::byteswap(*ptr); } } void swap_span_impl(std::span<int64_t> sp) { for (auto& x : sp) { x = std::byteswap(x); } } void swap_span_2(std::span<int64_t, 1024> sp) { for (auto& x : sp) { x = std::byteswap(x); } }

swap_ptr_impl is 2x slower than other functions on i9-14900KF. 2.8x slower is seen on quickbench.
swap_span_2 (span length known) is also 2x slower.

Run on (32 X 3187 MHz CPU s) CPU Caches: L1 Data 48 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 2048 KiB (x16) L3 Unified 36864 KiB (x1) ------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------ swap_ptr 400 ns 390 ns 1723077 swap_ptr2 184 ns 180 ns 4072727 swap_span 176 ns 165 ns 4072727 swap_span_2 403 ns 399 ns 1723077 

with -fno-vectorize, the results are reasonable.

Run on (32 X 3187 MHz CPU s) CPU Caches: L1 Data 48 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 2048 KiB (x16) L3 Unified 36864 KiB (x1) ------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------ swap_ptr 181 ns 184 ns 4072727 swap_ptr2 181 ns 180 ns 3733333 swap_span 173 ns 172 ns 3733333 swap_span_2 175 ns 173 ns 4072727 

so I assume that there is something wrong in the loop vectorizer. Verified since clang 17.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions