[libc] Use UMAXV.4S to reduce bcmp result. #99260

pcc · 2024-07-17T00:51:19Z

We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)

Created using spr 1.3.6-beta.1 [skip ci]

Created using spr 1.3.6-beta.1

llvmbot · 2024-07-17T00:51:49Z

@llvm/pr-subscribers-libc

Author: None (pcc)

Changes

We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)

Full diff: https://github.com/llvm/llvm-project/pull/99260.diff

1 Files Affected:

(modified) libc/src/string/memory_utils/op_aarch64.h (+6-12)

diff --git a/libc/src/string/memory_utils/op_aarch64.h b/libc/src/string/memory_utils/op_aarch64.h index 1090ea2617f09..5c08a6ae48b04 100644 --- a/libc/src/string/memory_utils/op_aarch64.h +++ b/libc/src/string/memory_utils/op_aarch64.h @@ -84,8 +84,7 @@ template <size_t Size> struct Bcmp { uint8x16_t a = vld1q_u8(_p1); uint8x16_t n = vld1q_u8(_p2); uint8x16_t an = veorq_u8(a, n); - uint32x2_t an_reduced = vqmovn_u64(vreinterpretq_u64_u8(an)); - return vmaxv_u32(an_reduced); + return vmaxvq_u32(vreinterpretq_u32_u8(an)); } else if constexpr (Size == 32) { auto _p1 = as_u8(p1); auto _p2 = as_u8(p2); @@ -97,12 +96,9 @@ template <size_t Size> struct Bcmp { uint8x16_t bo = veorq_u8(b, o); // anbo = (a ^ n) | (b ^ o). At least one byte is nonzero if there is // a difference between the two buffers. We reduce this value down to 4 - // bytes in two steps. First, calculate the saturated move value when - // going from 2x64b to 2x32b. Second, compute the max of the 2x32b to get - // a single 32 bit nonzero value if a mismatch occurred. + // bytes using the UMAXV instruction to compute the max across the vector. uint8x16_t anbo = vorrq_u8(an, bo); - uint32x2_t anbo_reduced = vqmovn_u64(vreinterpretq_u64_u8(anbo)); - return vmaxv_u32(anbo_reduced); + return vmaxvq_u32(vreinterpretq_u32_u8(anbo)); } else if constexpr ((Size % BlockSize) == 0) { for (size_t offset = 0; offset < Size; offset += BlockSize) if (auto value = Bcmp<BlockSize>::block(p1 + offset, p2 + offset)) @@ -129,8 +125,7 @@ template <size_t Size> struct Bcmp { uint8x16_t bo = veorq_u8(b, o); // anbo = (a ^ n) | (b ^ o) uint8x16_t anbo = vorrq_u8(an, bo); - uint32x2_t anbo_reduced = vqmovn_u64(vreinterpretq_u64_u8(anbo)); - return vmaxv_u32(anbo_reduced); + return vmaxvq_u32(vreinterpretq_u32_u8(anbo)); } else if constexpr (Size == 32) { auto _p1 = as_u8(p1); auto _p2 = as_u8(p2); @@ -150,9 +145,8 @@ template <size_t Size> struct Bcmp { uint8x16_t cpdq = vorrq_u8(cp, dq); // abnocpdq = ((a ^ n) | (b ^ o)) | ((c ^ p) | (d ^ q)). Reduce this to // a nonzero 32 bit value if a mismatch occurred. - uint64x2_t abnocpdq = vreinterpretq_u64_u8(anbo | cpdq); - uint32x2_t abnocpdq_reduced = vqmovn_u64(abnocpdq); - return vmaxv_u32(abnocpdq_reduced); + uint8x16_t abnocpdq = anbo | cpdq; + return vmaxvq_u32(vreinterpretq_u32_u8(abnocpdq)); } else { static_assert(cpp::always_false<decltype(Size)>, "SIZE not implemented"); }

SchrodingerZhu · 2024-08-03T05:40:37Z

Hi,

Thank you for the patch. Unfortunately, I think the proposed change is causing failures in tests:

Ran 5 tests. PASS: 5 FAIL: 0 [4171/5229] Running unit test libc.test.src.stdio.snprintf_test.__unit__ FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.snprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.snprintf_test.__unit__ cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.snprintf_test.__unit__.__build__ [==========] Running 2 tests from 1 test suite. [ RUN ] LlvmLibcSNPrintfTest.CutOff /home/schrodinger/development/llvm-project/libc/test/src/stdio/snprintf_test.cpp:23: FAILURE Expected: buff Which is: To be equal to: "A simple string" Which is: A simple string [ FAILED ] LlvmLibcSNPrintfTest.CutOff [ RUN ] LlvmLibcSNPrintfTest.NoCutOff /home/schrodinger/development/llvm-project/libc/test/src/stdio/snprintf_test.cpp:53: FAILURE Expected: buff Which is: To be equal to: "A simple string with no conversions." Which is: A simple string with no conversions. [ FAILED ] LlvmLibcSNPrintfTest.NoCutOff Ran 2 tests. PASS: 0 FAIL: 2 [4172/5229] Running unit test libc.test.src.stdio.vsnprintf_test.__unit__ FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.vsnprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.vsnprintf_test.__unit__ cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.vsnprintf_test.__unit__.__build__ [==========] Running 2 tests from 1 test suite. [ RUN ] LlvmLibcVSNPrintfTest.CutOff /home/schrodinger/development/llvm-project/libc/test/src/stdio/vsnprintf_test.cpp:35: FAILURE Expected: buff Which is: To be equal to: "A simple string" Which is: A simple string [ FAILED ] LlvmLibcVSNPrintfTest.CutOff [ RUN ] LlvmLibcVSNPrintfTest.NoCutOff /home/schrodinger/development/llvm-project/libc/test/src/stdio/vsnprintf_test.cpp:64: FAILURE Expected: buff Which is: To be equal to: "A simple string with no conversions." Which is: A simple string with no conversions. [ FAILED ] LlvmLibcVSNPrintfTest.NoCutOff Ran 2 tests. PASS: 0 FAIL: 2 [4173/5229] Running unit test libc.test.src.stdio.fprintf_test.__unit__ FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.fprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.fprintf_test.__unit__ cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.fprintf_test.__unit__.__build__ [==========] Running 1 test from 1 test suite. [ RUN ] LlvmLibcFPrintfTest.WriteToFile /home/schrodinger/development/llvm-project/libc/test/src/stdio/fprintf_test.cpp:65: FAILURE Expected: printf_test::fread(data, 1, sizeof(simple) - 1, file) Which is: 0 To be equal to: sizeof(simple) - 1 Which is: 37 [ FAILED ] LlvmLibcFPrintfTest.WriteToFile Ran 1 tests. PASS: 0 FAIL: 1 [4184/5229] Linking CXX executable libc/test/src/stdio/libc.test.src.stdio.remove_test.__hermetic__.__build__ ninja: build stopped: subcommand failed.

SchrodingerZhu

See previous comment

pcc · 2025-08-15T03:14:07Z

can't reproduce this test failure. I did:

cmake -G Ninja -DLLVM_ENABLE_RUNTIMES=libc -DLLVM_ENABLE_PROJECTS=clang\;lld -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_ASSERTIONS=1 -DLLVM_TARGETS_TO_BUILD=AArch64 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ../llvm ninja libc ninja -C ./runtimes/runtimes-bins check-libc

all tests passed.

SchrodingerZhu · 2025-08-15T04:08:54Z

I will test it again. Thanks for letting me know.On Thu, Aug 14, 2025 at 23:14, Peter Collingbourne ***@***.***> wrote: Reopened #99260. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are on a team that was mentioned.Message ID: ***@***.***>

pcc · 2025-09-06T00:58:35Z

I will test it again. Thanks for letting me know.On Thu, Aug 14, 2025 at 23:14, Peter Collingbourne @.> wrote: Reopened #99260. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are on a team that was mentioned.Message ID: @.>

Did you have a chance to do this?

overmighty · 2025-09-06T07:28:50Z

FYI, ~2 weeks ago I tried to reproduce your benchmark results on Android and I managed to cross-compile the benchmark from main and run it, but couldn't get it to compile from this PR's branch.

pcc · 2025-09-12T21:02:01Z

Ping

FYI, ~2 weeks ago I tried to reproduce your benchmark results on Android and I managed to cross-compile the benchmark from main and run it, but couldn't get it to compile from this PR's branch.

Yeah, you might need to rebase to main, it rebases cleanly here.

pcc · 2025-10-10T23:48:47Z

Ping

michaelrj-google

Looks good to me

Created using spr 1.3.6-beta.1 [skip ci]

Created using spr 1.3.6-beta.1

We can use UMAXV.4S to reduce the comparison result in a single instruction. This improves performance by roughly 4% on Apple M1: Summary bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran 1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 (1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch) Reviewers: michaelrj-google, gchatelet, overmighty, SchrodingerZhu Pull Request: llvm/llvm-project#99260

We can use UMAXV.4S to reduce the comparison result in a single instruction. This improves performance by roughly 4% on Apple M1: Summary bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran 1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 (1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch) Reviewers: michaelrj-google, gchatelet, overmighty, SchrodingerZhu Pull Request: llvm#99260

DamonFool and others added 2 commits July 16, 2024 17:51

[𝘀𝗽𝗿] changes to main this commit is based on

6f0bbb0

Created using spr 1.3.6-beta.1 [skip ci]

[𝘀𝗽𝗿] initial version

841c2e8

Created using spr 1.3.6-beta.1

llvmbot added the libc label Jul 17, 2024

pcc requested a review from gchatelet July 17, 2024 00:51

lntue changed the title ~~libc: Use UMAXV.4S to reduce bcmp result.~~ [libc] Use UMAXV.4S to reduce bcmp result. Jul 17, 2024

SchrodingerZhu requested changes Aug 3, 2024

View reviewed changes

pcc closed this Aug 5, 2024

pcc reopened this Aug 15, 2025

lntue requested a review from overmighty August 26, 2025 01:03

michaelrj-google approved these changes Oct 13, 2025

View reviewed changes

pveras and others added 2 commits October 13, 2025 11:20

[𝘀𝗽𝗿] changes introduced through rebase

776b912

Created using spr 1.3.6-beta.1 [skip ci]

Rebase

db12e3b

Created using spr 1.3.6-beta.1

pcc changed the title ~~[libc] Use UMAXV.4S to reduce bcmp result.~~ libc: Use UMAXV.4S to reduce bcmp result. Oct 13, 2025

pcc changed the title ~~libc: Use UMAXV.4S to reduce bcmp result.~~ [libc] Use UMAXV.4S to reduce bcmp result. Oct 13, 2025

pcc changed the base branch from users/pcc/spr/main.libc-use-umaxv4s-to-reduce-bcmp-result to main October 13, 2025 18:21

pcc merged commit 7905ec3 into main Oct 13, 2025
11 of 13 checks passed

pcc deleted the users/pcc/spr/libc-use-umaxv4s-to-reduce-bcmp-result branch October 13, 2025 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libc] Use UMAXV.4S to reduce bcmp result. #99260

[libc] Use UMAXV.4S to reduce bcmp result. #99260

Uh oh!

pcc commented Jul 17, 2024

llvmbot commented Jul 17, 2024

SchrodingerZhu commented Aug 3, 2024

SchrodingerZhu left a comment

pcc commented Aug 15, 2025

SchrodingerZhu commented Aug 15, 2025 via email

pcc commented Sep 6, 2025

overmighty commented Sep 6, 2025

pcc commented Sep 12, 2025

pcc commented Oct 10, 2025

michaelrj-google left a comment

Uh oh!

Labels

8 participants

[libc] Use UMAXV.4S to reduce bcmp result. #99260

[libc] Use UMAXV.4S to reduce bcmp result. #99260

Uh oh!

Conversation

pcc commented Jul 17, 2024

llvmbot commented Jul 17, 2024

SchrodingerZhu commented Aug 3, 2024

SchrodingerZhu left a comment

Choose a reason for hiding this comment

pcc commented Aug 15, 2025

SchrodingerZhu commented Aug 15, 2025 via email

pcc commented Sep 6, 2025

overmighty commented Sep 6, 2025

pcc commented Sep 12, 2025

pcc commented Oct 10, 2025

michaelrj-google left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

8 participants