Skip to content

Conversation

@pcc
Copy link
Contributor

@pcc pcc commented Jul 17, 2024

We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)

DamonFool and others added 2 commits July 16, 2024 17:51
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
@llvmbot llvmbot added the libc label Jul 17, 2024
@llvmbot
Copy link
Member

llvmbot commented Jul 17, 2024

@llvm/pr-subscribers-libc

Author: None (pcc)

Changes

We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)


Full diff: https://github.com/llvm/llvm-project/pull/99260.diff

1 Files Affected:

  • (modified) libc/src/string/memory_utils/op_aarch64.h (+6-12)
diff --git a/libc/src/string/memory_utils/op_aarch64.h b/libc/src/string/memory_utils/op_aarch64.h index 1090ea2617f09..5c08a6ae48b04 100644 --- a/libc/src/string/memory_utils/op_aarch64.h +++ b/libc/src/string/memory_utils/op_aarch64.h @@ -84,8 +84,7 @@ template <size_t Size> struct Bcmp { uint8x16_t a = vld1q_u8(_p1); uint8x16_t n = vld1q_u8(_p2); uint8x16_t an = veorq_u8(a, n); - uint32x2_t an_reduced = vqmovn_u64(vreinterpretq_u64_u8(an)); - return vmaxv_u32(an_reduced); + return vmaxvq_u32(vreinterpretq_u32_u8(an)); } else if constexpr (Size == 32) { auto _p1 = as_u8(p1); auto _p2 = as_u8(p2); @@ -97,12 +96,9 @@ template <size_t Size> struct Bcmp { uint8x16_t bo = veorq_u8(b, o); // anbo = (a ^ n) | (b ^ o). At least one byte is nonzero if there is // a difference between the two buffers. We reduce this value down to 4 - // bytes in two steps. First, calculate the saturated move value when - // going from 2x64b to 2x32b. Second, compute the max of the 2x32b to get - // a single 32 bit nonzero value if a mismatch occurred. + // bytes using the UMAXV instruction to compute the max across the vector. uint8x16_t anbo = vorrq_u8(an, bo); - uint32x2_t anbo_reduced = vqmovn_u64(vreinterpretq_u64_u8(anbo)); - return vmaxv_u32(anbo_reduced); + return vmaxvq_u32(vreinterpretq_u32_u8(anbo)); } else if constexpr ((Size % BlockSize) == 0) { for (size_t offset = 0; offset < Size; offset += BlockSize) if (auto value = Bcmp<BlockSize>::block(p1 + offset, p2 + offset)) @@ -129,8 +125,7 @@ template <size_t Size> struct Bcmp { uint8x16_t bo = veorq_u8(b, o); // anbo = (a ^ n) | (b ^ o) uint8x16_t anbo = vorrq_u8(an, bo); - uint32x2_t anbo_reduced = vqmovn_u64(vreinterpretq_u64_u8(anbo)); - return vmaxv_u32(anbo_reduced); + return vmaxvq_u32(vreinterpretq_u32_u8(anbo)); } else if constexpr (Size == 32) { auto _p1 = as_u8(p1); auto _p2 = as_u8(p2); @@ -150,9 +145,8 @@ template <size_t Size> struct Bcmp { uint8x16_t cpdq = vorrq_u8(cp, dq); // abnocpdq = ((a ^ n) | (b ^ o)) | ((c ^ p) | (d ^ q)). Reduce this to // a nonzero 32 bit value if a mismatch occurred. - uint64x2_t abnocpdq = vreinterpretq_u64_u8(anbo | cpdq); - uint32x2_t abnocpdq_reduced = vqmovn_u64(abnocpdq); - return vmaxv_u32(abnocpdq_reduced); + uint8x16_t abnocpdq = anbo | cpdq; + return vmaxvq_u32(vreinterpretq_u32_u8(abnocpdq)); } else { static_assert(cpp::always_false<decltype(Size)>, "SIZE not implemented"); } 
@pcc pcc requested a review from gchatelet July 17, 2024 00:51
@lntue lntue changed the title libc: Use UMAXV.4S to reduce bcmp result. [libc] Use UMAXV.4S to reduce bcmp result. Jul 17, 2024
@SchrodingerZhu
Copy link
Contributor

Hi,

Thank you for the patch. Unfortunately, I think the proposed change is causing failures in tests:

Ran 5 tests. PASS: 5 FAIL: 0 [4171/5229] Running unit test libc.test.src.stdio.snprintf_test.__unit__ FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.snprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.snprintf_test.__unit__ cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.snprintf_test.__unit__.__build__ [==========] Running 2 tests from 1 test suite. [ RUN ] LlvmLibcSNPrintfTest.CutOff /home/schrodinger/development/llvm-project/libc/test/src/stdio/snprintf_test.cpp:23: FAILURE Expected: buff Which is: To be equal to: "A simple string" Which is: A simple string [ FAILED ] LlvmLibcSNPrintfTest.CutOff [ RUN ] LlvmLibcSNPrintfTest.NoCutOff /home/schrodinger/development/llvm-project/libc/test/src/stdio/snprintf_test.cpp:53: FAILURE Expected: buff Which is: To be equal to: "A simple string with no conversions." Which is: A simple string with no conversions. [ FAILED ] LlvmLibcSNPrintfTest.NoCutOff Ran 2 tests. PASS: 0 FAIL: 2 [4172/5229] Running unit test libc.test.src.stdio.vsnprintf_test.__unit__ FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.vsnprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.vsnprintf_test.__unit__ cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.vsnprintf_test.__unit__.__build__ [==========] Running 2 tests from 1 test suite. [ RUN ] LlvmLibcVSNPrintfTest.CutOff /home/schrodinger/development/llvm-project/libc/test/src/stdio/vsnprintf_test.cpp:35: FAILURE Expected: buff Which is: To be equal to: "A simple string" Which is: A simple string [ FAILED ] LlvmLibcVSNPrintfTest.CutOff [ RUN ] LlvmLibcVSNPrintfTest.NoCutOff /home/schrodinger/development/llvm-project/libc/test/src/stdio/vsnprintf_test.cpp:64: FAILURE Expected: buff Which is: To be equal to: "A simple string with no conversions." Which is: A simple string with no conversions. [ FAILED ] LlvmLibcVSNPrintfTest.NoCutOff Ran 2 tests. PASS: 0 FAIL: 2 [4173/5229] Running unit test libc.test.src.stdio.fprintf_test.__unit__ FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.fprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.fprintf_test.__unit__ cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.fprintf_test.__unit__.__build__ [==========] Running 1 test from 1 test suite. [ RUN ] LlvmLibcFPrintfTest.WriteToFile /home/schrodinger/development/llvm-project/libc/test/src/stdio/fprintf_test.cpp:65: FAILURE Expected: printf_test::fread(data, 1, sizeof(simple) - 1, file) Which is: 0 To be equal to: sizeof(simple) - 1 Which is: 37 [ FAILED ] LlvmLibcFPrintfTest.WriteToFile Ran 1 tests. PASS: 0 FAIL: 1 [4184/5229] Linking CXX executable libc/test/src/stdio/libc.test.src.stdio.remove_test.__hermetic__.__build__ ninja: build stopped: subcommand failed. 
Copy link
Contributor

@SchrodingerZhu SchrodingerZhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment

@pcc pcc closed this Aug 5, 2024
@pcc
Copy link
Contributor Author

pcc commented Aug 15, 2025

can't reproduce this test failure. I did:

cmake -G Ninja -DLLVM_ENABLE_RUNTIMES=libc -DLLVM_ENABLE_PROJECTS=clang\;lld -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_ASSERTIONS=1 -DLLVM_TARGETS_TO_BUILD=AArch64 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ../llvm ninja libc ninja -C ./runtimes/runtimes-bins check-libc 

all tests passed.

@pcc pcc reopened this Aug 15, 2025
@SchrodingerZhu
Copy link
Contributor

SchrodingerZhu commented Aug 15, 2025 via email

@lntue lntue requested a review from overmighty August 26, 2025 01:03
@pcc
Copy link
Contributor Author

pcc commented Sep 6, 2025

I will test it again. Thanks for letting me know.On Thu, Aug 14, 2025 at 23:14, Peter Collingbourne @.> wrote: Reopened #99260. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are on a team that was mentioned.Message ID: @.>

Did you have a chance to do this?

@overmighty
Copy link
Member

FYI, ~2 weeks ago I tried to reproduce your benchmark results on Android and I managed to cross-compile the benchmark from main and run it, but couldn't get it to compile from this PR's branch.

@pcc
Copy link
Contributor Author

pcc commented Sep 12, 2025

Ping

FYI, ~2 weeks ago I tried to reproduce your benchmark results on Android and I managed to cross-compile the benchmark from main and run it, but couldn't get it to compile from this PR's branch.

Yeah, you might need to rebase to main, it rebases cleanly here.

@pcc
Copy link
Contributor Author

pcc commented Oct 10, 2025

Ping

Copy link
Contributor

@michaelrj-google michaelrj-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

pveras and others added 2 commits October 13, 2025 11:20
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
@pcc pcc changed the title [libc] Use UMAXV.4S to reduce bcmp result. libc: Use UMAXV.4S to reduce bcmp result. Oct 13, 2025
@pcc pcc changed the title libc: Use UMAXV.4S to reduce bcmp result. [libc] Use UMAXV.4S to reduce bcmp result. Oct 13, 2025
@pcc pcc changed the base branch from users/pcc/spr/main.libc-use-umaxv4s-to-reduce-bcmp-result to main October 13, 2025 18:21
@pcc pcc merged commit 7905ec3 into main Oct 13, 2025
11 of 13 checks passed
@pcc pcc deleted the users/pcc/spr/libc-use-umaxv4s-to-reduce-bcmp-result branch October 13, 2025 18:21
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Oct 13, 2025
We can use UMAXV.4S to reduce the comparison result in a single instruction. This improves performance by roughly 4% on Apple M1: Summary bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran 1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 (1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch) Reviewers: michaelrj-google, gchatelet, overmighty, SchrodingerZhu Pull Request: llvm/llvm-project#99260
akadutta pushed a commit to akadutta/llvm-project that referenced this pull request Oct 14, 2025
We can use UMAXV.4S to reduce the comparison result in a single instruction. This improves performance by roughly 4% on Apple M1: Summary bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran 1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 (1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch) Reviewers: michaelrj-google, gchatelet, overmighty, SchrodingerZhu Pull Request: llvm#99260
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

8 participants