Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Jun 17, 2024

This PR optimizes Vector128 and Vector256 multiplication for long/ulong when AVX512 is not presented in the system. It makes XxHash128 faster, see #103555 (comment)

public Vector128<long> Foo(Vector128<long> a, Vector128<long> b) => a * b;

Current codegen on x64 cpu without AVX512:

; Method MyBench:Foo  push rsi  push rbx  sub rsp, 104  mov rbx, rdx  mov rdx, qword ptr [r8]  mov qword ptr [rsp+0x58], rdx  mov rdx, qword ptr [r9]  mov qword ptr [rsp+0x50], rdx  mov rdx, qword ptr [rsp+0x58]  imul rdx, qword ptr [rsp+0x50]  mov qword ptr [rsp+0x60], rdx  mov rsi, qword ptr [rsp+0x60]  mov rdx, qword ptr [r8+0x08]  mov qword ptr [rsp+0x40], rdx  mov rdx, qword ptr [r9+0x08]  mov qword ptr [rsp+0x38], rdx  mov rcx, qword ptr [rsp+0x40]  mov rdx, qword ptr [rsp+0x38]  call [System.Runtime.Intrinsics.Scalar`1[long]:Multiply(long,long):long] ;;; not inlined call!  mov qword ptr [rsp+0x48], rax  mov rax, qword ptr [rsp+0x48]  mov qword ptr [rsp+0x20], rsi  mov qword ptr [rsp+0x28], rax  vmovaps xmm0, xmmword ptr [rsp+0x20]  vmovups xmmword ptr [rbx], xmm0  mov rax, rbx  add rsp, 104  pop rbx  pop rsi  ret  ; Total bytes of code: 120

New codegen:

; Method MyBench:Foo  vmovups xmm0, xmmword ptr [r8]  vmovups xmm1, xmmword ptr [r9]  vpmuludq xmm2, xmm1, xmm0  vpshufd xmm1, xmm1, -79  vpmulld xmm0, xmm1, xmm0  vxorps xmm1, xmm1, xmm1  vphaddd xmm0, xmm0, xmm1  vpshufd xmm0, xmm0, 115  vpaddq xmm0, xmm0, xmm2  vmovups xmmword ptr [rdx], xmm0  mov rax, rdx  ret  ; Total bytes of code: 50
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 17, 2024

Note: results should be better if we do it in JIT, it will enable loop hoisting, cse, etc for MUL

@neon-sunset
Copy link
Contributor

Note #103539 (comment) (and https://godbolt.org/z/eqsrf341M) from xxHash128 issue.

EgorBo and others added 2 commits June 17, 2024 17:01
…sics/Vector128_1.cs Co-authored-by: Tanner Gooding <tagoo@outlook.com>
@dotnet dotnet deleted a comment from EgorBot Jun 20, 2024
@dotnet dotnet deleted a comment from EgorBot Jun 20, 2024
@EgorBo
Copy link
Member Author

EgorBo commented Jun 20, 2024

@EgorBot -amd -intel -arm64 -profiler --envvars DOTNET_PreferredVectorBitWidth:128

using System.IO.Hashing; using BenchmarkDotNet.Attributes; public class Bench { static readonly byte[] Data = new byte[1000000]; [Benchmark] public byte[] BenchXxHash128() { XxHash128 hash = new(); hash.Append(Data); return hash.GetHashAndReset(); } }
@EgorBot
Copy link

EgorBot commented Jun 20, 2024

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores Job-ITXSAG : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-XSORFZ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128 
Method Toolchain Mean Error Ratio
BenchXxHash128 Main 43.41 μs 0.087 μs 1.00
BenchXxHash128 PR 43.33 μs 0.009 μs 1.00

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@EgorBot
Copy link

EgorBot commented Jun 20, 2024

Benchmark results on Amd
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores Job-SUBLYH : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-OPUYDY : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128 
Method Toolchain Mean Error Ratio
BenchXxHash128 Main 71.20 μs 0.022 μs 1.00
BenchXxHash128 PR 43.84 μs 0.013 μs 0.62

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@EgorBot
Copy link

EgorBot commented Jun 20, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Unknown processor Job-EDPWDU : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD Job-TIALUR : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128 
Method Toolchain Mean Error Ratio
BenchXxHash128 Main 116.9 μs 0.11 μs 1.00
BenchXxHash128 PR 116.8 μs 0.07 μs 1.00

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 21, 2024

/azp list

@azure-pipelines

This comment was marked as resolved.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 21, 2024

/azp run runtime-coreclr jitstress-isas-x86

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).
@EgorBo
Copy link
Member Author

EgorBo commented Jun 21, 2024

@tannergooding PTAL, I'll add arm64 separately, need to test different impls.
I've expanded it in importer similar to existing op_Multiply expansions

Benchmark improvement: #103555 (comment)

@EgorBo EgorBo requested a review from tannergooding June 21, 2024 12:29
@EgorBo EgorBo marked this pull request as ready for review June 24, 2024 14:26
Comment on lines +21627 to +21631
// Vector256<int> tmp3 = Avx2.HorizontalAdd(tmp2.AsInt32(), Vector256<int>.Zero);
GenTreeHWIntrinsic* tmp3 =
gtNewSimdHWIntrinsicNode(type, tmp2, gtNewZeroConNode(type),
is256 ? NI_AVX2_HorizontalAdd : NI_SSSE3_HorizontalAdd,
CORINFO_TYPE_UINT, simdSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know in other places we've started avoiding hadd in favor of shuffle+add, might be worth seeing if that's appropriate here too (low priority, non blocking)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to benchmark different implementations for it and they all were equaly fast e.g. #99871 (comment)

if (TARGET_POINTER_SIZE == 4)
{
// TODO-XARCH-CQ: We should support long/ulong multiplication
// TODO-XARCH-CQ: 32bit support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's blocking 32-bit support? It doesn't look like we're using any _X64 intrinsics in the fallback logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to be honest, that check was pre-existing, I only changed comment

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.