Accelerate Vector128<long>::op_Multiply on x64 #103555

EgorBo · 2024-06-17T07:46:12Z

This PR optimizes Vector128 and Vector256 multiplication for long/ulong when AVX512 is not presented in the system. It makes XxHash128 faster, see #103555 (comment)

public Vector128<long> Foo(Vector128<long> a, Vector128<long> b) => a * b;

Current codegen on x64 cpu without AVX512:

; Method MyBench:Foo  push rsi  push rbx  sub rsp, 104  mov rbx, rdx  mov rdx, qword ptr [r8]  mov qword ptr [rsp+0x58], rdx  mov rdx, qword ptr [r9]  mov qword ptr [rsp+0x50], rdx  mov rdx, qword ptr [rsp+0x58]  imul rdx, qword ptr [rsp+0x50]  mov qword ptr [rsp+0x60], rdx  mov rsi, qword ptr [rsp+0x60]  mov rdx, qword ptr [r8+0x08]  mov qword ptr [rsp+0x40], rdx  mov rdx, qword ptr [r9+0x08]  mov qword ptr [rsp+0x38], rdx  mov rcx, qword ptr [rsp+0x40]  mov rdx, qword ptr [rsp+0x38]  call [System.Runtime.Intrinsics.Scalar`1[long]:Multiply(long,long):long] ;;; not inlined call!  mov qword ptr [rsp+0x48], rax  mov rax, qword ptr [rsp+0x48]  mov qword ptr [rsp+0x20], rsi  mov qword ptr [rsp+0x28], rax  vmovaps xmm0, xmmword ptr [rsp+0x20]  vmovups xmmword ptr [rbx], xmm0  mov rax, rbx  add rsp, 104  pop rbx  pop rsi  ret  ; Total bytes of code: 120

New codegen:

; Method MyBench:Foo  vmovups xmm0, xmmword ptr [r8]  vmovups xmm1, xmmword ptr [r9]  vpmuludq xmm2, xmm1, xmm0  vpshufd xmm1, xmm1, -79  vpmulld xmm0, xmm1, xmm0  vxorps xmm1, xmm1, xmm1  vphaddd xmm0, xmm0, xmm1  vpshufd xmm0, xmm0, 115  vpaddq xmm0, xmm0, xmm2  vmovups xmmword ptr [rdx], xmm0  mov rax, rdx  ret  ; Total bytes of code: 50

dotnet-policy-service · 2024-06-17T07:46:47Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

EgorBo · 2024-06-17T08:09:35Z

Note: results should be better if we do it in JIT, it will enable loop hoisting, cse, etc for MUL

neon-sunset · 2024-06-17T10:18:02Z

Note #103539 (comment) (and https://godbolt.org/z/eqsrf341M) from xxHash128 issue.

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs

…sics/Vector128_1.cs Co-authored-by: Tanner Gooding <tagoo@outlook.com>

EgorBo · 2024-06-20T14:07:52Z

@EgorBot -amd -intel -arm64 -profiler --envvars DOTNET_PreferredVectorBitWidth:128

using System.IO.Hashing; using BenchmarkDotNet.Attributes; public class Bench { static readonly byte[] Data = new byte[1000000]; [Benchmark] public byte[] BenchXxHash128() { XxHash128 hash = new(); hash.Append(Data); return hash.GetHashAndReset(); } }

EgorBot · 2024-06-20T14:26:58Z

Benchmark results on Intel

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores Job-ITXSAG : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-XSORFZ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	43.41 μs	0.087 μs	1.00
BenchXxHash128	PR	43.33 μs	0.009 μs	1.00

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

EgorBot · 2024-06-20T14:27:46Z

Benchmark results on Amd

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores Job-SUBLYH : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-OPUYDY : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	71.20 μs	0.022 μs	1.00
BenchXxHash128	PR	43.84 μs	0.013 μs	0.62

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

EgorBot · 2024-06-20T14:31:07Z

Benchmark results on Arm64

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Unknown processor Job-EDPWDU : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD Job-TIALUR : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	116.9 μs	0.11 μs	1.00
BenchXxHash128	PR	116.8 μs	0.07 μs	1.00

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

EgorBo · 2024-06-21T11:57:19Z

/azp list

EgorBo · 2024-06-21T11:57:47Z

/azp run runtime-coreclr jitstress-isas-x86

azure-pipelines · 2024-06-21T11:57:55Z

Azure Pipelines successfully started running 1 pipeline(s).

EgorBo · 2024-06-21T12:29:23Z

@tannergooding PTAL, I'll add arm64 separately, need to test different impls.
I've expanded it in importer similar to existing op_Multiply expansions

Benchmark improvement: #103555 (comment)

src/coreclr/jit/gentree.cpp

tannergooding · 2024-06-27T15:13:47Z

src/coreclr/jit/gentree.cpp

+ // Vector256<int> tmp3 = Avx2.HorizontalAdd(tmp2.AsInt32(), Vector256<int>.Zero);
+ GenTreeHWIntrinsic* tmp3 =
+ gtNewSimdHWIntrinsicNode(type, tmp2, gtNewZeroConNode(type),
+ is256 ? NI_AVX2_HorizontalAdd : NI_SSSE3_HorizontalAdd,
+ CORINFO_TYPE_UINT, simdSize);


I know in other places we've started avoiding hadd in favor of shuffle+add, might be worth seeing if that's appropriate here too (low priority, non blocking)

I tried to benchmark different implementations for it and they all were equaly fast e.g. #99871 (comment)

tannergooding · 2024-06-27T15:15:14Z

src/coreclr/jit/hwintrinsicxarch.cpp

+ if (TARGET_POINTER_SIZE == 4)
 {
- // TODO-XARCH-CQ: We should support long/ulong multiplication
+ // TODO-XARCH-CQ: 32bit support


What's blocking 32-bit support? It doesn't look like we're using any _X64 intrinsics in the fallback logic?

Not sure to be honest, that check was pre-existing, I only changed comment

…-64bit

Accelerate Vector128 mul for long/ulong

ae17211

ghost added the area-System.Runtime.Intrinsics label Jun 17, 2024

dotnet-policy-service bot assigned EgorBo Jun 17, 2024

EgorBo added 2 commits June 17, 2024 12:43

better ulong version

afda312

fix build

ab01574

This was referenced Jun 17, 2024

GC/Regressions/v2.0-beta2/452950 failed in CI #103494

Closed

System.Numerics.Tensors.Tests.TensorSpanTests test failure #103525

Closed

EgorBo added 2 commits June 17, 2024 14:33

Update Vector128_1.cs

21b42de

Sse41 version

581f1e2

tannergooding reviewed Jun 17, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs Outdated Show resolved Hide resolved

EgorBo and others added 2 commits June 17, 2024 17:01

Update src/libraries/System.Private.CoreLib/src/System/Runtime/Intrin…

49a359f

…sics/Vector128_1.cs Co-authored-by: Tanner Gooding <tagoo@outlook.com>

Update Vector128_1.cs

57898f0

dotnet deleted a comment from EgorBot Jun 17, 2024

EgorBo added 5 commits June 19, 2024 15:23

Update Vector128_1.cs

f1be705

Update Vector128_1.cs

95d0eb8

Update Vector128_1.cs

dcfd93d

Update Vector128_1.cs

7fec9e3

Update Vector128_1.cs

e172296

This was referenced Jun 19, 2024

STJ NullPropertyNameFail test failing in CI #103715

Closed

NativeAOT legs timing out in CI #102239

Closed

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

Update Vector128_1.cs

0456d12

dotnet deleted a comment from EgorBot Jun 20, 2024

This was referenced Jun 20, 2024

Test failure: GC\\Features\\HeapExpansion\\Finalizer\\Finalizer.cmd #102706

Closed

[Test Failure] System.Net.Http.WinHttpHandlerFunctional.Tests.BidirectionStreamingTest.BackwardsCompatibility_DowngradeToHttp11 #103754

Closed

revert unrelated changes

60441f3

This comment was marked as resolved.

Sign in to view

EgorBo requested a review from tannergooding June 21, 2024 12:29

build-analysis bot mentioned this pull request Jun 21, 2024

[browser] Unable to evaluate script: tab crashed #103623

Closed

EgorBo marked this pull request as ready for review June 24, 2024 14:26

tannergooding reviewed Jun 27, 2024

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 27, 2024

View reviewed changes

tannergooding approved these changes Jun 27, 2024

View reviewed changes

tannergooding reviewed Jun 27, 2024

View reviewed changes

EgorBo added 2 commits June 28, 2024 17:51

Merge branch 'main' of https://github.com/dotnet/runtime into arm-mul…

5b78ddd

…-64bit

Address feedback

cc257dd

EgorBo mentioned this pull request Jun 28, 2024

Optimize Vector128<long> multiplication for arm64 #104177

Merged

This was referenced Jun 28, 2024

[browser] HalfTests.ExplicitConversion_FromSingle failing due to NaN != NaN #103347

Open

Nuget: Central Directory corrupt dotnet/dnceng#3099

Closed

The job running on agent NetCore-Public ran longer than the maximum time #104044

Closed

EgorBo merged commit 33ca32d into dotnet:main Jun 28, 2024

EgorBo deleted the arm-mul-64bit branch June 28, 2024 21:40

github-actions bot locked and limited conversation to collaborators Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accelerate Vector128<long>::op_Multiply on x64 #103555

Accelerate Vector128<long>::op_Multiply on x64 #103555

Uh oh!

EgorBo commented Jun 17, 2024 •

edited

Loading

dotnet-policy-service bot commented Jun 17, 2024

EgorBo commented Jun 17, 2024

neon-sunset commented Jun 17, 2024

Uh oh!

EgorBo commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBo commented Jun 21, 2024

This comment was marked as resolved.

EgorBo commented Jun 21, 2024

azure-pipelines bot commented Jun 21, 2024

EgorBo commented Jun 21, 2024 •

edited

Loading

Uh oh!

tannergooding Jun 27, 2024

EgorBo Jun 27, 2024

tannergooding Jun 27, 2024

EgorBo Jun 27, 2024

Labels

4 participants

Accelerate Vector128<long>::op_Multiply on x64 #103555

Accelerate Vector128<long>::op_Multiply on x64 #103555

Uh oh!

Conversation

EgorBo commented Jun 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dotnet-policy-service bot commented Jun 17, 2024

EgorBo commented Jun 17, 2024

neon-sunset commented Jun 17, 2024

Uh oh!

EgorBo commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBo commented Jun 21, 2024

This comment was marked as resolved.

EgorBo commented Jun 21, 2024

azure-pipelines bot commented Jun 21, 2024

EgorBo commented Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding Jun 27, 2024

Choose a reason for hiding this comment

EgorBo Jun 27, 2024

Choose a reason for hiding this comment

tannergooding Jun 27, 2024

Choose a reason for hiding this comment

EgorBo Jun 27, 2024

Choose a reason for hiding this comment

Labels

4 participants

EgorBo commented Jun 17, 2024 •

edited

Loading

EgorBo commented Jun 21, 2024 •

edited

Loading