Skip to content

Conversation

@briansull
Copy link
Contributor

No description provided.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 6, 2020
@tannergooding
Copy link
Member

CC. @CarolEidt, @echesakovMSFT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any HWIntrinsics we don't want CSE'd or any that need special handling (such as the non-temporal load/stores)?

Copy link
Contributor

@CarolEidt CarolEidt Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent thought - I would say this should include:

  • any of the "special" loads (LoadFence, any NonTemporal
  • any store (i.e. HW_Category_MemoryStore and StoreFence)
  • any prefetch
  • MemoryFence

Perhaps we should consider adding a flag for this, but a reasonable proxy in the meantime would be to exclude any HW_Category_MemoryStore, HW_Category_MemoryLoad and HW_Category_Special:

  • There would be few, if any, loads that the JIT could identify as CSE's (local stack loads would generally not show up as actual load intrinsics).
  • The only HW_Category_Special that I found that this would unnecessarily exclude is CompareLessThan
@briansull
Copy link
Contributor Author

One issue that I ran into while working on this feature was that the numArgs for the SIMD and HW_INTRINSICS is not always set to a useful value. Value numbering needs to know the exact number of arguments to expect and some instructions have a "-1" for numArgs or specify 2 when then can accept either 1 or 2.

For the purposes of Value Numbering it would be better to have multiple entries in the table when the intrinsic can take different numbers of arguments.
Currently I just bail out when the table specifies a numArgs of -1.

@briansull
Copy link
Contributor Author

briansull commented Feb 7, 2020

For true binary operations, it would be useful to have a "commutative" column that indicates that the operands can be safely swapped.

@tannergooding
Copy link
Member

tannergooding commented Feb 7, 2020

For true binary operations, it would be useful to have a "communitive" column that indicates that the operands can be safely swapped.

You can use static bool IsCommutative(NamedIntrinsic id): https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.h#L202. You would get the id via AsHWIntrinsic()->gtHWIntrinsicId

This is likewise hooked up through the standard OperIsCommutative API.

Its also worth noting that there are some intrinsics, like FusedMultiplyAdd, which are commutative but have 3 operands and require special handling to account for this (the operation is a fused a * b + c and so the a and b operands are commutative).

For the purposes of Value Numbering it would be better to have multiple entries in the table when the intrinsic can take different numbers of arguments.

This would make lookup during imporation quite a bit more expensive. We already have to check the names, I don't think we want to also have to start checking other parameters.

There are existing helper methods such as static int lookupNumArgs(const GenTreeHWIntrinsic* node): https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.h#L135 which will return the correct number of arguments for a given node and will attempt to do it as efficiently as possible.

@briansull
Copy link
Contributor Author

There are existing helper methods such as static int lookupNumArgs(const GenTreeHWIntrinsic* node): https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.h#L135 which will return the correct number of arguments for a given node and will attempt to do it as efficiently as possible.

The Value number operation defined by this enum (below), requires that each enum value have a fixed value for arity and commutativity, as we just pass around the value number and then expect to be able to unpack it based upon the VNFunc.

enum VNFunc { // Implicitly, elements of genTreeOps here. VNF_Boundary = GT_COUNT, #define ValueNumFuncDef(nm, arity, commute, knownNonNull, sharedStatic) VNF_##nm, #include "valuenumfuncs.h" VNF_COUNT }; 
@tannergooding
Copy link
Member

tannergooding commented Feb 8, 2020

The Value number operation defined by this enum (below), requires that each enum value have a fixed value for arity and commutativity, as we just pass around the value number and then expect to be able to unpack it based upon the VNFunc

@CarolEidt, do you have any concerns about updating the hwintrinsic tables to only contain constant data (so entries that currently have a variable number of args or supporting multiple SIMD sizes would be split into multiple entries, for example)?

It would make the table larger but I think we could probably cut some of the current lookup cost and still make this manageable (like having a table per column as we've done elsewhere; or tracking the first/last entry for a given ISA).

@CarolEidt
Copy link
Contributor

do you have any concerns about updating the hwintrinsic tables to only contain constant data (so entries that currently have a variable number of args or supporting multiple SIMD sizes would be split into multiple entries

It would be unfortunate, but I don't think it would be too prohibitive to split it out by number of args, but I don't think it's necessary to split it out by different SIMD sizes, as nodes of different sizes should never have the same value type - isn't that right @briansull?

@briansull
Copy link
Contributor Author

I don't think it's necessary to split it out by different SIMD sizes, as nodes of different sizes should never have the same value type - isn't that right @briansull?

Actually the type size isn't part of the the value number, so there are a couple of nodes where Value Numbering adds an additional arg that holds a constant represent the size of the operation. We do that for GT_CAST and I also had to add this for the SIMD Init operation., As one test was CSE-ing two different sized Vector Inits: 3-way and 4-way. So it awkward but it is something that we can already handle.

 case SIMDIntrinsicInit: { // Also encode the resulting type as op2vnp ValueNumPair op2vnp; op2vnp.SetBoth(vnStore->VNForIntCon(INT32(tree->TypeGet()))); excSetPair = op1Xvnp; normalPair = vnStore->VNPairForFunc(tree->TypeGet(), GetVNFuncForNode(tree), op1vnp, op2vnp); tree->gtVNPair = vnStore->VNPWithExc(normalPair, excSetPair); return; } 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also encode the result type for all of the above SIMD operations.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@AndyAyersMS
Copy link
Member

I would like to see new tests added here, especially given the rather complex IR surface.

Can you also show the impact on FX codegen and/or tests? I am curious how much coverage we're likely to get organically.

How robust will this be as people add new intrinsics?

@briansull
Copy link
Contributor Author

briansull commented Feb 22, 2020

Beginning PMI PerfScore Diffs for benchstones and benchmarks game D:\jit-diffs\runtime\x64\dasmset_98 Total PerfScoreUnits of diff: -3149.45 (-0.46% of base) diff is an improvement. Top file regressions (PerfScoreUnits): 43.10 : Span\SpanBench\SpanBench.dasm (0.22% of base) Top file improvements (PerfScoreUnits): -2805.60 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-3.38% of base) -289.60 : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-6.44% of base) -97.35 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.78% of base) 4 total files with Perf Score differences (3 improved, 1 regressed), 78 unchanged. Top method regressions (percentages): 43.10 (14.43% of base) : Span\SpanBench\SpanBench.dasm - SpanBench:TestSpanFill(ref,int) (7 methods) 5.00 ( 3.73% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this Top method improvements (percentages): -98.60 (-9.31% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this -371.33 (-8.90% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -361.70 (-8.57% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecFloat:square():ComplexVecFloat:this -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecDouble:square():ComplexVecDouble:this -80.63 (-7.49% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this -96.50 (-7.41% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte -192.50 (-7.19% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - <>c__DisplayClass4_0:<DoBench>b__0(int):this -78.50 (-7.14% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this -320.74 (-7.12% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -318.24 (-6.98% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -264.93 (-6.38% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -139.06 (-6.21% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods) -254.90 (-5.75% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -168.96 (-5.16% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods) -211.84 (-4.67% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -211.84 (-4.39% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -15.93 (-1.58% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedNoADT>b__1(int):this -0.75 (-0.96% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Normal(Vector):Vector:this -2.25 (-0.49% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera -0.75 (-0.48% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this -0.60 (-0.18% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:DoBench(int,int):ref 24 total methods with Perf Score differences (22 improved, 2 regressed), 1869 unchanged. 
@briansull
Copy link
Contributor Author

briansull commented Feb 22, 2020

Beginning PMI PerfScore Diffs for System.Private.CoreLib.dll Completed PMI PerfScore Diffs for System.Private.CoreLib.dll in 148.60s in D:\jit-diffs\runtime\x64\dasmset_96 Top method regressions (percentages): 11.35 ( 1.83% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,int):int 11.95 ( 1.66% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,ushort,int):int 0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:LessThanAny(Vector`1,Vector`1):bool (6 methods) 0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanAny(Vector`1,Vector`1):bool (6 methods) 1.00 ( 0.12% of base) : System.Private.CoreLib.dasm - SpanHelpers:Contains(byref,ushort,int):bool (2 methods) Top method improvements (percentages): -6.70 (-22.98% of base) : System.Private.CoreLib.dasm - Vector3:LengthSquared():float:this -12.65 (-19.69% of base) : System.Private.CoreLib.dasm - Vector3:Reflect(Vector3,Vector3):Vector3 -12.65 (-18.67% of base) : System.Private.CoreLib.dasm - Vector3:Normalize(Vector3):Vector3 -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods) -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods) -6.70 (-16.48% of base) : System.Private.CoreLib.dasm - Vector3:Length():float:this -8.25 (-14.44% of base) : System.Private.CoreLib.dasm - Vector4:Normalize(Vector4):Vector4 -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods) -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods) -3.00 (-12.50% of base) : System.Private.CoreLib.dasm - Vector4:LengthSquared():float:this -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAny(Vector`1,Vector`1):bool (6 methods) -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAny(Vector`1,Vector`1):bool (6 methods) -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAll(Vector`1,Vector`1):bool (6 methods) -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAll(Vector`1,Vector`1):bool (6 methods) -31.50 (-9.40% of base) : System.Private.CoreLib.dasm - Vector:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (8 methods) -22.50 (-8.74% of base) : System.Private.CoreLib.dasm - Vector`1:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (6 methods) -3.00 (-8.45% of base) : System.Private.CoreLib.dasm - Vector4:Length():float:this -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Min(Vector`1,Vector`1):Vector`1 (6 methods) -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Max(Vector`1,Vector`1):Vector`1 (6 methods) -14.15 (-7.30% of base) : System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Min(Vector`1,Vector`1):Vector`1 (6 methods) -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Max(Vector`1,Vector`1):Vector`1 (6 methods) -21.60 (-5.67% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4 -1.00 (-5.00% of base) : System.Private.CoreLib.dasm - Vector2:LengthSquared():float:this -1.50 (-4.93% of base) : System.Private.CoreLib.dasm - VectorMath:Lerp(Vector128`1,Vector128`1,Vector128`1):Vector128`1 -9.90 (-3.32% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateWorld(Vector3,Vector3,Vector3):Matrix4x4 -1.00 (-3.17% of base) : System.Private.CoreLib.dasm - Vector2:Length():float:this -17.85 (-2.64% of base) : System.Private.CoreLib.dasm - Matrix4x4:Decompose(Matrix4x4,byref,byref,byref):bool -10.02 (-2.07% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4 -2.17 (-1.56% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Default(long,long):long -1.64 (-1.42% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiByte_Default(long,long):long -1.50 (-0.87% of base) : System.Private.CoreLib.dasm - ASCIIUtility:NarrowUtf16ToAscii_Sse2(long,long,long):long (2 methods) -1.26 (-0.51% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Sse2(long,long):long (2 methods) -0.75 (-0.23% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateBillboard(Vector3,Vector3,Vector3,Vector3):Matrix4x4 -0.85 (-0.19% of base) : System.Private.CoreLib.dasm - SpanHelpers:LastIndexOf(byref,ushort,int):int -0.45 (-0.08% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int 42 total methods with Perf Score differences (36 improved, 6 regressed), 22012 unchanged. 
@briansull
Copy link
Contributor Author

briansull commented Feb 22, 2020

The regression in Span\SpanBench\SpanBench.dasm - SpanBench:TestSpanFill(ref,int) (7 methods)
We make a simd32 into a CSE, which requires an additional two caller saved ymm registers to be spilled in the prolog. And this ymm register is live across a call, which since only half of the register is preserved we have to save and restore half of it at a callsite:

prolog

+ vmovaps qword ptr [rsp+70H], xmm6 + vmovaps qword ptr [rsp+60H], xmm7 

Loop:

+ vextractf128 ymm7, ymm6, 1 call Span`1:Fill(Vector`1):this inc edi cmp edi, esi + vinsertf128 ymm6, ymm6, ymm7, 1 
}

#ifdef FEATURE_SIMD
// SIMD types will cause a SIMD register to be spilled/restored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "may cause".
I think you should have a single comment that talks about the heuristics for estimating register save/restore/spill costs. First, it is conservative to assume that each SIMD CSE will require an additional register to be saved/restored in prolog/epilog, as there may be disjoint CSE live ranges that can utilize the same register.
Second, this code presumes that the CSEs will be live arcoss a call, which may be true more often than not, but it also assumes that each use will require a restore, though it may be that multiple uses will not span a call.
While it may be fine to be conservative in these ways, I think we should accurately describe those conservative assumptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

{
spillSimdRegInProlog++; // we likely need to spill an extra register to save the upper half

// Increase the cse_use_cost since at use sites we have to generate code to spill/restore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, "we have to" should be "we may have to"

@briansull
Copy link
Contributor Author

briansull commented Feb 25, 2020

Here is an example where value number of SIMD types produced a very nice win:

runtime\src\coreclr\tests\src\JIT\Regression\JitBlue\GitHub_11816\GitHub_11816.cs:

 static void DoSomeWorkWithAStruct(ref Vector<float> source, out Vector<float> result) { StructType u; u.A = new Vector<float>(2) * source; u.B = new Vector<float>(3) * source; u.C = new Vector<float>(4) * source; u.D = new Vector<float>(5) * source; u.E = new Vector<float>(6) * source; u.F = new Vector<float>(7) * source; result = u.A + u.B + u.C + u.D + u.E + u.F; } 

We create the following CSE's to hold source, u.A, u.B, u.C, u.D, U.E and u.F

; V10 cse0 [V10,T10] ( 2, 2 ) simd32 -> mm0 "CSE - aggressive" ; V11 cse1 [V11,T11] ( 2, 2 ) simd32 -> mm2 "CSE - aggressive" ; V12 cse2 [V12,T12] ( 2, 2 ) simd32 -> mm3 "CSE - aggressive" ; V13 cse3 [V13,T13] ( 2, 2 ) simd32 -> mm4 "CSE - aggressive" ; V14 cse4 [V14,T14] ( 2, 2 ) simd32 -> mm5 "CSE - aggressive" ; V15 cse5 [V15,T15] ( 2, 2 ) simd32 -> mm1 "CSE - aggressive" ; V16 cse6 [V16,T03] ( 7, 7 ) simd32 -> mm1 "CSE - aggressive" 

Reducing the code from:

 vmovupd ymmword ptr[rsp+08H], ymm0 vbroadcastss ymm0, ymmword ptr[reloc @RWD04] vmovupd ymm1, ymmword ptr[rcx] vmulps ymm0, ymm1 vmovupd ymmword ptr[rsp+28H], ymm0 vbroadcastss ymm0, ymmword ptr[reloc @RWD08] vmovupd ymm1, ymmword ptr[rcx] vmulps ymm0, ymm1 vmovupd ymmword ptr[rsp+48H], ymm0 vbroadcastss ymm0, ymmword ptr[reloc @RWD12] vmovupd ymm1, ymmword ptr[rcx] vmulps ymm0, ymm1 vmovupd ymmword ptr[rsp+68H], ymm0 vbroadcastss ymm0, ymmword ptr[reloc @RWD16] vmovupd ymm1, ymmword ptr[rcx] vmulps ymm0, ymm1 vmovupd ymmword ptr[rsp+88H], ymm0 vbroadcastss ymm0, ymmword ptr[reloc @RWD20] vmovupd ymm1, ymmword ptr[rcx] vmulps ymm0, ymm1 vmovupd ymmword ptr[rsp+A8H], ymm0 vmovupd ymm0, ymmword ptr[rsp+08H] vmovupd ymm1, ymmword ptr[rsp+28H] vaddps ymm0, ymm1 vmovupd ymm1, ymmword ptr[rsp+48H] vaddps ymm0, ymm1 vmovupd ymm1, ymmword ptr[rsp+68H] vaddps ymm0, ymm1 vmovupd ymm1, ymmword ptr[rsp+88H] vaddps ymm0, ymm1 vmovupd ymm1, ymmword ptr[rsp+A8H] vaddps ymm0, ymm1 vmovupd ymmword ptr[rdx], ymm0 

To:

 vbroadcastss ymm2, ymmword ptr[reloc @RWD04] vmulps ymm2, ymm1 vbroadcastss ymm3, ymmword ptr[reloc @RWD08] vmulps ymm3, ymm1 vbroadcastss ymm4, ymmword ptr[reloc @RWD12] vmulps ymm4, ymm1 vbroadcastss ymm5, ymmword ptr[reloc @RWD16] vmulps ymm5, ymm1 vbroadcastss ymm6, ymmword ptr[reloc @RWD20] vmulps ymm1, ymm6, ymm1 vaddps ymm0, ymm2 vaddps ymm0, ymm3 vaddps ymm0, ymm4 vaddps ymm0, ymm5 vaddps ymm0, ymm1 vmovupd ymmword ptr[rdx], ymm0 

PerfScore:

OLD> ; Total bytes of code 255, prolog size 32, PerfScore 175.50, (MethodHash=2eeeff7d) for method GitHub_11816:DoSomeWorkWithAStruct(byref,byref) NEW> ; Total bytes of code 131, prolog size 12, PerfScore 89.10, (MethodHash=2eeeff7d) for method GitHub_11816:DoSomeWorkWithAStruct(byref,byref) 
 
@briansull
Copy link
Contributor Author

We perform a nice CSE for the a[i] and b[i] operations in this test case:

src\coreclr\tests\src\JIT\SIMD>emacs BitwiseOperations.cs

 static int TestBool() { Random random = new Random(13); byte[] arr1 = GenerateByteArray(64, random); byte[] arr2 = GenerateByteArray(64, random); var a = new System.Numerics.Vector<byte>(arr1); var b = new System.Numerics.Vector<byte>(arr2); var xorR = a ^ b; var andR = a & b; var orR = a | b; int Count = System.Numerics.Vector<byte>.Count; for (int i = 0; i < Count; ++i) { int d = a[i] ^ b[i]; if (xorR[i] != d) { return 0; } d = a[i] & b[i]; if (andR[i] != d) { return 0; } d = a[i] | b[i]; if (orR[i] != d) { return 0; } } return 100; } 
@briansull
Copy link
Contributor Author

@AndyAyersMS I believe that we have a reasonable amount of test coverage here.

I will add a Config variable to allow us to disable this new functionality and revert to the old behavior.

@briansull
Copy link
Contributor Author

briansull commented Feb 26, 2020

We also get this CSE now in:
coreclr\tests\src\JIT\Performance\CodeQuality\SIMD\ConsoleMandel\Abstractions.cs

Real * Imaginary

 [MethodImplAttribute(MethodImplOptions.AggressiveInlining)] public ComplexVecFloat square() { return new ComplexVecFloat(Real * Real - Imaginary * Imaginary, Real * Imaginary + Real * Imaginary); } 
CSE candidate #03, vn=$189 in BB01, [cost= 8, size= 7]: N014 ( 8, 7) CSE #03 (use)[000020] ---XG------- * SIMD simd32 float * <l:$24f, c:$252> N009 ( 3, 2) CSE #01 (use)[000017] *--XG------- +--* IND simd32 <l:$241, c:$250> N008 ( 1, 1) [000016] ------------ | \--* LCL_VAR byref V00 this u:1 Zero Fseq[Real] $80 N013 ( 4, 4) CSE #02 (use)[000019] *--XG------- \--* IND simd32 <l:$246, c:$251> N012 ( 2, 2) [000069] -------N---- \--* ADD byref $280 N010 ( 1, 1) [000018] ------------ +--* LCL_VAR byref V00 this u:1 (last use) $80 N011 ( 1, 1) [000068] ------------ \--* CNS_INT long 32 field offset Fseq[Imaginary] $c1 
G_M53767_IG02: vmovupd ymm0, ymmword ptr[rcx] vmulps ymm1, ymm0, ymm0 vmovupd ymm2, ymmword ptr[rcx+32] vmulps ymm3, ymm2, ymm2 vsubps ymm1, ymm3 < vmulps ymm3, ymm0, ymm2 vmulps ymm0, ymm2 < vaddps ymm0, ymm3, ymm0 > vaddps ymm0, ymm0 vmovupd ymmword ptr[rdx], ymm1 vmovupd ymmword ptr[rdx+32], ymm0 mov rax, rdx <	;; bbWeight=1 PerfScore 32.25 >	;; bbWeight=1 PerfScore 29.25 G_M53767_IG03: vzeroupper ret	;; bbWeight=1 PerfScore 2.00 <; Total bytes of code 52, prolog size 3, PerfScore 41.45, (MethodHash=c9862df9) for method ComplexVecFloat:square():ComplexVecFloat:this >; Total bytes of code 48, prolog size 3, PerfScore 37.95, (MethodHash=c9862df9) for method ComplexVecFloat:square():ComplexVecFloat:this 
 Default 0, ValueNumbering of SIMD nodes and HW Intrinsic nodes enabled If 1, then disable ValueNumbering of SIMD nodes If 2, then disable ValueNumbering of HW Intrinsic nodes If 3, disable both SIMD and HW Intrinsic nodes
@briansull
Copy link
Contributor Author

Beginning PMI PerfScore Diffs for System.Private.CoreLib.dll, framework assemblies Completed PMI PerfScore Diffs for System.Private.CoreLib.dll, framework assemblies in 685.82s Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_108\base d:\jit-diffs\runtime\x64\dasmset_108\diff Analyzing PerfScore diffs... PMI PerfScore Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit Summary of Perf Score diffs: (Lower is better) Total PerfScoreUnits of diff: -1179.71 (-0.00% of base) diff is an improvement. Top file improvements (PerfScoreUnits): -536.77 : System.Private.CoreLib.dasm (-0.02% of base) -196.30 : Microsoft.Diagnostics.FastSerialization.dasm (-0.07% of base) -113.13 : System.Data.Common.dasm (-0.00% of base) -75.97 : System.Linq.Parallel.dasm (-0.01% of base) -47.00 : System.Text.Encodings.Web.dasm (-0.34% of base) -44.70 : Newtonsoft.Json.dasm (-0.00% of base) -34.90 : System.Collections.dasm (-0.02% of base) -31.99 : System.Collections.Immutable.dasm (-0.01% of base) -20.18 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.00% of base) -18.77 : System.Threading.Channels.dasm (-0.03% of base) -13.15 : System.Private.Xml.Linq.dasm (-0.02% of base) Top method regressions (percentages): 11.35 ( 1.83% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,int):int 11.95 ( 1.66% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,ushort,int):int 1.50 ( 0.79% of base) : System.Collections.Immutable.dasm - Node:Add(Vector`1):Node:this 1.05 ( 0.48% of base) : System.Linq.dasm - Lookup`2:GetGrouping(Vector`1,bool):Grouping`2:this 2.03 ( 0.44% of base) : System.Collections.Immutable.dasm - Builder:set_Count(int):this (7 methods) 0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:LessThanAny(Vector`1,Vector`1):bool (6 methods) 0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanAny(Vector`1,Vector`1):bool (6 methods) 1.00 ( 0.12% of base) : System.Private.CoreLib.dasm - SpanHelpers:Contains(byref,ushort,int):bool (2 methods) 0.34 ( 0.08% of base) : Microsoft.Diagnostics.FastSerialization.dasm - GrowableArray`1:set_Count(int):this (7 methods) 1.30 ( 0.07% of base) : System.Collections.Immutable.dasm - ImmutableSortedSet`1:LeafToRootRefill(IEnumerable`1):ImmutableSortedSet`1:this (7 methods) Top method improvements (percentages): -6.70 (-22.98% of base) : System.Private.CoreLib.dasm - Vector3:LengthSquared():float:this -12.65 (-19.69% of base) : System.Private.CoreLib.dasm - Vector3:Reflect(Vector3,Vector3):Vector3 -12.65 (-18.67% of base) : System.Private.CoreLib.dasm - Vector3:Normalize(Vector3):Vector3 -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods) -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods) -23.00 (-17.73% of base) : System.Text.Encodings.Web.dasm - Sse2Helper:CreateEscapingMask_UnsafeRelaxedJavaScriptEncoder(Vector128`1):Vector128`1 (2 methods) -6.70 (-16.48% of base) : System.Private.CoreLib.dasm - Vector3:Length():float:this -20.10 (-14.71% of base) : System.Text.Encodings.Web.dasm - Sse2Helper:CreateEscapingMask_DefaultJavaScriptEncoderBasicLatin(Vector128`1):Vector128`1 -8.25 (-14.44% of base) : System.Private.CoreLib.dasm - Vector4:Normalize(Vector4):Vector4 -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods) -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods) -3.00 (-12.50% of base) : System.Private.CoreLib.dasm - Vector4:LengthSquared():float:this -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAny(Vector`1,Vector`1):bool (6 methods) -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAny(Vector`1,Vector`1):bool (6 methods) -35.20 (-10.49% of base) : System.Collections.Immutable.dasm - ImmutableInterlocked:GetOrAdd(byref,Vector`1,Func`2):long -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAll(Vector`1,Vector`1):bool (6 methods) -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAll(Vector`1,Vector`1):bool (6 methods) -7.50 (-9.77% of base) : System.Collections.dasm - HashSet`1:AddValue(int,int,Vector`1):this -5.90 (-9.56% of base) : System.DirectoryServices.AccountManagement.dasm - PrincipalValueCollection`1:Insert(int,Vector`1):this -31.50 (-9.40% of base) : System.Private.CoreLib.dasm - Vector:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (8 methods) -22.50 (-8.74% of base) : System.Private.CoreLib.dasm - Vector`1:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (6 methods) -3.00 (-8.45% of base) : System.Private.CoreLib.dasm - Vector4:Length():float:this -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Min(Vector`1,Vector`1):Vector`1 (6 methods) -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Max(Vector`1,Vector`1):Vector`1 (6 methods) -14.15 (-7.30% of base) : System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane -3.40 (-7.11% of base) : System.Threading.Channels.dasm - Deque`1:DequeueHead():Vector`1:this -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Min(Vector`1,Vector`1):Vector`1 (6 methods) -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Max(Vector`1,Vector`1):Vector`1 (6 methods) -3.50 (-6.31% of base) : System.Threading.Channels.dasm - Deque`1:EnqueueTail(Vector`1):this -2.50 (-6.27% of base) : System.Text.Encodings.Web.dasm - Sse2Helper:CreateAsciiMask(Vector128`1):Vector128`1 -3.10 (-5.96% of base) : System.Collections.dasm - Stack`1:PushWithResize(Vector`1):this -21.60 (-5.67% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4 -3.10 (-5.50% of base) : Microsoft.Diagnostics.FastSerialization.dasm - SegmentedList`1:Add(Vector`1):this 
@briansull
Copy link
Contributor Author

Beginning PMI PerfScore Diffs for benchstones and benchmarks game in D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release Completed PMI PerfScore Diffs for benchstones and benchmarks game in D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release in 113.61s Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_109\base d:\jit-diffs\runtime\x64\dasmset_109\diff Total PerfScoreUnits of diff: -3192.55 (-0.47% of base) diff is an improvement. Top file improvements (PerfScoreUnits): -2805.60 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-3.38% of base) -289.60 : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-6.44% of base) -97.35 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.78% of base) 3 total files with Perf Score differences (3 improved, 0 regressed), 79 unchanged. Top method regressions (percentages): 5.00 ( 3.73% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this Top method improvements (percentages): -98.60 (-9.31% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this -371.33 (-8.90% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -361.70 (-8.57% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecFloat:square():ComplexVecFloat:this -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecDouble:square():ComplexVecDouble:this -80.63 (-7.49% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this -96.50 (-7.41% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte -192.50 (-7.19% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - <>c__DisplayClass4_0:<DoBench>b__0(int):this -78.50 (-7.14% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this -320.74 (-7.12% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -318.24 (-6.98% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -264.93 (-6.38% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -139.06 (-6.21% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods) -254.90 (-5.75% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -168.96 (-5.16% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods) -211.84 (-4.67% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -211.84 (-4.39% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this 
@briansull
Copy link
Contributor Author

Beginning PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\SIMD Completed PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\SIMD in 141.91s Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_110\base d:\jit-diffs\runtime\x64\dasmset_110\diff Analyzing PerfScore diffs... Total PerfScoreUnits of diff: -1789.36 (-0.16% of base) diff is an improvement. Top file regressions (PerfScoreUnits): 79.83 : VectorRelOp_ro\VectorRelOp_ro.dasm (0.68% of base) 9.90 : Matrix4x4_ro\Matrix4x4_ro.dasm (3.23% of base) 0.10 : VectorIntEquals_ro\VectorIntEquals_ro.dasm (0.00% of base) Top file improvements (PerfScoreUnits): -770.70 : VectorMatrix_ro\VectorMatrix_ro.dasm (-0.26% of base) -747.30 : VectorConvert_ro\VectorConvert_ro.dasm (-4.99% of base) -110.30 : Vector3Interop_ro\Vector3Interop_ro.dasm (-4.96% of base) -67.88 : VectorGet_ro\VectorGet_ro.dasm (-1.17% of base) -66.40 : VectorExp_ro\VectorExp_ro.dasm (-1.42% of base) -65.60 : BitwiseOperations_ro\BitwiseOperations_ro.dasm (-8.52% of base) -24.00 : Haar-likeFeaturesGeneric_ro\Haar-likeFeaturesGeneric_ro.dasm (-1.66% of base) -9.30 : VectorMul_ro\VectorMul_ro.dasm (-0.17% of base) -6.10 : LdfldGeneric_ro\LdfldGeneric_ro.dasm (-4.35% of base) -5.18 : VectorInitN_ro\VectorInitN_ro.dasm (-0.21% of base) -3.08 : VectorSqrt_ro\VectorSqrt_ro.dasm (-0.11% of base) -2.60 : StoreElement_ro\StoreElement_ro.dasm (-3.76% of base) -0.75 : Plane_ro\Plane_ro.dasm (-0.38% of base) Top method regressions (percentages): 9.90 ( 3.28% of base) : Matrix4x4_ro\Matrix4x4_ro.dasm - Matrix4x4Test:Matrix4x4CreateScaleCenterTest3():int 31.91 ( 1.95% of base) : VectorRelOp_ro\VectorRelOp_ro.dasm - VectorRelopTest`1:VectorRelOp(int,int):int 28.11 ( 1.68% of base) : VectorRelOp_ro\VectorRelOp_ro.dasm - VectorRelopTest`1:VectorRelOp(short,short):int 25.11 ( 1.51% of base) : VectorRelOp_ro\VectorRelOp_ro.dasm - VectorRelopTest`1:VectorRelOp(long,long):int 0.10 ( 0.15% of base) : VectorIntEquals_ro\VectorIntEquals_ro.dasm - VectorTest:VectorIntEquals():int Top method improvements (percentages): -65.60 (-23.85% of base) : BitwiseOperations_ro\BitwiseOperations_ro.dasm - Program:TestBool():int -65.65 (-18.22% of base) : VectorExp_ro\VectorExp_ro.dasm - VectorExpTest`1:VectorExp(Vector`1,double,double,double):int -42.18 (-17.87% of base) : VectorGet_ro\VectorGet_ro.dasm - VectorGetTest`1:VectorGet(long,int):int -20.45 (-13.18% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Array(ref,int) -20.70 (-12.09% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Arg(int,Vector3,String,Vector3) -116.80 (-11.82% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertDoubleSingle(Vector`1,Vector`1):int -45.65 (-11.26% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Arg_Unix2(Vector3,float,float,float,float,float,float,float,Vector3,float,float,Vector3,float) -25.70 (-11.21% of base) : VectorGet_ro\VectorGet_ro.dasm - VectorGetTest`1:VectorGet(double,int):int -92.40 (-11.13% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertUInt16And8(Vector`1,Vector`1):int -90.80 (-10.97% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertUInt32And16(Vector`1,Vector`1):int -89.40 (-10.34% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertUInt64And32(Vector`1,Vector`1):int -93.20 (-10.30% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertInt16And8(Vector`1,Vector`1):int -91.00 (-10.09% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertInt32And16(Vector`1,Vector`1):int -89.40 (-9.53% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertInt64And32(Vector`1,Vector`1):int -42.00 (-8.39% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertSingleInt(Vector`1):int -23.50 (-8.37% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Arg_Unix(Vector3,float,float,float,float,float,float,float,Vector3,float,float) -42.30 (-7.26% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertDoubleInt64(Vector`1):int -6.10 (-4.84% of base) : LdfldGeneric_ro\LdfldGeneric_ro.dasm - Program:Main(ref):int -12.00 (-4.25% of base) : Haar-likeFeaturesGeneric_ro\Haar-likeFeaturesGeneric_ro.dasm - Program:VectorAndFilter(ref,ref):ref -12.00 (-4.09% of base) : Haar-likeFeaturesGeneric_ro\Haar-likeFeaturesGeneric_ro.dasm - Program:VectorFilter(ref,ref):ref 
@briansull
Copy link
Contributor Author

Beginning PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\Performance\CodeQuality Completed PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\Performance\CodeQuality in 112.49s Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_111\base d:\jit-diffs\runtime\x64\dasmset_111\diff Analyzing PerfScore diffs... Total PerfScoreUnits of diff: -3537.58 (-0.50% of base) diff is an improvement. Top file improvements (PerfScoreUnits): -2805.60 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-3.38% of base) -345.03 : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm (-1.54% of base) -289.60 : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-6.44% of base) -97.35 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.78% of base) 4 total files with Perf Score differences (4 improved, 0 regressed), 80 unchanged. Top method regressions (percentages): 5.00 ( 3.73% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this 4.70 ( 1.55% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - VectorMath:Log(Vector256`1):Vector256`1 2.50 ( 0.46% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Camera:Create(VectorPacket256,VectorPacket256):Camera Top method improvements (percentages): -10.25 (-32.59% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - <>c:<.cctor>b__5_2(VectorPacket256):VectorPacket256:this -10.25 (-32.59% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - <>c:<.cctor>b__5_4(VectorPacket256):VectorPacket256:this -9.00 (-18.69% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - VectorPacket256:op_Multiply(Vector256`1,VectorPacket256):VectorPacket256 -98.60 (-9.31% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this -371.33 (-8.90% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -361.70 (-8.57% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecFloat:square():ComplexVecFloat:this -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecDouble:square():ComplexVecDouble:this -80.63 (-7.49% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this -96.50 (-7.41% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte -192.50 (-7.19% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - <>c__DisplayClass4_0:<DoBench>b__0(int):this -78.50 (-7.14% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this -320.74 (-7.12% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -318.24 (-6.98% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this -35.00 (-6.45% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Packet256Tracer:CreateDefaultScene():Scene -264.93 (-6.38% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -139.06 (-6.21% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods) -254.90 (-5.75% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -168.96 (-5.16% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods) -211.84 (-4.67% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -211.84 (-4.39% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this -33.93 (-4.32% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Packet256Tracer:Shade(Intersections,RayPacket256,Scene,int):VectorPacket256:this -239.80 (-3.82% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Packet256Tracer:GetNaturalColor(Vector256`1,VectorPacket256,VectorPacket256,VectorPacket256,Scene):VectorPacket256:this 
@briansull
Copy link
Contributor Author

We detect and make a new CSE in:

coreclr\tests\src\JIT\Performance\CodeQuality\BenchmarksGame\mandelbrot\mandelbrot-7.cs

The expression:
Zi = Zr * Zi + Zr * Zi + vCiby;
has the sub expression (Zr *Zi) occurring twice, so we make it into a CSE:
Saving about 10% of the loop cost.

 [MethodImpl(MethodImplOptions.AggressiveInlining)] private static unsafe byte GetByte(double* pCrb, double Ciby, int x, int y) { // This currently does not do anything special for 'Count > 2' var res = 0; for (var i = 0; i < 8; i += 2) { var Crbx = Unsafe.Read<Vector<double>>(pCrb + x + i); var Zr = Crbx; var vCiby = new Vector<double>(Ciby); var Zi = vCiby; var b = 0; var j = 49; do { var nZr = Zr * Zr - Zi * Zi + Crbx; Zi = Zr * Zi + Zr * Zi + vCiby; Zr = nZr; 

We also make the CSE when we inline the GetByte method into the lamba function :

 Parallel.For(0, size, y => { var offset = y * lineLength; for (var x = 0; x < lineLength; x++) { data[offset + x] = GetByte(_Crb, Cib[y], x * 8, y); } }); 
@briansull
Copy link
Contributor Author

I verified that setting
set COMPLUS_JitDisableSimdVN=3
will disable my changes.

@briansull briansull merged commit 4ef3cc6 into dotnet:master Feb 26, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

6 participants