Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Sep 12, 2020

Contributes to #1342

public static float XorTest(float x) => -x; public static float AbsTest(float x) => MathF.Abs(x);

Current codegen:

; Method Test:XorTest(float):float  C5F877 vzeroupper   C5FA100D0D000000 vmovss xmm1, dword ptr [reloc @RWD00]  C5F857C1 vxorps xmm0, xmm1  C3 ret  RWD00 dd80000000h ; Total bytes of code: 16 ; Method Test:AbsTest(float):float  C5F877 vzeroupper   C5FA100D0D000000 vmovss xmm1, dword ptr [reloc @RWD00]  C5F854C1 vandps xmm0, xmm1  C3 ret  RWD00 dd7FFFFFFFh ; Total bytes of code: 16

New codegen:

; Method XorTest(float):float  C5F877 vzeroupper   C5F8570505000000 vxorps xmm0, xmm0, dword ptr [reloc @RWD00]  C3 ret  RWD00 dd80000000h ; Total bytes of code: 12 ; Method AbsTest(float):float  C5F877 vzeroupper   C5F8540505000000 vandps xmm0, xmm0, dword ptr [reloc @RWD00]  C3 ret  RWD00 dd7FFFFFFFh ; Total bytes of code: 12

godbolt: https://godbolt.org/z/5K5dGr

jit-diff:

C:\prj>jit-diff diff --output C:\prj\jitdiffs -f --core_root C:\prj\runtime-1\artifacts\tests\coreclr\Windows_NT.x64.Release\Tests\Core_Root --base C:\prj\runtime-1\artifacts\bin\coreclr\Windows_NT.x64.Checked_base --diff C:\prj\runtime-1\artifacts\bin\coreclr\Windows_NT.x64.Checked --pmi Beginning PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies - Finished 267/267 Base 267/267 Diff [490.6 sec] Completed PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies in 490.77s Diffs (if any) can be viewed by comparing: C:\prj\jitdiffs\dasmset_10\base C:\prj\jitdiffs\dasmset_10\diff Analyzing CodeSize diffs... Found 14 files with textual diffs. PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for default jit Summary of Code Size diffs: (Lower is better) Total bytes of diff: -1097 (-0.00% of base) diff is an improvement. Top file improvements (bytes): -484 : System.Private.CoreLib.dasm (-0.01% of base) -153 : Microsoft.VisualBasic.Core.dasm (-0.03% of base) -152 : System.Runtime.Numerics.dasm (-0.21% of base) -92 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.00% of base) -84 : System.Drawing.Common.dasm (-0.03% of base) -36 : Newtonsoft.Json.dasm (-0.00% of base) -28 : System.Data.Common.dasm (-0.00% of base) -20 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.00% of base) -16 : System.Private.Xml.dasm (-0.00% of base) -8 : FSharp.Core.dasm (-0.00% of base) -8 : System.Linq.Expressions.dasm (-0.00% of base) -8 : System.Private.DataContractSerialization.dasm (-0.00% of base) -4 : Microsoft.CodeAnalysis.CSharp.dasm (-0.00% of base) -4 : System.Net.Mail.dasm (-0.00% of base) 14 total files with Code Size differences (14 improved, 0 regressed), 253 unchanged. Top method improvements (bytes): -41 (-1.90% of base) : System.Private.CoreLib.dasm - Matrix4x4:<Invert>g__SoftwareFallback|65_1(Matrix4x4,byref):bool -36 (-6.19% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateShadow(Vector3,Plane):Matrix4x4 -28 (-1.36% of base) : Microsoft.VisualBasic.Core.dasm - ObjectType:InternalNegObj(Object,IConvertible,int):Object -25 (-7.84% of base) : System.Private.CoreLib.dasm - MathF:IEEERemainder(float,float):float -24 (-1.21% of base) : Microsoft.VisualBasic.Core.dasm - Operators:NegateObject(Object):Object -24 (-5.93% of base) : System.Drawing.Common.dasm - Matrix:RotateAt(float,PointF,int):this -24 (-14.20% of base) : System.Private.CoreLib.dasm - Matrix3x2:op_UnaryNegation(Matrix3x2):Matrix3x2 -21 (-7.00% of base) : System.Private.CoreLib.dasm - Math:IEEERemainder(double,double):double -20 (-1.43% of base) : Microsoft.VisualBasic.Core.dasm - Financial:IRR(byref,double):double -20 (-1.21% of base) : System.Private.CoreLib.dasm - Matrix4x4:Decompose(Matrix4x4,byref,byref,byref):bool -20 (-9.39% of base) : System.Runtime.Numerics.dasm - Complex:Asin(Complex):Complex -20 (-3.77% of base) : System.Runtime.Numerics.dasm - Complex:Sqrt(Complex):Complex -16 (-0.57% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - ExpressionEvaluator:EvaluateUnaryExpression(UnaryExpressionSyntax):CConst:this -16 (-9.64% of base) : System.Drawing.Common.dasm - PrinterSettings:CreateMeasurementGraphics(bool):Graphics:this -16 (-17.20% of base) : System.Private.CoreLib.dasm - Quaternion:op_UnaryNegation(Quaternion):Quaternion -16 (-17.20% of base) : System.Private.CoreLib.dasm - Quaternion:Negate(Quaternion):Quaternion -16 (-13.68% of base) : System.Private.CoreLib.dasm - Vector4:Abs(Vector4):Vector4 -16 (-2.26% of base) : System.Private.CoreLib.dasm - Vector`1:Abs(Vector`1):Vector`1 (6 methods) -16 (-11.51% of base) : System.Runtime.Numerics.dasm - Complex:op_Division(double,Complex):Complex -16 (-7.24% of base) : System.Runtime.Numerics.dasm - Complex:Acos(Complex):Complex Top method improvements (percentages): -4 (-25.00% of base) : FSharp.Core.dasm - -cctor@6135-13:Invoke(double):double:this -4 (-25.00% of base) : FSharp.Core.dasm - -cctor@6136-15:Invoke(float):float:this -4 (-25.00% of base) : System.Private.CoreLib.dasm - MathF:Abs(float):float -16 (-17.20% of base) : System.Private.CoreLib.dasm - Quaternion:op_UnaryNegation(Quaternion):Quaternion -16 (-17.20% of base) : System.Private.CoreLib.dasm - Quaternion:Negate(Quaternion):Quaternion -8 (-16.33% of base) : System.Runtime.Numerics.dasm - Complex:Negate(Complex):Complex -8 (-16.33% of base) : System.Runtime.Numerics.dasm - Complex:op_UnaryNegation(Complex):Complex -8 (-15.38% of base) : Microsoft.VisualBasic.Core.dasm - Conversion:Fix(double):double -12 (-14.81% of base) : System.Private.CoreLib.dasm - Quaternion:Conjugate(Quaternion):Quaternion -24 (-14.20% of base) : System.Private.CoreLib.dasm - Matrix3x2:op_UnaryNegation(Matrix3x2):Matrix3x2 -16 (-13.68% of base) : System.Private.CoreLib.dasm - Vector4:Abs(Vector4):Vector4 -8 (-12.50% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - <>c:<ByIDSortedExclusiveMetric>b__15_0(CallTreeNodeBase,CallTreeNodeBase):int:this -8 (-12.50% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - <>c:<.ctor>b__0_0(CallTreeNodeBase,CallTreeNodeBase):int:this -8 (-12.50% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - <>c:<.ctor>b__0_1(CallTreeNodeBase,CallTreeNodeBase):int:this -8 (-12.50% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - <>c:<GetCallees>b__5_0(CallTreeNode,CallTreeNode):int:this -12 (-12.37% of base) : Newtonsoft.Json.dasm - MathUtils:ApproxEquals(double,double):bool -12 (-11.88% of base) : System.Private.CoreLib.dasm - Vector3:Abs(Vector3):Vector3 -8 (-11.76% of base) : Microsoft.VisualBasic.Core.dasm - Conversion:Fix(float):float -4 (-11.76% of base) : Newtonsoft.Json.dasm - JsonValidatingReader:IsZero(double):bool -16 (-11.51% of base) : System.Runtime.Numerics.dasm - Complex:op_Division(double,Complex):Complex 127 total methods with Code Size differences (127 improved, 0 regressed), 258503 unchanged. Completed analysis in 28.55s 

/cc @tannergooding

Saving 0.0 constant into a temp reg makes sense when it's used more than once but it doesn't work anyway: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunGAzADYRsGBgFls2ABSDhohGgayRDAJ4BKBgF4AfA0QMAVPtUBuLjwD0lhgEtRQ7ABNcDAEQAGAHQe3DfHgA1gwYAO62YDCKAJIM2PgMAHYQojAAbjCJ/pkYthCJtokA5voIhnCqDLgAFhAArgJODMAwDBAADrn4tgBeME0YEAxlqjQAvkA===

@am11
Copy link
Member

am11 commented Sep 12, 2020

godbolt: https://godbolt.org/z/5K5dGr

@EgorBo, do you know why clang is producing vxorps, for double precision (as well as for float); while gcc seems to be doing the right thing, producing vxorpd with double and vxorps for float?

@EgorBo
Copy link
Member Author

EgorBo commented Sep 12, 2020

godbolt: https://godbolt.org/z/5K5dGr

@EgorBo, do you know why clang is producing vxorps, for double precision (as well as for float); while gcc seems to be doing the right thing, producing vxorpd with double and vxorps for float?

I guess there is no difference for xor what to xor - it xors bits https://godbolt.org/z/9TsWq1 🙂
UPD: it looks like there is a small difference for the 256bit xor: https://stackoverflow.com/questions/26942952/difference-between-the-avx-instructions-vxorpd-and-vpxor

@tannergooding
Copy link
Member

do you know why clang is producing vxorps, for double precision (as well as for float); while gcc seems to be doing the right thing, producing vxorpd with double and vxorps for float?

For the legacy encoding, the ps version is generally 1 byte smaller than the pd or integral versions. For VEX they should be the same size.
movaps, movups, movntps, xorps, orps, andps, and shufps are all other instructions that can also be substituted due to being a byte smaller but functionally operating on bits rather than on "32-bit floats".

@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 12, 2020
@sandreenko
Copy link
Contributor

@dotnet/jit-contrib

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

some tests fail, a simplified repro:

using System; using System.Runtime.CompilerServices; class Prog { static void Main() { Console.WriteLine( BitConverter.DoubleToInt64Bits(Egor(0, double.NaN))); } [MethodImpl(MethodImplOptions.NoInlining)] static double Egor(double xmm0, double xmm1) => -xmm1; }

happens only when AVX (VEX) is not available (e.g R2R).
Output:

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object. at Prog.Egor(Double xmm0, Double xmm1) at Prog.Main() 

codegen for Egor:

G_M17335_IG01: ;; bbWeight=1 PerfScore 0.00 G_M17335_IG02:  0F28C1 movaps xmm0, xmm1  0F57051E000000 xorps xmm0, qword ptr [reloc @RWD24] ;; bbWeight=1 PerfScore 2.25 G_M17335_IG03:  C3 ret ;; bbWeight=1 PerfScore 1.00 RWD00 dq0000000000000000h RWD08 dq8000000000000000h RWD16 dq0000000000000000h RWD24 dq8000000000000000h
@tannergooding
Copy link
Member

Shouldn't that be reloc @RWD00, not @RWD24?

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

@tannergooding 🤔 hm.. shouldn't the constant be 0x8000000000000000 (with sign bit on) but the section looks weird indeed.

fixed my code a bit so now it's:

; Assembly listing for method Prog:Egor(double,double):double ; Emitting BLENDED_CODE for X64 CPU with SSE2 - Unix ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ;* V00 arg0 [V00 ] ( 0, 0 ) double -> zero-ref ; V01 arg1 [V01,T00] ( 3, 3 ) double -> mm1 ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M17335_IG01: ;; bbWeight=1 PerfScore 0.00 G_M17335_IG02:  0F28C1 movaps xmm0, xmm1  0F57050E000000 xorps xmm0, qword ptr [reloc @RWD08] ;; bbWeight=1 PerfScore 2.25 G_M17335_IG03:  C3 ret ;; bbWeight=1 PerfScore 1.00 RWD00 dq0000000000000000h RWD08 dq8000000000000000h

but it still NREs.
here is the old (working) codegen that works:

; Assembly listing for method Prog:Egor(double,double):double ; Emitting BLENDED_CODE for X64 CPU with SSE2 - Unix ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ;* V00 arg0 [V00 ] ( 0, 0 ) double -> zero-ref ; V01 arg1 [V01,T00] ( 3, 3 ) double -> mm1 ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M17335_IG01: ;; bbWeight=1 PerfScore 0.00 G_M17335_IG02:  F20F100510000000 movsd xmm0, qword ptr [reloc @RWD08]  0F57C1 xorps xmm0, xmm1 ;; bbWeight=1 PerfScore 2.33 G_M17335_IG03:  C3 ret ;; bbWeight=1 PerfScore 1.00 RWD00 dq0000000000000000h RWD08 dq8000000000000000h ; Total bytes of code 12, prolog size 0, PerfScore 4.53, instruction count 3 (MethodHash=71e0bc48) for method Prog:Egor(double,double):double ; ============================================================ 
@tannergooding
Copy link
Member

There is no scalar xor operation, just a packed version, so the constant must be 16-bytes with at least element 0 being -0.0 (although we currently set all elements to be -0.0 or -0.0f).

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

@tannergooding I've updated my previous comment - do you see why the second (current) codegen works fine?
(it saves the constant to the targetReg and then does xorps reg/reg instead of xorps reg/mem for the this PR version)

@tannergooding
Copy link
Member

movsd xmm0, qword ptr [reloc @RWD08] reads a scalar double (hence sd), so it will only read 8-bytes and using RWD08 when the constant at that address is 8-bytes is fine.
xorps xmm0, qword ptr [reloc @RWD08] reads a packed single (hence ps), so it reads 16-bytes. You are reading from RWD08 and so it is reading RWD08-RWD24 (where there is no data at RWD16).

The failure, however, isn't due to the overreading (in this case), its because the data isn't aligned. The VEX encoding allows contained memory operands to be unaligned, while the legacy encoding (generally speaking) requires them to be aligned.

Fixing the overreading issue should also fix the alignment issue. You just need to read from RWD00.

@tannergooding
Copy link
Member

Fixing the overreading issue should also fix the alignment issue

  • Assuming you are specifying the constant should be 16-byte aligned when being emitted, there is an option for this (emitAnyCns takes a cnsAlign parameter that is respected).
@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

@tannergooding thank you for the explanation!
not sure my fix for it looks good but at least it works now

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

@tannergooding here is the current codegen:

Double:

static double Test(double xmm0, double xmm1) => -xmm1;
; VEX  C5F877 vzeroupper  C5F0570505000000 vxorps xmm0, xmm1, qword ptr [reloc @RWD00]  C3 ret RWD00 dq 8000000000000000h RWD08 dq 8000000000000000h ; legacy  0F28C1 movaps xmm0, xmm1  0F570516000000 xorps xmm0, qword ptr [reloc @RWD16]  C3 ret RWD00 db 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h RWD16 dq 8000000000000000h RWD24 dq 8000000000000000h

Float:

static float Test(float xmm0, float xmm1) => -xmm1;
; VEX  C5F877 vzeroupper  C5F0570505000000 vxorps xmm0, xmm1, dword ptr [reloc @RWD00]  C3 ret RWD00 dd 80000000h RWD04 dd 80000000h RWD08 dd 80000000h RWD12 dd 80000000h ; legacy  0F28C1 movaps xmm0, xmm1  0F570516000000 xorps xmm0, dword ptr [reloc @RWD16]  C3 ret RWD00 db 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h RWD16 dd 80000000h RWD20 dd 80000000h RWD24 dd 80000000h RWD28 dd 80000000h
@tannergooding
Copy link
Member

👍, that looks correct now. However, I'm unsure why there are 16-bytes of padding before the constant. It looks like it should just be able to start at RWD00 (with that being properly aligned).

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

👍, that looks correct now. However, I'm unsure why there are 16-bytes of padding before the constant. It looks like it should just be able to start at RWD00 (with that being properly aligned).

@tannergooding Fixed! as a bonus it fixes redundant paddings everywhere, e.g:

static double Test(double x) => x * 10 * 10;
 C5F877 vzeroupper   C5FB590515000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD08]  C5FB59051D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD24]  C3 ret  RWD00 dq0000000000000000h RWD08 dq4024000000000000h RWD16 dq0000000000000000h RWD24 dq4024000000000000h

to this:

 C5F877 vzeroupper   C5FB59050D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD00]  C5FB59050D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD08]  C3 ret  RWD00 dq4024000000000000h RWD08 dq4024000000000000h

(I guess jit-diff ignores data sections to get the actual impact of this change)

@EgorBo
Copy link
Member Author

EgorBo commented Sep 15, 2020

@dotnet/jit-contrib @tannergooding PTAL, it's ready for final review

Summary:
This PR cleans up genSSE2BitwiseOp and it now emits 3-operands versions (VEX) of xorps/andps instead of movsd+xorps
Also, it fixes alignment for emitter::emitDataGenBeg, it used to emit paddings even if offset is already aligned or zero.
(my System.Private.Corelib.dll(R2R) is now 4kb less in size, jit-diff doesn't take data section into account) it's also responsible for redundant gaps between other types of constants, e.g. vectors.

A sample to cover both problems this PR fixes:

double Test(double x, double y) => -x * 10 * 5;

Current asm:

 C5F877 vzeroupper   C5FB100525000000 vmovsd xmm0, qword ptr [reloc @RWD08] ; <-- can be eliminated for vxorps reg,reg,[mem]  C5F857C1 vxorps xmm0, xmm1  C5FB590529000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD24]  C5FB590531000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD40]  C3 ret  RWD00 dq0000000000000000h ; <-- redundant padding RWD08 dq8000000000000000h RWD16 dq0000000000000000h ; <-- redundant padding RWD24 dq4024000000000000h RWD32 dq0000000000000000h ; <-- redundant padding RWD40 dq4014000000000000h

New asm:

G_M60258_IG01:  C5F877 vzeroupper  G_M60258_IG02:  C5F0570515000000 vxorps xmm0, xmm1, qword ptr [reloc @RWD00]  C5FB59051D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD16]  C5FB59051D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD24] G_M60258_IG03:  C3 ret  RWD00 db000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h RWD16 dq4024000000000000h RWD24 dq4014000000000000h

New asm for non-VEX cpu:

 0F28C1 movaps xmm0, xmm1  0F570516000000 xorps xmm0, qword ptr [reloc @RWD00]  F20F59051E000000 mulsd xmm0, qword ptr [reloc @RWD16]  F20F59051E000000 mulsd xmm0, qword ptr [reloc @RWD24]  C3 ret  RWD00 db000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h RWD16 dq4024000000000000h RWD24 dq4014000000000000h

RWD00 is a 16bytes pack of masks for xorps (it's also 16bytes aligned)

Co-authored-by: Tanner Gooding <tagoo@outlook.com>
Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@CarolEidt CarolEidt merged commit 271f008 into dotnet:master Oct 2, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

6 participants