- Notifications
You must be signed in to change notification settings - Fork 5.2k
RyuJIT: Optimize -X and MathF.Abs(X) for floats #42164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
214efa9 to 2ec23e6 Compare
@EgorBo, do you know why clang is producing |
I guess there is no difference for |
For the legacy encoding, the |
| @dotnet/jit-contrib |
| some tests fail, a simplified repro: using System; using System.Runtime.CompilerServices; class Prog { static void Main() { Console.WriteLine( BitConverter.DoubleToInt64Bits(Egor(0, double.NaN))); } [MethodImpl(MethodImplOptions.NoInlining)] static double Egor(double xmm0, double xmm1) => -xmm1; }happens only when AVX (VEX) is not available (e.g R2R). codegen for G_M17335_IG01: ;; bbWeight=1 PerfScore 0.00 G_M17335_IG02: 0F28C1 movaps xmm0, xmm1 0F57051E000000 xorps xmm0, qword ptr [reloc @RWD24] ;; bbWeight=1 PerfScore 2.25 G_M17335_IG03: C3 ret ;; bbWeight=1 PerfScore 1.00 RWD00 dq0000000000000000h RWD08 dq8000000000000000h RWD16 dq0000000000000000h RWD24 dq8000000000000000h |
| Shouldn't that be |
| @tannergooding 🤔 hm.. shouldn't the constant be fixed my code a bit so now it's: ; Assembly listing for method Prog:Egor(double,double):double ; Emitting BLENDED_CODE for X64 CPU with SSE2 - Unix ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ;* V00 arg0 [V00 ] ( 0, 0 ) double -> zero-ref ; V01 arg1 [V01,T00] ( 3, 3 ) double -> mm1 ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M17335_IG01: ;; bbWeight=1 PerfScore 0.00 G_M17335_IG02: 0F28C1 movaps xmm0, xmm1 0F57050E000000 xorps xmm0, qword ptr [reloc @RWD08] ;; bbWeight=1 PerfScore 2.25 G_M17335_IG03: C3 ret ;; bbWeight=1 PerfScore 1.00 RWD00 dq0000000000000000h RWD08 dq8000000000000000hbut it still NREs. ; Assembly listing for method Prog:Egor(double,double):double ; Emitting BLENDED_CODE for X64 CPU with SSE2 - Unix ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ;* V00 arg0 [V00 ] ( 0, 0 ) double -> zero-ref ; V01 arg1 [V01,T00] ( 3, 3 ) double -> mm1 ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M17335_IG01: ;; bbWeight=1 PerfScore 0.00 G_M17335_IG02: F20F100510000000 movsd xmm0, qword ptr [reloc @RWD08] 0F57C1 xorps xmm0, xmm1 ;; bbWeight=1 PerfScore 2.33 G_M17335_IG03: C3 ret ;; bbWeight=1 PerfScore 1.00 RWD00 dq0000000000000000h RWD08 dq8000000000000000h ; Total bytes of code 12, prolog size 0, PerfScore 4.53, instruction count 3 (MethodHash=71e0bc48) for method Prog:Egor(double,double):double ; ============================================================ |
| There is no scalar |
| @tannergooding I've updated my previous comment - do you see why the second (current) codegen works fine? |
|
The failure, however, isn't due to the overreading (in this case), its because the data isn't aligned. The VEX encoding allows contained memory operands to be unaligned, while the legacy encoding (generally speaking) requires them to be aligned. Fixing the overreading issue should also fix the alignment issue. You just need to read from |
|
| @tannergooding thank you for the explanation! |
| @tannergooding here is the current codegen: Double: static double Test(double xmm0, double xmm1) => -xmm1;; VEX C5F877 vzeroupper C5F0570505000000 vxorps xmm0, xmm1, qword ptr [reloc @RWD00] C3 ret RWD00 dq 8000000000000000h RWD08 dq 8000000000000000h ; legacy 0F28C1 movaps xmm0, xmm1 0F570516000000 xorps xmm0, qword ptr [reloc @RWD16] C3 ret RWD00 db 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h RWD16 dq 8000000000000000h RWD24 dq 8000000000000000hFloat: static float Test(float xmm0, float xmm1) => -xmm1;; VEX C5F877 vzeroupper C5F0570505000000 vxorps xmm0, xmm1, dword ptr [reloc @RWD00] C3 ret RWD00 dd 80000000h RWD04 dd 80000000h RWD08 dd 80000000h RWD12 dd 80000000h ; legacy 0F28C1 movaps xmm0, xmm1 0F570516000000 xorps xmm0, dword ptr [reloc @RWD16] C3 ret RWD00 db 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h RWD16 dd 80000000h RWD20 dd 80000000h RWD24 dd 80000000h RWD28 dd 80000000h |
| 👍, that looks correct now. However, I'm unsure why there are 16-bytes of padding before the constant. It looks like it should just be able to start at |
@tannergooding Fixed! as a bonus it fixes redundant paddings everywhere, e.g: static double Test(double x) => x * 10 * 10; C5F877 vzeroupper C5FB590515000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD08] C5FB59051D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD24] C3 ret RWD00 dq0000000000000000h RWD08 dq4024000000000000h RWD16 dq0000000000000000h RWD24 dq4024000000000000hto this: C5F877 vzeroupper C5FB59050D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD00] C5FB59050D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD08] C3 ret RWD00 dq4024000000000000h RWD08 dq4024000000000000h(I guess jit-diff ignores data sections to get the actual impact of this change) |
| @dotnet/jit-contrib @tannergooding PTAL, it's ready for final review Summary: A sample to cover both problems this PR fixes: double Test(double x, double y) => -x * 10 * 5;Current asm: C5F877 vzeroupper C5FB100525000000 vmovsd xmm0, qword ptr [reloc @RWD08] ; <-- can be eliminated for vxorps reg,reg,[mem] C5F857C1 vxorps xmm0, xmm1 C5FB590529000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD24] C5FB590531000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD40] C3 ret RWD00 dq0000000000000000h ; <-- redundant padding RWD08 dq8000000000000000h RWD16 dq0000000000000000h ; <-- redundant padding RWD24 dq4024000000000000h RWD32 dq0000000000000000h ; <-- redundant padding RWD40 dq4014000000000000hNew asm: G_M60258_IG01: C5F877 vzeroupper G_M60258_IG02: C5F0570515000000 vxorps xmm0, xmm1, qword ptr [reloc @RWD00] C5FB59051D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD16] C5FB59051D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD24] G_M60258_IG03: C3 ret RWD00 db000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h RWD16 dq4024000000000000h RWD24 dq4014000000000000hNew asm for non-VEX cpu: 0F28C1 movaps xmm0, xmm1 0F570516000000 xorps xmm0, qword ptr [reloc @RWD00] F20F59051E000000 mulsd xmm0, qword ptr [reloc @RWD16] F20F59051E000000 mulsd xmm0, qword ptr [reloc @RWD24] C3 ret RWD00 db000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h RWD16 dq4024000000000000h RWD24 dq4014000000000000h
|
Co-authored-by: Tanner Gooding <tagoo@outlook.com>
CarolEidt left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks!
Contributes to #1342
Current codegen:
New codegen:
godbolt: https://godbolt.org/z/5K5dGr
jit-diff:
/cc @tannergooding
Saving 0.0 constant into a temp reg makes sense when it's used more than once but it doesn't work anyway: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunGAzADYRsGBgFls2ABSDhohGgayRDAJ4BKBgF4AfA0QMAVPtUBuLjwD0lhgEtRQ7ABNcDAEQAGAHQe3DfHgA1gwYAO62YDCKAJIM2PgMAHYQojAAbjCJ/pkYthCJtokA5voIhnCqDLgAFhAArgJODMAwDBAADrn4tgBeME0YEAxlqjQAvkA===