RyuJIT: Optimize simple comparisons into branchless expressions #32716

EgorBo · 2020-02-23T16:16:51Z

This PR optimizes simple conditions (BBJ_COND + two BBJ_RETURN) into a single branchless BBJ_RETURN block

E.g.

int Test1(int a) => a > 0 ? 1 : 0; /*any cmp or test condition*/ int Test2(int a) => a == 42 ? 3 : 2; /*any cmp or test condition*/

Current codegen:

; Method Test1(int):int:this  test edx, edx  jg SHORT G_M18502_IG05  xor eax, eax  ret  G_M18502_IG05:  mov eax, 1  ret  ; Total bytes of code: 13 ; Method Test2(int):int:this  cmp edx, 42  je SHORT G_M33413_IG05  mov eax, 2  ret  G_M33413_IG05:  mov eax, 3  ret  ; Total bytes of code: 17

New codegen:

; Method Test1(int):int:this  test edx, edx  setg al  movzx rax, al  ret ; Total bytes of code: 9 ; Method Test2(int):int:this  cmp edx, 42  sete al  movzx rax, al  add eax, 2  ret ; Total bytes of code: 13

The new codegen is branchless and is more compact. In theory, it can be slower in some benchmarks where we constantly take the same branch over and over and the branch-prediction + speculative execution may perform better but it seems GCC doesn't really care. If you want I can add a sort of "IsHotBlock" check (if it makes sense with the current PGO state)

Jit-diff (-f -pmi):

Found 52 files with textual diffs. Summary of Code Size diffs: (Lower is better) Total bytes of diff: -3419 (-0.01% of base) diff is an improvement. Top file improvements (bytes): -933 : System.Private.CoreLib.dasm (-0.02% of base) -396 : Microsoft.CodeAnalysis.CSharp.dasm (-0.01% of base) -348 : System.Data.Common.dasm (-0.02% of base) -251 : System.Private.Xml.dasm (-0.01% of base) -126 : System.Linq.Parallel.dasm (-0.01% of base) -121 : System.Linq.Expressions.dasm (-0.02% of base) -115 : System.DirectoryServices.dasm (-0.03% of base) -86 : System.Reflection.MetadataLoadContext.dasm (-0.05% of base) -80 : System.Management.dasm (-0.02% of base) -75 : Newtonsoft.Json.dasm (-0.01% of base) -70 : System.Threading.Channels.dasm (-0.04% of base) -65 : System.Collections.dasm (-0.01% of base) -52 : Microsoft.CSharp.dasm (-0.02% of base) -50 : System.Net.WebSockets.WebSocketProtocol.dasm (-0.12% of base) -44 : System.ComponentModel.TypeConverter.dasm (-0.01% of base) -40 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.00% of base) -39 : System.Security.Principal.Windows.dasm (-0.08% of base) -38 : Microsoft.CodeAnalysis.dasm (-0.00% of base) -35 : System.Reflection.Metadata.dasm (-0.01% of base) -32 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.00% of base) 52 total files with Code Size differences (52 improved, 0 regressed), 108 unchanged. Top method improvements (bytes): -50 (-2.38% of base) : System.Net.WebSockets.WebSocketProtocol.dasm - <WaitForServerToCloseConnectionAsync>d__64:MoveNext():this -36 (-2.81% of base) : System.Private.CoreLib.dasm - System.Globalization.CompareInfo:Compare(System.String,System.String,int):int:this (2 methods) -32 (-8.82% of base) : System.Private.CoreLib.dasm - System.Collections.Generic.Comparer`1[Vector`1][System.Numerics.Vector`1[System.Single]]:System.Collections.IComparer.Compare(System.Object,System.Object):int:this -22 (-7.97% of base) : System.Data.Common.dasm - System.Data.DataColumn:CompareValueTo(int,System.Object,bool):bool:this -22 (-17.60% of base) : System.Private.CoreLib.dasm - System.Globalization.HijriCalendar:GetDaysInMonth(int,int,int):int:this -20 (-2.74% of base) : Newtonsoft.Json.dasm - Newtonsoft.Json.Serialization.JsonSerializerInternalReader:SetPropertyValue(Newtonsoft.Json.Serialization.JsonProperty,Newtonsoft.Json.JsonConverter,Newtonsoft.Json.Serialization.JsonContainerContract,Newtonsoft.Json.Serialization.JsonProperty,Newtonsoft.Json.JsonReader,System.Object):bool:this -20 (-2.99% of base) : System.Private.CoreLib.dasm - System.DateTimeParse:MatchTimeMark(byref,System.Globalization.DateTimeFormatInfo,byref):bool -19 (-4.22% of base) : Microsoft.CSharp.dasm - Microsoft.CSharp.RuntimeBinder.Semantics.ExpressionBinder:CompareTypes(Microsoft.CSharp.RuntimeBinder.Semantics.TypeArray,Microsoft.CSharp.RuntimeBinder.Semantics.TypeArray):int -19 (-7.36% of base) : System.Private.CoreLib.dasm - System.DefaultBinder:FindMostSpecificMethod(System.Reflection.MethodBase,System.Int32[],System.Type,System.Reflection.MethodBase,System.Int32[],System.Type,System.Type[],System.Object[]):int -19 (-6.53% of base) : System.Private.CoreLib.dasm - System.Collections.Generic.Comparer`1[Double][System.Double]:System.Collections.IComparer.Compare(System.Object,System.Object):int:this -19 (-7.66% of base) : System.Reflection.MetadataLoadContext.dasm - System.DefaultBinder:FindMostSpecificMethod(System.Reflection.MethodBase,System.Int32[],System.Type,System.Reflection.MethodBase,System.Int32[],System.Type,System.Type[],System.Object[]):int -18 (-3.48% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.CodeGen.StackOptimizerPass1:IsNestedLocalOfCompoundOperator(Microsoft.CodeAnalysis.CSharp.Symbols.LocalSymbol,Microsoft.CodeAnalysis.CSharp.BoundSequence):bool:this -18 (-1.26% of base) : System.Private.CoreLib.dasm - System.Globalization.CompareInfo:IndexOf(System.String,System.String,int,int,int):int:this (2 methods) -18 (-1.81% of base) : System.Private.CoreLib.dasm - System.String:LastIndexOf(System.String,int,int,int):int:this -18 (-6.25% of base) : System.Private.CoreLib.dasm - System.Runtime.InteropServices.CustomMarshalers.EnumVariantViewOfEnumerator:Next(int,System.Object[],long):int:this -18 (-3.99% of base) : System.Private.Xml.dasm - System.Xml.Schema.SchemaCollectionCompiler:IsGroupBaseFromGroupBase(System.Xml.Schema.XmlSchemaGroupBase,System.Xml.Schema.XmlSchemaGroupBase,bool):bool:this -16 (-3.77% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberContainerTypeSymbol:DoOperatorsPair(Microsoft.CodeAnalysis.CSharp.Symbols.MethodSymbol,Microsoft.CodeAnalysis.CSharp.Symbols.MethodSymbol):bool -15 (-2.50% of base) : System.Data.Common.dasm - System.Data.Common.DateTimeOffsetStorage:CompareValueTo(int,System.Object):int:this -14 (-6.90% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.ConversionsBase:HasImplicitConversionFromDelegate(Microsoft.CodeAnalysis.CSharp.Symbols.TypeSymbol,Microsoft.CodeAnalysis.CSharp.Symbols.TypeSymbol,byref):bool:this -14 (-5.34% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.AssemblySymbol:PerformIVTCheck(System.Collections.Immutable.ImmutableArray`1[Byte],Microsoft.CodeAnalysis.AssemblyIdentity):int:this Top method improvements (percentages): -5 (-31.25% of base) : Microsoft.CodeAnalysis.dasm - Microsoft.CodeAnalysis.ThreeStateHelpers:ToThreeState(bool):ubyte -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToInt32(bool):int -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToUInt32(bool):int -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToUInt16(bool):ushort -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToByte(bool):ubyte -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToInt16(bool):short -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToSByte(bool):byte -4 (-30.77% of base) : System.Private.Xml.dasm - System.Xml.Xsl.XPathConvert:NotZero(int):int -4 (-28.57% of base) : System.Private.CoreLib.dasm - System.Boolean:GetHashCode():int:this -4 (-28.57% of base) : System.Private.CoreLib.dasm - System.Threading.SpinWait:get_NextSpinWillYield():bool:this -5 (-27.78% of base) : Microsoft.CodeAnalysis.dasm - VersionResourceSerializer:get_FileType():int:this -9 (-27.27% of base) : System.Linq.Expressions.dasm - System.Linq.Expressions.Compiler.LambdaCompiler:EmitExpressionStart(System.Linq.Expressions.Expression):int:this -4 (-26.67% of base) : Microsoft.CSharp.dasm - BinOpFullSig:isLifted():bool:this -4 (-26.67% of base) : Microsoft.CSharp.dasm - UnaOpFullSig:isLifted():bool:this -4 (-26.67% of base) : System.ComponentModel.Composition.dasm - System.ComponentModel.Composition.ImportAttribute:System.ComponentModel.Composition.IAttributedImport.get_Cardinality():int:this -4 (-26.67% of base) : System.ComponentModel.TypeConverter.dasm - FilterCacheItem:IsValid(System.ComponentModel.Design.ITypeDescriptorFilterService):bool:this -4 (-26.67% of base) : System.Data.Common.dasm - System.Data.LookupNode:DependsOn(System.Data.DataColumn):bool:this -8 (-26.67% of base) : System.Data.Common.dasm - System.Data.Common.ADP:SrcCompare(System.String,System.String):int -8 (-26.67% of base) : System.Data.OleDb.dasm - System.Data.Common.ADP:SrcCompare(System.String,System.String):int -8 (-26.67% of base) : System.Diagnostics.TraceSource.dasm - System.Diagnostics.BooleanSwitch:get_Enabled():bool:this 406 total methods with Code Size differences (406 improved, 0 regressed), 304680 unchanged.

This PR handles only Case 1 for now (as a start):

and the jit-diff already looks promising, I think the GT_ASG ones should produce even bigger diffs.

Also, see #32632 cc @lpereira
(can I use GT_LEA here to handle even more cases?)

…x times

EgorBo · 2020-02-24T09:54:25Z

Quite popular pattern this PR fixes in BCL: sharplab.io (e.g. here)

Suchiman · 2020-02-24T16:37:25Z

I suppose this also improves this?

runtime/src/libraries/System.Private.CoreLib/src/System/String.cs

Lines 464 to 466 in 9a08506

     // Ternary operator returning true/false prevents redundant asm generation:  
   // https://github.com/dotnet/runtime/issues/4207  
   return (value == null || 0u >= (uint)value.Length) ? true : false;  
 

EgorBo · 2020-02-24T16:44:58Z

I suppose this also improves this?

runtime/src/libraries/System.Private.CoreLib/src/System/String.cs

Lines 464 to 466 in 9a08506

// Ternary operator returning true/false prevents redundant asm generation:

// https://github.com/dotnet/runtime/issues/4207

return (value == null || 0u >= (uint)value.Length) ? true : false;

It won't change anything for IsNullOrEmpty, the optimization gives up if return true\false blocks are used by more than a single bb, otherwise it's a size regression.

src/coreclr/src/jit/optimizer.cpp

src/coreclr/tests/src/JIT/opt/InstructionCombining/CondToBranchlessExpr.cs

src/coreclr/src/jit/optimizer.cpp

src/coreclr/tests/src/JIT/opt/InstructionCombining/CondToBranchlessExpr.cs

kunalspathak

LGTM

stephentoub · 2020-05-05T13:50:19Z

What's the status of this PR? Should it be merged?

EgorBo · 2020-05-05T15:42:47Z

@dotnet/jit-contrib What do you think?
This PR is a base for other cases, e.g. GT_ASG which should be even more popular IMO.

sandreenko

The optimization and the implementation look correct. However, I am worried about compilation time impact: without tiering I would vote against that, with tiering enabled it could be worth doing.
It is linear in terms of blocks without additional memory, but could you estimate what percentage of methods it will affect if you support all 4 cases? It is like 1% or 0.1%?
Can it be combined with the existing logic to optimize branches like GT_TRUE(1,1)?

sandreenko · 2020-05-11T20:01:23Z

src/coreclr/src/jit/optimizer.cpp

+#ifndef TARGET_64BIT
+ if ((typ == TYP_INT) && (addValue == INT_MIN))
+ {
+ continue;


Does it protect from ? INT_MIN:INT_MIN+1 case? What is the problem with it?

sandreenko · 2020-05-11T20:10:28Z

src/coreclr/src/jit/optimizer.cpp

+ // TODO: implement for GT_ASG
+ (trueBb->bbJumpKind != BBJ_RETURN) || (falseBb->bbJumpKind != BBJ_RETURN) ||
+ // give up if these blocks are used by someone else
+ (trueBb->countOfInEdges() != 1) || (falseBb->countOfInEdges() != 1) ||


// give up if these blocks are used by someone else

why?

Need to check but I think there will be regressions (in terms of size, not performance)

The transformation will be invalid, because those return blocks can be reached under other conditions.

The transformation will be invalid

Why? We don't delete these blocks, we only remove the edges from block to trueBb and falseBb. trueBb and falseBb are staying alive for other preds.

sandreenko · 2020-05-11T20:11:26Z

src/coreclr/src/jit/optimizer.cpp

+ }
+
+ // unlink both blocks
+ fgRemoveRefPred(trueBb, block);


Do we just remove the edges? Does unreachable code elimination delete the blocks?

yeah, the blocks are eliminated, will check at what moment exactly.

sandreenko · 2020-05-11T20:11:47Z

src/coreclr/src/jit/optimizer.cpp

+ continue;
+ }
+
+ Statement* firstStmt = block->lastStmt();


why the last statement is called first?

This is how I skip possible leading GT_ASG statements

erozenfeld · 2020-05-12T00:45:58Z

src/coreclr/src/jit/optimizer.cpp

+ (trueBb->bbJumpKind != BBJ_RETURN) || (falseBb->bbJumpKind != BBJ_RETURN) ||
+ // give up if these blocks are used by someone else
+ (trueBb->countOfInEdges() != 1) || (falseBb->countOfInEdges() != 1) ||
+ (trueBb->bbFlags & BBF_DONT_REMOVE) || (falseBb->bbFlags & BBF_DONT_REMOVE))


Suggested change

(trueBb->bbFlags & BBF_DONT_REMOVE) || (falseBb->bbFlags & BBF_DONT_REMOVE))

((trueBb->bbFlags & BBF_DONT_REMOVE) != 0) || ((falseBb->bbFlags & BBF_DONT_REMOVE) != 0))

erozenfeld · 2020-05-12T00:53:07Z

At the very least we should measure the throughput cost of this, e,g., by using pin-tool and crossgen-ing SPC. It's not a full walk of the IR (it only looks at most at one statement per basic block) so it may not be too costly. Still, I would prefer to know the CQ benefit (and throughput cost) once the missing cases are added. Depending on the throughput results we may consider adding this optimization into one of the existing phases.

AndyAyersMS · 2020-05-13T19:31:47Z

Note we also already do return block merging for identical constant return values during fgAddInternal; I wonder if we can extend this to handle merging lexically similar return blocks.

stephentoub · 2020-09-18T13:51:48Z

@EgorBo, are you still working on this?

EgorBo · 2020-09-18T13:54:48Z

@stephentoub yes, sorry
I am planning to look what I can do here tomorrow/on Sunday (will close on Monday otherwise)

EgorBo · 2020-09-20T13:09:14Z

@AndyAyersMS @erozenfeld @sandreenko
I'm going to close it for now since I think it should be done differently:

I think It's time to introduce CMOV**-based optimizations, especially because now we're able to collect profile data and will be able to avoid perf regressions (see this godbolt sample).

So ideally, there must be an optimization like in this PR to collapse if-else blocks back into ternary operators where needed
For that we can introduce a new node GT_SELECT (inspired by LLVM, see https://godbolt.org/z/94YjYf) or just re-use QMARK+COLON (currently it's not welcomed in LIR - lots of asserts) and then codegen should emit either cmove (csel on arm) or cmp+add (for simple cases). Or just use a named intrinsic.

PS: regarding the JIT-throughput and optOptimizeBools (that walks all statements): optOptimizeBools brings only ~ -1000 bytes diff currently making it a questionable optimization. Mostly because it gives up on many cases, see sharplab.io and Roslyn is able to handle simple cases itself: sharplab.io

EgorBo · 2020-09-20T21:38:29Z

HW_INTRINSIC-based implementation for CMOVnn: EgorBo@1271fe5

static int Test1(int x) { return x == 42 ? 1000 : 2000; } static int Test2(int x, int a, int b) { return x == 42 ? a : b; }

; Method Tests:Test1(int):int G_M56601_IG01: G_M56601_IG02:  83F92A cmp ecx, 42  B8E8030000 mov eax, 0x3E8  BAD0070000 mov edx, 0x7D0  0F45C2 cmovne eax, edx G_M56601_IG03:  C3 ret  ; Total bytes of code: 17 ; Method Tests:Test2(int,int,int):int G_M50938_IG01: G_M50938_IG02:  83F92A cmp ecx, 42  8BC2 mov eax, edx  410F45C0 cmovne eax, r8d G_M50938_IG03:  C3 ret  ; Total bytes of code: 10

Works better with PGO 🙂:

AndyAyersMS · 2020-09-21T15:56:49Z

I think we may focus on this during .NET 6 -- as you note, not only do we have to figure out how to represent conditional instructions in the IR, but we also need to have the right heuristics to decide when to use them.

tannergooding · 2020-09-21T16:04:38Z

HW_INTRINSIC-based implementation

@EgorBo, do you think its worth opening a proposal for CMOV as a public HWIntrinsic?

EgorBo · 2020-09-22T18:59:11Z

HW_INTRINSIC-based implementation

@EgorBo, do you think its worth opening a proposal for CMOV as a public HWIntrinsic?

Honestly not sure, I'd expect jit to handle it since with PGO jumps could be better (https://github.com/xiadz/cmov)
But I can recall @damageboy wanted one for his bitonic sort.

EgorBo added 3 commits February 23, 2020 14:48

Implement simple condition to branchless expr optimization

3f1b495

Add tests

7a8bf0d

Add comments

b56129a

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 23, 2020

EgorBo added 11 commits February 23, 2020 19:21

fix comment

f7864e0

Fix code format

13b1515

fix build (error: backslash and newline separated by space)

96f8aaa

Fix clang format

e2fc570

Fix more clang format issues

f05aec3

don't optimize if there are side-effects

7dc41f3

change to ChangeOperUnchecked

d713d82

GTF_ORDER_SIDEEFF should be enough, GTF_ALL_EFFECT narrows jit-diff 3…

da345a1

…x times

Update CondToBranchlessExpr.cs

223a48b

Oops, it should be the other way around - that's why all tests fail

a578ef0

Add a comment

84b641b

jaredpar mentioned this pull request Feb 24, 2020

EvaluateCacheFile is causing restore crashes on Linux NuGet/Home#9199

Closed