Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Feb 23, 2020

This PR optimizes simple conditions (BBJ_COND + two BBJ_RETURN) into a single branchless BBJ_RETURN block
image

E.g.

int Test1(int a) => a > 0 ? 1 : 0; /*any cmp or test condition*/ int Test2(int a) => a == 42 ? 3 : 2; /*any cmp or test condition*/

Current codegen:

; Method Test1(int):int:this  test edx, edx  jg SHORT G_M18502_IG05  xor eax, eax  ret  G_M18502_IG05:  mov eax, 1  ret  ; Total bytes of code: 13 ; Method Test2(int):int:this  cmp edx, 42  je SHORT G_M33413_IG05  mov eax, 2  ret  G_M33413_IG05:  mov eax, 3  ret  ; Total bytes of code: 17

New codegen:

; Method Test1(int):int:this  test edx, edx  setg al  movzx rax, al  ret ; Total bytes of code: 9 ; Method Test2(int):int:this  cmp edx, 42  sete al  movzx rax, al  add eax, 2  ret ; Total bytes of code: 13

The new codegen is branchless and is more compact. In theory, it can be slower in some benchmarks where we constantly take the same branch over and over and the branch-prediction + speculative execution may perform better but it seems GCC doesn't really care. If you want I can add a sort of "IsHotBlock" check (if it makes sense with the current PGO state)

Jit-diff (-f -pmi):

Found 52 files with textual diffs. Summary of Code Size diffs: (Lower is better) Total bytes of diff: -3419 (-0.01% of base) diff is an improvement. Top file improvements (bytes): -933 : System.Private.CoreLib.dasm (-0.02% of base) -396 : Microsoft.CodeAnalysis.CSharp.dasm (-0.01% of base) -348 : System.Data.Common.dasm (-0.02% of base) -251 : System.Private.Xml.dasm (-0.01% of base) -126 : System.Linq.Parallel.dasm (-0.01% of base) -121 : System.Linq.Expressions.dasm (-0.02% of base) -115 : System.DirectoryServices.dasm (-0.03% of base) -86 : System.Reflection.MetadataLoadContext.dasm (-0.05% of base) -80 : System.Management.dasm (-0.02% of base) -75 : Newtonsoft.Json.dasm (-0.01% of base) -70 : System.Threading.Channels.dasm (-0.04% of base) -65 : System.Collections.dasm (-0.01% of base) -52 : Microsoft.CSharp.dasm (-0.02% of base) -50 : System.Net.WebSockets.WebSocketProtocol.dasm (-0.12% of base) -44 : System.ComponentModel.TypeConverter.dasm (-0.01% of base) -40 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.00% of base) -39 : System.Security.Principal.Windows.dasm (-0.08% of base) -38 : Microsoft.CodeAnalysis.dasm (-0.00% of base) -35 : System.Reflection.Metadata.dasm (-0.01% of base) -32 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.00% of base) 52 total files with Code Size differences (52 improved, 0 regressed), 108 unchanged. Top method improvements (bytes): -50 (-2.38% of base) : System.Net.WebSockets.WebSocketProtocol.dasm - <WaitForServerToCloseConnectionAsync>d__64:MoveNext():this -36 (-2.81% of base) : System.Private.CoreLib.dasm - System.Globalization.CompareInfo:Compare(System.String,System.String,int):int:this (2 methods) -32 (-8.82% of base) : System.Private.CoreLib.dasm - System.Collections.Generic.Comparer`1[Vector`1][System.Numerics.Vector`1[System.Single]]:System.Collections.IComparer.Compare(System.Object,System.Object):int:this -22 (-7.97% of base) : System.Data.Common.dasm - System.Data.DataColumn:CompareValueTo(int,System.Object,bool):bool:this -22 (-17.60% of base) : System.Private.CoreLib.dasm - System.Globalization.HijriCalendar:GetDaysInMonth(int,int,int):int:this -20 (-2.74% of base) : Newtonsoft.Json.dasm - Newtonsoft.Json.Serialization.JsonSerializerInternalReader:SetPropertyValue(Newtonsoft.Json.Serialization.JsonProperty,Newtonsoft.Json.JsonConverter,Newtonsoft.Json.Serialization.JsonContainerContract,Newtonsoft.Json.Serialization.JsonProperty,Newtonsoft.Json.JsonReader,System.Object):bool:this -20 (-2.99% of base) : System.Private.CoreLib.dasm - System.DateTimeParse:MatchTimeMark(byref,System.Globalization.DateTimeFormatInfo,byref):bool -19 (-4.22% of base) : Microsoft.CSharp.dasm - Microsoft.CSharp.RuntimeBinder.Semantics.ExpressionBinder:CompareTypes(Microsoft.CSharp.RuntimeBinder.Semantics.TypeArray,Microsoft.CSharp.RuntimeBinder.Semantics.TypeArray):int -19 (-7.36% of base) : System.Private.CoreLib.dasm - System.DefaultBinder:FindMostSpecificMethod(System.Reflection.MethodBase,System.Int32[],System.Type,System.Reflection.MethodBase,System.Int32[],System.Type,System.Type[],System.Object[]):int -19 (-6.53% of base) : System.Private.CoreLib.dasm - System.Collections.Generic.Comparer`1[Double][System.Double]:System.Collections.IComparer.Compare(System.Object,System.Object):int:this -19 (-7.66% of base) : System.Reflection.MetadataLoadContext.dasm - System.DefaultBinder:FindMostSpecificMethod(System.Reflection.MethodBase,System.Int32[],System.Type,System.Reflection.MethodBase,System.Int32[],System.Type,System.Type[],System.Object[]):int -18 (-3.48% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.CodeGen.StackOptimizerPass1:IsNestedLocalOfCompoundOperator(Microsoft.CodeAnalysis.CSharp.Symbols.LocalSymbol,Microsoft.CodeAnalysis.CSharp.BoundSequence):bool:this -18 (-1.26% of base) : System.Private.CoreLib.dasm - System.Globalization.CompareInfo:IndexOf(System.String,System.String,int,int,int):int:this (2 methods) -18 (-1.81% of base) : System.Private.CoreLib.dasm - System.String:LastIndexOf(System.String,int,int,int):int:this -18 (-6.25% of base) : System.Private.CoreLib.dasm - System.Runtime.InteropServices.CustomMarshalers.EnumVariantViewOfEnumerator:Next(int,System.Object[],long):int:this -18 (-3.99% of base) : System.Private.Xml.dasm - System.Xml.Schema.SchemaCollectionCompiler:IsGroupBaseFromGroupBase(System.Xml.Schema.XmlSchemaGroupBase,System.Xml.Schema.XmlSchemaGroupBase,bool):bool:this -16 (-3.77% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberContainerTypeSymbol:DoOperatorsPair(Microsoft.CodeAnalysis.CSharp.Symbols.MethodSymbol,Microsoft.CodeAnalysis.CSharp.Symbols.MethodSymbol):bool -15 (-2.50% of base) : System.Data.Common.dasm - System.Data.Common.DateTimeOffsetStorage:CompareValueTo(int,System.Object):int:this -14 (-6.90% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.ConversionsBase:HasImplicitConversionFromDelegate(Microsoft.CodeAnalysis.CSharp.Symbols.TypeSymbol,Microsoft.CodeAnalysis.CSharp.Symbols.TypeSymbol,byref):bool:this -14 (-5.34% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.AssemblySymbol:PerformIVTCheck(System.Collections.Immutable.ImmutableArray`1[Byte],Microsoft.CodeAnalysis.AssemblyIdentity):int:this Top method improvements (percentages): -5 (-31.25% of base) : Microsoft.CodeAnalysis.dasm - Microsoft.CodeAnalysis.ThreeStateHelpers:ToThreeState(bool):ubyte -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToInt32(bool):int -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToUInt32(bool):int -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToUInt16(bool):ushort -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToByte(bool):ubyte -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToInt16(bool):short -4 (-30.77% of base) : System.Private.CoreLib.dasm - System.Convert:ToSByte(bool):byte -4 (-30.77% of base) : System.Private.Xml.dasm - System.Xml.Xsl.XPathConvert:NotZero(int):int -4 (-28.57% of base) : System.Private.CoreLib.dasm - System.Boolean:GetHashCode():int:this -4 (-28.57% of base) : System.Private.CoreLib.dasm - System.Threading.SpinWait:get_NextSpinWillYield():bool:this -5 (-27.78% of base) : Microsoft.CodeAnalysis.dasm - VersionResourceSerializer:get_FileType():int:this -9 (-27.27% of base) : System.Linq.Expressions.dasm - System.Linq.Expressions.Compiler.LambdaCompiler:EmitExpressionStart(System.Linq.Expressions.Expression):int:this -4 (-26.67% of base) : Microsoft.CSharp.dasm - BinOpFullSig:isLifted():bool:this -4 (-26.67% of base) : Microsoft.CSharp.dasm - UnaOpFullSig:isLifted():bool:this -4 (-26.67% of base) : System.ComponentModel.Composition.dasm - System.ComponentModel.Composition.ImportAttribute:System.ComponentModel.Composition.IAttributedImport.get_Cardinality():int:this -4 (-26.67% of base) : System.ComponentModel.TypeConverter.dasm - FilterCacheItem:IsValid(System.ComponentModel.Design.ITypeDescriptorFilterService):bool:this -4 (-26.67% of base) : System.Data.Common.dasm - System.Data.LookupNode:DependsOn(System.Data.DataColumn):bool:this -8 (-26.67% of base) : System.Data.Common.dasm - System.Data.Common.ADP:SrcCompare(System.String,System.String):int -8 (-26.67% of base) : System.Data.OleDb.dasm - System.Data.Common.ADP:SrcCompare(System.String,System.String):int -8 (-26.67% of base) : System.Diagnostics.TraceSource.dasm - System.Diagnostics.BooleanSwitch:get_Enabled():bool:this 406 total methods with Code Size differences (406 improved, 0 regressed), 304680 unchanged. 

This PR handles only Case 1 for now (as a start):
image

and the jit-diff already looks promising, I think the GT_ASG ones should produce even bigger diffs.

Also, see #32632 cc @lpereira
(can I use GT_LEA here to handle even more cases?)

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 23, 2020
@EgorBo
Copy link
Member Author

EgorBo commented Feb 24, 2020

Quite popular pattern this PR fixes in BCL: sharplab.io (e.g. here)

@Suchiman
Copy link
Contributor

I suppose this also improves this?

// Ternary operator returning true/false prevents redundant asm generation:
// https://github.com/dotnet/runtime/issues/4207
return (value == null || 0u >= (uint)value.Length) ? true : false;

@EgorBo
Copy link
Member Author

EgorBo commented Feb 24, 2020

I suppose this also improves this?

// Ternary operator returning true/false prevents redundant asm generation:
// https://github.com/dotnet/runtime/issues/4207
return (value == null || 0u >= (uint)value.Length) ? true : false;

It won't change anything for IsNullOrEmpty, the optimization gives up if return true\false blocks are used by more than a single bb, otherwise it's a size regression.

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stephentoub
Copy link
Member

What's the status of this PR? Should it be merged?

@EgorBo
Copy link
Member Author

EgorBo commented May 5, 2020

@dotnet/jit-contrib What do you think?
This PR is a base for other cases, e.g. GT_ASG which should be even more popular IMO.

@BruceForstall BruceForstall changed the title RuyJIT: Optimize simple comparisons into branchless expressions RyuJIT: Optimize simple comparisons into branchless expressions May 5, 2020
@sandreenko sandreenko self-requested a review May 5, 2020 19:40
Copy link
Contributor

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization and the implementation look correct. However, I am worried about compilation time impact: without tiering I would vote against that, with tiering enabled it could be worth doing.
It is linear in terms of blocks without additional memory, but could you estimate what percentage of methods it will affect if you support all 4 cases? It is like 1% or 0.1%?
Can it be combined with the existing logic to optimize branches like GT_TRUE(1,1)?

#ifndef TARGET_64BIT
if ((typ == TYP_INT) && (addValue == INT_MIN))
{
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it protect from ? INT_MIN:INT_MIN+1 case? What is the problem with it?

// TODO: implement for GT_ASG
(trueBb->bbJumpKind != BBJ_RETURN) || (falseBb->bbJumpKind != BBJ_RETURN) ||
// give up if these blocks are used by someone else
(trueBb->countOfInEdges() != 1) || (falseBb->countOfInEdges() != 1) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// give up if these blocks are used by someone else

why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check but I think there will be regressions (in terms of size, not performance)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transformation will be invalid, because those return blocks can be reached under other conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transformation will be invalid

Why? We don't delete these blocks, we only remove the edges from block to trueBb and falseBb. trueBb and falseBb are staying alive for other preds.

}

// unlink both blocks
fgRemoveRefPred(trueBb, block);
Copy link
Contributor

@sandreenko sandreenko May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we just remove the edges? Does unreachable code elimination delete the blocks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the blocks are eliminated, will check at what moment exactly.

continue;
}

Statement* firstStmt = block->lastStmt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the last statement is called first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I skip possible leading GT_ASG statements

(trueBb->bbJumpKind != BBJ_RETURN) || (falseBb->bbJumpKind != BBJ_RETURN) ||
// give up if these blocks are used by someone else
(trueBb->countOfInEdges() != 1) || (falseBb->countOfInEdges() != 1) ||
(trueBb->bbFlags & BBF_DONT_REMOVE) || (falseBb->bbFlags & BBF_DONT_REMOVE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(trueBb->bbFlags & BBF_DONT_REMOVE) || (falseBb->bbFlags & BBF_DONT_REMOVE))
((trueBb->bbFlags & BBF_DONT_REMOVE) != 0) || ((falseBb->bbFlags & BBF_DONT_REMOVE) != 0))
@erozenfeld
Copy link
Member

At the very least we should measure the throughput cost of this, e,g., by using pin-tool and crossgen-ing SPC. It's not a full walk of the IR (it only looks at most at one statement per basic block) so it may not be too costly. Still, I would prefer to know the CQ benefit (and throughput cost) once the missing cases are added. Depending on the throughput results we may consider adding this optimization into one of the existing phases.

@AndyAyersMS
Copy link
Member

Note we also already do return block merging for identical constant return values during fgAddInternal; I wonder if we can extend this to handle merging lexically similar return blocks.

@stephentoub
Copy link
Member

@EgorBo, are you still working on this?

@EgorBo
Copy link
Member Author

EgorBo commented Sep 18, 2020

@stephentoub yes, sorry
I am planning to look what I can do here tomorrow/on Sunday (will close on Monday otherwise)

@EgorBo
Copy link
Member Author

EgorBo commented Sep 20, 2020

@AndyAyersMS @erozenfeld @sandreenko
I'm going to close it for now since I think it should be done differently:

I think It's time to introduce CMOV**-based optimizations, especially because now we're able to collect profile data and will be able to avoid perf regressions (see this godbolt sample).

So ideally, there must be an optimization like in this PR to collapse if-else blocks back into ternary operators where needed
For that we can introduce a new node GT_SELECT (inspired by LLVM, see https://godbolt.org/z/94YjYf) or just re-use QMARK+COLON (currently it's not welcomed in LIR - lots of asserts) and then codegen should emit either cmove (csel on arm) or cmp+add (for simple cases). Or just use a named intrinsic.

PS: regarding the JIT-throughput and optOptimizeBools (that walks all statements): optOptimizeBools brings only ~ -1000 bytes diff currently making it a questionable optimization. Mostly because it gives up on many cases, see sharplab.io and Roslyn is able to handle simple cases itself: sharplab.io

@EgorBo EgorBo closed this Sep 20, 2020
@EgorBo
Copy link
Member Author

EgorBo commented Sep 20, 2020

HW_INTRINSIC-based implementation for CMOVnn: EgorBo@1271fe5

static int Test1(int x) { return x == 42 ? 1000 : 2000; } static int Test2(int x, int a, int b) { return x == 42 ? a : b; }
; Method Tests:Test1(int):int G_M56601_IG01: G_M56601_IG02:  83F92A cmp ecx, 42  B8E8030000 mov eax, 0x3E8  BAD0070000 mov edx, 0x7D0  0F45C2 cmovne eax, edx G_M56601_IG03:  C3 ret  ; Total bytes of code: 17 ; Method Tests:Test2(int,int,int):int G_M50938_IG01: G_M50938_IG02:  83F92A cmp ecx, 42  8BC2 mov eax, edx  410F45C0 cmovne eax, r8d G_M50938_IG03:  C3 ret  ; Total bytes of code: 10

Works better with PGO 🙂:
image

@AndyAyersMS
Copy link
Member

I think we may focus on this during .NET 6 -- as you note, not only do we have to figure out how to represent conditional instructions in the IR, but we also need to have the right heuristics to decide when to use them.

@tannergooding
Copy link
Member

HW_INTRINSIC-based implementation

@EgorBo, do you think its worth opening a proposal for CMOV as a public HWIntrinsic?

@EgorBo
Copy link
Member Author

EgorBo commented Sep 22, 2020

HW_INTRINSIC-based implementation

@EgorBo, do you think its worth opening a proposal for CMOV as a public HWIntrinsic?

Honestly not sure, I'd expect jit to handle it since with PGO jumps could be better (https://github.com/xiadz/cmov)
But I can recall @damageboy wanted one for his bitonic sort.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

9 participants