Skip to content

Conversation

@kunalspathak
Copy link
Contributor

@kunalspathak kunalspathak commented Jul 25, 2023

Inline the TLS access for windows/x64 so we avoid calls to get ThreadStaticBase.

Current code gen:

 0000000000001AC4: E8 00 00 00 00 call __GetThreadStaticBase_repro_Program  0000000000001AC9: C7 40 08 0D F0 0D mov dword ptr [rax+8],900DF00Dh ... ... __GetThreadStaticBase_repro_Program:  000000000000104D: 8B 0D 00 00 00 00 mov ecx,dword ptr [_tls_index]  0000000000001053: 65 48 8B 04 25 58 mov rax,qword ptr gs:[58h]  00 00 00  000000000000105C: 48 8B 04 C8 mov rax,qword ptr [rax+rcx*8]  0000000000001060: B9 00 00 00 00 mov ecx,offset tls_InlinedThreadStatics  0000000000001065: 48 01 C1 add rcx,rax  0000000000001068: 48 8B 01 mov rax,qword ptr [rcx]  000000000000106B: 48 85 C0 test rax,rax  000000000000106E: 0F 84 00 00 00 00 je S_P_CoreLib_Internal_Runtime_ThreadStatics__GetInlinedThreadStaticBaseSlow  0000000000001074: C3 ret

New code gen:

 0000000000001AE5: 65 48 8B 04 25 58 mov rax,qword ptr gs:[58h]  00 00 00  0000000000001AEE: 8B 0D 00 00 00 00 mov ecx,dword ptr [_tls_index]  0000000000001AF4: BA 00 00 00 00 mov edx,offset tls_InlinedThreadStatics  0000000000001AF9: 48 03 14 C8 add rdx,qword ptr [rax+rcx*8]  0000000000001AFD: 48 8B CA mov rcx,rdx  0000000000001B00: 48 8B 19 mov rbx,qword ptr [rcx]  0000000000001B03: 48 85 DB test rbx,rbx  0000000000001B06: 74 1D je 0000000000001B25  0000000000001B08: C7 43 08 0D F0 0D mov dword ptr [rbx+8],900DF00Dh

Contributes to #79521

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 25, 2023
@ghost ghost assigned kunalspathak Jul 25, 2023
@ghost
Copy link

ghost commented Jul 25, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Inline the TLS access for windows/x64 so we avoid calls to get ThreadStaticBase.

Current code gen:

 0000000000001AC4: E8 00 00 00 00 call __GetThreadStaticBase_repro_Program  0000000000001AC9: C7 40 08 0D F0 0D mov dword ptr [rax+8],900DF00Dh ... ... __GetThreadStaticBase_repro_Program:  000000000000104D: 8B 0D 00 00 00 00 mov ecx,dword ptr [_tls_index]  0000000000001053: 65 48 8B 04 25 58 mov rax,qword ptr gs:[58h]  00 00 00  000000000000105C: 48 8B 04 C8 mov rax,qword ptr [rax+rcx*8]  0000000000001060: B9 00 00 00 00 mov ecx,offset tls_InlinedThreadStatics  0000000000001065: 48 01 C1 add rcx,rax  0000000000001068: 48 8B 01 mov rax,qword ptr [rcx]  000000000000106B: 48 85 C0 test rax,rax  000000000000106E: 0F 84 00 00 00 00 je S_P_CoreLib_Internal_Runtime_ThreadStatics__GetInlinedThreadStaticBaseSlow  0000000000001074: C3 ret

New code gen:

 0000000000001AE5: 65 48 8B 04 25 58 mov rax,qword ptr gs:[58h]  00 00 00  0000000000001AEE: 8B 0D 00 00 00 00 mov ecx,dword ptr [_tls_index]  0000000000001AF4: BA 00 00 00 00 mov edx,offset tls_InlinedThreadStatics  0000000000001AF9: 48 03 14 C8 add rdx,qword ptr [rax+rcx*8]  0000000000001AFD: 48 8B CA mov rcx,rdx  0000000000001B00: 48 8B 19 mov rbx,qword ptr [rcx]  0000000000001B03: 48 85 DB test rbx,rbx  0000000000001B06: 74 1D je 0000000000001B25  0000000000001B08: C7 43 08 0D F0 0D mov dword ptr [rbx+8],900DF00Dh
Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr

Milestone: -
@am11
Copy link
Member

am11 commented Jul 26, 2023

@kunalspathak, some platforms seem to have tls_get_addr in RO section. We've started to get this on illumos-x64 after #87082:

$ docker run --rm -v$(pwd):/runtime -e ROOTFS_DIR=/crossrootfs/x64 \ mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-cross-illumos-20220531132048-f13d79e \ /runtime/build.sh -c Release -os illumos -cross -gcc ... [ 95%] Built target clrjit_win_x64_x64 [ 95%] Built target coreclr_static /crossrootfs/x64/bin/../lib/gcc/x86_64-sun-solaris2.10/8.4.0/../../../../x86_64-sun-solaris2.10/bin/ld: ../../../vm/wks/CMakeFiles/cee_wks_core.dir/__/amd64/asmhelpers.S.o: warning: relocation against `__tls_get_addr@@SUNWprivate_1.1' in read-only section `.text'  /crossrootfs/x64/bin/../lib/gcc/x86_64-sun-solaris2.10/8.4.0/../../../../x86_64-sun-solaris2.10/bin/ld: ../../../vm/wks/CMakeFiles/cee_wks_core.dir/__/amd64/asmhelpers.S.o: relocation R_X86_64_PC32 against protected symbol `__tls_get_addr@@SUNWprivate_1.1' can not be used when making a shared object; recompile with -fPIC  /crossrootfs/x64/bin/../lib/gcc/x86_64-sun-solaris2.10/8.4.0/../../../../x86_64-sun-solaris2.10/bin/ld: final link failed: bad value  collect2: error: ld returned 1 exit status  dlls/mscoree/coreclr/CMakeFiles/coreclr.dir/build.make:974: recipe for target 'dlls/mscoree/coreclr/libcoreclr.so' failed  make[2]: *** [dlls/mscoree/coreclr/libcoreclr.so] Error 1  CMakeFiles/Makefile2:4678: recipe for target 'dlls/mscoree/coreclr/CMakeFiles/coreclr.dir/all' failed  make[1]: *** [dlls/mscoree/coreclr/CMakeFiles/coreclr.dir/all] Error 2  Makefile:135: recipe for target 'all' failed  make: *** [all] Error 2  /runtime/src/coreclr  Failed to build "CoreCLR component".

I haven't had time to go deeper on it, but thought you might wana know about this variation.

@kunalspathak
Copy link
Contributor Author

@kunalspathak, some platforms seem to have tls_get_addr in RO section.

tls_get_addr is needed for linux based OS. This PR just adds support for windows x64 currently.

recompile with -fPIC

That seems familiar. Might have to spend time trying to understand.

@kunalspathak kunalspathak marked this pull request as ready for review July 29, 2023 10:06
@kunalspathak
Copy link
Contributor Author

@dotnet/jit-contrib @VSadov @MichalStrehovsky

@kunalspathak
Copy link
Contributor Author

@MichalStrehovsky - let me know if there are more pipelines that i should trigger to test this out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to deal with the static constructor independently from the field access? Something like;

  • If the type has static constructor, emit regular static constructor trigger. This can be then optimized/deduplicated with static constructor triggers for the same type.
  • Emit thread static access. This has no side-effects, it can be moved around, CSEd, etc.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to deal with the static constructor independently from the field access?

The expansion happens after all the optimizations, CSE, etc. is done. We tried earlier to do in earlier phase, but it was making it hard to hoist certain checks as mentioned in #82973 (comment). Ideally, we need to expand just subset of appropriate portion of tree in various phases to achieve idle code, but might be tricky to get all the cases. Even today, without these changes, we do not hoist the call to helper out of loop because of

else if (s_helperCallProperties.MayRunCctor(helpFunc) &&
((call->gtFlags & GTF_CALL_HOISTABLE) == 0))
{
INDEBUG(failReason = "non-hoistable helper call";)
treeIsHoistable = false;
}

 static void Test(int n) { for (int i = 0; i < n; i++) { x1 = 5; x2 = 5; x3 = 5; y1 = 5; y2 = 5; y3 = 5; } }
G_M000_IG01: ;; offset=0000H  push rsi  push rbx  sub rsp, 40  mov ebx, ecx G_M000_IG02: ;; offset=0008H  lea rax, [(reloc 0x42aa78)]  cmp qword ptr [rax-08H], 0  jne SHORT G_M000_IG06 G_M000_IG03: ;; offset=0016H  xor esi, esi  test ebx, ebx  jle SHORT G_M000_IG05 G_M000_IG04: ;; offset=001CH  call CORINFO_HELP_READYTORUN_THREADSTATIC_BASE  mov dword ptr [rax+08H], 5  mov dword ptr [rax+0CH], 5  mov dword ptr [rax+10H], 5  mov dword ptr [rax+14H], 5  mov dword ptr [rax+18H], 5  mov dword ptr [rax+1CH], 5  inc esi  cmp esi, ebx  jl SHORT G_M000_IG04 G_M000_IG05: ;; offset=0051H  add rsp, 40  pop rbx  pop rsi  ret G_M000_IG06: ;; offset=0058H  call CORINFO_HELP_READYTORUN_NONGCSTATIC_BASE  jmp SHORT G_M000_IG03

I propose to take these changes for .NET 8 because of time constraints and revisit missing gaps in .NET 9.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotnet/jit-contrib Do you believe that this duplication is worth it for the time being?

.NET 8 because of time constraints and revisit missing gaps in .NET 9.

I think it may be too late to take this change for .NET 8. We are actively trying to shut things down for .NET 8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be too late to take this change for .NET 8.

In that case, I can start working on resolving the duplication problem that you pointed and get this in for .NET 9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried exploring on the idea of expanding lazy static ctor separately and it could be beneficial in cases where we have both a static init and a TLS, defined and used on same type (see example below). But it still won't eliminate the extra ctor check we introduce by expanding static initializer. If we have the static ctor trigger early, the VN will not
give it same number because under the hood, they call different helper. We could theortically avoid emitting the 2nd invocation of ctor check (0x40000000004203c8 related check below) it during expansion, if we have seen similar ctor invocation earlier, but for that we will have to make sure the previous ctor invocation dominates the block having the 2nd ctor invocation.

static string i = 5.ToString(); [ThreadStatic] static string j = 5.ToString(); void Foo() { Consume(i); Consume(j); }
G_M27646_IG02: ;; offset=0x0005  lea rcx, [(reloc 0x40000000004203c8)]  cmp qword ptr [rcx-0x08], 0  jne SHORT G_M27646_IG06  mov rcx, qword ptr [(reloc 0x40000000004203d0)]  mov rcx, gword ptr [rcx+0x08]  call Program:Consume(System.String)  lea r8, [(reloc 0x40000000004203c8)]  add r8, -8  cmp qword ptr [r8], 0  jne SHORT G_M27646_IG07 ;; size=47 bbWeight=1 PerfScore 14.25 G_M27646_IG03: ;; offset=0x0034  mov rcx, qword ptr GS:[0x0058]  mov eax, dword ptr [(reloc 0x40000000004203d8)]  mov edx, (reloc 0x40000000004203e0)  add rdx, qword ptr [rcx+8*rax]  mov rbx, qword ptr [rdx]  test rbx, rbx  je SHORT G_M27646_IG08 ;; size=32 bbWeight=1 PerfScore 9.50 G_M27646_IG04: ;; offset=0x0054  mov rcx, gword ptr [rbx+0x80]  call Program:Consume(System.String)  nop  ;; size=13 bbWeight=1 PerfScore 3.25 G_M27646_IG05: ;; offset=0x0061  add rsp, 32  pop rbx  ret  ;; size=6 bbWeight=1 PerfScore 1.75 G_M27646_IG06: ;; offset=0x0067  call CORINFO_HELP_READYTORUN_GCSTATIC_BASE  mov rcx, qword ptr [(reloc 0x40000000004203d0)]  mov rcx, gword ptr [rcx+0x08]  call Program:Consume(System.String)  lea r8, [(reloc 0x40000000004203c8)]  add r8, -8  cmp qword ptr [r8], 0  je SHORT G_M27646_IG03 ;; size=38 bbWeight=0 PerfScore 0.00 G_M27646_IG07: ;; offset=0x008D  xor ecx, ecx  mov edx, -1  lea rax, [(reloc 0x40000000004203f0)]  call rax  mov rbx, rax  jmp SHORT G_M27646_IG04 ;; size=21 bbWeight=0 PerfScore 0.00 G_M27646_IG08: ;; offset=0x00A2  mov rcx, rdx  lea rax, [(reloc 0x40000000004203e8)]  call rax  mov rbx, rax  jmp SHORT G_M27646_IG04 ;; size=17 bbWeig

I am not sure how common it is to see such patterns, @VSadov - any thoughts?

In cases such as below, we generate expected code:

 [ThreadStatic] static string i = 5.ToString(); [ThreadStatic] static string j = 5.ToString(); for (int k = 0; k < 10; k++) { Consume(i); Consume(j); }
G_M27646_IG02: ;; offset=0x0006  xor ebx, ebx  lea r8, [(reloc 0x40000000004203e8)]  add r8, -8  cmp qword ptr [r8], 0  jne SHORT G_M27646_IG05  mov rcx, qword ptr GS:[0x0058]  mov eax, dword ptr [(reloc 0x40000000004203c8)]  mov edx, (reloc 0x40000000004203d0)  add rdx, qword ptr [rcx+8*rax]  mov rsi, qword ptr [rdx]  test rsi, rsi  je SHORT G_M27646_IG06  ;; size=51 bbWeight=1 PerfScore 14.50 G_M27646_IG03: ;; offset=0x0039  mov rcx, gword ptr [rsi+0x68]  call Program:Consume(System.String)  mov rcx, gword ptr [rsi+0x78]  call Program:Consume(System.String)  inc ebx  cmp ebx, 10  jl SHORT G_M27646_IG03
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of general patterns, I think we can only assume that threadstatics are relatively less common than other kinds of fields. Also it is not common to have a lot of them in the same type. It is valid and certainly possible to see numerous threadstatics in a type, but it is more common for a type to have just one such field that holds all the threadstatic state for the type.
I'd say - 2 threadstatic fields in a type is likely common enough, 20 - is much less likely.

Mixing threadstatic and regular static fields in the same type is fairly common.

As for the order of access within the same method - static or threadstatic first, I think it would be too subtle if one pattern performs better than another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can only assume that threadstatics are relatively less common than other kinds of fields

On the other hand, I'd guess that code that uses them is often performance sensitive so handling it efficiently for AOT is important.

@VSadov
Copy link
Member

VSadov commented Jan 10, 2024

I tried running a few libraires testcases in a loop with the change, for stress purposes, and once in a while (about every 5-15 iterations) I see occasional crashes that look like GC holes.
That is with -rc Release -lc Release.

Same tests with f2a9ef8 commit from main (I assume that is the baseline for this change) run without failures, at least for 300 iterations, that I waited for.

Sometimes it is not a crash, but a test failure - typically like:

 System.Threading.SynchronizationLockException : Object synchronization method was called from an unsynchronized block of code. at System.Threading.Monitor.Exit(Object) + 0x15b at System.Collections.Concurrent.ConcurrentDictionary`2.ReleaseLocks(Int32) + 0x2d at System.Collections.Concurrent.ConcurrentDictionary`2.get_Count() + 0x34 at System.Collections.Tests.IDictionary_Generic_Tests`2.AddToCollection(ICollection`1 collection, Int32 numberOfItemsToAdd) + 0x151 at System.Collections.Tests.IDictionary_Generic_Tests`2.GenericIDictionaryFactory(Int32 count) + 0x2d at System.Collections.Tests.IDictionary_Generic_Tests`2.IDictionary_Generic_Values_ModifyingTheDictionaryUpdatesTheCollection(Int32 c 

Possibly the same reason, just a different way to fail.

Maybe worth checking if all is ok with GC info?

@kunalspathak
Copy link
Contributor Author

Maybe worth checking if all is ok with GC info?

It was not. I was not considering that the value t_inlinedThreadStaticBase returned needs tracking. Here is the before vs. after fix diff:

image
@VSadov
Copy link
Member

VSadov commented Jan 13, 2024

Maybe worth checking if all is ok with GC info?

It was not. I was not considering that the value t_inlinedThreadStaticBase returned needs tracking. Here is the before vs. after fix diff:

I am running the same subset of the tests that was hitting the GC holes and after 100+ iterations I do not see any failures.
It looks like it was the culprit.

@VSadov
Copy link
Member

VSadov commented Jan 13, 2024

I cannot comment on the shape of JIT changes, but we do get the desired codegen and the run time behavior. Very nice!

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!!

@VSadov
Copy link
Member

VSadov commented Jan 13, 2024

The failures in System.Text.Json.Tests on CoreCLR appear to be #96876 and not related to this change since this is NativeAOT-specific.

Process terminated. Assertion failed. Implementation depends on List<T> always using a T[] and not U[] where U : T. at System.Text.Json.Serialization.Tests.MetadataTests.VerifyWeatherForecastWithPOCOs(WeatherForecastWithPOCOs expected, WeatherForecastWithPOCOs obj) in /_/src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Serialization/MetadataTests/MetadataTests.JsonSerializer.cs:line 84 at System.Text.Json.Serialization.Tests.MetadataTests.RoundTripSerializerOverloads() in /_/src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Serialization/MetadataTests/MetadataTests.JsonSerializer.cs:line 21 at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs:line 38 at System.Text.Json.Serialization.Tests.MetadataTests.RoundTripSerializerOverloads() at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor) at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57 at Xunit.Sdk.TestInvoker`1.<>c__DisplayClass48_0.<<InvokeTestMethodAsync>b__1>d.MoveNext() 
@kunalspathak
Copy link
Contributor Author

@BruceForstall - can you please take another look?

@kunalspathak kunalspathak merged commit 7989f18 into dotnet:main Jan 17, 2024
@kunalspathak kunalspathak deleted the tls_nativeaot_winx64 branch January 17, 2024 20:31
tmds pushed a commit to tmds/runtime that referenced this pull request Jan 23, 2024
* wip * working model * wip * wip * working * Add helper for tlsIndex * add methods in superpmi * revert some local changes * misc fixes * Stop emitting TLS access code for windows/x64 * fix linux build errors * Do not throw not implemented for windows/x64 * fix the problem where ThreadStaticBase helper was still getting invoked * Revert certain changes from JIT method * Introduce getThreadLocalStaticInfo_ReadyToRun() * Consume getThreadLocalStaticInfo_ReadyToRun() * Remove getTlsRootInfo() and other methods * Revert unneeded changes * missing gtInitCldHnd initialization * save target address * jit format * run thunkgenerator * resolve merge conflicts * fix issues so the TLS is inlined * Rename data structures from *_ReadyToRun to *_NativeAOT * jit format * fix some unit test * fix a bug * fix the weird jump problem * use enclosing type cls handle for VN of static gc/non-gc helper * fix a bug of resetting the flag * useEnclosingTypeOnly from runtime to determine if VN should optimize it * do not use vnf, but only use useEnclosingTypeAsArg0 * Use GT_COMMA to add GCStaticBase call next to TLS call * optimize the cctor call * Remove lazy ctor generation from tls * Update jitinterface to not fetch data for lazy ctor * fix errors after merge * fix test build errors * fix bug in CSE * Use CORINFO_FLG_FIELD_INITCLASS instead of separate flag * Use the INITCLASS flag * Remove useEnclosingTypeOnly * Add NoCtor * Use CORINFO_HELP_READYTORUN_THREADSTATIC_BASE_NOCTOR * Minor cleanup * Renegenrate thunk * Add the SetFalseTarget * fix merge conflict resolution * better handling of GTF_ICON_SECREL_OFFSET better handling of GTF_ICON_SECREL_OFFSET * review feedback * Disable optimization for minopts * Add comments around iiaSecRel * jit format * create emitNewInstrCns() * Expand TLS even if optimization is disabled * Track t_inlinedThreadStaticBase Better tracking `t_inlinedThreadStaticBase` as TYP_REF * jit format
@github-actions github-actions bot locked and limited conversation to collaborators Feb 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

10 participants