Skip to content

Conversation

@kunalspathak
Copy link
Contributor

@kunalspathak kunalspathak commented Mar 4, 2023

Summary

This is an early prototype to add support of inlining ThreadStatic (TS) field access (for now just the primitive types. Today, for fields that are marked with ThreadStatic always has to go through the helper. The helper first gets the current thread, the thread local block, then the ThreadLocalModule for the current moduleIndex and lastly the static data block. This prototype tries to inline such accesses by adding few data structures that acts as a cache.

A thread local cache t_threadStaticBlocks (array of pointers) is added which stores the static blocks corresponding to the given thread.

During codegen, the enclosing type of the TS field that is being access (load or store) is monitored and gets a unique index. This index is the position at which the static data block will be stored in t_threadStaticBlocks cache at runtime. Since the index is assigned during codegen and embedded in the code, it remains same for any thread that executes the code at runtime. Hence any thread that executes the code, will make sure that it gets the relevant static data block from t_threadStaticBlocks cache.

The t_threadStaticBlocks is populated during runtime as well. The first time the field access code is executed, it tries to find the static data block in t_threadStaticBlocks but doesn't find it. It fallbacks to the slow path which is the existing helper call. The helper call has been modified to update the t_threadStaticBlocks with the static data block. Next time when the TS field access code is executed, the entry is found in the cache, and we skip going to the helpers.

To access the t_threadStaticBlocks at runtime, code is generated to access the relevant cache for the current thread by fetching the TLS of current thread, getting the slot for runtime, and then getting the t_threadStaticBlocks present in that slot.

Disassembly

To understand how it works, consider the following C# code.

[MethodImpl(MethodImplOptions.NoInlining)] public int GetThreadStaticInt() => t_threadStaticIntValue; [ThreadStatic] private static int t_threadStaticIntValue = 0;

Today, we always generate a helper call to retrieve the t_threadStaticIntValue field value.

G_M000_IG01: ;; offset=0000H  4883EC28 sub rsp, 40   G_M000_IG02: ;; offset=0004H  48B9D0482999FA7F0000 mov rcx, 0x7FFA992948D0  BA78010000 mov edx, 376  E81894A65F call CORINFO_HELP_GETSHARED_NONGCTHREADSTATIC_BASE_NOCTOR  8B80A4040000 mov eax, dword ptr [rax+04A4H]   G_M000_IG03: ;; offset=001EH  4883C428 add rsp, 40  C3 ret 

With this prototype, this is what we would generate:

G_M10050_IG01: ;; offset=0000H  4883EC28 sub rsp, 40 G_M10050_IG02: ;; offset=0004H  65488B042558000000 mov rax, qword ptr GS:[0x0058] ; Access the TLS of current thread  488B4030 mov rax, qword ptr [rax+30H] ; Get the runtime TLS slot from TLS[_tls_index]  4883784802 cmp qword ptr [rax+48H], 2 ; See if length of `t_threadStaticBlocks` < typeIndex. Here `typeIndex == 2`.  7F15 jg SHORT G_M10050_IG05 ; If yes, then proceed, else fallback to the helper G_M10050_IG03: ;; offset=0018H  488B4050 mov rax, qword ptr [rax+50H] ; Get the `t_threadStaticBlocks`.  488B4010 mov rax, qword ptr [rax+10H] ; Get the `t_threadStaticBlocks[typeIndex]`.  4885C0 test rax, rax ; Check if a valid entry is present at `t_threadStaticBlocks[typeIndex]`  7408 je SHORT G_M10050_IG05 ; If invalid, then go to the helper  8B4038 mov eax, dword ptr [rax+38H] ; If valid, then get the field value. G_M10050_IG04: ;; offset=0028H  4883C428 add rsp, 40  C3 ret G_M10050_IG05: ;; offset=002DH  48B990296324FA7F0000 mov rcx, 0x7FFA24632990  BA0D000000 mov edx, 13  41B802000000 mov r8d, 2 ; This is a new paramter to the helper that would cache static data block in `t_threadStaticBlocks` at index `2`.  E849D8545F call CORINFO_HELP_GETSHARED_NONGCTHREADSTATIC_BASE_NOCTOR  8B4038 mov eax, dword ptr [rax+38H]  EBDC jmp SHORT G_M10050_IG04 

Details

TODO

TODO

  • Rename bunch of variables with relavent names
  • Add support for dynamic memory allocation for the t_threadStaticBlocks.
  • Add a new helper for TLS access. Currently the existing helper is being used with an added parameter for the typeIndex.
  • Mark the block containing helper call as rarely run, so it is moved to the bottom of the method.
  • Extract the common code in importer to a method.
  • Performance measurements
  • Hoisting some checks out of loop. Should the expansion happen in importer or some later phase?

Contributes to #79521.

@ghost ghost assigned kunalspathak Mar 4, 2023
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 4, 2023
@ghost
Copy link

ghost commented Mar 4, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

This is an early prototype to add support of inlining ThreadStatic (TS) field access (for now just the primitive types. Today, for fields that are marked with ThreadStatic always has to go through the helper. The helper first gets the current thread, the thread local block, then the ThreadLocalModule for the current moduleIndex and lastly the static data block. This prototype tries to inline such accesses by adding few data structures that acts as a cache.

A thread local cache t_threadStaticBlocks (array of pointers) is added which stores the static blocks corresponding to the given thread.

During codegen, the enclosing type of the TS field that is being access (load or store) is monitored and gets a unique index. This index is the position at which the static data block will be stored in t_threadStaticBlocks cache at runtime. Since the index is assigned during codegen and embedded in the code, it remains same for any thread that executes the code at runtime. Hence any thread that executes the code, will make sure that it gets the relevant static data block from t_threadStaticBlocks cache.

The t_threadStaticBlocks is populated during runtime as well. The first time the field access code is executed, it tries to find the static data block in t_threadStaticBlocks but doesn't find it. It fallbacks to the slow path which is the existing helper call. The helper call has been modified to update the t_threadStaticBlocks with the static data block. Next time when the TS field access code is executed, the entry is found in the cache, and we skip going to the helpers.

To access the t_threadStaticBlocks at runtime, code is generated to access the relevant cache for the current thread by fetching the TLS of current thread, getting the slot for runtime, and then getting the t_threadStaticBlocks present in that slot.

Disassembly

To understand how it works, consider the following C# code.

[MethodImpl(MethodImplOptions.NoInlining)] public int GetThreadStaticInt() => t_threadStaticIntValue; [ThreadStatic] private static int t_threadStaticIntValue = 0;

Today, we always generate a helper call to retrieve the t_threadStaticIntValue field value.

G_M000_IG01: ;; offset=0000H  4883EC28 sub rsp, 40   G_M000_IG02: ;; offset=0004H  48B9D0482999FA7F0000 mov rcx, 0x7FFA992948D0  BA78010000 mov edx, 376  E81894A65F call CORINFO_HELP_GETSHARED_NONGCTHREADSTATIC_BASE_NOCTOR  8B80A4040000 mov eax, dword ptr [rax+04A4H]   G_M000_IG03: ;; offset=001EH  4883C428 add rsp, 40  C3 ret 

With this prototype, this is what we would generate:

G_M000_IG01: ;; offset=0000H  56 push rsi  4883EC20 sub rsp, 32   G_M000_IG02: ;; offset=0005H  65488B0C2558000000 mov rcx, qword ptr GS:[0x0058] ; Access the TLS of current thread  488B7130 mov rsi, qword ptr [rcx+30H] ; Get the runtime TLS slot from TLS[_tls_index]  4883BEB801000002 cmp qword ptr [rsi+1B8H], 2 ; See if length of `t_threadStaticBlocks` > typeIndex. Here `typeIndex == 2`.  7E22 jle SHORT G_M000_IG04 ; If yes, then proceed, else fallback to the helper   G_M000_IG03: ;; offset=001CH  48B9D048F522FA7F0000 mov rcx, 0x7FFA22F548D0  BA78010000 mov edx, 376  41B802000000 mov r8d, 2 ; This is a new paramter to the helper that would cache static data block in `t_threadStaticBlocks` at index `2`.  E8FAF8AB5F call CORINFO_HELP_GETSHARED_NONGCTHREADSTATIC_BASE_NOCTOR  8B80A4040000 mov eax, dword ptr [rax+04A4H]  EB16 jmp SHORT G_M000_IG06   G_M000_IG04: ;; offset=003EH  488B86B0010000 mov rax, qword ptr [rsi+1B0H] ; Get the `t_threadStaticBlocks[typeIndex]`.  488B4010 mov rax, qword ptr [rax+10H] ; Get the `t_threadStaticBlocks[typeIndex]`.  4885C0 test rax, rax ; Check if a valid entry is present at `t_threadStaticBlocks[typeIndex]`.   74CE je SHORT G_M000_IG03 ; If invalid, then go to the helper   G_M000_IG05: ;; offset=004EH  8B80A4040000 mov eax, dword ptr [rax+04A4H] ; If valid, then get the field value.   G_M000_IG06: ;; offset=0054H  4883C420 add rsp, 32  5E pop rsi  C3 ret 

Details

TODO

TODO

  • Rename bunch of variables with relavent names
  • Add support for dynamic memory allocation for the t_threadStaticBlocks.
  • Add a new helper for TLS access. Currently the existing helper is being used with an added parameter for the typeIndex.
  • Mark the block containing helper call as rarely run, so it is moved to the bottom of the method.
  • Performance measurements

Contributes to #79521.

Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr

Milestone: -
@tannergooding
Copy link
Member

tannergooding commented Mar 4, 2023

@kunalspathak, can you also give an example of the use of a TLS in a for loop so we can see what's getting hoisted (one time cost) vs what is repeated cost?

For example, what is the codegen given the following:

 [ThreadStatic] public static volatile int t_value; public static int Test(uint count) { int sum = 0; for (uint i = 0; i < count; i++) { sum += t_value; } return sum; }

I'd expect we end up with the large up front block as "hoistable", that is the initial resolution of the TLS base. The inner loop should then remain "small" and effectively just a direct memory access since we'll have already resolved the base/offset of the TLS value for the given thread.

Notably it also looks like there is a "cheap check" and "expensive fallback" to the TLS handling here. The way the blocks are being ordered doesn't look to mark the expensive fallback as "cold" which might negatively impact things, particularly for subsequent executions of the code.

@kunalspathak
Copy link
Contributor Author

Failures are known issues.

Copy link
Contributor

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple hopefully minor questions/comments

@EgorBo
Copy link
Member

EgorBo commented Apr 10, 2023

LGTM with a couple of nits

@kunalspathak
Copy link
Contributor Author

Other failures are unrelated and fixed by #84649

@kunalspathak kunalspathak merged commit 563408a into dotnet:main Apr 11, 2023
@kunalspathak kunalspathak deleted the tls branch April 11, 2023 20:53
@ghost ghost locked as resolved and limited conversation to collaborators May 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI