Skip to content

Conversation

@SwapnilGaikwad
Copy link
Contributor

It implements the decode from UTF8 algorithm listed here.

@ghost ghost added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 3, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 3, 2024
@MihaZupan MihaZupan added area-System.Memory and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Apr 3, 2024
@SwapnilGaikwad
Copy link
Contributor Author

On an N1 machine, the patch improves performance:

| Method | Toolchain | MemoryRandomization | NumberOfBytes | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) | |-------------------------------- |--------------------------------------------------------------------------- |-------------------- |-------------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- | | Base64Decode | /base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | Default | 1000 | 357.7 ns | 0.18 ns | 0.16 ns | 357.7 ns | 357.3 ns | 357.9 ns | 1.00 | Base | | Base64Decode | /patch/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | Default | 1000 | 283.7 ns | 0.03 ns | 0.03 ns | 283.7 ns | 283.7 ns | 283.8 ns | 0.79 | Faster | | | | | | | | | | | | | | | Base64DecodeDestinationTooSmall | /base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | True | 1000 | 369.0 ns | 10.47 ns | 12.06 ns | 368.5 ns | 356.9 ns | 383.1 ns | 1.00 | Base | | Base64DecodeDestinationTooSmall | /patch/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | True | 1000 | 286.4 ns | 3.69 ns | 3.46 ns | 283.9 ns | 283.8 ns | 291.5 ns | 0.78 | Faster | 
@SwapnilGaikwad
Copy link
Contributor Author

@kunalspathak
Copy link
Contributor

@dotnet/jit-contrib @tannergooding


// Invalid values are set to 255 during above look-ups using 'decLutTwo' and 'decLutTwo'.
// Thus the intermediate results 'decOne' and 'decTwo' could be OR-ed to get final values.
str1 = AdvSimd.Or(decOne1, decTwo1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for others: When writing an routine that is purely AdvSimd, for basic functions such as Or should we be using AdvSimd throughout or mixing APIs and using Vector128 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We should just use decOne1 | decTwo1

Copy link
Member

@tannergooding tannergooding Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we want to avoid writing routines that are purely for a given architecture or ISA where-ever possible.

It is simply not maintainable to write and support implementations for all of:

  • Arm64
    • AdvSimd
    • Sve/Sve2
  • x86/x64
    • Sse2/Sse4.1
    • Avx/Avx2
    • Avx512/Avx10
  • WASM
  • RISC-V
  • LoongArch64
  • etc

So, in the ideal world we are using xplat code and sharing across all target platforms for the vast majority of code we write. The only duplication we should have in the common case is for different vector sizes. That is, currently duplicating code across V128/V256/V512 and Vector, based on the sizes we want to support -- Even this is something we're looking at allowing more sharing of in the future via ISimdVector<TSelf, T>, so it is truly write once as much as possible.

We then opportunistically light up usage of platform specific behavior only if the additional complexity is justified by the perf gains and ideally only where absolutely required, still using shared xplat code for most of the algorithm.

The xplat APIs should cover most core functionality and the JIT should recognize common patterns and optimize to better platform specific codegen where possible. Where there is some functionality that might be missing, we can propose and expose new APIs that work better across the range of platforms we need to support.

Copy link
Contributor

@a74nh a74nh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add few comments and could you please share the code generated.

@SwapnilGaikwad
Copy link
Contributor Author

add few comments and could you please share the code generated.

Assembly for the 'AdvSimdDecode'. It's extracted from the 'System.Private.CoreLib.dll'.
b.cc 0xffffa8327378 // b.lo, b.ul, b.last mov x6, x27 str x28, [x29, #272] movi v8.16b, #0x3f // <==== Line#894: DuplicateToVector128() str q8, [x29, #256] ldr q9, 0xffffa83278b0 str q9, [x29, #96] ldr q10, 0xffffa83278c0 str q10, [x29, #80] ldr q11, 0xffffa83278d0 str q11, [x29, #64] ldr q12, 0xffffa83278e0 str q12, [x29, #48] ldr q13, 0xffffa83278f0 str q13, [x29, #32] ldr q14, 0xffffa8327900 str q14, [x29, #16] str w4, [x29, #344] mov w2, w4 str x6, [x29, #280] mov x0, x6 mov x1, x27 adrp x11, 0xffffa8e1c000 add x11, x11, #0xef8 mov v8.d[0], v9.d[1] mov v11.d[0], v10.d[1] ldr x13, [x11] blr x13 ldr x6, [x29, #280] ld4 {v16.16b-v19.16b}, [x6] // <==== Line#902: LoadVector128x4AndUnzip()/approx. loop start stp q16, q17, [x29, #192] stp q18, q19, [x29, #224] ldp q16, q17, [x29, #192] ldp q18, q19, [x29, #224] ldr q20, [x29, #256] uqsub v21.16b, v16.16b, v20.16b uqsub v22.16b, v17.16b, v20.16b uqsub v23.16b, v18.16b, v20.16b uqsub v24.16b, v19.16b, v20.16b mvni v25.4s, #0x0 mvni v26.4s, #0x0 mov v9.d[1], v8.d[0] mov v10.d[1], v11.d[0] mov v27.16b, v9.16b mov v28.16b, v10.16b tbl v16.16b, {v25.16b-v28.16b}, v16.16b mvni v25.4s, #0x0 mvni v26.4s, #0x0 mov v27.16b, v25.16b mov v28.16b, v26.16b mov v29.16b, v9.16b mov v30.16b, v10.16b tbl v17.16b, {v27.16b-v30.16b}, v17.16b mvni v25.4s, #0x0 mvni v26.4s, #0x0 mov v27.16b, v9.16b mov v28.16b, v10.16b tbl v18.16b, {v25.16b-v28.16b}, v18.16b mvni v25.4s, #0x0 mvni v26.4s, #0x0 mov v27.16b, v25.16b mov v28.16b, v26.16b mov v29.16b, v9.16b mov v30.16b, v10.16b tbl v19.16b, {v27.16b-v30.16b}, v19.16b ldp q26, q25, [x29, #48] ldp q28, q27, [x29, #16] tbx v21.16b, {v25.16b-v28.16b}, v21.16b tbx v22.16b, {v25.16b-v28.16b}, v22.16b tbx v23.16b, {v25.16b-v28.16b}, v23.16b tbx v24.16b, {v25.16b-v28.16b}, v24.16b orr v16.16b, v16.16b, v21.16b orr v17.16b, v17.16b, v22.16b orr v18.16b, v18.16b, v23.16b orr v19.16b, v19.16b, v24.16b cmhi v21.16b, v16.16b, v20.16b cmhi v22.16b, v17.16b, v20.16b orr v21.16b, v21.16b, v22.16b cmhi v22.16b, v18.16b, v20.16b orr v21.16b, v21.16b, v22.16b cmhi v22.16b, v19.16b, v20.16b orr v21.16b, v21.16b, v22.16b umaxp v21.4s, v21.4s, v21.4s mov x2, v21.d[0] cmp x2, #0x0 b.ne 0xffffa8327350 // b.any shl v16.16b, v16.16b, #2 ushr v21.16b, v17.16b, #4 orr v8.16b, v16.16b, v21.16b shl v16.16b, v17.16b, #4 ushr v17.16b, v18.16b, #2 orr v11.16b, v16.16b, v17.16b shl v16.16b, v18.16b, #6 orr v12.16b, v16.16b, v19.16b mov w2, w19 ldr x0, [x29, #272] mov x1, x28 adrp x11, 0xffffa8e1c000 add x11, x11, #0xf00 mov v14.d[0], v8.d[1] mov v9.d[0], v11.d[1] mov v10.d[0], v12.d[1] ldr x3, [x11] blr x3 mov v8.d[1], v14.d[0] mov v11.d[1], v9.d[0] mov v12.d[1], v10.d[0] ldr x7, [x29, #272] mov v16.16b, v8.16b mov v17.16b, v11.16b mov v18.16b, v12.16b st3 {v16.16b-v18.16b}, [x7] ldr x6, [x29, #280] add x6, x6, #0x40 add x7, x7, #0x30 ldr x3, [x29, #288] cmp x6, x3 str x7, [x29, #272] str x3, [x29, #288] ldp q10, q9, [x29, #80] b.ls 0xffffa832754c // b.plast str x6, [x29, #280] // <==== End of loop ldr x6, [x29, #280] mov x4, x6 ldr x7, [x29, #272] mov x5, x7 ldr x6, [x29, #312] cmp x4, x6 b.eq 0xffffa8327740 // b.none
@SwapnilGaikwad
Copy link
Contributor Author

Hi @kunalspathak, there is no change in assembly after the refactoring.

New assembly sequence
b.cc	0xffffa8337378 // b.lo, b.ul, b.last mov	x6, x27 str	x28, [x29, #272] movi	v8.16b, #0x3f str	q8, [x29, #256] ldr	q9, 0xffffa83378b0 str	q9, [x29, #96] ldr	q10, 0xffffa83378c0 str	q10, [x29, #80] ldr	q11, 0xffffa83378d0 str	q11, [x29, #64] ldr	q12, 0xffffa83378e0 str	q12, [x29, #48] ldr	q13, 0xffffa83378f0 str	q13, [x29, #32] ldr	q14, 0xffffa8337900 str	q14, [x29, #16] str	w4, [x29, #344] mov	w2, w4 str	x6, [x29, #280] mov	x0, x6 mov	x1, x27 adrp	x11, 0xffffa8e2c000 add	x11, x11, #0xef8 mov	v8.d[0], v9.d[1] mov	v11.d[0], v10.d[1] ldr	x13, [x11] blr	x13 ldr	x6, [x29, #280] ld4	{v16.16b-v19.16b}, [x6] stp	q16, q17, [x29, #192] stp	q18, q19, [x29, #224] ldp	q16, q17, [x29, #192] ldp	q18, q19, [x29, #224] ldr	q20, [x29, #256] uqsub	v21.16b, v16.16b, v20.16b uqsub	v22.16b, v17.16b, v20.16b uqsub	v23.16b, v18.16b, v20.16b uqsub	v24.16b, v19.16b, v20.16b mvni	v25.4s, #0x0 mvni	v26.4s, #0x0 mov	v9.d[1], v8.d[0] mov	v10.d[1], v11.d[0] mov	v27.16b, v9.16b mov	v28.16b, v10.16b tbl	v16.16b, {v25.16b-v28.16b}, v16.16b mvni	v25.4s, #0x0 mvni	v26.4s, #0x0 mov	v27.16b, v25.16b mov	v28.16b, v26.16b mov	v29.16b, v9.16b mov	v30.16b, v10.16b tbl	v17.16b, {v27.16b-v30.16b}, v17.16b mvni	v25.4s, #0x0 mvni	v26.4s, #0x0 mov	v27.16b, v9.16b mov	v28.16b, v10.16b tbl	v18.16b, {v25.16b-v28.16b}, v18.16b mvni	v25.4s, #0x0 mvni	v26.4s, #0x0 mov	v27.16b, v25.16b mov	v28.16b, v26.16b mov	v29.16b, v9.16b mov	v30.16b, v10.16b tbl	v19.16b, {v27.16b-v30.16b}, v19.16b ldp	q26, q25, [x29, #48] ldp	q28, q27, [x29, #16] tbx	v21.16b, {v25.16b-v28.16b}, v21.16b tbx	v22.16b, {v25.16b-v28.16b}, v22.16b tbx	v23.16b, {v25.16b-v28.16b}, v23.16b tbx	v24.16b, {v25.16b-v28.16b}, v24.16b orr	v16.16b, v16.16b, v21.16b orr	v17.16b, v17.16b, v22.16b orr	v18.16b, v18.16b, v23.16b orr	v19.16b, v19.16b, v24.16b cmhi	v21.16b, v16.16b, v20.16b cmhi	v22.16b, v17.16b, v20.16b orr	v21.16b, v21.16b, v22.16b cmhi	v22.16b, v18.16b, v20.16b orr	v21.16b, v21.16b, v22.16b cmhi	v22.16b, v19.16b, v20.16b orr	v21.16b, v21.16b, v22.16b umaxp	v21.4s, v21.4s, v21.4s mov	x2, v21.d[0] cmp	x2, #0x0 b.ne	0xffffa8337350 // b.any shl	v16.16b, v16.16b, #2 ushr	v21.16b, v17.16b, #4 orr	v8.16b, v16.16b, v21.16b shl	v16.16b, v17.16b, #4 ushr	v17.16b, v18.16b, #2 orr	v11.16b, v16.16b, v17.16b shl	v16.16b, v18.16b, #6 orr	v12.16b, v16.16b, v19.16b mov	w2, w19 ldr	x0, [x29, #272] mov	x1, x28 adrp	x11, 0xffffa8e2c000 add	x11, x11, #0xf00 mov	v14.d[0], v8.d[1] mov	v9.d[0], v11.d[1] mov	v10.d[0], v12.d[1] ldr	x3, [x11] blr	x3 mov	v8.d[1], v14.d[0] mov	v11.d[1], v9.d[0] mov	v12.d[1], v10.d[0] ldr	x7, [x29, #272] mov	v16.16b, v8.16b mov	v17.16b, v11.16b mov	v18.16b, v12.16b st3	{v16.16b-v18.16b}, [x7] ldr	x6, [x29, #280] add	x6, x6, #0x40 add	x7, x7, #0x30 ldr	x3, [x29, #288] cmp	x6, x3 str	x7, [x29, #272] str	x3, [x29, #288] ldp	q10, q9, [x29, #80] b.ls	0xffffa833754c // b.plast str	x6, [x29, #280] ldr	x6, [x29, #280] mov	x4, x6 ldr	x7, [x29, #272] mov	x5, x7 ldr	x6, [x29, #312] cmp	x4, x6 b.eq	0xffffa8337740 
Details
Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kunalspathak kunalspathak merged commit fa1164c into dotnet:main Apr 14, 2024
Comment on lines +867 to +868
Vector128<byte> decLutOne1 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF).AsByte();
Vector128<byte> decLutOne2 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF).AsByte();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could have been Vector128<byte>.AllBitsSet

Comment on lines +867 to +874
Vector128<byte> decLutOne1 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF).AsByte();
Vector128<byte> decLutOne2 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF).AsByte();
Vector128<byte> decLutOne3 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0x3EFFFFFF, 0x3FFFFFFF).AsByte();
Vector128<byte> decLutOne4 = Vector128.Create(0x37363534, 0x3B3A3938, 0xFFFF3D3C, 0xFFFFFFFF).AsByte();
Vector128<byte> decLutTwo1 = Vector128.Create(0x0100FF00, 0x05040302, 0x09080706, 0x0D0C0B0A).AsByte();
Vector128<byte> decLutTwo2 = Vector128.Create(0x11100F0E, 0x15141312, 0x19181716, 0xFFFFFFFF).AsByte();
Vector128<byte> decLutTwo3 = Vector128.Create(0x1B1AFFFF, 0x1F1E1D1C, 0x23222120, 0x27262524).AsByte();
Vector128<byte> decLutTwo4 = Vector128.Create(0x2B2A2928, 0x2F2E2D2C, 0x33323130, 0xFFFFFFFF).AsByte();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why these all need to be declared up front. It's effectively defining unnecessary locals and work for the JIT to do.

This could have been var decLutOne = (Vector128<byte>.AllBitsSet, Vector128<byte>.AllBitsSet, Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0x3EFFFFFF, 0x3FFFFFFF).AsByte(), Vector128.Create(0x37363534, 0x3B3A3938, 0xFFFF3D3C, 0xFFFFFFFF).AsByte()) instead, as an example.

Comment on lines +876 to +890
Vector128<byte> decOne1;
Vector128<byte> decOne2;
Vector128<byte> decOne3;
Vector128<byte> decOne4;
Vector128<byte> decTwo1;
Vector128<byte> decTwo2;
Vector128<byte> decTwo3;
Vector128<byte> decTwo4;
Vector128<byte> str1;
Vector128<byte> str2;
Vector128<byte> str3;
Vector128<byte> str4;
Vector128<byte> res1;
Vector128<byte> res2;
Vector128<byte> res3;
Copy link
Member

@tannergooding tannergooding Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here, it's not clear why these were defined up front like this. They could have been declared in the loop at the declaration site (since they aren't used outside the loop) making the overall code much more readable, without needing 15 up front declarations


byte* src = srcBytes;
byte* dest = destBytes;
Vector128<byte> offset = AdvSimd.DuplicateToVector128((byte)0x3F);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector128.Create<byte>(0x3F) is preferred over AdvSimd.DuplicateToVector128.

The former allows more general optimizations to kick in as it does not prescribe a particular instruction be emitted.

do
{
// Load 64 bytes and de-interleave.
AssertRead<Vector128<byte>>(src, srcStart, sourceLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is pre-existing to the PR, but this is not a great method name. It should probably be something like AssertReadIsValid or similar.

{
// Load 64 bytes and de-interleave.
AssertRead<Vector128<byte>>(src, srcStart, sourceLength);
(str1, str2, str3, str4) = AdvSimd.Arm64.LoadVector128x4AndUnzip(src);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if LoadVector128x4AndUnzip was something we could pattern match and implicitly light up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean a wrapper method, say LoadVectorAndUnzip(), that would do the right thing based on the return type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, rather as per my comment above (#100589 (comment)) we prefer to write xplat code so logic can be shared across CPU architectures where possible.

To facilitate this, we ideally have the JIT recognize simple patterns and optimize them. LoadVector128x4, for example, could be done automatically if the JIT were to recognize 4x sequential Vector128.Load.

LoadVector128x4AndUnzip could theoretically have the same done if we were to recognize 4x sequence Vector128.Load and a general pattern for unzip. Today that would be recognizing Shuffle in particular, but longer term it might be a good reason for there to be an xplat Vector128.Unzip (since that just corresponds to Unzip on Arm64 and Unpack on x64).


The driving question tends to be, "does this logic actually have to be platform/ISA specific, or is there a simple pattern we could enable the JIT to recognize instead?". In the case simple pattern recognition is possible, then it's generally ideal to enable it so that existing user code lights up and gets faster on Arm64 without the need for users to go and manually do the work. It also lets the same code automatically light-up for things like AdvSimd vs SVE, rather than being strictly tied down to AdvSimd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for this clarification. I'll keep this mind for the common patterns in the future.

Comment on lines +942 to +944
res1 = AdvSimd.ShiftLeftLogical(str1, 2) | AdvSimd.ShiftRightLogical(str2, 4);
res2 = AdvSimd.ShiftLeftLogical(str2, 4) | AdvSimd.ShiftRightLogical(str3, 2);
res3 = AdvSimd.ShiftLeftLogical(str3, 6) | str4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could have been of the form res1 = (str1 << 2) | (str2 << 4);

Comment on lines +930 to +933
Vector128<byte> classified = AdvSimd.CompareGreaterThan(str1, offset)
| AdvSimd.CompareGreaterThan(str2, offset)
| AdvSimd.CompareGreaterThan(str3, offset)
| AdvSimd.CompareGreaterThan(str4, offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could have been the xplat Vector128.GreaterThan(str1, offset)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand the xplat bit. Could you elaborate on this, please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have APIs defined on Vector128 that expose functionality common across multiple ISAs (AdvSimd, SVE, etc) and common across multiple architectures (Arm64, x64, WASM, etc).

Using these APIs tends to be preferred over using the platform specific APIs, particularly when there is a clear 1-to-1 mapping. Thus x + y is preferred over AdvSimd.Add and Vector128.GreaterThan is preffered over AdvSimd.CompareGreaterThan.

Using the xplat (cross platform) APIs gives the JIT more flexibility in the code it generates and allows more portability, while also often improving readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll update these calls 👍

Comment on lines +904 to +927
// Get indices for second LUT:
decTwo1 = AdvSimd.SubtractSaturate(str1, offset);
decTwo2 = AdvSimd.SubtractSaturate(str2, offset);
decTwo3 = AdvSimd.SubtractSaturate(str3, offset);
decTwo4 = AdvSimd.SubtractSaturate(str4, offset);

// Get values from first LUT. Out-of-range indices are set to 0.
decOne1 = AdvSimd.Arm64.VectorTableLookup(decLutOne, str1);
decOne2 = AdvSimd.Arm64.VectorTableLookup(decLutOne, str2);
decOne3 = AdvSimd.Arm64.VectorTableLookup(decLutOne, str3);
decOne4 = AdvSimd.Arm64.VectorTableLookup(decLutOne, str4);

// Get values from second LUT. Out-of-range indices are unchanged.
decTwo1 = AdvSimd.Arm64.VectorTableLookupExtension(decTwo1, decLutTwo, decTwo1);
decTwo2 = AdvSimd.Arm64.VectorTableLookupExtension(decTwo2, decLutTwo, decTwo2);
decTwo3 = AdvSimd.Arm64.VectorTableLookupExtension(decTwo3, decLutTwo, decTwo3);
decTwo4 = AdvSimd.Arm64.VectorTableLookupExtension(decTwo4, decLutTwo, decTwo4);

// Invalid values are set to 255 during above look-ups using 'decLutTwo' and 'decLutTwo'.
// Thus the intermediate results 'decOne' and 'decTwo' could be OR-ed to get final values.
str1 = decOne1 | decTwo1;
str2 = decOne2 | decTwo2;
str3 = decOne3 | decTwo3;
str4 = decOne4 | decTwo4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This general algorithm probably could do with a comment walking through what's happening so the "magic" is better understood, much as exists for Vector128Decode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's difficult to follow. Although, it's a bit tricky to describe the entire workings with lookup tables and interleaving. I'll try to make it less opaque.

@SwapnilGaikwad
Copy link
Contributor Author

Hi @tannergooding , thanks for your comments. I'll add a follow-up PR incorporating as many suggestions as I can.

@SwapnilGaikwad
Copy link
Contributor Author

Hi @tannergooding , thanks for your comments. I'll add a follow-up PR incorporating as many suggestions as I can.

#101620 addresses above suggestions.

matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024
* Use multi-reg load/store for DecodeFromUTF8 * Address review comments
@github-actions github-actions bot locked and limited conversation to collaborators May 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-arm64 area-System.Memory community-contribution Indicates that the PR has been added by a community member

6 participants