MSVC Machine-Independent Optimizations in Visual Studio 2022 17.7

This blog post presents a selection of machine-independent optimizations that were added between Visual Studio versions 17.4 (released November 8, 2022) and 17.7 P3 (released July 11, 2023). Each optimization below shows assembly code for both X64 and ARM64 to show the machine-independent nature of the optimization.

Optimizing Memory Across Block Boundaries

When a small struct is loaded into a register, we can optimize field accesses to extract the correct bits from the register instead of accessing it through memory. Historically in MSVC, this optimization has been limited to memory accesses within the same basic block. We are now able to perform this same optimization across block boundaries in many cases.

In the example ASM listings below, a load to the stack and a store from the stack are eliminated, resulting in less memory traffic as well as lower stack memory usage.

Example C++ Source Code:

#include <string_view> bool compare(const std::string_view& l, const std::string_view& r) {    return l == r; }

Required Compiler Flags: /O2

X64 ASM:

17.4

17.7

sub        rsp, 56 movups  xmm0, XMMWORD PTR [rcx] mov        r8, QWORD PTR [rcx+8] movaps  XMMWORD PTR $T1[rsp], xmm0 cmp        r8, QWORD PTR [rdx+8] jne        SHORT $LN9@compare mov        rdx, QWORD PTR [rdx] mov        rcx, QWORD PTR $T1[rsp] call       memcmp test       eax, eax jne        SHORT $LN9@compare mov        al, 1 add        rsp, 56 ret        0 $LN9@compare: xor        al, al add        rsp, 56 ret        0

sub         rsp, 40 mov        r8, QWORD PTR [rcx+8] movups  xmm1, XMMWORD PTR [rcx] cmp        r8, QWORD PTR [rdx+8] jne         SHORT $LN9@compare mov        rdx, QWORD PTR [rdx] movq      rcx, xmm1 call        memcmp test        eax, eax jne         SHORT $LN9@compare mov        al, 1 add         rsp, 40 ret         0 $LN9@compare: xor         al, al add         rsp, 40 ret         0

ARM64 ASM:

17.4

17.7

str         lr,[sp,#-0x10]! sub         sp,sp,#0x20 ldr         q17,[x1] ldr         q16,[x0] umov        x8,v17.d[1] umov        x2,v16.d[1] stp         q17,q16,[sp] cmp         x2,x8 bne         |$LN9@compare| ldr         x1,[sp] ldr         x0,[sp,#0x10] bl          memcmp cbnz        w0,|$LN9@compare| mov         w0,#1 add         sp,sp,#0x20 ldr         lr,[sp],#0x10 ret |$LN9@compare| mov         w0,#0 add         sp,sp,#0x20 ldr         lr,[sp],#0x10 ret

str         lr,[sp,#-0x10]! ldr         q17,[x1] ldr         q16,[x0] umov        x8,v17.d[1] umov        x2,v16.d[1] cmp         x2,x8 bne         |$LN9@compare| fmov        x1,d17 fmov        x0,d16 bl          memcmp cbnz        w0,|$LN9@compare| mov         w0,#1 ldr         lr,[sp],#0x10 ret |$LN9@compare| mov         w0,#0 ldr         lr,[sp],#0x10 ret

Vector Logical and Arithmetic Optimizations

We continue to add patterns for recognizing vector operations that are equivalent to intrinsics or short sequences of intrinsics. An example is recognizing common forms of vector absolute difference calculations. A long series of bitwise instructions can be replaced with specialized absolute value instructions, such as vpabsd on X64 and sabd on ARM64.

Example C++ Source Code:

#include <cstdint> void s32_1(int * __restrict a, int * __restrict b, int * __restrict c, int n) { for (int i = 0; i < n; i++) { a[i] = (b[i] - c[i]) > 0 ? (b[i] - c[i]) : (c[i] - b[i]); } }

Required Flags: /O2 /arch:AVX for X64, /O2 for ARM64

X64 ASM:

17.4

17.7

$LL4@s32_1: movdqu  xmm0, XMMWORD PTR [r11+rax] add     ecx, 4 movdqu  xmm1, XMMWORD PTR [rax] lea     rax, QWORD PTR [rax+16] movdqa  xmm3, xmm0 psubd   xmm3, xmm1 psubd   xmm1, xmm0 movdqa  xmm2, xmm3 pcmpgtd xmm2, xmm4 movdqa  xmm0, xmm2 andps   xmm2, xmm3 andnps  xmm0, xmm1 orps    xmm0, xmm2 movdqu  XMMWORD PTR [r10+rax-16], xmm0 cmp     ecx, edx jl      SHORT $LL4@s32_1

$LL4@s32_1: vmovdqu xmm1, XMMWORD PTR [r10+rax] vpsubd     xmm1, xmm1, XMMWORD PTR [rax] vpabsd     xmm2, xmm1    ; vector abs add        edx, 4 vmovdqu    XMMWORD PTR [rbx+rax], xmm2 lea        rax, QWORD PTR [rax+16] cmp        edx, ecx jl         SHORT $LL4@s32_1

ARM64 ASM:

17.4

17.7

|$LL24@s32_1| sbfiz       x9,x8,#2,#0x20 ldr         q19,[x9,x2] add         w8,w8,#4 ldr         q18,[x9,x1] cmp         w8,w10 sub         v16.4s,v18.4s,v19.4s cmgt        v17.4s,v16.4s,v21.4s and         v20.16b,v17.16b,v16.16b sub         v16.4s,v19.4s,v18.4s bic         v16.16b,v16.16b,v17.16b orr         v16.16b,v16.16b,v20.16b str         q16,[x9,x0] blt         |$LL24@s32_1| cmp         w8,w3 bge         |$LN27@s32_1|

|$LL24@s32_1| sbfiz       x9,x8,#2,#0x20 ldr         q17,[x9,x1] add         w8,w8,#4 ldr         q16,[x9,x2] cmp         w8,w10 sabd        v16.4s,v17.4s,v16.4s     ; vector abs str         q16,[x9,x0] blt         |$LL24@s32_1|

Vector Remainder Loops

Each iteration of a vectorized loop computes the same work as multiple iterations of the original scalar loop. Depending on the number of original scalar iterations and the vector width, there may be work left over when the vector loop ends. Although this work can be completed by a copy of the original scalar loop, it may contain sufficient work to warrant further vectorization using a smaller vector width. This optimization is now employed for targets that have a single vector width by partially packing a vector.

In the example below, three loops are emitted for the original loop. The first processes 64 elements per vector iteration by packing 16 8-bit values into a 128-bit vector register and then unrolling the vector loop by four. The second loop processes 8 elements per vector iteration by packing 8 8-bit values into half of a 128-bit vector register. The third loop processes one element per scalar iteration.

Example C++ source code:

#include <cstdint> void test(int8_t * __restrict d, int8_t * __restrict a, int8_t * __restrict b, int n) { for (int i = 0; i < n; i++) { d[i] = a[i] & (~b[i]); } }

Required compiler flags: /O2

X64 ASM:

17.4

17.7

$LL4@test: ; 1^st vector loop (64 element iters) movdqu  xmm0, XMMWORD PTR [rcx-16] add     edi, 64         ; 00000040H movdqu  xmm1, XMMWORD PTR [rdx+rcx-16] movdqu  xmm2, XMMWORD PTR [rdx+rcx] lea     rcx, QWORD PTR [rcx+64] pandn   xmm1, xmm3 lea     rax, QWORD PTR [r14+rcx] pand    xmm1, xmm0 pandn   xmm2, xmm3 movdqu  xmm0, XMMWORD PTR [rcx-64] movdqu  XMMWORD PTR [r9+rcx-80], xmm1 pand    xmm2, xmm0 movdqu  xmm0, XMMWORD PTR [rcx-48] movdqu  xmm1, XMMWORD PTR [rdx+rcx-48] movdqu  XMMWORD PTR [r9+rcx-64], xmm2 pandn   xmm1, xmm3 pand    xmm1, xmm0 movdqu  xmm2, XMMWORD PTR [rdx+rcx-32] movdqu  xmm0, XMMWORD PTR [rcx-32] pandn   xmm2, xmm3 pand    xmm2, xmm0 movdqu  XMMWORD PTR [r9+rcx-48], xmm1 movdqu  XMMWORD PTR [r9+rcx-32], xmm2 cmp     rax, rsi jl      SHORT $LL4@test mov     r14, QWORD PTR [rsp] mov     rsi, QWORD PTR [rsp+24] $LN9@test: movsxd  rcx, edi mov     rdx, rbx mov     rdi, QWORD PTR [rsp+32] mov     rbx, QWORD PTR [rsp+16] cmp     rcx, rdx jge     SHORT $LN3@test sub     r8, r11 lea     rax, QWORD PTR [rcx+r11] sub     r10, r11 sub     rdx, rcx $LL8@test: ; Scalar loop (1 element iters) movzx   ecx, BYTE PTR [rax+r8] lea     rax, QWORD PTR [rax+1] not     cl and     cl, BYTE PTR [rax-1] mov     BYTE PTR [rax+r10-1], cl sub     rdx, 1 jne     SHORT $LL8@test $LN3@test: add     rsp, 8 ret     0

$LL4@test: ; 1^st vector loop (64 element iters) movdqu    xmm0, XMMWORD PTR [rcx-16] add       edx, 64          ; 00000040H movdqu    xmm1, XMMWORD PTR [r8+rcx-16] movdqu    xmm2, XMMWORD PTR [r8+rcx] lea       rcx, QWORD PTR [rcx+64] andnps    xmm1, xmm3 lea       rax, QWORD PTR [rsi+rcx] andps     xmm1, xmm0 andnps    xmm2, xmm3 movdqu    xmm0, XMMWORD PTR [rcx-64] movdqu    XMMWORD PTR [r9+rcx-80], xmm1 andps     xmm2, xmm0 movdqu    xmm0, XMMWORD PTR [rcx-48] movdqu    xmm1, XMMWORD PTR [r8+rcx-48] movdqu    XMMWORD PTR [r9+rcx-64], xmm2 andnps    xmm1, xmm3 andps     xmm1, xmm0 movdqu    xmm2, XMMWORD PTR [r8+rcx-32] movdqu    xmm0, XMMWORD PTR [rcx-32] andnps    xmm2, xmm3 andps     xmm2, xmm0 movdqu    XMMWORD PTR [r9+rcx-48], xmm1 movdqu    XMMWORD PTR [r9+rcx-32], xmm2 cmp       rax, rbp jl        SHORT $LL4@test mov       rbp, QWORD PTR [rsp+16] mov       eax, edi and       al, 63    ; 0000003fH cmp       al, 8 jb        SHORT $LN27@test $LN11@test: mov       ecx, edi movsxd    rax, edx and       ecx, -8 movsxd    rsi, ecx lea       rcx, QWORD PTR [rax+r10] $LL10@test:          ; 2^nd vector loop (8 element iters) movq      xmm1, QWORD PTR [r8+rcx] add       edx, 8 movq      xmm0, QWORD PTR [rcx] andnps    xmm1, xmm3 andps     xmm1, xmm0 movq      QWORD PTR [r9+rcx], xmm1 add       rcx, 8 mov       rax, rcx sub       rax, r10 cmp       rax, rsi jl        SHORT $LL10@test $LN27@test: mov       rsi, QWORD PTR [rsp+24] $LN9@test: movsxd    rcx, edx mov       rdx, rdi mov       rdi, QWORD PTR [rsp+32] cmp       rcx, rdx jge       SHORT $LN3@test sub       r11, r10 lea       rax, QWORD PTR [rcx+r10] sub       rbx, r10 sub       rdx, rcx npad      6 $LL8@test:            ; Scalar loop (1 element iters) movzx     ecx, BYTE PTR [rax+r11] lea       rax, QWORD PTR [rax+1] not       cl and       cl, BYTE PTR [rax-1] mov       BYTE PTR [rax+rbx-1], cl sub       rdx, 1 jne       SHORT $LL8@test $LN3@test: pop       rbx ret       0

ARM64 ASM:

17.4 (was not vectorized)

17.7

|$LL4@test| ldrsb       w9,[x10,x1] sub         w3,w3,#1 ldrsb       w8,[x1] bic         w8,w8,w9 strb        w8,[x11,x1] add         x1,x1,#1 cbnz        w3,|$LL4@test| |$LN3@test| ret

|$LL27@test|  ; 1^st vector loop (64 element iters) ldr         q17,[x2,w8 sxtw #0] sxtw        x11,w8 ldr         q16,[x1,w8 sxtw #0] add         x10,x11,#0x10 bic         v16.16b,v16.16b,v17.16b ldr         q17,[x10,x2] str         q16,[x0,w8 sxtw #0] ldr         q16,[x10,x1] add         w8,w8,#0x40 cmp         w8,w9 bic         v16.16b,v16.16b,v17.16b str         q16,[x10,x0] add         x10,x11,#0x20 ldr         q17,[x10,x2] ldr         q16,[x10,x1] bic         v16.16b,v16.16b,v17.16b str         q16,[x10,x0] add         x10,x11,#0x30 ldr         q17,[x10,x2] ldr         q16,[x10,x1] bic         v16.16b,v16.16b,v17.16b str         q16,[x10,x0] blt         |$LL27@test| and         w9,w3,#0x3F cmp         w9,#8 blo         |$LN30@test| |$LN11@test| and         w9,w3,#0xFFFFFFF8 |$LL29@test|  ; 2^nd vector loop (8 element iters) ldr         d17,[x2,w8 sxtw #0] ldr         d16,[x1,w8 sxtw #0] bic         v16.8b,v16.8b,v17.8b str         d16,[x0,w8 sxtw #0] add         w8,w8,#8 cmp         w8,w9 blt         |$LL29@test| |$LN30@test| cmp         w8,w3 bge         |$LN32@test| |$LN21@test| add         x9,x1,w8,sxtw #0 sub         x13,x2,x1 sub         w8,w3,w8 sub         x10,x0,x1 |$LL31@test|  ; Scalar loop (1 element iters) ldrsb       w12,[x13,x9] sub         w8,w8,#1 ldrsb       w11,[x9] bic         w11,w11,w12 strb        w11,[x10,x9] add         x9,x9,#1 cbnz        w8,|$LL31@test| |$LN32@test| ret

Conditional Moves/Selects

MSVC now considers more opportunities to use conditional move instructions to eliminate branches. Once such an opportunity is recognized, whether to use a conditional move or leave the branch intact is subject to legality checks and tuning heuristics. Eliminating a branch is often beneficial because it removes the possibility of the branch being mispredicted and may enable other optimizations that previously would have stopped at the block boundary formed by the branch. Nevertheless, it does convert a control dependence that could be correctly predicted into a data dependence that cannot be removed.

This optimization applies only when the compiler can prove that it is safe to unconditionally perform a store because the instruction will always store one of two possible values instead of only performing a store when the condition is true. There are numerous reasons why an unconditional store may be unsafe. For example, if the memory location is shared, then another thread may observe a store when previously there was no store to observe. As another example, the condition may have protected a store through a potentially null pointer.

The optimization is enabled for X64 as well as ARM64. (ARM64 conditional execution was discussed in a previous blog post.)

Example C++ source code:

int test(int a, int i) { int mem[4]{0}; if (mem[i] < a) { mem[i] = a; } return mem[0]; }

Required compiler flags: /O2 (for the sake of simplifying the example output, /GS- was also passed)

X64 ASM:

17.4

17.7

$LN6: sub     rsp, 24 movsxd  rax, edx xorps   xmm0, xmm0 movups  XMMWORD PTR mem$[rsp], xmm0 cmp     DWORD PTR mem$[rsp+rax*4], ecx jge     SHORT $LN4@test mov     DWORD PTR mem$[rsp+rax*4], ecx $LN4@test: mov     eax, DWORD PTR mem$[rsp] add     rsp, 24 ret     0

$LN5: sub      rsp, 24 movsxd   rax, edx xorps    xmm0, xmm0 movups   XMMWORD PTR mem$[rsp], xmm0 lea      rdx, QWORD PTR mem$[rsp] cmp      DWORD PTR [rdx+rax*4], ecx lea      rdx, QWORD PTR [rdx+rax*4] cmovge   ecx, DWORD PTR [rdx] ; cond mov      DWORD PTR [rdx], ecx mov      eax, DWORD PTR mem$[rsp] add      rsp, 24 ret      0

ARM64 ASM:

17.4

17.7

|$LN5| sub         sp,sp,#0x10 movi        v16.16b,#0 mov         x9,sp str         q16,[sp] ldr         w8,[x9,w1 sxtw #2] cmp         w8,w0 bge         |$LN2@test| str         w0,[x9,w1 sxtw #2] |$LN2@test| ldr         w0,[sp] add         sp,sp,#0x10 ret

|$LN9| sub         sp,sp,#0x10 movi        v16.16b,#0 mov         x9,sp str         q16,[sp] ldr         w8,[x9,w1 sxtw #2] cmp         w8,w0 cselge      w8,w8,w0 str         w8,[x9,w1 sxtw #2] ldr         w0,[sp] add         sp,sp,#0x10 ret

Loop Optimizations for Array Assignments and Copies

The optimizer now recognizes more cases where an assignment/copy to contiguous memory can be raised to memset/memcopy intrinsics and lowered to more efficient instruction sequences. For example, consider the following array-assignment in a count-down loop.

Example C++ Source Code:

 char c[1024]; for (int n = 1023; n; n--) c[n] = 1;

Previously, the code generated was a loop with individual byte-assignments. Now, more efficient wider assignments are performed via block-copy libraries or vector instructions.

X64 ASM:

17.4

17.7

mov         QWORD PTR __$ArrayPad$[rsp], rax mov         eax, 1023 npad        2 $LL4@copy_while: mov         BYTE PTR c$[rsp+rax], 1 sub         rax, 1 jne         SHORT $LL4@copy_while

vmovaps  zmm0, ZMMWORD PTR __zmm@01010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101 vmovq    rdx, xmm0 lea      rax, QWORD PTR c$[rsp+1] mov      ecx, 7 npad     14 $LL11@copy_while: vmovups  ZMMWORD PTR [rax], zmm0 vmovups  YMMWORD PTR [rax+64], ymm0 vmovups  XMMWORD PTR [rax+96], xmm0 lea      rax, QWORD PTR [rax+128] vmovups  XMMWORD PTR [rax-16], xmm0 sub      rcx, 1 jne      SHORT $LL11@copy_whilevmovups  ZMMWORD PTR [rax], zmm0 vmovups  YMMWORD PTR [rax+64], ymm0 vmovups  XMMWORD PTR [rax+96], xmm0 mov      QWORD PTR [rax+112], rdx mov      DWORD PTR [rax+120], edx mov      WORD PTR [rax+124], dx mov      BYTE PTR [rax+126], dl

ARM64 ASM:

17.4	17.7
mov x8,#0x3FF mov x9,sp mov w10,#1\|$LL4@copy_while\| strb w10,[x8,x9] sub x8,x8,#1 cbnz x8,\|$LL4@copy_while\|	add x0,sp,#1 mov x2,#0x3FF mov w1,#1 bl memset

Summary

We would like to thank everyone for giving us feedback and we are looking forward to hearing more from you. Please share your thoughts and comments with us through Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.

10 comments

Discussion is closed.Login to edit/delete existing comments.

Avery Lee October 3, 2023

Interesting that the ARM64 optimization for vector abs relies on lack of signed integer overflow UB, which VC++ has historically been reluctant to take advantage of. I can't get the X64 version to generate vpabsd, but perhaps the heuristics have changed. It seems like the SSE2 output could be improved (psubd / pcmpgtd / pxor / psubd).

In the vector remainder example, it's odd that the X64 code generator translates (x & ~b) to inversion via andnps and then andps instead of just one andnps instruction. Furthermore, the optimization seems to be incomplete because it only works if the count is...
Read more
Interesting that the ARM64 optimization for vector abs relies on lack of signed integer overflow UB, which VC++ has historically been reluctant to take advantage of. I can’t get the X64 version to generate vpabsd, but perhaps the heuristics have changed. It seems like the SSE2 output could be improved (psubd / pcmpgtd / pxor / psubd).

In the vector remainder example, it’s odd that the X64 code generator translates (x & ~b) to inversion via andnps and then andps instead of just one andnps instruction. Furthermore, the optimization seems to be incomplete because it only works if the count is variable. If the count is hardcoded to 8, vectorization fails, and if it’s hardcoded to 24, it produces a single 128-bit vector iteration followed by 8 scalar loop iterations.

Read less
Adam Folwarczny September 30, 2023

Vector reminder loop – have considered to just reuse the wide iteration for the reminder, just by updating pointer accordingly ? Then scalar version wouldn’t be executed (and few bytes would be written twice)
Chris Green September 26, 2023

Will the last example of widening bytes in fill loops also work with functions like std::fill_n?
- Troy Johnson Author September 28, 2023
  
  Yes, this applies to std::fill_n as well. You’ll see either a call to memset or inline code that uses wide stores.
David Lowndes September 25, 2023

In the first example, you have string_view passed as const reference, which I’ve understood to date as probably being less efficient than just passing by value.
What are the results if you pass by value?
- Jan Ringoš September 25, 2023
  
  On Windows, with current miserable X64 ABI-mandated calling convention, it’s irrelevant.
  Anything larger than a register is copied to stack first and then passed by a pointer. Passing string_view by reference is likely faster, as it skips the copy step.
  It also means upgrading to many modern C++ features (optional, span, unique_ptr) is a pessimisation.
  - Chris Green September 26, 2023
    
    Though with vectorcall it will pass small structs in registers
  - Chris Green September 27, 2023
    
    Compilers are going to want to come up with a new calling convention for AMX too. With double the # of GPRs there are a lot of regs available for passing/returning structs
  - Jan Ringoš September 27, 2023
    
    I wish the new one would be a complete overhaul, but knowing the industry and Microsoft, it’s going to be some bare minimal extension :’-(
  - Jan Ringoš September 26, 2023 · Edited
    
    Yeah, __vectorcall is something at least, but new modern calling convention is a long overdue.
    Way too many things (from modern C++ features to function using 5+ arguments) incur regular performance penalties because of the current ABI.
    
    IDK if anyone at Microsoft has investigated what it would entail, but I’ve been collecting random thoughts on such calling convention here on my github in case anyone’s interested.

MSVC Machine-Independent Optimizations in Visual Studio 2022 17.7

Optimizing Memory Across Block Boundaries

Vector Logical and Arithmetic Optimizations

Vector Remainder Loops

Conditional Moves/Selects

Loop Optimizations for Array Assignments and Copies

Summary

Author

10 comments

Read next

C11 Threads in Visual Studio 2022 version 17.8 Preview 2

MSVC ARM64 Optimizations in Visual Studio 2022 17.7

Optimizing Memory Across Block Boundaries

Vector Logical and Arithmetic Optimizations

Vector Remainder Loops

Conditional Moves/Selects

Loop Optimizations for Array Assignments and Copies

Summary

Author

10 comments

Read next

C11 Threads in Visual Studio 2022 version 17.8 Preview 2

MSVC ARM64 Optimizations in Visual Studio 2022 17.7

Stay informed