Skip to content

Conversation

@antoyo
Copy link

@antoyo antoyo commented Jul 27, 2023

This fix a bug in the garbage collector of GCC that happens when compiling multiple times with -flto enabled.

@antoyo antoyo merged commit d430b38 into master Jul 28, 2023
@antoyo antoyo deleted the fix/ggc-bug-for-lto branch October 21, 2023 22:23
antoyo pushed a commit that referenced this pull request Nov 21, 2024
We can make use of the integrated rotate step of the XAR instruction to implement most vector integer rotates, as long we zero out one of the input registers for it. This allows for a lower-latency sequence than the fallback SHL+USRA, especially when we can hoist the zeroing operation away from loops and hot parts. This should be safe to do for 64-bit vectors as well even though the XAR instructions operate on 128-bit values, as the bottom 64-bit results is later accessed through the right subregs. This strategy is used whenever we have XAR instructions, the logic in aarch64_emit_opt_vec_rotate is adjusted to resort to expand_rotate_as_vec_perm only when it's expected to generate a single REV* instruction or when XAR instructions are not present. With this patch we can gerate for the input: v4si G1 (v4si r) { return (r >> 23) | (r << 9); } v8qi G2 (v8qi r) { return (r << 3) | (r >> 5); } the assembly for +sve2: G1: movi v31.4s, 0 xar z0.s, z0.s, z31.s, #23 ret G2: movi v31.4s, 0 xar z0.b, z0.b, z31.b, #5 ret instead of the current: G1: shl v31.4s, v0.4s, 9 usra v31.4s, v0.4s, 23 mov v0.16b, v31.16b ret G2: shl v31.8b, v0.8b, 3 usra v31.8b, v0.8b, 5 mov v0.8b, v31.8b ret Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/	* config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Add	generation of XAR sequences when possible. gcc/testsuite/	* gcc.target/aarch64/rotate_xar_1.c: New test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant