Skip to content

Conversation

djzin
Copy link
Contributor

@djzin djzin commented Mar 12, 2017

I would have thought that the mem::swap code didn't need an intermediate variable precisely because the pointers are guaranteed never to alias. And.. it doesn't! It seems that llvm will also auto-vectorize this case for large structs, but alas it doesn't seem to have all the aliasing info it needs and so will add redundant checks (and even not bother with autovectorizing for small types). Looks like a lot of performance could still be gained here, so this might be a good test case for future optimizer improvements.

Here are the current benchmarks for the simd version of mem::swap; the timings are in cycles (code below) measured with 10 iterations. The timings for sizes > 32 which are not a multiple of 8 tend to be ever so slightly faster in the old code, but not always. For large struct sizes (> 1024) the new code shows a marked improvement.

* = latest commit
† = subtracted from other measurements

arr_length noop rust_stdlib simd_u64x4* simd_u64x8
8 80 90 90 90
16 72 177 177 177
24 32 76 76 76
32 68 188 112 188
40 32 80 60 80
48 32 84 56 84
56 32 108 72 108
64 32 108 72 76
72 80 350 220 230
80 80 350 220 230
88 80 420 270 270
96 80 420 270 270
104 80 500 320 320
112 80 490 320 320
120 72 528 342 342
128 48 360 234 234
136 72 987 387 387
144 80 1070 420 420
152 64 856 376 376
160 68 804 400 400
168 80 1060 520 520
176 80 1070 520 520
184 32 464 228 228
192 32 504 228 228
200 32 440 248 248
208 72 987 573 573
216 80 1464 220 220
224 48 852 450 450
232 72 1182 666 666
240 32 428 288 288
248 32 428 308 308
256 80 860 770 770
264 80 1130 820 820
272 80 1340 820 820
280 80 1220 870 870
288 72 1227 804 804
296 72 1356 849 849
@rust-highfive
Copy link
Contributor

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @sfackler (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@djzin
Copy link
Contributor Author

djzin commented Mar 12, 2017

hm, got to be careful with the benchmarks. If llvm sucks (and it might), then what previously was optimized away in for loops because of using memmove intrinsics that llvm special cases, might now not be optimized. The relevant code here:

#[stable(feature = "rust1", since = "1.0.0")] impl<A: Step> Iterator for ops::Range<A> where for<'a> &'a A: Add<&'a A, Output = A> { type Item = A; #[inline] fn next(&mut self) -> Option<A> { if self.start < self.end { let mut n = self.start.add_one(); // this line may not be optimized as well now for simple integer types // note this was the cause of unintentional recursion before ;) mem::swap(&mut n, &mut self.start); Some(n) } else { None } } #[inline] fn size_hint(&self) -> (usize, Option<usize>) { match Step::steps_between_by_one(&self.start, &self.end) { Some(hint) => (hint, Some(hint)), None => (0, None) } } }
@djzin
Copy link
Contributor Author

djzin commented Mar 12, 2017

Testing confirms that llvm is unable to optimize loops as well when this implementation of mem::swap is used. For some reason even a manual implementation of swap -

pub fn swap<T: Copy>(x: &mut T, y: &mut T) { let t = *x; *x = *y; *y = t; }

performs slower than the llvm intrinsic version. Defninitely something to investigate...

@djzin
Copy link
Contributor Author

djzin commented Mar 12, 2017

OK so the reason this simple swap function was slower seems to be that copy_nonoverlapping tells llvm that the pointers do not alias in the case where it can't work that out on its own. The reason the xor-swap was faster was simply down to cache locality so I wrote a new function

@djzin
Copy link
Contributor Author

djzin commented Mar 12, 2017

I am trying to get something close to this: for an array of 2048 chars clang spits out the following:

#include <array> void swap(std::array<char, 2048>&__restrict__ x, std::array<char, 2048>&__restrict__ y) { std::swap(x, y); }
clang -std=c++11 swap.cpp -S -O3
# BB#0: xorl %eax, %eax .p2align 4, 0x90 .LBB0_1: # =>This Inner Loop Header: Depth=1 movups (%rdi,%rax), %xmm0 movups 16(%rdi,%rax), %xmm1 movups (%rsi,%rax), %xmm2 movups 16(%rsi,%rax), %xmm3 movups %xmm2, (%rdi,%rax) movups %xmm3, 16(%rdi,%rax) movups %xmm0, (%rsi,%rax) movups %xmm1, 16(%rsi,%rax) movups 32(%rdi,%rax), %xmm0 movups 48(%rdi,%rax), %xmm1 movups 32(%rsi,%rax), %xmm2 movups 48(%rsi,%rax), %xmm3 movups %xmm2, 32(%rdi,%rax) movups %xmm3, 48(%rdi,%rax) movups %xmm0, 32(%rsi,%rax) movups %xmm1, 48(%rsi,%rax) addq $64, %rax cmpq $2048, %rax # imm = 0x800 jne .LBB0_1 # BB#2: retq

This seems like a nice efficient swap utilizing 4 registers
movups as I understand will be just as efficient as movaps if the pointers are in fact aligned

In rust on the other hand I get code like the following:

use std::{mem, ptr}; fn swap<T>(x: &mut T, y: &mut T) { unsafe { // Give ourselves some scratch space to work with let mut t: [u8; 16] = mem::uninitialized(); let x = x as *mut T as *mut u8; let y = y as *mut T as *mut u8; let t = &mut t as *mut _ as *mut u8; // can't use a for loop as the `range` impl calls `mem::swap` recursively let len = mem::size_of::<T>() as isize; let mut i = 0; while i + 16 <= len { // Perform the swap 16 bytes at a time, `&mut` pointers never alias ptr::copy_nonoverlapping(x.offset(i), t, 16); ptr::copy_nonoverlapping(y.offset(i), x.offset(i), 16); ptr::copy_nonoverlapping(t, y.offset(i), 16); i += 16; } if i < len { // Swap any remaining bytes let rem = (len - i) as usize; ptr::copy_nonoverlapping(x.offset(i), t, rem); ptr::copy_nonoverlapping(y.offset(i), x.offset(i), rem); ptr::copy_nonoverlapping(t, y.offset(i), rem); } } } #[no_mangle] pub fn swap_array(x: &mut [u8; 2048], y: &mut [u8; 2048]) { swap(x, y) }
rustc swap.rs --crate-type=lib --emit=asm -C opt-level=3
swap_array: .cfi_startproc xorl %eax, %eax .p2align 4, 0x90 .LBB0_1: movups (%rdi,%rax), %xmm0 movaps %xmm0, -24(%rsp) movups (%rsi,%rax), %xmm0 movups %xmm0, (%rdi,%rax) movaps -24(%rsp), %xmm0 movups %xmm0, (%rsi,%rax) movups 16(%rdi,%rax), %xmm0 movaps %xmm0, -24(%rsp) movups 16(%rsi,%rax), %xmm0 movups %xmm0, 16(%rdi,%rax) movaps -24(%rsp), %xmm0 movups %xmm0, 16(%rsi,%rax) movups 32(%rdi,%rax), %xmm0 movaps %xmm0, -24(%rsp) movups 32(%rsi,%rax), %xmm0 movups %xmm0, 32(%rdi,%rax) movaps -24(%rsp), %xmm0 movups %xmm0, 32(%rsi,%rax) movups 48(%rdi,%rax), %xmm0 movaps %xmm0, -24(%rsp) movups 48(%rsi,%rax), %xmm0 movups %xmm0, 48(%rdi,%rax) movaps -24(%rsp), %xmm0 movups %xmm0, 48(%rsi,%rax) leaq 64(%rax), %rcx addq $80, %rax cmpq $2049, %rax movq %rcx, %rax jl .LBB0_1 retq

It is for some reason not eliminating the stores to and loads from the stack (the variable t stays on the stack and is not eliminated). For reference I got the same with clang on an equivalent c program so it's not rust's fault. I also tried adding an extra s variable to mirror what the c++ code was doing but to no avail.

@djzin
Copy link
Contributor Author

djzin commented Mar 12, 2017

I can get it to work by forcing the alignment of t using #[repr(simd)], but there wasn't any meaningful performance difference actually - no matter how I unrolled the loops or what size simd struct I used. So although it looks like this code is much worse, in practice it performs pretty much optimally

}
if i < len {
// Swap any remaining bytes
let rem = (len - i) as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it ever matters, but it occurs to me that calculating rem as len % 16 instead of len - i might be more obvious to an optimizer, as knowing what len - i is requires reasoning about a loop. Similarly, it would be better to test whether rem == 0 than to test i < len, as it is more trivially determined at compile time.

I doubt this matters in most cases, but I could see the optimizer failing when the size is large, since the loop might not be fully unrolled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it; looks like the generated assembly is the same either way. len here is a constant known at compile time so I guess it just gets completely eliminated

unsafe {
// Give ourselves some scratch space to work with
let mut t: T = uninitialized();
let mut t: [u8; 16] = uninitialized();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point: Instead of using a literal 16 here, it might be slightly better to stick a

const SWAP_BLOCK_SIZE: usize = 16;

at the top of the function. This would make it slightly less error-prone to tweak the value later, and it would make it easy to, e.g., pick different sizes for different architectures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good suggestion, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reference, the actual block size didn't matter too much but that was just on my machine - 16 may very well not be optimal

@eddyb
Copy link
Member

eddyb commented Mar 13, 2017

cc @rust-lang/compiler This seems much harder to ever perform MIR optimizations.
I'm torn, especially since we could've had those optimizations already but we don't.

@arthurprs
Copy link
Contributor

arthurprs commented Mar 14, 2017

This is extremely performance sensitive stuff (used in hashmap for example), thanks for working on it.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Mar 14, 2017
@brson
Copy link
Contributor

brson commented Mar 14, 2017

Please add comments to the source code explaining why this code is so complex.

@arthurprs
Copy link
Contributor

arthurprs commented Mar 14, 2017

Did you try a 32 byte buffer? This way the compiler could emit code that uses 2 xmm register in parallel.

Edit: After giving this a deeper look, I think for it to be really optimal we'll probably need multiple versions depending on the size of T.

@djzin
Copy link
Contributor Author

djzin commented Mar 14, 2017

OK so I did some more testing and it seems the SIMD version is indeed faster - this makes sense to me. On a struct of size 2048 I get 630 cycles for the SIMD version and 990 cycles for the 16-byte version. It would be nice to find a way to get llvm to use the SIMD instructions without resorting to unstable features though.

On my machine the u64x4 and u64x8 versions had exactly the same performance profile; the C++ version earlier used only 4 registers which apparently is optimal. I had a u64x16 version but it's slower.

here is the SIMD version for reference:

fn swap_u64x4<T>(x: &mut T, y: &mut T) { unsafe { #[allow(non_camel_case_types)] #[repr(simd)] struct u64x4(u64, u64, u64, u64); let mut t: u64x4 = mem::uninitialized(); let x = x as *mut T as *mut u8; let y = y as *mut T as *mut u8; let t = &mut t as *mut _ as *mut u8; let len = mem::size_of::<T>() as isize; let block_sz = mem::size_of::<u64x4>(); let mut i = 0; while i + block_sz as isize <= len { ptr::copy_nonoverlapping(x.offset(i), t, block_sz); ptr::copy_nonoverlapping(y.offset(i), x.offset(i), block_sz); ptr::copy_nonoverlapping(t, y.offset(i), block_sz); i += block_sz as isize; } if i < len { let rem = (len - i) as usize; ptr::copy_nonoverlapping(x.offset(i), t, rem); ptr::copy_nonoverlapping(y.offset(i), x.offset(i), rem); ptr::copy_nonoverlapping(t, y.offset(i), rem); } } } #[no_mangle] pub fn swap_array_u64x4(x: &mut [u8; 2048], y: &mut [u8; 2048]) { swap_u64x4(x, y) }

with the generated assembly:

swap_array_u64x4: .cfi_startproc xorl %eax, %eax .p2align 4, 0x90 .LBB0_1: movups (%rdi,%rax), %xmm0 movups 16(%rdi,%rax), %xmm1 movups (%rsi,%rax), %xmm2 movups 16(%rsi,%rax), %xmm3 movups %xmm3, 16(%rdi,%rax) movups %xmm2, (%rdi,%rax) movups %xmm1, 16(%rsi,%rax) movups %xmm0, (%rsi,%rax) movups 32(%rdi,%rax), %xmm0 movups 48(%rdi,%rax), %xmm1 movups 32(%rsi,%rax), %xmm2 movups 48(%rsi,%rax), %xmm3 movups %xmm3, 48(%rdi,%rax) movups %xmm2, 32(%rdi,%rax) movups %xmm1, 48(%rsi,%rax) movups %xmm0, 32(%rsi,%rax) movups 64(%rdi,%rax), %xmm0 movups 80(%rdi,%rax), %xmm1 movups 64(%rsi,%rax), %xmm2 movups 80(%rsi,%rax), %xmm3 movups %xmm3, 80(%rdi,%rax) movups %xmm2, 64(%rdi,%rax) movups %xmm1, 80(%rsi,%rax) movups %xmm0, 64(%rsi,%rax) movups 96(%rdi,%rax), %xmm0 movups 112(%rdi,%rax), %xmm1 movups 96(%rsi,%rax), %xmm2 movups 112(%rsi,%rax), %xmm3 movups %xmm3, 112(%rdi,%rax) movups %xmm2, 96(%rdi,%rax) movups %xmm1, 112(%rsi,%rax) movups %xmm0, 96(%rsi,%rax) movq %rax, %rcx subq $-128, %rcx addq $160, %rax cmpq $2049, %rax movq %rcx, %rax jl .LBB0_1 retq

The code I am using to time it...

unsafe fn time(swap: fn(&mut [u8; ARR_LENGTH], &mut [u8; ARR_LENGTH])) -> u64 { let mut x: [u8; ARR_LENGTH] = [42; ARR_LENGTH]; let mut y: [u8; ARR_LENGTH] = [69; ARR_LENGTH]; asm!("cpuid  cpuid  cpuid  cpuid  cpuid" : : : "rax", "rbx", "rcx", "rdx" : "volatile"); let mut min_diff = !0; let mut max_diff = 0; let mut i = 0; while i < 80 { let before_high: u32; let before_low: u32; let after_high: u32; let after_low: u32; asm!("cpuid  rdtsc  movl %edx, $0  movl %eax, $1" : "=r" (before_high), "=r" (before_low) : : "rax", "rbx", "rcx", "rdx" : "volatile" ); swap(&mut x, &mut y); asm!("rdtscp  movl %edx, $0  movl %eax, $1  cpuid" : "=r" (after_high), "=r" (after_low) : : "rax", "rbx", "rcx", "rdx" : "volatile" ); asm!("" :: "r"(&x), "r"(&y)); let diff = ((after_high as u64 >> 32) | after_low as u64) - ((before_high as u64 >> 32) | before_low as u64); if diff < min_diff { min_diff = diff; } if diff > max_diff { max_diff = diff; } i += 1; } return min_diff; } fn noop(_: &mut [u8; ARR_LENGTH], _: &mut [u8; ARR_LENGTH]) {} fn main() { unsafe { let min_timing = time(noop); let min_execution_time_16 = time(swap_array_16); let min_execution_time_32 = time(swap_array_32); let min_execution_time_64 = time(swap_array_64); let min_execution_time_u64x4 = time(swap_array_u64x4); let min_execution_time_u64x8 = time(swap_array_u64x8); let min_execution_time_u64x16 = time(swap_array_u64x16); println!("arr length {}", ARR_LENGTH); println!("min timing {}", min_timing); println!("16_bytes {}", min_execution_time_16 - min_timing); println!("32_bytes {}", min_execution_time_32 - min_timing); println!("64_bytes {}", min_execution_time_64 - min_timing); println!("simd_u64x4 {}", min_execution_time_u64x4 - min_timing); println!("simd_u64x8 {}", min_execution_time_u64x8 - min_timing); println!("simd_u64x16 {}", min_execution_time_u64x16 - min_timing); } }

gives

arr length 64 min timing 80 16_bytes 10 32_bytes 10 64_bytes 10 simd_u64x4 10 simd_u64x8 10 simd_u64x16 10 arr length 256 min timing 80 16_bytes 110 32_bytes 90 64_bytes 90 simd_u64x4 50 simd_u64x8 50 simd_u64x16 60 arr length 517 min timing 80 16_bytes 240 32_bytes 230 64_bytes 220 simd_u64x4 150 simd_u64x8 160 simd_u64x16 170 arr length 834 min timing 80 16_bytes 380 32_bytes 370 64_bytes 360 simd_u64x4 240 simd_u64x8 240 simd_u64x16 280 arr length 2048 min timing 80 16_bytes 990 32_bytes 1000 64_bytes 980 simd_u64x4 630 simd_u64x8 630 simd_u64x16 660 arr length 65536 min timing 80 16_bytes 41730 32_bytes 41880 64_bytes 41600 simd_u64x4 33570 simd_u64x8 33640 simd_u64x16 33970 arr length 99999 min timing 80 16_bytes 58380 32_bytes 58310 64_bytes 57990 simd_u64x4 52260 simd_u64x8 51960 simd_u64x16 52460 
@arthurprs
Copy link
Contributor

arthurprs commented Mar 15, 2017

I've spend some time on this and there's some cases that the old code is better (T sizes: 57, 71, 73, 77; for ex.)
The difference seems to lie in how the "tail copy" is done, the new code uses general purpose registers while the old still manages to do it with the xmm registers. I can't prove it but it could affect real code negatively.

Thoughts?

@djzin
Copy link
Contributor Author

djzin commented Mar 19, 2017

@arthurprs I see a similar effect - I don't see that the xmm registers are not being used, but for certain sizes the old code is better. I'm going to see if there's a way to make it faster in every case, but for now here's the full data (NOTE I timed these using 10 iterations each, count is in cycles):

arr_length	min_timing	rust_stdlib	simd_u64x4	simd_u64x8 1	80	90	90	90 2	72	84	84	84 3	68	144	144	144 4	48	54	54	54 5	72	156	156	156 6	80	160	160	160 7	68	180	180	180 8	80	90	90	90 9	32	72	72	72 10	68	144	144	144 11	32	80	80	80 12	80	160	160	160 13	72	516	516	519 14	32	224	224	224 15	72	516	516	516 16	72	177	177	177 17	72	195	195	195 18	80	210	210	210 19	32	80	80	80 20	32	80	80	80 21	80	560	560	560 22	80	560	560	560 23	80	560	560	560 24	32	76	76	76 25	80	600	600	600 26	80	600	600	600 27	72	555	555	555 28	72	555	555	555 29	80	600	600	600 30	64	480	480	480 31	80	600	600	600 32	68	188	112	188 33	32	84	52	84 34	72	195	120	195 35	68	232	232	232 36	80	240	130	230 37	72	564	249	564 38	68	520	232	520 39	32	244	136	244 40	32	80	60	80 41	80	640	270	640 42	72	591	258	591 43	48	432	228	432 44	80	640	270	640 45	80	640	590	640 46	80	640	600	640 47	40	940	325	940 48	32	84	56	84 49	32	108	112	108 50	72	249	258	249 51	48	204	210	204 52	72	249	258	249 53	72	528	555	528 54	32	228	240	228 55	72	528	555	528 56	32	108	72	108 57	68	520	532	520 58	80	610	620	610 59	32	244	248	244 60	48	384	390	384 61	80	610	620	610 62	72	564	573	564 63	68	520	532	520 64	32	108	72	76 65	32	136	88	92 66	72	324	204	213 67	32	168	148	172 68	80	340	220	230 69	80	620	370	420 70	72	573	342	396 71	72	573	408	462 72	80	350	220	230 73	80	640	370	430 74	80	640	380	430 75	32	256	176	200 76	80	640	380	420 77	32	256	280	300 78	68	548	600	640 79	72	591	648	693 80	80	350	220	230 81	80	420	370	430 82	80	420	380	430 83	68	420	376	428 84	32	168	148	172 85	72	528	657	693 86	80	570	710	750 87	80	650	700	770 88	80	420	270	270 89	32	256	288	308 90	64	520	576	616 91	32	256	288	308 92	80	640	720	770 93	32	256	288	308 94	32	256	288	308 95	72	591	666	711 96	80	420	270	270 97	72	453	297	462 98	32	200	128	200 99	60	427	351	427 100	68	420	272	428 101	32	244	188	320 102	80	610	470	800 103	80	610	540	800 104	80	500	320	320 105	32	256	192	328 106	68	548	400	700 107	72	591	498	756 108	64	512	376	656 109	68	548	676	700 110	80	2070	790	2020 111	32	256	316	328 112	80	490	320	320 113	80	570	480	570 114	32	228	192	228 115	64	672	432	520 116	32	228	192	228 117	72	537	729	738 118	72	537	738	738 119	32	232	316	320 120	72	528	342	342 121	32	236	328	328 122	32	236	328	328 123	68	504	700	700 124	32	236	328	328 125	32	276	376	376 126	32	236	328	328 127	32	236	328	328 128	48	360	234	234 129	72	987	387	387 130	80	1170	420	420 131	64	856	456	496 132	80	1070	420	420 133	32	428	228	252 134	80	1070	570	620 135	80	1070	640	700 136	72	987	387	387 137	64	856	456	496 138	72	1116	528	582 139	32	428	256	280 140	60	801	427	464 141	32	428	356	388 142	56	755	628	685 143	80	1070	890	970 144	80	1070	420	420 145	80	1010	580	630 146	72	906	537	582 147	80	990	650	700 148	68	892	488	532 149	68	856	760	812 150	64	792	712	760 151	24	363	312	339 152	64	856	376	376 153	60	734	690	727 154	32	396	368	388 155	72	906	849	897 156	80	980	920	970 157	80	990	920	970 158	80	1000	920	970 159	80	1080	920	970 160	68	804	400	400 161	32	428	208	280 162	72	1008	480	648 163	72	987	618	711 164	80	1070	520	700 165	80	1070	680	1010 166	80	1070	670	1010 167	80	1070	740	1010 168	80	1060	520	520 169	72	987	618	942 170	32	492	308	472 171	32	428	296	408 172	80	1350	680	1020 173	32	428	400	408 174	32	968	400	916 175	80	1070	1000	1020 176	80	1070	520	520 177	80	1240	670	770 178	72	897	627	711 179	80	970	740	850 180	80	970	670	770 181	80	970	1000	1000 182	80	1210	1000	1000 183	80	970	1000	1000 184	32	464	228	228 185	72	1119	942	942 186	40	1395	560	560 187	68	1044	872	872 188	72	897	942	942 189	32	388	408	408 190	72	897	942	942 191	72	897	942	942 192	32	504	228	228 193	60	877	464	464 194	72	2187	573	573 195	32	428	308	328 196	72	1080	573	573 197	80	2460	770	820 198	60	801	577	614 199	80	1130	840	900 200	32	440	248	248 201	32	428	308	328 202	68	916	668	704 203	72	990	777	831 204	32	428	308	328 205	80	1070	1100	1150 206	80	2400	1100	1150 207	68	916	940	984 208	72	987	573	573 209	80	970	770	820 210	68	832	660	700 211	72	897	777	831 212	32	388	312	332 213	32	388	440	460 214	40	530	600	630 215	40	560	625	650 216	80	1464	220	220 217	72	906	1035	1080 218	60	801	840	877 219	80	889	1027	1072 220	72	906	1035	1080 221	80	1260	1120	1170 222	72	897	1035	1080 223	32	388	448	468 224	48	852	450	450 225	48	714	480	600 226	72	996	666	831 227	68	916	744	832 228	68	916	616	772 229	32	428	348	484 230	68	916	744	1036 231	80	1070	940	1210 232	72	1182	666	666 233	68	916	744	1044 234	80	1070	880	1220 235	72	987	867	1128 236	32	428	348	488 237	32	428	476	488 238	80	1070	1190	1220 239	80	1070	1190	1220 240	32	428	288	288 241	64	776	696	776 242	72	999	813	897 243	60	1770	711	787 244	72	897	804	897 245	56	685	840	847 246	80	970	1190	1210 247	80	1010	1190	1200 248	32	428	308	308 249	80	970	1220	1220 250	80	970	1220	1220 251	72	897	1128	1128 252	80	970	1220	1220 253	32	388	488	488 254	80	970	1220	1220 255	80	889	1120	1120 256	80	860	770	770 257	40	645	470	470 258	72	1053	756	756 259	32	476	388	412 260	72	1053	756	756 261	80	1190	970	1030 262	32	456	388	412 263	72	1053	960	1017 264	80	1130	820	820 265	72	1140	906	951 266	32	456	388	408 267	80	1140	1050	1100 268	32	476	388	412 269	80	1450	1290	1350 270	72	1053	1191	1248 271	72	1053	1200	1248 272	80	1340	820	820 273	80	1140	970	1030 274	40	620	530	565 275	32	456	420	440 276	32	456	392	412 277	32	456	520	540 278	80	1140	1300	1350 279	80	1130	1300	1350 280	80	1220	870	870 281	80	1380	1320	1370 282	68	1180	1132	1172 283	48	864	834	864 284	72	1266	1218	1266 285	72	1275	1218	1266 286	80	3100	1320	1370 287	80	1380	1320	1370 288	72	1227	804	804 289	80	1500	920	1100 290	68	1292	788	940 291	80	1460	1070	1170 292	72	1386	849	1017 293	72	2004	996	1293 294	80	1460	1070	1410 295	32	584	456	560 296	72	1356	849	849 297	72	2031	987	1311 298	32	584	428	568 299	56	1031	805	1002 300	80	1460	1070	1420 
@djzin
Copy link
Contributor Author

djzin commented Mar 19, 2017

One thing to note is that sizes of 127 etc are probably not that common - for instance, if any of the elements of the struct is a u64 the size will be a multiple of 8. Looking at just the multiples of 8 the new code is faster every time

arr_length	min_timing	rust_stdlib	simd_u64x4	simd_u64x8 8	80	90	90	90 16	72	177	177	177 24	32	76	76	76 32	68	188	112	188 40	32	80	60	80 48	32	84	56	84 56	32	108	72	108 64	32	108	72	76 72	80	350	220	230 80	80	350	220	230 88	80	420	270	270 96	80	420	270	270 104	80	500	320	320 112	80	490	320	320 120	72	528	342	342 128	48	360	234	234 136	72	987	387	387 144	80	1070	420	420 152	64	856	376	376 160	68	804	400	400 168	80	1060	520	520 176	80	1070	520	520 184	32	464	228	228 192	32	504	228	228 200	32	440	248	248 208	72	987	573	573 216	80	1464	220	220 224	48	852	450	450 232	72	1182	666	666 240	32	428	288	288 248	32	428	308	308 256	80	860	770	770 264	80	1130	820	820 272	80	1340	820	820 280	80	1220	870	870 288	72	1227	804	804 296	72	1356	849	849 
@arthurprs
Copy link
Contributor

arthurprs commented Mar 19, 2017

My concern is that using general purpose registers inside very hot loops will throw off the optimizer. I'll try to come up with test cases or use the hashmap code.

@arthurprs
Copy link
Contributor

Ok, I'm probably over-thinking this.

Either way It's important to check results in x86 and some ARM variant.

/// because a double `free` is undefined behavior.
///
/// An example is the definition of [`mem::swap`][swap] in this module:
/// An example is the (old) definition of [`mem::swap`][swap] in this module:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would phrase this differently - maybe "An example is a possible implementation of mem::swap" or something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should not intimate the existance of a former impl.

@sfackler
Copy link
Member

sfackler commented Apr 9, 2017

Is there more investigation that needs to be done or is this good to go?

@eddyb
Copy link
Member

eddyb commented Apr 9, 2017

I believe my MIR optimization concerns can be alleviated by moving this optimization into rustc or LLVM later on, so this PR wouldn't necessarily have any lasting negative impact on that (to counter its wins).

@alexcrichton
Copy link
Member

Nah that looks legitimate, seems vectorized ops leaked into the backend which emscripten couldn't handle. Dunno if that's a bug with us or emscripten, but a legitimate bug regardless!

@nagisa
Copy link
Member

nagisa commented Jun 5, 2017

This is a bug in emscripten the way I see it.

It is likely not plausible this will get fixed in emscripten any time soon, so the next best fix is a cfg version for emscripten with old code and a FIXME. We should still report this against emscripten.

@brson
Copy link
Contributor

brson commented Jun 5, 2017

Thanks for persevering @djzin. Cool patch.

@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Jun 5, 2017
@arielb1 arielb1 added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 6, 2017
@djzin
Copy link
Contributor Author

djzin commented Jun 9, 2017

@sfackler should be fixed now...

@Mark-Simulacrum
Copy link
Member

@bors r=sfackler

@bors
Copy link
Collaborator

bors commented Jun 11, 2017

📌 Commit 83f1f11 has been approved by sfackler

@bors
Copy link
Collaborator

bors commented Jun 11, 2017

⌛ Testing commit 83f1f11 with merge 27650ee...

bors added a commit that referenced this pull request Jun 11, 2017
speed up mem::swap I would have thought that the mem::swap code didn't need an intermediate variable precisely because the pointers are guaranteed never to alias. And.. it doesn't! It seems that llvm will also auto-vectorize this case for large structs, but alas it doesn't seem to have all the aliasing info it needs and so will add redundant checks (and even not bother with autovectorizing for small types). Looks like a lot of performance could still be gained here, so this might be a good test case for future optimizer improvements. Here are the current benchmarks for the simd version of mem::swap; the timings are in cycles (code below) measured with 10 iterations. The timings for sizes > 32 which are not a multiple of 8 tend to be ever so slightly faster in the old code, but not always. For large struct sizes (> 1024) the new code shows a marked improvement. \* = latest commit † = subtracted from other measurements | arr_length	| noop<sup>†</sup>	| rust_stdlib	| simd_u64x4\*	| simd_u64x8 |------------------|------------|-------------------|-------------------|------------------- 8|80|90|90|90 16|72|177|177|177 24|32|76|76|76 32|68|188|112|188 40|32|80|60|80 48|32|84|56|84 56|32|108|72|108 64|32|108|72|76 72|80|350|220|230 80|80|350|220|230 88|80|420|270|270 96|80|420|270|270 104|80|500|320|320 112|80|490|320|320 120|72|528|342|342 128|48|360|234|234 136|72|987|387|387 144|80|1070|420|420 152|64|856|376|376 160|68|804|400|400 168|80|1060|520|520 176|80|1070|520|520 184|32|464|228|228 192|32|504|228|228 200|32|440|248|248 208|72|987|573|573 216|80|1464|220|220 224|48|852|450|450 232|72|1182|666|666 240|32|428|288|288 248|32|428|308|308 256|80|860|770|770 264|80|1130|820|820 272|80|1340|820|820 280|80|1220|870|870 288|72|1227|804|804 296|72|1356|849|849
@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jun 11, 2017
@bors
Copy link
Collaborator

bors commented Jun 11, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: sfackler
Pushing 27650ee to master...

@bors bors merged commit 83f1f11 into rust-lang:master Jun 11, 2017
bors added a commit that referenced this pull request Jun 28, 2017
Reuse the mem::swap optimizations to speed up slice::rotate This is most helpful for compound types where LLVM didn't vectorize the loop. Highlight: bench slice::rotate_medium_by727_strings gets 38% faster. Exposes the swapping logic from PR #40454 as `pub unsafe fn ptr::swap_nonoverlapping` under library feature `swap_nonoverlapping` #42818. (The new method seemed plausible, and was the simplest way to share the logic. I'm not attached to it, though, so let me know if a different way would be better.)
cuviper added a commit to cuviper/rust that referenced this pull request Jul 11, 2017
This is a workaround for rust-lang#42778, which was git-bisected to rust-lang#40454's optimizations to `mem::swap`, later moved to `ptr` in rust-lang#42819. Natively compiled rustc couldn't even compile stage1 libcore on powerpc64 and s390x, but they work fine without this `repr(simd)`. Since powerpc64le works OK, it seems probably related to being big-endian. The underlying problem is not yet known, but this at least makes those architectures functional again in the meantime. cc @arielb1
frewsxcv added a commit to frewsxcv/rust that referenced this pull request Jul 15, 2017
Disable big-endian simd in swap_nonoverlapping_bytes This is a workaround for rust-lang#42778, which was git-bisected to rust-lang#40454's optimizations to `mem::swap`, later moved to `ptr` in rust-lang#42819. Natively compiled rustc couldn't even compile stage1 libcore on powerpc64 and s390x, but they work fine without this `repr(simd)`. Since powerpc64le works OK, it seems probably related to being big-endian. The underlying problem is not yet known, but this at least makes those architectures functional again in the meantime. cc @arielb1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

relnotes Marks issues that should be documented in the release notes of the next release. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.