Skip to content

Conversation

@wdvxdr1123
Copy link
Contributor

goos: windows
goarch: amd64
pkg: nhooyr.io/websocket
cpu: Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
Benchmark_mask/2/basic-8 425339004 2.795 ns/op 715.66 MB/s
Benchmark_mask/2/nhooyr-8 379937766 3.186 ns/op 627.78 MB/s
Benchmark_mask/2/gorilla-8 392164167 3.071 ns/op 651.24 MB/s
Benchmark_mask/2/gobwas-8 310037222 3.880 ns/op 515.46 MB/s
Benchmark_mask/3/basic-8 321408024 3.806 ns/op 788.32 MB/s
Benchmark_mask/3/nhooyr-8 350726338 3.478 ns/op 862.58 MB/s
Benchmark_mask/3/gorilla-8 332217727 3.634 ns/op 825.43 MB/s
Benchmark_mask/3/gobwas-8 247376214 4.886 ns/op 614.01 MB/s
Benchmark_mask/4/basic-8 261182472 4.582 ns/op 872.91 MB/s
Benchmark_mask/4/nhooyr-8 381830712 3.262 ns/op 1226.05 MB/s
Benchmark_mask/4/gorilla-8 272616304 4.395 ns/op 910.04 MB/s
Benchmark_mask/4/gobwas-8 204574558 5.855 ns/op 683.19 MB/s
Benchmark_mask/8/basic-8 191330037 6.162 ns/op 1298.24 MB/s
Benchmark_mask/8/nhooyr-8 369694992 3.285 ns/op 2435.65 MB/s
Benchmark_mask/8/gorilla-8 175388466 6.743 ns/op 1186.48 MB/s
Benchmark_mask/8/gobwas-8 241719933 4.886 ns/op 1637.45 MB/s
Benchmark_mask/16/basic-8 100000000 10.92 ns/op 1464.83 MB/s
Benchmark_mask/16/nhooyr-8 272565096 4.436 ns/op 3606.98 MB/s
Benchmark_mask/16/gorilla-8 100000000 11.20 ns/op 1428.53 MB/s
Benchmark_mask/16/gobwas-8 221356798 5.405 ns/op 2960.45 MB/s
Benchmark_mask/32/basic-8 61476984 20.40 ns/op 1568.80 MB/s
Benchmark_mask/32/nhooyr-8 238665572 5.050 ns/op 6337.22 MB/s
Benchmark_mask/32/gorilla-8 100000000 12.09 ns/op 2647.28 MB/s
Benchmark_mask/32/gobwas-8 186077235 6.477 ns/op 4940.36 MB/s
Benchmark_mask/128/basic-8 14629720 80.90 ns/op 1582.19 MB/s
Benchmark_mask/128/nhooyr-8 181241968 6.565 ns/op 19497.98 MB/s
Benchmark_mask/128/gorilla-8 68308342 16.76 ns/op 7639.37 MB/s
Benchmark_mask/128/gobwas-8 94582026 12.97 ns/op 9872.11 MB/s
Benchmark_mask/512/basic-8 3921001 305.6 ns/op 1675.55 MB/s
Benchmark_mask/512/nhooyr-8 123102199 9.721 ns/op 52669.11 MB/s
Benchmark_mask/512/gorilla-8 32355914 38.18 ns/op 13411.43 MB/s
Benchmark_mask/512/gobwas-8 31528501 37.80 ns/op 13544.37 MB/s
Benchmark_mask/4096/basic-8 491804 2381 ns/op 1720.39 MB/s
Benchmark_mask/4096/nhooyr-8 26159691 46.98 ns/op 87187.73 MB/s
Benchmark_mask/4096/gorilla-8 4898440 243.6 ns/op 16817.89 MB/s
Benchmark_mask/4096/gobwas-8 4336398 277.2 ns/op 14776.40 MB/s
Benchmark_mask/16384/basic-8 113842 9623 ns/op 1702.66 MB/s
Benchmark_mask/16384/nhooyr-8 8088847 154.5 ns/op 106058.18 MB/s
Benchmark_mask/16384/gorilla-8 1282993 933.6 ns/op 17549.90 MB/s
Benchmark_mask/16384/gobwas-8 997347 1086 ns/op 15093.49 MB/s

@wdvxdr1123 wdvxdr1123 requested a review from nhooyr as a code owner January 24, 2022 11:26
@wdvxdr1123 wdvxdr1123 changed the title use simd mask for amd64&arm64 use simd masking for amd64&arm64 Jan 24, 2022
@nhooyr nhooyr changed the base branch from master to dev October 13, 2023 09:12
@nhooyr nhooyr added this to the v1.9.0 milestone Oct 13, 2023
@nhooyr nhooyr force-pushed the dev branch 8 times, most recently from e6fb843 to 0caa997 Compare October 19, 2023 11:01
@nhooyr
Copy link
Contributor

nhooyr commented Oct 19, 2023

Finally gotten around to reviewing this. I'm not very familiar with writing assembly of any kind. Why use AVX2 instead of AVX-512?

@nhooyr
Copy link
Contributor

nhooyr commented Oct 19, 2023

Also don't worry about the merge conflicts, I'll fix them myself.

@nhooyr
Copy link
Contributor

nhooyr commented Oct 19, 2023

Benchmark_mask/2/basic-12 631384161 1.883 ns/op 1061.88 MB/s 0 B/op 0 allocs/op Benchmark_mask/2/nhooyr-12 591894866 2.061 ns/op 970.52 MB/s 0 B/op 0 allocs/op Benchmark_mask/2/gorilla-12 657205106 1.923 ns/op 1040.00 MB/s 0 B/op 0 allocs/op Benchmark_mask/2/gobwas-12 496567813 2.496 ns/op 801.34 MB/s 0 B/op 0 allocs/op Benchmark_mask/3/basic-12 592897168 1.992 ns/op 1506.14 MB/s 0 B/op 0 allocs/op Benchmark_mask/3/nhooyr-12 507159836 2.197 ns/op 1365.80 MB/s 0 B/op 0 allocs/op Benchmark_mask/3/gorilla-12 553840022 2.304 ns/op 1302.28 MB/s 0 B/op 0 allocs/op Benchmark_mask/3/gobwas-12 397366413 2.800 ns/op 1071.31 MB/s 0 B/op 0 allocs/op Benchmark_mask/4/basic-12 634193241 1.807 ns/op 2213.23 MB/s 0 B/op 0 allocs/op Benchmark_mask/4/nhooyr-12 569515338 2.002 ns/op 1998.05 MB/s 0 B/op 0 allocs/op Benchmark_mask/4/gorilla-12 451382727 2.599 ns/op 1538.81 MB/s 0 B/op 0 allocs/op Benchmark_mask/4/gobwas-12 356507592 3.312 ns/op 1207.75 MB/s 0 B/op 0 allocs/op Benchmark_mask/8/basic-12 405458120 2.981 ns/op 2683.23 MB/s 0 B/op 0 allocs/op Benchmark_mask/8/nhooyr-12 586096395 2.124 ns/op 3765.62 MB/s 0 B/op 0 allocs/op Benchmark_mask/8/gorilla-12 296482132 4.003 ns/op 1998.59 MB/s 0 B/op 0 allocs/op Benchmark_mask/8/gobwas-12 358996738 3.317 ns/op 2411.46 MB/s 0 B/op 0 allocs/op Benchmark_mask/16/basic-12 199646600 5.828 ns/op 2745.57 MB/s 0 B/op 0 allocs/op Benchmark_mask/16/nhooyr-12 482739769 2.494 ns/op 6416.64 MB/s 0 B/op 0 allocs/op Benchmark_mask/16/gorilla-12 166567765 7.225 ns/op 2214.41 MB/s 0 B/op 0 allocs/op Benchmark_mask/16/gobwas-12 297547316 3.989 ns/op 4011.07 MB/s 0 B/op 0 allocs/op Benchmark_mask/32/basic-12 66204484 18.72 ns/op 1709.47 MB/s 0 B/op 0 allocs/op Benchmark_mask/32/nhooyr-12 444971588 2.557 ns/op 12516.90 MB/s 0 B/op 0 allocs/op Benchmark_mask/32/gorilla-12 153725197 7.672 ns/op 4171.01 MB/s 0 B/op 0 allocs/op Benchmark_mask/32/gobwas-12 221328512 5.407 ns/op 5918.17 MB/s 0 B/op 0 allocs/op Benchmark_mask/128/basic-12 21106347 58.03 ns/op 2205.73 MB/s 0 B/op 0 allocs/op Benchmark_mask/128/nhooyr-12 329196819 3.777 ns/op 33893.45 MB/s 0 B/op 0 allocs/op Benchmark_mask/128/gorilla-12 100000000 11.08 ns/op 11552.46 MB/s 0 B/op 0 allocs/op Benchmark_mask/128/gobwas-12 82296996 14.98 ns/op 8546.19 MB/s 0 B/op 0 allocs/op Benchmark_mask/512/basic-12 5925668 208.8 ns/op 2451.84 MB/s 0 B/op 0 allocs/op Benchmark_mask/512/nhooyr-12 11774136 101.9 ns/op 5023.62 MB/s 0 B/op 0 allocs/op Benchmark_mask/512/gorilla-12 43038144 26.93 ns/op 19014.42 MB/s 0 B/op 0 allocs/op Benchmark_mask/512/gobwas-12 23169214 55.74 ns/op 9184.92 MB/s 0 B/op 0 allocs/op Benchmark_mask/4096/basic-12 795450 1445 ns/op 2835.39 MB/s 0 B/op 0 allocs/op Benchmark_mask/4096/nhooyr-12 9641613 124.3 ns/op 32940.03 MB/s 0 B/op 0 allocs/op Benchmark_mask/4096/gorilla-12 8906532 139.6 ns/op 29346.43 MB/s 0 B/op 0 allocs/op Benchmark_mask/4096/gobwas-12 2789071 424.5 ns/op 9648.84 MB/s 0 B/op 0 allocs/op Benchmark_mask/16384/basic-12 219685 5795 ns/op 2827.23 MB/s 0 B/op 0 allocs/op Benchmark_mask/16384/nhooyr-12 6135582 196.3 ns/op 83454.70 MB/s 0 B/op 0 allocs/op Benchmark_mask/16384/gorilla-12 2377486 516.0 ns/op 31752.39 MB/s 0 B/op 0 allocs/op Benchmark_mask/16384/gobwas-12 723357 1557 ns/op 10523.07 MB/s 0 B/op 0 allocs/op PASS ok nhooyr.io/websocket/internal/thirdparty 58.195s 

For some reason it slows down at the 512 byte benchmark. Not sure what's going on there.

@nhooyr
Copy link
Contributor

nhooyr commented Oct 19, 2023

More clearly:

Benchmark_mask/2/nhooyr-12 590403414 2.028 ns/op 986.19 MB/s 0 B/op 0 allocs/op Benchmark_mask/3/nhooyr-12 584087539 2.063 ns/op 1453.96 MB/s 0 B/op 0 allocs/op Benchmark_mask/4/nhooyr-12 655971961 1.839 ns/op 2175.33 MB/s 0 B/op 0 allocs/op Benchmark_mask/8/nhooyr-12 642215430 1.905 ns/op 4199.37 MB/s 0 B/op 0 allocs/op Benchmark_mask/16/nhooyr-12 485812323 2.301 ns/op 6954.78 MB/s 0 B/op 0 allocs/op Benchmark_mask/32/nhooyr-12 501743362 2.351 ns/op 13608.66 MB/s 0 B/op 0 allocs/op Benchmark_mask/128/nhooyr-12 334930033 3.648 ns/op 35090.20 MB/s 0 B/op 0 allocs/op Benchmark_mask/512/nhooyr-12 51036463 99.33 ns/op 5154.74 MB/s 0 B/op 0 allocs/op Benchmark_mask/4096/nhooyr-12 11011562 121.7 ns/op 33663.04 MB/s 0 B/op 0 allocs/op Benchmark_mask/16384/nhooyr-12 6010369 197.6 ns/op 82904.02 MB/s 0 B/op 0 allocs/op 

Super weird.

@nhooyr
Copy link
Contributor

nhooyr commented Oct 19, 2023

Disabling AVX2 seems to have fixed it.

Benchmark_mask/2/nhooyr-12 542097008 2.197 ns/op 910.42 MB/s 0 B/op 0 allocs/op Benchmark_mask/3/nhooyr-12 537046092 2.258 ns/op 1328.35 MB/s 0 B/op 0 allocs/op Benchmark_mask/4/nhooyr-12 516057957 1.957 ns/op 2044.01 MB/s 0 B/op 0 allocs/op Benchmark_mask/8/nhooyr-12 566813392 2.027 ns/op 3946.05 MB/s 0 B/op 0 allocs/op Benchmark_mask/16/nhooyr-12 456252357 2.465 ns/op 6491.72 MB/s 0 B/op 0 allocs/op Benchmark_mask/32/nhooyr-12 477971746 2.697 ns/op 11862.99 MB/s 0 B/op 0 allocs/op Benchmark_mask/128/nhooyr-12 323935191 3.760 ns/op 34040.58 MB/s 0 B/op 0 allocs/op Benchmark_mask/512/nhooyr-12 131543775 8.955 ns/op 57174.80 MB/s 0 B/op 0 allocs/op Benchmark_mask/4096/nhooyr-12 23514272 46.50 ns/op 88092.14 MB/s 0 B/op 0 allocs/op Benchmark_mask/16384/nhooyr-12 6336271 181.9 ns/op 90069.97 MB/s 0 B/op 0 allocs/op 
@nhooyr nhooyr force-pushed the patch-simd-mask branch 6 times, most recently from 1e8bf28 to 32d0aa1 Compare October 19, 2023 23:40
@nhooyr
Copy link
Contributor

nhooyr commented Oct 20, 2023

The amd64 code looks good to me so far but the arm64 code doesn't seem to produce any speedup at least through qemu.

goos: linux goarch: amd64 pkg: nhooyr.io/websocket cpu: 12th Gen Intel(R) Core(TM) i5-1235U BenchmarkFlateWriter-12 3722 326920 ns/op 1200024 B/op 16 allocs/op BenchmarkFlateReader-12 169479 6926 ns/op 41047 B/op 6 allocs/op BenchmarkConn/disabledCompress-12 84481 12720 ns/op 40.25 MB/s 518.0 read/op 520.0 written/op 1 B/op 0 allocs/op BenchmarkConn/compressContextTakeover-12 32448 33822 ns/op 15.14 MB/s 24.00 read/op 36.00 written/op 42 B/op 0 allocs/op BenchmarkConn/compressNoContext-12 38430 29966 ns/op 17.09 MB/s 41.00 read/op 29.00 written/op 96 B/op 0 allocs/op PASS ok	nhooyr.io/websocket	6.819s goos: linux goarch: amd64 pkg: nhooyr.io/websocket/internal/thirdparty cpu: 12th Gen Intel(R) Core(TM) i5-1235U Benchmark_mask/amd64/basic/8-12	425723130 2.780 ns/op	2877.27 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/16-12	224227551 5.293 ns/op	3022.94 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/32-12	100000000 10.19 ns/op	3139.45 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/128-12	24135116 46.41 ns/op	2757.74 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/256-12	12339093 85.20 ns/op	3004.60 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/512-12 7325516 163.8 ns/op	3125.51 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/1024-12 3657289 320.5 ns/op	3194.87 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/2048-12 1887517 638.8 ns/op	3206.18 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/4096-12 934762 1264 ns/op	3241.70 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/8192-12 395722 2598 ns/op	3153.37 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/16384-12 236943 5162 ns/op	3173.86 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/8-12	505864449 2.316 ns/op	3454.92 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/16-12	500031924 2.375 ns/op	6737.54 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/32-12	451944298 2.574 ns/op	12429.91 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/128-12	306800580 3.938 ns/op	32506.67 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/256-12	197035516 6.612 ns/op	38717.64 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/512-12	114783908 10.59 ns/op	48332.85 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/1024-12	59498761 19.20 ns/op	53328.93 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/2048-12	31537369 36.59 ns/op	55970.07 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/4096-12	15516426 77.49 ns/op	52861.24 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/8192-12 8057901 150.7 ns/op	54358.50 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/16384-12 4023576 294.3 ns/op	55666.10 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/8-12	498550161 2.298 ns/op	3481.43 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/16-12	508013607 2.505 ns/op	6387.00 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/32-12	475446944 2.687 ns/op	11909.62 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/128-12	347085175 3.462 ns/op	36969.76 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/256-12	239742297 5.094 ns/op	50253.25 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/512-12	132367032 9.429 ns/op	54300.89 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/1024-12	59876775 17.24 ns/op	59387.88 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/2048-12	43464296 28.10 ns/op	72877.63 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/4096-12	25988770 51.22 ns/op	79973.77 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/8192-12	11870416 97.20 ns/op	84279.05 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/16384-12 6374655 196.1 ns/op	83555.75 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/8-12	307082148 4.199 ns/op	1905.10 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/16-12	166534495 7.258 ns/op	2204.54 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/32-12	157286900 7.638 ns/op	4189.59 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/128-12	121178448 10.14 ns/op	12620.60 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/256-12	88366356 13.62 ns/op	18791.55 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/512-12	40303383 26.69 ns/op	19181.52 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/1024-12	28564507 41.38 ns/op	24744.85 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/2048-12	14325160 72.32 ns/op	28317.53 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/4096-12 8834644 130.5 ns/op	31378.79 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/8192-12 4661844 249.3 ns/op	32856.93 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/16384-12 2452156 491.8 ns/op	33317.08 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/8-12	372520472 3.229 ns/op	2477.79 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/16-12	303515722 3.914 ns/op	4088.10 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/32-12	215681712 5.353 ns/op	5977.97 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/128-12	82971432 15.39 ns/op	8319.67 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/256-12	43254800 30.40 ns/op	8420.77 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/512-12	20618145 58.86 ns/op	8698.44 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/1024-12	11872770 108.3 ns/op	9453.73 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/2048-12 6433407 207.7 ns/op	9860.23 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/4096-12 3156878 403.0 ns/op	10162.75 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/8192-12 1622864 745.8 ns/op	10984.28 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/16384-12 820447 1490 ns/op	10997.96 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/8-12	585874134 2.147 ns/op	3726.24 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/16-12	475160053 2.394 ns/op	6684.32 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/32-12	356494118 3.316 ns/op	9650.56 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/128-12	269125159 4.106 ns/op	31177.06 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/256-12	150355809 7.474 ns/op	34249.82 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/512-12	72345751 14.25 ns/op	35929.65 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/1024-12	41781184 24.17 ns/op	42371.22 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/2048-12	26343178 45.28 ns/op	45225.24 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/4096-12	13897591 94.29 ns/op	43440.45 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/8192-12 6824702 185.3 ns/op	44204.41 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/16384-12 3472126 368.5 ns/op	44459.74 MB/s 0 B/op 0 allocs/op PASS ok	nhooyr.io/websocket/internal/thirdparty	102.835s goos: linux goarch: arm64 pkg: nhooyr.io/websocket/internal/thirdparty cpu: 12th Gen Intel(R) Core(TM) i5-1235U @ 1364.583MHz Benchmark_mask/arm64/basic/8-12	47771958 26.59 ns/op 300.86 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/16-12	24547660 52.69 ns/op 303.64 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/32-12	12533614 92.10 ns/op 347.46 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/128-12 3555813 346.9 ns/op 368.94 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/256-12 1811830 673.4 ns/op 380.14 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/512-12 938022 1335 ns/op 383.53 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/1024-12 484177 2479 ns/op 413.13 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/2048-12 211894 5014 ns/op 408.45 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/4096-12 112736 10130 ns/op 404.35 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/8192-12 61010 21183 ns/op 386.72 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/16384-12 31218 39141 ns/op 418.59 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/8-12	39843982 28.80 ns/op 277.80 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/16-12	34930447 29.61 ns/op 540.32 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/32-12	32931360 33.07 ns/op 967.69 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/128-12	32877277 42.30 ns/op	3025.92 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/256-12	21600469 60.31 ns/op	4244.99 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/512-12	14673056 94.28 ns/op	5430.72 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/1024-12 8250734 163.7 ns/op	6256.35 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/2048-12 3977023 301.1 ns/op	6802.66 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/4096-12 2260831 578.0 ns/op	7086.45 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/8192-12 1121847 1079 ns/op	7594.77 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/16384-12 508933 2095 ns/op	7819.85 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/8-12	34301584 36.89 ns/op 216.87 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/16-12	33929019 37.52 ns/op 426.46 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/32-12	31671778 41.70 ns/op 767.31 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/128-12	25115096 53.61 ns/op	2387.78 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/256-12	17948512 63.43 ns/op	4036.25 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/512-12	12472801 104.4 ns/op	4902.55 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/1024-12 7425166 161.7 ns/op	6334.35 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/2048-12 3981708 292.6 ns/op	6998.52 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/4096-12 2086530 563.9 ns/op	7264.25 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/8192-12 1070166 1114 ns/op	7355.53 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/16384-12 504093 2159 ns/op	7588.84 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/8-12	27462318 46.20 ns/op 173.14 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/16-12	23176634 49.10 ns/op 325.85 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/32-12	22810416 58.54 ns/op 546.62 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/128-12	12784365 87.69 ns/op	1459.75 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/256-12 8819766 142.4 ns/op	1797.13 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/512-12 5834811 225.9 ns/op	2266.71 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/1024-12 3309975 369.7 ns/op	2769.72 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/2048-12 1758891 763.6 ns/op	2682.02 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/4096-12 742028 1404 ns/op	2917.37 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/8192-12 489636 2739 ns/op	2990.70 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/16384-12 236709 5086 ns/op	3221.13 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/8-12	31763971 34.14 ns/op 234.35 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/16-12	28280493 41.83 ns/op 382.47 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/32-12	23041581 52.92 ns/op 604.73 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/128-12	10903680 115.3 ns/op	1110.58 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/256-12 6139404 202.1 ns/op	1266.59 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/512-12 3639919 339.2 ns/op	1509.60 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/1024-12 1897648 680.3 ns/op	1505.26 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/2048-12 958771 1223 ns/op	1674.76 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/4096-12 520082 2581 ns/op	1586.94 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/8192-12 243410 4994 ns/op	1640.52 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/16384-12 129097 9468 ns/op	1730.40 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/8-12	41615394 27.52 ns/op 290.75 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/16-12	38795175 31.95 ns/op 500.84 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/32-12	35392299 36.75 ns/op 870.68 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/128-12	31278990 39.73 ns/op	3221.36 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/256-12	20779035 59.31 ns/op	4316.11 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/512-12	12213514 99.53 ns/op	5144.00 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/1024-12 7523419 161.6 ns/op	6335.00 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/2048-12 3721555 330.7 ns/op	6192.52 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/4096-12 1884742 612.8 ns/op	6683.56 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/8192-12 1000591 1199 ns/op	6834.55 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/16384-12 512989 2263 ns/op	7238.41 MB/s 0 B/op 0 allocs/op PASS 

In fact it's slower. Not sure what's going on.

@nhooyr
Copy link
Contributor

nhooyr commented Oct 20, 2023

Will test on a proper VM too.

@nhooyr nhooyr force-pushed the patch-simd-mask branch 2 times, most recently from 7d0c6f4 to 9f298ec Compare October 20, 2023 14:29
nhooyr added 12 commits October 25, 2023 17:41
json.Encoder is 42% faster than json.Marshal thanks to the memory reuse. goos: linux goarch: amd64 pkg: nhooyr.io/websocket/wsjson cpu: 12th Gen Intel(R) Core(TM) i5-1235U BenchmarkJSON/json.Encoder-12 3517579 340.2 ns/op 24 B/op 1 allocs/op BenchmarkJSON/json.Marshal-12 2374086 484.3 ns/op 728 B/op 2 allocs/op Closes coder#409
[qrvnl@dios ~/src/websocket] 130$ go test -bench=. ./wsjson/ goos: linux goarch: amd64 pkg: nhooyr.io/websocket/wsjson cpu: 12th Gen Intel(R) Core(TM) i5-1235U BenchmarkJSON/json.Encoder/8-12 14041426 72.59 ns/op 110.21 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/16-12 13936426 86.99 ns/op 183.92 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/32-12 11416401 115.3 ns/op 277.59 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/128-12 4600574 264.7 ns/op 483.55 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/256-12 2710398 433.9 ns/op 590.06 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/512-12 1588930 717.3 ns/op 713.82 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/1024-12 823138 1484 ns/op 689.80 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/2048-12 402823 2875 ns/op 712.32 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/4096-12 213926 5602 ns/op 731.14 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/8192-12 92864 11281 ns/op 726.19 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/16384-12 39318 29203 ns/op 561.04 MB/s 19 B/op 1 allocs/op BenchmarkJSON/json.Marshal/8-12 10768671 114.5 ns/op 69.89 MB/s 48 B/op 2 allocs/op BenchmarkJSON/json.Marshal/16-12 10140996 113.9 ns/op 140.51 MB/s 64 B/op 2 allocs/op BenchmarkJSON/json.Marshal/32-12 9211780 121.6 ns/op 263.06 MB/s 64 B/op 2 allocs/op BenchmarkJSON/json.Marshal/128-12 4632796 264.2 ns/op 484.53 MB/s 224 B/op 2 allocs/op BenchmarkJSON/json.Marshal/256-12 2441511 473.5 ns/op 540.65 MB/s 432 B/op 2 allocs/op BenchmarkJSON/json.Marshal/512-12 1298788 896.2 ns/op 571.27 MB/s 912 B/op 2 allocs/op BenchmarkJSON/json.Marshal/1024-12 602084 1866 ns/op 548.83 MB/s 1808 B/op 2 allocs/op BenchmarkJSON/json.Marshal/2048-12 341151 3817 ns/op 536.61 MB/s 3474 B/op 2 allocs/op BenchmarkJSON/json.Marshal/4096-12 175594 7034 ns/op 582.32 MB/s 6548 B/op 2 allocs/op BenchmarkJSON/json.Marshal/8192-12 83222 15023 ns/op 545.30 MB/s 13591 B/op 2 allocs/op BenchmarkJSON/json.Marshal/16384-12 33087 39348 ns/op 416.39 MB/s 27304 B/op 2 allocs/op PASS ok nhooyr.io/websocket/wsjson 32.934s
@dixyes
Copy link

dixyes commented Nov 20, 2023

I guess qemu simd emulation harms performance

on aliyun(alibabacloud) yitian710 (arm64 armv8) 2c4g machine:

root@iZbp1heu8m4uq7gguvddwaZ:~/websocket# cat /proc/cpuinfo processor : 0 BogoMIPS : 100.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd49 CPU revision : 0 processor : 1 BogoMIPS : 100.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd49 CPU revision : 0 root@iZbp1heu8m4uq7gguvddwaZ:~/websocket# uname -a Linux iZbp1heu8m4uq7gguvddwaZ 5.10.0-19-arm64 #1 SMP Debian 5.10.149-2 (2022-10-21) aarch64 GNU/Linux 
goos: linux goarch: arm64 pkg: nhooyr.io/websocket/internal/thirdparty Benchmark_mask/arm64/basic/8-2 206792809 5.802 ns/op 1378.89 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/16-2 100000000 10.02 ns/op 1596.73 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/32-2 58691935 20.34 ns/op 1573.17 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/128-2 14648796 81.91 ns/op 1562.64 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/256-2 7302968 164.3 ns/op 1558.27 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/512-2 3585920 334.4 ns/op 1530.96 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/1024-2 1807688 663.8 ns/op 1542.68 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/2048-2 901452 1322 ns/op 1548.69 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/4096-2 453880 2641 ns/op 1550.79 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/8192-2 227306 5273 ns/op 1553.59 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/16384-2 113630 10536 ns/op 1555.07 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/8-2 372385791 3.200 ns/op 2499.82 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/16-2 326266168 3.677 ns/op 4351.15 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/32-2 326263063 3.675 ns/op 8706.64 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/128-2 193277991 6.178 ns/op 20717.82 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/256-2 120835178 9.939 ns/op 25757.71 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/512-2 67891269 17.58 ns/op 29120.25 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/1024-2 36238434 33.05 ns/op 30981.53 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/2048-2 18876517 63.51 ns/op 32244.43 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/4096-2 9632865 124.4 ns/op 32913.56 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/8192-2 4862270 246.5 ns/op 33239.77 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/16384-2 2449879 490.7 ns/op 33386.93 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/8-2 320587507 3.747 ns/op 2134.84 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/16-2 298397137 4.016 ns/op 3984.46 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/32-2 295286755 4.051 ns/op 7899.26 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/128-2 198758401 6.010 ns/op 21299.05 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/256-2 148294503 8.101 ns/op 31599.58 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/512-2 99287224 12.21 ns/op 41941.45 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/1024-2 59101357 20.24 ns/op 50591.08 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/2048-2 32870538 36.43 ns/op 56215.26 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/4096-2 17392502 68.75 ns/op 59578.86 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/8192-2 8991554 133.3 ns/op 61432.88 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/16384-2 4537192 264.3 ns/op 61990.60 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/8-2 166697532 7.199 ns/op 1111.26 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/16-2 95416378 12.50 ns/op 1280.35 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/32-2 99859288 12.03 ns/op 2659.82 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/128-2 74788264 15.98 ns/op 8008.48 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/256-2 49521510 24.10 ns/op 10620.54 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/512-2 30854259 38.75 ns/op 13213.30 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/1024-2 17709324 67.75 ns/op 15114.36 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/2048-2 9540504 125.6 ns/op 16301.06 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/4096-2 4887254 245.4 ns/op 16689.60 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/8192-2 2506159 477.0 ns/op 17173.59 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/16384-2 1276844 939.9 ns/op 17431.75 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/8-2 239466345 5.011 ns/op 1596.61 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/16-2 198722446 6.030 ns/op 2653.50 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/32-2 149454994 8.028 ns/op 3986.12 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/128-2 58453107 20.45 ns/op 6259.12 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/256-2 32118558 37.26 ns/op 6870.96 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/512-2 16886425 70.98 ns/op 7213.33 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/1024-2 8660222 138.4 ns/op 7396.91 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/2048-2 4389014 273.8 ns/op 7478.89 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/4096-2 2220012 540.4 ns/op 7579.69 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/8192-2 1000000 1070 ns/op 7654.83 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/16384-2 561620 2130 ns/op 7691.23 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/8-2 359732443 3.339 ns/op 2395.91 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/16-2 295799040 4.060 ns/op 3941.20 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/32-2 222655515 5.406 ns/op 5918.87 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/128-2 175895174 6.824 ns/op 18757.64 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/256-2 100000000 11.33 ns/op 22586.09 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/512-2 59968189 19.72 ns/op 25968.88 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/1024-2 33116636 36.16 ns/op 28320.44 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/2048-2 17286394 69.43 ns/op 29496.87 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/4096-2 8810706 136.0 ns/op 30118.04 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/8192-2 4461346 268.9 ns/op 30466.70 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/16384-2 2242198 534.8 ns/op 30633.09 MB/s 0 B/op 0 allocs/op PASS 

on aliyun(alibabacloud) ampere altra (arm64 armv8) 2c4g machine:

root@iZbp19nzrw6iywyjtl52srZ:~# cat /proc/cpuinfo processor : 0 BogoMIPS : 50.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x3 CPU part : 0xd0c CPU revision : 1 processor : 1 BogoMIPS : 50.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x3 CPU part : 0xd0c CPU revision : 1 root@iZbp19nzrw6iywyjtl52srZ:~# uname -a Linux iZbp19nzrw6iywyjtl52srZ 5.10.0-19-arm64 #1 SMP Debian 5.10.149-2 (2022-10-21) aarch64 GNU/Linux 
goos: linux goarch: arm64 pkg: nhooyr.io/websocket/internal/thirdparty Benchmark_mask/arm64/basic/8-2 156192206 7.680 ns/op 1041.61 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/16-2 87099630 13.69 ns/op 1168.31 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/32-2 43625746 27.15 ns/op 1178.65 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/128-2 11600862 103.4 ns/op 1237.93 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/256-2 5790669 207.2 ns/op 1235.57 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/512-2 2849724 421.0 ns/op 1216.19 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/1024-2 1443289 830.9 ns/op 1232.42 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/2048-2 723596 1652 ns/op 1239.84 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/4096-2 364108 3289 ns/op 1245.26 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/8192-2 182422 6565 ns/op 1247.79 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/basic/16384-2 91266 13126 ns/op 1248.20 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/8-2 179696448 6.678 ns/op 1198.02 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/16-2 171135552 7.011 ns/op 2282.01 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/32-2 163356070 7.345 ns/op 4356.99 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/128-2 100000000 10.21 ns/op 12531.93 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/256-2 73170615 16.29 ns/op 15715.26 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/512-2 42342985 28.30 ns/op 18091.30 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/1024-2 22871635 52.36 ns/op 19557.29 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/2048-2 11953033 100.4 ns/op 20390.69 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/4096-2 6098042 196.7 ns/op 20824.14 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/8192-2 3083127 389.3 ns/op 21045.51 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nhooyr-go/16384-2 1549681 773.8 ns/op 21172.69 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/8-2 239631874 5.007 ns/op 1597.77 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/16-2 246623696 4.874 ns/op 3282.49 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/32-2 224660503 5.343 ns/op 5989.62 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/128-2 146873190 8.150 ns/op 15705.46 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/256-2 100000000 11.35 ns/op 22548.91 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/512-2 66308772 18.03 ns/op 28401.24 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/1024-2 38059369 31.39 ns/op 32624.65 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/2048-2 20609492 58.09 ns/op 35258.25 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/4096-2 10760130 111.5 ns/op 36737.95 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/8192-2 5494204 218.4 ns/op 37501.18 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/wdvxdr1123-asm/16384-2 2776998 432.0 ns/op 37923.91 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/8-2 126019189 9.511 ns/op 841.13 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/16-2 87176002 13.69 ns/op 1168.45 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/32-2 79482931 15.03 ns/op 2129.34 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/128-2 51963406 23.05 ns/op 5552.79 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/256-2 35389480 33.79 ns/op 7576.26 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/512-2 21703480 55.26 ns/op 9265.94 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/1024-2 12215022 98.20 ns/op 10427.73 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/2048-2 6514315 184.3 ns/op 11113.19 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/4096-2 3286785 365.1 ns/op 11219.00 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/8192-2 1691893 709.5 ns/op 11545.83 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gorilla/16384-2 855566 1397 ns/op 11726.46 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/8-2 170618257 7.011 ns/op 1141.12 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/16-2 138242857 8.621 ns/op 1855.95 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/32-2 100000000 11.73 ns/op 2729.19 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/128-2 39364594 30.25 ns/op 4230.99 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/256-2 22373491 54.43 ns/op 4703.68 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/512-2 11536062 105.1 ns/op 4873.04 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/1024-2 5862846 201.1 ns/op 5091.32 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/2048-2 2996881 397.3 ns/op 5154.43 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/4096-2 1488253 820.2 ns/op 4993.60 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/8192-2 770410 1599 ns/op 5123.95 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/gobwas/16384-2 373005 3122 ns/op 5248.40 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/8-2 224579071 5.341 ns/op 1497.82 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/16-2 189027944 6.344 ns/op 2521.98 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/32-2 143714523 8.347 ns/op 3833.80 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/128-2 100000000 10.35 ns/op 12369.09 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/256-2 69961245 17.03 ns/op 15031.81 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/512-2 39476787 30.39 ns/op 16845.45 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/1024-2 20975644 57.11 ns/op 17930.66 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/2048-2 10845069 110.5 ns/op 18530.47 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/4096-2 5520278 217.4 ns/op 18844.48 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/8192-2 2782627 431.4 ns/op 18989.59 MB/s 0 B/op 0 allocs/op Benchmark_mask/arm64/nbio/16384-2 1397580 858.8 ns/op 19077.29 MB/s 0 B/op 0 allocs/op PASS 
@nhooyr
Copy link
Contributor

nhooyr commented Nov 20, 2023

Right on, thanks for testing @dixyes

@nightwolfz
Copy link

Finally gotten around to reviewing this. I'm not very familiar with writing assembly of any kind. Why use AVX2 instead of AVX-512?

AVX-512 is not widely supported, while AVX2 is everywhere.

I'm just not good enough at assembly. I added tests to confirm that @wdvxdr's implementation works correctly and matches the output of the basic masking loop.
nhooyr added a commit to wdvxdr1123/websocket that referenced this pull request Feb 22, 2024
Standard library does this too. Unfortunate wish they just exposed it in the standard library. Perhaps we can isolate the specific code we need later.
@nhooyr
Copy link
Contributor

nhooyr commented Feb 22, 2024

Final results:

goos: linux goarch: amd64 pkg: nhooyr.io/websocket/internal/thirdparty cpu: 12th Gen Intel(R) Core(TM) i5-1235U Benchmark_mask/amd64/basic/8-12	423375534 2.786 ns/op	2871.05 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/16-12	226554633 5.359 ns/op	2985.68 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/32-12	117482640 10.19 ns/op	3140.90 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/128-12	26246637 45.81 ns/op	2794.00 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/256-12	14100849 84.95 ns/op	3013.68 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/512-12 7287253 165.2 ns/op	3098.76 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/1024-12 3688262 320.3 ns/op	3197.24 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/2048-12 1888688 638.6 ns/op	3207.04 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/4096-12 939709 1275 ns/op	3212.55 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/8192-12 416410 2533 ns/op	3233.74 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/basic/16384-12 237880 5075 ns/op	3228.53 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/8-12	516842565 2.323 ns/op	3443.66 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/16-12	512148457 2.321 ns/op	6895.02 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/32-12	463799696 2.554 ns/op	12531.05 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/128-12	305272117 3.889 ns/op	32909.16 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/256-12	186344584 6.533 ns/op	39186.37 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/512-12	98735030 10.37 ns/op	49364.30 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/1024-12	60532092 20.18 ns/op	50735.99 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/2048-12	31890501 36.09 ns/op	56745.07 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/4096-12	15045230 79.13 ns/op	51760.10 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/8192-12 7874872 152.5 ns/op	53720.47 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nhooyr-go/16384-12 3976707 300.0 ns/op	54621.87 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/8-12	565721422 2.087 ns/op	3833.34 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/16-12	490515590 2.396 ns/op	6678.41 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/32-12	499705630 2.309 ns/op	13859.26 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/128-12	349259366 3.673 ns/op	34851.70 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/256-12	121710386 10.07 ns/op	25427.13 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/512-12	100000000 12.00 ns/op	42654.69 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/1024-12	68401042 17.57 ns/op	58296.87 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/2048-12	38861618 28.96 ns/op	70716.39 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/4096-12	22134694 53.55 ns/op	76483.15 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/8192-12	12523645 91.32 ns/op	89702.20 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/wdvxdr1123-asm/16384-12 6966129 167.6 ns/op	97768.91 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/8-12	306537969 3.908 ns/op	2047.33 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/16-12	167440917 7.127 ns/op	2245.06 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/32-12	157346451 7.623 ns/op	4197.75 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/128-12	100000000 10.17 ns/op	12590.73 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/256-12	91401891 13.36 ns/op	19161.41 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/512-12	43890088 26.60 ns/op	19246.01 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/1024-12	26414316 41.59 ns/op	24621.32 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/2048-12	16049217 71.19 ns/op	28766.12 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/4096-12 9171207 129.4 ns/op	31658.05 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/8192-12 4856886 250.7 ns/op	32674.27 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gorilla/16384-12 2488569 485.2 ns/op	33764.34 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/8-12	366741759 3.282 ns/op	2437.84 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/16-12	303639134 3.906 ns/op	4095.90 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/32-12	223418820 5.406 ns/op	5919.31 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/128-12	89532153 13.94 ns/op	9180.17 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/256-12	39774794 32.82 ns/op	7799.75 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/512-12	21657115 53.08 ns/op	9646.12 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/1024-12	11203101 97.40 ns/op	10513.88 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/2048-12 6175005 200.9 ns/op	10194.80 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/4096-12 3083400 390.6 ns/op	10487.27 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/8192-12 1551018 714.0 ns/op	11473.42 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/gobwas/16384-12 847084 1428 ns/op	11476.19 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/8-12	640919714 1.895 ns/op	4220.73 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/16-12	523854591 2.453 ns/op	6522.16 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/32-12	344619900 3.268 ns/op	9793.04 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/128-12	281670219 4.072 ns/op	31433.68 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/256-12	164968168 7.219 ns/op	35463.76 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/512-12	82934056 13.82 ns/op	37060.27 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/1024-12	48002257 22.96 ns/op	44599.52 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/2048-12	29191290 41.93 ns/op	48845.44 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/4096-12	14418003 84.95 ns/op	48215.55 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/8192-12 7101901 161.0 ns/op	50892.32 MB/s 0 B/op 0 allocs/op Benchmark_mask/amd64/nbio/16384-12 3655984 353.4 ns/op	46365.54 MB/s 0 B/op 0 allocs/op PASS ok	nhooyr.io/websocket/internal/thirdparty	94.759s 

Thanks again @wdvxdr1123 and sorry for the large delay.

@nhooyr nhooyr merged commit 8a54c1b into coder:dev Feb 22, 2024
nhooyr added a commit to alixander/websocket that referenced this pull request Apr 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants