[perf] Do not make compiler life harder#164

ChALkeR · 2025-10-29T22:23:23Z

Loops with bodies not depending on previous iterations are easier

Tested on M3 (please recheck on other platforms)

Before:

{size: 10 } unmask x 25,000,000 ops/sec @ 40ns/op (0ns..185μs) unmask x 24,390,244 ops/sec @ 41ns/op (0ns..16ms) mask x 18,518,519 ops/sec @ 54ns/op (0ns..1353μs) mask x 18,181,818 ops/sec @ 55ns/op (0ns..5ms){size: 128 } unmask x 19,607,843 ops/sec @ 51ns/op (0ns..195μs) unmask x 19,607,843 ops/sec @ 51ns/op (0ns..545μs) mask x 16,949,153 ops/sec @ 59ns/op (0ns..77μs) mask x 16,949,153 ops/sec @ 59ns/op (0ns..1437μs){size: 1024 } unmask x 5,988,024 ops/sec @ 167ns/op (41ns..8ms) unmask x 6,024,096 ops/sec @ 166ns/op (83ns..15ms) mask x 9,090,909 ops/sec @ 110ns/op (0ns..64μs) mask x 9,174,312 ops/sec @ 109ns/op (41ns..835μs){size: 10239 } unmask x 750,751 ops/sec @ 1332ns/op (417ns..147μs) unmask x 747,384 ops/sec @ 1338ns/op (417ns..144μs) mask x 1,727,116 ops/sec @ 579ns/op (459ns..63μs) mask x 1,730,104 ops/sec @ 578ns/op (500ns..70μs){size: 1048576 } unmask x 7,260 ops/sec @ 137μs/op (124μs..217μs) unmask x 7,096 ops/sec @ 140μs/op (124μs..820μs) mask x 16,944 ops/sec @ 59μs/op (55μs..1298μs) mask x 17,160 ops/sec @ 58μs/op (55μs..226μs)

After

{size: 10 } unmask x 25,641,026 ops/sec @ 39ns/op (0ns..77μs) unmask x 25,000,000 ops/sec @ 40ns/op (0ns..3ms) mask x 19,607,843 ops/sec @ 51ns/op (0ns..819μs) mask x 19,230,769 ops/sec @ 52ns/op (0ns..212μs){size: 128 } unmask x 23,809,524 ops/sec @ 42ns/op (0ns..137μs) unmask x 23,809,524 ops/sec @ 42ns/op (0ns..56μs) mask x 19,230,769 ops/sec @ 52ns/op (0ns..87μs) mask x 19,230,769 ops/sec @ 52ns/op (0ns..68μs){size: 1024 } unmask x 19,607,843 ops/sec @ 51ns/op (0ns..60μs) unmask x 20,000,000 ops/sec @ 50ns/op (0ns..97μs) mask x 16,129,032 ops/sec @ 62ns/op (0ns..54μs) mask x 16,129,032 ops/sec @ 62ns/op (0ns..656μs){size: 10239 } unmask x 6,410,256 ops/sec @ 156ns/op (83ns..493μs) unmask x 6,535,948 ops/sec @ 153ns/op (42ns..105μs) mask x 6,329,114 ops/sec @ 158ns/op (83ns..63μs) mask x 6,329,114 ops/sec @ 158ns/op (42ns..72μs){size: 1048576 } unmask x 60,831 ops/sec @ 16μs/op (16μs..97μs) unmask x 61,058 ops/sec @ 16μs/op (16μs..132μs) mask x 70,817 ops/sec @ 14μs/op (12μs..104μs) mask x 62,224 ops/sec @ 16μs/op (12μs..98μs)

Has to be retested on smth else

ChALkeR · 2025-10-30T01:04:21Z

will retest on linux

lpinca · 2025-10-31T14:42:26Z

I'm surprised to see such a big difference. I've run similar benchmarks on Intel Mac, and while I didn't see a 10x improvement, the difference is still huge (about 5x faster).

lpinca · 2025-10-31T14:42:54Z

Can you please run clang-format with style Google?

lpinca · 2025-10-31T15:06:41Z

On a Linux VM the difference is not so big but sill 2x faster in some cases.

ChALkeR · 2025-10-31T15:59:50Z

This might be affected by both clang++/g++ and arm64/x86-64.

The most common case on servers is likely g++ and x86-64 on Linux.
We need to recheck that didn't degrade

Will do that (if no one would beat me to it)

ChALkeR · 2025-10-31T16:05:55Z

side note: on Mac, an optimized (not the current) JS impl for unmask beats the current native one (pre-PR) by 1.5x
but that's due to native impl being slow

lpinca · 2025-10-31T17:12:44Z

The most common case on servers is likely g++ and x86-64 on Linux.
We need to recheck that didn't degrade

This is the VM env mentioned above, but it is virtualized.

side note: on Mac, an optimized (not the current) JS impl for unmask beats the current native one (pre-PR) by 1.5x
but that's due to native impl being slow

The JS implementation is used for very small buffers where the cost of calling the native bindings isn't worth the effort, so it does not really matter.

lpinca · 2025-10-31T17:26:53Z

While reading the changes I also noticed that it is time to use napi_get_value_int64() instead of napi_get_value_uint32() for the offset and length. The Buffer length can now be > 2^31 - 1, but that is a different topic.

ChALkeR · 2025-10-31T20:25:55Z

@lpinca I was considering replacing native with js as new js was 1.5x faster, this is why it mattered 🙃
But instead fixing native was a better perf improvement locally

The Buffer length can now be > 2^31 - 1,

That is broken in Node.js unfortunately for now. But yes, it makes sense to support that here.
Will switch to 64, but this likely also needs a different PR for that which has to land first (for a cleaner bench comparison)

And as for perf, I'll get to my x86-64 / Linux machine soon to test this locally.

ChALkeR · 2025-11-01T17:04:36Z

Here are two versions which could be explored in Godbolt:

Before

#include<cstddef> #include<cstdint>structArgs0{uint8_t *source; uint8_t *mask; uint8_t *destination; uint32_t offset; uint32_t length}; structArgs1{uint8_t *source; size_t length; uint8_t *mask}; void* Mask(Args0 args0){uint8_t *source = args0.source; uint8_t *mask = args0.mask; uint8_t *destination = args0.destination; uint32_t offset = args0.offset; uint32_t length = args0.length; destination += offset; uint32_t index = 0; //// Alignment preamble.//while (index < length && ((size_t)source % 8)){*destination++ = *source++ ^ mask[index % 4]; index++} length -= index; if (!length) returnNULL; //// Realign mask and convert to 64 bit.//uint8_t maskAlignedArray[8]; for (uint8_t i = 0; i < 8; i++, index++){maskAlignedArray[i] = mask[index % 4]} //// Apply 64 bit mask in 8 byte chunks.//uint32_t loop = length / 8; uint64_t *pMask8 = (uint64_t *)maskAlignedArray; while (loop--){uint64_t *pFrom8 = (uint64_t *)source; uint64_t *pTo8 = (uint64_t *)destination; *pTo8 = *pFrom8 ^ *pMask8; source += 8; destination += 8} //// Apply mask to remaining data.//uint8_t *pmaskAlignedArray = maskAlignedArray; length %= 8; while (length--){*destination++ = *source++ ^ *pmaskAlignedArray++} returnNULL} void* Unmask(Args1 args1){uint8_t *source = args1.source; size_t length = args1.length; uint8_t *mask = args1.mask; uint32_t index = 0; //// Alignment preamble.//while (index < length && ((size_t)source % 8)){*source++ ^= mask[index % 4]; index++} length -= index; if (!length) returnNULL; //// Realign mask and convert to 64 bit.//uint8_t maskAlignedArray[8]; for (uint8_t i = 0; i < 8; i++, index++){maskAlignedArray[i] = mask[index % 4]} //// Apply 64 bit mask in 8 byte chunks.//uint32_t loop = length / 8; uint64_t *pMask8 = (uint64_t *)maskAlignedArray; while (loop--){uint64_t *pSource8 = (uint64_t *)source; *pSource8 ^= *pMask8; source += 8} //// Apply mask to remaining data.//uint8_t *pmaskAlignedArray = maskAlignedArray; length %= 8; while (length--){*source++ ^= *pmaskAlignedArray++} returnNULL}

After

#include<cstddef> #include<cstdint>structArgs0{uint8_t *source; uint8_t *mask; uint8_t *destination; uint32_t offset; uint32_t length}; structArgs1{uint8_t *source; size_t length; uint8_t *mask}; void* Mask(Args0 args0){uint8_t *source = args0.source; uint8_t *mask = args0.mask; uint8_t *destination = args0.destination; uint32_t offset = args0.offset; uint32_t length = args0.length; destination += offset; uint32_t index = 0; //// Alignment preamble.//while (index < length && ((size_t)source % 8)){*destination++ = *source++ ^ mask[index % 4]; index++} length -= index; if (!length) returnNULL; //// Realign mask and convert to 64 bit.//uint8_t maskAlignedArray[8]; for (uint8_t i = 0; i < 8; i++, index++){maskAlignedArray[i] = mask[index % 4]} //// Apply 64 bit mask in 8 byte chunks.//uint32_t loop = length / 8; uint64_t mask8 = ((uint64_t *)maskAlignedArray)[0]; uint64_t *pFrom8 = (uint64_t *)source; uint64_t *pTo8 = (uint64_t *)destination; for (uint32_t i = 0; i < loop; i++) pTo8[i] = pFrom8[i] ^ mask8; source += 8 * loop; destination += 8 * loop; //// Apply mask to remaining data.// length %= 8; for (uint32_t i = 0; i < length; i++){destination[i] = source[i] ^ maskAlignedArray[i]} returnNULL} void* Unmask(Args1 args1){uint8_t *source = args1.source; uint8_t *mask = args1.mask; size_t length = args1.length; uint32_t index = 0; //// Alignment preamble.//while (index < length && ((size_t)source % 8)){*source++ ^= mask[index % 4]; index++} length -= index; if (!length) returnNULL; //// Realign mask and convert to 64 bit.//uint8_t maskAlignedArray[8]; for (uint8_t i = 0; i < 8; i++, index++){maskAlignedArray[i] = mask[index % 4]} //// Apply 64 bit mask in 8 byte chunks.//uint32_t loop = length / 8; uint64_t mask8 = ((uint64_t *)maskAlignedArray)[0]; uint64_t *pSource8 = (uint64_t *)source; for (uint32_t i = 0; i < loop; i++) pSource8[i] ^= mask8; source += 8 * loop; //// Apply mask to remaining data.// length %= 8; for (uint32_t i = 0; i < length; i++){source[i] ^= maskAlignedArray[i]} returnNULL}

ChALkeR · 2025-11-01T17:09:39Z

g++ on x86_64 with -O3, old:

mov QWORD PTR [rsp-16],rcxtestr8d,r8dje .L71cmpr8d,1je .L77movedi,r8dmovqxmm1,rcxmovrdx,raxshredipunpcklqdqxmm1,xmm1salrdi,4addrdi,rax.L73:movdquxmm0, XMMWORD PTR [rdx]addrdx,16pxorxmm0,xmm1movups XMMWORD PTR [rdx-16],xmm0cmprdx,rdijne .L73testr8b,1je .L74 movabs rdx,34359738352andrdx,rsiaddrdx,rax.L72:xor QWORD PTR [rdx],rcx.L74:

g++ on x86_64 with -O3, new:

movr8, QWORD PTR [rsp-24]cmpesi,1je .L68movecx,esimovqxmm1,r8movrdx,raxshrecxpunpcklqdqxmm1,xmm1salrcx,4addrcx,rax.L64:movdquxmm0, XMMWORD PTR [rdx]addrdx,16pxorxmm0,xmm1movups XMMWORD PTR [rdx-16],xmm0cmprdx,rcxjne .L64test sil,1je .L62movedx,esiandedx,-2.L63:xor QWORD PTR [rax+rdx*8],r8.L62:

The main loop is identical.
This shouldn't slow down g++.

ChALkeR · 2025-11-01T17:16:54Z

clang++ on arm8-a with -O3

Before:

 b.hs .LBB1_26 tbz w9, #0, .LBB1_26mov x9, x8 b .LBB1_29.LBB1_26: ldr d0,[sp, #8]and x13, x11, #0xfffffffcadd x9, x8, x13,lsl #3sub w11, w11, w13add x8, x8, #16 dup v0.2d, v0.d[0]mov x14, x13.LBB1_27: ldp q1, q2,[x8, #-16] subs x14, x14, #4 eor v1.16b, v1.16b, v0.16b eor v2.16b, v2.16b, v0.16b stp q1, q2,[x8, #-16]add x8, x8, #32 b.ne .LBB1_27cmp x12, x13 b.eq .LBB1_30.LBB1_29: ldr x8,[sp, #8] ldr x12,[x9] subs w11, w11, #1 eor x8, x12, x8str x8,[x9], #8 b.ne .LBB1_29.LBB1_30:

After:

 b.hs .LBB1_22mov x16, xzr b .LBB1_25.LBB1_22: lsr x16, x13, #3 dup v0.2d, x14add x17, x8, #16and x16, x16, #0xfffffffcmov x18, x16.LBB1_23: ldp q1, q2,[x17, #-16] subs x18, x18, #4 eor v1.16b, v1.16b, v0.16b eor v2.16b, v2.16b, v0.16b stp q1, q2,[x17, #-16]add x17, x17, #32 b.ne .LBB1_23cmp x15, x16 b.eq .LBB1_27.LBB1_25:add x17, x8, x16,lsl #3sub x15, x16, x15.LBB1_26: ldr x16,[x17] adds x15, x15, #1 eor x16, x16, x14str x16,[x17], #8 b.lo .LBB1_26.LBB1_27:

lpinca · 2025-11-14T09:29:10Z

Benchmark results on native Windows:

System Information: Node.js: v25.2.0 OS: win32 10.0.26200 CPU: AMD Ryzen 5 5600G with Radeon Graphics Benchmark results (20 total): Plugins enabled: V8NeverOptimizePlugin ├─ mask (10) │ ├─ old 13.914.890 ops/sec (11 runs sampled) min..max=(71.36ns...72.63ns) │ └─ new 13.680.920 ops/sec (12 runs sampled) min..max=(72.43ns...73.85ns) ├─ mask (128) │ ├─ old 11.519.328 ops/sec (12 runs sampled) min..max=(86.11ns...88.11ns) │ └─ new 12.812.499 ops/sec (10 runs sampled) min..max=(77.46ns...78.46ns) ├─ mask (1024) │ ├─ old 6.178.851 ops/sec (12 runs sampled) min..max=(160.54ns...162.89ns) │ └─ new 8.751.573 ops/sec (10 runs sampled) min..max=(112.93ns...114.78ns) ├─ mask (10239) │ ├─ old 1.031.000 ops/sec (11 runs sampled) min..max=(964.15ns...982.73ns) │ └─ new 2.432.875 ops/sec (11 runs sampled) min..max=(406.97ns...415.03ns) ├─ mask (1048576) │ ├─ old 10.292 ops/sec (10 runs sampled) min..max=(96.72us...97.43us) │ └─ new 27.480 ops/sec (13 runs sampled) min..max=(35.65us...36.81us) ├─ unmask (10) │ ├─ old 19.327.232 ops/sec (12 runs sampled) min..max=(51.55ns...52.12ns) │ └─ new 19.315.417 ops/sec (10 runs sampled) min..max=(51.67ns...51.90ns) ├─ unmask (128) │ ├─ old 18.222.878 ops/sec (11 runs sampled) min..max=(54.33ns...56.30ns) │ └─ new 18.912.535 ops/sec (11 runs sampled) min..max=(52.58ns...53.30ns) ├─ unmask (1024) │ ├─ old 9.159.110 ops/sec (12 runs sampled) min..max=(107.66ns...109.96ns) │ └─ new 15.270.425 ops/sec (10 runs sampled) min..max=(65.24ns...65.91ns) ├─ unmask (10239) │ ├─ old 1.551.788 ops/sec (11 runs sampled) min..max=(643.32ns...645.68ns) │ └─ new 4.895.549 ops/sec (11 runs sampled) min..max=(202.91ns...206.37ns) └─ unmask (1048576) ├─ old 15.936 ops/sec (9 runs sampled) min..max=(62.66us...63.06us) └─ new 62.088 ops/sec (11 runs sampled) min..max=(15.90us...16.81us)

lpinca · 2025-11-26T08:09:19Z

@ChALkeR apart from #164 (comment) and #164 (comment), is this ready?

lpinca · 2025-12-10T06:43:29Z

@ChALkeR ping.

ChALkeR · 2025-12-18T13:21:50Z

@lpinca sorry for missing this, thanks for fixing and merging!

[perf] Do not make compiler life harder
214f15d

ChALkeR marked this pull request as draft October 29, 2025 22:33

ChALkeR marked this pull request as ready for review October 29, 2025 22:35

ChALkeR marked this pull request as draft October 30, 2025 01:04

This comment was marked as outdated.
Sign in to view

lpinca force-pushed the master branch 2 times, most recently from 95371b9 to 557a7afCompare December 10, 2025 09:19

fixup! [perf] Do not make compiler life harder
e4c7132

lpinca marked this pull request as ready for review December 10, 2025 10:49

lpinca approved these changes Dec 10, 2025
View reviewed changes

lpinca merged commit 321fbe4 into websockets:masterDec 10, 2025
16 checks passed

ChALkeR deleted the go-brrrr branch December 18, 2025 13:21

[perf] Do not make compiler life harder#164

[perf] Do not make compiler life harder #164

Uh oh!

Conversation

ChALkeR commented Oct 29, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChALkeR commented Oct 30, 2025

Uh oh!

lpinca commented Oct 31, 2025

Uh oh!

lpinca commented Oct 31, 2025

Uh oh!

lpinca commented Oct 31, 2025

Uh oh!

ChALkeR commented Oct 31, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChALkeR commented Oct 31, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lpinca commented Oct 31, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lpinca commented Oct 31, 2025

Uh oh!

ChALkeR commented Oct 31, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChALkeR commented Nov 1, 2025

Uh oh!

ChALkeR commented Nov 1, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChALkeR commented Nov 1, 2025

Uh oh!

This comment was marked as outdated.

lpinca commented Nov 14, 2025

Uh oh!

lpinca commented Nov 26, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lpinca commented Dec 10, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ChALkeR commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChALkeR commented Oct 29, 2025•
edited
Loading

ChALkeR commented Oct 31, 2025•
edited
Loading

ChALkeR commented Oct 31, 2025•
edited
Loading

lpinca commented Oct 31, 2025•
edited
Loading

ChALkeR commented Oct 31, 2025•
edited
Loading

ChALkeR commented Nov 1, 2025•
edited
Loading

lpinca commented Nov 26, 2025•
edited
Loading

lpinca commented Dec 10, 2025•
edited
Loading