Skip to content

Conversation

@ChALkeR
Copy link
Contributor

@ChALkeRChALkeR commented Oct 29, 2025

Loops with bodies not depending on previous iterations are easier

Tested on M3 (please recheck on other platforms)

Before:

{size: 10 } unmask x 25,000,000 ops/sec @ 40ns/op (0ns..185μs) unmask x 24,390,244 ops/sec @ 41ns/op (0ns..16ms) mask x 18,518,519 ops/sec @ 54ns/op (0ns..1353μs) mask x 18,181,818 ops/sec @ 55ns/op (0ns..5ms){size: 128 } unmask x 19,607,843 ops/sec @ 51ns/op (0ns..195μs) unmask x 19,607,843 ops/sec @ 51ns/op (0ns..545μs) mask x 16,949,153 ops/sec @ 59ns/op (0ns..77μs) mask x 16,949,153 ops/sec @ 59ns/op (0ns..1437μs){size: 1024 } unmask x 5,988,024 ops/sec @ 167ns/op (41ns..8ms) unmask x 6,024,096 ops/sec @ 166ns/op (83ns..15ms) mask x 9,090,909 ops/sec @ 110ns/op (0ns..64μs) mask x 9,174,312 ops/sec @ 109ns/op (41ns..835μs){size: 10239 } unmask x 750,751 ops/sec @ 1332ns/op (417ns..147μs) unmask x 747,384 ops/sec @ 1338ns/op (417ns..144μs) mask x 1,727,116 ops/sec @ 579ns/op (459ns..63μs) mask x 1,730,104 ops/sec @ 578ns/op (500ns..70μs){size: 1048576 } unmask x 7,260 ops/sec @ 137μs/op (124μs..217μs) unmask x 7,096 ops/sec @ 140μs/op (124μs..820μs) mask x 16,944 ops/sec @ 59μs/op (55μs..1298μs) mask x 17,160 ops/sec @ 58μs/op (55μs..226μs) 

After

{size: 10 } unmask x 25,641,026 ops/sec @ 39ns/op (0ns..77μs) unmask x 25,000,000 ops/sec @ 40ns/op (0ns..3ms) mask x 19,607,843 ops/sec @ 51ns/op (0ns..819μs) mask x 19,230,769 ops/sec @ 52ns/op (0ns..212μs){size: 128 } unmask x 23,809,524 ops/sec @ 42ns/op (0ns..137μs) unmask x 23,809,524 ops/sec @ 42ns/op (0ns..56μs) mask x 19,230,769 ops/sec @ 52ns/op (0ns..87μs) mask x 19,230,769 ops/sec @ 52ns/op (0ns..68μs){size: 1024 } unmask x 19,607,843 ops/sec @ 51ns/op (0ns..60μs) unmask x 20,000,000 ops/sec @ 50ns/op (0ns..97μs) mask x 16,129,032 ops/sec @ 62ns/op (0ns..54μs) mask x 16,129,032 ops/sec @ 62ns/op (0ns..656μs){size: 10239 } unmask x 6,410,256 ops/sec @ 156ns/op (83ns..493μs) unmask x 6,535,948 ops/sec @ 153ns/op (42ns..105μs) mask x 6,329,114 ops/sec @ 158ns/op (83ns..63μs) mask x 6,329,114 ops/sec @ 158ns/op (42ns..72μs){size: 1048576 } unmask x 60,831 ops/sec @ 16μs/op (16μs..97μs) unmask x 61,058 ops/sec @ 16μs/op (16μs..132μs) mask x 70,817 ops/sec @ 14μs/op (12μs..104μs) mask x 62,224 ops/sec @ 16μs/op (12μs..98μs) 

Has to be retested on smth else

@ChALkeRChALkeR marked this pull request as draft October 29, 2025 22:33
@ChALkeRChALkeR marked this pull request as ready for review October 29, 2025 22:35
@ChALkeR
Copy link
ContributorAuthor

will retest on linux

@ChALkeRChALkeR marked this pull request as draft October 30, 2025 01:04
@lpinca
Copy link
Member

I'm surprised to see such a big difference. I've run similar benchmarks on Intel Mac, and while I didn't see a 10x improvement, the difference is still huge (about 5x faster).

@lpinca
Copy link
Member

Can you please run clang-format with style Google?

@lpinca
Copy link
Member

On a Linux VM the difference is not so big but sill 2x faster in some cases.

@ChALkeR
Copy link
ContributorAuthor

ChALkeR commented Oct 31, 2025

This might be affected by both clang++/g++ and arm64/x86-64.

The most common case on servers is likely g++ and x86-64 on Linux.
We need to recheck that didn't degrade

Will do that (if no one would beat me to it)

@ChALkeR
Copy link
ContributorAuthor

ChALkeR commented Oct 31, 2025

side note: on Mac, an optimized (not the current) JS impl for unmask beats the current native one (pre-PR) by 1.5x
but that's due to native impl being slow

@lpinca
Copy link
Member

lpinca commented Oct 31, 2025

The most common case on servers is likely g++ and x86-64 on Linux.
We need to recheck that didn't degrade

This is the VM env mentioned above, but it is virtualized.

side note: on Mac, an optimized (not the current) JS impl for unmask beats the current native one (pre-PR) by 1.5x
but that's due to native impl being slow

The JS implementation is used for very small buffers where the cost of calling the native bindings isn't worth the effort, so it does not really matter.

@lpinca
Copy link
Member

While reading the changes I also noticed that it is time to use napi_get_value_int64() instead of napi_get_value_uint32() for the offset and length. The Buffer length can now be > 2^31 - 1, but that is a different topic.

@ChALkeR
Copy link
ContributorAuthor

ChALkeR commented Oct 31, 2025

@lpinca I was considering replacing native with js as new js was 1.5x faster, this is why it mattered 🙃
But instead fixing native was a better perf improvement locally

The Buffer length can now be > 2^31 - 1,

That is broken in Node.js unfortunately for now. But yes, it makes sense to support that here.
Will switch to 64, but this likely also needs a different PR for that which has to land first (for a cleaner bench comparison)


And as for perf, I'll get to my x86-64 / Linux machine soon to test this locally.

@ChALkeR
Copy link
ContributorAuthor

Here are two versions which could be explored in Godbolt:

Before

#include<cstddef> #include<cstdint>structArgs0{uint8_t *source; uint8_t *mask; uint8_t *destination; uint32_t offset; uint32_t length}; structArgs1{uint8_t *source; size_t length; uint8_t *mask}; void* Mask(Args0 args0){uint8_t *source = args0.source; uint8_t *mask = args0.mask; uint8_t *destination = args0.destination; uint32_t offset = args0.offset; uint32_t length = args0.length; destination += offset; uint32_t index = 0; //// Alignment preamble.//while (index < length && ((size_t)source % 8)){*destination++ = *source++ ^ mask[index % 4]; index++} length -= index; if (!length) returnNULL; //// Realign mask and convert to 64 bit.//uint8_t maskAlignedArray[8]; for (uint8_t i = 0; i < 8; i++, index++){maskAlignedArray[i] = mask[index % 4]} //// Apply 64 bit mask in 8 byte chunks.//uint32_t loop = length / 8; uint64_t *pMask8 = (uint64_t *)maskAlignedArray; while (loop--){uint64_t *pFrom8 = (uint64_t *)source; uint64_t *pTo8 = (uint64_t *)destination; *pTo8 = *pFrom8 ^ *pMask8; source += 8; destination += 8} //// Apply mask to remaining data.//uint8_t *pmaskAlignedArray = maskAlignedArray; length %= 8; while (length--){*destination++ = *source++ ^ *pmaskAlignedArray++} returnNULL} void* Unmask(Args1 args1){uint8_t *source = args1.source; size_t length = args1.length; uint8_t *mask = args1.mask; uint32_t index = 0; //// Alignment preamble.//while (index < length && ((size_t)source % 8)){*source++ ^= mask[index % 4]; index++} length -= index; if (!length) returnNULL; //// Realign mask and convert to 64 bit.//uint8_t maskAlignedArray[8]; for (uint8_t i = 0; i < 8; i++, index++){maskAlignedArray[i] = mask[index % 4]} //// Apply 64 bit mask in 8 byte chunks.//uint32_t loop = length / 8; uint64_t *pMask8 = (uint64_t *)maskAlignedArray; while (loop--){uint64_t *pSource8 = (uint64_t *)source; *pSource8 ^= *pMask8; source += 8} //// Apply mask to remaining data.//uint8_t *pmaskAlignedArray = maskAlignedArray; length %= 8; while (length--){*source++ ^= *pmaskAlignedArray++} returnNULL}

After

#include<cstddef> #include<cstdint>structArgs0{uint8_t *source; uint8_t *mask; uint8_t *destination; uint32_t offset; uint32_t length}; structArgs1{uint8_t *source; size_t length; uint8_t *mask}; void* Mask(Args0 args0){uint8_t *source = args0.source; uint8_t *mask = args0.mask; uint8_t *destination = args0.destination; uint32_t offset = args0.offset; uint32_t length = args0.length; destination += offset; uint32_t index = 0; //// Alignment preamble.//while (index < length && ((size_t)source % 8)){*destination++ = *source++ ^ mask[index % 4]; index++} length -= index; if (!length) returnNULL; //// Realign mask and convert to 64 bit.//uint8_t maskAlignedArray[8]; for (uint8_t i = 0; i < 8; i++, index++){maskAlignedArray[i] = mask[index % 4]} //// Apply 64 bit mask in 8 byte chunks.//uint32_t loop = length / 8; uint64_t mask8 = ((uint64_t *)maskAlignedArray)[0]; uint64_t *pFrom8 = (uint64_t *)source; uint64_t *pTo8 = (uint64_t *)destination; for (uint32_t i = 0; i < loop; i++) pTo8[i] = pFrom8[i] ^ mask8; source += 8 * loop; destination += 8 * loop; //// Apply mask to remaining data.// length %= 8; for (uint32_t i = 0; i < length; i++){destination[i] = source[i] ^ maskAlignedArray[i]} returnNULL} void* Unmask(Args1 args1){uint8_t *source = args1.source; uint8_t *mask = args1.mask; size_t length = args1.length; uint32_t index = 0; //// Alignment preamble.//while (index < length && ((size_t)source % 8)){*source++ ^= mask[index % 4]; index++} length -= index; if (!length) returnNULL; //// Realign mask and convert to 64 bit.//uint8_t maskAlignedArray[8]; for (uint8_t i = 0; i < 8; i++, index++){maskAlignedArray[i] = mask[index % 4]} //// Apply 64 bit mask in 8 byte chunks.//uint32_t loop = length / 8; uint64_t mask8 = ((uint64_t *)maskAlignedArray)[0]; uint64_t *pSource8 = (uint64_t *)source; for (uint32_t i = 0; i < loop; i++) pSource8[i] ^= mask8; source += 8 * loop; //// Apply mask to remaining data.// length %= 8; for (uint32_t i = 0; i < length; i++){source[i] ^= maskAlignedArray[i]} returnNULL}

@ChALkeR
Copy link
ContributorAuthor

ChALkeR commented Nov 1, 2025

g++ on x86_64 with -O3, old:

mov QWORD PTR [rsp-16],rcxtestr8d,r8dje .L71cmpr8d,1je .L77movedi,r8dmovqxmm1,rcxmovrdx,raxshredipunpcklqdqxmm1,xmm1salrdi,4addrdi,rax.L73:movdquxmm0, XMMWORD PTR [rdx]addrdx,16pxorxmm0,xmm1movups XMMWORD PTR [rdx-16],xmm0cmprdx,rdijne .L73testr8b,1je .L74 movabs rdx,34359738352andrdx,rsiaddrdx,rax.L72:xor QWORD PTR [rdx],rcx.L74:

g++ on x86_64 with -O3, new:

movr8, QWORD PTR [rsp-24]cmpesi,1je .L68movecx,esimovqxmm1,r8movrdx,raxshrecxpunpcklqdqxmm1,xmm1salrcx,4addrcx,rax.L64:movdquxmm0, XMMWORD PTR [rdx]addrdx,16pxorxmm0,xmm1movups XMMWORD PTR [rdx-16],xmm0cmprdx,rcxjne .L64test sil,1je .L62movedx,esiandedx,-2.L63:xor QWORD PTR [rax+rdx*8],r8.L62:

The main loop is identical.
This shouldn't slow down g++.

@ChALkeR
Copy link
ContributorAuthor

clang++ on arm8-a with -O3

Before:

 b.hs .LBB1_26 tbz w9, #0, .LBB1_26mov x9, x8 b .LBB1_29.LBB1_26: ldr d0,[sp, #8]and x13, x11, #0xfffffffcadd x9, x8, x13,lsl #3sub w11, w11, w13add x8, x8, #16 dup v0.2d, v0.d[0]mov x14, x13.LBB1_27: ldp q1, q2,[x8, #-16] subs x14, x14, #4 eor v1.16b, v1.16b, v0.16b eor v2.16b, v2.16b, v0.16b stp q1, q2,[x8, #-16]add x8, x8, #32 b.ne .LBB1_27cmp x12, x13 b.eq .LBB1_30.LBB1_29: ldr x8,[sp, #8] ldr x12,[x9] subs w11, w11, #1 eor x8, x12, x8str x8,[x9], #8 b.ne .LBB1_29.LBB1_30:

After:

 b.hs .LBB1_22mov x16, xzr b .LBB1_25.LBB1_22: lsr x16, x13, #3 dup v0.2d, x14add x17, x8, #16and x16, x16, #0xfffffffcmov x18, x16.LBB1_23: ldp q1, q2,[x17, #-16] subs x18, x18, #4 eor v1.16b, v1.16b, v0.16b eor v2.16b, v2.16b, v0.16b stp q1, q2,[x17, #-16]add x17, x17, #32 b.ne .LBB1_23cmp x15, x16 b.eq .LBB1_27.LBB1_25:add x17, x8, x16,lsl #3sub x15, x16, x15.LBB1_26: ldr x16,[x17] adds x15, x15, #1 eor x16, x16, x14str x16,[x17], #8 b.lo .LBB1_26.LBB1_27:

@ChALkeR

This comment was marked as outdated.

@lpinca
Copy link
Member

Benchmark results on native Windows:

System Information: Node.js: v25.2.0 OS: win32 10.0.26200 CPU: AMD Ryzen 5 5600G with Radeon Graphics Benchmark results (20 total): Plugins enabled: V8NeverOptimizePlugin ├─ mask (10) │ ├─ old 13.914.890 ops/sec (11 runs sampled) min..max=(71.36ns...72.63ns) │ └─ new 13.680.920 ops/sec (12 runs sampled) min..max=(72.43ns...73.85ns) ├─ mask (128) │ ├─ old 11.519.328 ops/sec (12 runs sampled) min..max=(86.11ns...88.11ns) │ └─ new 12.812.499 ops/sec (10 runs sampled) min..max=(77.46ns...78.46ns) ├─ mask (1024) │ ├─ old 6.178.851 ops/sec (12 runs sampled) min..max=(160.54ns...162.89ns) │ └─ new 8.751.573 ops/sec (10 runs sampled) min..max=(112.93ns...114.78ns) ├─ mask (10239) │ ├─ old 1.031.000 ops/sec (11 runs sampled) min..max=(964.15ns...982.73ns) │ └─ new 2.432.875 ops/sec (11 runs sampled) min..max=(406.97ns...415.03ns) ├─ mask (1048576) │ ├─ old 10.292 ops/sec (10 runs sampled) min..max=(96.72us...97.43us) │ └─ new 27.480 ops/sec (13 runs sampled) min..max=(35.65us...36.81us) ├─ unmask (10) │ ├─ old 19.327.232 ops/sec (12 runs sampled) min..max=(51.55ns...52.12ns) │ └─ new 19.315.417 ops/sec (10 runs sampled) min..max=(51.67ns...51.90ns) ├─ unmask (128) │ ├─ old 18.222.878 ops/sec (11 runs sampled) min..max=(54.33ns...56.30ns) │ └─ new 18.912.535 ops/sec (11 runs sampled) min..max=(52.58ns...53.30ns) ├─ unmask (1024) │ ├─ old 9.159.110 ops/sec (12 runs sampled) min..max=(107.66ns...109.96ns) │ └─ new 15.270.425 ops/sec (10 runs sampled) min..max=(65.24ns...65.91ns) ├─ unmask (10239) │ ├─ old 1.551.788 ops/sec (11 runs sampled) min..max=(643.32ns...645.68ns) │ └─ new 4.895.549 ops/sec (11 runs sampled) min..max=(202.91ns...206.37ns) └─ unmask (1048576) ├─ old 15.936 ops/sec (9 runs sampled) min..max=(62.66us...63.06us) └─ new 62.088 ops/sec (11 runs sampled) min..max=(15.90us...16.81us) 

@lpinca
Copy link
Member

lpinca commented Nov 26, 2025

@ChALkeR apart from #164 (comment) and #164 (comment), is this ready?

@lpinca
Copy link
Member

lpinca commented Dec 10, 2025

@ChALkeR ping.

@lpincalpincaforce-pushed the master branch 2 times, most recently from 95371b9 to 557a7afCompareDecember 10, 2025 09:19
@lpincalpinca marked this pull request as ready for review December 10, 2025 10:49
@lpincalpinca merged commit 321fbe4 into websockets:masterDec 10, 2025
16 checks passed
@ChALkeR
Copy link
ContributorAuthor

@lpinca sorry for missing this, thanks for fixing and merging!

@ChALkeRChALkeR deleted the go-brrrr branch December 18, 2025 13:21
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

@ChALkeR@lpinca