Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34.3k
src: improve base64 encoding performance#39701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
mscdex commented Aug 8, 2021 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
nodejs-github-bot commented Aug 8, 2021
benjamingr left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The improvement (sure you tested with -O3 in both cases right?) is somewhat surprising but sure.
mscdex commented Aug 8, 2021 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
-O3 is the default in node builds, which I did not touch.
I'm not as surprised considering this is reading more bytes per loop iteration (and in one go) than before, which is not something I'm sure a compiler would necessarily know to do. |
benjamingr commented Aug 8, 2021
Yeah I of course believe you and assumed it's run with -O3 (otherwise I wouldn't have LGTMd) I'm just surprised about:
It sounds like a pretty safe and straightforward form of loop unrolling which is a fairly simple/standard optimisation |
| // Read in chunks of 8 bytes for as long as possible | ||
| while (i < n64){ | ||
| constuint64_t dword = *reinterpret_cast<constuint64_t*>(src + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some architectures don't support unaligned 64-bit access, so this could end up being somewhat slower on those. But still an optimization on most systems!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I was curious to check on ARM. However from what I'm seeing ARMv7 and newer support unaligned reads, with ARMv7 supposedly not supporting them for only 2 instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried to start reading 1/4 bytes until 8 byte alignment is reached?
mscdex commented Aug 11, 2021 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Alright, after more wrangling than I wanted, I managed to get some armv7 binaries compiled and ran the benchmarks on a Pi 2B before and after the changes in this PR: I don't have an armv8 board available to test on. |
Mesteery commented Aug 11, 2021
I can run the benchmark on a Pi 4B. Should I compare with master or 16.6.1? |
Mesteery commented Aug 11, 2021
compared with 16.6.1 : |
fhinkel commented Aug 17, 2021
Can you link the issue where this is too slow in the real world? Not sure 15% perf improvement on a synthetic benchmark warrants this change. |
mscdex commented Aug 17, 2021 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
"Synthetic" benchmarks are all we have at the moment, after the benchmarking WG was dechartered/decommissioned. IMO base64 encoding/decoding is not something that I'd exactly call an uncommon task, especially within the realm of web applications. In general, any sensible changes that provide a measurable improvement should be welcomed, especially considering the number of performance hits we continue to take on over time as features get added and as V8 evolves. The improvements all add up. Anyway, as far as base64 encoding goes, I also have an alternative PR here that you may be more interested in. 🤷♂️ |
mscdex commented Jul 10, 2022
Closing in favor of #39775 |
Benchmark results on a Core i7-3770K:
I was tempted to run the benchmark on ARM out of curiosity, but I don't have a machine with a new enough build environment or one that I can easily swap the OS at the moment and trying to cross compile node.js with a Linaro aarch64 toolchain is impossible.