Skip to content

Conversation

@gagern
Copy link
Contributor

Checklist
  • make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
  • a test and/or benchmark is included
  • the commit message follows commit guidelines
Affected core subsystem(s)
  • string_decoder
Description of change

There have been problems with utf8 decoding in cases where the input was invalid. Some cases would give different results depending on chunking, while others even led to exceptions. This commit simplifies the code in several ways, reducing the risk of breakage.

Most importantly, the text method is not only used for bulk conversion of a single chunk, but also for the conversion of the mini buffer lastChar while handling characters spanning chunks. That should make most of the problems due to incorrect handling of special cases disappear.

Secondly, that text method is now independent of encoding. The encoding-dependent method complete now has a single well-defined task: determine the buffer position up to which the input consists of complete characters. The actual conversion is handled in a central location, leading to a cleaner and leaner internal interface.

Thirdly, we no longer try to remember just how many bytes we'll need for the next complete character. We simply try to fill up the nextChar buffer and perform a conversion on that. This reduces the number of internal counter variables from two to one, namely partial which indicates the number of bytes currently stored in nextChar.
A possible drawback of this approach is that there is chance of degraded performance if input arrives one byte at a time and is from a script using long utf8 sequences. As a benefit, though, this allows us
to emit a U+FFFD replacement character sooner in cases where the first byte of an utf8 sequence is not followed by the expected number of continuation bytes.

Fixes: #7308

This is an alternative to #7310; merging this makes that obsolete.

@nodejs-github-botnodejs-github-bot added the string_decoder Issues and PRs related to the string_decoder subsystem. label Jun 16, 2016
@gagerngagernforce-pushed the string-decoder-simplify branch from dc94308 to d11e41cCompareJune 16, 2016 14:16
There have been problems with utf8 decoding in cases where the input was invalid. Some cases would give different results depending on chunking, while others even led to exceptions. This commit simplifies the code in several ways, reducing the risk of breakage. Most importantly, the `text` method is not only used for bulk conversion of a single chunk, but also for the conversion of the mini buffer `lastChar` while handling characters spanning chunks. That should make most of the problems due to incorrect handling of special cases disappear. Secondly, that `text` method is now independent of encoding. The encoding-dependent method `complete` now has a single well-defined task: determine the buffer position up to which the input consists of complete characters. The actual conversion is handled in a central location, leading to a cleaner and leaner internal interface. Thirdly, we no longer try to remember just how many bytes we'll need for the next complete character. We simply try to fill up the `nextChar` buffer and perform a conversion on that. This reduces the number of internal counter variables from two to one, namely `partial` which indicates the number of bytes currently stored in `nextChar`. A possible drawback of this approach is that there is chance of degraded performance if input arrives one byte at a time and is from a script using long utf8 sequences. As a benefit, though, this allows us to emit a U+FFFD replacement character sooner in cases where the first byte of an utf8 sequence is not followed by the expected number of continuation bytes. Fixes: nodejs#7308
@gagerngagernforce-pushed the string-decoder-simplify branch from d11e41c to 588864dCompareJune 16, 2016 14:23
@gagern
Copy link
ContributorAuthor

I have to concede that my changes appear to come at some performance cost, particularly for the base64 encodings. Comparing 588864d with its direct parent 1a1ff77 I see this benchmark comparison:

string_decoder/string-decoder-create.js encoding="ascii" n="25000000": ..................... 588864d/node: 8513500 1a1ff77/node: 7738100 .... 10.02% string_decoder/string-decoder-create.js encoding="utf8" n="25000000": ..................... 588864d/node: 3286700 1a1ff77/node: 3590100 .... -8.45% string_decoder/string-decoder-create.js encoding="utf-8" n="25000000": ..................... 588864d/node: 3047000 1a1ff77/node: 3234800 .... -5.81% string_decoder/string-decoder-create.js encoding="base64" n="25000000": ..................... 588864d/node: 2107300 1a1ff77/node: 2335200 .... -9.76% string_decoder/string-decoder-create.js encoding="ucs2" n="25000000": ..................... 588864d/node: 2884700 1a1ff77/node: 3018600 .... -4.44% string_decoder/string-decoder-create.js encoding="UTF-8" n="25000000": ..................... 588864d/node: 2066800 1a1ff77/node: 2026100 ..... 2.01% string_decoder/string-decoder-create.js encoding="AscII" n="25000000": ..................... 588864d/node: 3038600 1a1ff77/node: 4011300 ... -24.25% string_decoder/string-decoder-create.js encoding="UTF-16LE" n="25000000": ..................... 588864d/node: 1755500 1a1ff77/node: 1568800 .... 11.91% string_decoder/string-decoder.js encoding="ascii" inlen= "32" chunk= "16" n="2500000": 588864d/node: 1819400 1a1ff77/node: 2257700 ... -19.41% string_decoder/string-decoder.js encoding="ascii" inlen= "32" chunk= "64" n="2500000": 588864d/node: 4687000 1a1ff77/node: 4057000 .... 15.53% string_decoder/string-decoder.js encoding="ascii" inlen= "32" chunk= "256" n="2500000": 588864d/node: 3747300 1a1ff77/node: 4322900 ... -13.31% string_decoder/string-decoder.js encoding="ascii" inlen= "32" chunk="1024" n="2500000": 588864d/node: 4219900 1a1ff77/node: 3180500 .... 32.68% string_decoder/string-decoder.js encoding="ascii" inlen= "128" chunk= "16" n="2500000": 588864d/node: 461860 1a1ff77/node: 487710 .... -5.30% string_decoder/string-decoder.js encoding="ascii" inlen= "128" chunk= "64" n="2500000": 588864d/node: 1668000 1a1ff77/node: 1759200 .... -5.18% string_decoder/string-decoder.js encoding="ascii" inlen= "128" chunk= "256" n="2500000": 588864d/node: 3470300 1a1ff77/node: 3701400 .... -6.24% string_decoder/string-decoder.js encoding="ascii" inlen= "128" chunk="1024" n="2500000": 588864d/node: 3303200 1a1ff77/node: 3674700 ... -10.11% string_decoder/string-decoder.js encoding="ascii" inlen="1024" chunk= "16" n="2500000": 588864d/node: 74869 1a1ff77/node: 59589 .... 25.64% string_decoder/string-decoder.js encoding="ascii" inlen="1024" chunk= "64" n="2500000": 588864d/node: 235220 1a1ff77/node: 292320 ... -19.53% string_decoder/string-decoder.js encoding="ascii" inlen="1024" chunk= "256" n="2500000": 588864d/node: 820890 1a1ff77/node: 791060 ..... 3.77% string_decoder/string-decoder.js encoding="ascii" inlen="1024" chunk="1024" n="2500000": 588864d/node: 1974500 1a1ff77/node: 2066800 .... -4.47% string_decoder/string-decoder.js encoding="ascii" inlen="4096" chunk= "16" n="2500000": 588864d/node: 18091 1a1ff77/node: 18726 .... -3.39% string_decoder/string-decoder.js encoding="ascii" inlen="4096" chunk= "64" n="2500000": 588864d/node: 58023 1a1ff77/node: 57669 ..... 0.61% string_decoder/string-decoder.js encoding="ascii" inlen="4096" chunk= "256" n="2500000": 588864d/node: 226320 1a1ff77/node: 233570 .... -3.10% string_decoder/string-decoder.js encoding="ascii" inlen="4096" chunk="1024" n="2500000": 588864d/node: 496170 1a1ff77/node: 370910 .... 33.77% string_decoder/string-decoder.js encoding="utf8" inlen= "32" chunk= "16" n="2500000": 588864d/node: 1383600 1a1ff77/node: 1307200 ..... 5.85% string_decoder/string-decoder.js encoding="utf8" inlen= "32" chunk= "64" n="2500000": 588864d/node: 1612300 1a1ff77/node: 2009400 ... -19.76% string_decoder/string-decoder.js encoding="utf8" inlen= "32" chunk= "256" n="2500000": 588864d/node: 2146500 1a1ff77/node: 2132100 ..... 0.68% string_decoder/string-decoder.js encoding="utf8" inlen= "32" chunk="1024" n="2500000": 588864d/node: 2156900 1a1ff77/node: 2146100 ..... 0.50% string_decoder/string-decoder.js encoding="utf8" inlen= "128" chunk= "16" n="2500000": 588864d/node: 282740 1a1ff77/node: 421970 ... -33.00% string_decoder/string-decoder.js encoding="utf8" inlen= "128" chunk= "64" n="2500000": 588864d/node: 624260 1a1ff77/node: 812660 ... -23.18% string_decoder/string-decoder.js encoding="utf8" inlen= "128" chunk= "256" n="2500000": 588864d/node: 890720 1a1ff77/node: 927910 .... -4.01% string_decoder/string-decoder.js encoding="utf8" inlen= "128" chunk="1024" n="2500000": 588864d/node: 798830 1a1ff77/node: 1116100 ... -28.43% string_decoder/string-decoder.js encoding="utf8" inlen="1024" chunk= "16" n="2500000": 588864d/node: 41316 1a1ff77/node: 43883 .... -5.85% string_decoder/string-decoder.js encoding="utf8" inlen="1024" chunk= "64" n="2500000": 588864d/node: 104220 1a1ff77/node: 115900 ... -10.08% string_decoder/string-decoder.js encoding="utf8" inlen="1024" chunk= "256" n="2500000": 588864d/node: 121100 1a1ff77/node: 129600 .... -6.56% string_decoder/string-decoder.js encoding="utf8" inlen="1024" chunk="1024" n="2500000": 588864d/node: 130920 1a1ff77/node: 101800 .... 28.60% string_decoder/string-decoder.js encoding="utf8" inlen="4096" chunk= "16" n="2500000": 588864d/node: 9906.4 1a1ff77/node: 10700 .... -7.42% string_decoder/string-decoder.js encoding="utf8" inlen="4096" chunk= "64" n="2500000": 588864d/node: 24386 1a1ff77/node: 23261 ..... 4.84% string_decoder/string-decoder.js encoding="utf8" inlen="4096" chunk= "256" n="2500000": 588864d/node: 31953 1a1ff77/node: 32738 .... -2.40% string_decoder/string-decoder.js encoding="utf8" inlen="4096" chunk="1024" n="2500000": 588864d/node: 25762 1a1ff77/node: 32769 ... -21.38% string_decoder/string-decoder.js encoding="base64-utf8" inlen= "32" chunk= "16" n="2500000": 588864d/node: 236220 1a1ff77/node: 388520 ... -39.20% string_decoder/string-decoder.js encoding="base64-utf8" inlen= "32" chunk= "64" n="2500000": 588864d/node: 718230 1a1ff77/node: 1320800 ... -45.62% string_decoder/string-decoder.js encoding="base64-utf8" inlen= "32" chunk= "256" n="2500000": 588864d/node: 715660 1a1ff77/node: 1671300 ... -57.18% string_decoder/string-decoder.js encoding="base64-utf8" inlen= "32" chunk="1024" n="2500000": 588864d/node: 728550 1a1ff77/node: 1307200 ... -44.27% string_decoder/string-decoder.js encoding="base64-utf8" inlen= "128" chunk= "16" n="2500000": 588864d/node: 67355 1a1ff77/node: 110650 ... -39.13% string_decoder/string-decoder.js encoding="base64-utf8" inlen= "128" chunk= "64" n="2500000": 588864d/node: 192100 1a1ff77/node: 406130 ... -52.70% string_decoder/string-decoder.js encoding="base64-utf8" inlen= "128" chunk= "256" n="2500000": 588864d/node: 635020 1a1ff77/node: 991090 ... -35.93% string_decoder/string-decoder.js encoding="base64-utf8" inlen= "128" chunk="1024" n="2500000": 588864d/node: 604890 1a1ff77/node: 1169200 ... -48.26% string_decoder/string-decoder.js encoding="base64-utf8" inlen="1024" chunk= "16" n="2500000": 588864d/node: 8373.9 1a1ff77/node: 14872 ... -43.69% string_decoder/string-decoder.js encoding="base64-utf8" inlen="1024" chunk= "64" n="2500000": 588864d/node: 28409 1a1ff77/node: 65263 ... -56.47% string_decoder/string-decoder.js encoding="base64-utf8" inlen="1024" chunk= "256" n="2500000": 588864d/node: 114250 1a1ff77/node: 133340 ... -14.32% string_decoder/string-decoder.js encoding="base64-utf8" inlen="1024" chunk="1024" n="2500000": 588864d/node: 215090 1a1ff77/node: 230280 .... -6.60% string_decoder/string-decoder.js encoding="base64-utf8" inlen="4096" chunk= "16" n="2500000": 588864d/node: 2635.2 1a1ff77/node: 3725 ... -29.26% string_decoder/string-decoder.js encoding="base64-utf8" inlen="4096" chunk= "64" n="2500000": 588864d/node: 9046.6 1a1ff77/node: 12260 ... -26.21% string_decoder/string-decoder.js encoding="base64-utf8" inlen="4096" chunk= "256" n="2500000": 588864d/node: 24803 1a1ff77/node: 35052 ... -29.24% string_decoder/string-decoder.js encoding="base64-utf8" inlen="4096" chunk="1024" n="2500000": 588864d/node: 50969 1a1ff77/node: 68712 ... -25.82% string_decoder/string-decoder.js encoding="base64-ascii" inlen= "32" chunk= "16" n="2500000": 588864d/node: 286250 1a1ff77/node: 454890 ... -37.07% string_decoder/string-decoder.js encoding="base64-ascii" inlen= "32" chunk= "64" n="2500000": 588864d/node: 742810 1a1ff77/node: 1451100 ... -48.81% string_decoder/string-decoder.js encoding="base64-ascii" inlen= "32" chunk= "256" n="2500000": 588864d/node: 748130 1a1ff77/node: 1311000 ... -42.94% string_decoder/string-decoder.js encoding="base64-ascii" inlen= "32" chunk="1024" n="2500000": 588864d/node: 986320 1a1ff77/node: 1376700 ... -28.35% string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk= "16" n="2500000": 588864d/node: 78080 1a1ff77/node: 162880 ... -52.06% string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk= "64" n="2500000": 588864d/node: 244550 1a1ff77/node: 455950 ... -46.36% string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk= "256" n="2500000": 588864d/node: 833560 1a1ff77/node: 1261000 ... -33.90% string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk="1024" n="2500000": 588864d/node: 662110 1a1ff77/node: 1075600 ... -38.44% string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk= "16" n="2500000": 588864d/node: 10065 1a1ff77/node: 16471 ... -38.90% string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk= "64" n="2500000": 588864d/node: 34553 1a1ff77/node: 59507 ... -41.93% string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk= "256" n="2500000": 588864d/node: 126240 1a1ff77/node: 169490 ... -25.52% string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk="1024" n="2500000": 588864d/node: 227460 1a1ff77/node: 262340 ... -13.30% string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk= "16" n="2500000": 588864d/node: 2522.3 1a1ff77/node: 4446.2 . -43.27% string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk= "64" n="2500000": 588864d/node: 9335.4 1a1ff77/node: 18852 ... -50.48% string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk= "256" n="2500000": 588864d/node: 27761 1a1ff77/node: 44906 ... -38.18% string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk="1024" n="2500000": 588864d/node: 61087 1a1ff77/node: 71490 ... -14.55% string_decoder/string-decoder.js encoding="utf16le" inlen= "32" chunk= "16" n="2500000": 588864d/node: 2034700 1a1ff77/node: 1962700 ..... 3.66% string_decoder/string-decoder.js encoding="utf16le" inlen= "32" chunk= "64" n="2500000": 588864d/node: 3102000 1a1ff77/node: 3954200 ... -21.55% string_decoder/string-decoder.js encoding="utf16le" inlen= "32" chunk= "256" n="2500000": 588864d/node: 3990800 1a1ff77/node: 3504400 .... 13.88% string_decoder/string-decoder.js encoding="utf16le" inlen= "32" chunk="1024" n="2500000": 588864d/node: 3144100 1a1ff77/node: 4183900 ... -24.85% string_decoder/string-decoder.js encoding="utf16le" inlen= "128" chunk= "16" n="2500000": 588864d/node: 392810 1a1ff77/node: 496990 ... -20.96% string_decoder/string-decoder.js encoding="utf16le" inlen= "128" chunk= "64" n="2500000": 588864d/node: 1709900 1a1ff77/node: 1692800 ..... 1.01% string_decoder/string-decoder.js encoding="utf16le" inlen= "128" chunk= "256" n="2500000": 588864d/node: 2617300 1a1ff77/node: 2468200 ..... 6.04% string_decoder/string-decoder.js encoding="utf16le" inlen= "128" chunk="1024" n="2500000": 588864d/node: 2183500 1a1ff77/node: 2823700 ... -22.67% string_decoder/string-decoder.js encoding="utf16le" inlen="1024" chunk= "16" n="2500000": 588864d/node: 55287 1a1ff77/node: 63105 ... -12.39% string_decoder/string-decoder.js encoding="utf16le" inlen="1024" chunk= "64" n="2500000": 588864d/node: 226390 1a1ff77/node: 177080 .... 27.85% string_decoder/string-decoder.js encoding="utf16le" inlen="1024" chunk= "256" n="2500000": 588864d/node: 458280 1a1ff77/node: 452330 ..... 1.32% string_decoder/string-decoder.js encoding="utf16le" inlen="1024" chunk="1024" n="2500000": 588864d/node: 692320 1a1ff77/node: 671270 ..... 3.14% string_decoder/string-decoder.js encoding="utf16le" inlen="4096" chunk= "16" n="2500000": 588864d/node: 13839 1a1ff77/node: 15688 ... -11.79% string_decoder/string-decoder.js encoding="utf16le" inlen="4096" chunk= "64" n="2500000": 588864d/node: 55990 1a1ff77/node: 51500 ..... 8.72% string_decoder/string-decoder.js encoding="utf16le" inlen="4096" chunk= "256" n="2500000": 588864d/node: 118300 1a1ff77/node: 113680 ..... 4.06% string_decoder/string-decoder.js encoding="utf16le" inlen="4096" chunk="1024" n="2500000": 588864d/node: 176230 1a1ff77/node: 166440 ..... 5.88% 

I guess the degraded base64 performance might be due to extra method invocation from write to text to complete. I don't know how to avoid that without sacrificing the improved maintainability I was aiming for. I'm also not sure how Buffer.copy compares to copying these few bytes one at a time, whether changing that would make sense. I'm open to suggestions on how else performance might be improved.

@mscdex
Copy link
Contributor

The original issue is fixed in 79ef3b6 while maintaining performance (and actually improving performance in split multi-byte character cases). I've also merged the tests you're provided here in 5e8cbd7.

@mscdexmscdex closed this Jun 24, 2016
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

string_decoderIssues and PRs related to the string_decoder subsystem.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

string_decoder: RangeError introduced by 6.2.1

3 participants

@gagern@mscdex@nodejs-github-bot