Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34.2k
readline: use icu based string width calculation#9040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
jasnell commented Oct 11, 2016
Technically semver-minor because it allows now a single codepoint to be passed in as the argument to |
jasnell commented Oct 12, 2016
bnoordhuis commented Oct 12, 2016
This change would make intl and non-intl builds behave differently, wouldn't it? I think I'd rather have a single semi-broken algorithm than two diverging implementations. As well, the non-intl path will be untested until nodejs/build#419 is resolved. |
jasnell commented Oct 12, 2016
Intl and non-intl builds already behave differently in many ways. This On Wednesday, October 12, 2016, Ben Noordhuis [email protected]
|
bnoordhuis commented Oct 12, 2016
Oh? In what way?
"Everyone is doing it" is never a valid argument. V8 is not under our direct control but this is. |
jasnell commented Oct 12, 2016
Off the top of my head:
There's likely more. It's also not just about a more accurate implementation, performance is also significantly improved with the new algorithm. this PR: vs. latest v6: (the benchmark test benchmark/misc/stringwidth.js is not part of this PR currently, it's local on my machine) |
lib/internal/readline.js Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't need /g?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it uses /g.
lib/internal/readline.js Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should initially assign no-ops with comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come? that does not seem very practical.
lib/internal/readline.js Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not detectable from process.config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using process.config is unreliable because there are userland packages that override it. That's why process.binding('config') was introduced.
lib/internal/readline.js Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this remain isNaN() ... or rather, Number.isNaN()?
lib/readline.js Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment about these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the existing comment on line 128 and 129 not enough?
c133999 to 83c7a88Comparejasnell commented Oct 21, 2016 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
New CI: https://ci.nodejs.org/job/node-test-pull-request/4611/ |
jasnell commented Oct 22, 2016
Only unrelated flaky failures in CI. @nodejs/collaborators would appreciate additional review. |
bnoordhuis left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I still don't think this should land until we have a proper non-intl buildbot. So far, almost every intl-related change broke the non-intl build in one way or another.
lib/internal/readline.js Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regex could use a comment explaining what it tries to match/capture. Also, four space indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
four space indent would make the line longer than 80 chars, and splitting the regex into multiple lines would make it less readable. I'd prefer to leave the spacing as is.
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: s/ambiguousFull/ambiguous_full/ here and elsewhere. I don't like the name very much, it doesn't really convey what it does.
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why UCHAR_EAST_ASIAN_WIDTH? This needs an explaining comment.
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the fall-through is intentional, can you add // fallthrough comments?
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put braces around the body, slightly easier to read.
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: snake_case
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reinterpret_cast shouldn't be necessary, should it?
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Braces?
jasnell commented Oct 23, 2016
Updated to address the nits. |
jasnell commented Oct 23, 2016
@bnoordhuis ... added a nointl CI job we can run manually: https://ci.nodejs.org/job/node-test-commit-linux-nointl/1/ |
jbergstroem commented Oct 23, 2016
@jasnell: I don't think it should be added to the current job as-is. There's way too many workers involved. Do we need to test this on every linux flavor? I was hoping to have a small subset of workers for build permutation tests. |
Updated. Needs another review from @bnoordhuis
bnoordhuis left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a comment and a style nit.
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you line up the argument?
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TwoByteValue's constructor takes a Local<Value>. It's arguably wrong to cast because there is no check that it's really a string so I'd just let the constructor deal with this.
srl295 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, with a couple of minor comments
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
U+200D is a ZWJ (zero width joiner), please use that term
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasoable way of doing this calculation
src/node_i18n.cc Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth referencing something like http://www.unicode.org/reports/tr11/#Ambiguous here
jasnell commented Oct 24, 2016
Updated with the final nits addressed. New CI: https://ci.nodejs.org/job/node-test-pull-request/4655/ |
jasnell commented Oct 25, 2016
had a compile nit pop up on windows... trying again https://ci.nodejs.org/job/node-test-pull-request/4664/ |
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: nodejs#8075
jasnell commented Oct 25, 2016
CI is green except for unrelated flaky failures. Landing |
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Steven R Loomis <[email protected]>
jasnell commented Oct 25, 2016
Landed in 72547fe |
jasnell commented Oct 25, 2016
Shrinking it down ought to be fine. I'll do that this next week On Sunday, October 23, 2016, Johan Bergström [email protected]
|
bnoordhuis commented Oct 25, 2016
@jasnell g++ spotted a bug: I think you need a U16_NEXT call outside the loop. |
jasnell commented Oct 25, 2016
Hmm.. Ok, away from the laptop at the moment but will look at that shortly |
srl295 commented Oct 25, 2016
@bnoordhuis good catch - otherwise A |
bnoordhuis commented Oct 25, 2016
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Steven R Loomis <[email protected]>
MylesBorins commented May 15, 2017
/cc @srl295 currently passing on this being backported to v6.x. Do you think it should be considered? |
jasnell commented May 15, 2017
I'm not @srl295, of course, but I'll chime in to say that I see no pressing reason to backport this |
srl295 commented Aug 31, 2017
no pressing reason |
Checklist
make -j8 test(UNIX), orvcbuild test nosign(Windows) passesAffected core subsystem(s)
readline, internal
Description of change
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster.
This was originally part of a proposal to add a new unicode module, but has been split out.
Refs: #8075