Skip to content

Conversation

@jasnell
Copy link
Member

Checklist
  • make -j8 test (UNIX), or vcbuild test nosign (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

readline, internal

Description of change

Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster.

This was originally part of a proposal to add a new unicode module, but has been split out.

Refs: #8075

@jasnelljasnell added readline Issues and PRs related to the built-in readline module. i18n-api Issues and PRs related to the i18n implementation. labels Oct 11, 2016
@nodejs-github-botnodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. readline Issues and PRs related to the built-in readline module. labels Oct 11, 2016
@jasnelljasnell added the semver-minor PRs that contain new features and should be released in the next minor version. label Oct 11, 2016
@jasnell
Copy link
MemberAuthor

Technically semver-minor because it allows now a single codepoint to be passed in as the argument to getStringWidth().

@jasnell
Copy link
MemberAuthor

@bnoordhuis
Copy link
Member

This change would make intl and non-intl builds behave differently, wouldn't it? I think I'd rather have a single semi-broken algorithm than two diverging implementations. As well, the non-intl path will be untested until nodejs/build#419 is resolved.

@jasnell
Copy link
MemberAuthor

Intl and non-intl builds already behave differently in many ways. This
would be no different. Even v8 already acts differently if ICU isn't
present.

On Wednesday, October 12, 2016, Ben Noordhuis [email protected]
wrote:

This change would make intl and non-intl builds behave differently,
wouldn't it? I think I'd rather have a single semi-broken algorithm than
two diverging implementations. As well, the non-intl path will be untested
until nodejs/build#419nodejs/build#419 is
resolved.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#9040 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAa2eTOFILRRLZmfbAkdyGUaprCAOECaks5qzJ_XgaJpZM4KUK7U
.

@bnoordhuis
Copy link
Member

Intl and non-intl builds already behave differently in many ways.

Oh? In what way?

Even v8 already acts differently if ICU isn't present.

"Everyone is doing it" is never a valid argument. V8 is not under our direct control but this is.

@jasnell
Copy link
MemberAuthor

Oh? In what way?

Off the top of my head:

  1. Presence of Intl
  2. Results of String.prototype.normalize (try '\u1E9B\u0323'.normalize('NFD') === '\u1E9B\u0323'.normalize('NFC') in a default build vs. --without-intl build)
  3. Behavior of locale specific methods like toLocaleUpperCase(), localeCompare(), and toLocaleString()
  4. ICU's 10x faster punycode implementation used if ICU is present (recent change)

There's likely more.

It's also not just about a more accurate implementation, performance is also significantly improved with the new algorithm.

this PR:

$ ./node --expose-internals benchmark/misc/stringwidth.js misc/stringwidth.js test="a" millions=5: 6.288441921473474 misc/stringwidth.js test="丁" millions=5: 6.230841486482881 misc/stringwidth.js test="👸🏿" millions=5: 4.1727458183443265 misc/stringwidth.js test="👅" millions=5: 5.148377225269059 misc/stringwidth.js test="\n" millions=5: 6.285715980343315 misc/stringwidth.js test="‎f‏" millions=5: 5.2453819193365865 misc/stringwidth.js test="‎\n∊⃒" millions=5: 4.174633509425163 

vs. latest v6:

$ node --expose-internals benchmark/misc/stringwidth.js misc/stringwidth.js test="a" millions=5: 1.3992433734181065 misc/stringwidth.js test="丁" millions=5: 1.7113687595793867 misc/stringwidth.js test="👸🏿" millions=5: 1.5577253551652364 misc/stringwidth.js test="👅" millions=5: 1.6861030856604042 misc/stringwidth.js test="\n" millions=5: 1.7522551193962668 misc/stringwidth.js test="‎f‏" millions=5: 1.5648816109285903 misc/stringwidth.js test="‎\n∊⃒" millions=5: 1.459896127806548 

(the benchmark test benchmark/misc/stringwidth.js is not part of this PR currently, it's local on my machine)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need /g?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it uses /g.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should initially assign no-ops with comments?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come? that does not seem very practical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not detectable from process.config?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using process.config is unreliable because there are userland packages that override it. That's why process.binding('config') was introduced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this remain isNaN() ... or rather, Number.isNaN()?

lib/readline.js Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment about these values?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the existing comment on line 128 and 129 not enough?

@rvaggrvaggforce-pushed the master branch 2 times, most recently from c133999 to 83c7a88CompareOctober 18, 2016 17:02
@jasnell
Copy link
MemberAuthor

jasnell commented Oct 21, 2016

@jasnell
Copy link
MemberAuthor

Only unrelated flaky failures in CI. @nodejs/collaborators would appreciate additional review.
@trevnorris@addaleax ... PTAL, does this LGTY?

bnoordhuis
bnoordhuis previously requested changes Oct 23, 2016
Copy link
Member

@bnoordhuisbnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I still don't think this should land until we have a proper non-intl buildbot. So far, almost every intl-related change broke the non-intl build in one way or another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex could use a comment explaining what it tries to match/capture. Also, four space indent.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

four space indent would make the line longer than 80 chars, and splitting the regex into multiple lines would make it less readable. I'd prefer to leave the spacing as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: s/ambiguousFull/ambiguous_full/ here and elsewhere. I don't like the name very much, it doesn't really convey what it does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why UCHAR_EAST_ASIAN_WIDTH? This needs an explaining comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the fall-through is intentional, can you add // fallthrough comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put braces around the body, slightly easier to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: snake_case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reinterpret_cast shouldn't be necessary, should it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braces?

@jasnell
Copy link
MemberAuthor

Updated to address the nits.

@jasnell
Copy link
MemberAuthor

@bnoordhuis ... added a nointl CI job we can run manually: https://ci.nodejs.org/job/node-test-commit-linux-nointl/1/
/cc @jbergstroem ... I'll tweak this a bit later to get it added to the main CI group.

@jbergstroem
Copy link
Member

@jasnell: I don't think it should be added to the current job as-is. There's way too many workers involved. Do we need to test this on every linux flavor? I was hoping to have a small subset of workers for build permutation tests.

@jasnelljasnell dismissed bnoordhuis’s stale reviewOctober 24, 2016 14:18

Updated. Needs another review from @bnoordhuis

Copy link
Member

@bnoordhuisbnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a comment and a style nit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you line up the argument?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TwoByteValue's constructor takes a Local<Value>. It's arguably wrong to cast because there is no check that it's really a string so I'd just let the constructor deal with this.

Copy link
Member

@srl295srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, with a couple of minor comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U+200D is a ZWJ (zero width joiner), please use that term

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasoable way of doing this calculation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth referencing something like http://www.unicode.org/reports/tr11/#Ambiguous here

@jasnell
Copy link
MemberAuthor

Updated with the final nits addressed. New CI: https://ci.nodejs.org/job/node-test-pull-request/4655/

@jasnell
Copy link
MemberAuthor

had a compile nit pop up on windows... trying again https://ci.nodejs.org/job/node-test-pull-request/4664/

Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: nodejs#8075
@jasnell
Copy link
MemberAuthor

CI is green except for unrelated flaky failures. Landing

jasnell added a commit that referenced this pull request Oct 25, 2016
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Steven R Loomis <[email protected]>
@jasnell
Copy link
MemberAuthor

Landed in 72547fe

@jasnelljasnell closed this Oct 25, 2016
@jasnell
Copy link
MemberAuthor

Shrinking it down ought to be fine. I'll do that this next week

On Sunday, October 23, 2016, Johan Bergström [email protected]
wrote:

@jasnellhttps://github.com/jasnell: I don't think it should be added
to the current job as-is. There's way too many workers involved. Do we need
to test this on every linux flavor? I was hoping to have a small subset of
workers for build permutation tests.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9040 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAa2eSj8DXEb45zGppZSrbdxdlcyksZkks5q2650gaJpZM4KUK7U
.

@bnoordhuis
Copy link
Member

@jasnell g++ spotted a bug:

../src/node_i18n.cc: In function 'void node::i18n::GetStringWidth(const v8::FunctionCallbackInfo<v8::Value>&)': ../src/node_i18n.cc:543:15: warning: 'c' may be used uninitialized in this function [-Wmaybe-uninitialized] if (!expand_emoji_sequence && ~~~~~~~~~~~~~~~~~~~~~~~~~ n > 0 && p == 0x200d && // 0x200d == ZWJ (zero width joiner) 

I think you need a U16_NEXT call outside the loop.

@jasnell
Copy link
MemberAuthor

Hmm.. Ok, away from the laptop at the moment but will look at that shortly

@srl295
Copy link
Member

@bnoordhuis good catch - otherwise p=<undefined>

A U16_NEXT(str, n, value.length(), c); before the loop will initialize c.
May also obviate the n>0 check within the loop.

@bnoordhuis
Copy link
Member

#9280

evanlucas pushed a commit that referenced this pull request Nov 3, 2016
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Steven R Loomis <[email protected]>
@evanlucasevanlucas mentioned this pull request Nov 3, 2016
@MylesBorins
Copy link
Contributor

/cc @srl295 currently passing on this being backported to v6.x. Do you think it should be considered?

@jasnell
Copy link
MemberAuthor

I'm not @srl295, of course, but I'll chime in to say that I see no pressing reason to backport this

@srl295
Copy link
Member

no pressing reason

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++Issues and PRs that require attention from people who are familiar with C++.i18n-apiIssues and PRs related to the i18n implementation.readlineIssues and PRs related to the built-in readline module.semver-minorPRs that contain new features and should be released in the next minor version.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants

@jasnell@bnoordhuis@jbergstroem@srl295@MylesBorins@Fishrock123@gibfahn@nodejs-github-bot