Skip to content

Conversation

@jasnell
Copy link
Member

@jasnell jasnell commented Jun 21, 2016

Checklist
  • make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
  • a test and/or benchmark is included
  • the commit message follows commit guidelines
Affected core subsystem(s)

net, intl

Description of change

ICU has a punycode implementation built in. Use it instead of the javascript implementation because it's much faster.

The punycode module is currently only used in one location in core (in the url module). With this change if ICU is present, the ICU implementation would be used with the punycode impl becoming the fallback. Eventually, it would be great to see if we could deprecate the punycode module entirely, with an eye towards eventually removing it.

This includes an update to the small-icu dataset to support the IDNA implementation. The additional data adds < 1 mb of data.

Benchmarks (bigger numbers are better):

net/punycode.js method="punycode" n="1024" val="افغانستا.icom.museum": 27593.19811 net/punycode.js method="punycode" n="1024" val="الجزائر.icom.museum": 30275.51786 net/punycode.js method="punycode" n="1024" val="österreich.icom.museum": 34711.86323 net/punycode.js method="punycode" n="1024" val="বাংলাদেশ.icom.museum": 26096.37590 net/punycode.js method="punycode" n="1024" val="беларусь.icom.museum": 26819.07514 net/punycode.js method="punycode" n="1024" val="belgië.icom.museum": 39787.58249 net/punycode.js method="punycode" n="1024" val="българия.icom.museum": 26728.08628 net/punycode.js method="punycode" n="1024" val="تشادر.icom.museum": 33340.09143 net/punycode.js method="punycode" n="1024" val="中国.icom.museum": 40001.78289 net/punycode.js method="punycode" n="1024" val="القمر.icom.museum": 34623.28045 net/punycode.js method="punycode" n="1024" val="κυπρος.icom.museum": 32423.57551 net/punycode.js method="punycode" n="1024" val="českárepublika.icom.museum": 30496.93111 net/punycode.js method="punycode" n="1024" val="مصر.icom.museum": 38439.50043 net/punycode.js method="punycode" n="1024" val="ελλάδα.icom.museum": 32494.77561 net/punycode.js method="punycode" n="1024" val="magyarország.icom.museum": 35283.22541 net/punycode.js method="punycode" n="1024" val="ísland.icom.museum": 38816.50008 net/punycode.js method="punycode" n="1024" val="भारत.icom.museum": 37098.76096 net/punycode.js method="punycode" n="1024" val="ايران.icom.museum": 31919.15278 net/punycode.js method="punycode" n="1024" val="éire.icom.museum": 36912.57326 net/punycode.js method="punycode" n="1024" val="איקו״ם.ישראל.museum": 15457.49763 net/punycode.js method="punycode" n="1024" val="日本.icom.museum": 40124.44062 net/punycode.js method="punycode" n="1024" val="الأردن.icom.museum": 30370.73983 net/punycode.js method="icu" n="1024" val="افغانستا.icom.museum": 265435.02052 net/punycode.js method="icu" n="1024" val="الجزائر.icom.museum": 327985.94529 net/punycode.js method="icu" n="1024" val="österreich.icom.museum": 300055.76232 net/punycode.js method="icu" n="1024" val="বাংলাদেশ.icom.museum": 235323.31677 net/punycode.js method="icu" n="1024" val="беларусь.icom.museum": 229452.87589 net/punycode.js method="icu" n="1024" val="belgië.icom.museum": 354082.67000 net/punycode.js method="icu" n="1024" val="българия.icom.museum": 256043.71947 net/punycode.js method="icu" n="1024" val="تشادر.icom.museum": 357039.62547 net/punycode.js method="icu" n="1024" val="中国.icom.museum": 385073.29554 net/punycode.js method="icu" n="1024" val="القمر.icom.museum": 351974.70182 net/punycode.js method="icu" n="1024" val="κυπρος.icom.museum": 297908.24193 net/punycode.js method="icu" n="1024" val="českárepublika.icom.museum": 268331.79121 net/punycode.js method="icu" n="1024" val="مصر.icom.museum": 338442.24685 net/punycode.js method="icu" n="1024" val="ελλάδα.icom.museum": 340001.77306 net/punycode.js method="icu" n="1024" val="magyarország.icom.museum": 301679.21012 net/punycode.js method="icu" n="1024" val="ísland.icom.museum": 370667.99875 net/punycode.js method="icu" n="1024" val="भारत.icom.museum": 205594.87089 net/punycode.js method="icu" n="1024" val="ايران.icom.museum": 302680.25733 net/punycode.js method="icu" n="1024" val="éire.icom.museum": 397259.52996 net/punycode.js method="icu" n="1024" val="איקו״ם.ישראל.museum": 202994.92850 net/punycode.js method="icu" n="1024" val="日本.icom.museum": 407493.58416 net/punycode.js method="icu" n="1024" val="الأردن.icom.museum": 326276.22915 

cc @srl295 @mscdex @ChALkeR @trevnorris

@jasnell jasnell added wip Issues and PRs that are still a work in progress. net Issues and PRs related to the net subsystem. i18n-api Issues and PRs related to the i18n implementation. performance Issues and PRs related to the performance of Node.js. labels Jun 21, 2016
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. url Issues and PRs related to the legacy built-in url module. labels Jun 21, 2016
@srl295
Copy link
Member

srl295 commented Jun 21, 2016

does the binary size change much? i don't think it should.

@jasnell
Copy link
Member Author

jasnell commented Jun 21, 2016

With the code, no. Not sure about the data files. I need to track down
exactly which additional data files are needed
On Jun 21, 2016 2:25 PM, "Steven R. Loomis" notifications@github.com
wrote:

does the binary size change much? i don't think it should.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#7355 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAa2eQuKo2eIYt7PM1-WSnnxx2cc3Skhks5qOFbMgaJpZM4I7LOq
.

@srl295
Copy link
Member

srl295 commented Jun 21, 2016

@jasnell
Copy link
Member Author

jasnell commented Jun 21, 2016

Yeah, likely both. I'll play around with adding those into the small-icu subset and see what happens ;-)

@sindresorhus
Copy link

src/node_i18n.cc Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll at least want to do a CHECK_NE(dest, nullptr)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better yet, use MaybeStackBuffer.

@bnoordhuis
Copy link
Member

High-level comment: is punycode prevalent enough to warrant optimizing for? I'm inclined to say it isn't.

@srl295
Copy link
Member

srl295 commented Jun 22, 2016

@bnoordhuis Maybe this is what you meant, but the function is called every time url.parse() is called, isn't it?

> url.parse('http://mañana.com').host 'xn--maana-pta.com' 
@bnoordhuis
Copy link
Member

Let me rephrase: does punycode.toASCII() show up in the profile of a non-contrived benchmark? Maybe I don't call url.parse() enough, but I've never seen it.

@jasnell
Copy link
Member Author

jasnell commented Jun 22, 2016

@bnoordhuis ... it's not used extensively by our code, no, but it is used pretty much every time url.parse() is called which is every time someone does something like http.get('http://example.org') (see https://github.com/nodejs/node/blob/master/lib/url.js#L312). This is the only use of the punycode module in core.

Ultimately what I'd really like is for us to deprecate the punycode module and eventually pull it back out of core entirely, relying solely on the faster ICU implementation. What that would mean is not supporting IDNA's in builds that do not include ICU but if those are not including ICU then internationalization is likely not a requirement and it's not likely to be any issue (and if it is, they can rely on the userland module to get it).

@jasnell
Copy link
Member Author

jasnell commented Jun 22, 2016

@bnoordhuis ... to give an idea, below is a comparison of url.parse benchmarks in v6.2.2 against master with this patch applied:

url/url-parse.js type="one" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 249020 ./node: 370670 ... -32.82% url/url-parse.js type="two" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 299860 ./node: 493920 ... -39.29% url/url-parse.js type="three" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 238360 ./node: 345490 . -31.01% url/url-parse.js type="four" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 727730 ./node: 730270 ... -0.35% url/url-parse.js type="five" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 716050 ./node: 609270 ... 17.52% url/url-parse.js type="six" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 271480 ./node: 421830 ... -35.64% 

The "five" case is an exception and I believe that something else unrelated to this change is going on there.. but the perf boost for the typical use cases is significant.

@jasnell
Copy link
Member Author

jasnell commented Jun 22, 2016

src/node_i18n.cc Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this give a simple yes/no on failure, or is there additional information that may be useful to the error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case the possible failures are rare and non-interesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you can test for info.errors & UIDNA_ERROR_DOMAIN_NAME_TOO_LONG, maybe other values. The status itself is likely to be something like out of memory, invalid parameters, etc. You could ThrowError including u_errorName(status) to return the ICU status code - that would be good. I don't know if ThrowError takes a parameter.

@trevnorris
Copy link
Contributor

what a pain. so http.get() uses punycode automatically but not encodeURI()? so a request like http://見.香港/見 would properly convert the domain but not the path. At least in a way that node would like. So to be full proof a user would have to do:

// url.parse() automatically runs toASCII() on the // domain, but isn't documented to do so. const o = url.parse('http://見.香港/見'); o.pathname = encodeURI(o.pathname); const url = url.format(o);

Anyway, not directly anything to do with this PR.

In terms of performance, I think punycode.js could be made noticeably faster for node. Since it's not written for node to begin with (as shown by comments like // Avoid split(regex) for IE8 compatibility.). My main question is the need to use full-icu data for this to work, and so how much actual use it will see.

Strictly from the code perspective, the updates LGTM.

@jasnell
Copy link
Member Author

jasnell commented Jun 22, 2016

This only needs full-icu currently. There are a couple of minor data files currently missing in the small-icu subset that would need to be added but those are minimal. Those can be added to small-icu without too much impact.

@jasnell
Copy link
Member Author

jasnell commented Jun 22, 2016

And I'm working on a better URL impl to deal with the other issues ;-)

@mathiasbynens
Copy link
Contributor

@trevnorris I think punycode.js could be made noticeably faster for node.

How?

@bnoordhuis
Copy link
Member

@mathiasbynens By playing to V8's strengths while avoiding its weaknesses. :-)

For example, .toASCII() calls String#split() twice and is heavy on the map-over-substrings idiom. It would probably be faster to:

  1. Scan for the split character and only split when found; saves allocating an array in the common case.
  2. Iterate rather than map; V8 is not great at lowering/inlining callbacks.
  3. Try to avoid concatenating strings too much; cons strings eventually have to be flattened.

There is probably more but that is what stood out from a quick look.

lib/url.js Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a spacing rule for destructuring before we start using it in full? // @Trott

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places we have spaces around the name.

@jasnell
Copy link
Member Author

jasnell commented Jun 28, 2016

CI had a failure on smartos... green everywhere else. Small tweak and running again: https://ci.nodejs.org/job/node-test-pull-request/3108/

@jasnell
Copy link
Member Author

jasnell commented Jun 28, 2016

@trevnorris @bnoordhuis @srl295 ... CI is good. This now no-longer requires full-icu data. PTAL

src/node_i18n.cc Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be able to put these two on the same line. git grep "return.*Throw" shows that's how it's done in 260 other places.

@bnoordhuis
Copy link
Member

LGTM with style nits.

@jasnell
Copy link
Member Author

jasnell commented Jun 28, 2016

@bnoordhuis @trevnorris .. updated to address the nits

jasnell added 2 commits June 28, 2016 16:07
ICU has a punycode implementation built in. Use it instead of the javascript implementation because it's much faster.
@jasnell
Copy link
Member Author

jasnell commented Jun 29, 2016

@jasnell
Copy link
Member Author

jasnell commented Jun 30, 2016

CI is green.

jasnell added a commit that referenced this pull request Jun 30, 2016
PR-URL: #7355 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
jasnell added a commit that referenced this pull request Jun 30, 2016
ICU has a punycode implementation built in. Use it instead of the javascript implementation because it's much faster. PR-URL: #7355 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
@jasnell
Copy link
Member Author

jasnell commented Jun 30, 2016

Landed in 7de59ef and 3d6a01e

@jasnell jasnell closed this Jun 30, 2016
@mhdawson mhdawson mentioned this pull request Jun 30, 2016
Fishrock123 pushed a commit that referenced this pull request Jul 5, 2016
PR-URL: #7355 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Fishrock123 pushed a commit that referenced this pull request Jul 5, 2016
ICU has a punycode implementation built in. Use it instead of the javascript implementation because it's much faster. PR-URL: #7355 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
@Fishrock123 Fishrock123 mentioned this pull request Jul 5, 2016
@MylesBorins
Copy link
Contributor

@jasnell @srl295 I've set this to don't land as the icu doesn't come packed with v4.x

please feel free to update if I'm incorrect here

@srl295
Copy link
Member

srl295 commented Jul 11, 2016

@thealphanerd Correct and correct.

@srl295 srl295 mentioned this pull request Jul 27, 2016
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. net Issues and PRs related to the net subsystem. performance Issues and PRs related to the performance of Node.js. url Issues and PRs related to the legacy built-in url module.

10 participants