- Notifications
You must be signed in to change notification settings - Fork 817
StringLowering: Escape the JSON in the custom section #6316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/support/string.h Outdated
JSON | ||
}; | ||
| ||
std::ostream& printEscaped(std::ostream& os, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a separate printJSONEscaped
function would make more sense for this interface, even if the implementation still uses a flag.
src/support/string.cpp Outdated
if (mode == EscapeMode::Normal) { | ||
os << std::hex << '\\' << (c / 16) << (c % 16) << std::dec; | ||
} else if (mode == EscapeMode::JSON) { | ||
os << std::hex << "\\u00" << (c / 16) << (c % 16) << std::dec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON requires exactly 4 hex digits representing a UTF-16 code unit (i.e. 2 bytes at once), so it's going to be slightly more complicated than this, unfortunately.
PR now contains code point logic from Emscripten, that is hopefully correct... on a large real-world testcase it seems to parse everything properly at least. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encoding looks right now, but only as long as the string is valid WTF-8. At no point do we validate or otherwise require that strings are valid WTF-8 (although perhaps we should be validating this?), so I think it would make sense to code more defensively here.
We could also make this code much more readable :)
It would be good to get more test coverage as well.
I ported the warning from the JS code to be more defensive here, and I added more comments in general. Did you have anything else in mind for defense/readability? I added test coverage for all escaped characters. |
Last commit adds a test for a weird utf8 char, and fixes our handling of the escape codes. |
} | ||
| ||
// This uses 2 bytes. | ||
i++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to check that we haven't run off the end of the string whenever we increment i
.
if (u0 < 0x10000) { | ||
uEscape(u0); | ||
} else { | ||
// There are two separate code points here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// There are two separate code points here. | |
// This value must be encoded with a surrogate pair of code points. |
if ((u0 & 0xF8) != 0xF0) { | ||
std::cerr << "warning: Bad UTF-8 leading byte " << int(u0) << '\n'; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to emit similar warnings in the 1-, 2-, and 3-byte cases as well
I plan to try to rewrite this to be safer and easier to understand, so LGTM as-is for the short term. |
Also add an end-to-end test using node to verify we can parse the escaped content properly using TextDecoder+JSON.parse.
Also add an end-to-end test using node to verify we can parse the escaped
content properly using TextDecoder+JSON.parse.