Skip to content

Conversation

kripken
Copy link
Member

@kripken kripken commented Feb 15, 2024

Also add an end-to-end test using node to verify we can parse the escaped
content properly using TextDecoder+JSON.parse.

@kripken kripken requested a review from tlively February 15, 2024 18:28
JSON
};

std::ostream& printEscaped(std::ostream& os,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a separate printJSONEscaped function would make more sense for this interface, even if the implementation still uses a flag.

if (mode == EscapeMode::Normal) {
os << std::hex << '\\' << (c / 16) << (c % 16) << std::dec;
} else if (mode == EscapeMode::JSON) {
os << std::hex << "\\u00" << (c / 16) << (c % 16) << std::dec;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON requires exactly 4 hex digits representing a UTF-16 code unit (i.e. 2 bytes at once), so it's going to be slightly more complicated than this, unfortunately.

@kripken
Copy link
Member Author

kripken commented Feb 16, 2024

PR now contains code point logic from Emscripten, that is hopefully correct... on a large real-world testcase it seems to parse everything properly at least.

Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encoding looks right now, but only as long as the string is valid WTF-8. At no point do we validate or otherwise require that strings are valid WTF-8 (although perhaps we should be validating this?), so I think it would make sense to code more defensively here.

We could also make this code much more readable :)

It would be good to get more test coverage as well.

@kripken
Copy link
Member Author

kripken commented Feb 20, 2024

I ported the warning from the JS code to be more defensive here, and I added more comments in general. Did you have anything else in mind for defense/readability?

I added test coverage for all escaped characters.

@kripken
Copy link
Member Author

kripken commented Feb 20, 2024

Last commit adds a test for a weird utf8 char, and fixes our handling of the escape codes.

}

// This uses 2 bytes.
i++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to check that we haven't run off the end of the string whenever we increment i.

if (u0 < 0x10000) {
uEscape(u0);
} else {
// There are two separate code points here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// There are two separate code points here.
// This value must be encoded with a surrogate pair of code points.
Comment on lines +208 to +210
if ((u0 & 0xF8) != 0xF0) {
std::cerr << "warning: Bad UTF-8 leading byte " << int(u0) << '\n';
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to emit similar warnings in the 1-, 2-, and 3-byte cases as well

@tlively
Copy link
Member

tlively commented Feb 20, 2024

I plan to try to rewrite this to be safer and easier to understand, so LGTM as-is for the short term.

@kripken kripken merged commit 07b91a8 into WebAssembly:main Feb 20, 2024
@kripken kripken deleted the json.escape branch February 20, 2024 21:23
radekdoulik pushed a commit to dotnet/binaryen that referenced this pull request Jul 12, 2024
Also add an end-to-end test using node to verify we can parse the escaped content properly using TextDecoder+JSON.parse.
@gkdn gkdn mentioned this pull request Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants