StringLowering: Escape the JSON in the custom section #6316

kripken · 2024-02-15T18:28:17Z

Also add an end-to-end test using node to verify we can parse the escaped
content properly using TextDecoder+JSON.parse.

tlively · 2024-02-16T00:15:35Z

src/support/string.h

+ JSON
+};
+
+std::ostream& printEscaped(std::ostream& os,


I think having a separate printJSONEscaped function would make more sense for this interface, even if the implementation still uses a flag.

tlively · 2024-02-16T00:18:00Z

src/support/string.cpp

+ if (mode == EscapeMode::Normal) {
+ os << std::hex << '\\' << (c / 16) << (c % 16) << std::dec;
+ } else if (mode == EscapeMode::JSON) {
+ os << std::hex << "\\u00" << (c / 16) << (c % 16) << std::dec;


JSON requires exactly 4 hex digits representing a UTF-16 code unit (i.e. 2 bytes at once), so it's going to be slightly more complicated than this, unfortunately.

kripken · 2024-02-16T03:14:23Z

PR now contains code point logic from Emscripten, that is hopefully correct... on a large real-world testcase it seems to parse everything properly at least.

tlively

The encoding looks right now, but only as long as the string is valid WTF-8. At no point do we validate or otherwise require that strings are valid WTF-8 (although perhaps we should be validating this?), so I think it would make sense to code more defensively here.

We could also make this code much more readable :)

It would be good to get more test coverage as well.

kripken · 2024-02-20T17:12:29Z

I ported the warning from the JS code to be more defensive here, and I added more comments in general. Did you have anything else in mind for defense/readability?

I added test coverage for all escaped characters.

kripken · 2024-02-20T18:28:51Z

Last commit adds a test for a weird utf8 char, and fixes our handling of the escape codes.

tlively · 2024-02-20T19:50:15Z

src/support/string.cpp

 }
+
+ // This uses 2 bytes.
 i++;


It would be good to check that we haven't run off the end of the string whenever we increment i.

tlively · 2024-02-20T19:51:35Z

src/support/string.cpp

 if (u0 < 0x10000) {
 uEscape(u0);
 } else {
+ // There are two separate code points here.


Suggested change

// There are two separate code points here.

// This value must be encoded with a surrogate pair of code points.

tlively · 2024-02-20T20:02:13Z

src/support/string.cpp

+ if ((u0 & 0xF8) != 0xF0) {
+ std::cerr << "warning: Bad UTF-8 leading byte " << int(u0) << '\n';
+ }


It would be good to emit similar warnings in the 1-, 2-, and 3-byte cases as well

tlively · 2024-02-20T20:41:34Z

I plan to try to rewrite this to be safer and easier to understand, so LGTM as-is for the short term.

Also add an end-to-end test using node to verify we can parse the escaped content properly using TextDecoder+JSON.parse.

kripken added 2 commits February 15, 2024 10:06

start

2851c52

fix

5e80819

kripken requested a review from tlively February 15, 2024 18:28

Use json escapes (dash-u)

fc3f9e4

tlively reviewed Feb 16, 2024

View reviewed changes

kripken added 7 commits February 15, 2024 16:27

change API as suggested

b727091

format

2ffe9ab

fun with code points

24b01e1

work

2ebdec6

test

9558204

fix

827a4df

format

b95c38c

tlively reviewed Feb 16, 2024

View reviewed changes

kripken added 2 commits February 20, 2024 09:01

Add warning and test coverage

332fb61

comments

5b4c654

fix 16/32 encoding and add a passing test for a weird utf8 char

f0158a6

tlively reviewed Feb 20, 2024

View reviewed changes

fix node.js nondeterminism across versions

1d7b783

kripken merged commit 07b91a8 into WebAssembly:main Feb 20, 2024

kripken deleted the json.escape branch February 20, 2024 21:23

gkdn mentioned this pull request Aug 31, 2024

stringconsts gkdn/binaryen#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

StringLowering: Escape the JSON in the custom section #6316

StringLowering: Escape the JSON in the custom section #6316

Uh oh!

kripken commented Feb 15, 2024

tlively Feb 16, 2024

tlively Feb 16, 2024

kripken commented Feb 16, 2024

tlively left a comment

kripken commented Feb 20, 2024

kripken commented Feb 20, 2024

tlively Feb 20, 2024

tlively Feb 20, 2024

tlively Feb 20, 2024

tlively commented Feb 20, 2024

Labels

2 participants

	// There are two separate code points here.
	// This value must be encoded with a surrogate pair of code points.

StringLowering: Escape the JSON in the custom section #6316

StringLowering: Escape the JSON in the custom section #6316

Uh oh!

Conversation

kripken commented Feb 15, 2024

tlively Feb 16, 2024

Choose a reason for hiding this comment

tlively Feb 16, 2024

Choose a reason for hiding this comment

kripken commented Feb 16, 2024

tlively left a comment

Choose a reason for hiding this comment

kripken commented Feb 20, 2024

kripken commented Feb 20, 2024

tlively Feb 20, 2024

Choose a reason for hiding this comment

tlively Feb 20, 2024

Choose a reason for hiding this comment

tlively Feb 20, 2024

Choose a reason for hiding this comment

tlively commented Feb 20, 2024

Labels

2 participants