-
- Notifications
You must be signed in to change notification settings - Fork 33.7k
buffer: add buffer.isUtf8 for utf8 validation #45947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
be10b36 bcb19ec 6c8ac38 103e807 b590f06 e940f59 File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| 'use strict'; | ||
| | ||
| require('../common'); | ||
| const assert = require('assert'); | ||
| const { isUtf8, Buffer } = require('buffer'); | ||
| const { TextEncoder } = require('util'); | ||
| | ||
| const encoder = new TextEncoder(); | ||
| | ||
| assert.strictEqual(isUtf8(encoder.encode('hello')), true); | ||
| assert.strictEqual(isUtf8(encoder.encode('ğ')), true); | ||
| assert.strictEqual(isUtf8(Buffer.from([])), true); | ||
| Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does a zero length buffer return true? I would expect this to be false. Member Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because it does not include an invalid code point. Is there a similar Node function that has a different behavior? Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But the stated description of the API is, "This function is used to check if input contains UTF-8 code points"... An empty buffer does not contain UTF-8 code points so it really can't return true. Other methods we have that accept ArrayBuffer or TypedArray, with the exception of Web Streams which have specifically defined handling for detached, will treat those as indistinguishable from a zero-length input. Member Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm.. Thats correct. What do you recommend? Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would just follow up with an additional pr that returned false for zero-length, removing the detached check and error entirely. Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my opinion it should not be changed. It should return To avoid confusion the documentation can be updated like this "This function returns Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The challenge there is that with that logic the empty buffer would pass any encoding check. isASCII? Yes. isUTF16le? Yes. IsUTF32be? Yes. Is Shift-JIS? Yes.... Which just simply isn't useful. If you want the inverse check, isInvalidUtf8() then implement that. Member Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I created a pull request: #45973 Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The UTF-8 RFC specifies... Reference: https://www.rfc-editor.org/rfc/rfc3629 So UTF-8 explicitly, by its ABNF, includes the empty string. Note that, in general, from a non-empty buffer alone, we cannot determine uniquely the character encoding. A BOM may help but UTF-8 is BOM-less. A string of bytes may be interpreted under different encodings... and in some cases, it is by design. Thus, for example, ASCII buffers are always valid UTF-8 and Latin1 buffers (by design). Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I think it makes sense and that is how it works in some other popular programming languages. | ||
| | ||
| // Taken from test/fixtures/wpt/encoding/textdecoder-fatal.any.js | ||
| [ | ||
| [0xFF], // 'invalid code' | ||
| [0xC0], // 'ends early' | ||
| [0xE0], // 'ends early 2' | ||
| [0xC0, 0x00], // 'invalid trail' | ||
| [0xC0, 0xC0], // 'invalid trail 2' | ||
| [0xE0, 0x00], // 'invalid trail 3' | ||
| [0xE0, 0xC0], // 'invalid trail 4' | ||
| [0xE0, 0x80, 0x00], // 'invalid trail 5' | ||
| [0xE0, 0x80, 0xC0], // 'invalid trail 6' | ||
| [0xFC, 0x80, 0x80, 0x80, 0x80, 0x80], // '> 0x10FFFF' | ||
| [0xFE, 0x80, 0x80, 0x80, 0x80, 0x80], // 'obsolete lead byte' | ||
| | ||
| // Overlong encodings | ||
| [0xC0, 0x80], // 'overlong U+0000 - 2 bytes' | ||
| [0xE0, 0x80, 0x80], // 'overlong U+0000 - 3 bytes' | ||
| [0xF0, 0x80, 0x80, 0x80], // 'overlong U+0000 - 4 bytes' | ||
| [0xF8, 0x80, 0x80, 0x80, 0x80], // 'overlong U+0000 - 5 bytes' | ||
| [0xFC, 0x80, 0x80, 0x80, 0x80, 0x80], // 'overlong U+0000 - 6 bytes' | ||
| | ||
| [0xC1, 0xBF], // 'overlong U+007F - 2 bytes' | ||
| [0xE0, 0x81, 0xBF], // 'overlong U+007F - 3 bytes' | ||
| [0xF0, 0x80, 0x81, 0xBF], // 'overlong U+007F - 4 bytes' | ||
| [0xF8, 0x80, 0x80, 0x81, 0xBF], // 'overlong U+007F - 5 bytes' | ||
| [0xFC, 0x80, 0x80, 0x80, 0x81, 0xBF], // 'overlong U+007F - 6 bytes' | ||
| | ||
| [0xE0, 0x9F, 0xBF], // 'overlong U+07FF - 3 bytes' | ||
| [0xF0, 0x80, 0x9F, 0xBF], // 'overlong U+07FF - 4 bytes' | ||
| [0xF8, 0x80, 0x80, 0x9F, 0xBF], // 'overlong U+07FF - 5 bytes' | ||
| [0xFC, 0x80, 0x80, 0x80, 0x9F, 0xBF], // 'overlong U+07FF - 6 bytes' | ||
| | ||
| [0xF0, 0x8F, 0xBF, 0xBF], // 'overlong U+FFFF - 4 bytes' | ||
| [0xF8, 0x80, 0x8F, 0xBF, 0xBF], // 'overlong U+FFFF - 5 bytes' | ||
| [0xFC, 0x80, 0x80, 0x8F, 0xBF, 0xBF], // 'overlong U+FFFF - 6 bytes' | ||
| | ||
| [0xF8, 0x84, 0x8F, 0xBF, 0xBF], // 'overlong U+10FFFF - 5 bytes' | ||
| [0xFC, 0x80, 0x84, 0x8F, 0xBF, 0xBF], // 'overlong U+10FFFF - 6 bytes' | ||
| | ||
| // UTF-16 surrogates encoded as code points in UTF-8 | ||
| [0xED, 0xA0, 0x80], // 'lead surrogate' | ||
| [0xED, 0xB0, 0x80], // 'trail surrogate' | ||
| [0xED, 0xA0, 0x80, 0xED, 0xB0, 0x80], // 'surrogate pair' | ||
| ].forEach((input) => { | ||
| assert.strictEqual(isUtf8(Buffer.from(input)), false); | ||
| }); | ||
| | ||
| [ | ||
| null, | ||
| undefined, | ||
| 'hello', | ||
| true, | ||
| false, | ||
| ].forEach((input) => { | ||
| assert.throws( | ||
| () => { isUtf8(input); }, | ||
| { | ||
| code: 'ERR_INVALID_ARG_TYPE', | ||
| }, | ||
| ); | ||
| }); | ||
| | ||
| { | ||
| // Test with detached array buffers | ||
| const arrayBuffer = new ArrayBuffer(1024); | ||
| structuredClone(arrayBuffer, { transfer: [arrayBuffer] }); | ||
| assert.throws( | ||
| () => { isUtf8(arrayBuffer); }, | ||
| { | ||
| code: 'ERR_INVALID_STATE' | ||
| } | ||
| ); | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's after the fact but why does the buffer being detached matter here? It would be otherwise indistinguishable from zero-length which we should just return false for anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detached buffers create false sense of UTF8 validation, if there isn’t an error in here, since there is no way of accessing the underlying data store, and validating for UTF-8, I believe this error is valid.