Assign unique IDs to tokenization errors (#1339 part1) #2701

inikulin · 2017-05-21T21:05:42Z

PR for tests: html5lib/html5lib-tests#92
Reference implementation: https://github.com/HTMLParseErrorWG/parse5
Preview: https://htmlparseerrorwg.github.io/html-build/output/multipage/syntax.html

- Added unexpected-eof-character - Added unexepceted-null-character

zcorpan · 2017-05-30T16:44:09Z

When the user agent leaves the attribute name state (and before emitting the tag token, if appropriate), the complete attribute's name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a parse error and the new attribute must be removed from the token.

This doesn't have an error code.

inikulin · 2017-05-30T16:48:18Z

@zcorpan 0_0 no idea how we've missed this one. Will fix in a moment.

zcorpan · 2017-05-30T17:27:26Z

OK I'm done fiddling here for now, I think it's looking pretty good! Just that error code missing.

inikulin · 2017-05-30T17:34:23Z

@zcorpan I've added missing error code.

zcorpan · 2017-05-30T17:55:39Z

OK let's wait for @domenic if he likes to check the recent changes. Then we'll need to squash and rebase.

inikulin · 2017-05-30T18:50:00Z

I've updated the preview: https://htmlparseerrorwg.github.io/html-build/output/multipage/syntax.html

domenic

Made some final tweaks. Two minor questions/suggestions, but ready to merge.

domenic · 2017-05-31T17:57:08Z

source

+
+ <p class="note">Parse errors are only errors with the <em>syntax</em> of HTML. In addition to
+ checking for parse errors, conformance checkers will also verify that the document obeys all the
+ other conformance requirements described in this specification.</p>


Is "some" accurate still, or is it "all" now?

Yes, it's accurate. Tree construction errors still doesn't have error codes.

domenic · 2017-05-31T19:17:25Z

source

+ <td><p>This error occurs if the parser encounters a numeric <span
+ data-x="syntax-charref">character reference</span> that references a U+0000 NULL. The parser
+ resolves such character references to a U+FFFD REPLACEMENT CHARACTER.</p>
+


Can we rename these to "noncharacter-..." instead of talking about "undefined character"?

I've had concerns about this rename (you've suggested it earlier): we already have non-unicode-character-in-input-stream error, adding noncharacter-in-input-stream will be confusing IMHO. I don't like how it named in infra, to be honest: isolated surrogates and code points more than 0x10ffff are non-characters as well but they are not in the same category as "noncharacters" in Infra. In fact, what Infra defines as "noncharacters" is "permanently undefined characters".

I've never understood the name "noncharacters" and have asked for clarification on it before: see whatwg/infra#114. I think it's best to be consistent across the ecosystem though...

I think it'd be good to rename non-unicode-character-in-... to surrogate-in-..., actually. That seems like a clear improvement.

I've had the same idea at first, but the problem is that we have a non-unicode-character-reference error that occurs if parser encounters a numeric character reference that resolve to surrogate or code point that is more than 0x10ffff. Currently it's nice and consistent. As possible solution we can split it into two errors: surrogate-character-reference and non-unicode-range-character-reference, thus making error codes consistent again.

The latter solution sounds great. I'm also OK with just leaving them inconsistent; I don't think they need to be inconsistent since they are about separate cases.

OK, let's rename it. I'll update the PR within an hour.

…e-range + surrogate-character-reference

inikulin · 2017-05-31T21:01:16Z

@domenic Fixed. I've updated the preview as well.

domenic · 2017-05-31T21:43:17Z

Awesome, thanks so much!

Suggestions on a suitably-epic commit message?

inikulin · 2017-05-31T21:47:26Z

@domenic "How I spent nights this May" sounds epic enough? =)

inikulin · 2017-05-31T21:49:45Z

Jokes apart, something like the title of the PR will be fine.

domenic · 2017-05-31T22:25:32Z

How's this?

Assign IDs to and explain all tokenization parse errors This gives every parse error that occurs during tokenization a unique ID, and adds non-normative text explaining and exemplifying when they occur in an overview table. Part of #1339; tree construction parse errors remain before that issue is finished.

inikulin · 2017-05-31T22:26:48Z

@domenic lgtm

inikulin · 2017-05-31T22:37:19Z

🎉 🎉🎉

- Use new initial states in tests according to: html5lib/html5lib-tests#101 - Implement tokenization errors introduced in: whatwg/html#2701 html5lib/html5lib-tests#92

inikulin and others added 30 commits May 21, 2017 23:22

Add self-closing-non-void-html-element error.

3d0ebcc

Add end-tag-with-attributes error.

2325e53

Fix article before error id

c16b096

Add self-closing-end-tag error.

c9af034

Add tokenizer data state errors.

d03b564

More unexpected-null-character parse errors

22c23e1

Add Tag open state parse errors.

725b672

Add End tag open state parse errors.

1590782

Add Markup declaration open state parse errors.

0e5a7a8

Remove unnecessary colon

d5ec0d3

Add Script data escaped state errors

2c19401

Add Script data escaped dash state errors.

794c72b

Add Script data escaped dash dash state errors.

a847da0

Add Script data double escaped state errors.

cd0b5d8

Adding Tag Name parse errors:

5bd6a5f

- Added unexpected-eof-character - Added unexepceted-null-character

Renaming unexpected-eof to eof-in-tag-name

65b1908

Adding parser errors for before attribute name state

f972b97

Add Comment less-than sign bang dash dash state errors

82669e5

Add Comment start state errors.

ccff2f7

Add Comment start dash state errors.

0d51aae

Add Comment state errors.

0146cdb

Adding parser errors for attibute name state

965dd92

Add Comment end dash state errors.

a69bc10

Add Comment end state errors.

48f3330

Add Comment end bang state errors.

ad48fbe

Adding parser errors for after attibute name state

cca871c

Generalizing error naming

07eaf28

Revert " Adding parser errors for attibute name state"

514bf0a

Revert "Revert " Adding parser errors for attibute name state""

91ddba9

Generalize tag errors. Fix typo.

6e03cea

zcorpan added 3 commits May 30, 2017 18:52

Make sure data-x and the text content match

481bb9d

Fix wording in an example

a912d26

Tweak indentation of markup

2e4e654

Add error for duplicate attribute

8e68ae3

Fixup indentation

8e68e68

zcorpan approved these changes May 30, 2017

View reviewed changes

Minor tweaks

d6764a5

domenic approved these changes May 31, 2017

View reviewed changes

inikulin added 4 commits May 31, 2017 23:37

non-unicode-character-in-input-stream -> surrogate-in-input-stream

c26329e

undefined-character-in-input-stream -> noncharacter-in-input-stream

f4cb302

undefined-character-reference -> noncharacter-character-reference

0e5fd11

non-unicode-character-reference -> character-reference-outside-unicod…

14a2a3d

…e-range + surrogate-character-reference

Minor tweaks

4e2e8dc

domenic merged commit 32dbd7d into whatwg:master May 31, 2017

inikulin deleted the to-upstream branch May 31, 2017 22:37

TRowbotham mentioned this pull request Jul 27, 2018

Named validation errors whatwg/url#406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Assign unique IDs to tokenization errors (#1339 part1) #2701

Assign unique IDs to tokenization errors (#1339 part1) #2701

Uh oh!

inikulin commented May 21, 2017 •

edited

Loading

zcorpan commented May 30, 2017

inikulin commented May 30, 2017

zcorpan commented May 30, 2017

inikulin commented May 30, 2017

zcorpan commented May 30, 2017

inikulin commented May 30, 2017

domenic left a comment

domenic May 31, 2017

inikulin May 31, 2017

domenic May 31, 2017

inikulin May 31, 2017 •

edited

Loading

domenic May 31, 2017

domenic May 31, 2017

inikulin May 31, 2017

domenic May 31, 2017

inikulin May 31, 2017

inikulin commented May 31, 2017 •

edited

Loading

domenic commented May 31, 2017

inikulin commented May 31, 2017

inikulin commented May 31, 2017

domenic commented May 31, 2017

inikulin commented May 31, 2017

inikulin commented May 31, 2017

Labels

5 participants

Assign unique IDs to tokenization errors (#1339 part1) #2701

Assign unique IDs to tokenization errors (#1339 part1) #2701

Uh oh!

Conversation

inikulin commented May 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

zcorpan commented May 30, 2017

inikulin commented May 30, 2017

zcorpan commented May 30, 2017

inikulin commented May 30, 2017

zcorpan commented May 30, 2017

inikulin commented May 30, 2017

domenic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

inikulin May 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

inikulin commented May 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

domenic commented May 31, 2017

inikulin commented May 31, 2017

inikulin commented May 31, 2017

domenic commented May 31, 2017

inikulin commented May 31, 2017

inikulin commented May 31, 2017

Labels

5 participants

inikulin commented May 21, 2017 •

edited

Loading

inikulin May 31, 2017 •

edited

Loading

inikulin commented May 31, 2017 •

edited

Loading