Skip to content

Conversation

@inikulin
Copy link
Member

@inikulin inikulin commented May 21, 2017

inikulin and others added 30 commits May 21, 2017 23:22
- Added unexpected-eof-character - Added unexepceted-null-character
@zcorpan
Copy link
Member

zcorpan commented May 30, 2017

When the user agent leaves the attribute name state (and before emitting the tag token, if appropriate), the complete attribute's name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a parse error and the new attribute must be removed from the token.

This doesn't have an error code.

@inikulin
Copy link
Member Author

@zcorpan 0_0 no idea how we've missed this one. Will fix in a moment.

@zcorpan
Copy link
Member

zcorpan commented May 30, 2017

OK I'm done fiddling here for now, I think it's looking pretty good! Just that error code missing.

@inikulin
Copy link
Member Author

@zcorpan I've added missing error code.

@zcorpan
Copy link
Member

zcorpan commented May 30, 2017

OK let's wait for @domenic if he likes to check the recent changes. Then we'll need to squash and rebase.

@inikulin
Copy link
Member Author

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some final tweaks. Two minor questions/suggestions, but ready to merge.


<p class="note">Parse errors are only errors with the <em>syntax</em> of HTML. In addition to
checking for parse errors, conformance checkers will also verify that the document obeys all the
other conformance requirements described in this specification.</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "some" accurate still, or is it "all" now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's accurate. Tree construction errors still doesn't have error codes.

<td><p>This error occurs if the parser encounters a numeric <span
data-x="syntax-charref">character reference</span> that references a U+0000 NULL. The parser
resolves such character references to a U+FFFD REPLACEMENT CHARACTER.</p>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename these to "noncharacter-..." instead of talking about "undefined character"?

Copy link
Member Author

@inikulin inikulin May 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had concerns about this rename (you've suggested it earlier): we already have non-unicode-character-in-input-stream error, adding noncharacter-in-input-stream will be confusing IMHO. I don't like how it named in infra, to be honest: isolated surrogates and code points more than 0x10ffff are non-characters as well but they are not in the same category as "noncharacters" in Infra. In fact, what Infra defines as "noncharacters" is "permanently undefined characters".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never understood the name "noncharacters" and have asked for clarification on it before: see whatwg/infra#114. I think it's best to be consistent across the ecosystem though...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to rename non-unicode-character-in-... to surrogate-in-..., actually. That seems like a clear improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had the same idea at first, but the problem is that we have a non-unicode-character-reference error that occurs if parser encounters a numeric character reference that resolve to surrogate or code point that is more than 0x10ffff. Currently it's nice and consistent. As possible solution we can split it into two errors: surrogate-character-reference and non-unicode-range-character-reference, thus making error codes consistent again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter solution sounds great. I'm also OK with just leaving them inconsistent; I don't think they need to be inconsistent since they are about separate cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's rename it. I'll update the PR within an hour.

@inikulin
Copy link
Member Author

inikulin commented May 31, 2017

@domenic Fixed. I've updated the preview as well.

@domenic
Copy link
Member

domenic commented May 31, 2017

Awesome, thanks so much!

Suggestions on a suitably-epic commit message?

@inikulin
Copy link
Member Author

@domenic "How I spent nights this May" sounds epic enough? =)

@inikulin
Copy link
Member Author

Jokes apart, something like the title of the PR will be fine.

@domenic
Copy link
Member

domenic commented May 31, 2017

How's this?

Assign IDs to and explain all tokenization parse errors This gives every parse error that occurs during tokenization a unique ID, and adds non-normative text explaining and exemplifying when they occur in an overview table. Part of #1339; tree construction parse errors remain before that issue is finished. 
@inikulin
Copy link
Member Author

@domenic lgtm

@domenic domenic merged commit 32dbd7d into whatwg:master May 31, 2017
@inikulin
Copy link
Member Author

🎉 🎉🎉

@inikulin inikulin deleted the to-upstream branch May 31, 2017 22:37
iabudiab added a commit to iabudiab/HTMLKit that referenced this pull request Sep 10, 2017
- Use new initial states in tests according to: html5lib/html5lib-tests#101 - Implement tokenization errors introduced in: whatwg/html#2701 html5lib/html5lib-tests#92
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

5 participants