Skip to content

Conversation

@hamishknight
Copy link
Contributor

Parse the escaped syntaxes for backreferences and subpatterns (the latter are so syntactically similar, it made sense to also parse them).

This doesn't yet handle the non-escaped syntax for either, in particular Python-style backreferences (?P=...). These are syntactically similar to groups (despite being atoms), so will require some more thought on how to parse.

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, though I suspect the parser will be the one to make the call whether \n is a backreference or an octal reference. Otherwise the AST needs API to help clients decide.

@milseman
Copy link
Member

@swift-ci please test linux platform

@hamishknight
Copy link
Contributor Author

Updated to disambiguate octal vs backreferences in the parser, and also extended the \nnn syntax to work more generally in custom character classes, how does this look @milseman?

@hamishknight
Copy link
Contributor Author

@swift-ci please test Linux

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

num <= numberOfPriorGroups {
src.advance(digits.count)
return .backreference(.absolute(num))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just return the octal literal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do, though we'd need to call into expectUnicodeScalar as we can only read up to 3 digits of octal, and it wouldn't cover the \0nn case unless we also add a condition for that here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is that logic when we fall through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In expectUnicodeScalar, called from expectEscaped

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems more straight-forward to make the explicit call, especially if we have context-local constraints on the number of digits or other aspects of interpretation. Otherwise, would the general case have to reconstruct this logic?

Fall-through also means that any intermediary code may have to be concerned with or reason about this possibility. In general I'm a fan of local reasoning. But this can be done after this PR, and it's not a huge deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, would the general case have to reconstruct this logic?

I might not be following what you mean by this, but I don't think so. Both the backreference and unicode scalar cases have differences in what they accept, but there is a syntactic ambiguity between them. In general, we just need to make sure to try lex as a backreference before a unicode scalar. IMO if we change this logic to produce a scalar, we should also change the scalar logic to be able to produce a backreference so we no longer need to reason about which is called first. I'm happy to do that in a follow-up, but going to merge this for now to cleanup the option parsing PR

Missed this when implementing the rest of the group kinds. Because we don't backtrack, we should throw an error here after consuming a `*` in a group.
Throw an error if we reach the end of input before we encounter the closing delimiter we expect. Also add an overload of `lexUntil(eating:)` that takes a character.
Parse the escaped syntaxes for backreferences and subpatterns (the latter are so syntactically similar, it made sense to also parse them). This doesn't yet handle the non-escaped syntax for either, in particular Python-style backreferences `(?P=...)`. These are syntactically similar to groups (despite being atoms), so will require some more thought on how to parse.
Implement octal disambiguation for the `\nnn` syntax where a backreference is only formed if there have been that many prior groups, or it begins with 8 or 9, or is less than 10. In addition, generalize the \0nn syntax to support arbitrary \nnn octal sequences inside and outside character classes.
@hamishknight
Copy link
Contributor Author

@swift-ci please test Linux

@hamishknight hamishknight merged commit 0a7d4bb into swiftlang:main Dec 20, 2021
@hamishknight hamishknight deleted the backrefs branch December 20, 2021 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants