Regex literal pitch #12

hamishknight · 2021-10-12T15:04:02Z

No description provided.

Documentation/Evolution/RegexLiteralPitch.md

milseman

This is looking good so far!

Documentation/Evolution/RegexLiteralPitch.md

kylemacomber

I think we're saying "out of scope" too much in the current draft. I'd prefer we not say it at all, since the whole point of the pitch it to encourage feedback.

Instead, maybe we should preface the forum post saying this pitch is a discussion starter and we don't think it's complete enough to graduate to a full review as is. And then maybe we emphasize what we'd like specific feedback on (instead of emphasizing what we don't want feedback on).

Wdyt?

Documentation/Evolution/RegexLiteralPitch.md

hamishknight · 2021-10-13T13:53:50Z

I think we're saying "out of scope" too much in the current draft. I'd prefer we not say it at all, since the whole point of the pitch it to encourage feedback.

Instead, maybe we should preface the forum post saying this pitch is a discussion starter and we don't think it's complete enough to graduate to a full review as is. And then maybe we emphasize what we'd like specific feedback on (instead of emphasizing what we don't want feedback on).

Wdyt?

Makes sense to me, though IMO it's probably still worthwhile to call out the details of the Regex type as well as the specifics of what the stdlib will initially support as out of scope. What do you think @milseman?

Documentation/Evolution/RegexLiteralPitch.md

milseman · 2021-10-13T18:14:27Z

I reduced the occurrences of "out of scope" and chose better terminology for the other uses. I think it is good to clarify what should be discussed here vs elsewhere (defaulting to the overview thread).

milseman · 2021-10-13T18:18:04Z

For choice of PCRE, it's the richest, most well-known, and IIUC basically a superset of perl/python/javascript/ICU/... regular expression syntax.

kylemacomber

I think this is in really good shape!

Documentation/Evolution/RegexLiteralPitch.md

beccadax · 2021-10-14T01:21:40Z

Documentation/Evolution/RegexLiteralPitch.md

+```swift
+// let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/
+let regex = {
+ var builder = T.RegexLiteral()
+
+ // __A4 = /([[:alpha:]]\w*)/
+ let __A1 = builder.posixCharacterClassAlpha()
+ let __A2 = builder.characterClassWord()
+ let __A3 = builder.concatenate(__A1, __A2)
+ let __A4 = builder.captureGroup(__A3)
+
+ // __B1 = / = /
+ let __B1 = builder.literal(" = ")
+
+ // __C3 = /([0-9A-F]+)/
+ let __C1 = builder.customCharacterClass(["0"..."9", "A"..."F"])
+ let __C2 = builder.oneOrMore(__C1)
+ let __C3 = builder.captureGroup(__C2)
+
+ let __D1 = builder.concatenate(__A4, __B1, __C3)
+ return T(regexLiteral: builder.finalize(__D1))
+}()
+```


This might be premature, but I have Thoughts.

We are both calling methods on a mutable builder instance, and returning values that are passed to later methods. So is the model here based on the methods mutating the builder to accumulate match operations into it, or is it based on the methods constructing an immutable value-type AST from the leaf nodes up, or is it somehow both? (Maybe builder is a context that the AST nodes can be made in, but the relationship between those nodes is expressed by the return values? In that case, the initializer should probably be given some estimate information, much like StringInterpolationProtocol.)

If these methods are making AST nodes, should the method names indicate this? Perhaps by following the general makeFoo convention, or the result builder buildFoo convention? On the other hand, if they are having side effects on the builder, should they have verb-y names like appendFoo as we see in StringInterpolationProtocol?

Do we want to name metacharacters' methods based on specific semantics they're expected to follow? It might be better to have more mechanical translation rules, like:

\<identifier character> turns into makeEscaped_<identifier character>() (except for characters that Swift string literals already assign a meaning to).

So for instance, the __A2 line would be let __A2 = builder.makeEscaped_w(). That would free us from actually baking in a semantic for each character at this level, and would give us a consistent answer for characters that are used for different things in different dialects or aren't used in any of the dialects we know of. I also suspect it would be easier for implementers to remember than coming up with a name for each metacharacter. (Is it "Word" or is it "IdentifierCharacter"?)

In StringInterpolationProtocol, I found that I didn't need a finalize because init(stringInterpolation:) could express any finalization logic I might need. The same might be true here.

Yeah, this syntax was mainly intended to be a straw-person to illustrate what the compiler transform would roughly look like, I believe the exact stdlib APIs are still TBD. Personally I think it makes sense for the builder methods to return AST nodes, so I'm not sure the builder would need any mutable state.

@milseman any thoughts on the naming conventions especially for the metachar methods?

Regarding finalize I believe we'd still need some way to get from the resulting root AST node back to the builder type to pass to init(regexLiteral:).

I do think these details @beccadax calls out are worth focusing on in this pitch. IIUC they're the protocol informal requirements and as such are more part of the literal than the concrete conforming Regex type we'll eventually provide.

Regex do have an actual AST as they are recursive, unlike interpolation segments. So it makes sense for there to be some divergence.

I didn’t go further than this with mutating or not, other than I could easily imagine a conformer wanting to update a trivial bit of state in the process of constructing an AST node.

I go back and forth on the metacharacter naming problem. One the one hand, \s really does mean white space and it would be weird for a conformer to pick a completely different meaning. How abstract is the AST? Beyond that very minor concern I have no strong opinions, other than that we be consistent. I think a middle ground that helpfully recognizes that it’s character class s could be worthwhile, so we have manufactureCharacterClass_s and manufactureCharacterClass_POSIX_alpha, picking your favorite construction verb.

beccadax · 2021-10-14T01:35:17Z

Documentation/Evolution/RegexLiteralPitch.md

+Single line comments use the syntax `//`, which would conflict with the spelling for an empty regex literal. As such, an empty regex literal would be forbidden.
+
+Multi-line comments use the `/*` delimiter. As such, a regex literal starting with `*` wouldn't be parsed. This however isn't a major issue as an unqualified `*` is already invalid regex syntax. An escaped `/\*/` regex literal wouldn't be impacted.
+


A third comment issue that might be worth discussing: If we wanted to support multi-line regex literals, the most natural way to do that given current Swift syntax would be:

let regex = /// pattern ///

However, /// is a comment syntax, and specifically it's the documentation comment syntax. Do we want to rule out this feature, or do we think there's a heuristic we can use to rescue it?

(Someone on the overall pitch thread pointed this out—can't remember who, but you may want to give credit.)

How about just requiring that a regex that is split across lines in the source file start with an option expression (e.g. /(?m), /(?x)), then it's clearly a regex literal and we can continue parsing it until we hit the trailing /? Most regexes that contain newlines are going to have to have the multiline option or the extended option enabled anyway.

My personal feeling is we probably shouldn't support multi-line regex literals, as I think users would likely be better served by the more general Pattern result builder API at that point. I can't immediately think of a clever way of getting the parser to work with /// delimiters for that, at least not without restricting where you can currently place documentation comments.

I definitely agree it's worth calling out for discussion though, I'll add a paragraph for it, including @al45tair's suggestion of inferring it from the options provided.

Added a paragraph for this. Regarding the post on the pitch thread that pointed this out, was this the one you had in mind?

beccadax · 2021-10-14T01:38:45Z

Documentation/Evolution/RegexLiteralPitch.md

+}
+```
+
+`{ print("hello") }` will be parsed as a trailing closure to `SomeType()` rather than as a separate element to the result builder.


There's an even better analogy here to leading-dot syntax, which therefore isn't supported at the top level of a statement in a result builder.

That's a good point! Leading dot syntax is indeed also ambiguous in a result builder, though we do seem to support it if a buildExpression is defined, e.g:

@resultBuilder struct Build { static func buildExpression(_ x: Int) -> Int { x } static func buildBlock(_ x: Int...) {} } func takesBuilder(@Build _ fn: () -> Void) {} func foo() -> Int { 0 } extension Int { static var baz: Int { 0 } var baz: Int { 0 } } takesBuilder { 0 .baz }

The builder closure above will be parsed as 0.baz, but if you insert a ; after the 0, it will become a two-element builder with the second being an inferred Int.baz.

beccadax · 2021-10-14T01:51:28Z

Documentation/Evolution/RegexLiteralPitch.md

+
+### Opting into certain regex features
+
+We intend for the compiler to completely parse [the PCRE syntax][PCRE]. However, types conforming to `RegexLiteralProtocol` might not be able to handle the full feature set. The compiler looks for corresponding function declarations inside `RegexLiteralProtocol` and will emit a compilation error if missing. Conforming types can use `@available` on these function declarations to communicate versioning and add more support in the future.


Are there situations where we might want to fail in a "softer" way than this? For instance, a regex engine that didn't support \Q would typically treat it as equivalent to plain Q. So if the method for \Q was missing, we might want to emit a warning recommending you remove the backslash and then treat it as though you hadn't written the backslash, rather than giving an actual error.

Yeah, we can definitely also talk about warnings. I expect some things like range-based character classes to be a source of warnings/errors in grapheme-semantic mode (outside this scope), but the general idea of being able to issue errors and/or warnings is worth some discussion here.

I don't know if a missing decl is an error or warning, or if it makes sense to be configurable. Your thoughts? Would it differ for metacharacters than, say, named capture groups?

al45tair · 2021-10-14T08:35:28Z

Documentation/Evolution/RegexLiteralPitch.md

+
+## Proposed Solution
+
+We propose the introduction of a regular expression literal that supports [the PCRE syntax][PCRE], in addition to new standard library protocols `ExpressibleByRegexLiteral` and `RegexLiteralProtocol` that allow for the customization of how the regex literal is interpreted (similar to [string interpolation][stringinterpolation]). The compiler will parse the PCRE syntax within a regex literal, and synthesize calls to corresponding builder methods. Types conforming to `ExpressibleByRegexLiteral` will be able to provide a builder type that opts into supporting various regex constructs through the use of normal function declarations and `@available`.


Could we also support UTS#18? http://www.unicode.org/reports/tr18/

AFAIK PCRE doesn't support all of that, but it'd be good if we could at least parse it.

I’m fine supporting a syntactic superset. Every time I’ve read TR-18 it’s been through the lens of where we are semantically (level 1.5), and we clearly want to support libraries at higher or lower levels of support. But if there’s some metacharacters or additional syntax that we can parse that sounds fine. If there are incompatibilities, we’re likely to side with PCRE

Let's continue talk about TR18 syntax on the thread

Documentation/Evolution/RegexLiteralPitch.md

hamishknight · 2021-10-14T13:43:52Z

Updated to fix the default builder type to be DefaultRegexLiteral, and added a section on why we don't want to re-use string literal syntax.

milseman · 2021-10-14T14:00:33Z

Documentation/Evolution/RegexLiteralPitch.md

+- Support a syntax familiar to developers who have learned to use regular expressions in other tools and languages
+- Allow reuse of many regular expressions not specifically designed for Swift (e.g. from Stack Overflow or popular programming books)
+- Allow libraries to define custom types that can be constructed with regex literals, much like string literals
+- Diagnose at compile time if a regex literal uses capabilities that aren't allowed by the type's regex dialect


Nice! Thanks @beccadax

milseman · 2021-10-14T14:02:18Z

Documentation/Evolution/RegexLiteralPitch.md


 Single line comments use the syntax `//`, which would conflict with the spelling for an empty regex literal. As such, an empty regex literal would be forbidden.

+While not conflicting with the syntax proposed in this pitch, it's also worth noting that the `//` comment syntax (in particular documentation comments that use `///`) would likely preclude the ability to use `///` as a delimiter if we ever wanted to support multi-line regex literals. It's possible though that future multi-line support could be inferred from the regex options provided. For example, a regex that uses the multi-line option `/(?m)/` could be allowed to span multiple lines.


Does that mean that the parser would understand (?m) and switch behavior or heuristics?

Also, raw regex literals give us a future route here too.

Yeah we'd have to recognize the m option while lexing and switch to allowing multiple lines. Good point regarding raw literals, I'll add a sentence for that

hamishknight requested a review from milseman October 12, 2021 15:04

hamishknight commented Oct 12, 2021

View reviewed changes

Documentation/Evolution/RegexLiteralPitch.md Outdated Show resolved Hide resolved

milseman reviewed Oct 12, 2021

View reviewed changes

kylemacomber reviewed Oct 13, 2021

View reviewed changes

Add regex literal pitch

74fb687

hamishknight force-pushed the regex-literal-pitch branch from 9e0e83c to 74fb687 Compare October 13, 2021 12:20

hamishknight commented Oct 13, 2021

View reviewed changes

Documentation/Evolution/RegexLiteralPitch.md Outdated Show resolved Hide resolved

Update RegexLiteralPitch.md

38018ab

milseman marked this pull request as ready for review October 13, 2021 18:17

milseman changed the title ~~[WIP] Add regex literal pitch~~ Regex literal pitch Oct 13, 2021

milseman added 3 commits October 13, 2021 13:21

Update RegexLiteralPitch.md

139c5c0

Update RegexLiteralPitch.md

adefd9f

Update RegexLiteralPitch.md

69be4b6

kylemacomber reviewed Oct 13, 2021

View reviewed changes

Documentation/Evolution/RegexLiteralPitch.md Outdated Show resolved Hide resolved

Documentation/Evolution/RegexLiteralPitch.md Outdated Show resolved Hide resolved

Update RegexLiteralPitch.md

3b840c7

beccadax reviewed Oct 14, 2021

View reviewed changes

al45tair reviewed Oct 14, 2021

View reviewed changes

hamishknight commented Oct 14, 2021

View reviewed changes

Documentation/Evolution/RegexLiteralPitch.md Outdated Show resolved Hide resolved

hamishknight added 2 commits October 14, 2021 13:47

Update RegexLiteralPitch.md

83858bd

Update RegexLiteralPitch.md

556d957

milseman reviewed Oct 14, 2021

View reviewed changes

hamishknight and others added 3 commits October 14, 2021 15:11

Update RegexLiteralPitch.md

0dfde7d

Update RegexLiteralPitch.md

4aa41d7

Update RegexLiteralPitch.md

cb80953

hamishknight merged commit 2ab8866 into swiftlang:main Oct 14, 2021

hamishknight deleted the regex-literal-pitch branch October 14, 2021 15:45

		Single line comments use the syntax `//`, which would conflict with the spelling for an empty regex literal. As such, an empty regex literal would be forbidden.

		Multi-line comments use the `/` delimiter. As such, a regex literal starting with `` wouldn't be parsed. This however isn't a major issue as an unqualified `` is already invalid regex syntax. An escaped `/\/` regex literal wouldn't be impacted.


		### Opting into certain regex features

		We intend for the compiler to completely parse [the PCRE syntax][PCRE]. However, types conforming to `RegexLiteralProtocol` might not be able to handle the full feature set. The compiler looks for corresponding function declarations inside `RegexLiteralProtocol` and will emit a compilation error if missing. Conforming types can use `@available` on these function declarations to communicate versioning and add more support in the future.


		## Proposed Solution

		We propose the introduction of a regular expression literal that supports [the PCRE syntax][PCRE], in addition to new standard library protocols `ExpressibleByRegexLiteral` and `RegexLiteralProtocol` that allow for the customization of how the regex literal is interpreted (similar to [string interpolation][stringinterpolation]). The compiler will parse the PCRE syntax within a regex literal, and synthesize calls to corresponding builder methods. Types conforming to `ExpressibleByRegexLiteral` will be able to provide a builder type that opts into supporting various regex constructs through the use of normal function declarations and `@available`.


		Single line comments use the syntax `//`, which would conflict with the spelling for an empty regex literal. As such, an empty regex literal would be forbidden.

		While not conflicting with the syntax proposed in this pitch, it's also worth noting that the `//` comment syntax (in particular documentation comments that use `///`) would likely preclude the ability to use `///` as a delimiter if we ever wanted to support multi-line regex literals. It's possible though that future multi-line support could be inferred from the regex options provided. For example, a regex that uses the multi-line option `/(?m)/` could be allowed to span multiple lines.

Regex literal pitch #12

Regex literal pitch #12

Uh oh!

Conversation

hamishknight commented Oct 12, 2021

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylemacomber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamishknight commented Oct 13, 2021

Uh oh!

milseman commented Oct 13, 2021

milseman commented Oct 13, 2021

kylemacomber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

beccadax Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

hamishknight commented Oct 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

5 participants

beccadax Oct 14, 2021 •

edited

Loading