Skip to content

Conversation

@hamishknight
Copy link
Contributor

No description provided.

@hamishknight hamishknight requested a review from milseman October 12, 2021 15:04
Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good so far!

Copy link

@kylemacomber kylemacomber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're saying "out of scope" too much in the current draft. I'd prefer we not say it at all, since the whole point of the pitch it to encourage feedback.

Instead, maybe we should preface the forum post saying this pitch is a discussion starter and we don't think it's complete enough to graduate to a full review as is. And then maybe we emphasize what we'd like specific feedback on (instead of emphasizing what we don't want feedback on).

Wdyt?

@hamishknight
Copy link
Contributor Author

I think we're saying "out of scope" too much in the current draft. I'd prefer we not say it at all, since the whole point of the pitch it to encourage feedback.

Instead, maybe we should preface the forum post saying this pitch is a discussion starter and we don't think it's complete enough to graduate to a full review as is. And then maybe we emphasize what we'd like specific feedback on (instead of emphasizing what we don't want feedback on).

Wdyt?

Makes sense to me, though IMO it's probably still worthwhile to call out the details of the Regex type as well as the specifics of what the stdlib will initially support as out of scope. What do you think @milseman?

@milseman
Copy link
Member

I reduced the occurrences of "out of scope" and chose better terminology for the other uses. I think it is good to clarify what should be discussed here vs elsewhere (defaulting to the overview thread).

@milseman milseman marked this pull request as ready for review October 13, 2021 18:17
@milseman
Copy link
Member

For choice of PCRE, it's the richest, most well-known, and IIUC basically a superset of perl/python/javascript/ICU/... regular expression syntax.

@milseman milseman changed the title [WIP] Add regex literal pitch Regex literal pitch Oct 13, 2021
Copy link

@kylemacomber kylemacomber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in really good shape!

Comment on lines 43 to 65
```swift
// let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/
let regex = {
var builder = T.RegexLiteral()

// __A4 = /([[:alpha:]]\w*)/
let __A1 = builder.posixCharacterClassAlpha()
let __A2 = builder.characterClassWord()
let __A3 = builder.concatenate(__A1, __A2)
let __A4 = builder.captureGroup(__A3)

// __B1 = / = /
let __B1 = builder.literal(" = ")

// __C3 = /([0-9A-F]+)/
let __C1 = builder.customCharacterClass(["0"..."9", "A"..."F"])
let __C2 = builder.oneOrMore(__C1)
let __C3 = builder.captureGroup(__C2)

let __D1 = builder.concatenate(__A4, __B1, __C3)
return T(regexLiteral: builder.finalize(__D1))
}()
```
Copy link

@beccadax beccadax Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be premature, but I have Thoughts.

  1. We are both calling methods on a mutable builder instance, and returning values that are passed to later methods. So is the model here based on the methods mutating the builder to accumulate match operations into it, or is it based on the methods constructing an immutable value-type AST from the leaf nodes up, or is it somehow both? (Maybe builder is a context that the AST nodes can be made in, but the relationship between those nodes is expressed by the return values? In that case, the initializer should probably be given some estimate information, much like StringInterpolationProtocol.)

  2. If these methods are making AST nodes, should the method names indicate this? Perhaps by following the general makeFoo convention, or the result builder buildFoo convention? On the other hand, if they are having side effects on the builder, should they have verb-y names like appendFoo as we see in StringInterpolationProtocol?

  3. Do we want to name metacharacters' methods based on specific semantics they're expected to follow? It might be better to have more mechanical translation rules, like:

\<identifier character> turns into makeEscaped_<identifier character>() (except for characters that Swift string literals already assign a meaning to).

So for instance, the __A2 line would be let __A2 = builder.makeEscaped_w(). That would free us from actually baking in a semantic for each character at this level, and would give us a consistent answer for characters that are used for different things in different dialects or aren't used in any of the dialects we know of. I also suspect it would be easier for implementers to remember than coming up with a name for each metacharacter. (Is it "Word" or is it "IdentifierCharacter"?)

  1. In StringInterpolationProtocol, I found that I didn't need a finalize because init(stringInterpolation:) could express any finalization logic I might need. The same might be true here.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this syntax was mainly intended to be a straw-person to illustrate what the compiler transform would roughly look like, I believe the exact stdlib APIs are still TBD. Personally I think it makes sense for the builder methods to return AST nodes, so I'm not sure the builder would need any mutable state.

@milseman any thoughts on the naming conventions especially for the metachar methods?

Regarding finalize I believe we'd still need some way to get from the resulting root AST node back to the builder type to pass to init(regexLiteral:).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think these details @beccadax calls out are worth focusing on in this pitch. IIUC they're the protocol informal requirements and as such are more part of the literal than the concrete conforming Regex type we'll eventually provide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex do have an actual AST as they are recursive, unlike interpolation segments. So it makes sense for there to be some divergence.

I didn’t go further than this with mutating or not, other than I could easily imagine a conformer wanting to update a trivial bit of state in the process of constructing an AST node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I go back and forth on the metacharacter naming problem. One the one hand, \s really does mean white space and it would be weird for a conformer to pick a completely different meaning. How abstract is the AST? Beyond that very minor concern I have no strong opinions, other than that we be consistent. I think a middle ground that helpfully recognizes that it’s character class s could be worthwhile, so we have manufactureCharacterClass_s and manufactureCharacterClass_POSIX_alpha, picking your favorite construction verb.

Single line comments use the syntax `//`, which would conflict with the spelling for an empty regex literal. As such, an empty regex literal would be forbidden.

Multi-line comments use the `/*` delimiter. As such, a regex literal starting with `*` wouldn't be parsed. This however isn't a major issue as an unqualified `*` is already invalid regex syntax. An escaped `/\*/` regex literal wouldn't be impacted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A third comment issue that might be worth discussing: If we wanted to support multi-line regex literals, the most natural way to do that given current Swift syntax would be:

let regex = /// pattern /// 

However, /// is a comment syntax, and specifically it's the documentation comment syntax. Do we want to rule out this feature, or do we think there's a heuristic we can use to rescue it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Someone on the overall pitch thread pointed this out—can't remember who, but you may want to give credit.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just requiring that a regex that is split across lines in the source file start with an option expression (e.g. /(?m), /(?x)), then it's clearly a regex literal and we can continue parsing it until we hit the trailing /? Most regexes that contain newlines are going to have to have the multiline option or the extended option enabled anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal feeling is we probably shouldn't support multi-line regex literals, as I think users would likely be better served by the more general Pattern result builder API at that point. I can't immediately think of a clever way of getting the parser to work with /// delimiters for that, at least not without restricting where you can currently place documentation comments.

I definitely agree it's worth calling out for discussion though, I'll add a paragraph for it, including @al45tair's suggestion of inferring it from the options provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a paragraph for this. Regarding the post on the pitch thread that pointed this out, was this the one you had in mind?

}
```

`{ print("hello") }` will be parsed as a trailing closure to `SomeType()` rather than as a separate element to the result builder.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an even better analogy here to leading-dot syntax, which therefore isn't supported at the top level of a statement in a result builder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! Leading dot syntax is indeed also ambiguous in a result builder, though we do seem to support it if a buildExpression is defined, e.g:

@resultBuilder struct Build { static func buildExpression(_ x: Int) -> Int { x } static func buildBlock(_ x: Int...) {} } func takesBuilder(@Build _ fn: () -> Void) {} func foo() -> Int { 0 } extension Int { static var baz: Int { 0 } var baz: Int { 0 } } takesBuilder { 0 .baz }

The builder closure above will be parsed as 0.baz, but if you insert a ; after the 0, it will become a two-element builder with the second being an inferred Int.baz.


### Opting into certain regex features

We intend for the compiler to completely parse [the PCRE syntax][PCRE]. However, types conforming to `RegexLiteralProtocol` might not be able to handle the full feature set. The compiler looks for corresponding function declarations inside `RegexLiteralProtocol` and will emit a compilation error if missing. Conforming types can use `@available` on these function declarations to communicate versioning and add more support in the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there situations where we might want to fail in a "softer" way than this? For instance, a regex engine that didn't support \Q would typically treat it as equivalent to plain Q. So if the method for \Q was missing, we might want to emit a warning recommending you remove the backslash and then treat it as though you hadn't written the backslash, rather than giving an actual error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can definitely also talk about warnings. I expect some things like range-based character classes to be a source of warnings/errors in grapheme-semantic mode (outside this scope), but the general idea of being able to issue errors and/or warnings is worth some discussion here.

I don't know if a missing decl is an error or warning, or if it makes sense to be configurable. Your thoughts? Would it differ for metacharacters than, say, named capture groups?


## Proposed Solution

We propose the introduction of a regular expression literal that supports [the PCRE syntax][PCRE], in addition to new standard library protocols `ExpressibleByRegexLiteral` and `RegexLiteralProtocol` that allow for the customization of how the regex literal is interpreted (similar to [string interpolation][stringinterpolation]). The compiler will parse the PCRE syntax within a regex literal, and synthesize calls to corresponding builder methods. Types conforming to `ExpressibleByRegexLiteral` will be able to provide a builder type that opts into supporting various regex constructs through the use of normal function declarations and `@available`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also support UTS#18? http://www.unicode.org/reports/tr18/

AFAIK PCRE doesn't support all of that, but it'd be good if we could at least parse it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m fine supporting a syntactic superset. Every time I’ve read TR-18 it’s been through the lens of where we are semantically (level 1.5), and we clearly want to support libraries at higher or lower levels of support. But if there’s some metacharacters or additional syntax that we can parse that sounds fine. If there are incompatibilities, we’re likely to side with PCRE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue talk about TR18 syntax on the thread

@hamishknight
Copy link
Contributor Author

Updated to fix the default builder type to be DefaultRegexLiteral, and added a section on why we don't want to re-use string literal syntax.

- Support a syntax familiar to developers who have learned to use regular expressions in other tools and languages
- Allow reuse of many regular expressions not specifically designed for Swift (e.g. from Stack Overflow or popular programming books)
- Allow libraries to define custom types that can be constructed with regex literals, much like string literals
- Diagnose at compile time if a regex literal uses capabilities that aren't allowed by the type's regex dialect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks @beccadax


Single line comments use the syntax `//`, which would conflict with the spelling for an empty regex literal. As such, an empty regex literal would be forbidden.

While not conflicting with the syntax proposed in this pitch, it's also worth noting that the `//` comment syntax (in particular documentation comments that use `///`) would likely preclude the ability to use `///` as a delimiter if we ever wanted to support multi-line regex literals. It's possible though that future multi-line support could be inferred from the regex options provided. For example, a regex that uses the multi-line option `/(?m)/` could be allowed to span multiple lines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that the parser would understand (?m) and switch behavior or heuristics?

Also, raw regex literals give us a future route here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we'd have to recognize the m option while lexing and switch to allowing multiple lines. Good point regarding raw literals, I'll add a sentence for that

@hamishknight hamishknight merged commit 2ab8866 into swiftlang:main Oct 14, 2021
@hamishknight hamishknight deleted the regex-literal-pitch branch October 14, 2021 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants