Skip to content

Conversation

@milseman
Copy link
Member

Generate consumers for AST nodes that are atoms or custom character classes.

Most of the char class and props tests are now passing, notably except for scripts (which the stdlib doesn't surface yet, cc @Azoy).

Copy link

@kylemacomber kylemacomber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only looked at the changes to the tests. This looks like great progress!

Copy link
Contributor

@Azoy Azoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of properties that we don't have yet that probably make sense to expose at some point. Regarding scripts, I'm going to need to store some information on scripts regardless because I need to implement the CLDR grapheme breaking rule to be consistent with ICU. So it probably makes sense to start storing all of scripts data anyway because it seems like we need that here. Should we start making a list of what properties we need to start exposing? Should we expose more and more Unicode properties?

Comment on lines 347 to 348
case .extendedPictographic:
break
Copy link
Contributor

@Azoy Azoy Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we store this property now as part of grapheme breaking, so if we need to expose it we can.

Comment on lines +381 to +396
case .otherAlphabetic:
break
case .otherDefaultIgnorableCodePoint:
break
case .otherGraphemeExtended:
break
case .otherIDContinue:
break
case .otherIDStart:
break
case .otherLowercase:
break
case .otherMath:
break
case .otherUppercase:
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These properties are unfortunate because they get scooped up with their normal counterpart (Alphabetic) to match ICU's behavior when asking if a scalar isAlphabetic. If we need to make the distinction here, then that'll probably need to be a separate table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, IIRC they were because Unicode historically used general categories for these kind of queries, but it became apparent that a universal categorization is not great for this use. IIRC isAlphabetic is derived from categories and properties, so it might actually be possible to run that in reverse, that is we check isAlphabetic and excluding the others.

Comment on lines 407 to 408
case .regionalIndicator:
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property is a simple range 0x1F1E6...0x1F1FF if we wanted to expose this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's expose any properties that lets you write grapheme breaking as a regular expression for testing purposes :-)

milseman and others added 3 commits December 16, 2021 07:32
Co-authored-by: Kyle Macomber <kmacomber@apple.com>
Co-authored-by: Richard Wei <rxrwei@gmail.com>
@milseman
Copy link
Member Author

Switched all the fatal errors to throws, to at least let us know reason more and better support xfails

@milseman
Copy link
Member Author

@swift-ci please test linux platform

@milseman
Copy link
Member Author

@swift-ci please test linux platform

@milseman milseman merged commit 25f6fdc into swiftlang:main Dec 16, 2021
@milseman milseman deleted the arborvore branch December 16, 2021 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants