Unicode Regexes

Regex Tutorial

Special Characters

Non-Printable Characters

Regex Engine Internals

Character Classes

Character Class Subtraction

Character Class Intersection

Shorthand Character Classes

Backreferences, part 2

Named Groups

Relative Backreferences

Branch Reset Groups

Free-Spacing & Comments

Unicode Characters & Properties

Mode Modifiers

Atomic Grouping

Possessive Quantifiers

Lookahead & Lookbehind

Lookaround, part 2

Lookbehind Limitations

(Non-)Atomic Lookaround

Keep Text out of The Match

Recursion & Quantifiers

Recursion & Capturing

Recursion & Backreferences

Recursion & Backtracking

POSIX Bracket Expressions

Zero-Length Matches

Continuing Matches

Backtracking Control Verbs

Control Verb Arguments

Introduction to Unicode Regular Expressions

Unicode is a character set that aims to define all characters and glyphs from all human languages and writing systems, living and dead, from Egyptian hieroglyphs to space age emoji 🚀🛸. With more and more software being required to support multiple languages, or even just any language, not to mention those cute 😺🐶 emoji, Unicode has become essential to most software. Using different character sets for different languages is simply too cumbersome for programmers and users.

Most flavors discussed in this tutorial support Unicode. PCRE is 8-bit by default. But when this tutorial talks about PCRE, it assumes that you compiled PCRE with Unicode support. PCRE2 supports Unicode when compiled with the default settings. PHP uses PCRE or PCRE2 compiled with Unicode support, but requires the /u flag to actually enable the Unicode features. This tutorial assumes you use the /u flag with PHP all the time. JavaScript also has a /u flag to enable full Unicode support. The tutorial mentions when this flag should be used. In JavaScript it has more implications than just enabling Unicode support. For Boost and the standard C++ library Unicode support depends on the character type your code uses. On Windows, boost::regex and std::regex are based on the 8-bit char type while boost::wregex and std::wregex are based on the 16-bit wchar_t. This tutorial assumes you’re using wchar_t.

The regular expressions reference that accompanies this tutorial makes the same assumptions. For JavaScript, the reference tables indicate “with /u” or “without /u” if the feature depends on the /u flag being set or not.

Unfortunately, Unicode brings its own requirements and pitfalls when it comes to regular expressions. Unicode continues to evolve, typically releasing a new version each September. When a regex engine is updated to a newer version of Unicode it can change how your regexes work. It will start matching newly defined characters. But it can also stop matching certain characters if the new version changes a character’s properties. As an example, the Georgian letters U+10C0–U+10FA were originally in the lowercase letter category. Unicode 3.0.0 moved them to the “other letter” category because they didn’t have uppercase equivalents. Unicode 11.0.0 added uppercase equivalents of these letters and moved the original letters back to the lowercase letter category. This change would affect you, for example, if your app was previously running on Java 11 or prior (based on Unicode 10.0.0 or prior) and is then migrated to Java 12 or later (based on Unicode 11.0.0 or later). Whether this actually impacts your application depends on whether you have any users in Georgia and whether your app uses regexes with \p{Ll} and/or \p{Lo}.

The biggest change that still impacts software today is that Unicode was originally designed as a 16-bit character set. But that turned out to be insufficient if we want Unicode to support every character from every script from all of history and the future. Even though the first code points beyond U+FFFF were assigned way back in 2001 with Unicode 3.1.0, a lot of software, including Windows itself, is still designed around 16-bit characters. Such software uses surrogate pairs to handle code points between U+10000 and U+10FFFF. This can upset your regex engine’s handling of astral characters. Make sure to follow this link if your application needs to support such characters, which include many emoji and the 𝔪𝔞𝔱𝔥 𝓼𝔂𝓶𝓫𝓸𝓵𝓼 that are often used for more fanciful purposes.

Characters, Code Points, and Graphemes

What a regex engine sees as a character and what this tutorial means by a character is more accurately called a Unicode code point. What most people see as a character is more accurately called a Unicode grapheme. The topic about Unicode characters, code points, and graphemes explains the difference in detail.

Unicode Properties

Unicode assigns various properties to characters. For example, Unicode code point U+0031 which represents the letter 1 is in the Unicode category Nd or Decimal_Digit, in the Unicode script Common, and in the Unicode block Basic_Latin. It has various binary Unicode properties: ASCII_Hex_Digit, Grapheme_Base, Emoji, Emoji_Component, Hex_Digit, ID_Continue, and XID_Continue. It also has one value for each Unicode property set: Bidi_Class=European_Number, Bidi_Paired_Bracket_Type=None, Canonical_Combining_Class=0, Decomposition_Type=None, East_Asian_Width=Narrow, Grapheme_Cluster_Break=Other, Identifier_Status=Allowed, Identifier_Type=Recommended, Indic_Conjunct_Break=None, Indic_Positional_Category=NA, Indic_Syllabic_Category=Number, Joining_Group=No_Joining_Group, Joining_Type=Non_Joining, Line_Break=NU, Numeric_Type=Decimal, Numeric_Value=1, Sentence_Break=Numeric, Vertical_Orientation=Rotated, and Word_Break=Numeric.

The Unicode standard recommends that regular expression engines support the \p{Property_Set=Property_Value} syntax to match any character that has the specified value for the specified property. For example, \p{Numeric_Value=1}+ should match all of 1¹١۱१১৴੧૧୧௧౧೧൧๑໑༡₁⅟Ⅰⅰ①⑴⒈❶➀➊〡㊀１. The syntax \p{Property} should be used to match any character that has the specified binary property. \p{Hex_Digit}+ should match 0123456789ABCDEFabcdef０１２３４５６７８９ＡＢＣＤＥＦａｂｃｄｅｆ. Unicode suggests this shorter notation as an alternative for categories and scripts. So you could use \p{Nd} instead of \p{gc=Nd} and \p{Common} instead of \p{Script=Common}.

As with anything in regular expressions, regex flavors vary widely in what they actually support. Follow the links to learn which flavors support exactly which Unicode properties. Every flavor that supports Unicode properties supports \p{N} and \p{Nd} to match all characters in a specific Unicode category, specifying just the one-letter or two-letter abbreviation for the category. Far fewer flavors support the more explicit syntax \p{gc=Nd}.

Many flavors also support Unicode scripts specifying just the name of the script. \p{Cyrillic} matches all characters in the Cyrillic script. Most, but not all, of those also support \p{Script=Cyrillic}.

Some flavors support Unicode blocks by specifying the name of the block with the prefix In. \p{InCyrillic} is equivalent to [\u{0400}-\u{04FF}], matching this entire block. Again, most, but not all, of those flavors also support \p{Block=Cyrillic}. The prefix is needed to differentiate between the Unicode script of the same name.

Support for binary Unicode properties and Unicode property sets other than categories, scripts, and blocks is much more limited. ICU and Perl are really the only regex flavors that support (nearly) all properties.

All flavors that support some Unicode properties also support the syntax with a capital P to negate the property, as recommended by Unicode.. \P{Nd} matches any code point that is not in the Decimal_Digit category. That includes unassigned code points. Perl, Ruby, PCRE, PCRE2, and flavors based on the latter two such as PHP, Delphi, and R, support an alternative syntax using a caret for negation. \p{^Nd} is another way of writing \P{Nd} in those flavors. Be careful not to negate the property twice. \P{^Nd} with double negation is the same as \p{Nd} without any negation.

Unicode Boundaries

Unicode Standard Annex 29 titled “Unicode Text Segmentation” defines rules for word boundaries, grapheme boundaries, and sentence boundaries. Perl supports all three in its regex flavor since Perl 5.22. ICU only supports the word boundaries. Java 9 and later support the grapheme boundaries.

Unicode Standard Annex 14 titled “Unicode Line Breaking Algorithm” defines an algorithm for finding potential word wrapping positions. Perl 5.24 and later can match these positions as line boundaries.