Unicode Regexes

Regex Tutorial

Special Characters

Non-Printable Characters

Regex Engine Internals

Character Classes

Character Class Subtraction

Character Class Intersection

Shorthand Character Classes

Backreferences, part 2

Named Groups

Relative Backreferences

Branch Reset Groups

Free-Spacing & Comments

Unicode Characters & Properties

Mode Modifiers

Atomic Grouping

Possessive Quantifiers

Lookahead & Lookbehind

Lookaround, part 2

Lookbehind Limitations

(Non-)Atomic Lookaround

Keep Text out of The Match

Recursion & Quantifiers

Recursion & Capturing

Recursion & Backreferences

Recursion & Backtracking

POSIX Bracket Expressions

Zero-Length Matches

Continuing Matches

Backtracking Control Verbs

Control Verb Arguments

Unicode Characters, Code Points, and Graphemes

Most people would consider à a single character. Unfortunately, it need not be depending on the meaning of the word “character”.

All Unicode regex engines discussed in this tutorial treat any single Unicode code point as a single character. When this tutorial tells you that the dot matches any single character, this translates into Unicode parlance as “the dot matches any single Unicode code point”. In Unicode, à can be encoded as two code points: U+0061 (a) followed by U+0300 (grave accent). In this situation, . applied to à will match a without the accent. ^.$ will fail to match, since the string consists of two code points. ^..$ matches à.

The Unicode code point U+0300 (grave accent) is a combining mark. Any code point that is not a combining mark can be followed by any number of combining marks. This sequence, like U+0061 U+0300 above, is displayed as a single grapheme on the screen.

Unfortunately, à can also be encoded with the single Unicode code point U+00E0 (a with grave accent). The reason for this duality is that many historical character sets encode “a with grave accent” as a single character. The Windows 1252 and ISO-8859-1 code pages encode it as the byte 0xE0, for example. Unicode’s designers thought it would be useful to have a one-on-one mapping with popular legacy character sets, in addition to the Unicode way of separating marks and base letters which makes arbitrary combinations not supported by legacy character sets possible. In fact, Unicode code points U+0000 to U+00FF are identical to ISO-8859-1 bytes 0x00 to 0xFF.

Because regex engines work on code points, this tutorial uses the word “character” as a synonym for the term “code point”. It uses the word “grapheme” when talking about graphemes.

This topic, like every topic in this tutorial except the one about astral characters, assumes that your regex engine either correctly handles code points between U+10000 and U+10FFFF as individual code points, or that your subject string does not contain any such code points.

How to Match a Single Unicode Grapheme

Matching a single grapheme, whether it’s encoded as a single code point, or as multiple code points using combining marks, is easy in ICU, Java, Perl, PCRE, PCRE2, PHP, Delphi, R, and Boost: simply use \X. You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

Well, it would be easy if all these flavors agreed on exactly what a grapheme is. All these flavors match à with \X regardless of whether it is encoded as a single code point U+00E0 or as two code points U+0061 U+0300. But there are significant differences with graphemes that are more complex than a letter with a combining mark.

The JGsoft applications, Ruby 2.0 to 2.3, PCRE 5.0 to 8.31, and flavors based on these PCRE versions such as Delphi XE to XE6, PHP 5.0.0 to 5.4.13, and R 2.14.0 to 2.15.2 use the simplest definition of a grapheme. They implement \X as matching one code point that is not in the Mark category along with all following code points, if any, that are in the Mark category. In flavors that do not support \X but do support Unicode categories and atomic grouping, which includes .NET, Java 8 and prior, and Ruby 1.9, you could match simple graphemes with the regex (?>\P{M}\p{M}*) and get the same results.

Boost also uses a simple definition of graphemes but only treats a small subset of the characters in the Mark category as combining characters. It also treats some unassigned code points as combining characters. You could implement Boost’s idea of a grapheme with the regex (?>[^\u0300-\u0361\u0483-\u0486\u0903\u093E-\u0940\u0949-\u094C\u0982\u0983\u09BE-\u09C0\u09C7-\u09CC\u09D7\u0A3E-\u0A40\u0A83\u0ABE-\u0AC0\u0AC9-\u0ACC\u0B02\u0B03\u0B3E\u0B40\u0B47-\u0B4C\u0B57\u0B83\u0BBE\u0BBF\u0BC1-\u0BCC\u0BD7\u0C01-\u0C03\u0C41-\u0C44\u0C82\u0C83\u0CBE\u0CC0-\u0CC4\u0CC7-\u0CCB\u0CD5\u0CD6\u0D02\u0D03\u0D3E-\u0D40\u0D46-\u0D4C\u0D57\u0F7F\u20D0-\u20E1\u3099-\u309A\uFE20-\uFE23\uFFFF][\u0300-\u0361\u0483-\u0486\u0903\u093E-\u0940\u0949-\u094C\u0982\u0983\u09BE-\u09C0\u09C7-\u09CC\u09D7\u0A3E-\u0A40\u0A83\u0ABE-\u0AC0\u0AC9-\u0ACC\u0B02\u0B03\u0B3E\u0B40\u0B47-\u0B4C\u0B57\u0B83\u0BBE\u0BBF\u0BC1-\u0BCC\u0BD7\u0C01-\u0C03\u0C41-\u0C44\u0C82\u0C83\u0CBE\u0CC0-\u0CC4\u0CC7-\u0CCB\u0CD5\u0CD6\u0D02\u0D03\u0D3E-\u0D40\u0D46-\u0D4C\u0D57\u0F7F\u20D0-\u20E1\u3099-\u309A\uFE20-\uFE23\uFFFF]*). Yes, the non-character \uFFFF that was obviously intended as a sentinel in the lookup table in Boost’s code is also treated as a combining character.

The proper way to handle graphemes in modern Unicode is using the grapheme cluster boundaries defined Unicode Standard Annex #29. All code points between two such boundaries should be treated as a single grapheme. ICU, Perl, Java 9 and later, Ruby 2.4 and later, PCRE 8.32 and later, and PCRE2 all base \X on their interpretation of UAX #29 grapheme boundaries. Perl and Java even support \b{g} to match the grapheme boundaries themselves. Internally, a regex engine could implement \X as (?>.+?\b{g}) to match all code points up until the next grapheme boundary.

A very important difference between \X based on simple graphemes and \X based on cluster boundaries is that the simple graphemes require a base character, while the cluster boundaries do not. A grapheme without a base character is called a degenerate grapheme. A string that consists of the single code point U+0300 consists of a single degenerate grapheme. It cannot be matched as a simple grapheme because there is no code point that is not in the Mark category to start the match. But the end of the string is always a grapheme boundary. So U+0300 alone will be matched when \X is based on boundaries. In fact, \X always matches at least one code point when based on boundaries. It matches additional code points if that is needed to reach the next grapheme cluster boundary.

This can also happen in the middle of the string. If you have the string à as two code points U+0061 U+0300 then a\X fails when the regex flavor uses simple graphemes, but matches when it uses grapheme boundaries. a matches a regardless. Then \X is attempted starting at U+0300. This cannot be matched as a simple grapheme, but can be matched as the one code point up to the next boundary.

You may have noticed the phrase “their interpretation” three paragraphs ago. Even when regex flavors implement what should be standards, they do so inconsistently. The UAX #29 rules are a bit complicated. They have also changed with different Unicode versions. We’ll go over the most important rules and how they’re supported to give you a better idea of what Unicode considers a grapheme and how regex flavors follow this. The order of the rules is important. If multiple rules appear applicable, the earlier rule applies. There is no grapheme boundary between the control characters CR and LF, for example, because rule #2 takes precedence over rule #3. The rules use the verb “break” to indicate the positions of grapheme cluster boundaries.

UAX #29 defines both “legacy grapheme clusters” and “extended grapheme clusters”. Legacy grapheme clusters exclude some of the rules. ICU 66 and prior implemented legacy grapheme clusters. Extended grapheme clusters implement all the rules. ICU 67 and later and all the other flavors that implement grapheme boundaries (try to) implement all the rules.

Break at the start at end of the string.
Do not break between between a carriage return and a line break.
Break before and after a control character.
Do not break Hangul (Korean) syllable or other conjoining sequences.
Do not break before extending characters, which include all non-spacing marks, all enclosing marks, some spacing marks, emoji modifiers, and the zero-width joiner and non-joiner.
Extended clusters only: Do not break before most other spacing marks.
Extended clusters only: Do not break after code points with gcb=Prepend which are mainly characters from Indic script that must precede another character.
Extended clusters only, since Unicode 15.1.0: Do not break between certain combinations of Indic characters that include “linkers”.
Do not break within emoji modifier sequences or emoji ZWJ sequences. The actual implementation of this rule has changed significantly with Unicode 10.0.0.
Do not break within emoji flag sequences. Since Unicode 9.0.0 that means: do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point. Between Unicode 6.2.0 and 8.0.0 that meant: do not break between any two RI symbols.
Break everywhere.

So in total, we have four different implementations of graphemes. We have two simple implementations, based either on the Mark category or on Boost’s idea of a combining character. We also have two implementations based on grapheme clusters, which can be either legacy clusters or extended clusters. To better illustrate the differences between the various implementations, consider these sample strings and how they are matched by \X:

Sample String	Simple (Boost)	Simple (Mark category)	Legacy Cluster	Extended Cluster
`à` (U+0061 U+0300)	`à` (U+0061 U+0300)
`คู` (U+0E0F U+0E39)	Each code point separately	`คู` (U+0E0F U+0E39)
`अः` (U+0905 U+0903)	Each code point separately	`अः` (U+0905 U+0903)	Each code point separately	`अः` (U+0905 U+0903)
`ｶﾞ` (U+FF76 U+FF9F)	Each code point separately		`ｶﾞ` (U+FF76 U+FF9F)
`각` (U+AC00 U+11A8)	Each code point separately		`각` (U+AC00 U+11A8)
CRLF pair	CR and LF separately		CRLF as a single match
tab grave (U+0009 U+0300)	Only the tab can be matched		Tab U+0009 and grave U+0300 separately

Matching a Specific Code Point

To match a specific Unicode code point, use \u{10FFFF} or \x{10FFFF} where 10FFFF is the hexadecimal number of the code point you want to match. The number can range from 0 to 10FFFF. Leading zeros are permitted but not required. \x{E0} and \x{00E0} match à when encoded as a single code point U+00E0.\u{10FFFF} is supported by Ruby 1.9 and later and by JavaScript with the /u flag. \x{10FFFF} is supported by ICU, RE2, Perl, PCRE, PCRE2, and Boost. It is also supported by Java 7 and later. The JGsoft applications support \u{FFFF} and \x{FFFF} but only with code points up to U+FFFF.

Since \x by itself is not a valid regex token, \x{9} can never be confused to match \x 9 times. It always matches the Unicode code point U+0009 which represents the ASCII tab. \x{9}{9} matches 9 tabs. \u by itself is a shorthand character class that matches any uppercase letter in the JGsoft applications and Boost. The JGsoft applications interpret \u{9} as Unicode code point U+0009 instead of as 9 uppercase letters. \u{9,9} matches exactly 9 uppercase letters. Boost, however, interprets \u{9} as 9 uppercase letters because it does not support \u{FFFF} to match a Unicode code point.

For code points between U+0000 and U+FFFF you can use \uFFFF without the curly braces. But then you must always specify 4 hexadecimal digits. \u0161 matches š. This syntax is supported by .NET, Python 3.3 and later, and std::wregex. It is also supported by ICU, Java (any version), JavaScript (with and without /u), and the JGsoft applications.

\U0010FFFF also matches a code point between U+0000 and U+10FFFF. You must always specify 8 hexadecimal digits, even though Unicode code points never use more than 6. This syntax is supported by Tcl, ICU, and Python 3.3 and later.

In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you’re doing, the difference may be significant.

XML Schema and XPath do not have a regex token for matching Unicode code points. However, you can easily use XML entities like  to insert literal code points into your regular expressions if they are stored in XML files. The XML parser will convert the entity into the character before the XML Schema or XPath parser compile the regex.

Matching a String of Code Points

Ruby has a unique take on the \u{10FFFF} syntax: it allows you to specify multiple code points delimited by spaces within a single escape. \u{52 75 62 79} is identical to \u{52}\u{75}\u{62}\u{79}. Both match Ruby. Quantifiers affect only the last character. \u{52 75 62 79}+ is identical to \u{52}\u{75}\u{62}\u{79}+. Both match Rubyyy. Use (?:\u{52 75 62 79})+ if you want to match RubyRubyRuby.

Do You Need To Worry About Different Encodings?

While you should always keep in mind the pitfalls created by the different ways in which accented characters can be encoded, you don’t always have to worry about them. If you know that your input string and your regex use the same style then you don’t have to worry about it at all. This process is called Unicode normalization. All programming languages with native Unicode support, such as Java, C# and VB.NET, have library routines for normalizing strings. If you normalize both the subject and regex before attempting the match then there won’t be any inconsistencies.

If you are using Java then you can pass the CANON_EQ flag as the second parameter to Pattern.compile(). This tells the Java regex engine to consider canonically equivalent characters as identical. The regex à encoded as U+00E0 matches à encoded as U+0061 U+0300, and vice versa. None of the other regex engines currently support canonical equivalence while matching.

If you type the à key on the keyboard then that is always received as the code point U+00E0 by the application. So if you’re working with text that you typed in yourself then any regex that you type in yourself will match in the same way.

If you’re working with text files encoded using a traditional Windows (often called “ANSI”) or ISO-8859 code page then most applications use the one-on-one substitution. Since all the Windows or ISO-8859 code pages encode accented characters as a single code point, nearly all software uses a single Unicode code point for each character when converting the file to Unicode.