Quick Start
Tutorial
Search & Replace
Tools & Languages
Examples
Reference
Unicode Regexes
Introduction
Astral Characters
Code Points and Graphemes
Unicode Categories
Unicode Scripts
Unicode Blocks
Unicode Binary Properties
Unicode Property Sets
Unicode Script Runs
Unicode Boundaries
Regex Tutorial
Introduction
Table of Contents
Special Characters
Non-Printable Characters
Regex Engine Internals
Character Classes
Character Class Subtraction
Character Class Intersection
Shorthand Character Classes
Dot
Anchors
Word Boundaries
Alternation
Optional Items
Repetition
Grouping & Capturing
Backreferences
Backreferences, part 2
Named Groups
Relative Backreferences
Branch Reset Groups
Free-Spacing & Comments
Unicode Characters & Properties
Mode Modifiers
Atomic Grouping
Possessive Quantifiers
Lookahead & Lookbehind
Lookaround, part 2
Lookbehind Limitations
(Non-)Atomic Lookaround
Keep Text out of The Match
Conditionals
Balancing Groups
Recursion
Subroutines
Infinite Recursion
Recursion & Quantifiers
Recursion & Capturing
Recursion & Backreferences
Recursion & Backtracking
POSIX Bracket Expressions
Zero-Length Matches
Continuing Matches
Backtracking Control Verbs
Control Verb Arguments
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
PowerGREP—The world’s most powerful tool to flex your regex muscles!
RegexBuddy—Better than a regular expression tutorial!

Introduction to Unicode Regular Expressions

Unicode is a character set that aims to define all characters and glyphs from all human languages and writing systems, living and dead, from Egyptian hieroglyphs to space age emoji 🚀🛸. With more and more software being required to support multiple languages, or even just any language, not to mention those cute 😺🐶 emoji, Unicode has become essential to most software. Using different character sets for different languages is simply too cumbersome for programmers and users.

Most flavors discussed in this tutorial support Unicode. PCRE is 8-bit by default. But when this tutorial talks about PCRE, it assumes that you compiled PCRE with Unicode support. PCRE2 supports Unicode when compiled with the default settings. PHP uses PCRE or PCRE2 compiled with Unicode support, but requires the /u flag to actually enable the Unicode features. This tutorial assumes you use the /u flag with PHP all the time. JavaScript also has a /u flag to enable full Unicode support. The tutorial mentions when this flag should be used. In JavaScript it has more implications than just enabling Unicode support. For Boost and the standard C++ library Unicode support depends on the character type your code uses. On Windows, boost::regex and std::regex are based on the 8-bit char type while boost::wregex and std::wregex are based on the 16-bit wchar_t. This tutorial assumes you’re using wchar_t.

The regular expressions reference that accompanies this tutorial makes the same assumptions. For JavaScript, the reference tables indicate “with /u” or “without /u” if the feature depends on the /u flag being set or not.

Unfortunately, Unicode brings its own requirements and pitfalls when it comes to regular expressions. Unicode continues to evolve, typically releasing a new version each September. When a regex engine is updated to a newer version of Unicode it can change how your regexes work. It will start matching newly defined characters. But it can also stop matching certain characters if the new version changes a character’s properties. As an example, the Georgian letters U+10C0–U+10FA were originally in the lowercase letter category. Unicode 3.0.0 moved them to the “other letter” category because they didn’t have uppercase equivalents. Unicode 11.0.0 added uppercase equivalents of these letters and moved the original letters back to the lowercase letter category. This change would affect you, for example, if your app was previously running on Java 11 or prior (based on Unicode 10.0.0 or prior) and is then migrated to Java 12 or later (based on Unicode 11.0.0 or later). Whether this actually impacts your application depends on whether you have any users in Georgia and whether your app uses regexes with \p{Ll} and/or \p{Lo}.

The biggest change that still impacts software today is that Unicode was originally designed as a 16-bit character set. But that turned out to be insufficient if we want Unicode to support every character from every script from all of history and the future. Even though the first code points beyond U+FFFF were assigned way back in 2001 with Unicode 3.1.0, a lot of software, including Windows itself, is still designed around 16-bit characters. Such software uses surrogate pairs to handle code points between U+10000 and U+10FFFF. This can upset your regex engine’s handling of astral characters. Make sure to follow this link if your application needs to support such characters, which include many emoji and the 𝔪𝔞𝔱𝔥 𝓼𝔂𝓶𝓫𝓸𝓵𝓼 that are often used for more fanciful purposes.

Characters, Code Points, and Graphemes

What a regex engine sees as a character and what this tutorial means by a character is more accurately called a Unicode code point. What most people see as a character is more accurately called a Unicode grapheme. The topic about Unicode characters, code points, and graphemes explains the difference in detail.

Unicode Properties

Unicode assigns various properties to characters. For example, Unicode code point U+0031 which represents the letter 1 is in the Unicode category Nd or Decimal_Digit, in the Unicode script Common, and in the Unicode block Basic_Latin. It has various binary Unicode properties: ASCII_Hex_Digit, Grapheme_Base, Emoji, Emoji_Component, Hex_Digit, ID_Continue, and XID_Continue. It also has one value for each Unicode property set: Bidi_Class=European_Number, Bidi_Paired_Bracket_Type=None, Canonical_Combining_Class=0, Decomposition_Type=None, East_Asian_Width=Narrow, Grapheme_Cluster_Break=Other, Identifier_Status=Allowed, Identifier_Type=Recommended, Indic_Conjunct_Break=None, Indic_Positional_Category=NA, Indic_Syllabic_Category=Number, Joining_Group=No_Joining_Group, Joining_Type=Non_Joining, Line_Break=NU, Numeric_Type=Decimal, Numeric_Value=1, Sentence_Break=Numeric, Vertical_Orientation=Rotated, and Word_Break=Numeric.

The Unicode standard recommends that regular expression engines support the \p{Property_Set=Property_Value} syntax to match any character that has the specified value for the specified property. For example, \p{Numeric_Value=1}+ should match all of 1¹١۱१১৴੧૧୧௧౧೧൧๑໑༡₁⅟Ⅰⅰ①⑴⒈❶➀➊〡㊀1. The syntax \p{Property} should be used to match any character that has the specified binary property. \p{Hex_Digit}+ should match 0123456789ABCDEFabcdef0123456789ABCDEFabcdef. Unicode suggests this shorter notation as an alternative for categories and scripts. So you could use \p{Nd} instead of \p{gc=Nd} and \p{Common} instead of \p{Script=Common}.

As with anything in regular expressions, regex flavors vary widely in what they actually support. Follow the links to learn which flavors support exactly which Unicode properties. Every flavor that supports Unicode properties supports \p{N} and \p{Nd} to match all characters in a specific Unicode category, specifying just the one-letter or two-letter abbreviation for the category. Far fewer flavors support the more explicit syntax \p{gc=Nd}.

Many flavors also support Unicode scripts specifying just the name of the script. \p{Cyrillic} matches all characters in the Cyrillic script. Most, but not all, of those also support \p{Script=Cyrillic}.

Some flavors support Unicode blocks by specifying the name of the block with the prefix In. \p{InCyrillic} is equivalent to [\u{0400}-\u{04FF}], matching this entire block. Again, most, but not all, of those flavors also support \p{Block=Cyrillic}. The prefix is needed to differentiate between the Unicode script of the same name.

Support for binary Unicode properties and Unicode property sets other than categories, scripts, and blocks is much more limited. ICU and Perl are really the only regex flavors that support (nearly) all properties.

All flavors that support some Unicode properties also support the syntax with a capital P to negate the property, as recommended by Unicode.. \P{Nd} matches any code point that is not in the Decimal_Digit category. That includes unassigned code points. Perl, Ruby, PCRE, PCRE2, and flavors based on the latter two such as PHP, Delphi, and R, support an alternative syntax using a caret for negation. \p{^Nd} is another way of writing \P{Nd} in those flavors. Be careful not to negate the property twice. \P{^Nd} with double negation is the same as \p{Nd} without any negation.

Unicode Boundaries

Unicode Standard Annex 29 titled “Unicode Text Segmentation” defines rules for word boundaries, grapheme boundaries, and sentence boundaries. Perl supports all three in its regex flavor since Perl 5.22. ICU only supports the word boundaries. Java 9 and later support the grapheme boundaries.

Unicode Standard Annex 14 titled “Unicode Line Breaking Algorithm” defines an algorithm for finding potential word wrapping positions. Perl 5.24 and later can match these positions as line boundaries.

| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |

| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |

| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |