Blame - third_party/pcre/files/HACKING - syzygy

blob: 8c6a42de399a4fd0c5e5c08f6716bd7fe1ce31f2 [file] [log] [blame]

Sigurdur Asgeirsson	fe15d50	2010-01-18 20:18:38	[diff] [blame]	1	Technical Notes about PCRE
				2	--------------------------
				3
				4	These are very rough technical notes that record potentially useful information
				5	about PCRE internals.
				6
				7	Historical note 1
				8	-----------------
				9
				10	Many years ago I implemented some regular expression functions to an algorithm
				11	suggested by Martin Richards. These were not Unix-like in form, and were quite
				12	restricted in what they could do by comparison with Perl. The interesting part
				13	about the algorithm was that the amount of space required to hold the compiled
				14	form of an expression was known in advance. The code to apply an expression did
				15	not operate by backtracking, as the original Henry Spencer code and current
				16	Perl code does, but instead checked all possibilities simultaneously by keeping
				17	a list of current states and checking all of them as it advanced through the
				18	subject string. In the terminology of Jeffrey Friedl's book, it was a "DFA
				19	algorithm", though it was not a traditional Finite State Machine (FSM). When
				20	the pattern was all used up, all remaining states were possible matches, and
				21	the one matching the longest subset of the subject string was chosen. This did
				22	not necessarily maximize the individual wild portions of the pattern, as is
				23	expected in Unix and Perl-style regular expressions.
				24
				25	Historical note 2
				26	-----------------
				27
				28	By contrast, the code originally written by Henry Spencer (which was
				29	subsequently heavily modified for Perl) compiles the expression twice: once in
				30	a dummy mode in order to find out how much store will be needed, and then for
				31	real. (The Perl version probably doesn't do this any more; I'm talking about
				32	the original library.) The execution function operates by backtracking and
				33	maximizing (or, optionally, minimizing in Perl) the amount of the subject that
				34	matches individual wild portions of the pattern. This is an "NFA algorithm" in
				35	Friedl's terminology.
				36
				37	OK, here's the real stuff
				38	-------------------------
				39
				40	For the set of functions that form the "basic" PCRE library (which are
				41	unrelated to those mentioned above), I tried at first to invent an algorithm
				42	that used an amount of store bounded by a multiple of the number of characters
				43	in the pattern, to save on compiling time. However, because of the greater
				44	complexity in Perl regular expressions, I couldn't do this. In any case, a
				45	first pass through the pattern is helpful for other reasons.
				46
				47	Computing the memory requirement: how it was
				48	--------------------------------------------
				49
				50	Up to and including release 6.7, PCRE worked by running a very degenerate first
				51	pass to calculate a maximum store size, and then a second pass to do the real
				52	compile - which might use a bit less than the predicted amount of memory. The
				53	idea was that this would turn out faster than the Henry Spencer code because
				54	the first pass is degenerate and the second pass can just store stuff straight
				55	into the vector, which it knows is big enough.
				56
				57	Computing the memory requirement: how it is
				58	-------------------------------------------
				59
				60	By the time I was working on a potential 6.8 release, the degenerate first pass
				61	had become very complicated and hard to maintain. Indeed one of the early
				62	things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
				63	I had a flash of inspiration as to how I could run the real compile function in
				64	a "fake" mode that enables it to compute how much memory it would need, while
				65	actually only ever using a few hundred bytes of working memory, and without too
				66	many tests of the mode that might slow it down. So I re-factored the compiling
				67	functions to work this way. This got rid of about 600 lines of source. It
				68	should make future maintenance and development easier. As this was such a major
				69	change, I never released 6.8, instead upping the number to 7.0 (other quite
				70	major changes were also present in the 7.0 release).
				71
				72	A side effect of this work was that the previous limit of 200 on the nesting
				73	depth of parentheses was removed. However, there is a downside: pcre_compile()
				74	runs more slowly than before (30% or more, depending on the pattern) because it
				75	is doing a full analysis of the pattern. My hope was that this would not be a
				76	big issue, and in the event, nobody has commented on it.
				77
				78	Traditional matching function
				79	-----------------------------
				80
				81	The "traditional", and original, matching function is called pcre_exec(), and
				82	it implements an NFA algorithm, similar to the original Henry Spencer algorithm
				83	and the way that Perl works. This is not surprising, since it is intended to be
				84	as compatible with Perl as possible. This is the function most users of PCRE
				85	will use most of the time.
				86
				87	Supplementary matching function
				88	-------------------------------
				89
				90	From PCRE 6.0, there is also a supplementary matching function called
				91	pcre_dfa_exec(). This implements a DFA matching algorithm that searches
				92	simultaneously for all possible matches that start at one point in the subject
				93	string. (Going back to my roots: see Historical Note 1 above.) This function
				94	intreprets the same compiled pattern data as pcre_exec(); however, not all the
				95	facilities are available, and those that are do not always work in quite the
				96	same way. See the user documentation for details.
				97
				98	The algorithm that is used for pcre_dfa_exec() is not a traditional FSM,
				99	because it may have a number of states active at one time. More work would be
				100	needed at compile time to produce a traditional FSM where only one state is
				101	ever active at once. I believe some other regex matchers work this way.
				102
				103
				104	Format of compiled patterns
				105	---------------------------
				106
				107	The compiled form of a pattern is a vector of bytes, containing items of
				108	variable length. The first byte in an item is an opcode, and the length of the
				109	item is either implicit in the opcode or contained in the data bytes that
				110	follow it.
				111
				112	In many cases below LINK_SIZE data values are specified for offsets within the
				113	compiled pattern. The default value for LINK_SIZE is 2, but PCRE can be
				114	compiled to use 3-byte or 4-byte values for these offsets (impairing the
				115	performance). This is necessary only when patterns whose compiled length is
				116	greater than 64K are going to be processed. In this description, we assume the
				117	"normal" compilation options. Data values that are counts (e.g. for
				118	quantifiers) are always just two bytes long.
				119
				120	A list of the opcodes follows:
				121
				122
				123	Opcodes with no following data
				124	------------------------------
				125
				126	These items are all just one byte long
				127
				128	OP_END end of pattern
				129	OP_ANY match any one character other than newline
				130	OP_ALLANY match any one character, including newline
				131	OP_ANYBYTE match any single byte, even in UTF-8 mode
				132	OP_SOD match start of data: \A
				133	OP_SOM, start of match (subject + offset): \G
				134	OP_SET_SOM, set start of match (\K)
				135	OP_CIRC ^ (start of data, or after \n in multiline)
				136	OP_NOT_WORD_BOUNDARY \W
				137	OP_WORD_BOUNDARY \w
				138	OP_NOT_DIGIT \D
				139	OP_DIGIT \d
				140	OP_NOT_HSPACE \H
				141	OP_HSPACE \h
				142	OP_NOT_WHITESPACE \S
				143	OP_WHITESPACE \s
				144	OP_NOT_VSPACE \V
				145	OP_VSPACE \v
				146	OP_NOT_WORDCHAR \W
				147	OP_WORDCHAR \w
				148	OP_EODN match end of data or \n at end: \Z
				149	OP_EOD match end of data: \z
				150	OP_DOLL $ (end of data, or before \n in multiline)
				151	OP_EXTUNI match an extended Unicode character
				152	OP_ANYNL match any Unicode newline sequence
				153
				154	OP_ACCEPT ) These are Perl 5.10's "backtracking
				155	OP_COMMIT ) control verbs". If OP_ACCEPT is inside
				156	OP_FAIL ) capturing parentheses, it may be preceded
				157	OP_PRUNE ) by one or more OP_CLOSE, followed by a 2-byte
				158	OP_SKIP ) number, indicating which parentheses must be
				159	OP_THEN ) closed.
				160
				161
				162	Repeating single characters
				163	---------------------------
				164
				165	The common repeats (*, +, ?) when applied to a single character use the
				166	following opcodes:
				167
				168	OP_STAR
				169	OP_MINSTAR
				170	OP_POSSTAR
				171	OP_PLUS
				172	OP_MINPLUS
				173	OP_POSPLUS
				174	OP_QUERY
				175	OP_MINQUERY
				176	OP_POSQUERY
				177
				178	In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable.
				179	Those with "MIN" in their name are the minimizing versions. Those with "POS" in
				180	their names are possessive versions. Each is followed by the character that is
				181	to be repeated. Other repeats make use of
				182
				183	OP_UPTO
				184	OP_MINUPTO
				185	OP_POSUPTO
				186	OP_EXACT
				187
				188	which are followed by a two-byte count (most significant first) and the
				189	repeated character. OP_UPTO matches from 0 to the given number. A repeat with a
				190	non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an
				191	OP_UPTO (or OP_MINUPTO or OPT_POSUPTO).
				192
				193
				194	Repeating character types
				195	-------------------------
				196
				197	Repeats of things like \d are done exactly as for single characters, except
				198	that instead of a character, the opcode for the type is stored in the data
				199	byte. The opcodes are:
				200
				201	OP_TYPESTAR
				202	OP_TYPEMINSTAR
				203	OP_TYPEPOSSTAR
				204	OP_TYPEPLUS
				205	OP_TYPEMINPLUS
				206	OP_TYPEPOSPLUS
				207	OP_TYPEQUERY
				208	OP_TYPEMINQUERY
				209	OP_TYPEPOSQUERY
				210	OP_TYPEUPTO
				211	OP_TYPEMINUPTO
				212	OP_TYPEPOSUPTO
				213	OP_TYPEEXACT
				214
				215
				216	Match by Unicode property
				217	-------------------------
				218
				219	OP_PROP and OP_NOTPROP are used for positive and negative matches of a
				220	character by testing its Unicode property (the \p and \P escape sequences).
				221	Each is followed by two bytes that encode the desired property as a type and a
				222	value.
				223
				224	Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
				225	three bytes: OP_PROP or OP_NOTPROP and then the desired property type and
				226	value.
				227
				228
				229	Matching literal characters
				230	---------------------------
				231
				232	The OP_CHAR opcode is followed by a single character that is to be matched
				233	casefully. For caseless matching, OP_CHARNC is used. In UTF-8 mode, the
				234	character may be more than one byte long. (Earlier versions of PCRE used
				235	multi-character strings, but this was changed to allow some new features to be
				236	added.)
				237
				238
				239	Character classes
				240	-----------------
				241
				242	If there is only one character, OP_CHAR or OP_CHARNC is used for a positive
				243	class, and OP_NOT for a negative one (that is, for something like [^a]).
				244	However, in UTF-8 mode, the use of OP_NOT applies only to characters with
				245	values < 128, because OP_NOT is confined to single bytes.
				246
				247	Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated,
				248	negated, single-character class. The normal ones (OP_STAR etc.) are used for a
				249	repeated positive single-character class.
				250
				251	When there's more than one character in a class and all the characters are less
				252	than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative
				253	one. In either case, the opcode is followed by a 32-byte bit map containing a 1
				254	bit for every character that is acceptable. The bits are counted from the least
				255	significant end of each byte.
				256
				257	The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode,
				258	subject characters with values greater than 256 can be handled correctly. For
				259	OP_CLASS they don't match, whereas for OP_NCLASS they do.
				260
				261	For classes containing characters with values > 255, OP_XCLASS is used. It
				262	optionally uses a bit map (if any characters lie within it), followed by a list
				263	of pairs and single characters. There is a flag character than indicates
				264	whether it's a positive or a negative class.
				265
				266
				267	Back references
				268	---------------
				269
				270	OP_REF is followed by two bytes containing the reference number.
				271
				272
				273	Repeating character classes and back references
				274	-----------------------------------------------
				275
				276	Single-character classes are handled specially (see above). This section
				277	applies to OP_CLASS and OP_REF. In both cases, the repeat information follows
				278	the base item. The matching code looks at the following opcode to see if it is
				279	one of
				280
				281	OP_CRSTAR
				282	OP_CRMINSTAR
				283	OP_CRPLUS
				284	OP_CRMINPLUS
				285	OP_CRQUERY
				286	OP_CRMINQUERY
				287	OP_CRRANGE
				288	OP_CRMINRANGE
				289
				290	All but the last two are just single-byte items. The others are followed by
				291	four bytes of data, comprising the minimum and maximum repeat counts. There are
				292	no special possessive opcodes for these repeats; a possessive repeat is
				293	compiled into an atomic group.
				294
				295
				296	Brackets and alternation
				297	------------------------
				298
				299	A pair of non-capturing (round) brackets is wrapped round each expression at
				300	compile time, so alternation always happens in the context of brackets.
				301
				302	[Note for North Americans: "bracket" to some English speakers, including
				303	myself, can be round, square, curly, or pointy. Hence this usage.]
				304
				305	Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99
				306	capturing brackets and it used a different opcode for each one. From release
				307	3.5, the limit was removed by putting the bracket number into the data for
				308	higher-numbered brackets. From release 7.0 all capturing brackets are handled
				309	this way, using the single opcode OP_CBRA.
				310
				311	A bracket opcode is followed by LINK_SIZE bytes which give the offset to the
				312	next alternative OP_ALT or, if there aren't any branches, to the matching
				313	OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
				314	the next one, or to the OP_KET opcode. For capturing brackets, the bracket
				315	number immediately follows the offset, always as a 2-byte item.
				316
				317	OP_KET is used for subpatterns that do not repeat indefinitely, while
				318	OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
				319	maximally respectively. All three are followed by LINK_SIZE bytes giving (as a
				320	positive number) the offset back to the matching bracket opcode.
				321
				322	If a subpattern is quantified such that it is permitted to match zero times, it
				323	is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
				324	single-byte opcodes that tell the matcher that skipping the following
				325	subpattern entirely is a valid branch. In the case of the first two, not
				326	skipping the pattern is also valid (greedy and non-greedy). The third is used
				327	when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
				328	because it may be called as a subroutine from elsewhere in the regex.
				329
				330	A subpattern with an indefinite maximum repetition is replicated in the
				331	compiled data its minimum number of times (or once with OP_BRAZERO if the
				332	minimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX
				333	as appropriate.
				334
				335	A subpattern with a bounded maximum repetition is replicated in a nested
				336	fashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO
				337	before each replication after the minimum, so that, for example, (abc){2,5} is
				338	compiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group
				339	has the same number.
				340
				341	When a repeated subpattern has an unbounded upper limit, it is checked to see
				342	whether it could match an empty string. If this is the case, the opcode in the
				343	final replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher
				344	that it needs to check for matching an empty string when it hits OP_KETRMIN or
				345	OP_KETRMAX, and if so, to break the loop.
				346
				347
				348	Assertions
				349	----------
				350
				351	Forward assertions are just like other subpatterns, but starting with one of
				352	the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
				353	OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
				354	is OP_REVERSE, followed by a two byte count of the number of characters to move
				355	back the pointer in the subject string. When operating in UTF-8 mode, the count
				356	is a character count rather than a byte count. A separate count is present in
				357	each alternative of a lookbehind assertion, allowing them to have different
				358	fixed lengths.
				359
				360
				361	Once-only (atomic) subpatterns
				362	------------------------------
				363
				364	These are also just like other subpatterns, but they start with the opcode
				365	OP_ONCE. The check for matching an empty string in an unbounded repeat is
				366	handled entirely at runtime, so there is just this one opcode.
				367
				368
				369	Conditional subpatterns
				370	-----------------------
				371
				372	These are like other subpatterns, but they start with the opcode OP_COND, or
				373	OP_SCOND for one that might match an empty string in an unbounded repeat. If
				374	the condition is a back reference, this is stored at the start of the
				375	subpattern using the opcode OP_CREF followed by two bytes containing the
				376	reference number. OP_NCREF is used instead if the reference was generated by
				377	name (so that the runtime code knows to check for duplicate names).
				378
				379	If the condition is "in recursion" (coded as "(?(R)"), or "in recursion of
				380	group x" (coded as "(?(Rx)"), the group number is stored at the start of the
				381	subpattern using the opcode OP_RREF or OP_NRREF (cf OP_NCREF), and a value of
				382	zero for "the whole pattern". For a DEFINE condition, just the single byte
				383	OP_DEF is used (it has no associated data). Otherwise, a conditional subpattern
				384	always starts with one of the assertions.
				385
				386
				387	Recursion
				388	---------
				389
				390	Recursion either matches the current regex, or some subexpression. The opcode
				391	OP_RECURSE is followed by an value which is the offset to the starting bracket
				392	from the start of the whole pattern. From release 6.5, OP_RECURSE is
				393	automatically wrapped inside OP_ONCE brackets (because otherwise some patterns
				394	broke it). OP_RECURSE is also used for "subroutine" calls, even though they
				395	are not strictly a recursion.
				396
				397
				398	Callout
				399	-------
				400
				401	OP_CALLOUT is followed by one byte of data that holds a callout number in the
				402	range 0 to 254 for manual callouts, or 255 for an automatic callout. In both
				403	cases there follows a two-byte value giving the offset in the pattern to the
				404	start of the following item, and another two-byte item giving the length of the
				405	next item.
				406
				407
				408	Changing options
				409	----------------
				410
				411	If any of the /i, /m, or /s options are changed within a pattern, an OP_OPT
				412	opcode is compiled, followed by one byte containing the new settings of these
				413	flags. If there are several alternatives, there is an occurrence of OP_OPT at
				414	the start of all those following the first options change, to set appropriate
				415	options for the start of the alternative. Immediately after the end of the
				416	group there is another such item to reset the flags to their previous values. A
				417	change of flag right at the very start of the pattern can be handled entirely
				418	at compile time, and so does not cause anything to be put into the compiled
				419	data.
				420
				421	Philip Hazel
				422	October 2009