Pattern analyzer
The pattern analyzer uses a regular expression to split the text into terms. The regular expression should match the token separators not the tokens themselves. The regular expression defaults to \W+ (or all non-word characters).
The pattern analyzer uses Java Regular Expressions.
A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.
Read more about pathological regular expressions and how to avoid them.
POST _analyze { "analyzer": "pattern", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } The above sentence would produce the following terms:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ] The pattern analyzer accepts the following parameters:
pattern- A Java regular expression, defaults to
\W+. flags- Java regular expression flags. Flags should be pipe-separated, eg
"CASE_INSENSITIVE|COMMENTS". lowercase- Should terms be lowercased or not. Defaults to
true. stopwords- A pre-defined stop words list like
_english_or an array containing a list of stop words. Defaults to_none_. stopwords_path- The path to a file containing stop words.
See the Stop Token Filter for more information about stop word configuration.
In this example, we configure the pattern analyzer to split email addresses on non-word characters or on underscores (\W|_), and to lower-case the result:
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_email_analyzer": { "type": "pattern", "pattern": "\\W|_", "lowercase": true } } } } } POST my-index-000001/_analyze { "analyzer": "my_email_analyzer", "text": "John_Smith@foo-bar.com" } - The backslashes in the pattern need to be escaped when specifying the pattern as a JSON string.
The above example produces the following terms:
[ john, smith, foo, bar, com ] The following more complicated example splits CamelCase text into tokens:
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "camel": { "type": "pattern", "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])" } } } } } GET my-index-000001/_analyze { "analyzer": "camel", "text": "MooseX::FTPClass2_beta" } The above example produces the following terms:
[ moose, x, ftp, class, 2, beta ] The regex above is easier to understand as:
([^\p{L}\d]+) | (?<=\D)(?=\d) | (?<=\d)(?=\D) | (?<=[ \p{L} && [^\p{Lu}]]) (?=\p{Lu}) | (?<=\p{Lu}) (?=\p{Lu} [\p{L}&&[^\p{Lu}]] ) - swallow non letters and numbers,
- or non-number followed by number,
- or number followed by non-number,
- or lower case
- followed by upper case,
- or upper case
- followed by upper case
- then lower case
The pattern analyzer consists of:
- Tokenizer
- Token Filters
-
- Lower Case Token Filter
- Stop Token Filter (disabled by default)
If you need to customize the pattern analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. This would recreate the built-in pattern analyzer and you can use it as a starting point for further customization:
PUT /pattern_example { "settings": { "analysis": { "tokenizer": { "split_on_non_word": { "type": "pattern", "pattern": "\\W+" } }, "analyzer": { "rebuilt_pattern": { "tokenizer": "split_on_non_word", "filter": [ "lowercase" ] } } } } } - The default pattern is
\W+which splits on non-word characters and this is where you’d change it. - You’d add other token filters after
lowercase.