https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers
Elasticsearch Tokenizers – Partial Word Tokenizers
In this tutorial, we're gonna look at 2 tokenizers that can break up text or words into small fragments, for partial word matching: N-Gram Tokenizer and Edge N-Gram Tokenizer.
I. N-Gram Tokenizer
ngram tokenizer does 2 things:
- break up text into words when it encounters specified characters (whitespace, punctuation...)
- emit N-grams of each word of the specified length (quick with length = 2 -> [qu, ui, ic, ck] )
=> N-grams are like a sliding window of continuous letters.
For example:
POST _analyze { "tokenizer": "ngram", "text": "Spring 5" } It will generate terms with a sliding (1 char min-width, 2 chars max-width) window:
[ "S", "Sp", "p", "pr", "r", "ri", "i", "in", "n", "ng", "g", "g ", " ", " 5", "5" ] Configuration
-
min_gram: minimum length of characters in a gram (min-width of the sliding window). Defaults to 1. -
max_gram: maximum length of characters in a gram (max-width of the sliding window). Defaults to 2. -
token_chars: character classes that will be included in a token. Elasticsearch will split on characters that don’t belong to: - letter (a, b, ...)
- digit (1, 2, ...)
- whitespace (" ", "\n", ...)
- punctuation (!, ", ...)
- symbol ($, %, ...)
Defaults to [] (keep all characters).
For example, we will create a tokenizer with sliding window (width = 3) and character classes: only letter & digit.
PUT jsa_index_n-gram { "settings": { "analysis": { "analyzer": { "jsa_analyzer": { "tokenizer": "jsa_tokenizer" } }, "tokenizer": { "jsa_tokenizer": { "type": "ngram", "min_gram": 3, "max_gram": 3, "token_chars": [ "letter", "digit" ] } } } } } POST jsa_index_n-gram/_analyze { "analyzer": "jsa_analyzer", "text": "Tut101: Spring 5" } Terms:
More at:
https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers
Elasticsearch Tokenizers – Partial Word Tokenizers
Top comments (0)