Posted on May 22, 2021

Elasticsearch Tokenizers – Partial Word Tokenizers

#elasticsearch #tokenizers #partial #word

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers

In this tutorial, we're gonna look at 2 tokenizers that can break up text or words into small fragments, for partial word matching: N-Gram Tokenizer and Edge N-Gram Tokenizer.

I. N-Gram Tokenizer

ngram tokenizer does 2 things:

break up text into words when it encounters specified characters (whitespace, punctuation...)
emit N-grams of each word of the specified length (quick with length = 2 -> [qu, ui, ic, ck] )

=> N-grams are like a sliding window of continuous letters.

For example:

 POST _analyze { "tokenizer": "ngram", "text": "Spring 5" }

It will generate terms with a sliding (1 char min-width, 2 chars max-width) window:

 [ "S", "Sp", "p", "pr", "r", "ri", "i", "in", "n", "ng", "g", "g ", " ", " 5", "5" ]

Configuration

min_gram: minimum length of characters in a gram (min-width of the sliding window). Defaults to 1.
max_gram: maximum length of characters in a gram (max-width of the sliding window). Defaults to 2.
token_chars: character classes that will be included in a token. Elasticsearch will split on characters that don’t belong to:
letter (a, b, ...)
digit (1, 2, ...)
whitespace (" ", "\n", ...)
punctuation (!, ", ...)
symbol ($, %, ...)

Defaults to [] (keep all characters).

For example, we will create a tokenizer with sliding window (width = 3) and character classes: only letter & digit.

 PUT jsa_index_n-gram { "settings": { "analysis": { "analyzer": { "jsa_analyzer": { "tokenizer": "jsa_tokenizer" } }, "tokenizer": { "jsa_tokenizer": { "type": "ngram", "min_gram": 3, "max_gram": 3, "token_chars": [ "letter", "digit" ] } } } } } POST jsa_index_n-gram/_analyze { "analyzer": "jsa_analyzer", "text": "Tut101: Spring 5" }

Terms:

More at:

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers

Elasticsearch Tokenizers – Partial Word Tokenizers