Fingerprint token filter

Sorts and removes duplicate tokens from a token stream, then concatenates the stream into a single output token.

For example, this filter changes the [ the, fox, was, very, very, quick ] token stream as follows:

Sorts the tokens alphabetically to [ fox, quick, the, very, very, was ]
Removes a duplicate instance of the very token.
Concatenates the token stream to a output single token: [fox quick the very was ]

Output tokens produced by this filter are useful for fingerprinting and clustering a body of text as described in the OpenRefine project.

This filter uses Lucene’s FingerprintFilter.

Example

The following analyze API request uses the fingerprint filter to create a single output token for the text zebra jumps over resting resting dog:

  GET _analyze { "tokenizer" : "whitespace", "filter" : ["fingerprint"], "text" : "zebra jumps over resting resting dog" }  

The filter produces the following token:

 [ dog jumps over resting zebra ]

Add to an analyzer

The following create index API request uses the fingerprint filter to configure a new custom analyzer.

  PUT fingerprint_example { "settings": { "analysis": { "analyzer": { "whitespace_fingerprint": { "tokenizer": "whitespace", "filter": [ "fingerprint" ] } } } } }  

Configurable parameters

max_output_size: (Optional, integer) Maximum character length, including whitespace, of the output token. Defaults to 255. Concatenated tokens longer than this will result in no token output.
separator: (Optional, string) Character to use to concatenate the token stream input. Defaults to a space.

Customize

To customize the fingerprint filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a custom fingerprint filter with that use + to concatenate token streams. The filter also limits output tokens to 100 characters or fewer.

  PUT custom_fingerprint_example { "settings": { "analysis": { "analyzer": { "whitespace_": { "tokenizer": "whitespace", "filter": [ "fingerprint_plus_concat" ] } }, "filter": { "fingerprint_plus_concat": { "type": "fingerprint", "max_output_size": 100, "separator": "+" } } } } }