Using the annotated-text field
The annotated-text
tokenizes text content as per the more common text
field (see "limitations" below) but also injects any marked-up annotation tokens directly into the search index:
PUT my-index-000001
{ "mappings": { "properties": { "my_field": { "type": "annotated_text" } } } }
Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text and structured tokens. The annotations use a markdown-like syntax using URL encoding of one or more values separated by the &
symbol.
We can use the "_analyze" api to test how an example annotation would be stored as tokens in the search index:
GET my-index-000001/_analyze { "field": "my_field", "text":"Investors in [Apple](Apple+Inc.) rejoiced." }
Response:
{ "tokens": [ { "token": "investors", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "in", "start_offset": 10, "end_offset": 12, "type": "<ALPHANUM>", "position": 1 }, { "token": "Apple Inc.", "start_offset": 13, "end_offset": 18, "type": "annotation", "position": 2 }, { "token": "apple", "start_offset": 13, "end_offset": 18, "type": "<ALPHANUM>", "position": 2 }, { "token": "rejoiced", "start_offset": 19, "end_offset": 27, "type": "<ALPHANUM>", "position": 3 } ] }
- Note the whole annotation token
Apple Inc.
is placed, unchanged as a single token in the token stream and at the same position (position 2) as the text token (apple
) it annotates.
We can now perform searches for annotations using regular term
queries that don’t tokenize the provided search values. Annotations are a more precise way of matching as can be seen in this example where a search for Beck
will not match Jeff Beck
:
# Example documents
PUT my-index-000001/_doc/1 { "my_field": "[Beck](Beck) announced a new tour"<1> } PUT my-index-000001/_doc/2 { "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"<2> } # Example search GET my-index-000001/_search { "query": { "term": { "my_field": "Beck" } } }
- As well as tokenising the plain text into single words e.g.
beck
, here we inject the single token valueBeck
at the same position asbeck
in the token stream. - Note annotations can inject multiple tokens at the same position - here we inject both the very specific value
Jeff Beck
and the broader termGuitarist
. This enables broader positional queries e.g. finding mentions of aGuitarist
near tostrat
. - A benefit of searching with these carefully defined annotation tokens is that a query for
Beck
will not match document 2 that contains the tokensjeff
,beck
andJeff Beck
Any use of =
signs in annotation values eg [Prince](person=Prince)
will cause the document to be rejected with a parse failure. In future we hope to have a use for the equals signs so will actively reject documents that contain this today.
Synthetic _source
is Generally Available only for TSDB indices (indices that have index.mode
set to time_series
). For other indices synthetic _source
is in technical preview. Features in technical preview may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
If using a sub-keyword
field then the values are sorted in the same way as a keyword
field’s values are sorted. By default, that means sorted with duplicates removed. So:
PUT idx
{ "settings": { "index": { "mapping": { "source": { "mode": "synthetic" } } } }, "mappings": { "properties": { "text": { "type": "annotated_text", "fields": { "raw": { "type": "keyword" } } } } } } PUT idx/_doc/1 { "text": [ "the quick brown fox", "the quick brown fox", "jumped over the lazy dog" ] }
Will become:
{ "text": [ "jumped over the lazy dog", "the quick brown fox" ] }
Reordering text fields can have an effect on phrase and span queries. See the discussion about position_increment_gap
for more detail. You can avoid this by making sure the slop
parameter on the phrase queries is lower than the position_increment_gap
. This is the default.
If the annotated_text
field sets store
to true then order and duplicates are preserved.
PUT idx
{ "settings": { "index": { "mapping": { "source": { "mode": "synthetic" } } } }, "mappings": { "properties": { "text": { "type": "annotated_text", "store": true } } } } PUT idx/_doc/1 { "text": [ "the quick brown fox", "the quick brown fox", "jumped over the lazy dog" ] }
Will become:
{ "text": [ "the quick brown fox", "the quick brown fox", "jumped over the lazy dog" ] }