HTML strip character filter
Strips HTML elements from a text and replaces HTML entities with their decoded value (e.g, replaces & with &).
The html_strip filter uses Lucene’s HTMLStripCharFilter.
The following analyze API request uses the html_strip filter to change the text <p>I'm so <b>happy</b>!</p> to \nI'm so happy!\n.
GET /_analyze { "tokenizer": "keyword", "char_filter": [ "html_strip" ], "text": "<p>I'm so <b>happy</b>!</p>" } The filter produces the following text:
[ \nI'm so happy!\n ] The following create index API request uses the html_strip filter to configure a new custom analyzer.
PUT /my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "html_strip" ] } } } } } escaped_tags- (Optional, array of strings) Array of HTML elements without enclosing angle brackets (
< >). The filter skips these HTML elements when stripping HTML from the text. For example, a value of[ "p" ]skips the<p>HTML element.
To customize the html_strip filter, duplicate it to create the basis for a new custom character filter. You can modify the filter using its configurable parameters.
The following create index API request configures a new custom analyzer using a custom html_strip filter, my_custom_html_strip_char_filter.
The my_custom_html_strip_char_filter filter skips the removal of the <b> HTML element.
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "my_custom_html_strip_char_filter" ] } }, "char_filter": { "my_custom_html_strip_char_filter": { "type": "html_strip", "escaped_tags": [ "b" ] } } } } }