HTML strip character filteredit
Strips HTML elements from a text and replaces HTML entities with their decoded
value (e.g, replaces &
with &
).
The html_strip
filter uses Lucene’s
HTMLStripCharFilter.
Exampleedit
The following analyze API request uses the
html_strip
filter to change the text <p>I'm so <b>happy</b>!</p>
to
\nI'm so happy!\n
.
GET /_analyze { "tokenizer": "keyword", "char_filter": [ "html_strip" ], "text": "<p>I'm so <b>happy</b>!</p>" }
The filter produces the following text:
[ \nI'm so happy!\n ]
Add to an analyzeredit
The following create index API request uses the
html_strip
filter to configure a new
custom analyzer.
PUT /my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "html_strip" ] } } } } }
Configurable parametersedit
-
escaped_tags
-
(Optional, array of strings)
Array of HTML elements without enclosing angle brackets (
< >
). The filter skips these HTML elements when stripping HTML from the text. For example, a value of[ "p" ]
skips the<p>
HTML element.
Customizeedit
To customize the html_strip
filter, duplicate it to create the basis for a new
custom character filter. You can modify the filter using its configurable
parameters.
The following create index API request
configures a new custom analyzer using a custom
html_strip
filter, my_custom_html_strip_char_filter
.
The my_custom_html_strip_char_filter
filter skips the removal of the <b>
HTML element.
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "my_custom_html_strip_char_filter" ] } }, "char_filter": { "my_custom_html_strip_char_filter": { "type": "html_strip", "escaped_tags": [ "b" ] } } } } }