Synonym graph token filteredit
The synonym_graph
token filter allows to easily handle synonyms,
including multi-word synonyms correctly during the analysis process.
In order to properly handle multi-word synonyms this token filter creates a graph token stream during processing. For more information on this topic and its various complexities, please read the Lucene’s TokenStreams are actually graphs blog post.
This token filter is designed to be used as part of a search analyzer only. If you want to apply synonyms during indexing please use the standard synonym token filter.
Define synonyms setsedit
Synonyms in a synonyms set are defined using synonym rules. Each synonym rule contains words that are synonyms.
You can use two formats to define synonym rules: Solr and WordNet.
Solr formatedit
This format uses two different definitions:
-
Equivalent synonyms: Define groups of words that are equivalent. Words are separated by commas. Example:
ipod, i-pod, i pod computer, pc, laptop
-
Explicit mappings: Matches a group of words to other words. Words on the left hand side of the rule definition are expanded into all the possibilities described on the right hand side. Example:
personal computer => pc sea biscuit, sea biscit => seabiscuit
WordNet formatedit
WordNet defines synonyms sets spanning multiple lines. Each line contains the following information:
- Synonyms set numeric identifier
- Ordinal of the synonym in the synonyms set
- Synonym word
- Word type identifier: Noun (n), verb (v), adjective (a) or adverb (b).
- Depth of the word in the synonym net
The following example defines a synonym set for the words "come", "advance" and "approach":
s(100000002,1,'come',v,1,0). s(100000002,2,'advance',v,1,0). s(100000002,3,'approach',v,1,0).""";
Configure synonyms setsedit
Synonyms can be configured using the synonyms API, a synonyms file, or directly inlined in the token filter configuration. See store your synonyms set for more details on each option.
Use synonyms_set
configuration option to provide a synonym set created via Synonyms Management APIs:
"filter": { "synonyms_filter": { "type": "synonym", "synonyms_set": "my-synonym-set", "updateable": true } }
Use synonyms_path
to provide a synonym file :
"filter": { "synonyms_filter": { "type": "synonym", "synonyms_path": "analysis/synonym-set.txt" } }
The above configures a synonym
filter, with a path of
analysis/synonym-set.txt
(relative to the config
location).
Use synonyms
to define inline synonyms:
"filter": { "synonyms_filter": { "type": "synonym", "synonyms": ["pc => personal computer", "computer, pc, laptop"] } }
Additional settings are:
-
updateable
(defaults tofalse
). Iftrue
allows reloading search analyzers to pick up changes to synonym files. Only to be used for search analyzers. -
expand
(defaults totrue
). -
lenient
(defaults tofalse
). Iftrue
ignores exceptions while parsing the synonym configuration. It is important to note that only those synonym rules which cannot get parsed are ignored. For instance consider the following request:
PUT /test_index { "settings": { "index": { "analysis": { "analyzer": { "synonym": { "tokenizer": "standard", "filter": [ "my_stop", "synonym_graph" ] } }, "filter": { "my_stop": { "type": "stop", "stopwords": [ "bar" ] }, "synonym_graph": { "type": "synonym_graph", "lenient": true, "synonyms": [ "foo, bar => baz" ] } } } } } }
With the above request the word bar
gets skipped but a mapping foo => baz
is still added. However, if the mapping
being added was foo, baz => bar
nothing would get added to the synonym list. This is because the target word for the
mapping is itself eliminated because it was a stop word. Similarly, if the mapping was "bar, foo, baz" and expand
was
set to false
no mapping would get added as when expand=false
the target mapping is the first word. However, if
expand=true
then the mappings added would be equivalent to foo, baz => foo, baz
i.e, all mappings other than the
stop word.
tokenizer
and ignore_case
are deprecatededit
The tokenizer
parameter controls the tokenizers that will be used to
tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0.
The ignore_case
parameter works with tokenizer
parameter only.
Configure analyzers with synonym graph token filtersedit
To apply synonyms, you will need to include a synonym graph token filter into an analyzer:
"analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase", "synonym_graph"] } }
Token filters orderingedit
Order is important for your token filters. Text will be processed first through filters preceding the synonym filter before being processed by the synonym filter.
In the above example, text will be lowercased by the lowercase
filter before being processed by the synonyms_filter
.
This means that all the synonyms defined there needs to be in lowercase, or they won’t be found by the synonyms filter.
The synonym rules should not contain words that are removed by a filter that appears later in the chain (like a stop
filter).
Removing a term from a synonym rule means there will be no matching for it at query time.
Because entries in the synonym map cannot have stacked positions, some token filters may cause issues here.
Token filters that produce multiple versions of a token may choose which version of the token to emit when parsing synonyms.
For example, asciifolding
will only produce the folded version of the token.
Others, like multiplexer
, word_delimiter_graph
or ngram
will throw an error.
If you need to build analyzers that include both multi-token filters and synonym filters, consider using the multiplexer filter, with the multi-token filters in one branch and the synonym filter in the other.