While stemming helps to broaden the scope of search by simplifying inflected words to their root form, synonyms broaden the scope by relating concepts and ideas. Perhaps no documents match a query for “English queen,” but documents that contain “British monarch” would probably be considered a good match.
A user might search for “the US” and expect to find documents that contain
United States, USA, U.S.A., America, or the States.
However, they wouldn’t expect to see results about the states of matter
or
state machines
.
This example provides a valuable lesson. It demonstrates how simple it is for a human to distinguish between separate concepts, and how tricky it can be for mere machines. The natural tendency is to try to provide synonyms for every word in the language, to ensure that any document is findable with even the most remotely related terms.
This is a mistake. In the same way that we prefer light or minimal stemming to aggressive stemming, synonyms should be used only where necessary. Users understand why their results are limited to the words in their search query. They are less understanding when their results seems almost random.
Synonyms can be used to conflate words that have pretty much the same meaning,
such as jump
, leap
, and hop
, or pamphlet
, leaflet
, and brochure
.
Alternatively, they can be used to make a word more generic. For instance,
bird
could be used as a more general synonym for owl
or pigeon
, and adult
could be used for man
or woman
.
Synonyms appear to be a simple concept but they are quite tricky to get right. In this chapter, we explain the mechanics of using synonyms and discuss the limitations and gotchas.
Synonyms are used to broaden the scope of what is considered a matching document. Just as with stemming or partial matching, synonym fields should not be used alone but should be combined with a query on a main field that contains the original text in unadulterated form. See “Most Fields” for an explanation of how to maintain relevance when using synonyms.
Synonyms can replace existing tokens or be added to the token stream by using the
synonym
token filter:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"my_synonym_filter"
:
{
"type"
:
"synonym"
,
"synonyms"
:
[
"british,english"
,
"queen,monarch"
]
}
},
"analyzer"
:
{
"my_synonyms"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"my_synonym_filter"
]
}
}
}
}
}
First, we define a token filter of type synonym
.
We discuss synonym formats in “Formatting Synonyms”.
Then we create a custom analyzer that uses the my_synonym_filter
.
Synonyms can be specified inline with the synonyms
parameter, or in a
synonyms file that must be present on every node in the cluster. The path to
the synonyms file should be specified with the synonyms_path
parameter, and
should be either absolute or relative to the Elasticsearch config
directory.
See “Updating Stopwords” for techniques that can be used to refresh the
synonyms list.
Testing our analyzer with the analyze
API shows the following:
GET
/my_index/_analyze?analyzer=my_synonyms
Elizabeth
is
the
English
queen
Pos 1: (elizabeth) Pos 2: (is) Pos 3: (the) Pos 4: (british,english) Pos 5: (queen,monarch)
A document like this will match queries for any of the following: English queen
,
British queen
, English monarch
, or British monarch
.
Even a phrase query will work, because the position of
each term has been preserved.
Using the same synonym
token filter at both index time and search time is
redundant. If, at index time, we replace English
with the two terms
english
and british
, then at search time we need to search for only one of
those terms. Alternatively, if we don’t use synonyms at index time, then at
search time, we would need to convert a query for English
into a query for
english OR british
.
Whether to do synonym expansion at search or index time can be a difficult choice. We will explore the options more in “Expand or contract”.
In their simplest form, synonyms are listed as comma-separated values:
"jump,leap,hop"
If any of these terms is encountered, it is replaced by all of the listed synonyms. For instance:
Original terms: Replaced by: ──────────────────────────────── jump → (jump,leap,hop) leap → (jump,leap,hop) hop → (jump,leap,hop)
Alternatively, with the =>
syntax, it is possible to specify a list of terms
to match (on the left side), and a list of one or more replacements (on
the right side):
"u s a,united states,united states of america => usa" "g b,gb,great britain => britain,england,scotland,wales"
Original terms: Replaced by: ──────────────────────────────── u s a → (usa) united states → (usa) great britain → (britain,england,scotland,wales)
If multiple rules for the same synonyms are specified, they are merged together. The order of rules is not respected. Instead, the longest matching rule wins. Take the following rules as an example:
"united states => usa", "united states of america => usa"
If these rules conflicted, Elasticsearch would turn United States of
America
into the terms (usa),(of),(america)
. Instead, the longest
sequence wins, and we end up with just the term (usa)
.
In “Formatting Synonyms”, we have seen that it is possible to replace synonyms by simple expansion, simple contraction, or generic expansion. We will look at the trade-offs of each of these techniques in this section.
With simple expansion, any of the listed synonyms is expanded into all of the listed synonyms:
"jump,hop,leap"
Expansion can be applied either at index time or at query time. Each has advantages (⬆)︎ and disadvantages (⬇)︎. When to use which comes down to performance versus flexibility.
Index time | Query time | |
---|---|---|
Index size |
⬇︎ Bigger index because all synonyms must be indexed. |
⬆︎ Normal. |
Relevance |
⬇︎ All synonyms will have the same IDF (see “What Is Relevance?”), meaning that more commonly used words will have the same weight as less commonly used words. |
⬆︎ The IDF for each synonym will be correct. |
Performance |
⬆︎ A query needs to find only the single term specified in the query string. |
⬇︎ A query for a single term is rewritten to look up all synonyms, which decreases performance. |
Flexibility |
⬇︎ The synonym rules can’t be changed for existing documents. For the new rules to have effect, existing documents have to be reindexed. |
⬆︎ Synonym rules can be updated without reindexing documents. |
Simple contraction maps a group of synonyms on the left side to a single value on the right side:
"leap,hop => jump"
It must be applied both at index time and at query time, to ensure that query terms are mapped to the same single value that exists in the index.
This approach has some advantages and some disadvantages compared to the simple expansion approach:
⬆︎ The index size is normal, as only a single term is indexed.
⬇︎ The IDF for all terms is the same, so you can’t distinguish between more commonly used words and less commonly used words.
⬆︎ A query needs to find only the single term that appears in the index.
⬆︎ New synonyms can be added to the left side of the rule and applied at
query time. For instance, imagine that we wanted to add the word bound
to
the rule specified previously. The following rule would work for queries that
contain bound
or for newly added documents that contain bound
:
"leap,hop,bound => jump"
But we could expand the effect to also take into account existing documents
that contain bound
by writing the rule as follows:
"leap,hop,bound => jump,bound"
When you reindex your documents, you could revert to the previous rule to gain the performance benefit of querying only a single term.
Genre expansion is quite different from simple contraction or expansion. Instead of treating all synonyms as equal, genre expansion widens the meaning of a term to be more generic. Take these rules, for example:
"cat => cat,pet", "kitten => kitten,cat,pet", "dog => dog,pet" "puppy => puppy,dog,pet"
By applying genre expansion at index time:
A query for kitten
would find just documents about kittens.
A query for cat
would find documents abouts kittens and cats.
A query for pet
would find documents about kittens, cats, puppies, dogs,
or pets.
Alternatively, by applying genre expansion at query time, a query for kitten
would be expanded to return documents that mention kittens, cats, or pets
specifically.
You could also have the best of both worlds by applying expansion at index
time to ensure that the genres are present in the index. Then, at query time,
you can choose to not apply synonyms (so that a query for kitten
returns only documents about kittens) or to apply synonyms in order to match
kittens, cats and pets (including the canine variety).
With the preceding example rules above, the IDF for kitten
will be correct, while the
IDF for cat
and pet
will be artificially deflated. However, this
works in your favor—a genre-expanded query for kitten OR cat OR pet
will
rank documents with kitten
highest, followed by documents with cat
, and
documents with pet
would be right at the bottom.
The example we showed in “Formatting Synonyms”, used u s a
as a synonym. Why
did we use that instead of U.S.A.
? The reason is that the synonym
token
filter sees only the terms that the previous token filter or tokenizer has
emitted.
Imagine that we have an analyzer that consists of the standard
tokenizer,
with the lowercase
token filter followed by a synonym
token filter. The
analysis process for the text U.S.A.
would look like this:
original string → "U.S.A." standard tokenizer → (U),(S),(A) lowercase token filter → (u),(s),(a) synonym token filter → (usa)
If we had specified the synonym as U.S.A.
, it would never match anything
because, by the time my_synonym_filter
sees the terms, the periods have been
removed and the letters have been lowercased.
This is an important point to consider. What if we want to combine synonyms
with stemming, so that jumps
, jumped
, jump
, leaps
, leaped
, and
leap
are all indexed as the single term jump
? We could place the synonyms
filter before the stemmer and list all inflections:
"jumps,jumped,leap,leaps,leaped => jump"
But the more concise way would be to place the synonyms filter after the stemmer, and to list just the root words that would be emitted by the stemmer:
"leap => jump"
Normally, synonym filters are placed after the lowercase
token filter and so
all synonyms are written in lowercase, but sometimes that can lead to odd
conflations. For instance, a CAT
scan and a cat
are quite different, as
are PET
(positron emmision tomography) and a pet
. For that matter, the
surname Little
is distinct from the adjective little
(although if a
sentence starts with the adjective, it will be uppercased anyway).
If you need use case to distinguish between word senses, you will need to
place your synonym filter before the lowercase
filter. Of course, that means
that your synonym rules would need to list all of the case variations that you
want to match (for example, Little,LITTLE,little
).
Instead of that, you could have two synonym filters: one to catch the case-sensitive synonyms and one for all the case-insentive synonyms. For instance, the case-sensitive rules could look like this:
"CAT,CAT scan => cat_scan" "PET,PET scan => pet_scan" "Johnny Little,J Little => johnny_little" "Johnny Small,J Small => johnny_small"
And the case-insentive rules could look like this:
"cat => cat,pet" "dog => dog,pet" "cat scan,cat_scan scan => cat_scan" "pet scan,pet_scan scan => pet_scan" "little,small"
The case-sensitive rules would CAT scan
but would match only the
CAT
in CAT scan
. For this reason, we have the odd-looking rule cat_scan
scan
in the case-insensitive list to catch bad replacements.
analyze
API
is your friend—use it to check that your analyzers are configured
correctly. See “Testing Analyzers”.
So far, synonyms appear to be quite straightforward. Unfortunately, this is where things start to go wrong. For phrase queries to function correctly, Elasticsearch needs to know the position that each term occupies in the original text. Multiword synonyms can play havoc with term positions, especially when the injected synonyms are of differing lengths.
To demonstrate, we’ll create a synonym token filter that uses this rule:
"usa,united states,u s a,united states of america"
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"my_synonym_filter"
:
{
"type"
:
"synonym"
,
"synonyms"
:
[
"usa,united states,u s a,united states of america"
]
}
},
"analyzer"
:
{
"my_synonyms"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"my_synonym_filter"
]
}
}
}
}
}
GET
/my_index/_analyze?analyzer=my_synonyms&text=
The
United
States
is
wealthy
The tokens emitted by the analyze
request look like this:
Pos 1: (the) Pos 2: (usa,united,u,united) Pos 3: (states,s,states) Pos 4: (is,a,of) Pos 5: (wealthy,america)
If we were to index a document analyzed with synonyms as above, and then run a phrase query without synonyms, we’d have some surprising results. These phrases would not match:
The usa is wealthy
The united states of america is wealthy
The U.S.A. is wealthy
However, these phrases would:
United states is wealthy
Usa states of wealthy
The U.S. of wealthy
U.S. is america
If we were to use synonyms at query time instead, we would see even more-bizarre matches. Look at the output of this validate-query
request:
GET
/my_index/_validate/query?explain
{
"query"
:
{
"match_phrase"
:
{
"text"
:
{
"query"
:
"usa is wealthy"
,
"analyzer"
:
"my_synonyms"
}
}
}
}
The explanation is as follows:
"(usa united u united) (is states s states) (wealthy a of) america"
This would match documents containg u is of america
but wouldn’t match any
document that didn’t contain the term america
.
Multiword synonyms affect highlighting in a similar way. A query for USA
could end up returning a highlighted snippet such as: “The United States
is wealthy”.
The way to avoid this mess is to use simple contraction to inject a single term that represents all synonyms, and to use the same synonym token filter at query time:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"my_synonym_filter"
:
{
"type"
:
"synonym"
,
"synonyms"
:
[
"united states,u s a,united states of america=>usa"
]
}
},
"analyzer"
:
{
"my_synonyms"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"my_synonym_filter"
]
}
}
}
}
}
GET
/my_index/_analyze?analyzer=my_synonyms
The
United
States
is
wealthy
The result of the preceding analyze
request looks much more sane:
Pos 1: (the) Pos 2: (usa) Pos 3: (is) Pos 5: (wealthy)
And repeating the validate-query
request that we made previously yields a simple,
sane explanation:
"usa is wealthy"
The downside of this approach is that, by reducing united states of america
down to the single term usa
, you can’t use the same field to find just the
word united
or states
. You would need to use a separate field with a
different analysis chain for that purpose.
We have tried to avoid discussing the query_string
query because we don’t
recommend using it. In More-Complicated Queries, we said that, because the
query_string
query supports a terse mini search-syntax, it could
frequently lead to surprising results or even syntax errors.
One of the gotchas of this query involves multiword synonyms. To
support its search-syntax, it has to parse the query string to recognize
special operators like AND
, OR
, +
, -
, field:
, and so forth. (See the full
query_string
syntax
here.)
As part of this parsing process, it breaks up the query string on whitespace,
and passes each word that it finds to the relevant analyzer separately. This
means that your synonym analyzer will never receive a multiword synonym.
Instead of seeing United States
as a single string, the analyzer will
receive United
and States
separately.
Fortunately, the trustworthy match
query supports no such syntax, and
multiword synonyms will be passed to the analyzer in their entirety.
The final part of this chapter is devoted to symbol synonyms, which are unlike the synonyms we have discussed until now. Symbol synonyms are string aliases used to represent symbols that would otherwise be removed during tokenization.
While most punctuation is seldom important for full-text search, character combinations like emoticons may be very signficant, even changing the meaning of the the text. Compare these:
I am thrilled to be at work on Sunday.
I am thrilled to be at work on Sunday :(
The standard
tokenizer would simply strip out the emoticon in the second
sentence, conflating two sentences that have quite different intent.
We can use the
mapping
character filter
to replace emoticons with symbol synonyms like emoticon_happy
and
emoticon_sad
before the text is passed to the tokenizer:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"char_filter"
:
{
"emoticons"
:
{
"type"
:
"mapping"
,
"mappings"
:
[
":)=>emoticon_happy"
,
":(=>emoticon_sad"
]
}
},
"analyzer"
:
{
"my_emoticons"
:
{
"char_filter"
:
"emoticons"
,
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
]
]
}
}
}
}
}
GET
/my_index/_analyze?analyzer=my_emoticons
I
am
:)
not
:(
The mappings
filter replaces the characters to the left of =>
with those to the right.
Emits tokens i
, am
, emoticon_happy
, not
, emoticon_sad
.
It is unlikely that anybody would ever search for emoticon_happy
, but
ensuring that important symbols like emoticons are included in the index can
be helpful when doing sentiment analysis. Of course, we could equally
have used real words, like happy
and sad
.
mapping
character filter is useful for simple replacements of exact
character sequences. For more-flexible pattern matching, you can use regular
expressions with the
pattern_replace
character filter.
18.191.176.194