Chapter 22. Stopwords: Performance Versus Precision
Back in the early days of information retrieval, disk space and memory were
limited to a tiny fraction of what we are accustomed to today. It was
essential to make your index as small as possible. Every kilobyte saved meant
a significant improvement in performance. Stemming (see Chapter 21) was
important, not just for making searches broader and increasing retrieval in
the same way that we use it today, but also as a tool for compressing index
size.
Another way to reduce index size is simply to index fewer words. For search
purposes, some words are more important than others. A significant reduction
in index size can be achieved by indexing only the more important terms.
So which terms can be left out? We can divide terms roughly into two groups:
- Low-frequency terms
-
Words that appear in relatively few documents in the collection. Because of their
rarity, they have a high value, or weight.
- High-frequency terms
-
Common words that appear in many documents in the index, such as the
, and
, and
is
. These words have a low weight and contribute little to the relevance
score.
Tip
Of course, frequency is really a scale rather than just two points labeled
low and high. We just draw a line at some arbitrary point and say that any
terms below that line are low frequency and above the line are high frequency.
Which terms are low or high frequency depend on the documents themselves. The
word and
may be a low-frequency term if all the documents are in Chinese.
In a collection of documents about databases, the word database
may be a
high-frequency term with little value as a search term for that particular
collection.
That said, for any language there are words that occur very
commonly and that seldom add value to a search. The default English
stopwords used in Elasticsearch are as follows:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
no, not, of, on, or, such, that, the, their, then, there, these,
they, this, to, was, will, with
These stopwords can usually be filtered out before indexing with little
negative impact on retrieval. But is it a good idea to do so?
Pros and Cons of Stopwords
We have more disk space, more RAM, and better compression algorithms than
existed back in the day. Excluding the preceding 33 common words from the index
will save only about 4MB per million documents. Using stopwords for the sake
of reducing index size is no longer a valid reason. (However, there is one
caveat to this statement, which we discuss in “Stopwords and Phrase Queries”.)
On top of that, by removing words from the index, we are reducing our ability
to perform certain types of searches. Filtering out the words listed previously
prevents us from doing the following:
-
Distinguishing happy from not happy.
-
Searching for the band The The.
-
Finding Shakespeare’s quotation “To be, or not to be”
-
Using the country code for Norway: no
The primary advantage of removing stopwords is performance. Imagine that we
search an index with one million documents for the word fox
. Perhaps fox
appears in only 20 of them, which means that Elastisearch has to calculate the
relevance _score
for 20 documents in order to return the top 10. Now, we
change that to a search for the OR fox
. The word the
probably occurs in
almost all the documents, which means that Elasticsearch has to calculate
the _score
for all one million documents. This second query simply cannot
perform as well as the first.
Fortunately, there are techniques that we can use to keep common words
searchable, while still maintaining good performance. First, we’ll start with
how to use stopwords.
Using Stopwords
The removal of stopwords is handled by the
stop
token filter which can be used
when creating a custom
analyzer (see “Using the stop Token Filter”).
However, some out-of-the-box analyzers come with the stop
filter pre-integrated:
- Language analyzers
-
Each language analyzer defaults to using the appropriate stopwords list
for that language. For instance, the english
analyzer uses the
_english_
stopwords list.
standard
analyzer
-
Defaults to the empty stopwords list: _none_
, essentially disabling
stopwords.
pattern
analyzer
-
Defaults to _none_
, like the standard
analyzer.
Stopwords and the Standard Analyzer
To use custom stopwords in conjunction with the standard
analyzer, all we
need to do is to create a configured version of the analyzer and pass in the
list of stopwords
that we require:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"my_analyzer"
:
{
"type"
:
"standard"
,
"stopwords"
:
[
"and"
,
"the"
]
}
}
}
}
}
This is a custom analyzer called my_analyzer
.
This analyzer is the standard
analyzer with some custom configuration.
The stopwords to filter out are and
and the
.
Tip
This same technique can be used to configure custom stopword lists for
any of the language analyzers.
Maintaining Positions
The output from the analyze
API is quite interesting:
GET
/my_index/_analyze?analyzer=my_analyzer
The
quick
and
the
dead
{
"tokens"
:
[
{
"token"
:
"quick"
,
"start_offset"
:
4
,
"end_offset"
:
9
,
"type"
:
"<ALPHANUM>"
,
"position"
:
2
},
{
"token"
:
"dead"
,
"start_offset"
:
18
,
"end_offset"
:
22
,
"type"
:
"<ALPHANUM>"
,
"position"
:
5
}
]
}
Note the position
of each token.
The stopwords have been filtered out, as expected, but the interesting part is
that the position
of the two remaining terms is unchanged: quick
is the
second word in the original sentence, and dead
is the fifth. This is
important for phrase queries—if the positions of each term had been
adjusted, a phrase query for quick dead
would have matched the preceding
example incorrectly.
Specifying Stopwords
Stopwords can be passed inline, as we did in the previous example, by
specifying an array:
"stopwords"
:
[
"and"
,
"the"
]
The default stopword list for a particular language can be specified using the
_lang_
notation:
"stopwords"
:
"_english_"
Tip
The predefined language-specific stopword
lists available in
Elasticsearch can be found in the
stop
token filter documentation.
Stopwords can be disabled by specifying the special list: _none_
. For
instance, to use the english
analyzer without stopwords, you can do the
following:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"my_english"
:
{
"type"
:
"english"
,
"stopwords"
:
"_none_"
}
}
}
}
}
The my_english
analyzer is based on the english
analyzer.
But stopwords are disabled.
Finally, stopwords can also be listed in a file with one word per line. The
file must be present on all nodes in the cluster, and the path can be
specified with the stopwords_path
parameter:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"my_english"
:
{
"type"
:
"english"
,
"stopwords_path"
:
"stopwords/english.txt"
}
}
}
}
}
The path to the stopwords file, relative to the Elasticsearch config
directory
Using the stop Token Filter
The stop
token filter can be combined
with a tokenizer and other token filters when you need to create a custom
analyzer. For instance, let’s say that we wanted to create a Spanish analyzer
with the following:
We could set that up as follows:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"spanish_stop"
:
{
"type"
:
"stop"
,
"stopwords"
:
[
"si"
,
"esta"
,
"el"
,
"la"
]
},
"light_spanish"
:
{
"type"
:
"stemmer"
,
"language"
:
"light_spanish"
}
},
"analyzer"
:
{
"my_spanish"
:
{
"tokenizer"
:
"spanish"
,
"filter"
:
[
"lowercase"
,
"asciifolding"
,
"spanish_stop"
,
"light_spanish"
]
}
}
}
}
}
The stop
token filter takes the same stopwords
and stopwords_path
parameters as the standard
analyzer.
See “Algorithmic Stemmers”.
The order of token filters is important, as explained next.
We have placed the spanish_stop
filter after the asciifolding
filter. This
means that esta
, ésta
, and está
will first have their diacritics
removed to become just esta
, which will then be removed as a stopword. If,
instead, we wanted to remove esta
and ésta
, but not está
, we
would have to put the spanish_stop
filter before the asciifolding
filter, and specify both words in the stopwords list.
Updating Stopwords
A few techniques can be used to update the list of stopwords
used by an analyzer. Analyzers are instantiated at index creation time, when a
node is restarted, or when a closed index is reopened.
If you specify stopwords inline with the stopwords
parameter, your
only option is to close the index and update the analyzer configuration with the
update index settings API, then reopen
the index.
Updating stopwords is easier if you specify them in a file with the
stopwords_path
parameter. You can just update the file (on every node in
the cluster) and then force the analyzers to be re-created by either of these actions:
-
Closing and reopening the index
(see open/close index), or
-
Restarting each node in the cluster, one by one
Of course, updating the stopwords list will not change any documents that have
already been indexed. It will apply only to searches and to new or updated
documents. To apply the changes to existing documents, you will need to
reindex your data. See “Reindexing Your Data”.
Divide and Conquer
The terms in a query string can be divided into more-important (low-frequency)
and less-important (high-frequency) terms. Documents that match only the less
important terms are probably of very little interest. Really, we want
documents that match as many of the more important terms as possible.
The match
query accepts a cutoff_frequency
parameter, which allows it to
divide the terms in the query string into a low-frequency and high-frequency
group. The low-frequency group (more-important terms) form the bulk of the
query, while the high-frequency group (less-important terms) is used only for
scoring, not for matching. By treating these two groups differently, we can
gain a real boost of speed on previously slow queries.
Take this query as an example:
{
"match"
:
{
"text"
:
{
"query"
:
"Quick and the dead"
,
"cutoff_frequency"
:
0.01
}
}
Any term that occurs in more than 1% of documents is considered to be high
frequency. The cutoff_frequency
can be specified as a fraction (0.01
)
or as an absolute number (5
).
This query uses the cutoff_frequency
to first divide the query terms into a
low-frequency group (quick
, dead
) and a high-frequency group (and
,
the
). Then, the query is rewritten to produce the following bool
query:
{
"bool"
:
{
"must"
:
{
"bool"
:
{
"should"
:
[
{
"term"
:
{
"text"
:
"quick"
}},
{
"term"
:
{
"text"
:
"dead"
}}
]
}
},
"should"
:
{
"bool"
:
{
"should"
:
[
{
"term"
:
{
"text"
:
"and"
}},
{
"term"
:
{
"text"
:
"the"
}}
]
}
}
}
}
At least one low-frequency/high-importance term must match.
High-frequency/low-importance terms are entirely optional.
The must
clause means that at least one of the low-frequency terms—quick
or dead
—_must_ be present for a document to be considered a
match. All other documents are excluded. The should
clause then looks for
the high-frequency terms and
and the
, but only in the documents collected
by the must
clause. The sole job of the should
clause is to score a
document like “Quick and the dead” higher than “_The_ quick but
dead”. This approach greatly reduces the number of documents that need to be
examined and scored.
Tip
Setting the operator parameter to and
would make all low-frequency terms
required, and score documents that contain all high-frequency terms higher.
However, matching documents would not be required to contain all high-frequency terms. If you would prefer all low- and high-frequency terms to be
required, you should use a bool
query instead. As we saw in
“and Operator”, this is already an efficient query.
Controlling Precision
The minimum_should_match
parameter can be combined with cutoff_frequency
but it applies to only the low-frequency terms. This query:
{
"match"
:
{
"text"
:
{
"query"
:
"Quick and the dead"
,
"cutoff_frequency"
:
0.01
,
"minimum_should_match"
:
"75%"
}
}
would be rewritten as follows:
{
"bool"
:
{
"must"
:
{
"bool"
:
{
"should"
:
[
{
"term"
:
{
"text"
:
"quick"
}},
{
"term"
:
{
"text"
:
"dead"
}}
],
"minimum_should_match"
:
1
}
},
"should"
:
{
"bool"
:
{
"should"
:
[
{
"term"
:
{
"text"
:
"and"
}},
{
"term"
:
{
"text"
:
"the"
}}
]
}
}
}
}
Because there are only two terms, the original 75% is rounded down
to 1
, that is: one out of two low-terms must match.
The high-frequency terms are still optional and used only for scoring.
Only High-Frequency Terms
An or
query for high-frequency terms only—`‘To be, or not to be’'—is
the worst case for performance. It is pointless to score all the
documents that contain only one of these terms in order to return just the top
10 matches. We are really interested only in documents in which the terms all occur
together, so in the case where there are no low-frequency terms, the query is
rewritten to make all high-frequency terms required:
{
"bool"
:
{
"must"
:
[
{
"term"
:
{
"text"
:
"to"
}},
{
"term"
:
{
"text"
:
"be"
}},
{
"term"
:
{
"text"
:
"or"
}},
{
"term"
:
{
"text"
:
"not"
}},
{
"term"
:
{
"text"
:
"to"
}},
{
"term"
:
{
"text"
:
"be"
}}
]
}
}
More Control with Common Terms
While the high/low frequency functionality in the match
query is useful,
sometimes you want more control over how the high- and low-frequency groups
should be handled. The match
query exposes a subset of the
functionality available in the common
terms query.
For instance, we could make all low-frequency terms required, and score only
documents that have 75% of all high-frequency terms with a query like this:
{
"common"
:
{
"text"
:
{
"query"
:
"Quick and the dead"
,
"cutoff_frequency"
:
0.01
,
"low_freq_operator"
:
"and"
,
"minimum_should_match"
:
{
"high_freq"
:
"75%"
}
}
}
}
See the common
terms query reference page for more options.
Stopwords and Phrase Queries
About 5% of all queries are phrase queries (see “Phrase Matching”), but they
often account for the majority of slow queries. Phrase queries can perform
poorly, especially if the phrase includes very common words; a phrase like
“To be, or not to be” could be considered pathological. The reason for this
has to do with the amount of data that is necessary to support proximity
matching.
In “Pros and Cons of Stopwords”, we said that removing stopwords saves only a small
amount of space in the inverted index. That was only partially true. A
typical index may contain, among other data, some or all of the following:
- Terms dictionary
-
A sorted list of all terms that appear in the documents in the index,
and a count of the number of documents that contain each term.
- Postings list
-
A list of which documents contain each term.
- Term frequency
-
How often each term appears in each document.
- Positions
-
The position of each term within each document, for phrase and proximity
queries.
- Offsets
-
The start and end character offsets of each term in each document, for
snippet highlighting. Disabled by default.
- Norms
-
A factor used to normalize fields of different lengths, to give shorter
fields more weight.
Removing stopwords from the index may save a small amount of space in the
terms dictionary and the postings list, but positions and offsets are
another matter. Positions and offsets data can easily double, triple, or
quadruple index size.
Positions Data
Positions are enabled on analyzed
string fields by default, so that phrase
queries will work out of the box. The more often that a term appears, the more
space is needed to store its position data. Very common words, by
definition, appear very commonly, and their positions data can run to megabytes
or gigabytes on large collections.
Running a phrase query on a high-frequency word like the
might result in
gigabytes of data being read from disk. That data will be stored in the kernel
filesystem cache to speed up later access, which seems like a good thing, but
it might cause other data to be evicted from the cache, which will slow
subsequent queries.
This is clearly a problem that needs solving.
Index Options
The first question you should ask yourself is: Do you need phrase or
proximity queries?
Often, the answer is no. For many use cases, such as logging, you need to
know whether a term appears in a document — information that is provided
by the postings list—but not where it appears. Or perhaps you need to use
phrase queries on one or two fields, but you can disable positions data on all
of the other analyzed string
fields.
The index_options
parameter allows you to control what information is stored
in the index for each field. Valid values are as follows:
docs
-
Only store which documents contain which terms. This is the default for
not_analyzed
string fields.
freqs
-
Store docs
information, plus how often each term appears in each
document. Term frequencies are needed for complete TF/IDF
relevance calculations, but they are not required if you just need to know
whether a document contains a particular term.
positions
-
Store docs
and freqs
, plus the position of each term in each document.
This is the default for analyzed
string fields, but can be disabled if
phrase/proximity matching is not needed.
offsets
-
Store docs
, freqs
, positions
, and the start and end character offsets
of each term in the original string. This information is used by the
postings
highlighter
but is disabled by default.
You can set index_options
on fields added at index creation time, or when
adding new fields by using the put-mapping
API. This setting can’t be changed
on existing fields:
PUT
/my_index
{
"mappings"
:
{
"my_type"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
},
"content"
:
{
"type"
:
"string"
,
"index_options"
:
"freqs"
}
}
}
}
The title
field uses the default setting of positions
, so
it is suitable for phrase/proximity queries.
The content
field has positions disabled and so cannot be used
for phrase/proximity queries.
Stopwords
Removing stopwords is one way of reducing the size of the positions data quite
dramatically. An index with stopwords removed can still be used for phrase
queries because the original positions of the remaining terms are maintained,
as we saw in “Maintaining Positions”. But of course, excluding terms from
the index reduces searchability. We wouldn’t be able to differentiate between
the two phrases Man in the moon and Man on the moon.
Fortunately, there is a way to have our cake and eat it: the
common_grams
token filter.
common_grams Token Filter
The common_grams
token filter is designed to make phrase queries with
stopwords more efficient. It is similar to the shingles
token filter (see
“Finding Associated Words”), which creates bigrams out of every pair of adjacent words. It
is most easily explained by example.
The common_grams
token filter produces different output depending on whether
query_mode
is set to false
(for indexing) or to true
(for searching), so
we have to create two separate analyzers:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"index_filter"
:
{
"type"
:
"common_grams"
,
"common_words"
:
"_english_"
},
"search_filter"
:
{
"type"
:
"common_grams"
,
"common_words"
:
"_english_"
,
"query_mode"
:
true
}
},
"analyzer"
:
{
"index_grams"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"index_filter"
]
},
"search_grams"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"search_filter"
]
}
}
}
}
}
First we create two token filters based on the common_grams
token
filter: index_filter
for index time (with query_mode
set to the
default false
), and search_filter
for query time (with query_mode
set to true
).
The common_words
parameter accepts the same options as the stopwords
parameter (see “Specifying Stopwords”). The filter also
accepts a common_words_path
parameter, which allows you to maintain the
common words list in a file.
Then we use each filter to create an analyzer for index time and another
for query time.
With our custom analyzers in place, we can create a field that will use the
index_grams
analyzer at index time:
PUT
/my_index/_mapping/my_type
{
"properties"
:
{
"text"
:
{
"type"
:
"string"
,
"index_analyzer"
:
"index_grams"
,
"search_analyzer"
:
"standard"
}
}
}
The text
field uses the index_grams
analyzer at index time, but
defaults to using the standard
analyzer at search time, for reasons we
will explain next.
At Index Time
If we were to analyze the phrase The quick and brown fox with shingles, it
would produce these terms:
Pos 1: the_quick
Pos 2: quick_and
Pos 3: and_brown
Pos 4: brown_fox
Our new index_grams
analyzer produces the following terms instead:
Pos 1: the, the_quick
Pos 2: quick, quick_and
Pos 3: and, and_brown
Pos 4: brown
Pos 5: fox
All terms are output as unigrams—the
, quick
, and so forth—but if a word is a
common word or is followed by a common word, then it also outputs a bigram in
the same position as the unigram—the_quick
, quick_and
, and_brown
.
Unigram Queries
Because the index contains unigrams, the field can be queried using the same
techniques that we have used for any other field, for example:
GET
/my_index/_search
{
"query"
:
{
"match"
:
{
"text"
:
{
"query"
:
"the quick and brown fox"
,
"cutoff_frequency"
:
0.01
}
}
}
}
The preceding query string is analyzed by the search_analyzer
configured for the
text
field—the standard
analyzer in this example—to produce the
terms the
, quick
, and
, brown
, fox
.
Because the index for the text
field contains the same unigrams as produced
by the standard
analyzer, search functions as it would for any normal
field.
Bigram Phrase Queries
However, when we come to do phrase queries, we can use the specialized
search_grams
analyzer to make the process much more efficient:
GET
/my_index/_search
{
"query"
:
{
"match_phrase"
:
{
"text"
:
{
"query"
:
"The quick and brown fox"
,
"analyzer"
:
"search_grams"
}
}
}
}
For phrase queries, we override the default search_analyzer
and use the
search_grams
analyzer instead.
The search_grams
analyzer would produce the following terms:
Pos 1: the_quick
Pos 2: quick_and
Pos 3: and_brown
Pos 4: brown
Pos 5: fox
The analyzer has stripped out all of the common word unigrams, leaving the common word
bigrams and the low-frequency unigrams. Bigrams like the_quick
are much
less common than the single term the
. This has two advantages:
-
The positions data for the_quick
is much smaller than for the
, so it is
faster to read from disk and has less of an impact on the filesystem cache.
-
The term the_quick
is much less common than the
, so it drastically
decreases the number of documents that have to be examined.
Two-Word Phrases
There is one further optimization. By far the majority of phrase queries
consist of only two words. If one of those words happens to be a common word,
such as
GET
/my_index/_search
{
"query"
:
{
"match_phrase"
:
{
"text"
:
{
"query"
:
"The quick"
,
"analyzer"
:
"search_grams"
}
}
}
}
then the search_grams
analyzer outputs a single token: the_quick
. This
transforms what originally could have been an expensive phrase query for the
and quick
into a very efficient single-term lookup.
Stopwords and Relevance
The last topic to cover before moving on from stopwords is that of relevance.
Leaving stopwords in your index could make the relevance calculation
less accurate, especially if your documents are very long.
As we have already discussed in “Term-frequency saturation”, the reason for this is
that term-frequency/inverse document frequency doesn’t impose an
upper limit on the impact of term frequency. Very common words may have a low
weight because of inverse document frequency but, in long documents, the sheer
number of occurrences of stopwords in a single document may lead to their
weight being artificially boosted.
You may want to consider using the Okapi BM25 similarity on long
fields that include stopwords instead of the default Lucene similarity.