Most languages of the world are inflected, meaning that words can change their form to express differences in the following:
Number: fox, foxes
Tense: pay, paid, paying
Gender: waiter, waitress
Person: hear, hears
Case: I, me, my
Aspect: ate, eaten
Mood: so be it, were it so
While inflection aids expressivity, it interferes with retrievability, as a single root word sense (or meaning) may be represented by many different sequences of letters. English is a weakly inflected language (you could ignore inflections and still get reasonable search results), but some other languages are highly inflected and need extra work in order to achieve high-quality search results.
Stemming attempts to remove the differences between inflected forms of a
word, in order to reduce each word to its root form. For instance foxes
may
be reduced to the root fox
, to remove the difference between singular and
plural in the same way that we removed the difference between lowercase and
uppercase.
The root form of a word may not even be a real word. The words jumping
and
jumpiness
may both be stemmed to jumpi
. It doesn’t matter—as long as
the same terms are produced at index time and at search time, search will just
work.
If stemming were easy, there would be only one implementation. Unfortunately, stemming is an inexact science that suffers from two issues: understemming and overstemming.
Understemming is the failure to reduce words with the same meaning to the same
root. For example, jumped
and jumps
may be reduced to jump
, while
jumping
may be reduced to jumpi
. Understemming reduces retrieval
relevant documents are not returned.
Overstemming is the failure to keep two words with distinct meanings separate.
For instance, general
and generate
may both be stemmed to gener
.
Overstemming reduces precision: irrelevant documents are returned when they
shouldn’t be.
First we will discuss the two classes of stemmers available in Elasticsearch—“Algorithmic Stemmers” and “Dictionary Stemmers”—and then look at how to choose the right stemmer for your needs in “Choosing a Stemmer”. Finally, we will discuss options for tailoring stemming in “Controlling Stemming” and “Stemming in situ”.
Most of the stemmers available in Elasticsearch are algorithmic in that they
apply a series of rules to a word in order to reduce it to its root form, such
as stripping the final s
or es
from plurals. They don’t have to know
anything about individual words in order to stem them.
These algorithmic stemmers have the advantage that they are available out of
the box, are fast, use little memory, and work well for regular words. The
downside is that they don’t cope well with irregular words like be
, are
,
and am
, or mice
and mouse
.
One of the earliest stemming algorithms is the Porter stemmer for English, which is still the recommended English stemmer today. Martin Porter subsequently went on to create the Snowball language for creating stemming algorithms, and a number of the stemmers available in Elasticsearch are written in Snowball.
The kstem
token filter is a stemmer
for English which combines the algorithmic approach with a built-in
dictionary. The dictionary contains a list of root words and exceptions in
order to avoid conflating words incorrectly. kstem
tends to stem less
aggressively than the Porter stemmer.
While you can use the
porter_stem
or
kstem
token filter directly, or
create a language-specific Snowball stemmer with the
snowball
token filter, all of the
algorithmic stemmers are exposed via a single unified interface:
the stemmer
token filter, which
accepts the language
parameter.
For instance, perhaps you find the default stemmer used by the english
analyzer to be too aggressive and you want to make it less aggressive.
The first step is to look up the configuration for the english
analyzer
in the language analyzers
documentation, which shows the following:
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"english_stop"
:
{
"type"
:
"stop"
,
"stopwords"
:
"_english_"
},
"english_keywords"
:
{
"type"
:
"keyword_marker"
,
"keywords"
:
[]
},
"english_stemmer"
:
{
"type"
:
"stemmer"
,
"language"
:
"english"
},
"english_possessive_stemmer"
:
{
"type"
:
"stemmer"
,
"language"
:
"possessive_english"
}
},
"analyzer"
:
{
"english"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"english_possessive_stemmer"
,
"lowercase"
,
"english_stop"
,
"english_keywords"
,
"english_stemmer"
]
}
}
}
}
}
The keyword_marker
token filter lists words that should not be
stemmed. This defaults to the empty list.
The english
analyzer uses two stemmers: the possessive_english
and the english
stemmer. The possessive stemmer removes 's
from any words before passing them on to the english_stop
,
english_keywords
, and english_stemmer
.
Having reviewed the current configuration, we can use it as the basis for a new analyzer, with the following changes:
Change the english_stemmer
from english
(which maps to the
porter_stem
token filter)
to light_english
(which maps to the less aggressive
kstem
token filter).
Add the asciifolding
token filter to
remove any diacritics from foreign words.
Remove the keyword_marker
token filter, as we don’t need it.
(We discuss this in more detail in “Controlling Stemming”.)
Our new custom analyzer would look like this:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"english_stop"
:
{
"type"
:
"stop"
,
"stopwords"
:
"_english_"
},
"light_english_stemmer"
:
{
"type"
:
"stemmer"
,
"language"
:
"light_english"
},
"english_possessive_stemmer"
:
{
"type"
:
"stemmer"
,
"language"
:
"possessive_english"
}
},
"analyzer"
:
{
"english"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"english_possessive_stemmer"
,
"lowercase"
,
"english_stop"
,
"light_english_stemmer"
,
"asciifolding"
]
}
}
}
}
}
Dictionary stemmers work quite differently from algorithmic stemmers. Instead of applying a standard set of rules to each word, they simply look up the word in the dictionary. Theoretically, they could produce much better results than an algorithmic stemmer. A dictionary stemmer should be able to do the following:
Return the correct root word for irregular forms such as feet
and mice
Recognize the distinction between words that are similar but have
different word senses—for example, organ
and organization
In practice, a good algorithmic stemmer usually outperforms a dictionary stemmer. There are a couple of reasons this should be so:
A dictionary stemmer is only as good as its dictionary. The Oxford English Dictionary website estimates that the English language contains approximately 750,000 words (when inflections are included). Most English dictionaries available for computers contain about 10% of those.
The meaning of words changes with time. While stemming mobility
to mobil
may have made sense previously, it now conflates the idea of mobility with a
mobile phone. Dictionaries need to be kept current, which is a time-consuming
task. Often, by the time a dictionary has been made available, some of its
entries are already out-of-date.
If a dictionary stemmer encounters a word not in its dictionary, it doesn’t know how to deal with it. An algorithmic stemmer, on the other hand, will apply the same rules as before, correctly or incorrectly.
A dictionary stemmer needs to load all words, all prefixes, and all suffixes into memory. This can use a significant amount of RAM. Finding the right stem for a word is often considerably more complex than the equivalent process with an algorithmic stemmer.
Depending on the quality of the dictionary, the process of removing prefixes and suffixes may be more or less efficient. Less-efficient forms can slow the stemming process significantly.
Algorithmic stemmers, on the other hand, are usually simple, small, and fast.
Elasticsearch provides dictionary-based stemming via the
hunspell
token filter.
Hunspell hunspell.sourceforge.net is the
spell checker used by Open Office, LibreOffice, Chrome, Firefox, Thunderbird, and many
other open and closed source projects.
Hunspell dictionaries can be obtained from the following:
extensions.openoffice.org: Download and
unzip the .oxt
extension file.
addons.mozilla.org:
Download and unzip the .xpi
addon file.
OpenOffice archive: Download and unzip the .zip
file.
A Hunspell dictionary consists of two files with the same base name—such as
en_US
—but with one of two extensions:
.dic
Contains all the root words, in alphabetical order, plus a code representing all possible suffixes and prefixes (which collectively are known as affixes)
.aff
Contains the actual prefix or suffix transformation for each code listed
in the .dic
file
The Hunspell token filter looks for dictionaries within a dedicated Hunspell
directory, which defaults to ./config/hunspell/
. The .dic
and .aff
files should be placed in a subdirectory whose name represents the language
or locale of the dictionaries. For instance, we could create a Hunspell
stemmer for American English with the following layout:
config/ └ hunspell/ └ en_US/ ├ en_US.dic ├ en_US.aff └ settings.yml
The location of the Hunspell directory can be changed by setting
indices.analysis.hunspell.dictionary.location
in the
config/elasticsearch.yml
file.
en_US
will be the name of the locale or language
that we pass to the
hunspell
token filter.
Per-language settings file, described in the following section.
The settings.yml
file contains settings that apply to all of the
dictionaries within the language directory, such as these:
---
ignore_case
:
true
strict_affix_parsing
:
true
The meaning of these settings is as follows:
ignore_case
Hunspell dictionaries are case sensitive by default: the surname Booker
is a
different word from the noun booker
, and so should be stemmed differently. It
may seem like a good idea to use the hunspell
stemmer in case-sensitive
mode, but that can complicate things:
A word at the beginning of a sentence will be capitalized, and thus appear to be a proper noun.
The input text may be all uppercase, in which case almost no words will be found.
The user may search for names in all lowercase, in which case no capitalized words will be found.
As a general rule, it is a good idea to set ignore_case
to true
.
strict_affix_parsing
The quality of dictionaries varies greatly. Some dictionaries that are
available online have malformed rules in the .aff
file. By default, Lucene
will throw an exception if it can’t parse an affix rule. If you need to deal
with a broken affix file, you can set strict_affix_parsing
to false
to tell
Lucene to ignore the broken rules.
Once your dictionaries are installed on all nodes, you can define a hunspell
token filter that uses them:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"en_US"
:
{
"type"
:
"hunspell"
,
"language"
:
"en_US"
}
},
"analyzer"
:
{
"en_US"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"en_US"
]
}
}
}
}
}
You can test the new analyzer with the analyze
API,
and compare its output to that of the english
analyzer:
GET
/my_index/_analyze?analyzer=en_US
reorganizes
GET
/_analyze?analyzer=english
reorganizes
An interesting property of the hunspell
stemmer, as can be seen in the
preceding example, is that it can remove prefixes as well as as suffixes. Most
algorithmic stemmers remove suffixes only.
Hunspell dictionaries can consume a few megabytes of RAM. Fortunately, Elasticsearch creates only a single instance of a dictionary per node. All shards that use the same Hunspell analyzer share the same instance.
While it is not necessary to understand the format of a Hunspell dictionary in
order to use the hunspell
tokenizer, understanding the format will help you
write your own custom dictionaries. It is quite simple.
For instance, in the US English dictionary, the en_US.dic
file contains an entry for
the word analyze
, which looks like this:
analyze/ADSG
The en_US.aff
file contains the prefix or suffix rules for the A
, G
,
D
, and S
flags. Each flag consists of a number of rules, only one of
which should match. Each rule has the following format:
[type] [flag] [letters to remove] [letters to add] [condition]
For instance, the following is suffix (SFX
) rule D
. It says that, when a
word ends in a consonant (anything but a
, e
, i
, o
, or u
) followed by
a y
, it can have the y
removed and ied
added (for example, ready
→
readied
).
SFX D y ied [^aeiou]y
The rules for the A
, G
, D
, and S
flags mentioned previously are as follows:
SFX D Y 4 SFX D 0 d e SFX D y ied [^aeiou]y SFX D 0 ed [^ey] SFX D 0 ed [aeiou]y SFX S Y 4 SFX S y ies [^aeiou]y SFX S 0 s [aeiou]y SFX S 0 es [sxzh] SFX S 0 s [^sxzhy] SFX G Y 2 SFX G e ing e SFX G 0 ing [^e] PFX A Y 1 PFX A 0 re .
analyze
ends in an e
, so it can become analyzed
by adding a d
.
analyze
does not end in s
, x
, z
, h
, or y
, so it can become
analyzes
by adding an s
.
analyze
ends in an e
, so it can become analyzing
by removing the e
and adding ing
.
The prefix re
can be added to form reanalyze
. This rule can be
combined with the suffix rules to form reanalyzes
, reanalyzed
,
reanalyzing
.
More information about the Hunspell syntax can be found on the Hunspell documentation site.
The documentation for the
stemmer
token filter
lists multiple stemmers for some languages. For English we have the following:
english
The porter_stem
token filter.
light_english
The kstem
token filter.
minimal_english
The EnglishMinimalStemmer
in Lucene, which removes plurals
lovins
The Snowball based Lovins stemmer, the first stemmer ever produced.
porter
porter2
possessive_english
The EnglishPossessiveFilter
in Lucene which removes 's
Add to that list the Hunspell stemmer with the various English dictionaries that are available.
One thing is for sure: whenever more than one solution exists for a problem, it means that none of the solutions solves the problem adequately. This certainly applies to stemming — each stemmer uses a different approach that overstems and understems words to a different degree.
The stemmer
documentation page highlights the recommended stemmer for
each language in bold, usually because it offers a reasonable compromise
between performance and quality. That said, the recommended stemmer may not be
appropriate for all use cases. There is no single right answer to the question
of which is the best stemmer — it depends very much on your requirements.
There are three factors to take into account when making a choice:
performance, quality, and degree.
Algorithmic stemmers are typically four or five times faster than Hunspell
stemmers. “Handcrafted” algorithmic stemmers are usually, but not always,
faster than their Snowball equivalents. For instance, the porter_stem
token
filter is significantly faster than the Snowball implementation of the Porter
stemmer.
Hunspell stemmers have to load all words, prefixes, and suffixes into memory, which can consume a few megabytes of RAM. Algorithmic stemmers, on the other hand, consist of a small amount of code and consume very little memory.
All languages, except Esperanto, are irregular. While more-formal words tend to follow a regular pattern, the most commonly used words often have irregular rules. Some stemming algorithms have been developed over years of research and produce reasonably high-quality results. Others have been assembled more quickly with less research and deal only with the most common cases.
While Hunspell offers the promise of dealing precisely with irregular words, it often falls short in practice. A dictionary stemmer is only as good as its dictionary. If Hunspell comes across a word that isn’t in its dictionary, it can do nothing with it. Hunspell requires an extensive, high-quality, up-to-date dictionary in order to produce good results; dictionaries of this caliber are few and far between. An algorithmic stemmer, on the other hand, will happily deal with new words that didn’t exist when the designer created the algorithm.
If a good algorithmic stemmer is available for your language, it makes sense to use it rather than Hunspell. It will be faster, will consume less memory, and will generally be as good or better than the Hunspell equivalent.
If accuracy and customizability is important to you, and you need (and have the resources) to maintain a custom dictionary, then Hunspell gives you greater flexibility than the algorithmic stemmers. (See “Controlling Stemming” for customization techniques that can be used with any stemmer.)
Different stemmers overstem and understem to a different degree. The light_
stemmers stem less aggressively than the standard stemmers, and the minimal_
stemmers less aggressively still. Hunspell stems aggressively.
Whether you want aggressive or light stemming depends on your use case. If your search results are being consumed by a clustering algorithm, you may prefer to match more widely (and, thus, stem more aggressively). If your search results are intended for human consumption, lighter stemming usually produces better results. Stemming nouns and adjectives is more important for search than stemming verbs, but this also depends on the language.
The other factor to take into account is the size of your document collection. With a small collection such as a catalog of 10,000 products, you probably want to stem more aggressively to ensure that you match at least some documents. If your collection is large, you likely will get good matches with lighter stemming.
Start out with a recommended stemmer. If it works well enough, there is no need to change it. If it doesn’t, you will need to spend some time investigating and comparing the stemmers available for language in order to find the one that best suits your purposes.
Out-of-the-box stemming solutions are never perfect. Algorithmic stemmers,
especially, will blithely apply their rules to any words they encounter,
perhaps conflating words that you would prefer to keep separate. Maybe, for
your use case, it is important to keep skies
and skiing
as distinct words
rather than stemming them both down to ski
(as would happen with the
english
analyzer).
The keyword_marker
and
stemmer_override
token filters
allow us to customize the stemming process.
The stem_exclusion
parameter for language analyzers (see
“Configuring Language Analyzers”) allowed us to specify a list of words that
should not be stemmed. Internally, these language analyzers use the
keyword_marker
token filter
to mark the listed words as keywords, which prevents subsequent stemming
token filters from touching those words.
For instance, we can create a simple custom analyzer that uses the
porter_stem
token filter,
but prevents the word skies
from being stemmed:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"no_stem"
:
{
"type"
:
"keyword_marker"
,
"keywords"
:
[
"skies"
]
}
},
"analyzer"
:
{
"my_english"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"no_stem"
,
"porter_stem"
]
}
}
}
}
}
Testing it with the analyze
API shows that just the word skies
has
been excluded from stemming:
GET
/my_index/_analyze?analyzer=my_english
sky
skies
skiing
skis
While the language analyzers allow us only to specify an array of words in the
stem_exclusion
parameter, the keyword_marker
token filter also accepts a
keywords_path
parameter that allows us to store all of our keywords in a
file. The file should contain one word per line, and must be present on every
node in the cluster. See “Updating Stopwords” for tips on how to update this
file.
In the preceding example, we prevented skies
from being stemmed, but perhaps we
would prefer it to be stemmed to sky
instead. The
stemmer_override
token
filter allows us to specify our own custom stemming rules. At the same time,
we can handle some irregular forms like stemming mice
to mouse
and feet
to foot
:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"custom_stem"
:
{
"type"
:
"stemmer_override"
,
"rules"
:
[
"skies=>sky"
,
"mice=>mouse"
,
"feet=>foot"
]
}
},
"analyzer"
:
{
"my_english"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"custom_stem"
,
"porter_stem"
]
}
}
}
}
}
GET
/my_index/_analyze?analyzer=my_english
The
mice
came
down
from
the
skies
and
ran
over
my
feet
Rules take the form original=>stem
.
The stemmer_override
filter must be placed before the stemmer.
Returns the
, mouse
, came
, down
, from
, the
, sky
,
and
, ran
, over
, my
, foot
.
keyword_marker
token filter, rules can be stored
in a file whose location should be specified with the rules_path
parameter.
For the sake of completeness, we will finish this chapter by explaining how to index stemmed words into the same field as unstemmed words. As an example, analyzing the sentence The quick foxes jumped would produce the following terms:
Pos 1: (the) Pos 2: (quick) Pos 3: (foxes,fox) Pos 4: (jumped,jump)
To achieve stemming in situ, we will use the
keyword_repeat
token filter, which, like the keyword_marker
token filter (see
“Preventing Stemming”), marks each term as a keyword to prevent the subsequent
stemmer from touching it. However, it also repeats the term in the same
position, and this repeated term is stemmed.
Using the keyword_repeat
token filter alone would result in the following:
Pos 1: (the,the) Pos 2: (quick,quick) Pos 3: (foxes,fox) Pos 4: (jumped,jump)
To prevent the useless repetition of terms that are the same in their stemmed
and unstemmed forms, we add the
unique
token filter into the mix:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"unique_stem"
:
{
"type"
:
"unique"
,
"only_on_same_position"
:
true
}
},
"analyzer"
:
{
"in_situ"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"keyword_repeat"
,
"porter_stem"
,
"unique_stem"
]
}
}
}
}
}
The unique
token filter is set to remove duplicate tokens
only when they occur in the same position.
The keyword_repeat
token filter must appear before the
stemmer.
The unique_stem
filter removes duplicate terms after the
stemmer has done its work.
People like the idea of stemming in situ: “Why use an unstemmed field and a stemmed field if I can just use one combined field?” But is it a good idea? The answer is almost always no. There are two problems.
The first is the inability to separate exact matches from inexact matches. In
this chapter, we have seen that words with different meanings are often
conflated to the same stem word: organs
and organization
both stem to
organ
.
In “Using Language Analyzers”, we demonstrated how to combine a query on a stemmed field (to increase recall) with a query on an unstemmed field (to improve relevance). When the stemmed and unstemmed fields are separate, the contribution of each field can be tuned by boosting one field over another (see “Prioritizing Clauses”). If, instead, the stemmed and unstemmed forms appear in the same field, there is no way to tune your search results.
The second issue has to do with how the relevance score is calculated. In
“What Is Relevance?”, we explained that part of the calculation depends on the
inverse document frequency — how often a word appears in all the documents
in our index. Using in situ stemming for a document that contains the text
jump jumped jumps
would result in these terms:
Pos 1: (jump) Pos 2: (jumped,jump) Pos 3: (jumps,jump)
While jumped
and jumps
appear once each and so would have the correct IDF,
jump
appears three times, greatly reducing its value as a search term in
comparison with the unstemmed forms.
For these reasons, we recommend against using stemming in situ.
18.226.251.70