Breaking text into tokens is only half the job. To make those
tokens more easily searchable, they need to go through a normalization
process to remove insignificant differences between otherwise identical words,
such as uppercase versus lowercase. Perhaps we also need to remove significant
differences, to make esta
, ésta
, and está
all searchable as the same
word. Would you search for déjà vu
, or just for deja vu
?
This is the job of the token filters, which receive a stream of tokens from the tokenizer. You can have multiple token filters, each doing its particular job. Each receives the new token stream as output by the token filter before it.
The most frequently used token filter is the lowercase
filter, which does
exactly what you would expect; it transforms each token into its lowercase
form:
GET
/
_analyze
?
tokenizer
=
standard
&
filters
=
lowercase
The
QUICK
Brown
FOX
!
It doesn’t matter whether users search for fox
or FOX
, as long as the same
analysis process is applied at query time and at search time. The lowercase
filter will transform a query for FOX
into a query for fox
, which is the
same token that we have stored in our inverted index.
To use token filters as part of the analysis process, we can create a custom
analyzer:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"my_lowercaser"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
]
}
}
}
}
}
And we can test it out with the analyze
API:
GET
/
my_index
/
_analyze
?
analyzer
=
my_lowercaser
The
QUICK
Brown
FOX
!
English uses diacritics (like ´
, ^
, and ¨
) only for imported words—like rôle
, déjà
, and däis
—but usually they are optional. Other
languages require diacritics in order to be correct. Of course, just because
words are spelled correctly in your index doesn’t mean that the user will
search for the correct spelling.
It is often useful to strip diacritics from words, allowing rôle
to match
role
, and vice versa. With Western languages, this can be done with the
asciifolding
character filter. Actually, it does more than just strip
diacritics. It tries to convert many Unicode characters into a simpler ASCII
representation:
ß
⇒ ss
æ
⇒ ae
ł
⇒ l
ɰ
⇒ m
⁇
⇒ ??
❷
⇒ 2
⁶
⇒ 6
Like the lowercase
filter, the asciifolding
filter doesn’t require any
configuration but can be included directly in a custom
analyzer:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"folding"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"asciifolding"
]
}
}
}
}
}
GET
/
my_index
?
analyzer
=
folding
My
œsophagus
caused
a
débâcle
Of course, when you strip diacritical marks from a word, you lose meaning. For instance, consider these three Spanish words:
esta
Feminine form of the adjective this, as in esta silla (this chair) or esta (this one).
ésta
An archaic form of esta
.
está
The third-person form of the verb estar (to be), as in está feliz (he is happy).
While we would like to conflate the first two forms, they differ in meaning from the third form, which we would like to keep separate. Similarly:
sé
The first person form of the verb saber (to know) as in Yo sé (I know).
se
The third-person reflexive pronoun used with many verbs, such as se sabe (it is known).
Unfortunately, there is no easy way to separate words that should have their diacritics removed from words that shouldn’t. And it is quite likely that your users won’t know either.
Instead, we index the text twice: once in the original form and once with diacritics removed:
PUT
/
my_index
/
_mapping
/
my_type
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
,
"analyzer"
:
"standard"
,
"fields"
:
{
"folded"
:
{
"type"
:
"string"
,
"analyzer"
:
"folding"
}
}
}
}
}
The title
field uses the standard
analyzer and will contain
the original word with diacritics in place.
The title.folded
field uses the folding
analyzer, which strips
the diacritical marks.
You can test the field mappings by using the analyze
API on the sentence
Esta está loca (This woman is crazy):
GET
/
my_index
/
_analyze
?
field
=
title
Esta
está
loca
GET
/
my_index
/
_analyze
?
field
=
title
.
folded
Esta
está
loca
Let’s index some documents to test it out:
PUT
/
my_index
/
my_type
/
1
{
"title"
:
"Esta loca!"
}
PUT
/
my_index
/
my_type
/
2
{
"title"
:
"Está loca!"
}
Now we can search across both fields, using the multi_match
query in
most_fields
mode to combine the scores from each field:
GET
/
my_index
/
_search
{
"query"
:
{
"multi_match"
:
{
"type"
:
"most_fields"
,
"query"
:
"esta loca"
,
"fields"
:
[
"title"
,
"title.folded"
]
}
}
}
Running this query through the validate-query
API helps to explain how the
query is executed:
GET
/
my_index
/
_validate
/
query
?
explain
{
"query"
:
{
"multi_match"
:
{
"type"
:
"most_fields"
,
"query"
:
"está loca"
,
"fields"
:
[
"title"
,
"title.folded"
]
}
}
}
The multi-match
query searches for the original form of the word (está
) in the title
field,
and the form without diacritics esta
in the title.folded
field:
(title:está title:loca ) (title.folded:esta title.folded:loca)
It doesn’t matter whether the user searches for esta
or está
; both
documents will match because the form without diacritics exists in the the
title.folded
field. However, only the original form exists in the title
field. This extra match will push the document containing the original form of
the word to the top of the results list.
We use the title.folded
field to widen the net in order to match more
documents, and use the original title
field to push the most relevant
document to the top. This same technique can be used wherever an analyzer is
used, to increase matches at the expense of meaning.
The asciifolding
filter does have an option called preserve_original
that
allows you to index the original token and the folded token in the same
position in the same field. With this option enabled, you would end up with
something like this:
Position 1 Position 2 -------------------------- (ésta,esta) loca --------------------------
While this appears to be a nice way to save space, it does mean that you have no way of saying, “Give me an exact match on the original word.” Mixing tokens with and without diacritics can also end up interfering with term-frequency counts, resulting in less-reliable relevance calcuations.
As a rule, it is cleaner to index each field variant into a separate field, as we have done in this section.
When Elasticsearch compares one token with another, it does so at the byte level. In other words, for two tokens to be considered the same, they need to consist of exactly the same bytes. Unicode, however, allows you to write the same letter in different ways.
For instance, what’s the difference between é and é? It
depends on who you ask. According to Elasticsearch, the first one consists of
the two bytes 0xC3 0xA9
, and the second one consists of three bytes, 0x65
0xCC 0x81
.
According to Unicode, the differences in how they are represented as bytes is
irrelevant, and they are the same letter. The first one is the single letter
é
, while the second is a plain e
combined with an acute accent ´
.
If you get your data from more than one source, it may happen that you have
the same letters encoded in different ways, which may result in one form of
déjà
not matching another!
Fortunately, a solution is at hand. There are four Unicode normalization
forms, all of which convert Unicode characters into a standard format, making
all characters comparable at a byte level: nfc
, nfd
, nfkc
, nfkd
.
It doesn’t really matter which normalization form you choose, as long as all
your text is in the same form. That way, the same tokens consist of the
same bytes. That said, the compatibility forms allow you to compare
ligatures like ffi
with their simpler representation, ffi
.
You can use the icu_normalizer
token filter to ensure that all of your
tokens are in the same form:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"nfkc_normalizer"
:
{
"type"
:
"icu_normalizer"
,
"name"
:
"nfkc"
}
},
"analyzer"
:
{
"my_normalizer"
:
{
"tokenizer"
:
"icu_tokenizer"
,
"filter"
:
[
"nfkc_normalizer"
]
}
}
}
}
}
Besides the icu_normalizer
token filter mentioned previously, there is also an
icu_normalizer
character filter, which does the same job as the token
filter, but does so before the text reaches the tokenizer. When using the
standard
tokenizer or icu_tokenizer
, this doesn’t really matter. These
tokenizers know how to deal with all forms of Unicode correctly.
However, if you plan on using a different tokenizer, such as the ngram
,
edge_ngram
, or pattern
tokenizers, it would make sense to use the
icu_normalizer
character filter in preference to the token filter.
Usually, though, you will want to not only normalize the byte order of tokens,
but also lowercase them. This can be done with icu_normalizer
, using
the custom normalization form nfkc_cf
, which we discuss in the next section.
Humans are nothing if not inventive, and human language reflects that. Changing the case of a word seems like such a simple task, until you have to deal with multiple languages.
Take, for example, the lowercase German letter ß
. Converting that to upper
case gives you SS
, which converted back to lowercase gives you ss
. Or consider the
Greek letter ς
(sigma, when used at the end of a word). Converting it to
uppercase results in Σ
, which converted back to lowercase, gives you σ
.
The whole point of lowercasing terms is to make them more likely to match, not less! In Unicode, this job is done by case folding rather than by lowercasing. Case folding is the act of converting words into a (usually lowercase) form that does not necessarily result in the correct spelling, but does allow case-insensitive comparisons.
For instance, the letter ß
, which is already lowercase, is folded to
ss
. Similarly, the lowercase ς
is folded to σ
, to make σ
, ς
, and Σ
comparable, no matter where the letter appears in a word.
The default normalization form that the icu_normalizer
token filter uses
is nfkc_cf
. Like the nfkc
form, this does the following:
Composes characters into the shortest byte representation
Uses compatibility mode to convert characters like ffi
into the simpler
ffi
But it also does this:
Case-folds characters into a form suitable for case comparison
In other words, nfkc_cf
is the equivalent of the lowercase
token filter,
but suitable for use with all languages. The on-steroids equivalent of the
standard
analyzer would be the following:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"my_lowercaser"
:
{
"tokenizer"
:
"icu_tokenizer"
,
"filter"
:
[
"icu_normalizer"
]
}
}
}
}
}
We can compare the results of running Weißkopfseeadler
and
WEISSKOPFSEEADLER
(the uppercase equivalent) through the standard
analyzer and through our Unicode-aware analyzer:
GET
/
_analyze
?
analyzer
=
standard
Weißkopfseeadler
WEISSKOPFSEEADLER
GET
/
my_index
/
_analyze
?
analyzer
=
my_lowercaser
Weißkopfseeadler
WEISSKOPFSEEADLER
The standard
analyzer emits two different, incomparable tokens, while our
custom analyzer produces tokens that are comparable, regardless of the
original case.
In the same way as the lowercase
token filter is a good starting point for
many languages but falls short when exposed to the entire tower of Babel, so
the asciifolding
token filter requires a more
effective Unicode character-folding counterpart for dealing with the many
languages of the world.
The icu_folding
token filter (provided by the icu
plug-in)
does the same job as the asciifolding
filter, but extends the transformation
to scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversion
of numbers in other scripts into their Latin equivalents, plus various other
numeric, symbolic, and punctuation transformations.
The icu_folding
token filter applies Unicode normalization and case folding
from nfkc_cf
automatically, so the icu_normalizer
is not required:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"my_folder"
:
{
"tokenizer"
:
"icu_tokenizer"
,
"filter"
:
[
"icu_folding"
]
}
}
}
}
}
GET
/
my_index
/
_analyze
?
analyzer
=
my_folder
١٢٣٤٥
If there are particular characters that you would like to protect from
folding, you can use a
UnicodeSet
(much like a character class in regular expressions) to specify which Unicode
characters may be folded. For instance, to exclude the Swedish letters å
,
ä
, ö
, Å
, Ä
, and Ö
from folding, you would specify a character class
representing all Unicode characters, except for those letters: [^åäöÅÄÖ]
(^
means everything except).
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"swedish_folding"
:
{
"type"
:
"icu_folding"
,
"unicodeSetFilter"
:
"[^åäöÅÄÖ]"
}
},
"analyzer"
:
{
"swedish_analyzer"
:
{
"tokenizer"
:
"icu_tokenizer"
,
"filter"
:
[
"swedish_folding"
,
"lowercase"
]
}
}
}
}
}
The swedish_folding
token filter customizes the
icu_folding
token filter to exclude Swedish letters,
both uppercase and lowercase.
The swedish
analyzer first tokenizes words, then folds
each token by using the swedish_folding
filter, and then
lowercases each token in case it includes some of
the uppercase excluded letters: Å
, Ä
, or Ö
.
So far in this chapter, we have looked at how to normalize tokens for the purposes of search. The final use case to consider in this chapter is that of string sorting.
In “String Sorting and Multifields”, we explained that Elasticsearch cannot sort on an
analyzed
string field, and demonstrated how to use multifields to index
the same field once as an analyzed
field for search, and once as a
not_analyzed
field for sorting.
The problem with sorting on an analyzed
field is not that it uses
an analyzer, but that the analyzer tokenizes the string value into
multiple tokens, like a bag of words, and Elasticsearch doesn’t know which
token to use for sorting.
Relying on a not_analyzed
field for sorting is inflexible: it allows
us to sort on only the exact value of the original string. However, we can use
analyzers to achieve other sort orders, as long as our chosen analyzer always emits only a single token for each string value.
Imagine that we have three user
documents whose name
fields contain Boffey
,
BROWN
, and bailey
, respectively. First we will apply the technique
described in “String Sorting and Multifields” of using a not_analyzed
field for sorting:
PUT
/
my_index
{
"mappings"
:
{
"user"
:
{
"properties"
:
{
"name"
:
{
"type"
:
"string"
,
"fields"
:
{
"raw"
:
{
"type"
:
"string"
,
"index"
:
"not_analyzed"
}
}
}
}
}
}
}
We can index some documents and try sorting:
PUT
/
my_index
/
user
/
1
{
"name"
:
"Boffey"
}
PUT
/
my_index
/
user
/
2
{
"name"
:
"BROWN"
}
PUT
/
my_index
/
user
/
3
{
"name"
:
"bailey"
}
GET
/
my_index
/
user
/
_search
?
sort
=
name
.
raw
The preceding search request would return the documents in this order: BROWN
,
Boffey
, bailey
. This is known as lexicographical order as opposed to
alphabetical order. Essentially, the bytes used to represent capital
letters have a lower value than the bytes used to represent lowercase letters,
and so the names are sorted with the lowest bytes first.
That may make sense to a computer, but doesn’t make much sense to human beings who would reasonably expect these names to be sorted alphabetically, regardless of case. To achieve this, we need to index each name in a way that the byte ordering corresponds to the sort order that we want.
In other words, we need an analyzer that will emit a single lowercase token:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"case_insensitive_sort"
:
{
"tokenizer"
:
"keyword"
,
"filter"
:
[
"lowercase"
]
}
}
}
}
}
The keyword
tokenizer emits the original input string
as a single unchanged token.
The lowercase
token filter lowercases the token.
With the case_insentive_sort
analyzer in place, we can now use it in our
multifield:
PUT
/
my_index
/
_mapping
/
user
{
"properties"
:
{
"name"
:
{
"type"
:
"string"
,
"fields"
:
{
"lower_case_sort"
:
{
"type"
:
"string"
,
"analyzer"
:
"case_insensitive_sort"
}
}
}
}
}
PUT
/
my_index
/
user
/
1
{
"name"
:
"Boffey"
}
PUT
/
my_index
/
user
/
2
{
"name"
:
"BROWN"
}
PUT
/
my_index
/
user
/
3
{
"name"
:
"bailey"
}
GET
/
my_index
/
user
/
_search
?
sort
=
name
.
lower_case_sort
The preceding search request returns our documents in the order that we expect:
bailey
, Boffey
, BROWN
.
But is this order correct? It appears to be correct as it matches our expectations, but our expectations have probably been influenced by the fact that this book is in English and all of the letters used in our example belong to the English alphabet.
What if we were to add the German name Böhm?
Now our names would be returned in this order: bailey
, Boffey
, BROWN
,
Böhm
. The reason that böhm
comes after BROWN
is that these words are
still being sorted by the values of the bytes used to represent them, and an
r
is stored as the byte 0x72
, while ö
is stored as 0xF6
and so is
sorted last. The byte value of each character is an accident of history.
Clearly, the default sort order is meaningless for anything other than plain English. In fact, there is no “right” sort order. It all depends on the language you speak.
Every language has its own sort order, and sometimes even multiple sort orders. Here are a few examples of how our four names from the previous section would be sorted in different contexts:
English: bailey
, boffey
, böhm
, brown
German: bailey
, boffey
, böhm
, brown
German phonebook: bailey
, böhm
, boffey
, brown
Swedish: bailey
, boffey
, brown
, böhm
The reason that the German phonebook sort order places böhm
before boffey
is that ö
and oe
are considered synonyms when dealing with names and
places, so böhm
is sorted as if it had been written as boehm
.
Collation is the process of sorting text into a predefined order. The Unicode Collation Algorithm, or UCA (see www.unicode.org/reports/tr10) defines a method of sorting strings into the order defined in a Collation Element Table (usually referred to just as a collation).
The UCA also defines the Default Unicode Collation Element Table, or DUCET, which defines the default sort order for all Unicode characters, regardless of language. As you have already seen, there is no single correct sort order, so DUCET is designed to annoy as few people as possible as seldom as possible, but it is far from being a panacea for all sorting woes.
Instead, language-specific collations exist for pretty much every language under the sun. Most use DUCET as their starting point and add a few custom rules to deal with the peculiarities of each language.
The UCA takes a string and a collation as inputs and outputs a binary sort key. Sorting a collection of strings according to the specified collation then becomes a simple comparison of their binary sort keys.
The approach described in this section will probably change in a future version of
Elasticsearch. Check the icu
plugin documentation for the
latest information.
The icu_collation
token filter defaults to using the DUCET
collation for sorting. This is already an improvement over the default sort. To use it,
all we need to do is to create an analyzer that uses the default
icu_collation
filter:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"ducet_sort"
:
{
"tokenizer"
:
"keyword"
,
"filter"
:
[
"icu_collation"
]
}
}
}
}
}
Typically, the field that we want to sort on is also a field that we want to search on, so we use the same multifield approach as we used in “Case-Insensitive Sorting”:
PUT
/
my_index
/
_mapping
/
user
{
"properties"
:
{
"name"
:
{
"type"
:
"string"
,
"fields"
:
{
"sort"
:
{
"type"
:
"string"
,
"analyzer"
:
"ducet_sort"
}
}
}
}
}
With this mapping, the name.sort
field will contain a sort key that will be
used only for sorting. We haven’t specified a language, so it defaults to
using the DUCET collation.
Now, we can reindex our example docs and test the sorting:
PUT
/
my_index
/
user
/
_bulk
{
"index"
:
{
"_id"
:
1
}}
{
"name"
:
"Boffey"
}
{
"index"
:
{
"_id"
:
2
}}
{
"name"
:
"BROWN"
}
{
"index"
:
{
"_id"
:
3
}}
{
"name"
:
"bailey"
}
{
"index"
:
{
"_id"
:
4
}}
{
"name"
:
"Böhm"
}
GET
/
my_index
/
user
/
_search
?
sort
=
name
.
sort
Note that the sort
key returned with each document, which in earlier
examples looked like brown
and böhm
, now looks like gobbledygook:
ᖔ乏昫倈⠀u0001
. The reason is that the icu_collation
filter emits keys intended only for efficient sorting, not for any other purposes.
The preceding search returns our docs in this order: bailey
, Boffey
, Böhm
,
BROWN
. This is already an improvement, as the sort order is now correct for
English and German, but it is still incorrect for German phonebooks and
Swedish. The next step is to customize our mapping for different languages.
The icu_collation
filter can be configured to use the collation table for a
specific language, a country-specific version of a language, or some other
subset such as German phonebooks. This can be done by creating a custom version
of the token filter by using the language
, country
, and variant
parameters
as follows:
{
"language"
:
"en"
}
{
"language"
:
"de"
}
{
"language"
:
"de"
,
"country"
:
"AT"
}
{
"language"
:
"en"
,
"variant"
:
"@collation=phonebook"
}
You can read more about the locales supported by ICU at: http://bit.ly/1u9LEdp.
This example shows how to set up the German phonebook sort order:
PUT
/
my_index
{
"settings"
:
{
"number_of_shards"
:
1
,
"analysis"
:
{
"filter"
:
{
"german_phonebook"
:
{
"type"
:
"icu_collation"
,
"language"
:
"de"
,
"country"
:
"DE"
,
"variant"
:
"@collation=phonebook"
}
},
"analyzer"
:
{
"german_phonebook"
:
{
"tokenizer"
:
"keyword"
,
"filter"
:
[
"german_phonebook"
]
}
}
}
},
"mappings"
:
{
"user"
:
{
"properties"
:
{
"name"
:
{
"type"
:
"string"
,
"fields"
:
{
"sort"
:
{
"type"
:
"string"
,
"analyzer"
:
"german_phonebook"
}
}
}
}
}
}
}
First we create a version of the icu_collation
customized for the German phonebook collation.
Then we wrap that up in a custom analyzer.
And we apply it to our name.sort
field.
Reindex the data and repeat the same search as we used previously:
PUT
/
my_index
/
user
/
_bulk
{
"index"
:
{
"_id"
:
1
}}
{
"name"
:
"Boffey"
}
{
"index"
:
{
"_id"
:
2
}}
{
"name"
:
"BROWN"
}
{
"index"
:
{
"_id"
:
3
}}
{
"name"
:
"bailey"
}
{
"index"
:
{
"_id"
:
4
}}
{
"name"
:
"Böhm"
}
GET
/
my_index
/
user
/
_search
?
sort
=
name
.
sort
This now returns our docs in this order: bailey
, Böhm
, Boffey
, BROWN
.
In the German phonebook collation, Böhm
is the equivalent of Boehm
, which
comes before Boffey
.
The same field can support multiple sort orders by using a multifield for each language:
PUT
/
my_index
/
_mapping
/
_user
{
"properties"
:
{
"name"
:
{
"type"
:
"string"
,
"fields"
:
{
"default"
:
{
"type"
:
"string"
,
"analyzer"
:
"ducet"
},
"french"
:
{
"type"
:
"string"
,
"analyzer"
:
"french"
},
"german"
:
{
"type"
:
"string"
,
"analyzer"
:
"german_phonebook"
},
"swedish"
:
{
"type"
:
"string"
,
"analyzer"
:
"swedish"
}
}
}
}
}
With this mapping in place, results can be ordered correctly for French,
German, and Swedish users, just by sorting on the name.french
, name.german
,
or name.swedish
fields. Unsupported languages can fall back to using the
name.default
field, which uses the DUCET sort order.
The icu_collation
token filter takes many more options than just language
,
country
, and variant
, which can be used to tailor the sorting algorithm.
Options are available that will do the following:
Ignore diacritics
Order uppercase first or last, or ignore case
Take punctuation and whitespace into account or ignore it
Sort numbers as strings or by their numeric value
Customize existing collations or define your own custom collations
Details of these options are beyond the scope of this book, but more information can be found in the ICU plug-in documentation and in the ICU project collation documentation.
3.144.100.237