A word in English is relatively simple to spot: words are separated by whitespace or (some) punctuation. Even in English, though, there can be controversy: is you’re one word or two? What about o’clock, cooperate, half-baked, or eyewitness?
Languages like German or Dutch combine individual words to create longer
compound words like Weißkopfseeadler (white-headed sea eagle), but in order
to be able to return Weißkopfseeadler
as a result for the query Adler
(eagle), we need to understand how to break up compound words into their
constituent parts.
Asian languages are even more complex: some have no whitespace between words, sentences, or even paragraphs. Some words can be represented by a single character, but the same single character, when placed next to other characters, can form just one part of a longer word with a quite different meaning.
It should be obvious that there is no silver-bullet analyzer that will miraculously deal with all human languages. Elasticsearch ships with dedicated analyzers for many languages, and more language-specific analyzers are available as plug-ins.
However, not all languages have dedicated analyzers, and sometimes you won’t even be sure which language(s) you are dealing with. For these situations, we need good standard tools that do a reasonable job regardless of language.
The standard
analyzer is used by default for any full-text analyzed
string
field. If we were to reimplement the standard
analyzer as a
custom
analyzer, it would be defined as follows:
{
"type"
:
"custom"
,
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"stop"
]
}
In Chapter 20 and Chapter 22, we talk about the
lowercase
, and stop
token filters, but for the moment, let’s focus on
the standard
tokenizer.
A tokenizer accepts a string as input, processes the string to break it into individual words, or tokens (perhaps discarding some characters like punctuation), and emits a token stream as output.
What is interesting is the algorithm that is used to identify words. The
whitespace
tokenizer simply breaks on whitespace—spaces, tabs, line
feeds, and so forth—and assumes that contiguous nonwhitespace characters form a
single token. For instance:
GET
/
_analyze
?
tokenizer
=
whitespace
You
'
re
the
1
st
runner
home
!
This request would return the following terms:
You're
, the
, 1st
, runner
, home!
The letter
tokenizer, on the other hand, breaks on any character that is
not a letter, and so would return the following terms: You
, re
, the
,
st
, runner
, home
.
The standard
tokenizer uses the Unicode Text Segmentation algorithm (as
defined in Unicode Standard Annex #29) to
find the boundaries between words, and emits everything in-between. Its
knowledge of Unicode allows it to successfully tokenize text containing a
mixture of languages.
Punctuation may or may not be considered part of a word, depending on where it appears:
GET
/
_analyze
?
tokenizer
=
standard
You
're my '
favorite
'
.
In this example, the apostrophe in You're
is treated as part of the
word, while the single quotes in 'favorite'
are not, resulting in the
following terms: You're
, my
, favorite
.
The uax_url_email
tokenizer works in exactly the same way as the standard
tokenizer, except that it recognizes email addresses and URLs and emits them as
single tokens. The standard
tokenizer, on the other hand, would try to
break them into individual words. For instance, the email address
[email protected]
would result in the tokens joe
, bloggs
, foo
,
bar.com
.
The standard
tokenizer is a reasonable starting point for tokenizing most
languages, especially Western languages. In fact, it forms the basis of most
of the language-specific analyzers like the english
, french
, and spanish
analyzers. Its support for Asian languages, however, is limited, and you should consider
using the icu_tokenizer
instead, which is available in the ICU plug-in.
The ICU analysis
plug-in for Elasticsearch uses the International Components for Unicode
(ICU) libraries (see site.project.org) to
provide a rich set of tools for dealing with Unicode. These include the
icu_tokenizer
, which is particularly useful for Asian languages, and a number
of token filters that are essential for correct matching and sorting in all
languages other than English.
The ICU plug-in is an essential tool for dealing with languages other than English, and it is highly recommended that you install and use it. Unfortunately, because it is based on the external ICU libraries, different versions of the ICU plug-in may not be compatible with previous versions. When upgrading, you may need to reindex your data.
To install the plug-in, first shut down your Elasticsearch node and then run the following command from the Elasticsearch home directory:
./bin/plugin -install elasticsearch/elasticsearch-analysis-icu/$VERSION
The current $VERSION
can be found at
https://github.com/elasticsearch/elasticsearch-analysis-icu.
Once installed, restart Elasticsearch, and you should see a line similar to the following in the startup logs:
[INFO][plugins] [Mysterio] loaded [marvel, analysis-icu], sites [marvel]
If you are running a cluster with multiple nodes, you will need to install the plug-in on every node in the cluster.
The icu_tokenizer
uses the same Unicode Text Segmentation algorithm as the
standard
tokenizer, but adds better support for some Asian languages by
using a dictionary-based approach to identify words in Thai, Lao, Chinese,
Japanese, and Korean, and using custom rules to break Myanmar and Khmer text
into syllables.
For instance, compare the tokens produced by the standard
and
icu_tokenizers
, respectively, when tokenizing “Hello. I am from Bangkok.” in
Thai:
GET
/
_analyze
?
tokenizer
=
standard
สวัสดี
ผมมาจากกรุงเทพฯ
The standard
tokenizer produces two tokens, one for each sentence: สวัสดี
,
ผมมาจากกรุงเทพฯ
. That is useful only if you want to search for the whole
sentence “I am from Bangkok.”, but not if you want to search for just
“Bangkok.”
GET
/
_analyze
?
tokenizer
=
icu_tokenizer
สวัสดี
ผมมาจากกรุงเทพฯ
The icu_tokenizer
, on the other hand, is able to break up the text into the
individual words (สวัสดี
, ผม
, มา
, จาก
, กรุงเทพฯ)
, making them
easier to search.
In contrast, the standard
tokenizer “over-tokenizes” Chinese and Japanese
text, often breaking up whole words into single characters. Because there
are no spaces between words, it can be difficult to tell whether consecutive
characters are separate words or form a single word. For instance:
向 means facing, 日 means sun, and 葵 means hollyhock. When written together, 向日葵 means sunflower.
五 means five or fifth, 月 means month, and 雨 means rain. The first two characters written together as 五月 mean the month of May, and adding the third character, 五月雨 means continuous rain. When combined with a fourth character, 式, meaning style, the word 五月雨式 becomes an adjective for anything consecutive or unrelenting.
Although each character may be a word in its own right, tokens are more meaningful when they retain the bigger original concept instead of just the component parts:
GET
/
_analyze
?
tokenizer
=
standard
向日葵
GET
/
_analyze
?
tokenizer
=
icu_tokenizer
向日葵
The standard
tokenizer in the preceding example would emit each character
as a separate token: 向
, 日
, 葵
. The icu_tokenizer
would
emit the single token 向日葵
(sunflower).
Another difference between the standard
tokenizer and the icu_tokenizer
is
that the latter will break a word containing characters written in different
scripts (for example, βeta
) into separate tokens—β
, eta
—while the
former will emit the word as a single token: βeta
.
Tokenizers produce the best results when the input text is clean, valid text, where valid means that it follows the punctuation rules that the Unicode algorithm expects. Quite often, though, the text we need to process is anything but clean. Cleaning it up before tokenization improves the quality of the output.
Passing HTML through the standard
tokenizer or the icu_tokenizer
produces
poor results. These tokenizers just don’t know what to do with the HTML tags.
For example:
GET
/
_analyzer
?
tokenizer
=
standard
<
p
>
Some
d
&
eacute
;
j
&
agrave
;
vu
<
a
href
=
"http://somedomain.com>"
>
website
<
/a>
The standard
tokenizer confuses HTML tags and entities, and emits the
following tokens: p
, Some
, d
, eacute
, j
, agrave
, vu
, a
,
href
, http
, somedomain.com
, website
, a
. Clearly not what was
intended!
Character filters can be added to an analyzer to preprocess the text
before it is passed to the tokenizer. In this case, we can use the
html_strip
character filter to remove HTML tags and to decode HTML entities
such as é
into the corresponding Unicode characters.
Character filters can be tested out via the analyze
API by specifying them
in the query string:
GET
/
_analyzer
?
tokenizer
=
standard
&
char_filters
=
html_strip
<
p
>
Some
d
&
eacute
;
j
&
agrave
;
vu
<
a
href
=
"http://somedomain.com>"
>
website
<
/a>
To use them as part of the analyzer, they should be added to a custom
analyzer definition:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"my_html_analyzer"
:
{
"tokenizer"
:
"standard"
,
"char_filter"
:
[
"html_strip"
]
}
}
}
}
}
Once created, our new my_html_analyzer
can be tested with the analyze
API:
GET
/
my_index
/
_analyzer
?
analyzer
=
my_html_analyzer
<
p
>
Some
d
&
eacute
;
j
&
agrave
;
vu
<
a
href
=
"http://somedomain.com>"
>
website
<
/a>
This emits the tokens that we expect: Some
, déjà
, vu
, website
.
The standard
tokenizer and icu_tokenizer
both understand that an
apostrophe within a word should be treated as part of the word, while single
quotes that surround a word should not. Tokenizing the text You're my 'favorite'
. would correctly emit the tokens You're, my, favorite
.
Unfortunately, Unicode lists a few characters that are sometimes used as apostrophes:
U+0027
Apostrophe ('
)—the original ASCII character
U+2018
Left single-quotation mark (‘
)—opening quote when single-quoting
U+2019
Right single-quotation mark (’
)—closing quote when single-quoting, but also the preferred character to use as an apostrophe
Both tokenizers treat these three characters as an apostrophe (and thus as part of the word) when they appear within a word. Then there are another three apostrophe-like characters:
U+201B
Single high-reversed-9 quotation mark (‛
)—same as U+2018
but differs in appearance
U+0091
Left single-quotation mark in ISO-8859-1—should not be used in Unicode
U+0092
Right single-quotation mark in ISO-8859-1—should not be used in Unicode
Both tokenizers treat these three characters as word boundaries—a place to
break text into tokens. Unfortunately, some publishers use U+201B
as a
stylized way to write names like M‛coy
, and the second two characters may well
be produced by your word processor, depending on its age.
Even when using the “acceptable” quotation marks, a word written with a
single right quotation mark—You’re
—is not the same as the word written
with an apostrophe—You're
—which means that a query for one variant
will not find the other.
Fortunately, it is possible to sort out this mess with the mapping
character
filter, which allows us to replace all instances of one character with
another. In this case, we will replace all apostrophe variants with the
simple U+0027
apostrophe:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"char_filter"
:
{
"quotes"
:
{
"type"
:
"mapping"
,
"mappings"
:
[
"\u0091=>\u0027"
,
"\u0092=>\u0027"
,
"\u2018=>\u0027"
,
"\u2019=>\u0027"
,
"\u201B=>\u0027"
]
}
},
"analyzer"
:
{
"quotes_analyzer"
:
{
"tokenizer"
:
"standard"
,
"char_filter"
:
[
"quotes"
]
}
}
}
}
}
We define a custom char_filter
called quotes
that
maps all apostrophe variants to a simple apostrophe.
For clarity, we have used the JSON Unicode escape syntax
for each character, but we could just have used the
characters themselves: "‘=>'"
.
We use our custom quotes
character filter to create
a new analyzer called quotes_analyzer
.
As always, we test the analyzer after creating it:
GET
/
my_index
/
_analyze
?
analyzer
=
quotes_analyzer
You
’
re
my
‘
favorite
’
M
‛
Coy
This example returns the following tokens, with all of the in-word
quotation marks replaced by apostrophes: You're
, my
, favorite
, M'Coy
.
The more effort that you put into ensuring that the tokenizer receives good-quality input, the better your search results will be.
18.218.212.102