Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:
Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.
These analyzers typically perform four roles:
Tokenize text into individual words:
The quick brown foxes
→ [The
, quick
, brown
, foxes
]
Lowercase tokens:
The
→ the
Remove common stopwords:
[The
, quick
, brown
, foxes
] → [quick
, brown
, foxes
]
Stem tokens to their root form:
foxes
→ fox
Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:
The english
analyzer removes the possessive 's
:
John's
→ john
The french
analyzer removes elisions like l'
and qu'
and
diacritics like ¨
or ^
:
l'église
→ eglis
The german
analyzer normalizes terms, replacing ä
and ae
with a
, or
ß
with ss
, among others:
äußerst
→ ausserst
The built-in language analyzers are available globally and don’t need to be configured before being used. They can be specified directly in the field mapping:
PUT
/
my_index
{
"mappings"
:
{
"blog"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
}
}
}
}
}
Of course, by passing text through the english
analyzer, we lose
information:
GET
/
my_index
/
_analyze
?
field
=
title
I
'
m
not
happy
about
the
foxes
We can’t tell if the document mentions one fox
or many foxes
; the word
not
is a stopword and is removed, so we can’t tell whether the document is
happy about foxes or not. By using the english
analyzer, we have increased
recall as we can match more loosely, but we have reduced our ability to rank
documents accurately.
To get the best of both worlds, we can use multifields to
index the title
field twice: once with the english
analyzer and once with
the standard
analyzer:
PUT
/
my_index
{
"mappings"
:
{
"blog"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
,
"fields"
:
{
"english"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
}
}
}
}
}
}
}
The main title
field uses the standard
analyzer.
The title.english
subfield uses the english
analyzer.
With this mapping in place, we can index some test documents to demonstrate how to use both fields at query time:
PUT
/
my_index
/
blog
/
1
{
"title"
:
"I'm happy for this fox"
}
PUT
/
my_index
/
blog
/
2
{
"title"
:
"I'm not happy about my fox problem"
}
GET
/
_search
{
"query"
:
{
"multi_match"
:
{
"type"
:
"most_fields"
,
"query"
:
"not happy foxes"
,
"fields"
:
[
"title"
,
"title.english"
]
}
}
}
Use the most_fields
query type to match the
same text in as many fields as possible.
Even though neither of our documents contain the word foxes
, both documents
are returned as results thanks to the word stemming on the title.english
field. The second document is ranked as more relevant, because the word not
matches on the title
field.
While the language analyzers can be used out of the box without any configuration, most of them do allow you to control aspects of their behavior, specifically:
Imagine, for instance, that users searching for the “World Health
Organization” are instead getting results for “organ health.” The reason
for this confusion is that both “organ” and “organization” are stemmed to
the same root word: organ
. Often this isn’t a problem, but in this
particular collection of documents, this leads to confusing results. We would
like to prevent the words organization
and organizations
from being
stemmed.
The default list of stopwords used in English are as follows:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
The unusual thing about no
and not
is that they invert the meaning of the
words that follow them. Perhaps we decide that these two words are important
and that we shouldn’t treat them as stopwords.
To customize the behavior of the english
analyzer, we need to
create a custom analyzer that uses the english
analyzer as its base but
adds some configuration:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"analyzer"
:
{
"my_english"
:
{
"type"
:
"english"
,
"stem_exclusion"
:
[
"organization"
,
"organizations"
],
"stopwords"
:
[
"a"
,
"an"
,
"and"
,
"are"
,
"as"
,
"at"
,
"be"
,
"but"
,
"by"
,
"for"
,
"if"
,
"in"
,
"into"
,
"is"
,
"it"
,
"of"
,
"on"
,
"or"
,
"such"
,
"that"
,
"the"
,
"their"
,
"then"
,
"there"
,
"these"
,
"they"
,
"this"
,
"to"
,
"was"
,
"will"
,
"with"
]
}
}
}
}
}
GET
/
my_index
/
_analyze
?
analyzer
=
my_english
The
World
Health
Organization
does
not
sell
organs
.
Prevents organization
and organizations
from being stemmed
Specifies a custom list of stopwords
Emits tokens world
, health
, organization
, does
, not
, sell
, organ
We discuss stemming and stopwords in much more detail in Chapter 21 and Chapter 22, respectively.
If you have to deal with only a single language, count yourself lucky. Finding the right strategy for handling documents written in several languages can be challenging.
Multilingual documents come in three main varieties:
One predominant language per document, which may contain snippets from other languages (See “One Language per Document”.)
One predominant language per field, which may contain snippets from other languages (See “One Language per Field”.)
A mixture of languages per field (See “Mixed-Language Fields”.)
The goal, although not always achievable, should be to keep languages separate. Mixing languages in the same inverted index can be problematic.
The stemming rules for German are different from those for English, French, Swedish, and so on. Applying the same stemming rules to different languages will result in some words being stemmed correctly, some incorrectly, and some not being stemmed at all. It may even result in words from different languages with different meanings being stemmed to the same root word, conflating their meanings and producing confusing search results for the user.
Applying multiple stemmers in turn to the same text is likely to result in rubbish, as the next stemmer may try to stem an already stemmed word, compounding the problem.
In “What Is Relevance?”, we explained that the more frequently a term appears in a collection of documents, the less weight that term has. For accurate relevance calculations, you need accurate term-frequency statistics.
A short snippet of German appearing in predominantly English text would give more weight to the German words, given that they are relatively uncommon. But mix those with documents that are predominantly German, and the short German snippets now have much less weight.
It is not sufficient just to think about your documents, though. You also need
to think about how your users will query those documents. Often you will be able
to identify the main language of the user either from the language of that user’s chosen
interface (for example, mysite.de
versus mysite.fr
) or from the
accept-language
HTTP header from the user’s browser.
User searches also come in three main varieties:
Users search for words in their main language.
Users search for words in a different language, but expect results in their main language.
Users search for words in a different language, and expect results in that language (for example, a bilingual person, or a foreign visitor in a web cafe).
Depending on the type of data that you are searching, it may be appropriate to return results in a single language (for example, a user searching for products on the Spanish version of the website) or to combine results in the identified main language of the user with results from other languages.
Usually, it makes sense to give preference to the user’s language. An English-speaking user searching the Web for “deja vu” would probably prefer to see the English Wikipedia page rather than the French Wikipedia page.
You may already know the language of your documents. Perhaps your documents are created within your organization and translated into a list of predefined languages. Human pre-identification is probably the most reliable method of classifying language correctly.
Perhaps, though, your documents come from an external source without any language classification, or possibly with incorrect classification. In these cases, you need to use a heuristic to identify the predominant language. Fortunately, libraries are available in several languages to help with this problem.
Of particular note is the chromium-compact-language-detector library from Mike McCandless, which uses the open source (Apache License 2.0) Compact Language Detector (CLD) from Google. It is small, fast, and accurate, and can detect 160+ languages from as little as two sentences. It can even detect multiple languages within a single block of text. Bindings exist for several languages including Python, Perl, JavaScript, PHP, C#/.NET, and R.
Identifying the language of the user’s search request is not quite as simple.
The CLD is designed for text that is at least 200 characters in length.
Shorter amounts of text, such as search keywords, produce much less accurate
results. In these cases, it may be preferable to take simple heuristics into
account such as the country of origin, the user’s selected language, and the
HTTP accept-language
headers.
A single predominant language per document requires a relatively simple setup.
Documents from different languages can be stored in separate indices—blogs-en
,
blogs-fr
, and so forth—that use the same type and the same fields for each index,
just with different analyzers:
PUT
/
blogs
-
en
{
"mappings"
:
{
"post"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
,
"fields"
:
{
"stemmed"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
}
}}}}}}
PUT
/
blogs
-
fr
{
"mappings"
:
{
"post"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
,
"fields"
:
{
"stemmed"
:
{
"type"
:
"string"
,
"analyzer"
:
"french"
}
}}}}}}
Both blogs-en
and blogs-fr
have a type called post
that contains
the field title
.
The title.stemmed
subfield uses a language-specific analyzer.
This approach is clean and flexible. New languages are easy to add—just create a new index—and because each language is completely separate, we don’t suffer from the term-frequency and stemming problems described in “Pitfalls of Mixing Languages”.
The documents of a single language can be queried independently, or queries
can target multiple languages by querying multiple indices. We can even
specify a preference for particular languages with the indices_boost
parameter:
GET
/
blogs
-*
/post/_search
{
"query"
:
{
"multi_match"
:
{
"query"
:
"deja vu"
,
"fields"
:
[
"title"
,
"title.stemmed"
]
"type"
:
"most_fields"
}
},
"indices_boost"
:
{
"blogs-en"
:
3
,
"blogs-fr"
:
2
}
}
This search is performed on any index beginning with blogs-
.
The title.stemmed
fields are queried using the analyzer
specified in each index.
Perhaps the user’s accept-language
headers showed a preference for
English, and then French, so we boost results from each index accordingly.
Any other languages will have a neutral boost of 1
.
Of course, these documents may contain words or sentences in other languages, and these words are unlikely to be stemmed correctly. With predominant-language documents, this is not usually a major problem. The user will often search for the exact words—for instance, of a quotation from another language—rather than for inflections of a word. Recall can be improved by using techniques explained in Chapter 20.
Perhaps some words like place names should be queryable in the predominant language and in the original language, such as Munich and München. These words are effectively synonyms, which we discuss in Chapter 23.
For documents that represent entities like products, movies, or legal notices, it is common for the same text to be translated into several languages. Although each translation could be represented in a single document in an index per language, another reasonable approach is to keep all translations in the same document:
{
"title"
:
"Fight club"
,
"title_br"
:
"Clube de Luta"
,
"title_cz"
:
"Klub rvácu"
,
"title_en"
:
"Fight club"
,
"title_es"
:
"El club de la lucha"
,
...
}
Each translation is stored in a separate field, which is analyzed according to the language it contains:
PUT
/
movies
{
"mappings"
:
{
"movie"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
},
"title_br"
:
{
"type"
:
"string"
,
"analyzer"
:
"brazilian"
},
"title_cz"
:
{
"type"
:
"string"
,
"analyzer"
:
"czech"
},
"title_en"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
},
"title_es"
:
{
"type"
:
"string"
,
"analyzer"
:
"spanish"
}
}
}
}
}
The title
field contains the original title and uses the
standard
analyzer.
Each of the other fields uses the appropriate analyzer for that language.
Like the index-per-language approach, the field-per-language approach
maintains clean term frequencies. It is not quite as flexible as having
separate indices. Although it is easy to add a new field by using the update-mapping
API, those new fields may require new
custom analyzers, which can only be set up at index creation time. As a
workaround, you can close the index, add the new
analyzers with the update-settings
API,
then reopen the index, but closing the index means that it will require some
downtime.
The documents of a single language can be queried independently, or queries can target multiple languages by querying multiple fields. We can even specify a preference for particular languages by boosting that field:
GET
/
movies
/
movie
/
_search
{
"query"
:
{
"multi_match"
:
{
"query"
:
"club de la lucha"
,
"fields"
:
[
"title*"
,
"title_es^2"
],
"type"
:
"most_fields"
}
}
}
Usually, documents that mix multiple languages in a single field come from sources beyond your control, such as pages scraped from the Web:
{
"body"
:
"Page not found / Seite nicht gefunden / Page non trouvée"
}
They are the most difficult type of multilingual document to handle correctly.
Although you can simply use the standard
analyzer on all fields, your documents
will be less searchable than if you had used an appropriate stemmer. But of
course, you can’t choose just one stemmer—stemmers are language specific.
Or rather, stemmers are language and script specific. As discussed in
“Stemmer per Script”, if every language uses a different script, then
stemmers can be combined.
Assuming that your mix of languages uses the same script such as Latin, you have three choices available to you:
Split into separate fields
Analyze multiple times
Use n-grams
The Compact Language Detector mentioned in “Identifying Language” can tell you which parts of the document are in which language. You can split up the text based on language and use the same approach as was used in “One Language per Field”.
If you primarily deal with a limited number of languages, you could use multi-fields to analyze the text once per language:
PUT
/
movies
{
"mappings"
:
{
"title"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
,
"fields"
:
{
"de"
:
{
"type"
:
"string"
,
"analyzer"
:
"german"
},
"en"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
},
"fr"
:
{
"type"
:
"string"
,
"analyzer"
:
"french"
},
"es"
:
{
"type"
:
"string"
,
"analyzer"
:
"spanish"
}
}
}
}
}
}
}
You could index all words as n-grams, using the same approach as described in “Ngrams for Compound Words”. Most inflections involve adding a suffix (or in some languages, a prefix) to a word, so by breaking each word into n-grams, you have a good chance of matching words that are similar but not exactly the same. This can be combined with the analyze-multiple times approach to provide a catchall field for unsupported languages:
PUT
/
movies
{
"settings"
:
{
"analysis"
:
{...}
},
"mappings"
:
{
"title"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
,
"fields"
:
{
"de"
:
{
"type"
:
"string"
,
"analyzer"
:
"german"
},
"en"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
},
"fr"
:
{
"type"
:
"string"
,
"analyzer"
:
"french"
},
"es"
:
{
"type"
:
"string"
,
"analyzer"
:
"spanish"
},
"general"
:
{
"type"
:
"string"
,
"analyzer"
:
"trigrams"
}
}
}
}
}
}
}
In the analysis
section, we define the same trigrams
analyzer as described in “Ngrams for Compound Words”.
The title.general
field uses the trigrams
analyzer
to index any language.
When querying the catchall general
field, you can use
minimum_should_match
to reduce the number of low-quality matches. It may
also be necessary to boost the other fields slightly more than the general
field, so that matches on the the main language fields are given more weight
than those on the general
field:
GET
/
movies
/
movie
/
_search
{
"query"
:
{
"multi_match"
:
{
"query"
:
"club de la lucha"
,
"fields"
:
[
"title*^1.5"
,
"title.general"
],
"type"
:
"most_fields"
,
"minimum_should_match"
:
"75%"
}
}
}
18.226.133.49