While playing around with the data in our index, we notice something odd.
Something seems to be broken: we have 12 tweets in our indices, and only one
of them contains the date 2014-09-15
, but have a look at the total
hits
for the following queries:
GET
/
_search
?
q
=
2014
#
12
results
GET
/
_search
?
q
=
2014
-
09
-
15
#
12
results
!
GET
/
_search
?
q
=
date
:
2014
-
09
-
15
#
1
result
GET
/
_search
?
q
=
date
:
2014
#
0
results
!
Why does querying the _all
field for the full date
return all tweets, and querying the date
field for just the year return no
results? Why do our results differ when searching within the _all
field or
the date
field?
Presumably, it is because the way our data has been indexed in the _all
field is different from how it has been indexed in the date
field.
So let’s take a look at how Elasticsearch has interpreted our document
structure, by requesting the mapping (or schema definition)
for the tweet
type in the gb
index:
GET
/
gb
/
_mapping
/
tweet
This gives us the following:
{
"gb"
:
{
"mappings"
:
{
"tweet"
:
{
"properties"
:
{
"date"
:
{
"type"
:
"date"
,
"format"
:
"dateOptionalTime"
},
"name"
:
{
"type"
:
"string"
},
"tweet"
:
{
"type"
:
"string"
},
"user_id"
:
{
"type"
:
"long"
}
}
}
}
}
}
Elasticsearch has dynamically generated a mapping for us, based on what it
could guess about our field types. The response shows us that the date
field
has been recognized as a field of type date
. The _all
field isn’t
mentioned because it is a default field, but we know that the _all
field is
of type string
.
So fields of type date
and fields of type string
are indexed differently,
and can thus be searched differently. That’s not entirely surprising.
You might expect that each of the core data types—strings, numbers, Booleans,
and dates—might be indexed slightly differently. And this is true:
there are slight differences.
But by far the biggest difference is between fields that represent
exact values (which can include string
fields) and fields that
represent full text. This distinction is really important—it’s the thing
that separates a search engine from all other databases.
Data in Elasticsearch can be broadly divided into two types: exact values and full text.
Exact values are exactly what they sound like. Examples are a date or a
user ID, but can also include exact strings such as a username or an email
address. The exact value Foo
is not the same as the exact value foo
.
The exact value 2014
is not the same as the exact value 2014-09-15
.
Full text, on the other hand, refers to textual data—usually written in some human language — like the text of a tweet or the body of an email.
Full text is often referred to as unstructured data, which is a misnomer—natural language is highly structured. The problem is that the rules of natural languages are complex, which makes them difficult for computers to parse correctly. For instance, consider this sentence:
May is fun but June bores me.
Does it refer to months or to people?
Exact values are easy to query. The decision is binary; a value either matches the query, or it doesn’t. This kind of query is easy to express with SQL:
WHERE
name
=
"John Smith"
AND
user_id
=
2
AND
date
>
"2014-09-15"
Querying full-text data is much more subtle. We are not just asking, “Does this document match the query” but “How well does this document match the query?” In other words, how relevant is this document to the given query?
We seldom want to match the whole full-text field exactly. Instead, we want to search within text fields. Not only that, but we expect search to understand our intent:
A search for UK
should also return documents mentioning the United
Kingdom
.
A search for jump
should also match jumped
, jumps
, jumping
,
and perhaps even leap
.
johnny walker
should match Johnnie Walker
, and johnnie depp
should match Johnny Depp
.
fox news hunting
should return stories about hunting on Fox News,
while fox hunting news
should return news stories about fox hunting.
To facilitate these types of queries on full-text fields, Elasticsearch first analyzes the text, and then uses the results to build an inverted index. We will discuss the inverted index and the analysis process in the next two sections.
Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.
For example, let’s say we have two documents, each with a content
field
containing the following:
The quick brown fox jumped over the lazy dog
Quick brown foxes leap over lazy dogs in summer
To create an inverted index, we first split the content
field of each
document into separate words (which we call terms, or tokens), create a
sorted list of all the unique terms, and then list in which document each term
appears. The result looks something like this:
Term Doc_1 Doc_2 ------------------------- Quick | | X The | X | brown | X | X dog | X | dogs | | X fox | X | foxes | | X in | | X jumped | X | lazy | X | X leap | | X over | X | X quick | X | summer | | X the | X | ------------------------
Now, if we want to search for quick brown
, we just need to find the
documents in which each term appears:
Term Doc_1 Doc_2 ------------------------- brown | X | X quick | X | ------------------------ Total | 2 | 1
Both documents match, but the first document has more matches than the second. If we apply a naive similarity algorithm that just counts the number of matching terms, then we can say that the first document is a better match—is more relevant to our query—than the second document.
But there are a few problems with our current inverted index:
Quick
and quick
appear as separate terms, while the user probably
thinks of them as the same word.
fox
and foxes
are pretty similar, as are dog
and dogs
;
They share the same root word.
jumped
and leap
, while not from the same root word, are similar
in meaning. They are synonyms.
With the preceding index, a search for +Quick +fox
wouldn’t match any
documents. (Remember, a preceding +
means that the word must be present.)
Both the term Quick
and the term fox
have to be in the same document
in order to satisfy the query, but the first doc contains quick fox
and
the second doc contains Quick foxes
.
Our user could reasonably expect both documents to match the query. We can do better.
If we normalize the terms into a standard format, then we can find documents that contain terms that are not exactly the same as the user requested, but are similar enough to still be relevant. For instance:
Quick
can be lowercased to become quick
.
foxes
can be stemmed--reduced to its root form—to
become fox
. Similarly, dogs
could be stemmed to dog
.
jumped
and leap
are synonyms and can be indexed as just the
single term jump
.
Now the index looks like this:
Term Doc_1 Doc_2 ------------------------- brown | X | X dog | X | X fox | X | X in | | X jump | X | X lazy | X | X over | X | X quick | X | X summer | | X the | X | X ------------------------
But we’re not there yet. Our search for +Quick +fox
would still fail,
because we no longer have the exact term Quick
in our index. However, if
we apply the same normalization rules that we used on the content
field to
our query string, it would become a query for +quick +fox
, which would
match both documents!
This process of tokenization and normalization is called analysis, which we discuss in the next section.
Analysis is a process that consists of the following:
First, tokenizing a block of text into individual terms suitable for use in an inverted index,
Then normalizing these terms into a standard form to improve their “searchability,” or recall
This job is performed by analyzers. An analyzer is really just a wrapper that combines three functions into a single package:
First, the string is passed through any character filters in turn. Their
job is to tidy up the string before tokenization. A character filter could
be used to strip out HTML, or to convert &
characters to the word
and
.
Next, the string is tokenized into individual terms by a tokenizer. A simple tokenizer might split the text into terms whenever it encounters whitespace or punctuation.
Last, each term is passed through any token filters in turn, which can
change terms (for example, lowercasing Quick
), remove terms (for example, stopwords such as
a
, and
, the
) or add terms (for example, synonyms like jump
and
leap
).
Elasticsearch provides many character filters, tokenizers, and token filters out of the box. These can be combined to create custom analyzers suitable for different purposes. We discuss these in detail in “Custom Analyzers”.
However, Elasticsearch also ships with prepackaged analyzers that you can use directly. We list the most important ones next and, to demonstrate the difference in behavior, we show what terms each would produce from this string:
"Set the shape to semi-transparent by calling set_trans(5)"
The standard analyzer is the default analyzer that Elasticsearch uses. It is the best general choice for analyzing text that may be in any language. It splits the text on word boundaries, as defined by the Unicode Consortium, and removes most punctuation. Finally, it lowercases all terms. It would produce
set, the, shape, to, semi, transparent, by, calling, set_trans, 5
The simple analyzer splits the text on anything that isn’t a letter, and lowercases the terms. It would produce
set, the, shape, to, semi, transparent, by, calling, set, trans
The whitespace analyzer splits the text on whitespace. It doesn’t lowercase. It would produce
Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
Language-specific analyzers are available for many languages. They are able to
take the peculiarities of the specified language into account. For instance,
the english
analyzer comes with a set of English stopwords (common words
like and
or the
that don’t have much impact on relevance), which it
removes. This analyzer also is able to stem English words because it understands the
rules of English grammar.
The english
analyzer would produce the following:
set, shape, semi, transpar, call, set_tran, 5
Note how transparent
, calling
, and set_trans
have been stemmed to
their root form.
When we index a document, its full-text fields are analyzed into terms that are used to create the inverted index. However, when we search on a full-text field, we need to pass the query string through the same analysis process, to ensure that we are searching for terms in the same form as those that exist in the index.
Full-text queries, which we discuss later, understand how each field is defined, and so they can do the right thing:
When you query a full-text field, the query will apply the same analyzer to the query string to produce the correct list of terms to search for.
When you query an exact-value field, the query will not analyze the query string, but instead search for the exact value that you have specified.
Now you can understand why the queries that we demonstrated at the start of this chapter return what they do:
The date
field contains an exact value: the single term 2014-09-15
.
The _all
field is a full-text field, so the analysis process has
converted the date into the three terms: 2014
, 09
, and 15
.
When we query the _all
field for 2014
, it matches all 12 tweets, because
all of them contain the term 2014
:
GET /_search?q=
2014
# 12 results
When we query the _all
field for 2014-09-15
, it first analyzes the query
string to produce a query that matches any of the terms 2014
, 09
, or
15
. This also matches all 12 tweets, because all of them contain the term
2014
:
GET /_search?q=
2014-09-15# 12 results !
When we query the date
field for 2014-09-15
, it looks for that exact
date, and finds one tweet only:
GET /_search?q=
date:2014-09-15# 1 result
When we query the date
field for 2014
, it finds no documents
because none contain that exact date:
GET /_search?q=
date:2014# 0 results !
Especially when you are new to Elasticsearch, it is sometimes difficult to
understand what is actually being tokenized and stored into your index. To
better understand what is going on, you can use the analyze
API to see how
text is analyzed. Specify which analyzer to use in the query-string
parameters, and the text to analyze in the body:
GET
/
_analyze
?
analyzer
=
standard
Text
to
analyze
Each element in the result represents a single term:
{
"tokens"
:
[
{
"token"
:
"text"
,
"start_offset"
:
0
,
"end_offset"
:
4
,
"type"
:
"<ALPHANUM>"
,
"position"
:
1
},
{
"token"
:
"to"
,
"start_offset"
:
5
,
"end_offset"
:
7
,
"type"
:
"<ALPHANUM>"
,
"position"
:
2
},
{
"token"
:
"analyze"
,
"start_offset"
:
8
,
"end_offset"
:
15
,
"type"
:
"<ALPHANUM>"
,
"position"
:
3
}
]
}
The token
is the actual term that will be stored in the index. The
position
indicates the order in which the terms appeared in the original
text. The start_offset
and end_offset
indicate the character positions
that the original word occupied in the original string.
type
values like <ALPHANUM>
vary per analyzer and can be ignored.
The only place that they are used in Elasticsearch is in the
keep_types
token filter.
The analyze
API is a useful tool for understanding what is happening
inside Elasticsearch indices, and we will talk more about it as we progress.
When Elasticsearch detects a new string field in your documents, it
automatically configures it as a full-text string
field and analyzes it with
the standard
analyzer.
You don’t always want this. Perhaps you want to apply a different analyzer that suits the language your data is in. And sometimes you want a string field to be just a string field—to index the exact value that you pass in, without any analysis, such as a string user ID or an internal status field or tag.
To achieve this, we have to configure these fields manually by specifying the mapping.
In order to be able to treat date fields as dates, numeric fields as numbers, and string fields as full-text or exact-value strings, Elasticsearch needs to know what type of data each field contains. This information is contained in the mapping.
As explained in Chapter 3, each document in an index has a type. Every type has its own mapping, or schema definition. A mapping defines the fields within a type, the datatype for each field, and how the field should be handled by Elasticsearch. A mapping is also used to configure metadata associated with the type.
We discuss mappings in detail in “Types and Mappings”. In this section, we’re going to look at just enough to get you started.
Elasticsearch supports the following simple field types:
String: string
Whole number: byte
, short
, integer
, long
Floating-point: float
, double
Boolean: boolean
Date: date
When you index a document that contains a new field—one previously not seen—Elasticsearch will use dynamic mapping to try to guess the field type from the basic datatypes available in JSON, using the following rules:
Field type
true
or false
boolean
123
long
123.45
double
2014-09-15
date
foo bar
string
"123"
), it will be
mapped as type string
, not type long
. However, if the field is
already mapped as type long
, then Elasticsearch will try to convert
the string into a long, and throw an exception if it can’t.
We can view the mapping that Elasticsearch has for one or more types in one or
more indices by using the /_mapping
endpoint. At the start
of this chapter, we already retrieved the mapping for type tweet
in index
gb
:
GET
/
gb
/
_mapping
/
tweet
This shows us the mapping for the fields (called properties) that Elasticsearch generated dynamically from the documents that we indexed:
{
"gb"
:
{
"mappings"
:
{
"tweet"
:
{
"properties"
:
{
"date"
:
{
"type"
:
"date"
,
"format"
:
"dateOptionalTime"
},
"name"
:
{
"type"
:
"string"
},
"tweet"
:
{
"type"
:
"string"
},
"user_id"
:
{
"type"
:
"long"
}
}
}
}
}
}
While the basic field datatypes are sufficient for many cases, you will often need to customize the mapping for individual fields, especially string fields. Custom mappings allow you to do the following:
Distinguish between full-text string fields and exact value string fields
Use language-specific analyzers
Optimize a field for partial matching
Specify custom date formats
And much more
The most important attribute of a field is the type
. For fields
other than string
fields, you will seldom need to map anything other
than type
:
{
"number_of_clicks"
:
{
"type"
:
"integer"
}
}
Fields of type string
are, by default, considered to contain full text.
That is, their value will be passed through an analyzer before being indexed,
and a full-text query on the field will pass the query string through an
analyzer before searching.
The two most important mapping attributes for string
fields are
index
and analyzer
.
The index
attribute controls how the string will be indexed. It
can contain one of three values:
analyzed
First analyze the string and then index it. In other words, index this field as full text.
not_analyzed
Index this field, so it is searchable, but index the value exactly as specified. Do not analyze it.
no
Don’t index this field at all. This field will not be searchable.
The default value of index
for a string
field is analyzed
. If we
want to map the field as an exact value, we need to set it to
not_analyzed
:
{
"tag"
:
{
"type"
:
"string"
,
"index"
:
"not_analyzed"
}
}
The other simple types (such as long
, double
, date
etc) also accept the
index
parameter, but the only relevant values are no
and not_analyzed
,
as their values are never analyzed.
For analyzed
string fields, use the analyzer
attribute to
specify which analyzer to apply both at search time and at index time. By
default, Elasticsearch uses the standard
analyzer, but you can change this
by specifying one of the built-in analyzers, such as
whitespace
, simple
, or english
:
{
"tweet"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
}
}
In “Custom Analyzers”, we show you how to define and use custom analyzers as well.
You can specify the mapping for a type when you first create an index.
Alternatively, you can add the mapping for a new type (or update the mapping
for an existing type) later, using the /_mapping
endpoint.
Although you can add to an existing mapping, you can’t change it. If a field already exists in the mapping, the data from that field probably has already been indexed. If you were to change the field mapping, the already indexed data would be wrong and would not be properly searchable.
We can update a mapping to add a new field, but we can’t change an existing
field from analyzed
to not_analyzed
.
To demonstrate both ways of specifying mappings, let’s first delete the gb
index:
DELETE /gb
Then create a new index, specifying that the tweet
field should use
the english
analyzer:
PUT
/
gb
{
"mappings"
:
{
"tweet"
:
{
"properties"
:
{
"tweet"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
},
"date"
:
{
"type"
:
"date"
},
"name"
:
{
"type"
:
"string"
},
"user_id"
:
{
"type"
:
"long"
}
}
}
}
}
Later on, we decide to add a new not_analyzed
text field called tag
to the
tweet
mapping, using the _mapping
endpoint:
PUT
/
gb
/
_mapping
/
tweet
{
"properties"
:
{
"tag"
:
{
"type"
:
"string"
,
"index"
:
"not_analyzed"
}
}
}
Note that we didn’t need to list all of the existing fields again, as we can’t change them anyway. Our new field has been merged into the existing mapping.
You can use the analyze
API to test the mapping for string fields by
name. Compare the output of these two requests:
GET
/
gb
/
_analyze
?
field
=
tweet
Black
-
cats
GET
/
gb
/
_analyze
?
field
=
tag
Black
-
cats
The tweet
field produces the two terms black
and cat
, while the
tag
field produces the single term Black-cats
. In other words, our
mapping is working correctly.
Besides the simple scalar datatypes that we have mentioned, JSON also
has null
values, arrays, and objects, all of which are supported by
Elasticsearch.
It is quite possible that we want our tag
field to contain more
than one tag. Instead of a single string, we could index an array of tags:
{
"tag"
:
[
"search"
,
"nosql"
]}
There is no special mapping required for arrays. Any field can contain zero, one, or more values, in the same way as a full-text field is analyzed to produce multiple terms.
By implication, this means that all the values of an array must be
of the same datatype. You can’t mix dates with strings. If you create
a new field by indexing an array, Elasticsearch will use the
datatype of the first value in the array to determine the type
of the
new field.
When you get a document back from Elasticsearch, any arrays will be in the
same order as when you indexed the document. The _source
field that you get
back contains exactly the same JSON document that you indexed.
However, arrays are indexed—made searchable—as multivalue fields, which are unordered. At search time, you can’t refer to “the first element” or “the last element.” Rather, think of an array as a bag of values.
Arrays can, of course, be empty. This is the equivalent of having zero
values. In fact, there is no way of storing a null
value in Lucene, so
a field with a null
value is also considered to be an empty
field.
These four fields would all be considered to be empty, and would not be indexed:
"null_value"
:
null
,
"empty_array"
:
[],
"array_with_null_value"
:
[
null
]
The last native JSON datatype that we need to discuss is the object — known in other languages as a hash, hashmap, dictionary or associative array.
Inner objects are often used to embed one entity or object inside
another. For instance, instead of having fields called user_name
and user_id
inside our tweet
document, we could write it as follows:
{
"tweet"
:
"Elasticsearch is very flexible"
,
"user"
:
{
"id"
:
"@johnsmith"
,
"gender"
:
"male"
,
"age"
:
26
,
"name"
:
{
"full"
:
"John Smith"
,
"first"
:
"John"
,
"last"
:
"Smith"
}
}
}
Elasticsearch will detect new object fields dynamically and map them as
type object
, with each inner field listed under properties
:
{
"gb"
:
{
"tweet"
:
{
"properties"
:
{
"tweet"
:
{
"type"
:
"string"
},
"user"
:
{
"type"
:
"object"
,
"properties"
:
{
"id"
:
{
"type"
:
"string"
},
"gender"
:
{
"type"
:
"string"
},
"age"
:
{
"type"
:
"long"
},
"name"
:
{
"type"
:
"object"
,
"properties"
:
{
"full"
:
{
"type"
:
"string"
},
"first"
:
{
"type"
:
"string"
},
"last"
:
{
"type"
:
"string"
}
}
}
}
}
}
}
}
}
The mapping for the user
and name
fields has a similar structure
to the mapping for the tweet
type itself. In fact, the type
mapping
is just a special type of object
mapping, which we refer to as the
root object. It is just the same as any other object, except that it has
some special top-level fields for document metadata, such as _source
,
and the _all
field.
Lucene doesn’t understand inner objects. A Lucene document consists of a flat list of key-value pairs. In order for Elasticsearch to index inner objects usefully, it converts our document into something like this:
{
"tweet"
:
[
elasticsearch
,
flexible
,
very
],
"user.id"
:
[
@
johnsmith
],
"user.gender"
:
[
male
],
"user.age"
:
[
26
],
"user.name.full"
:
[
john
,
smith
],
"user.name.first"
:
[
john
],
"user.name.last"
:
[
smith
]
}
Inner fields can be referred to by name (for example, first
). To distinguish
between two fields that have the same name, we can use the full path (for example, user.name.first
) or even the type
name plus
the path (tweet.user.name.first
).
user
and no field called user.name
. Lucene indexes only scalar or simple values,
not complex data structures.
Finally, consider how an array containing inner objects would be indexed.
Let’s say we have a followers
array that looks like this:
{
"followers"
:
[
{
"age"
:
35
,
"name"
:
"Mary White"
},
{
"age"
:
26
,
"name"
:
"Alex Jones"
},
{
"age"
:
19
,
"name"
:
"Lisa Smith"
}
]
}
This document will be flattened as we described previously, but the result will look like this:
{
"followers.age"
:
[
19
,
26
,
35
],
"followers.name"
:
[
alex
,
jones
,
lisa
,
smith
,
mary
,
white
]
}
The correlation between {age: 35}
and {name: Mary White}
has been lost as
each multivalue field is just a bag of values, not an ordered array. This is
sufficient for us to ask, “Is there a follower who is 26 years old?”
But we can’t get an accurate answer to this: “Is there a follower who is 26 years old and who is called Alex Jones?”
Correlated inner objects, which are able to answer queries like these, are called nested objects, and we cover them later, in Chapter 41.
13.59.227.82