By default, results are returned sorted by relevance—with the most
relevant docs first. Later in this chapter, we explain what we mean by
relevance and how it is calculated, but let’s start by looking at the sort
parameter and how to use it.
In order to sort by relevance, we need to represent relevance as a value. In
Elasticsearch, the relevance score is represented by the floating-point
number returned in the search results as the _score
, so the default sort
order is _score
descending.
Sometimes, though, you don’t have a meaningful relevance score. For instance,
the following query just returns all tweets whose user_id
field has the
value 1
:
GET
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"term"
:
{
"user_id"
:
1
}
}
}
}
}
Filters have no bearing on _score
, and the missing-but-implied match_all
query just sets the _score
to a neutral value of 1
for all documents. In
other words, all documents are considered to be equally relevant.
In this case, it probably makes sense to sort tweets by recency, with the most
recent tweets first. We can do this with the sort
parameter:
GET
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"term"
:
{
"user_id"
:
1
}}
}
},
"sort"
:
{
"date"
:
{
"order"
:
"desc"
}}
}
You will notice two differences in the results:
"hits"
:
{
"total"
:
6
,
"max_score"
:
null
,
"hits"
:
[
{
"_index"
:
"us"
,
"_type"
:
"tweet"
,
"_id"
:
"14"
,
"_score"
:
null
,
"_source"
:
{
"date"
:
"2014-09-24"
,
...
},
"sort"
:
[
1411516800000
]
},
...
}
The _score
is not calculated, because it is not being used for sorting.
The value of the date
field, expressed as milliseconds since the epoch,
is returned in the sort
values.
The first is that we have a new element in each result called sort
, which
contains the value(s) that was used for sorting. In this case, we sorted on
date
, which internally is indexed as milliseconds since the epoch. The long
number 1411516800000
is equivalent to the date string 2014-09-24 00:00:00
UTC
.
The second is that the _score
and max_score
are both null
. Calculating
the _score
can be quite expensive, and usually its only purpose is for
sorting; we’re not sorting by relevance, so it doesn’t make sense to keep
track of the _score
. If you want the _score
to be calculated regardless,
you can set the track_scores
parameter to true
.
Perhaps we want to combine the _score
from a query with the date
, and
show all matching results sorted first by date, then by relevance:
GET
/
_search
{
"query"
:
{
"filtered"
:
{
"query"
:
{
"match"
:
{
"tweet"
:
"manage text search"
}},
"filter"
:
{
"term"
:
{
"user_id"
:
2
}}
}
},
"sort"
:
[
{
"date"
:
{
"order"
:
"desc"
}},
{
"_score"
:
{
"order"
:
"desc"
}}
]
}
Order is important. Results are sorted by the first criterion first. Only
results whose first sort
value is identical will then be sorted by the
second criterion, and so on.
Multilevel sorting doesn’t have to involve the _score
. You could sort
by using several different fields, on geo-distance or on a custom value
calculated in a script.
When sorting on fields with more than one value, remember that the values do not have any intrinsic order; a multivalue field is just a bag of values. Which one do you choose to sort on?
For numbers and dates, you can reduce a multivalue field to a single value
by using the min
, max
, avg
, or sum
sort modes. For instance, you
could sort on the earliest date in each dates
field by using the following:
"sort"
:
{
"dates"
:
{
"order"
:
"asc"
,
"mode"
:
"min"
}
}
Analyzed string fields are also multivalue fields, but sorting on them seldom
gives you the results you want. If you analyze a string like fine old art
,
it results in three terms. We probably want to sort alphabetically on the
first term, then the second term, and so forth, but Elasticsearch doesn’t have this
information at its disposal at sort time.
You could use the min
and max
sort modes (it uses min
by default), but
that will result in sorting on either art
or old
, neither of which was the
intent.
In order to sort on a string field, that field should contain one term only:
the whole not_analyzed
string. But of course we still need the field to be
analyzed
in order to be able to query it as full text.
The naive approach to indexing the same string in two ways would be to include
two separate fields in the document: one that is analyzed
for searching,
and one that is not_analyzed
for sorting.
But storing the same string twice in the _source
field is waste of space.
What we really want to do is to pass in a single field but to index it in two different ways. All of the core field types (strings, numbers,
Booleans, dates) accept a fields
parameter that allows you to transform a
simple mapping like
"tweet"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
}
into a multifield mapping like this:
"tweet"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
,
"fields"
:
{
"raw"
:
{
"type"
:
"string"
,
"index"
:
"not_analyzed"
}
}
}
The main tweet
field is just the same as before: an analyzed
full-text
field.
The new tweet.raw
subfield is not_analyzed
.
Now, or at least as soon as we have reindexed our data, we can use the tweet
field for search and the tweet.raw
field for sorting:
GET
/
_search
{
"query"
:
{
"match"
:
{
"tweet"
:
"elasticsearch"
}
},
"sort"
:
"tweet.raw"
}
analyzed
field can use a lot of memory. See
“Fielddata” for more information.
We’ve mentioned that, by default, results are returned in descending order of relevance. But what is relevance? How is it calculated?
The relevance score of each document is represented by a positive floating-point number called the _score
. The higher the _score
, the more relevant
the document.
A query clause generates a _score
for each document. How that score is
calculated depends on the type of query clause. Different query clauses are
used for different purposes: a fuzzy
query might determine the _score
by
calculating how similar the spelling of the found word is to the original
search term; a terms
query would incorporate the percentage of terms that
were found. However, what we usually mean by relevance is the algorithm that we
use to calculate how similar the contents of a full-text field are to a full-text query string.
The standard similarity algorithm used in Elasticsearch is known as term frequency/inverse document frequency, or TF/IDF, which takes the following factors into account:
How often does the term appear in the field? The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.
How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more-uncommon terms.
How long is the field? The longer it is, the less likely it is that words in
the field will be relevant. A term appearing in a short title
field
carries more weight than the same term appearing in a long content
field.
Individual queries may combine the TF/IDF score with other factors such as the term proximity in phrase queries, or term similarity in fuzzy queries.
Relevance is not just about full-text search, though. It can equally be applied
to yes/no clauses, where the more clauses that match, the higher the
_score
.
When multiple query clauses are combined using a compound query like the
bool
query, the _score
from each of these query clauses is combined to
calculate the overall _score
for the document.
When debugging a complex query, it can be difficult to understand
exactly how a _score
has been calculated. Elasticsearch
has the option of producing an explanation with every search result,
by setting the explain
parameter to true
.
GET
/
_search
?
explain
{
"query"
:
{
"match"
:
{
"tweet"
:
"honeymoon"
}}
}
Adding explain
produces a lot of output for every hit, which can look
overwhelming, but it is worth taking the time to understand what it all means.
Don’t worry if it doesn’t all make sense now; you can refer to this section
when you need it. We’ll work through the output for one hit
bit by bit.
First, we have the metadata that is returned on normal search requests:
{
"_index"
:
"us"
,
"_type"
:
"tweet"
,
"_id"
:
"12"
,
"_score"
:
0.076713204
,
"_source"
:
{
...
trimmed
...
},
It adds information about the shard and the node that the document came from, which is useful to know because term and document frequencies are calculated per shard, rather than per index:
"_shard"
:
1
,
"_node"
:
"mzIVYCsqSWCG_M_ZffSs9Q"
,
Then it provides the _explanation
. Each entry contains a description
that tells you what type of calculation is being performed, a value
that gives you the result of the calculation, and the details
of any
subcalculations that were required:
"_explanation"
:
{
"description"
:
"weight(tweet:honeymoon in 0)
[PerFieldSimilarity], result of:"
,
"value"
:
0.076713204
,
"details"
:
[
{
"description"
:
"fieldWeight in 0, product of:"
,
"value"
:
0.076713204
,
"details"
:
[
{
"description"
:
"tf(freq=1.0), with freq of:"
,
"value"
:
1
,
"details"
:
[
{
"description"
:
"termFreq=1.0"
,
"value"
:
1
}
]
},
{
"description"
:
"idf(docFreq=1, maxDocs=1)"
,
"value"
:
0.30685282
},
{
"description"
:
"fieldNorm(doc=0)"
,
"value"
:
0.25
,
}
]
}
]
}
Summary of the score calculation for honeymoon
Term frequency
Inverse document frequency
Field-length norm
explain
output is expensive. It is a debugging tool
only. Don’t leave it turned on in production.
The first part is the summary of the calculation. It tells us that it has
calculated the weight—the TF/IDF—of the term honeymoon
in the field tweet
, for document 0
. (This is
an internal document ID and, for our purposes, can be ignored.)
It then provides details of how the weight was calculated:
How many times did the term honeymoon
appear in the tweet
field in
this document?
How many times did the term honeymoon
appear in the tweet
field
of all documents in the index?
How long is the tweet
field in this document? The longer the field,
the smaller this number.
Explanations for more-complicated queries can appear to be very complex, but really they just contain more of the same calculations that appear in the preceding example. This information can be invaluable for debugging why search results appear in the order that they do.
While the explain
option adds an explanation for every result, you can use
the explain
API to understand why one particular document matched or, more
important, why it didn’t match.
The path for the request is /index/type/id/_explain
, as in the following:
GET
/
us
/
tweet
/
12
/
_explain
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"term"
:
{
"user_id"
:
2
}},
"query"
:
{
"match"
:
{
"tweet"
:
"honeymoon"
}}
}
}
}
Along with the full explanation that we saw previously, we also now have a
description
element, which tells us this:
"failure to match filter: cache(user_id:[2 TO 2])"
In other words, our user_id
filter clause is preventing the document from
matching.
Our final topic in this chapter is about an internal aspect of Elasticsearch. While we don’t demonstrate any new techniques here, fielddata is an important topic that we will refer to repeatedly, and is something that you should be aware of.
When you sort on a field, Elasticsearch needs access to the value of that field for every document that matches the query. The inverted index, which performs very well when searching, is not the ideal structure for sorting on field values:
When searching, we need to be able to map a term to a list of documents.
When sorting, we need to map a document to its terms. In other words, we need to “uninvert” the inverted index.
To make sorting efficient, Elasticsearch loads all the values for the field that you want to sort on into memory. This is referred to as fielddata.
type
.
The reason that Elasticsearch loads all values into memory is that uninverting the index from disk is slow. Even though you may need the values for only a few docs for the current request, you will probably need access to the values for other docs on the next request, so it makes sense to load all the values into memory at once, and to keep them there.
Fielddata is used in several places in Elasticsearch:
Sorting on a field
Aggregations on a field
Certain filters (for example, geolocation filters)
Scripts that refer to fields
Clearly, this can consume a lot of memory, especially for high-cardinality string fields—string fields that have many unique values—like the body of an email. Fortunately, insufficient memory is a problem that can be solved by horizontal scaling, by adding more nodes to your cluster.
For now, all you need to know is what fielddata is, and to be aware that it can be memory hungry. Later, we will show you how to determine the amount of memory that fielddata is using, how to limit the amount of memory that is available to it, and how to preload fielddata to improve the user experience.
13.58.51.36