Now that we have covered the simple case of searching for structured data, it is time to explore full-text search: how to search within full-text fields in order to find the most relevant documents.
The two most important aspects of full-text search are as follows:
The ability to rank results by how relevant they are to the given query, whether relevance is calculated using TF/IDF (see “What Is Relevance?”), proximity to a geolocation, fuzzy similarity, or some other algorithm.
The process of converting a block of text into distinct, normalized tokens (see “Analysis and Analyzers”) in order to (a) create an inverted index and (b) query the inverted index.
As soon as we talk about either relevance or analysis, we are in the territory of queries, rather than filters.
While all queries perform some sort of relevance calculation, not all queries
have an analysis phase. Besides specialized queries like the bool
or
function_score
queries, which don’t operate on text at all, textual queries can
be broken down into two families:
Queries like the term
or fuzzy
queries are low-level queries that have no
analysis phase. They operate on a single term. A term
query for the term
Foo
looks for that exact term in the inverted index and calculates the
TF/IDF relevance _score
for each document that contains the term.
It is important to remember that the term
query looks in the inverted index
for the exact term only; it won’t match any variants like foo
or
FOO
. It doesn’t matter how the term came to be in the index, just that it
is. If you were to index ["Foo","Bar"]
into an exact value not_analyzed
field, or Foo Bar
into an analyzed field with the whitespace
analyzer,
both would result in having the two terms Foo
and Bar
in the inverted
index.
Queries like the match
or query_string
queries are high-level queries
that understand the mapping of a field:
If you use them to query a date
or integer
field, they will treat the
query string as a date or integer, respectively.
If you query an exact value (not_analyzed
) string field, they will treat
the whole query string as a single term.
But if you query a full-text (analyzed
) field, they will first pass the
query string through the appropriate analyzer to produce the list of terms
to be queried.
Once the query has assembled a list of terms, it executes the appropriate low-level query for each of these terms, and then combines their results to produce the final relevance score for each document.
We will discuss this process in more detail in the following chapters.
You seldom need to use the term-based queries directly. Usually you want to query full text, not individual terms, and this is easier to do with the high-level full-text queries (which end up using term-based queries internally).
If you do find yourself wanting to use a query on an exact value
not_analyzed
field, think about whether you really want a query or a filter.
Single-term queries usually represent binary yes/no questions and are almost always better expressed as a filter, so that they can benefit from filter caching:
GET
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"term"
:
{
"gender"
:
"female"
}
}
}
}
}
The match
query is the go-to query—the first query that you should
reach for whenever you need to query any field. It is a high-level full-text
query, meaning that it knows how to deal with both full-text fields and exact-value fields.
That said, the main use case for the match
query is for full-text search. So
let’s take a look at how full-text search works with a simple example.
First, we’ll create a new index and index some documents using the
bulk
API:
DELETE
/
my_index
PUT
/
my_index
{
"settings"
:
{
"number_of_shards"
:
1
}}
POST
/
my_index
/
my_type
/
_bulk
{
"index"
:
{
"_id"
:
1
}}
{
"title"
:
"The quick brown fox"
}
{
"index"
:
{
"_id"
:
2
}}
{
"title"
:
"The quick brown fox jumps over the lazy dog"
}
{
"index"
:
{
"_id"
:
3
}}
{
"title"
:
"The quick brown fox jumps over the quick dog"
}
{
"index"
:
{
"_id"
:
4
}}
{
"title"
:
"Brown fox brown dog"
}
Delete the index in case it already exists.
Later, in “Relevance Is Broken!”, we explain why we created this index with only one primary shard.
Our first example explains what happens when we use the match
query to
search within a full-text field for a single word:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match"
:
{
"title"
:
"QUICK!"
}
}
}
Elasticsearch executes the preceding match
query as follows:
Check the field type.
The title
field is a full-text (analyzed
) string
field, which means that
the query string should be analyzed too.
Analyze the query string.
The query string QUICK!
is passed through the standard analyzer, which
results in the single term quick
. Because we have a just a single term,
the match
query can be executed as a single low-level term
query.
Find matching docs.
The term
query looks up quick
in the inverted index and retrieves the
list of documents that contain that term—in this case, documents 1, 2, and
3.
Score each doc.
The term
query calculates the relevance _score
for each matching document,
by combining the term frequency (how often quick
appears in the title
field of each document), with the inverse document frequency (how often
quick
appears in the title
field in all documents in the index), and the
length of each field (shorter fields are considered more relevant).
See “What Is Relevance?”.
This process gives us the following (abbreviated) results:
"hits"
:
[
{
"_id"
:
"1"
,
"_score"
:
0.5
,
"_source"
:
{
"title"
:
"The quick brown fox"
}
},
{
"_id"
:
"3"
,
"_score"
:
0.44194174
,
"_source"
:
{
"title"
:
"The quick brown fox jumps over the quick dog"
}
},
{
"_id"
:
"2"
,
"_score"
:
0.3125
,
"_source"
:
{
"title"
:
"The quick brown fox jumps over the lazy dog"
}
}
]
If we could search for only one word at a time, full-text search would be
pretty inflexible. Fortunately, the match
query makes multiword queries
just as simple:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match"
:
{
"title"
:
"BROWN DOG!"
}
}
}
The preceding query returns all four documents in the results list:
{
"hits"
:
[
{
"_id"
:
"4"
,
"_score"
:
0.73185337
,
"_source"
:
{
"title"
:
"Brown fox brown dog"
}
},
{
"_id"
:
"2"
,
"_score"
:
0.47486103
,
"_source"
:
{
"title"
:
"The quick brown fox jumps over the lazy dog"
}
},
{
"_id"
:
"3"
,
"_score"
:
0.47486103
,
"_source"
:
{
"title"
:
"The quick brown fox jumps over the quick dog"
}
},
{
"_id"
:
"1"
,
"_score"
:
0.11914785
,
"_source"
:
{
"title"
:
"The quick brown fox"
}
}
]
}
Document 4 is the most relevant because it contains "brown"
twice and "dog"
once.
Documents 2 and 3 both contain brown
and dog
once each, and the title
field is the same length in both docs, so they have the same score.
Document 1 matches even though it contains only brown
, not dog
.
Because the match
query has to look for two terms—["brown","dog"]
—internally it has to execute two term
queries and combine their individual
results into the overall result. To do this, it wraps the two term
queries
in a bool
query, which we examine in detail in “Combining Queries”.
The important thing to take away from this is that any document whose
title
field contains at least one of the specified terms will match the
query. The more terms that match, the more relevant the document.
Matching any document that contains any of the query terms may result in a
long tail of seemingly irrelevant results. It’s a shotgun approach to search.
Perhaps we want to show only documents that contain all of the query terms.
In other words, instead of brown OR dog
, we want to return only documents
that match brown AND dog
.
The match
query accepts an operator
parameter that defaults to or
.
You can change it to and
to require that all specified terms must match:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match"
:
{
"title"
:
{
"query"
:
"BROWN DOG!"
,
"operator"
:
"and"
}
}
}
}
The structure of the match
query has to change slightly in order to
accommodate the operator
parameter.
This query would exclude document 1, which contains only one of the two terms.
The choice between all and any is a bit too black-or-white. What if the
user specified five query terms, and a document contains only four of them?
Setting operator
to and
would exclude this document.
Sometimes that is exactly what you want, but for most full-text search use cases, you want to include documents that may be relevant but exclude those that are unlikely to be relevant. In other words, we need something in-between.
The match
query supports the minimum_should_match
parameter, which allows
you to specify the number of terms that must match for a document to be considered
relevant. While you can specify an absolute number of terms, it usually makes
sense to specify a percentage instead, as you have no control over the number of words the user may enter:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match"
:
{
"title"
:
{
"query"
:
"quick brown dog"
,
"minimum_should_match"
:
"75%"
}
}
}
}
When specified as a percentage, minimum_should_match
does the right thing:
in the preceding example with three terms, 75%
would be rounded down to 66.6%
,
or two out of the three terms. No matter what you set it to, at least one term
must match for a document to be considered a match.
The minimum_should_match
parameter is flexible, and different rules can
be applied depending on the number of terms the user enters. For the full
documentation see the
minimum_should_match
reference documentation.
To fully understand how the match
query handles multiword queries, we need
to look at how to combine multiple queries with the bool
query.
In “Combining Filters” we discussed how to, use the bool
filter to combine
multiple filter clauses with and
, or
, and not
logic. In query land, the
bool
query does a similar job but with one important difference.
Filters make a binary decision: should this document be included in the results list or not? Queries, however, are more subtle. They decide not only whether to include a document, but also how relevant that document is.
Like the filter equivalent, the bool
query accepts multiple query clauses
under the must
, must_not
, and should
parameters. For instance:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"bool"
:
{
"must"
:
{
"match"
:
{
"title"
:
"quick"
}},
"must_not"
:
{
"match"
:
{
"title"
:
"lazy"
}},
"should"
:
[
{
"match"
:
{
"title"
:
"brown"
}},
{
"match"
:
{
"title"
:
"dog"
}}
]
}
}
}
The results from the preceding query include any document whose title
field
contains the term quick
, except for those that also contain lazy
. So
far, this is pretty similar to how the bool
filter works.
The difference comes in with the two should
clauses, which say that: a document
is not required to contain either brown
or dog
, but if it does, then
it should be considered more relevant:
{
"hits"
:
[
{
"_id"
:
"3"
,
"_score"
:
0.70134366
,
"_source"
:
{
"title"
:
"The quick brown fox jumps over the quick dog"
}
},
{
"_id"
:
"1"
,
"_score"
:
0.3312608
,
"_source"
:
{
"title"
:
"The quick brown fox"
}
}
]
}
The bool
query calculates the relevance _score
for each document by adding
together the _score
from all of the matching must
and should
clauses,
and then dividing by the total number of must
and should
clauses.
The must_not
clauses do not affect the score; their only purpose is to
exclude documents that might otherwise have been included.
All the must
clauses must match, and all the must_not
clauses must not
match, but how many should
clauses should match? By default, none of the should
clauses are required to match, with one
exception: if there are no must
clauses, then at least one should
clause
must match.
Just as we can control the precision of the match
query,
we can control how many should
clauses need to match by using the
minimum_should_match
parameter, either as an absolute number or as a
percentage:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"bool"
:
{
"should"
:
[
{
"match"
:
{
"title"
:
"brown"
}},
{
"match"
:
{
"title"
:
"fox"
}},
{
"match"
:
{
"title"
:
"dog"
}}
],
"minimum_should_match"
:
2
}
}
}
The results would include only documents whose title
field contains "brown"
AND "fox"
, "brown" AND "dog"
, or "fox" AND "dog"
. If a document contains
all three, it would be considered more relevant than those that contain
just two of the three.
By now, you have probably realized that multiword match
queries simply wrap the generated term
queries in a bool
query. With the
default or
operator, each term
query is added as a should
clause, so
at least one clause must match. These two queries are equivalent:
{
"match"
:
{
"title"
:
"brown fox"
}
}
{
"bool"
:
{
"should"
:
[
{
"term"
:
{
"title"
:
"brown"
}},
{
"term"
:
{
"title"
:
"fox"
}}
]
}
}
With the and
operator, all the term
queries are added as must
clauses,
so all clauses must match. These two queries are equivalent:
{
"match"
:
{
"title"
:
{
"query"
:
"brown fox"
,
"operator"
:
"and"
}
}
}
{
"bool"
:
{
"must"
:
[
{
"term"
:
{
"title"
:
"brown"
}},
{
"term"
:
{
"title"
:
"fox"
}}
]
}
}
And if the minimum_should_match
parameter is specified, it is passed
directly through to the bool
query, making these two queries equivalent:
{
"match"
:
{
"title"
:
{
"query"
:
"quick brown fox"
,
"minimum_should_match"
:
"75%"
}
}
}
{
"bool"
:
{
"should"
:
[
{
"term"
:
{
"title"
:
"brown"
}},
{
"term"
:
{
"title"
:
"fox"
}},
{
"term"
:
{
"title"
:
"quick"
}}
],
"minimum_should_match"
:
2
}
}
Because there are only three clauses, the minimum_should_match
value of 75%
in the match
query is rounded down to 2
.
At least two out of the three should
clauses must match.
Of course, we would normally write these types of queries by using the match
query, but understanding how the match
query works internally lets you take
control of the process when you need to. Some things can’t be
done with a single match
query, such as give more weight to some query terms
than to others. We will look at an example of this in the next section.
Of course, the bool
query isn’t restricted to combining simple one-word
match
queries. It can combine any other query, including other bool
queries. It is commonly used to fine-tune the relevance _score
for each
document by combining the scores from several distinct queries.
Imagine that we want to search for documents about “full-text search,” but we
want to give more weight to documents that also mention “Elasticsearch” or
“Lucene.” By more weight, we mean that documents mentioning
“Elasticsearch” or “Lucene” will receive a higher relevance _score
than
those that don’t, which means that they will appear higher in the list of
results.
A simple bool
query allows us to write this fairly complex logic as follows:
GET
/
_search
{
"query"
:
{
"bool"
:
{
"must"
:
{
"match"
:
{
"content"
:
{
"query"
:
"full text search"
,
"operator"
:
"and"
}
}
},
"should"
:
[
{
"match"
:
{
"content"
:
"Elasticsearch"
}},
{
"match"
:
{
"content"
:
"Lucene"
}}
]
}
}
}
The content
field must contain all of the words full
, text
, and search
.
If the content
field also contains Elasticsearch
or Lucene
,
the document will receive a higher _score
.
The more should
clauses that match, the more relevant the document. So far,
so good.
But what if we want to give more weight to the docs that contain Lucene
and
even more weight to the docs containing Elasticsearch
?
We can control the relative weight of any query clause by specifying a boost
value, which defaults to 1
. A boost
value greater than 1
increases the
relative weight of that clause. So we could rewrite the preceding query as
follows:
GET
/
_search
{
"query"
:
{
"bool"
:
{
"must"
:
{
"match"
:
{
"content"
:
{
"query"
:
"full text search"
,
"operator"
:
"and"
}
}
},
"should"
:
[
{
"match"
:
{
"content"
:
{
"query"
:
"Elasticsearch"
,
"boost"
:
3
}
}},
{
"match"
:
{
"content"
:
{
"query"
:
"Lucene"
,
"boost"
:
2
}
}}
]
}
}
}
These clauses use the default boost
of 1
.
This clause is the most important, as it has the highest boost
.
This clause is more important than the default, but not as important
as the Elasticsearch
clause.
The boost
parameter is used to increase the relative weight of a clause
(with a boost
greater than 1
) or decrease the relative weight (with a
boost
between 0
and 1
), but the increase or decrease is not linear. In
other words, a boost
of 2
does not result in double the _score
.
Instead, the new _score
is normalized after the boost is applied. Each
type of query has its own normalization algorithm, and the details are beyond
the scope of this book. Suffice to say that a higher boost
value results in
a higher _score
.
If you are implementing your own scoring model not based on TF/IDF and you
need more control over the boosting process, you can use the
function_score
query to manipulate a document’s
boost without the normalization step.
We present other ways of combining queries in the next chapter, Chapter 14. But first, let’s take a look at the other important feature of queries: text analysis.
Queries can find only terms that actually exist in the inverted index, so it is important to ensure that the same analysis process is applied both to the document at index time, and to the query string at search time so that the terms in the query match the terms in the inverted index.
Although we say document, analyzers are determined per field. Each field can have a different analyzer, either by configuring a specific analyzer for that field or by falling back on the type, index, or node defaults. At index time, a field’s value is analyzed by using the configured or default analyzer for that field.
For instance, let’s add a new field to my_index
:
PUT
/
my_index
/
_mapping
/
my_type
{
"my_type"
:
{
"properties"
:
{
"english_title"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
}
}
}
}
Now we can compare how values in the english_title
field and the title
field are
analyzed at index time by using the analyze
API to analyze the word Foxes
:
GET
/
my_index
/
_analyze
?
field
=
my_type
.
title
Foxes
GET
/
my_index
/
_analyze
?
field
=
my_type
.
english_title
Foxes
Field title
, which uses the default standard
analyzer, will return the
term foxes
.
Field english_title
, which uses the english
analyzer, will return the term
fox
.
This means that, were we to run a low-level term
query for the exact term
fox
, the english_title
field would match but the title
field would
not.
High-level queries like the match
query understand field mappings and can
apply the correct analyzer for each field being queried. We can see this
in action with the validate-query
API:
GET
/
my_index
/
my_type
/
_validate
/
query
?
explain
{
"query"
:
{
"bool"
:
{
"should"
:
[
{
"match"
:
{
"title"
:
"Foxes"
}},
{
"match"
:
{
"english_title"
:
"Foxes"
}}
]
}
}
}
which returns this explanation
:
(title:foxes english_title:fox)
The match
query uses the appropriate analyzer for each field to ensure
that it looks for each term in the correct format for that field.
While we can specify an analyzer at the field level, how do we determine which analyzer is used for a field if none is specified at the field level?
Analyzers can be specified at several levels. Elasticsearch works through each level until it finds an analyzer that it can use. At index time, the order is as follows:
The analyzer
defined in the field mapping, else
The analyzer defined in the _analyzer
field of the document, else
The default analyzer
for the type
, which defaults to
The analyzer named default
in the index settings, which defaults to
The analyzer named default
at node level, which defaults to
The standard
analyzer
At search time, the sequence is slightly different:
The analyzer
defined in the query itself, else
The analyzer
defined in the field mapping, else
The default analyzer
for the type
, which defaults to
The analyzer named default
in the index settings, which defaults to
The analyzer named default
at node level, which defaults to
The standard
analyzer
The two lines in italics in the preceding lists highlight differences in the index time sequence and the search time sequence. The _analyzer
field allows you to specify a default analyzer for each document (for example, english
, french
, spanish
) while the analyzer
parameter in the query specifies which analyzer to use on the query string. However, this is not the best way to handle multiple languages
in a single index because of the pitfalls highlighted in Part III.
Occasionally, it makes sense to use a different analyzer at index and search
time. For instance, at index time we may want to index synonyms (for example, for every
occurrence of quick
, we also index fast
, rapid
, and speedy
). But at
search time, we don’t need to search for all of these synonyms. Instead we
can just look up the single word that the user has entered, be it quick
,
fast
, rapid
, or speedy
.
To enable this distinction, Elasticsearch also supports the index_analyzer
and search_analyzer
parameters, and analyzers named default_index
and
default_search
.
Taking these extra parameters into account, the full sequence at index time really looks like this:
The index_analyzer
defined in the field mapping, else
The analyzer
defined in the field mapping, else
The analyzer defined in the _analyzer
field of the document, else
The default index_analyzer
for the type
, which defaults to
The default analyzer
for the type
, which defaults to
The analyzer named default_index
in the index settings, which defaults to
The analyzer named default
in the index settings, which defaults to
The analyzer named default_index
at node level, which defaults to
The analyzer named default
at node level, which defaults to
The standard
analyzer
And at search time:
The analyzer
defined in the query itself, else
The search_analyzer
defined in the field mapping, else
The analyzer
defined in the field mapping, else
The default search_analyzer
for the type
, which defaults to
The default analyzer
for the type
, which defaults to
The analyzer named default_search
in the index settings, which defaults to
The analyzer named default
in the index settings, which defaults to
The analyzer named default_search
at node level, which defaults to
The analyzer named default
at node level, which defaults to
The standard
analyzer
The sheer number of places where you can specify an analyzer is quite overwhelming. In practice, though, it is pretty simple.
The first thing to remember is that, even though you may start out using Elasticsearch for a single purpose or a single application such as logging, chances are that you will find more use cases and end up running several distinct applications on the same cluster. Each index needs to be independent and independently configurable. You don’t want to set defaults for one use case, only to have to override them for another use case later.
This rules out configuring analyzers at the node level. Additionally, configuring analyzers at the node level requires changing the config file on every node and restarting every node, which becomes a maintenance nightmare. It’s a much better idea to keep Elasticsearch running and to manage settings only via the API.
Most of the time, you will know what fields your documents will contain ahead of time. The simplest approach is to set the analyzer for each full-text field when you create your index or add type mappings. While this approach is slightly more verbose, it enables you to easily see which analyzer is being applied to each field.
Typically, most of your string fields will be exact-value not_analyzed
fields such as tags or enums, plus a handful of full-text fields that will
use some default analyzer like standard
or english
or some other language.
Then you may have one or two fields that need custom analysis: perhaps the
title
field needs to be indexed in a way that supports find-as-you-type.
You can set the default
analyzer in the index to the analyzer you want to
use for almost all full-text fields, and just configure the specialized
analyzer on the one or two fields that need it. If, in your model, you need
a different default analyzer per type, then use the type level analyzer
setting instead.
A common work flow for time based data like logging is to create a new index per day on the fly by just indexing into it. While this work flow prevents you from creating your index up front, you can still use index templates to specify the settings and mappings that a new index should have.
Before we move on to discussing more-complex queries in Chapter 14, let’s make a quick detour to explain why we created our test index with just one primary shard.
Every now and again a new user opens an issue claiming that sorting by relevance is broken and offering a short reproduction: the user indexes a few documents, runs a simple query, and finds apparently less-relevant results appearing above more-relevant results.
To understand why this happens, let’s imagine that we create an index with two
primary shards and we index ten documents, six of which contain the word foo
.
It may happen that shard 1 contains three of the foo
documents and shard
2 contains the other three. In other words, our documents are well distributed.
In “What Is Relevance?”, we described the default similarity algorithm used in Elasticsearch, called term frequency / inverse document frequency or TF/IDF. Term frequency counts the number of times a term appears within the field we are querying in the current document. The more times it appears, the more relevant is this document. The inverse document frequency takes into account how often a term appears as a percentage of all the documents in the index. The more frequently the term appears, the less weight it has.
However, for performance reasons, Elasticsearch doesn’t calculate the IDF across all documents in the index. Instead, each shard calculates a local IDF for the documents contained in that shard.
Because our documents are well distributed, the IDF for both shards will be
the same. Now imagine instead that five of the foo
documents are on shard 1,
and the sixth document is on shard 2. In this scenario, the term foo
is
very common on one shard (and so of little importance), but rare on the other
shard (and so much more important). These differences in IDF can produce
incorrect results.
In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.
For testing purposes, there are two ways we can work around this issue. The
first is to create an index with one primary shard, as we did in the section
introducing the match
query. If you have only one shard, then
the local IDF is the global IDF.
The second workaround is to add ?search_type=dfs_query_then_fetch
to your
search requests. The dfs
stands for Distributed Frequency Search, and it
tells Elasticsearch to first retrieve the local IDF from each shard in order
to calculate the global IDF across the whole index.
dfs_query_then_fetch
in production. It really isn’t
required. Just having enough data will ensure that your term frequencies are
well distributed. There is no reason to add this extra DFS step to every query
that you run.
18.118.7.102