Queries are seldom simple one-clause match
queries. We frequently need to
search for the same or different query strings in one or more fields, which
means that we need to be able to combine multiple query clauses and their
relevance scores in a way that makes sense.
Perhaps we’re looking for a book called War and Peace by an author called Leo Tolstoy. Perhaps we’re searching the Elasticsearch documentation for “minimum should match,” which might be in the title or the body of a page. Or perhaps we’re searching for users with first name John and last name Smith.
In this chapter, we present the available tools for constructing multiclause searches and how to figure out which solution you should apply to your particular use case.
The simplest multifield query to deal with is the one where we can map
search terms to specific fields. If we know that War and Peace is the
title, and Leo Tolstoy is the author, it is easy to write each of these
conditions as a match
clause and to combine them with a bool
query:
GET
/
_search
{
"query"
:
{
"bool"
:
{
"should"
:
[
{
"match"
:
{
"title"
:
"War and Peace"
}},
{
"match"
:
{
"author"
:
"Leo Tolstoy"
}}
]
}
}
}
The bool
query takes a more-matches-is-better approach, so the score from
each match
clause will be added together to provide the final _score
for
each document. Documents that match both clauses will score higher than
documents that match just one clause.
Of course, you’re not restricted to using just match
clauses: the bool
query can wrap any other query type, including other bool
queries. We could
add a clause to specify that we prefer to see versions of the book that have
been translated by specific translators:
GET
/
_search
{
"query"
:
{
"bool"
:
{
"should"
:
[
{
"match"
:
{
"title"
:
"War and Peace"
}},
{
"match"
:
{
"author"
:
"Leo Tolstoy"
}},
{
"bool"
:
{
"should"
:
[
{
"match"
:
{
"translator"
:
"Constance Garnett"
}},
{
"match"
:
{
"translator"
:
"Louise Maude"
}}
]
}}
]
}
}
}
Why did we put the translator clauses inside a separate bool
query? All four
match
queries are should
clauses, so why didn’t we just put the translator
clauses at the same level as the title and author clauses?
The answer lies in how the score is calculated. The bool
query runs each
match
query, adds their scores together, then multiplies by the number of
matching clauses, and divides by the total number of clauses. Each clause at
the same level has the same weight. In the preceding query, the bool
query
containing the translator clauses counts for one-third of the total score. If we had
put the translator clauses at the same level as title and author, they
would have reduced the contribution of the title and author clauses to one-quarter each.
It is likely that an even one-third split between clauses is not what we need for the preceding query. Probably we’re more interested in the title and author clauses then we are in the translator clauses. We need to tune the query to make the title and author clauses relatively more important.
The simplest weapon in our tuning arsenal is the boost
parameter. To
increase the weight of the title
and author
fields, give them a boost
value higher than 1
:
GET
/
_search
{
"query"
:
{
"bool"
:
{
"should"
:
[
{
"match"
:
{
"title"
:
{
"query"
:
"War and Peace"
,
"boost"
:
2
}}},
{
"match"
:
{
"author"
:
{
"query"
:
"Leo Tolstoy"
,
"boost"
:
2
}}},
{
"bool"
:
{
"should"
:
[
{
"match"
:
{
"translator"
:
"Constance Garnett"
}},
{
"match"
:
{
"translator"
:
"Louise Maude"
}}
]
}}
]
}
}
}
The title
and author
clauses have a boost
value of 2
.
The nested bool
clause has the default boost
of 1
.
The “best” value for the boost
parameter is most easily determined by
trial and error: set a boost
value, run test queries, repeat. A reasonable
range for boost
lies between 1
and 10
, maybe 15
. Boosts higher than
that have little more impact because scores are
normalized.
The bool
query is the mainstay of multiclause queries. It works well
for many cases, especially when you are able to map different query strings to
individual fields.
The problem is that, these days, users expect to be able to type all of their search terms into a single field, and expect that the application will figure out how to give them the right results. It is ironic that the multifield search form is known as Advanced Search—it may appear advanced to the user, but it is much simpler to implement.
There is no simple one-size-fits-all approach to multiword, multifield queries. To get the best results, you have to know your data and know how to use the appropriate tools.
When your only user input is a single query string, you will encounter three scenarios frequently:
When searching for words that represent a concept, such as “brown fox,” the
words mean more together than they do individually. Fields like the title
and body
, while related, can be considered to be in competition with each
other. Documents should have as many words as possible in the same field,
and the score should come from the best-matching field.
A common technique for fine-tuning relevance is to index the same data into multiple fields, each with its own analysis chain.
The main field may contain words in their stemmed form, synonyms, and words stripped of their diacritics, or accents. It is used to match as many documents as possible.
The same text could then be indexed in other fields to provide more-precise matching. One field may contain the unstemmed version, another the original word with accents, and a third might use shingles to provide information about word proximity.
These other fields act as signals to increase the relevance score of each matching document. The more fields that match, the better.
For some entities, the identifying information is spread across multiple fields, each of which contains just a part of the whole:
Person: first_name
and last_name
Book: title
, author
, and description
Address: street
, city
, country
, and postcode
In this case, we want to find as many words as possible in any of the listed fields. We need to search across multiple fields as if they were one big field.
All of these are multiword, multifield queries, but each requires a different strategy. We will examine each strategy in turn in the rest of this chapter.
Imagine that we have a website that allows users to search blog posts, such as these two documents:
PUT
/
my_index
/
my_type
/
1
{
"title"
:
"Quick brown rabbits"
,
"body"
:
"Brown rabbits are commonly seen."
}
PUT
/
my_index
/
my_type
/
2
{
"title"
:
"Keeping pets healthy"
,
"body"
:
"My quick brown fox eats rabbits on a regular basis."
}
The user types in the words “Brown fox” and clicks Search. We don’t
know ahead of time if the user’s search terms will be found in the title
or
the body
field of the post, but it is likely that the user is searching for
related words. To our eyes, document 2 appears to be the better match, as it
contains both words that we are looking for.
Now we run the following bool
query:
{
"query"
:
{
"bool"
:
{
"should"
:
[
{
"match"
:
{
"title"
:
"Brown fox"
}},
{
"match"
:
{
"body"
:
"Brown fox"
}}
]
}
}
}
And we find that this query gives document 1 the higher score:
{
"hits"
:
[
{
"_id"
:
"1"
,
"_score"
:
0.14809652
,
"_source"
:
{
"title"
:
"Quick brown rabbits"
,
"body"
:
"Brown rabbits are commonly seen."
}
},
{
"_id"
:
"2"
,
"_score"
:
0.09256032
,
"_source"
:
{
"title"
:
"Keeping pets healthy"
,
"body"
:
"My quick brown fox eats rabbits on a regular basis."
}
}
]
}
To understand why, think about how the bool
query calculates its score:
It runs both of the queries in the should
clause.
It adds their scores together.
It multiplies the total by the number of matching clauses.
It divides the result by the total number of clauses (two).
Document 1 contains the word brown
in both fields, so both match
clauses
are successful and have a score. Document 2 contains both brown
and
fox
in the body
field but neither word in the title
field. The high
score from the body
query is added to the zero score from the title
query,
and multiplied by one-half, resulting in a lower overall score than for document 1.
In this example, the title
and body
fields are competing with each other.
We want to find the single best-matching field.
What if, instead of combining the scores from each field, we used the score from the best-matching field as the overall score for the query? This would give preference to a single field that contains both of the words we are looking for, rather than the same word repeated in different fields.
Instead of the bool
query, we can use the dis_max
or Disjunction Max
Query. Disjunction means or (while conjunction means and) so the
Disjunction Max Query simply means return documents that match any of these
queries, and return the score of the best matching query:
{
"query"
:
{
"dis_max"
:
{
"queries"
:
[
{
"match"
:
{
"title"
:
"Brown fox"
}},
{
"match"
:
{
"body"
:
"Brown fox"
}}
]
}
}
}
This produces the results that we want:
{
"hits"
:
[
{
"_id"
:
"2"
,
"_score"
:
0.21509302
,
"_source"
:
{
"title"
:
"Keeping pets healthy"
,
"body"
:
"My quick brown fox eats rabbits on a regular basis."
}
},
{
"_id"
:
"1"
,
"_score"
:
0.12713557
,
"_source"
:
{
"title"
:
"Quick brown rabbits"
,
"body"
:
"Brown rabbits are commonly seen."
}
}
]
}
What would happen if the user had searched instead for “quick pets”? Both
documents contain the word quick
, but only document 2 contains the word
pets
. Neither document contains both words in the same field.
A simple dis_max
query like the following would choose the single best
matching field, and ignore the other:
{
"query"
:
{
"dis_max"
:
{
"queries"
:
[
{
"match"
:
{
"title"
:
"Quick pets"
}},
{
"match"
:
{
"body"
:
"Quick pets"
}}
]
}
}
}
{
"hits"
:
[
{
"_id"
:
"1"
,
"_score"
:
0.12713557
,
"_source"
:
{
"title"
:
"Quick brown rabbits"
,
"body"
:
"Brown rabbits are commonly seen."
}
},
{
"_id"
:
"2"
,
"_score"
:
0.12713557
,
"_source"
:
{
"title"
:
"Keeping pets healthy"
,
"body"
:
"My quick brown fox eats rabbits on a regular basis."
}
}
]
}
We would probably expect documents that match on both the title
field and
the body
field to rank higher than documents that match on just one field,
but this isn’t the case. Remember: the dis_max
query simply uses the
_score
from the single best-matching clause.
It is possible, however, to also take the _score
from the other matching
clauses into account, by specifying the tie_breaker
parameter:
{
"query"
:
{
"dis_max"
:
{
"queries"
:
[
{
"match"
:
{
"title"
:
"Quick pets"
}},
{
"match"
:
{
"body"
:
"Quick pets"
}}
],
"tie_breaker"
:
0.3
}
}
}
This gives us the following results:
{
"hits"
:
[
{
"_id"
:
"2"
,
"_score"
:
0.14757764
,
"_source"
:
{
"title"
:
"Keeping pets healthy"
,
"body"
:
"My quick brown fox eats rabbits on a regular basis."
}
},
{
"_id"
:
"1"
,
"_score"
:
0.124275915
,
"_source"
:
{
"title"
:
"Quick brown rabbits"
,
"body"
:
"Brown rabbits are commonly seen."
}
}
]
}
The tie_breaker
parameter makes the dis_max
query behave more like a
halfway house between dis_max
and bool
. It changes the score calculation
as follows:
Take the _score
of the best-matching clause.
Multiply the score of each of the other matching clauses by the tie_breaker
.
Add them all together and normalize.
With the tie_breaker
, all matching clauses count, but the best-matching
clause counts most.
The tie_breaker
can be a floating-point value between 0
and 1
, where 0
uses just the best-matching clause and 1
counts all matching clauses
equally. The exact value can be tuned based on your data and queries, but a
reasonable value should be close to zero, (for example, 0.1 - 0.4
), in order not to
overwhelm the best-matching nature of dis_max
.
The multi_match
query provides a convenient shorthand way of running
the same query against multiple fields.
There are several types of multi_match
query, three of which just
happen to coincide with the three scenarios that we listed in
“Know Your Data”: best_fields
, most_fields
, and cross_fields
.
By default, this query runs as type best_fields
, which means that it generates a
match
query for each field and wraps them in a dis_max
query. This
dis_max
query
{
"dis_max"
:
{
"queries"
:
[
{
"match"
:
{
"title"
:
{
"query"
:
"Quick brown fox"
,
"minimum_should_match"
:
"30%"
}
}
},
{
"match"
:
{
"body"
:
{
"query"
:
"Quick brown fox"
,
"minimum_should_match"
:
"30%"
}
}
},
],
"tie_breaker"
:
0.3
}
}
could be rewritten more concisely with multi_match
as follows:
{
"multi_match"
:
{
"query"
:
"Quick brown fox"
,
"type"
:
"best_fields"
,
"fields"
:
[
"title"
,
"body"
],
"tie_breaker"
:
0.3
,
"minimum_should_match"
:
"30%"
}
}
The best_fields
type is the default and can be left out.
Parameters like minimum_should_match
or operator
are passed through to
the generated match
queries.
Field names can be specified with wildcards: any field that matches the
wildcard pattern will be included in the search. You could match on the
book_title
, chapter_title
, and section_title
fields, with the following:
{
"multi_match"
:
{
"query"
:
"Quick brown fox"
,
"fields"
:
"*_title"
}
}
Full-text search is a battle between recall—returning all the documents that are relevant—and precision—not returning irrelevant documents. The goal is to present the user with the most relevant documents on the first page of results.
To improve recall, we cast the net wide—we include not only
documents that match the user’s search terms exactly, but also
documents that we believe to be pertinent to the query. If a user searches
for “quick brown fox,” a document that contains fast foxes
may well be
a reasonable result to return.
If the only pertinent document that we have is the one containing fast
foxes
, it will appear at the top of the results list. But of course, if
we have 100 documents that contain the words quick brown fox
, then the
fast foxes
document may be considered less relevant, and we would want to
push it further down the list. After including many potential matches, we
need to ensure that the best ones rise to the top.
A common technique for fine-tuning full-text relevance is to index the same text in multiple ways, each of which provides a different relevance signal. The main field would contain terms in their broadest-matching form to match as many documents as possible. For instance, we could do the following:
Use a stemmer to index jumps
, jumping
, and jumped
as their root
form: jump
. Then it doesn’t matter if the user searches for
jumped
; we could still match documents containing jumping
.
Include synonyms like jump
, leap
, and hop
.
Remove diacritics, or accents: for example, ésta
, está
, and esta
would
all be indexed without accents as esta
.
However, if we have two documents, one of which contains jumped
and the
other jumping
, the user would probably expect the first document to rank
higher, as it contains exactly what was typed in.
We can achieve this by indexing the same text in other fields to provide more-precise matching. One field may contain the unstemmed version, another the original word with diacritics, and a third might use shingles to provide information about word proximity. These other fields act as signals that increase the relevance score of each matching document. The more fields that match, the better.
A document is included in the results list if it matches the broad-matching main field. If it also matches the signal fields, it gets extra points and is pushed up the results list.
We discuss synonyms, word proximity, partial-matching and other potential signals later in the book, but we will use the simple example of stemmed and unstemmed fields to illustrate this technique.
The first thing to do is to set up our field to be indexed twice: once in a stemmed form and once in an unstemmed form. To do this, we will use multifields, which we introduced in “String Sorting and Multifields”:
DELETE
/
my_index
PUT
/
my_index
{
"settings"
:
{
"number_of_shards"
:
1
},
"mappings"
:
{
"my_type"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
,
"fields"
:
{
"std"
:
{
"type"
:
"string"
,
"analyzer"
:
"standard"
}
}
}
}
}
}
}
The title
field is stemmed by the english
analyzer.
The title.std
field uses the standard
analyzer and so is not stemmed.
Next we index some documents:
PUT
/
my_index
/
my_type
/
1
{
"title"
:
"My rabbit jumps"
}
PUT
/
my_index
/
my_type
/
2
{
"title"
:
"Jumping jack rabbits"
}
Here is a simple match
query on the title
field for jumping rabbits
:
GET
/
my_index
/
_search
{
"query"
:
{
"match"
:
{
"title"
:
"jumping rabbits"
}
}
}
This becomes a query for the two stemmed terms jump
and rabbit
, thanks to the
english
analyzer. The title
field of both documents contains both of those
terms, so both documents receive the same score:
{
"hits"
:
[
{
"_id"
:
"1"
,
"_score"
:
0.42039964
,
"_source"
:
{
"title"
:
"My rabbit jumps"
}
},
{
"_id"
:
"2"
,
"_score"
:
0.42039964
,
"_source"
:
{
"title"
:
"Jumping jack rabbits"
}
}
]
}
If we were to query just the title.std
field, then only document 2 would
match. However, if we were to query both fields and to combine their scores
by using the bool
query, then both documents would match (thanks to the title
field) and document 2 would score higher (thanks to the title.std
field):
GET
/
my_index
/
_search
{
"query"
:
{
"multi_match"
:
{
"query"
:
"jumping rabbits"
,
"type"
:
"most_fields"
,
"fields"
:
[
"title"
,
"title.std"
]
}
}
}
We want to combine the scores from all matching fields, so we use the
most_fields
type. This causes the multi_match
query to wrap the two
field-clauses in a bool
query instead of a dis_max
query.
{
"hits"
:
[
{
"_id"
:
"2"
,
"_score"
:
0.8226396
,
"_source"
:
{
"title"
:
"Jumping jack rabbits"
}
},
{
"_id"
:
"1"
,
"_score"
:
0.10741998
,
"_source"
:
{
"title"
:
"My rabbit jumps"
}
}
]
}
We are using the broad-matching title
field to include as many documents as
possible—to increase recall—but we use the title.std
field as a
signal to push the most relevant results to the top.
The contribution of each field to the final score can be controlled by
specifying custom boost
values. For instance, we could boost the title
field to make it the most important field, thus reducing the effect of any
other signal fields:
GET
/
my_index
/
_search
{
"query"
:
{
"multi_match"
:
{
"query"
:
"jumping rabbits"
,
"type"
:
"most_fields"
,
"fields"
:
[
"title^10"
,
"title.std"
]
}
}
}
Now we come to a common pattern: cross-fields entity search. With entities
like person
, product
, or address
, the identifying information is spread
across several fields. We may have a person
indexed as follows:
{
"firstname"
:
"Peter"
,
"lastname"
:
"Smith"
}
Or an address like this:
{
"street"
:
"5 Poland Street"
,
"city"
:
"London"
,
"country"
:
"United Kingdom"
,
"postcode"
:
"W1V 3DG"
}
This sounds a lot like the example we described in “Multiple Query Strings”, but there is a big difference between these two scenarios. In “Multiple Query Strings”, we used a separate query string for each field. In this scenario, we want to search across multiple fields with a single query string.
Our user might search for the person “Peter Smith” or for the address
“Poland Street W1V.” Each of those words appears in a different field, so
using a dis_max
/ best_fields
query to find the single best-matching
field is clearly the wrong approach.
Really, we want to query each field in turn and add up the scores of every
field that matches, which sounds like a job for the bool
query:
{
"query"
:
{
"bool"
:
{
"should"
:
[
{
"match"
:
{
"street"
:
"Poland Street W1V"
}},
{
"match"
:
{
"city"
:
"Poland Street W1V"
}},
{
"match"
:
{
"country"
:
"Poland Street W1V"
}},
{
"match"
:
{
"postcode"
:
"Poland Street W1V"
}}
]
}
}
}
Repeating the query string for every field soon becomes tedious. We can use
the multi_match
query instead, and set the type
to most_fields
to tell it to
combine the scores of all matching fields:
{
"query"
:
{
"multi_match"
:
{
"query"
:
"Poland Street W1V"
,
"type"
:
"most_fields"
,
"fields"
:
[
"street"
,
"city"
,
"country"
,
"postcode"
]
}
}
}
The most_fields
approach to entity search has some problems that are not
immediately obvious:
It is designed to find the most fields matching any words, rather than to find the most matching words across all fields.
It can’t use the operator
or minimum_should_match
parameters
to reduce the long tail of less-relevant results.
Term frequencies are different in each field and could interfere with each other to produce badly ordered results.
All three of the preceding problems stem from most_fields
being
field-centric rather than term-centric: it looks for the most matching
fields, when really what we’re interested is the most matching terms.
First we’ll look at why these problems exist, and then how we can combat them.
Think about how the most_fields
query is executed: Elasticsearch generates a
separate match
query for each field and then wraps these match queries in an outer bool
query.
We can see this by passing our query through the validate-query
API:
GET
/
_validate
/
query
?
explain
{
"query"
:
{
"multi_match"
:
{
"query"
:
"Poland Street W1V"
,
"type"
:
"most_fields"
,
"fields"
:
[
"street"
,
"city"
,
"country"
,
"postcode"
]
}
}
}
which yields this explanation
:
(street:poland street:street street:w1v) (city:poland city:street city:w1v) (country:poland country:street country:w1v) (postcode:poland postcode:street postcode:w1v)
You can see that a document matching just the word poland
in two fields
could score higher than a document matching poland
and street
in one
field.
In “Controlling Precision”, we talked about using the and
operator or the
minimum_should_match
parameter to trim the long tail of almost irrelevant
results. Perhaps we could try this:
{
"query"
:
{
"multi_match"
:
{
"query"
:
"Poland Street W1V"
,
"type"
:
"most_fields"
,
"operator"
:
"and"
,
"fields"
:
[
"street"
,
"city"
,
"country"
,
"postcode"
]
}
}
}
However, with best_fields
or most_fields
, these parameters are passed down
to the generated match
queries. The explanation
for this query shows the
following:
(+street:poland +street:street +street:w1v) (+city:poland +city:street +city:w1v) (+country:poland +country:street +country:w1v) (+postcode:poland +postcode:street +postcode:w1v)
In other words, using the and
operator means that all words must exist in
the same field, which is clearly wrong! It is unlikely that any documents
would match this query.
In “What Is Relevance?”, we explained that the default similarity algorithm used to calculate the relevance score for each term is TF/IDF:
The more often a term appears in a field in a single document, the more relevant the document.
The more often a term appears in a field in all documents in the index, the less relevant is that term.
When searching against multiple fields, TF/IDF can introduce some surprising results.
Consider our example of searching for “Peter Smith” using the first_name
and last_name
fields. Peter is a common first name and Smith is a common
last name—both will have low IDFs. But what if we have another person in
the index whose name is Smith Williams? Smith as a first name is very
uncommon and so will have a high IDF!
A simple query like the following may well return Smith Williams above Peter Smith in spite of the fact that the second person is a better match than the first.
{
"query"
:
{
"multi_match"
:
{
"query"
:
"Peter Smith"
,
"type"
:
"most_fields"
,
"fields"
:
[
"*_name"
]
}
}
}
The high IDF of smith
in the first name field can overwhelm the two low IDFs
of peter
as a first name and smith
as a last name.
These problems only exist because we are dealing with multiple fields. If we
were to combine all of these fields into a single field, the problems would
vanish. We could achieve this by adding a full_name
field to our person
document:
{
"first_name"
:
"Peter"
,
"last_name"
:
"Smith"
,
"full_name"
:
"Peter Smith"
}
When querying just the full_name
field:
Documents with more matching words would trump documents with the same word repeated.
The minimum_should_match
and operator
parameters would function as
expected.
The inverse document frequencies for first and last names would be combined so it wouldn’t matter whether Smith were a first or last name anymore.
While this would work, we don’t like having to store redundant data. Instead, Elasticsearch offers us two solutions—one at index time and one at search time—which we discuss next.
In “Metadata: _all Field”, we explained that the special _all
field indexes the values
from all other fields as one big string. Having all fields indexed into one
field is not terribly flexible, though. It would be nice to have one custom
_all
field for the person’s name, and another custom _all
field for the
address.
Elasticsearch provides us with this functionality via the copy_to
parameter
in a field mapping:
PUT
/
my_index
{
"mappings"
:
{
"person"
:
{
"properties"
:
{
"first_name"
:
{
"type"
:
"string"
,
"copy_to"
:
"full_name"
},
"last_name"
:
{
"type"
:
"string"
,
"copy_to"
:
"full_name"
},
"full_name"
:
{
"type"
:
"string"
}
}
}
}
}
With this mapping in place, we can query the first_name
field for first
names, the last_name
field for last name, or the full_name
field for first
and last names.
first_name
and last_name
fields have no bearing
on how the full_name
field is indexed. The full_name
field copies the
string values from the other two fields, then indexes them according to the
mapping of the full_name
field only.
The custom _all
approach is a good solution, as long as you thought
about setting it up before you indexed your documents. However, Elasticsearch
also provides a search-time solution to the problem: the multi_match
query
with type cross_fields
.
The cross_fields
type takes a term-centric approach, quite different from the
field-centric approach taken by best_fields
and most_fields
. It treats all
of the fields as one big field, and looks for each term in any field.
To illustrate the difference between field-centric and term-centric queries,
look at the explanation
for this field-centric most_fields
query:
GET
/
_validate
/
query
?
explain
{
"query"
:
{
"multi_match"
:
{
"query"
:
"peter smith"
,
"type"
:
"most_fields"
,
"operator"
:
"and"
,
"fields"
:
[
"first_name"
,
"last_name"
]
}
}
}
For a document to match, both peter
and smith
must appear in the same
field, either the first_name
field or the last_name
field:
(+first_name:peter +first_name:smith) (+last_name:peter +last_name:smith)
A term-centric approach would use this logic instead:
+(first_name:peter last_name:peter) +(first_name:smith last_name:smith)
In other words, the term peter
must appear in either field, and the term
smith
must appear in either field.
The cross_fields
type first analyzes the query string to produce a list of
terms, and then it searches for each term in any field. That difference alone
solves two of the three problems that we listed in “Field-Centric Queries”, leaving
us just with the issue of differing inverse document frequencies.
Fortunately, the cross_fields
type solves this too, as can be seen from this
validate-query
request:
GET
/
_validate
/
query
?
explain
{
"query"
:
{
"multi_match"
:
{
"query"
:
"peter smith"
,
"type"
:
"cross_fields"
,
"operator"
:
"and"
,
"fields"
:
[
"first_name"
,
"last_name"
]
}
}
}
It solves the term-frequency problem by blending inverse document frequencies across fields:
+blended("peter", fields: [first_name, last_name]) +blended("smith", fields: [first_name, last_name])
In other words, it looks up the IDF of smith
in both the first_name
and
the last_name
fields and uses the minimum of the two as the IDF for both
fields. The fact that smith
is a common last name means that it will be
treated as a common first name too.
For the cross_fields
query type to work optimally, all fields should have
the same analyzer. Fields that share an analyzer are grouped together as
blended fields.
If you include fields with a different analysis chain, they will be added to
the query in the same way as for best_fields
. For instance, if we added the
title
field to the preceding query (assuming it uses a different analyzer), the
explanation would be as follows:
(+title:peter +title:smith) ( +blended("peter", fields: [first_name, last_name]) +blended("smith", fields: [first_name, last_name]) )
This is particularly important when using the minimum_should_match
and
operator
parameters.
One of the advantages of using the cross_fields
query over
custom _all
fields is that you can boost individual
fields at query time.
For fields of equal value like first_name
and last_name
, this generally
isn’t required, but if you were searching for books using the title
and
description
fields, you might want to give more weight to the title
field.
This can be done as described before with the caret (^
) syntax:
GET
/
books
/
_search
{
"query"
:
{
"multi_match"
:
{
"query"
:
"peter smith"
,
"type"
:
"cross_fields"
,
"fields"
:
[
"title^2"
,
"description"
]
}
}
}
The advantage of being able to boost individual fields should be weighed
against the cost of querying multiple fields instead of querying a single
custom _all
field. Use whichever of the two solutions that delivers the most
bang for your buck.
The final topic that we should touch on before leaving multifield queries is
that of exact-value not_analyzed
fields. It is not useful to mix
not_analyzed
fields with analyzed
fields in multi_match
queries.
The reason for this can be demonstrated easily by looking at a query
explanation. Imagine that we have set the title
field to be not_analyzed
:
GET
/
_validate
/
query
?
explain
{
"query"
:
{
"multi_match"
:
{
"query"
:
"peter smith"
,
"type"
:
"cross_fields"
,
"fields"
:
[
"title"
,
"first_name"
,
"last_name"
]
}
}
}
Because the title
field is not analyzed, it searches that field for a single
term consisting of the whole query string!
title:peter smith ( blended("peter", fields: [first_name, last_name]) blended("smith", fields: [first_name, last_name]) )
That term clearly does not exist in the inverted index of the title
field,
and can never be found. Avoid using not_analyzed
fields in multi_match
queries.
3.135.183.138