Standard full-text search with TF/IDF treats documents, or at least each field
within a document, as a big bag of words. The match
query can tell us whether
that bag contains our search terms, but that is only part of the story.
It can’t tell us anything about the relationship between words.
Consider the difference between these sentences:
Sue ate the alligator.
The alligator ate Sue.
Sue never goes anywhere without her alligator-skin purse.
A match
query for sue alligator
would match all three documents, but it
doesn’t tell us whether the two words form part of the same idea, or even the same
paragraph.
Understanding how words relate to each other is a complicated problem, and we can’t solve it by just using another type of query, but we can at least find words that appear to be related because they appear near each other or even right next to each other.
Each document may be much longer than the examples we have presented: Sue
and alligator
may be separated by paragraphs of other text. Perhaps we still
want to return these documents in which the words are widely separated, but we
want to give documents in which the words are close together a higher relevance
score.
This is the province of phrase matching, or proximity matching.
In this chapter, we are using the same example documents that we used for
the match
query.
In the same way that the match
query is the go-to query for standard
full-text search, the match_phrase
query is the one you should reach for
when you want to find words that are near each other:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match_phrase"
:
{
"title"
:
"quick brown fox"
}
}
}
Like the match
query, the match_phrase
query first analyzes the query
string to produce a list of terms. It then searches for all the terms, but
keeps only documents that contain all of the search terms, in the same
positions relative to each other. A query for the phrase quick fox
would not match any of our documents, because no document contains the word
quick
immediately followed by fox
.
The match_phrase
query can also be written as a match
query with type
phrase
:
"match"
:
{
"title"
:
{
"query"
:
"quick brown fox"
,
"type"
:
"phrase"
}
}
When a string is analyzed, the analyzer returns not only a list of terms, but also the position, or order, of each term in the original string:
GET
/
_analyze
?
analyzer
=
standard
Quick
brown
fox
This returns the following:
{
"tokens"
:
[
{
"token"
:
"quick"
,
"start_offset"
:
0
,
"end_offset"
:
5
,
"type"
:
"<ALPHANUM>"
,
"position"
:
1
},
{
"token"
:
"brown"
,
"start_offset"
:
6
,
"end_offset"
:
11
,
"type"
:
"<ALPHANUM>"
,
"position"
:
2
},
{
"token"
:
"fox"
,
"start_offset"
:
12
,
"end_offset"
:
15
,
"type"
:
"<ALPHANUM>"
,
"position"
:
3
}
]
}
Positions can be stored in the inverted index, and position-aware queries like
the match_phrase
query can use them to match only documents that contain
all the words in exactly the order specified, with no words in-between.
For a document to be considered a match for the phrase “quick brown fox,” the following must be true:
quick
, brown
, and fox
must all appear in the field.
The position of brown
must be 1
greater than the position of quick
.
The position of fox
must be 2
greater than the position of quick
.
If any of these conditions is not met, the document is not considered a match.
Internally, the match_phrase
query uses the low-level span
query family to
do position-aware matching. Span queries are term-level queries, so they have
no analysis phase; they search for the exact term specified.
Thankfully, most people never need to use the span
queries directly, as the
match_phrase
query is usually good enough. However, certain specialized
fields, like patent searches, use these low-level queries to perform very
specific, carefully constructed positional searches.
Requiring exact-phrase matches may be too strict a constraint. Perhaps we do want documents that contain “quick brown fox” to be considered a match for the query “quick fox,” even though the positions aren’t exactly equivalent.
We can introduce a degree of flexibility into phrase matching by using the
slop
parameter:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match_phrase"
:
{
"title"
:
{
"query"
:
"quick fox"
,
"slop"
:
1
}
}
}
}
The slop
parameter tells the match_phrase
query how far apart terms are
allowed to be while still considering the document a match. By how far
apart we mean how many times do you need to move a term in order to make
the query and document match?
We’ll start with a simple example. To make the query quick fox
match
a document containing quick brown fox
we need a slop
of just 1
:
Pos 1 Pos 2 Pos 3 ----------------------------------------------- Doc: quick brown fox ----------------------------------------------- Query: quick fox Slop 1: quick ↳ fox
Although all words need to be present in phrase matching, even when using slop
,
the words don’t necessarily need to be in the same sequence in order to
match. With a high enough slop
value, words can be arranged in any order.
To make the query fox quick
match our document, we need a slop
of 3
:
Pos 1 Pos 2 Pos 3 ----------------------------------------------- Doc: quick brown fox ----------------------------------------------- Query: fox quick Slop 1: fox|quick ↵ Slop 2: quick ↳ fox Slop 3: quick ↳ fox
A curious thing can happen when you try to use phrase matching on multivalue fields. Imagine that you index this document:
PUT
/
my_index
/
groups
/
1
{
"names"
:
[
"John Abraham"
,
"Lincoln Smith"
]
}
Then run a phrase query for Abraham Lincoln
:
GET
/
my_index
/
groups
/
_search
{
"query"
:
{
"match_phrase"
:
{
"names"
:
"Abraham Lincoln"
}
}
}
Surprisingly, our document matches, even though Abraham
and Lincoln
belong to two different people in the names
array. The reason for this comes
down to the way arrays are indexed in Elasticsearch.
When John Abraham
is analyzed, it produces this:
Position 1: john
Position 2: abraham
Then when Lincoln Smith
is analyzed, it produces this:
Position 3: lincoln
Position 4: smith
In other words, Elasticsearch produces exactly the same list of tokens as it would have
for the single string John Abraham Lincoln Smith
. Our example query
looks for abraham
directly followed by lincoln
, and these two terms do
indeed exist, and they are right next to each other, so the query matches.
Fortunately, there is a simple workaround for cases like these, called the
position_offset_gap
, which we need to configure in the field mapping:
DELETE
/
my_index
/
groups
/
PUT
/
my_index
/
_mapping
/
groups
{
"properties"
:
{
"names"
:
{
"type"
:
"string"
,
"position_offset_gap"
:
100
}
}
}
First delete the groups
mapping and all documents of that type.
Then create a new groups
mapping with the correct values.
The position_offset_gap
setting tells Elasticsearch that it should increase
the current term position
by the specified value for every new array
element. So now, when we index the array of names, the terms are emitted with
the following positions:
Position 1: john
Position 2: abraham
Position 103: lincoln
Position 104: smith
Our phrase query would no longer match a document like this because abraham
and lincoln
are now 100 positions apart. You would have to add a slop
value of 100 in order for this document to match.
Whereas a phrase query simply excludes documents that don’t contain the exact
query phrase, a proximity query—a phrase query where slop
is greater
than 0
—incorporates the proximity of the query terms into the final
relevance _score
. By setting a high slop
value like 50
or 100
, you can
exclude documents in which the words are really too far apart, but give a higher
score to documents in which the words are closer together.
The following proximity query for quick dog
matches both documents that
contain the words quick
and dog
, but gives a higher score to the
document in which the words are nearer to each other:
POST
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match_phrase"
:
{
"title"
:
{
"query"
:
"quick dog"
,
"slop"
:
50
}
}
}
}
{
"hits"
:
[
{
"_id"
:
"3"
,
"_score"
:
0.75
,
"_source"
:
{
"title"
:
"The quick brown fox jumps over the quick dog"
}
},
{
"_id"
:
"2"
,
"_score"
:
0.28347334
,
"_source"
:
{
"title"
:
"The quick brown fox jumps over the lazy dog"
}
}
]
}
Although proximity queries are useful, the fact that they require all terms to be
present can make them overly strict. It’s the same issue that we discussed in
“Controlling Precision” in Chapter 13: if six out of seven terms match,
a document is probably relevant enough to be worth showing to the user, but
the match_phrase
query would exclude it.
Instead of using proximity matching as an absolute requirement, we can use it as a signal—as one of potentially many queries, each of which contributes to the overall score for each document (see “Most Fields”).
The fact that we want to add together the scores from multiple queries implies
that we should combine them by using the bool
query.
We can use a simple match
query as a must
clause. This is the query that
will determine which documents are included in our result set. We can trim
the long tail with the minimum_should_match
parameter. Then we can add other,
more specific queries as should
clauses. Every one that matches will
increase the relevance of the matching docs.
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"bool"
:
{
"must"
:
{
"match"
:
{
"title"
:
{
"query"
:
"quick brown fox"
,
"minimum_should_match"
:
"30%"
}
}
},
"should"
:
{
"match_phrase"
:
{
"title"
:
{
"query"
:
"quick brown fox"
,
"slop"
:
50
}
}
}
}
}
}
The must
clause includes or excludes documents from the result set.
The should
clause increases the relevance score of those documents that
match.
We could, of course, include other queries in the should
clause, where each
query targets a specific aspect of relevance.
Phrase and proximity queries are more expensive than simple match
queries.
Whereas a match
query just has to look up terms in the inverted index, a
match_phrase
query has to calculate and compare the positions of multiple
possibly repeated terms.
The Lucene nightly
benchmarks show that a simple term
query is about 10 times as fast as a
phrase query, and about 20 times as fast as a proximity query (a phrase query
with slop
). And of course, this cost is paid at search time instead of at index time.
Usually the extra cost of phrase queries is not as scary as these numbers
suggest. Really, the difference in performance is a testimony to just how fast
a simple term
query is. Phrase queries on typical full-text data usually
complete within a few milliseconds, and are perfectly usable in practice, even
on a busy cluster.
In certain pathological cases, phrase queries can be costly, but this is
unusual. An example of a pathological case is DNA sequencing, where there are
many many identical terms repeated in many positions. Using higher slop
values in this case results in a huge growth in the number of position
calculations.
So what can we do to limit the performance cost of phrase and proximity queries? One useful approach is to reduce the total number of documents that need to be examined by the phrase query.
In the preceding section, we discussed using proximity queries just for relevance purposes, not to include or exclude results from the result set. A query may match millions of results, but chances are that our users are interested in only the first few pages of results.
A simple match
query will already have ranked documents that contain all
search terms near the top of the list. Really, we just want to rerank the top
results to give an extra relevance bump to those documents that also match the
phrase query.
The search
API supports exactly this functionality via rescoring. The
rescore phase allows you to apply a more expensive scoring algorithm—like a
phrase
query—to just the top K
results from each shard. These top
results are then resorted according to their new scores.
The request looks like this:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match"
:
{
"title"
:
{
"query"
:
"quick brown fox"
,
"minimum_should_match"
:
"30%"
}
}
},
"rescore"
:
{
"window_size"
:
50
,
"query"
:
{
"rescore_query"
:
{
"match_phrase"
:
{
"title"
:
{
"query"
:
"quick brown fox"
,
"slop"
:
50
}
}
}
}
}
}
The match
query decides which results will be included in the final
result set and ranks results according to TF/IDF.
The window_size
is the number of top results to rescore, per shard.
The only rescoring algorithm currently supported is another query, but there are plans to add more algorithms later.
As useful as phrase and proximity queries can be, they still have a downside.
They are overly strict: all terms must be present for a phrase query to match,
even when using slop
.
The flexibility in word ordering that you gain with slop
also comes at a
price, because you lose the association between word pairs. While you can
identify documents in which sue
, alligator
, and ate
occur close together,
you can’t tell whether Sue ate or the alligator ate.
When words are used in conjunction with each other, they express an idea that is bigger or more meaningful than each word in isolation. The two clauses I’m not happy I’m working and I’m happy I’m not working contain the sames words, in close proximity, but have quite different meanings.
If, instead of indexing each word independently, we were to index pairs of words, then we could retain more of the context in which the words were used.
For the sentence Sue ate the alligator
, we would not only index each word
(or unigram) as a term
["sue", "ate", "the", "alligator"]
but also each word and its neighbor as single terms:
["sue ate", "ate the", "the alligator"]
These word pairs (or bigrams) are known as shingles.
Shingles are not restricted to being pairs of words; you could index word triplets (trigrams) as well:
["sue ate the", "ate the alligator"]
Trigrams give you a higher degree of precision, but greatly increase the number of unique terms in the index. Bigrams are sufficient for most use cases.
Of course, shingles are useful only if the user enters the query in the same
order as in the original document; a query for sue alligator
would match
the individual words but none of our shingles.
Fortunately, users tend to express themselves using constructs similar to those that appear in the data they are searching. But this point is an important one: it is not enough to index just bigrams; we still need unigrams, but we can use matching bigrams as a signal to increase the relevance score.
Shingles need to be created at index time as part of the analysis process. We could index both unigrams and bigrams into a single field, but it is cleaner to keep unigrams and bigrams in separate fields that can be queried independently. The unigram field would form the basis of our search, with the bigram field being used to boost relevance.
First, we need to create an analyzer that uses the shingle
token filter:
DELETE
/
my_index
PUT
/
my_index
{
"settings"
:
{
"number_of_shards"
:
1
,
"analysis"
:
{
"filter"
:
{
"my_shingle_filter"
:
{
"type"
:
"shingle"
,
"min_shingle_size"
:
2
,
"max_shingle_size"
:
2
,
"output_unigrams"
:
false
}
},
"analyzer"
:
{
"my_shingle_analyzer"
:
{
"type"
:
"custom"
,
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"my_shingle_filter"
]
}
}
}
}
}
The default min/max shingle size is 2
so we don’t really need to set
these.
The shingle
token filter outputs unigrams by default, but we want to
keep unigrams and bigrams separate.
The my_shingle_analyzer
uses our custom my_shingles_filter
token
filter.
First, let’s test that our analyzer is working as expected with the analyze
API:
GET
/
my_index
/
_analyze
?
analyzer
=
my_shingle_analyzer
Sue
ate
the
alligator
Sure enough, we get back three terms:
sue ate
ate the
the alligator
Now we can proceed to setting up a field to use the new analyzer.
We said that it is cleaner to index unigrams and bigrams separately, so we
will create the title
field as a multifield (see “String Sorting and Multifields”):
PUT
/
my_index
/
_mapping
/
my_type
{
"my_type"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
,
"fields"
:
{
"shingles"
:
{
"type"
:
"string"
,
"analyzer"
:
"my_shingle_analyzer"
}
}
}
}
}
}
With this mapping, values from our JSON document in the field title
will be
indexed both as unigrams (title
) and as bigrams (title.shingles
), meaning
that we can query these fields independently.
And finally, we can index our example documents:
POST
/
my_index
/
my_type
/
_bulk
{
"index"
:
{
"_id"
:
1
}}
{
"title"
:
"Sue ate the alligator"
}
{
"index"
:
{
"_id"
:
2
}}
{
"title"
:
"The alligator ate Sue"
}
{
"index"
:
{
"_id"
:
3
}}
{
"title"
:
"Sue never goes anywhere without her alligator skin purse"
}
To understand the benefit that the shingles
field adds, let’s first look at
the results from a simple match
query for “The hungry alligator ate Sue”:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match"
:
{
"title"
:
"the hungry alligator ate sue"
}
}
}
This query returns all three documents, but note that documents 1 and 2 have the same relevance score because they contain the same words:
{
"hits"
:
[
{
"_id"
:
"1"
,
"_score"
:
0.44273707
,
"_source"
:
{
"title"
:
"Sue ate the alligator"
}
},
{
"_id"
:
"2"
,
"_score"
:
0.44273707
,
"_source"
:
{
"title"
:
"The alligator ate Sue"
}
},
{
"_id"
:
"3"
,
"_score"
:
0.046571054
,
"_source"
:
{
"title"
:
"Sue never goes anywhere without her alligator skin purse"
}
}
]
}
Both documents contain the
, alligator
, and ate
and so have the
same score.
We could have excluded document 3 by setting the minimum_should_match
parameter. See “Controlling Precision”.
Now let’s add the shingles
field into the query. Remember that we want
matches on the shingles
field to act as a signal—to increase the
relevance score—so we still need to include the query on the main title
field:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"bool"
:
{
"must"
:
{
"match"
:
{
"title"
:
"the hungry alligator ate sue"
}
},
"should"
:
{
"match"
:
{
"title.shingles"
:
"the hungry alligator ate sue"
}
}
}
}
}
We still match all three documents, but document 2 has now been bumped into
first place because it matched the shingled term ate sue
.
{
"hits"
:
[
{
"_id"
:
"2"
,
"_score"
:
0.4883322
,
"_source"
:
{
"title"
:
"The alligator ate Sue"
}
},
{
"_id"
:
"1"
,
"_score"
:
0.13422975
,
"_source"
:
{
"title"
:
"Sue ate the alligator"
}
},
{
"_id"
:
"3"
,
"_score"
:
0.014119488
,
"_source"
:
{
"title"
:
"Sue never goes anywhere without her alligator skin purse"
}
}
]
}
Even though our query included the word hungry
, which doesn’t appear in
any of our documents, we still managed to use word proximity to return the
most relevant document first.
Not only are shingles more flexible than phrase queries, but they perform better
as well. Instead of paying the price of a phrase query every time you search,
queries for shingles are just as efficient as a simple match
query. A small price is paid at index time, because more terms need to
be indexed, which also means that fields with shingles use more disk space.
However, most applications write once and read many times, so it makes sense
to optimize for fast queries.
This is a theme that you will encounter frequently in Elasticsearch: enables you to achieve a lot at search time, without requiring any up-front setup. Once you understand your requirements more clearly, you can achieve better results with better performance by modeling your data correctly at index time.
3.133.152.198