A keen observer will notice that all the queries so far in this book have operated on whole terms. To match something, the smallest unit had to be a single term. You can find only terms that exist in the inverted index.
But what happens if you want to match parts of a term but not the whole thing? Partial matching allows users to specify a portion of the term they are looking for and find any words that contain that fragment.
The requirement to match on part of a term is less common in the full-text search-engine world than you might think. If you have come from an SQL background, you likely have, at some stage of your career, implemented a poor man’s full-text search using SQL constructs like this:
WHERE
text
LIKE
"*quick*"
AND
text
LIKE
"*brown*"
AND
text
LIKE
"*fox*"
Of course, with Elasticsearch, we have the analysis process and the inverted index that remove the need for such brute-force techniques. To handle the case of matching both “fox” and “foxes,” we could simply use a stemmer to index words in their root form. There is no need to match partial terms.
That said, on some occasions partial matching can be useful. Common use cases include the following:
Matching postal codes, product serial numbers, or other not_analyzed
values
that start with a particular prefix or match a wildcard pattern
or even a regular expression
search-as-you-type—displaying the most likely results before the user has finished typing the search terms
Matching in languages like German or Dutch, which contain long compound words, like Weltgesundheitsorganisation (World Health Organization)
We will start by examining prefix matching on exact-value not_analyzed
fields.
We will use United Kingdom postcodes (postal codes in the United States) to illustrate how to use partial matching with
structured data. UK postcodes have a well-defined structure. For instance, the
postcode W1V 3DG
can be broken down as follows:
W1V
: This outer part identifies the postal area and district:
W
indicates the area (one or two letters)
1V
indicates the district (one or two numbers, possibly followed by a letter
3DG
: This inner part identifies a street or building:
3
indicates the sector (one number)
DG
indicates the unit (two letters)
Let’s assume that we are indexing postcodes as exact-value not_analyzed
fields, so we could create our index as follows:
PUT
/
my_index
{
"mappings"
:
{
"address"
:
{
"properties"
:
{
"postcode"
:
{
"type"
:
"string"
,
"index"
:
"not_analyzed"
}
}
}
}
}
PUT
/
my_index
/
address
/
1
{
"postcode"
:
"W1V 3DG"
}
PUT
/
my_index
/
address
/
2
{
"postcode"
:
"W2F 8HW"
}
PUT
/
my_index
/
address
/
3
{
"postcode"
:
"W1F 7HW"
}
PUT
/
my_index
/
address
/
4
{
"postcode"
:
"WC1N 1LZ"
}
PUT
/
my_index
/
address
/
5
{
"postcode"
:
"SW5 0BE"
}
Now our data is ready to be queried.
To find all postcodes beginning with W1
, we could use a simple prefix
query:
GET
/
my_index
/
address
/
_search
{
"query"
:
{
"prefix"
:
{
"postcode"
:
"W1"
}
}
}
The prefix
query is a low-level query that works at the term level. It
doesn’t analyze the query string before searching. It assumes that you have
passed it the exact prefix that you want to find.
By default, the prefix
query does no relevance scoring. It just finds
matching documents and gives them all a score of 1
. Really, it behaves more
like a filter than a query. The only practical difference between the
prefix
query and the prefix
filter is that the filter can be cached.
Previously, we said that “you can find only terms that exist in the inverted
index,” but we haven’t done anything special to index these postcodes; each
postcode is simply indexed as the exact value specified in each document. So
how does the prefix
query work?
Remember that the inverted index consists of a sorted list of unique terms (in this case, postcodes). For each term, it lists the IDs of the documents containing that term in the postings list. The inverted index for our example documents looks something like this:
Term: Doc IDs: ------------------------- "SW5 0BE" | 5 "W1F 7HW" | 3 "W1V 3DG" | 1 "W2F 8HW" | 2 "WC1N 1LZ" | 4 -------------------------
To support prefix matching on the fly, the query does the following:
Skips through the terms list to find the first term beginning with W1
.
Collects the associated document IDs.
Moves to the next term.
If that term also begins with W1
, the query repeats from step 2; otherwise, we’re finished.
While this works fine for our small example, imagine that our inverted index
contains a million postcodes beginning with W1
. The prefix query
would need to visit all one million terms in order to calculate the result!
And the shorter the prefix, the more terms need to be visited. If we were to
look for the prefix W
instead of W1
, perhaps we would match 10 million
terms instead of just one million.
prefix
query or filter are useful for ad hoc prefix matching, but
should be used with care. They can be used freely on fields with a small
number of terms, but they scale poorly and can put your cluster under a lot of
strain. Try to limit their impact on your cluster by using a long prefix;
this reduces the number of terms that need to be visited.
Later in this chapter, we present an alternative index-time solution that
makes prefix matching much more efficient. But first, we’ll take a look at
two related queries: the wildcard
and regexp
queries.
The wildcard
query is a low-level, term-based query similar in nature to the
prefix
query, but it allows you to specify a pattern instead of just a prefix.
It uses the standard shell wildcards: ?
matches any character, and *
matches zero or more characters.
This query would match the documents containing W1F 7HW
and W2F 8HW
:
GET
/
my_index
/
address
/
_search
{
"query"
:
{
"wildcard"
:
{
"postcode"
:
"W?F*HW"
}
}
}
Imagine now that you want to match all postcodes just in the W
area. A
prefix match would also include postcodes starting with WC
, and you would
have a similar problem with a wildcard match. We want to match only postcodes
that begin with a W
, followed by a number. The regexp
query allows you to
write these more complicated patterns:
GET
/
my_index
/
address
/
_search
{
"query"
:
{
"regexp"
:
{
"postcode"
:
"W[0-9].+"
}
}
}
The regular expression says that the term must begin with a W
, followed
by any number from 0 to 9, followed by one or more other characters.
The wildcard
and regexp
queries work in exactly the same way as the
prefix
query. They also have to scan the list of terms in the inverted
index to find all matching terms, and gather document IDs term by term. The
only difference between them and the prefix
query is that they support more-complex patterns.
This means that the same caveats apply. Running these queries on a field with
many unique terms can be resource intensive indeed. Avoid using a
pattern that starts with a wildcard (for example, *foo
or, as a regexp, .*foo
).
Whereas prefix matching can be made more efficient by preparing your data at index time, wildcard and regular expression matching can be done only at query time. These queries have their place but should be used sparingly.
The prefix
, wildcard
, and regexp
queries operate on terms. If you use
them to query an analyzed
field, they will examine each term in the
field, not the field as a whole.
For instance, let’s say that our title
field contains “Quick brown fox”
which produces the terms quick
, brown
, and fox
.
This query would match:
{
"regexp"
:
{
"title"
:
"br.*"
}}
But neither of these queries would match:
{
"regexp"
:
{
"title"
:
"Qu.*"
}}
{
"regexp"
:
{
"title"
:
"quick br*"
}}
Leaving postcodes behind, let’s take a look at how prefix matching can help with full-text queries. Users have become accustomed to seeing search results before they have finished typing their query—so-called instant search, or search-as-you-type. Not only do users receive their search results in less time, but we can guide them toward results that actually exist in our index.
For instance, if a user types in johnnie walker bl
, we would like to show results for Johnnie Walker Black Label and Johnnie Walker Blue
Label before they can finish typing their query.
As always, there are more ways than one to skin a cat! We will start by looking at the way that is simplest to implement. You don’t need to prepare your data in any way; you can implement search-as-you-type at query time on any full-text field.
In “Phrase Matching”, we introduced the match_phrase
query, which matches
all the specified words in the same positions relative to each other. For-query time search-as-you-type, we can use a specialization of this query,
called the match_phrase_prefix
query:
{
"match_phrase_prefix"
:
{
"brand"
:
"johnnie walker bl"
}
}
This query behaves in the same way as the match_phrase
query, except that it
treats the last word in the query string as a prefix. In other words, the
preceding example would look for the following:
johnnie
Followed by walker
Followed by words beginning with bl
If you were to run this query through the validate-query
API, it would
produce this explanation:
"johnnie walker bl*"
Like the match_phrase
query, it accepts a slop
parameter (see “Mixing It Up”) to
make the word order and relative positions somewhat less rigid:
{
"match_phrase_prefix"
:
{
"brand"
:
{
"query"
:
"walker johnnie bl"
,
"slop"
:
10
}
}
}
Even though the words are in the wrong order, the query still matches
because we have set a high enough slop
value to allow some flexibility
in word positions.
However, it is always only the last word in the query string that is treated as a prefix.
Earlier, in “prefix Query”, we warned about the perils of the prefix—how
prefix
queries can be resource intensive. The same is true in this
case. A prefix of a
could match hundreds of thousands of terms. Not only
would matching on this many terms be resource intensive, but it would also not be
useful to the user.
We can limit the impact of the prefix expansion by setting max_expansions
to
a reasonable number, such as 50:
{
"match_phrase_prefix"
:
{
"brand"
:
{
"query"
:
"johnnie walker bl"
,
"max_expansions"
:
50
}
}
}
The max_expansions
parameter controls how many terms the prefix is allowed
to match. It will find the first term starting with bl
and keep collecting
terms (in alphabetical order) until it either runs out of terms with prefix
bl
, or it has more terms than max_expansions
.
Don’t forget that we have to run this query every time the user types another character, so it needs to be fast. If the first set of results isn’t what users are after, they’ll keep typing until they get the results that they want.
All of the solutions we’ve talked about so far are implemented at query time. They don’t require any special mappings or indexing patterns; they simply work with the data that you’ve already indexed.
The flexibility of query-time operations comes at a cost: search performance. Sometimes it may make sense to move the cost away from the query. In a real- time web application, an additional 100ms may be too much latency to tolerate.
By preparing your data at index time, you can make your searches more flexible and improve performance. You still pay a price: increased index size and slightly slower indexing throughput, but it is a price you pay once at index time, instead of paying it on every query.
Your users will thank you.
As we have said before, “You can find only terms that exist in the inverted
index.” Although the prefix
, wildcard
, and regexp
queries demonstrated that
that is not strictly true, it is true that doing a single-term lookup is
much faster than iterating through the terms list to find matching terms on
the fly. Preparing your data for partial matching ahead of time will increase
your search performance.
Preparing your data at index time means choosing the right analysis chain, and
the tool that we use for partial matching is the n-gram. An n-gram can be
best thought of as a moving window on a word. The n stands for a length.
If we were to n-gram the word quick
, the results would depend on the length
we have chosen:
Length 1 (unigram): [ q
, u
, i
, c
, k
]
Length 2 (bigram): [ qu
, ui
, ic
, ck
]
Length 3 (trigram): [ qui
, uic
, ick
]
Length 4 (four-gram): [ quic
, uick
]
Length 5 (five-gram): [ quick
]
Plain n-grams are useful for matching somewhere within a word, a technique
that we will use in “Ngrams for Compound Words”. However, for search-as-you-type,
we use a specialized form of n-grams called edge n-grams. Edge
n-grams are anchored to the beginning of the word. Edge n-gramming the word
quick
would result in this:
q
qu
qui
quic
quick
You may notice that this conforms exactly to the letters that a user searching for “quick” would type. In other words, these are the perfect terms to use for instant search!
The first step to setting up index-time search-as-you-type is to define our analysis chain, which we discussed in “Configuring Analyzers”, but we will go over the steps again here.
The first step is to configure a custom edge_ngram
token filter, which we
will call the autocomplete_filter
:
{
"filter"
:
{
"autocomplete_filter"
:
{
"type"
:
"edge_ngram"
,
"min_gram"
:
1
,
"max_gram"
:
20
}
}
}
This configuration says that, for any term that this token filter receives, it should produce an n-gram anchored to the start of the word of minimum length 1 and maximum length 20.
Then we need to use this token filter in a custom analyzer, which we will call
the autocomplete
analyzer:
{
"analyzer"
:
{
"autocomplete"
:
{
"type"
:
"custom"
,
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"autocomplete_filter"
]
}
}
}
This analyzer will tokenize a string into individual terms by using the
standard
tokenizer, lowercase each term, and then produce edge n-grams of each
term, thanks to our autocomplete_filter
.
The full request to create the index and instantiate the token filter and analyzer looks like this:
PUT
/
my_index
{
"settings"
:
{
"number_of_shards"
:
1
,
"analysis"
:
{
"filter"
:
{
"autocomplete_filter"
:
{
"type"
:
"edge_ngram"
,
"min_gram"
:
1
,
"max_gram"
:
20
}
},
"analyzer"
:
{
"autocomplete"
:
{
"type"
:
"custom"
,
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"autocomplete_filter"
]
}
}
}
}
}
You can test this new analyzer to make sure it is behaving correctly by using
the analyze
API:
GET
/
my_index
/
_analyze
?
analyzer
=
autocomplete
quick
brown
The results show us that the analyzer is working correctly. It returns these terms:
q
qu
qui
quic
quick
b
br
bro
brow
brown
To use the analyzer, we need to apply it to a field, which we can do
with the update-mapping
API:
PUT
/
my_index
/
_mapping
/
my_type
{
"my_type"
:
{
"properties"
:
{
"name"
:
{
"type"
:
"string"
,
"analyzer"
:
"autocomplete"
}
}
}
}
Now, we can index some test documents:
POST
/
my_index
/
my_type
/
_bulk
{
"index"
:
{
"_id"
:
1
}}
{
"name"
:
"Brown foxes"
}
{
"index"
:
{
"_id"
:
2
}}
{
"name"
:
"Yellow furballs"
}
If you test out a query for “brown fo” by using a simple match
query
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match"
:
{
"name"
:
"brown fo"
}
}
}
you will see that both documents match, even though the Yellow furballs
doc contains neither brown
nor fo
:
{
"hits"
:
[
{
"_id"
:
"1"
,
"_score"
:
1.5753809
,
"_source"
:
{
"name"
:
"Brown foxes"
}
},
{
"_id"
:
"2"
,
"_score"
:
0.012520773
,
"_source"
:
{
"name"
:
"Yellow furballs"
}
}
]
}
As always, the validate-query
API shines some light:
GET
/
my_index
/
my_type
/
_validate
/
query
?
explain
{
"query"
:
{
"match"
:
{
"name"
:
"brown fo"
}
}
}
The explanation
shows us that the query is looking for edge n-grams of every
word in the query string:
name:b name:br name:bro name:brow name:brown name:f name:fo
The name:f
condition is satisfied by the second document because
furballs
has been indexed as f
, fu
, fur
, and so forth. In retrospect, this
is not surprising. The same autocomplete
analyzer is being applied both at
index time and at search time, which in most situations is the right thing to
do. This is one of the few occasions when it makes sense to break this rule.
We want to ensure that our inverted index contains edge n-grams of every word,
but we want to match only the full words that the user has entered (brown
and fo
). We can do this by using the autocomplete
analyzer at
index time and the standard
analyzer at search time. One way to change the
search analyzer is just to specify it in the query:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match"
:
{
"name"
:
{
"query"
:
"brown fo"
,
"analyzer"
:
"standard"
}
}
}
}
Alternatively, we can specify the index_analyzer
and search_analyzer
in
the mapping for the name
field itself. Because we want to change only the
search_analyzer
, we can update the existing mapping without having to
reindex our data:
PUT
/
my_index
/
my_type
/
_mapping
{
"my_type"
:
{
"properties"
:
{
"name"
:
{
"type"
:
"string"
,
"index_analyzer"
:
"autocomplete"
,
"search_analyzer"
:
"standard"
}
}
}
}
Use the autocomplete
analyzer at index time to produce edge n-grams of
every term.
Use the standard
analyzer at search time to search only on the terms
that the user has entered.
If we were to repeat the validate-query
request, it would now give us this
explanation:
name:brown name:fo
Repeating our query correctly returns just the Brown foxes
document.
Because most of the work has been done at index time, all this query needs to
do is to look up the two terms brown
and fo
, which is much more efficient
than the match_phrase_prefix
approach of having to find all terms beginning
with fo
.
The edge n-gram approach can also be used for structured data, such as the
postcodes example from earlier in this chapter. Of course,
the postcode
field would need to be analyzed
instead of not_analyzed
, but
you could use the keyword
tokenizer to treat the postcodes as if they were
not_analyzed
.
The keyword
tokenizer is the no-operation tokenizer, the tokenizer that does
nothing. Whatever string it receives as input, it emits exactly the same
string as a single token. It can therefore be used for values that we would
normally treat as not_analyzed
but that require some other analysis
transformation such as lowercasing.
This example uses the keyword
tokenizer to convert the postcode string into a token stream, so that we can use the edge n-gram token filter:
{
"analysis"
:
{
"filter"
:
{
"postcode_filter"
:
{
"type"
:
"edge_ngram"
,
"min_gram"
:
1
,
"max_gram"
:
8
}
},
"analyzer"
:
{
"postcode_index"
:
{
"tokenizer"
:
"keyword"
,
"filter"
:
[
"postcode_filter"
]
},
"postcode_search"
:
{
"tokenizer"
:
"keyword"
}
}
}
}
Finally, let’s take a look at how n-grams can be used to search languages with compound words. German is famous for combining several small words into one massive compound word in order to capture precise or complex meanings. For example:
Pronunciation dictionary
Military history
White-headed sea eagle, or bald eagle
World Health Organization
The law concerning the delegation of duties for the supervision of cattle marking and the labeling of beef
Somebody searching for “Wörterbuch” (dictionary) would probably expect to see “Aussprachewörtebuch” in the results list. Similarly, a search for “Adler” (eagle) should include “Weißkopfseeadler.”
One approach to indexing languages like this is to break compound words into their constituent parts using the compound word token filter. However, the quality of the results depends on how good your compound-word dictionary is.
Another approach is just to break all words into n-grams and to search for any matching fragments—the more fragments that match, the more relevant the document.
Given that an n-gram is a moving window on a word, an n-gram of any length will cover all of the word. We want to choose a length that is long enough to be meaningful, but not so long that we produce far too many unique terms. A trigram (length 3) is probably a good starting point:
PUT
/
my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"trigrams_filter"
:
{
"type"
:
"ngram"
,
"min_gram"
:
3
,
"max_gram"
:
3
}
},
"analyzer"
:
{
"trigrams"
:
{
"type"
:
"custom"
,
"tokenizer"
:
"standard"
,
"filter"
:
[
"lowercase"
,
"trigrams_filter"
]
}
}
}
},
"mappings"
:
{
"my_type"
:
{
"properties"
:
{
"text"
:
{
"type"
:
"string"
,
"analyzer"
:
"trigrams"
}
}
}
}
}
Testing the trigrams analyzer with the analyze
API
GET
/
my_index
/
_analyze
?
analyzer
=
trigrams
Weißkopfseeadler
returns these terms:
wei, eiß, ißk, ßko, kop, opf, pfs, fse, see, eea,ead, adl, dle, ler
We can index our example compound words to test this approach:
POST
/
my_index
/
my_type
/
_bulk
{
"index"
:
{
"_id"
:
1
}}
{
"text"
:
"Aussprachewörterbuch"
}
{
"index"
:
{
"_id"
:
2
}}
{
"text"
:
"Militärgeschichte"
}
{
"index"
:
{
"_id"
:
3
}}
{
"text"
:
"Weißkopfseeadler"
}
{
"index"
:
{
"_id"
:
4
}}
{
"text"
:
"Weltgesundheitsorganisation"
}
{
"index"
:
{
"_id"
:
5
}}
{
"text"
:
"Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz"
}
A search for “Adler” (eagle) becomes a query for the three terms adl
, dle
,
and ler
:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match"
:
{
"text"
:
"Adler"
}
}
}
which correctly matches “Weißkopfsee-adler”:
{
"hits"
:
[
{
"_id"
:
"3"
,
"_score"
:
3.3191128
,
"_source"
:
{
"text"
:
"Weißkopfseeadler"
}
}
]
}
A similar query for “Gesundheit” (health) correctly matches
“Welt-gesundheit-sorganisation,” but it also matches
“Militär-ges-chichte” and
“Rindfleischetikettierungsüberwachungsaufgabenübertragungs-ges-etz,”
both of which also contain the trigram ges
.
Judicious use of the minimum_should_match
parameter can remove these
spurious results by requiring that a minimum number of trigrams must be
present for a document to be considered a match:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"match"
:
{
"text"
:
{
"query"
:
"Gesundheit"
,
"minimum_should_match"
:
"80%"
}
}
}
}
This is a bit of a shotgun approach to full-text search and can result in a large inverted index, but it is an effective generic way of indexing languages that use many compound words or that don’t use whitespace between words, such as Thai.
This technique is used to increase recall—the number of relevant documents that a search returns. It is usually used in combination with other techniques, such as shingles (see “Finding Associated Words”) to improve precision and the relevance score of each document.
3.133.130.199