We expect a query on structured data like dates and prices to return only documents that match exactly. However, good full-text search shouldn’t have the same restriction. Instead, we can widen the net to include words that may match, but use the relevance score to push the better matches to the top of the result set.
In fact, full-text search that only matches exactly will probably frustrate your users. Wouldn’t you expect a search for “quick brown fox” to match a document containing “fast brown foxes,” “Johnny Walker” to match “Johnnie Walker,” or “Arnold Shcwarzenneger” to match “Arnold Schwarzenegger”?
If documents exist that do contain exactly what the user has queried, they should appear at the top of the result set, but weaker matches can be included further down the list. If no documents match exactly, at least we can show the user potential matches; they may even be what the user originally intended!
We have already looked at diacritic-free matching in Chapter 20, word stemming in Chapter 21, and synonyms in Chapter 23, but all of those approaches presuppose that words are spelled correctly, or that there is only one way to spell each word.
Fuzzy matching allows for query-time matching of misspelled words, while phonetic token filters at index time can be used for sounds-like matching.
Fuzzy matching treats two words that are “fuzzily” similar as if they were the same word. First, we need to define what we mean by fuzziness.
In 1965, Vladimir Levenshtein developed the Levenshtein distance, which measures the number of single-character edits required to transform one word into the other. He proposed three types of one-character edits:
Substitution of one character for another: _f_ox → _b_ox
Insertion of a new character: sic → sic_k_
Deletion of a character:: b_l_ack → back
Frederick Damerau later expanded these operations to include one more:
Transposition of two adjacent characters: _st_ar → _ts_ar
For example, to convert the word bieber
into beaver
requires the
following steps:
Substitute v
for b
: bie_b_er → bie_v_er
Substitute a
for i
: b_i_ever → b_a_ever
Transpose a
and e
: b_ae_ver → b_ea_ver
These three steps represent a Damerau-Levenshtein edit distance of 3.
Clearly, bieber
is a long way from beaver
—they are too far apart to be
considered a simple misspelling. Damerau observed that 80% of human
misspellings have an edit distance of 1. In other words, 80% of misspellings
could be corrected with a single edit to the original string.
Elasticsearch supports a maximum edit distance, specified with the fuzziness
parameter, of 2.
Of course, the impact that a single edit has on a string depends on the
length of the string. Two edits to the word hat
can produce mad
, so
allowing two edits on a string of length 3 is overkill. The fuzziness
parameter can be set to AUTO
, which results in the following maximum edit distances:
0
for strings of one or two characters
1
for strings of three, four, or five characters
2
for strings of more than five characters
Of course, you may find that an edit distance of 2
is still overkill, and
returns results that don’t appear to be related. You may get better results,
and better performance, with a maximum fuzziness
of 1
.
The fuzzy
query is the fuzzy equivalent of
the term
query. You will seldom use it directly yourself, but understanding
how it works will help you to use fuzziness in the higher-level match
query.
To understand how it works, we will first index some documents:
POST
/my_index/my_type/_bulk
{
"index"
:
{
"_id"
:
1
}}
{
"text"
:
"Surprise me!"
}
{
"index"
:
{
"_id"
:
2
}}
{
"text"
:
"That was surprising."
}
{
"index"
:
{
"_id"
:
3
}}
{
"text"
:
"I wasn't surprised."
}
Now we can run a fuzzy
query for the term surprize
:
GET
/my_index/my_type/_search
{
"query"
:
{
"fuzzy"
:
{
"text"
:
"surprize"
}
}
}
The fuzzy
query is a term-level query, so it doesn’t do any analysis. It
takes a single term and finds all terms in the term dictionary that are
within the specified fuzziness
. The default fuzziness
is AUTO
.
In our example, surprize
is within an edit distance of 2 from both
surprise
and surprised
, so documents 1 and 3 match. We could reduce the
matches to just surprise
with the following query:
GET
/my_index/my_type/_search
{
"query"
:
{
"fuzzy"
:
{
"text"
:
{
"value"
:
"surprize"
,
"fuzziness"
:
1
}
}
}
}
The fuzzy
query works by taking the original term and building a
Levenshtein automaton—like a big graph representing all the strings
that are within the specified edit distance of the original string.
The fuzzy query then uses the automation to step efficiently through all of the terms in the term dictionary to see if they match. Once it has collected all of the matching terms that exist in the term dictionary, it can compute the list of matching documents.
Of course, depending on the type of data stored in the index, a fuzzy query with an edit distance of 2 can match a very large number of terms and perform very badly. Two parameters can be used to limit the performance impact:
prefix_length
The number of initial characters that will not be “fuzzified.” Most
spelling errors occur toward the end of the word, not toward the beginning.
By using a prefix_length
of 3
, for example, you can signficantly reduce
the number of matching terms.
max_expansions
If a fuzzy query expands to three or four fuzzy options, the new options may be
meaningful. If it produces 1,000 options, they are essentially
meaningless. Use max_expansions
to limit the total number of options that
will be produced. The fuzzy query will collect matching terms until it
runs out of terms or reaches the max_expansions
limit.
The match
query supports fuzzy matching out of the box:
GET
/my_index/my_type/_search
{
"query"
:
{
"match"
:
{
"text"
:
{
"query"
:
"SURPRIZE ME!"
,
"fuzziness"
:
"AUTO"
,
"operator"
:
"and"
}
}
}
}
The query string is first analyzed, to produce the terms [surprize, me]
, and
then each term is fuzzified using the specified fuzziness
.
Similarly, the multi_match
query also supports fuzziness
, but only when
executing with type best_fields
or most_fields
:
GET
/my_index/my_type/_search
{
"query"
:
{
"multi_match"
:
{
"fields"
:
[
"text"
,
"title"
],
"query"
:
"SURPRIZE ME!"
,
"fuzziness"
:
"AUTO"
}
}
}
Both the match
and multi_match
queries also support the prefix_length
and max_expansions
parameters.
match
and multi_match
queries. It
doesn’t work with phrase matching, common terms, or cross_fields
matches.
Users love fuzzy queries. They assume that these queries will somehow magically find the right combination of proper spellings. Unfortunately, the truth is somewhat more prosaic.
Imagine that we have 1,000 documents containing “Schwarzenegger,” and just one document with the misspelling “Schwarzeneger.” According to the theory of term frequency/inverse document frequency, the misspelling is much more relevant than the correct spelling, because it appears in far fewer documents!
In other words, if we were to treat fuzzy matches like any other match, we would favor misspellings over correct spellings, which would make for grumpy users.
By default, the match
query gives all fuzzy matches the constant score of 1.
This is sufficient to add potential matches onto the end of the result list,
without interfering with the relevance scoring of nonfuzzy queries.
Fuzzy queries alone are much less useful than they initially appear. They are
better used as part of a “bigger” feature, such as the search-as-you-type
completion
suggester or the
did-you-mean phrase
suggester.
In a last, desperate, attempt to match something, anything, we could resort to searching for words that sound similar, even if their spelling differs.
Several algorithms exist for converting words into a phonetic representation. The Soundex algorithm is the granddaddy of them all, and most other phonetic algorithms are improvements or specializations of Soundex, such as Metaphone and Double Metaphone (which expands phonetic matching to languages other than English), Caverphone for matching names in New Zealand, the Beider-Morse algorithm, which adopts the Soundex algorithm for better matching of German and Yiddish names, and the Kölner Phonetik for better handling of German words.
The thing to take away from this list is that phonetic algorithms are fairly crude, and very specific to the languages they were designed for, usually either English or German. This limits their usefulness. Still, for certain purposes, and in combination with other techniques, phonetic matching can be a useful tool.
First, you will need to install the Phonetic Analysis plug-in from http://bit.ly/1CreKJQ on every node in the cluster, and restart each node.
Then, you can create a custom analyzer that uses one of the phonetic token filters and try it out:
PUT
/my_index
{
"settings"
:
{
"analysis"
:
{
"filter"
:
{
"dbl_metaphone"
:
{
"type"
:
"phonetic"
,
"encoder"
:
"double_metaphone"
}
},
"analyzer"
:
{
"dbl_metaphone"
:
{
"tokenizer"
:
"standard"
,
"filter"
:
"dbl_metaphone"
}
}
}
}
}
First, configure a custom phonetic
token filter that uses the
double_metaphone
encoder.
Then use the custom token filter in a custom analyzer.
Now we can test it with the analyze
API:
GET
/my_index/_analyze?analyzer=dbl_metaphone
Smith
Smythe
Each of Smith
and Smythe
produce two tokens in the same position: SM0
and XMT
. Running John
, Jon
, and Johnnie
through the analyzer will all
produce the two tokens JN
and AN
, while Jonathon
results in the tokens
JN0N
and ANTN
.
The phonetic analyzer can be used just like any other analyzer. First map a field to use it, and then index some data:
PUT
/my_index/_mapping/my_type
{
"properties"
:
{
"name"
:
{
"type"
:
"string"
,
"fields"
:
{
"phonetic"
:
{
"type"
:
"string"
,
"analyzer"
:
"dbl_metaphone"
}
}
}
}
}
PUT
/my_index/my_type/
1
{
"name"
:
"John Smith"
}
PUT
/my_index/my_type/
2
{
"name"
:
"Jonnie Smythe"
}
The match
query can be used for searching:
GET
/my_index/my_type/_search
{
"query"
:
{
"match"
:
{
"name.phonetic"
:
{
"query"
:
"Jahnnie Smeeth"
,
"operator"
:
"and"
}
}
}
}
This query returns both documents, demonstrating just how coarse phonetic matching is. Scoring with a phonetic algorithm is pretty much worthless. The purpose of phonetic matching is not to increase precision, but to increase recall—to spread the net wide enough to catch any documents that might possibly match.
It usually makes more sense to use phonetic algorithms when retrieving results which will be consumed and post-processed by another computer, rather than by human users.
3.143.247.53