Example: Fuzzy Searching

At this point, we’ve extracted article text from web pages and then created keyword indexes of those articles based on the nouns that occurred most frequently within them.

Now let’s explore how we could search our new database for articles that match a given query. More usefully, you’ll learn how you can perform a fuzzy search—a search that matches not just the exact query you give it, but also things that are similar to it.

Our first step is to write a simple and exact search. Then we can look at how to improve both the storage of the index and the querying process to broaden the text that is matched by it.

A Simple Search

The simplest search we could write would be to, given a search term, list all the articles that have that search term among their keywords. Although it’s simple, such a search would actually be quite useful, and given that we only have to search the pre-extracted terms, which are only a fraction of the length of the original articles, it should be relatively efficient, too.

nlp/simple-search.rb
 
require ​"json"
 
require ​"json/add/core"
 
 
Article = Struct.new(​:title​, ​:text​, ​:terms​)
 
articles = JSON.parse(File.read(​"articles.json"​), ​create_additions: ​​true​)
 
 
query = ARGV[0]
 
 
articles =
 
articles.select { |article| article.terms.assoc(query) }
 
 
if​ matches.length > 0
 
matches.each { |article| puts article.title }
 
else
 
puts ​"No matches"
 
end

We require the necessary libraries, then re-create our Article struct and unserialize the article data from the JSON file it’s stored in. We now have an array of Articles, the same data that we created in our extraction script.

Next is the actual searching logic. We take the first argument passed on the command line and use it as our search query, selecting all of the articles where one or more terms matches exactly the query passed on the command line. Since an article’s terms are an array of arrays, with each entry containing not just the term itself but also information about the frequency of the term, we can use Array’s assoc method for this. Given a value, it returns the first element of an array of arrays for which the first element is equal to the value we pass to it. For each article that matches, we simply output its title.

This logic works perfectly fine. We can run the script as follows and see matching articles:

 
$ ​ruby​​ ​​simple-search.rb​​ ​​Rincewind
 
The Colour of Magic
 
The Light Fantastic
 
Sourcery
 
Eric (novel)
 
Interesting Times
 
The Last Continent
 
The Last Hero
 
$ ​ruby​​ ​​simple-search.rb​​ ​​'Granny Weatherwax'
 
Equal Rites
 
Wyrd Sisters
 
Witches Abroad
 
Carpe Jugulum
 
A Hat Full of Sky
 
Wintersmith

But the search that we’ve created is somewhat inflexible. Our first search, for Rincewind, worked fine because the character’s name is capitalized and so was our search query. But a search for rincewind produces no results at all:

 
$ ​ruby​​ ​​simple-search.rb​​ ​​rincewind

This is clearly an issue, but we can do two things to solve it. The first is to normalize our index to ignore elements that we don’t want to consider in our queries (such as case); this can be done in advance. The second is to be more flexible in what we match. Instead of allowing only for an identical match, we can allow for a fuzzy one.

Normalizing the Index

If we want our search queries to ignore some element of the text we’re searching—its case, for example—then a simple solution is to account for that in the index. If we want to ignore case, then we make all of the terms lowercase. If we want to ignore accents, then we either strip them out or convert them to some normalized form (converting é to e, for example). Provided we then make the same normalization to our search queries, we’ll achieve the desired effect.

In this case, it makes sense to ignore case. It’s how a user would expect a search to behave, and it’s a straightforward change to make. While we’re at it, we’ll also ignore dashes, so that Ankh-Morpork will be returned if the user searches for Ankh Morpork.

Let’s adapt our term extraction script to store both the original version of the term and a normalized version. Although this increases the size of our index, in this case we’re not dealing with tens of thousands of articles, so the trade-off is worth it. First, we can write a method that, given a string, will return the normalized form of that string. To start with, let’s just ignore case:

nlp/extracting-normalized-terms.rb
 
def​ normalize_term(term)
 
term.downcase.gsub(​"-"​, ​" "​)
 
end

Then we need to make sure that we’re storing this normalized version of each term at the point we store the article. Let’s modify that portion of the script:

nlp/extracting-normalized-terms.rb
 
terms = Phrasie::Extractor.new.phrases(text)
 
terms = terms.reject { |term| ignored_terms.include?(term.first) }
 
terms.each ​do​ |term|
 
term.unshift normalize_term(term.first)
 
end

Just after extracting the terms, we loop over them and prepend a normalized form of the term onto the array. Since this becomes the first element in the array, it becomes the element checked by assoc, so our search script can continue to work in the same way.

Finally, we need to make sure that we’re applying the same normalization to our search query:

 
query = normalize(ARGV[0])

After making these changes, a search for rincewind bears the results our users might have expected:

 
$ ​ruby​​ ​​simple-search.rb​​ ​​rincewind
 
The Colour of Magic
 
The Light Fantastic
 
Sourcery
 
Eric (novel)
 
Interesting Times
 
The Last Continent
 
The Last Hero

We could adapt our normalization method to do many other things: stripping accents, removing punctuation other than dashes, converting non-ASCII characters to their closest ASCII equivalents (é to e, for example). There are even normalizations in the realms of language processing that we could perform: reducing every keyword to its stem, for example—so rowing would become row. This would mean that a search for one of row or rowing would match both the original keywords row and rowing. The sort of normalization we perform will depend on the domain in which we’re working.

Fuzzy Matchers

We’ve developed our search to the point that it’s a bit more helpful to users, not forcing them to match the case of the original term in their query. But it still requires them to be pretty exact, particularly when it comes to spelling. Google has spoiled us in this regard: we want to be able to search for, say, rinsewind and have the search know that we really meant rincewind.

There are a variety of techniques within this area, but we’re going to focus on two. The first is the concept of edit distance, which will catch typos and simple spelling mistakes. For instance, if the user types teh king as a query, this will enable us to match (the probably intended) the king.

The second is the use of phonetic algorithms, which will allow the user to search for something they have almost no idea how to spell by searching for terms that sound, when pronounced, like the query. This will allow the user to type reylwey stayshun, for example, and have the term railway station match.

Edit Distance

If we repeatedly add, remove, or modify a single character of a given string of text, it’s possible to transform it into any other string, no matter how different in length or in content the two strings might seem. To illustrate this, let’s take the string boat. We can transform it into bob by performing the following changes:

 
boat
 
boa
 
bob

We removed the t, to get boa; that’s one transformation. Then we changed the a to a b to get bob; that’s a second transformation. Because it required two single-character transformations to get from our first string to our second, we say that these two strings have an edit distance of 2.

To transform starring to cart, we perform the following changes:

 
starring
 
starrin
 
starri
 
starr
 
tarr
 
tart
 
cart

This time, six transformations are necessary—the two strings have an edit distance of 6. This method of calculating edit distance, by considering the adding, removing, and modifying of characters, is knowing as the Levenshtein distance.

By using a Levenshtein distance algorithm in Ruby, we can look for terms that are within a certain edit distance of the query and that are therefore a good—but not necessarily exact—match. A Ruby implementation of this algorithm can be found in Paul Battley’s excellent text gem.

Using this library, we can extend our script to look for terms within an edit distance of two from our query:

nlp/levenshtein-search.rb
 
query = normalize_term(ARGV[0])
 
 
matches =
 
articles
 
.select { |article|
 
article.terms.find { |term, _|
 
Text::Levenshtein.distance(term, query) <= 2
 
}
 
}
 
 
if​ matches.length > 0
 
matches.each { |article| puts article.title }
 
else
 
puts ​"No matches"
 
end

Previously, we were checking whether the query matched a term exactly. Now we check whether the Levenshtein distance between the query and the term is less than or equal to two. This will cover the previous exact-matching functionality too, since two identical strings will have a Levenshtein distance of zero. But it will also cover cases where the query differs by one or two transformations, allowing for typos and misspellings—in this case the misspelling of the term rincewind:

 
$ ​ruby​​ ​​levenshtein-search.rb​​ ​​rinsewind
 
The Colour of Magic
 
The Light Fantastic
 
Sourcery
 
Eric (novel)
 
Interesting Times
 
The Last Continent
 
The Last Hero
 
$ ​ruby​​ ​​levenshtein-search.rb​​ ​​rinsewint
 
The Colour of Magic
 
The Light Fantastic
 
Sourcery
 
Eric (novel)
 
Interesting Times
 
The Last Continent
 
The Last Hero
 
$ ​ruby​​ ​​levenshtein-search.rb​​ ​​rensewint
 
No matches

By altering one or two characters from the actual term, we see that matches are returned. But when we alter a third character, we stray too far from reality, and our criteria are no longer satisfied; we therefore see no results listed.

Searching for terms within a certain level of fuzziness is an improvement, but we’re not yet at the slightly magic-feeling Google level, in which we can allow people to search for terms that they’ve heard of but have no idea how to spell. For that, we’ll need another algorithm.

Phonetic Algorithms

English orthography is a strange beast. Many words’ pronunciations have little or no resemblance to the way they’re spelled. Think of though, through, thought, and cough; think of laugh, graph, and staff. Similar spellings don’t mean similar sounds, and similar sounds aren’t necessarily spelled in similar ways. If English is your second language, you’ve likely grappled with these bizarre and confusing aspects of its spelling system.

It might seem that translating an English word into its pronounced form would be impossible, but that’s actually not the case. While English spelling rules are undoubtedly a hodgepodge, they aren’t random. There are rules behind them—rules that are learned to varying degrees by speakers of English—and where there are rules, we can create algorithms.

One such algorithm is called Metaphone.[13] Created by the software engineer Lawrence Philips in 1990, Metaphone is a form of phonetic fingerprinting. If we feed it a string representing an English word or phrase, it will return a representation of that word’s approximate pronunciation, allowing for regional differences in accent and pronunciation. This means that the Metaphone values for two words that are pronounced in similar ways will be identical, regardless of how the words are spelled.

For example, the Metaphone for baron is BRN. The Metaphone for barren, which is pronounced in the same way, is also BRN. These values are of course based on the approximate pronunciations of the words. For example, the Metaphone values for pudding and putting, which are pronounced the same in some American accents but not in British English, are the same. But this fuzziness is likely to benefit us in the case of searching, so it’s actually desirable—it wouldn’t be good if our fuzziness worked only with certain accents.

The text gem, which we used for its Levenshtein distance algorithm, provides us with a implementation of Metaphone in Ruby. For any string, we can obtain its Metaphone representation by using the Text::Metaphone.metaphone method:

 
Text::Metaphone.metaphone(​"tooth"​)
 
=> ​"T0"
 
Text::Metaphone.metaphone(​"tewth"​)
 
=> ​"T0"
 
Text::Metaphone.metaphone(​"tuth"​)
 
=> ​"T0"

We can introduce Metaphone into our search process in much the same way as we did with edit distance; it just becomes another thing to check against potential terms:

nlp/metaphone-search.rb
 
query = normalize_term(ARGV[0])
 
 
metaphone_query = Text::Metaphone.metaphone(query)
 
 
matches =
 
articles
 
.select { |article|
 
article.terms.find { |term, _|
 
Text::Levenshtein.distance(term, query) <= 2 ||
 
Text::Metaphone.metaphone(term) == metaphone_query
 
}
 
}
 
 
if​ matches.length > 0
 
matches.each { |article| puts article.title }
 
else
 
puts ​"No matches"
 
end

First, we calculate the Metaphone value for the query passed on the command line and store it for later. Then, when looking at each term, we compare the Metaphone value for that term to that of the query. If they’re the same—which means that the term sounds like the query—then we consider the term to be a match.

We can now match things that sound similar to valid terms. So, acceptable substitutes for rincewind and vetinari are:

 
$ ​ruby​​ ​​metaphone-search.rb​​ ​​rensewint
 
The Colour of Magic
 
The Light Fantastic
 
Sourcery
 
Eric (novel)
 
Interesting Times
 
The Last Continent
 
The Last Hero
 
$ ​ruby​​ ​​metaphone-search.rb​​ ​​vetunahree
 
Men at Arms
 
Feet of Clay (novel)
 
Jingo (novel)
 
The Truth (novel)
 
The Last Hero
 
Going Postal
 
Thud!
 
Raising Steam

That’s it; we’ve now created a fuzzy searcher for our index of articles. It allows users to search using queries that exactly match terms, as you might expect, but also ones that are close misspellings or merely sound like genuine terms. Equally, it allows us to match terms that are misspelled in the original articles—a fairly likely occurrence.

If you wanted to extend this further, you could implement some kind of weighting of the results returned by the search routing (so that “better” matches were listed first), perhaps based on keyword density across the entire database of articles. You could pull out snippets from the article to show the terms in context, helping the user to determine whether the match is a good one. By layering language processing techniques, as we’ve already seen, we can create powerful tools. It’s a subject worthy of exploration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.116.59