8
Language

In this chapter, we step into the messy world of human language. We’ll start by discussing the differences between language and math that make language algorithms difficult. We’ll continue by building a space insertion algorithm that can take any text in any language and insert spaces wherever they’re missing. After that, we’ll build a phrase completion algorithm that can imitate the style of a writer and find the most fitting next word in a phrase.

The algorithms in this chapter rely heavily on two tools that we haven’t used before: list comprehensions and corpuses. List comprehensions enable us to quickly generate lists using the logic of loops and iterations. They’re optimized to run very quickly in Python and they’re easy to write concisely, but they can be hard to read and their syntax takes some getting used to. A corpus is a body of text that will “teach” our algorithm the language and style we want it to use.

Why Language Algorithms Are Hard

The application of algorithmic thinking to language goes back at least as far as Descartes, who noticed that although there are infinite numbers, anyone with a rudimentary understanding of arithmetic knows how to create or interpret a number they’ve never encountered before. For example, maybe you’ve never encountered the number 14,326—never counted that high, never read a financial report about that many dollars, never mashed exactly those keys on the keyboard. And yet I’m confident that you can easily grasp exactly how high it is, what numbers are higher or lower than it, and how to manipulate it in equations.

The algorithm that lets us easily understand hitherto unimagined numbers is simply a combination of the 10 digits (0–9), memorized in order, and the place system. We know that 14,326 is one higher than 14,325 because the digit 6 comes one after the digit 5 in order, they occupy the same place in their respective numbers, and the digits in all the other places are the same. Knowing the digits and the place system enables us to instantly have an idea of how 14,326 is similar to 14,325 and how both are larger than 12 and smaller than 1,000,000. We can also understand at a glance that 14,326 is similar to 4,326 in some respects but differs greatly in size.

Language is not the same. If you are learning English and you see the word stage for the first time, you cannot reliably reason about its meaning simply by noting its similarity to stale or stake or state or stave or stade or sage, even though those words differ from stage about as much as 14,326 does from 14,325. Nor can you reliably suppose that a bacterium is larger than an elk because of the number of syllables and characters in the words. Even supposedly reliable rules of language, like adding s to form plurals in English, can lead us badly astray when we infer that the word “princes” refers to less of something than the word “princess.”

In order to use algorithms with language, we must either make language simpler, so that the short mathematical algorithms we have explored so far can reliably work with it, or make our algorithms smarter, so that they can deal with the messy complexity of human language as it has developed naturally. We’ll do the latter.

Space Insertion

Imagine that you are the chief algorithm officer at a large old company that has a warehouse full of handwritten paper records. The chief record digitization officer has been conducting a long-term project of scanning those paper records to image files, and then using text recognition technology to convert the images to text that can be easily stored in the company’s databases. However, some of the handwriting on the records is awful and the text recognition technology is imperfect, so the final digital text that is extracted from a paper record is sometimes incorrect. You’ve been given only the digitized text and you’re asked to find a way to correct the mistakes without referring to the paper originals.

Suppose that you read the first digitized sentence into Python and find that it’s a quote from G. K. Chesterton: “The one perfectly divine thing, the one glimpse of God’s paradise given on earth, is to fight a losing battle—and not lose it.” You take this imperfectly digitized text and store it in a variable called text:

text = "The oneperfectly divine thing, the oneglimpse of God's paradisegiven on earth, is to fight a losingbattle - and notlose it."

You’ll notice that this text is in English, and while the spelling of each word is correct, there are missing spaces throughout: oneperfectly should actually be one perfectly, paradisegiven should be paradise given, and so on. (Missing a space is uncommon for humans, but text recognition technology often makes this kind of mistake.) In order to do your job, you’ll have to insert spaces at the appropriate spots in this text. For a fluent English speaker, this task may not seem difficult to do manually. However, imagine that you need to do it quickly for millions of scanned pages—you will obviously need to write an algorithm that can do it for you.

Defining a Word List and Finding Words

The first thing we will do is teach our algorithm some English words. This isn’t very hard: we can define a list called word_list and populate it with words. Let’s start with just a few words:

word_list = ['The','one','perfectly','divine']

In this chapter, we’ll create and manipulate lists using list comprehensions, which you’ll probably like after you get used to them. The following is a very simple list comprehension that creates a copy of our word_list:

word_list_copy = [word for word in word_list]

You can see that the syntax for word in word_list is very similar to the syntax for a for loop. But we don’t need a colon or extra lines. In this case, the list comprehension is as simple as possible, just specifying that we want each word in word_list to be in our new list, word_list_copy. This may not be so useful, but we can concisely add logic to make it more useful. For example, if we want to find every word in our word list that contains the letter n, all it takes is the simple addition of an if statement:

has_n = [word for word in word_list if 'n' in word]

We can run print(has_n) to see that the result is what we expect:

['one', 'divine']

Later in the chapter, you’ll see more complex list comprehensions, including some that have nested loops. However, all of them follow the same basic pattern: a for loop specifying iteration, with optional if statements describing the logic of what we want to select for our final list output.

We’ll use Python’s re module to access text manipulation tools. One of re’s useful functions is finditer(), which can search our text to find the location of any word in our word_list. We use finditer() in a list comprehension like so:

import re
locs = list(set([(m.start(),m.end()) for word in word_list for m in re.finditer(word, text)]))

That line is a little dense, so take a moment to make sure you understand it. We’re defining a variable called locs, short for “locations”; this variable will contain the locations in the text of every word in our word list. We’ll use a list comprehension to get this list of locations.

The list comprehension takes place inside the square brackets ([]). We use for word in word_list to iterate over every word in our word_list. For each word, we call re.finditer(), which finds the selected word in our text and returns a list of every location where that word occurs. We iterate over these locations, and each individual location is stored in m. When we access m.start() and m.end(), we’ll get the location in the text of the beginning and end of the word, respectively. Notice—and get used to—the order of the for loops, since some people find it the opposite of the order they expected.

The whole list comprehension is enveloped by list(set()). This is a convenient way to get a list that contains only unique values with no duplicates. Our list comprehension alone might have multiple identical elements, but converting it to a set automatically removes duplicates, and then converting it back to a list puts it in the format we want: a list of unique word locations. You can run print(locs) to see the result of the whole operation:

[(17, 23), (7, 16), (0, 3), (35, 38), (4, 7)]

In Python, ordered pairs like these are called tuples, and these tuples show the locations of each word from word_list in our text. For example, when we run text[17:23] (using the numbers from the third tuple in the preceding list), we find that it’s divine. Here, d is the 17th character of our text, i is the 18th character of our text, and so on until e, the final letter of divine, is the 22nd character of our text, so the tuple is rounded off with 23. You can check that the other tuples also refer to the locations of words in our word_list.

Notice that text[4:7] is one, and text[7:16] is perfectly. The end of the word one runs into the beginning of the word perfectly without any intervening space. If we hadn’t noticed that immediately by reading the text, we could have caught it by looking at the tuples (4, 7) and (7, 16) in our locs variable: since 7 is the second element of (4, 7) and also the first element of (7, 16), we know that one word ends in the same index where another word begins. In order to find places where we need to insert spaces, we’ll look for cases like this: where the end of one valid word is at the same place as the beginning of another valid word.

Dealing with Compound Words

Unfortunately, two valid words appearing together without a space is not conclusive evidence that a space is missing. Consider the word butterfly. We know that butter is a valid word and fly is a valid word, but we can’t necessarily conclude that butterfly was written in error, because butterfly is also a valid word. So we need to check not only for valid words that appear together without a space but also for valid words that, when mashed together without a space, do not together form another valid word. This means that in our text, we need to check whether oneperfectly is a word, whether paradisegiven is a word, and so on.

In order to check this, we need to find all the spaces in our text. We can look at all the substrings between two consecutive spaces and call those potential words. If a potential word is not in our word list, then we’ll conclude that it’s invalid. We can check each invalid word to see whether it’s made up of a combination of two smaller words; if it is, we’ll conclude that there’s a missing space and add it back in, right between the two valid words that have combined to form the invalid word.

Checking Between Existing Spaces for Potential Words

We can use re.finditer() again to find all the spaces in our text, which we’ll store in a variable called spacestarts. We’ll also add two more elements to our spacestarts variable: one to represent the location of the beginning of the text and one to represent the location of the end. This ensures that we find every potential word, since words at the very beginning and end will be the only words that are not between spaces. We also add a line that sorts the spacestarts list:

spacestarts = [m.start() for m in re.finditer(' ', text)]
spacestarts.append(-1)
spacestarts.append(len(text))
spacestarts.sort()

The list spacestarts records the locations of the spaces in our text. We got these locations by using a list comprehension and the re.finditer() tool. In this case, re.finditer() finds the location of every space in the text and stores it in a list, which refers to each individual element as m. For each of those m elements, which are spaces, we get the location where the space begins by using the start() function. We are looking for potential words between those spaces. It will be useful to have another list that records the locations of characters that come just after a space; these will be the locations of the first character of each potential word. We’ll call that list spacestarts_affine, since in technical terms, this new list is an affine transformation of the spacestarts list. Affine is often used to refer to linear transformations, such as adding 1 to each location, which we’ll do here. We’ll also sort this list:

spacestarts_affine = [ss+1 for ss in spacestarts]
spacestarts_affine.sort()

Next, we can get all the substrings that are between two spaces:

between_spaces = [(spacestarts[k] + 1,spacestarts[k + 1]) for k in range(0,len(spacestarts) - 1 )]

The variable we’re creating here is called between_spaces, and it’s a list of tuples of the form (<location of beginning of substring>, <location of end of substring>), like (17, 23). The way we get these tuples is through a list comprehension. This list comprehension iterates over k. In this case, k takes on the values of integers between 0 and one less than the length of the spacestarts list. For each k, we will generate one tuple. The first element of the tuple is spacestarts[k]+1, which is one position after the location of each space. The second element of the tuple is spacestarts[k+1], which is the location of the next space in the text. This way, our final output contains tuples that indicate the beginning and end of each substring between spaces.

Now, consider all of the potential words that are between spaces, and find the ones that are not valid (not in our word list):

between_spaces_notvalid = [loc for loc in between_spaces if 	ext[loc[0]:loc[1]] not in word_list]

Looking at between_spaces_notvalid, we can see that it’s a list of the locations of all invalid potential words in our text:

[(4, 16), (24, 30), (31, 34), (35, 45), (46, 48), (49, 54), (55, 68), (69, 71), (72, 78), (79, 81), (82, 84), (85, 90), (91, 92), (93, 105), (106, 107), (108, 111), (112, 119), (120, 123)]

Our code thinks that all these locations refer to invalid words. However, if you look at some of the words referred to here, they look pretty valid. For example, text[103:106] outputs the valid word and. The reason our code thinks that and is an invalid word is that it isn’t in our word list. Of course, we could add it to our word list manually and continue using that approach as we need our code to recognize words. But remember that we want this space insertion algorithm to work for millions of pages of scanned text, and they may contain many thousands of unique words. It would be helpful if we could import a word list that already contained a substantial body of valid English words. Such a collection of words is referred to as a corpus.

Using an Imported Corpus to Check for Valid Words

Luckily, there are existing Python modules that allow us to import a full corpus with just a few lines. First, we need to download the corpus:

import nltk
nltk.download('brown')

We’ve downloaded a corpus called brown from the module called nltk. Next, we’ll import the corpus:

from nltk.corpus import brown
wordlist = set(brown.words())
word_list = list(wordlist)

We have imported the corpus and converted its collection of words into a Python list. Before we use this new word_list, however, we should do some cleanup to remove what it thinks are words but are actually punctuation marks:

word_list = [word.replace('*','') for word in word_list]
word_list = [word.replace('[','') for word in word_list]
word_list = [word.replace(']','') for word in word_list]
word_list = [word.replace('?','') for word in word_list]
word_list = [word.replace('.','') for word in word_list]
word_list = [word.replace('+','') for word in word_list]
word_list = [word.replace('/','') for word in word_list]
word_list = [word.replace(';','') for word in word_list]
word_list = [word.replace(':','') for word in word_list]
word_list = [word.replace(',','') for word in word_list]
word_list = [word.replace(')','') for word in word_list]
word_list = [word.replace('(','') for word in word_list]
word_list.remove('')

These lines use the remove() and replace() functions to replace punctuation with empty strings and then remove the empty strings. Now that we have a suitable word list, we’ll be able to recognize invalid words more accurately. We can rerun our check for invalid words using our new word_list and get better results:

between_spaces_notvalid = [loc for loc in between_spaces if 	ext[loc[0]:loc[1]] not in word_list]

When we print the list between_spaces_notvalid, we get a shorter and more accurate list:

[(4, 16), (24, 30), (35, 45), (55, 68), (72, 78), (93, 105), (112, 119), (120, 123)]

Now that we have found the invalid potential words in our text, we’ll check in our word list for words that could be combined to form those invalid words. We can begin by looking for words that start just after a space. These words could be the first half of an invalid word:

partial_words = [loc for loc in locs if loc[0] in spacestarts_affine and loc[1] not in spacestarts]

Our list comprehension iterates over every element of our locs variable, which contains the location of every word in the text. It checks whether locs[0], the beginning of the word, is in spacestarts_affine, a list containing the characters that come just after a space. Then it checks whether loc[1] is not in spacestarts, which checks whether the word ends where a space begins. If a word starts after a space and doesn’t end at the same place as a space, we put it in our partial_words variable, because this could be a word that needs to have a space inserted after it.

Next, let’s look for words that end with a space. These could be the second half of an invalid word. To find them, we make some small changes to the previous logic:

partial_words_end = [loc for loc in locs if loc[0] not in spacestarts_affine and loc[1] in spacestarts]

Now we can start inserting spaces.

Finding First and Second Halves of Potential Words

Let’s start by inserting a space into oneperfectly. We’ll define a variable called loc that stores the location of oneperfectly in our text:

loc = between_spaces_notvalid[0]

We now need to check whether any of the words in partial_words could be the first half of oneperfectly. For a valid word to be the first half of oneperfectly, it would have to have the same beginning location in the text , but not the same ending location, as oneperfectly. We’ll write a list comprehension that finds the ending location of every valid word that begins at the same location as oneperfectly:

endsofbeginnings = [loc2[1] for loc2 in partial_words if loc2[0] == loc[0] and (loc2[1] - loc[0]) > 1]

We’ve specified loc2[0] == loc[0], which says that our valid word must start at the same place as oneperfectly. We’ve also specified (loc2[1]-loc[0])>1, which ensures that the valid word we find is more than one character long. This is not strictly necessary, but it can help us avoid false positives. Think of words like avoid, aside, along, irate, and iconic, in which the first letter could be considered a word on its own but probably shouldn’t be.

Our list endsofbeginnings should include the ending location of every valid word that begins at the same place as oneperfectly. Let’s use a list comprehension to create a similar variable, called beginningsofends, that will find the beginning location of every valid word that ends at the same place as oneperfectly:

beginningsofends = [loc2[0] for loc2 in partial_words_end if loc2[1] == loc[1] and (loc2[1] - loc[0]) > 1]

We’ve specified loc2[1] == loc[1], which says that our valid word must end at the same place as oneperfectly. We’ve also specified (loc2[1]-loc[0])>1, which ensures that the valid word we find is more than one character long, just as we did before.

We’re almost home; we just need to find whether any locations are contained in both endsofbeginnings and beginningsofends. If there are, that means that our invalid word is indeed a combination of two valid words without a space. We can use the intersection() function to find all elements that are shared by both lists:

pivot = list(set(endsofbeginnings).intersection(beginningsofends))

We use the list(set()) syntax again; just like before, it’s to make sure that our list contains only unique values, with no duplicates. We call the result pivot. It’s possible that pivot will contain more than one element. This would mean that there are more than two possible combinations of valid words that could compose our invalid word. If this happens, we’ll have to decide which combination is the one the original writer intended. This cannot be done with certainty. For example, consider the invalid word choosespain. It’s possible that this invalid word is from a travel brochure for Iberia (“Choose Spain!”), but it’s also possible that it’s from a description of a masochist (“chooses pain”). Because of the huge quantity of words in our language and the numerous ways they can be combined, sometimes we can’t be certain which is right. A more sophisticated approach would take into account context—whether other words around choosespain tend to be about olives and bullfighting or about whips and superfluous dentist appointments. Such an approach would be difficult to do well and impossible to do perfectly, illustrating again the difficulty of language algorithms in general. In our case, we’ll take the smallest element of pivot, not because this is certainly the correct one, but just because we have to take one:

import numpy as np
pivot = np.min(pivot)

Finally, we can write one line that replaces our invalid word with the two valid component words plus a space:

textnew = text
textnew = textnew.replace(text[loc[0]:loc[1]],text[loc[0]:pivot]+' '+text[pivot:loc[1]])

If we print this new text, we can see that it has correctly inserted a space into the misspelling oneperfectly, though it hasn’t yet inserted spaces in the rest of the misspellings.

The one perfectly divine thing, the oneglimpse of God's paradisegiven on earth, is to fight a losingbattle - and notlose it.

We can put all this together into one beautiful function, shown in Listing 8-1. This function will use a for loop to insert spaces into every instance of two valid words running together to become an invalid word.

def insertspaces(text,word_list):

    locs = list(set([(m.start(),m.end()) for word in word_list for m in re.finditer(word, 	ext)]))
    spacestarts = [m.start() for m in re.finditer(' ', text)]
    spacestarts.append(-1)
    spacestarts.append(len(text))
    spacestarts.sort()
    spacestarts_affine = [ss + 1 for ss in spacestarts]
    spacestarts_affine.sort()
    partial_words = [loc for loc in locs if loc[0] in spacestarts_affine and loc[1] not in     spacestarts]
    partial_words_end = [loc for loc in locs if loc[0] not in spacestarts_affine and loc[1]     in spacestarts]
    between_spaces = [(spacestarts[k] + 1,spacestarts[k+1]) for k in     range(0,len(spacestarts) - 1)]
    between_spaces_notvalid = [loc for loc in between_spaces if text[loc[0]:loc[1]] not in     word_list]
    textnew = text
    for loc in between_spaces_notvalid:
        endsofbeginnings = [loc2[1] for loc2 in partial_words if loc2[0] == loc[0] and     (loc2[1] - loc[0]) > 1]
        beginningsofends = [loc2[0] for loc2 in partial_words_end if loc2[1] == loc[1] and     (loc2[1] - loc[0]) > 1]
        pivot = list(set(endsofbeginnings).intersection(beginningsofends))
        if(len(pivot) > 0):
            pivot = np.min(pivot)
            textnew = textnew.replace(text[loc[0]:loc[1]],text[loc[0]:pivot]+'             '+text[pivot:loc[1]])
    textnew = textnew.replace('  ',' ')
    return(textnew)

Listing 8-1: A function that inserts spaces into texts, combining much of the code in the chapter so far

Then we can define any text and call our function as follows:

text = "The oneperfectly divine thing, the oneglimpse of God's paradisegiven on earth, is to fight a losingbattle - and notlose it."
print(insertspaces(text,word_list))

We see the output just as we expect, with spaces inserted perfectly:

The one perfectly divine thing, the one glimpse of God's paradise given on earth, is to fight a losing battle - and not lose it.

We’ve created an algorithm that can correctly insert spaces into English text. One thing to consider is whether you can do the same for other languages. You can—as long as you read in a good, appropriate corpus for the language you’re working with to define the word_list, the function we defined and called in this example can correctly insert spaces into text in any language. It can even correct a text in a language you’ve never studied or even heard of. Try different corpuses, different languages, and different texts to see what kind of results you can get, and you’ll get a glimpse of the power of language algorithms.

Phrase Completion

Imagine that you are doing algorithm consulting work for a startup that is trying to add features to a search engine they are building. They want to add phrase completion so that they can provide search suggestions to users. For example, when a user types in peanutbutter and, a search suggestion feature might suggest adding the word jelly. When a user types in squash, the search engine could suggest both court and soup.

Building this feature is simple. We’ll start with a corpus, just like we did with our space checker. In this case, we’re interested not only in the individual words of our corpus but also in how the words fit together, so we’ll compile lists of n-grams from our corpus. An n-gram is simply a collection of n words that appear together. For example, the phrase “Reality is not always probable, or likely” is made up of seven words once spoken by the great Jorge Luis Borges. A 1-gram is an individual word, so the 1-grams of this phrase are reality, is, not, always, probable, or, and likely. The 2-grams are every string of two words that appear together, including realityis, is not, not always, always probable, and so on. The 3-grams are reality is not, is not always, and so on.

Tokenizing and Getting N-grams

We’ll use a Python module called nltk to make n-gram collection easy. We’ll first tokenize our text. Tokenizing simply means splitting a string into its component words, ignoring punctuation. For example:

from nltk.tokenize import sent_tokenize, word_tokenize
text = "Time forks perpetually toward innumerable futures"
print(word_tokenize(text))

The result we see is this:

['Time', 'forks', 'perpetually', 'toward', 'innumerable', 'futures']

We can tokenize and get the n-grams from our text as follows:

import nltk
from nltk.util import ngrams
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)

Alternatively, we can put all the n-grams in a list called grams:

grams = [ngrams(token,2),ngrams(token,3),ngrams(token,4),ngrams(token,5)]

In this case, we have gotten a tokenization and a list of n-grams for a short one-sentence text. However, in order to have an all-purpose phrase completion tool, we’ll need a considerably larger corpus. The brown corpus we used for space insertion won’t work because it consists of single words and so we can’t get its n-grams.

One corpus we could use is a collection of literary texts made available online by Google’s Peter Norvig at http://norvig.com/big.txt. For the examples in this chapter, I downloaded a file of Shakespeare’s complete works, available for free online at http://www.gutenberg.org/files/100/100-0.txt, and then removed the Project Gutenberg boilerplate text on the top. You could also use the complete works of Mark Twain, available at http://www.gutenberg.org/cache/epub/3200/pg3200.txt. Read a corpus into Python as follows:

import requests
file = requests.get('http://www.bradfordtuckfield.com/shakespeare.txt')
file = file.text
text = file.replace('
', '')

Here, we used the requests module to directly read a text file containing the collected works of Shakespeare from a website where it’s being hosted, and then read it into our Python session in a variable called text.

After reading in your chosen corpus, rerun the code that created the grams variable. Here it is with the new definition of the text variable:

token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)
grams = [ngrams(token,2),ngrams(token,3),ngrams(token,4),ngrams(token,5)]

Our Strategy

Our strategy for generating search suggestions is simple. When a user types in a search, we check how many words are in their search. In other words, a user enters an n-gram and we determine what n is. When a user searches for an n-gram, we are helping them add to their search, so we will want to suggest an n + 1-gram. We’ll search our corpus and find all n + 1-grams whose first n elements match our n-gram. For example, a user might search for crane, a 1-gram, and our corpus might contain the 2-grams crane feather, crane operator, and crane neck. Each is a potential search suggestion we could offer.

We could stop there, providing every n + 1-gram whose first n elements matched the n + 1-gram the user had entered. However, not all suggestions are equally good. For example, if we are working for a custom engine that searches through manuals for industrial construction equipment, it’s likely that crane operator will be a more relevant, useful suggestion than crane feather. The simplest way to determine which n + 1-gram is the best suggestion is to offer the one that appears most often in our corpus.

Thus, our full algorithm: a user searches for an n-gram, we find all n + 1-grams whose first n elements match the user’s n-gram, and we recommend the matching n + 1-gram that appears most frequently in the corpus.

Finding Candidate n + 1-grams

In order to find the n + 1-grams that will constitute our search suggestions, we need to know how long the user’s search term is. Suppose the search term is life is a, meaning that we’re looking for suggestions for how to complete the phrase “life is a . . .”. We can use the following simple lines to get the length of our search term:

from nltk.tokenize import sent_tokenize, word_tokenize
search_term = 'life is a'
split_term = tuple(search_term.split(' '))
search_term_length = len(search_term.split(' '))

Now that we know the length of the search term, we know n—it’s 3. Remember that we’ll be returning the most frequent n + 1-grams (4-grams) to the user. So we need to take into account the different frequencies of different n + 1-grams. We’ll use a function called Counter(), which will count the number of occurrences of each n + 1-gram in our collection.

from collections import Counter
counted_grams = Counter(grams[search_term_length - 1])

This line has selected only the n + 1-grams from our grams variable. Applying the Counter() function creates a list of tuples. Each tuple has an n + 1-gram as its first element and the frequency of that n + 1-gram in our corpus as its second element. For example, we can print the first element of counted_grams:

print(list(counted_grams.items())[0])

The output shows us the first n + 1-gram in our corpus and tells us that it appears only once in the entire corpus:

(('From', 'fairest', 'creatures', 'we'), 1)

This n-gram is the beginning of Shakespeare’s Sonnet 1. It’s fun to look at some of the interesting 4-grams we can randomly find in Shakespeare’s works. For example, if you run print(list(counted_grams)[10]), you can see that the 10th 4-gram in Shakespeare’s works is “rose might never die.” If you run print(list(counted_grams)[240000]), you can see that the 240,000th n-gram is “I shall command all.” The 323,002nd is “far more glorious star” and the 328,004th is “crack my arms asunder.” But we want to do phrase completion, not just n + 1-gram browsing. We need to find the subset of n + 1-grams whose first n elements match our search term. We can do that as follows:

matching_terms = [element for element in list(counted_grams.items()) if element[0][:-1] == tuple(split_term)]

This list comprehension iterates over every n + 1-gram and calls each element as it does so. For each element, it checks whether element[0][:-1]==tuple(split_term). The left side of this equality, element[0][:-1], simply takes the first n elements of each n + 1-gram: the [:-1] is a handy way to disregard the last element of a list. The right side of the equality, tuple(split_term), is the n-gram we’re searching for (“life is a”). So we’re checking for n + 1-grams whose first n elements are the same as our n-gram of interest. Whichever terms match are stored in our final output, called matching_terms.

Selecting a Phrase Based on Frequency

Our matching_terms list has everything we need to finish the job; it consists of n + 1-grams whose first n elements match the search term, and it includes their frequencies in our corpus. As long as there is at least one element in the matching terms list, we can find the element that occurs most frequently in the corpus and suggest it to the user as the completed phrase. The following snippet gets the job done:

if(len(matching_terms)>0):
    frequencies = [item[1] for item in matching_terms]
    maximum_frequency = np.max(frequencies)
    highest_frequency_term = [item[0] for item in matching_terms if item[1] == maximum_frequency][0]
    combined_term = ' '.join(highest_frequency_term)

In this snippet, we started by defining frequencies, a list containing the frequency of every n + 1-gram in our corpus that matches the search term. Then, we used the numpy module’s max() function to find the highest of those frequencies. We used another list comprehension to get the first n + 1-gram that occurs with the highest frequency in the corpus, and finally we created a combined_term, a string that puts together all of the words in that search term, with spaces separating the words.

Finally, we can put all of our code together in a function, shown in Listing 8-2.

def search_suggestion(search_term, text):
    token = nltk.word_tokenize(text)
    bigrams = ngrams(token,2)
    trigrams = ngrams(token,3)
    fourgrams = ngrams(token,4)
    fivegrams = ngrams(token,5)
    grams = [ngrams(token,2),ngrams(token,3),ngrams(token,4),ngrams(token,5)]
    split_term = tuple(search_term.split(' '))
    search_term_length = len(search_term.split(' '))
    counted_grams = Counter(grams[search_term_length-1])
    combined_term = 'No suggested searches'    
    matching_terms = [element for element in list(counted_grams.items()) if element[0][:-1] == tuple(split_term)]
    if(len(matching_terms) > 0):
        frequencies = [item[1] for item in matching_terms]
        maximum_frequency = np.max(frequencies)
        highest_frequency_term = [item[0] for item in matching_terms if item[1] == maximum_frequency][0]
        combined_term = ' '.join(highest_frequency_term)
    return(combined_term)

Listing 8-2: A function that provides search suggestions by taking an n-gram and returning the most likely n + 1-gram that starts with the input n-gram

When we call our function, we pass an n-gram as the argument, and the function returns an n + 1-gram. We call it as follows:

file = requests.get('http://www.bradfordtuckfield.com/shakespeare.txt')
file = file=file.text
text = file.replace('
', '')
print(search_suggestion('life is a', text))

And you can see that the suggestion is life is a tedious, which is the most common 4-gram that Shakespeare used that started with the words life is a (tied with two other 4-grams). Shakespeare used this 4-gram only once, in Cymbeline, when Imogen says, “I see a man’s life is a tedious one.” In King Lear, Edgar tells Gloucester “Thy life is a miracle” (or “Thy life’s a miracle,” depending on which text you use), so that 4-gram would also be a valid completion of our phrase.

We can have some fun by trying a different corpus and seeing how the results differ. Let’s use the corpus of Mark Twain’s collected works:

file = requests.get('http://www.bradfordtuckfield.com/marktwain.txt')
file = file=file.text
text = file.replace('
', '')

With this new corpus, we can check for search suggestions again:

print(search_suggestion('life is a',text))

In this case, the completed phrase is life is a failure, indicating a difference between the two text corpuses, and maybe also a difference between the style and attitude of Shakespeare and those of Mark Twain. You can also try other search terms. For example, I love is completed by you if we use Mark Twain’s corpus, and thee if we use Shakespeare’s corpus, showing a difference in style across the centuries and ocean, if not a difference in ideas. Try another corpus and some other phrases and see how your phrases get completed. If you use a corpus written in another language, you can do phrase completion for languages you don’t even speak using the exact function we just wrote.

Summary

In this chapter, we discussed algorithms that can be used to work with human language. We started with a space insertion algorithm that can correct incorrectly scanned texts, and we continued with a phrase completion algorithm that can add words to input phrases to match the content and style of a text corpus. The approaches we took to these algorithms are similar to the approaches that work for other types of language algorithms, including spell checkers and intent parsers.

In the next chapter, we’ll explore machine learning, a powerful and growing field that every good algorithm-smith should be familiar with. We’ll focus on a machine learning algorithm called decision trees, which are simple, flexible, accurate, and interpretable models that can take you far on your journey through algorithms and life.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.190.167