11
Forging Ahead

You’ve made it through the dark forest of searching and sorting, across the frozen river of esoteric mathematics, over the treacherous mountain passes of gradient ascent, past the swamp of geometric despair, and you’ve conquered the dragon of slow runtimes. Congratulations. If you wish, you’re free to return to your comfortable home in a land free from algorithms. This chapter is for those who instead wish to continue the adventure after they close this book.

No single book can contain everything about algorithms. There is too much to know, and more is being discovered all the time. This chapter is about three things: doing more with algorithms, using them in better and faster ways, and solving their deepest mysteries.

In this chapter, we’ll build a simple chatbot that can talk to us about previous chapters of the book. Then we’ll discuss some of the hardest problems in the world and how we might make progress toward crafting algorithms to solve them. We’ll conclude by discussing some of the deepest mysteries of the world of algorithms, including detailed instructions on how to win a million dollars with advanced algorithmic theory.

Doing More with Algorithms

The 10 previous chapters of this book covered algorithms that can perform a variety of tasks in many fields. But algorithms can do even more than we’ve seen here. If you wish to continue your adventure with algorithms, you should explore other fields and the important algorithms associated with them.

For example, the many algorithms for information compression can store a long book in a coded form that is only a fraction of the size of the original, and they can compress a complex photograph or film file into a manageable size with either minimal or no loss of quality.

Our ability to communicate securely online, including confidently passing our credit card information to third parties, relies on cryptographic algorithms. Cryptography is great fun to study because it comes with a thrilling history of adventurers, spies, betrayals, and triumphant nerds who broke codes to win wars.

Recently, innovative algorithms have been developed to perform parallel distributed computing. Instead of performing one operation at a time, millions of times, distributed computing algorithms split up a dataset into many little parts and then send them to different computers, which perform the needed operation simultaneously and return the results, to be recompiled and presented as the final output. By working on all parts of the data concurrently instead of consecutively, parallel computing saves a huge amount of time. This is extremely useful for applications in machine learning, where there’s a need to process datasets that are extremely large or to perform a large number of simple computations simultaneously.

For decades, people have been excited about the potential of quantum computing. Quantum computers, if we can engineer them to work properly, have the potential to perform extremely difficult calculations (including the calculations needed to break state-of-the-art cryptography) in a tiny fraction of the time required on today’s nonquantum supercomputers. Since quantum computers are built with different architecture than standard computers, it’s possible to design new algorithms that take advantage of their different physical properties to perform tasks with extra speed. For now, this is more or less only an academic concern, since quantum computers are not yet in a state where they are used for practical purposes. But if the technology ever matures, quantum algorithms could become extremely important.

When you learn about algorithms in these or many other fields, you will not be starting from scratch. By mastering the algorithms of this book, you’ve come to grasp they are, how they tend to function, and how to write code for them. Learning your first algorithm may have felt quite difficult, but learning your 50th or 200th will be much easier, since your brain will be used to the general patterns of how they are constructed and how to think about them.

To prove that you can now understand and code algorithms, we’ll explore a few algorithms that work together to provide the functionality of a chatbot. If you can pick up how they work and how to write code for them in the short introduction provided here, then you’re on your way to being able to pick up how any algorithm works in any field.

Building a Chatbot

Let’s build a simple chatbot that can answer questions about the table of contents of this book. We’ll start by importing modules that will be important later:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial
import numpy as np
import nltk, string

The next step we’ll take to create our chatbot is text normalization, the process of converting natural language text to standardized substrings; it enables easy comparison between superficially different texts. We want our bot to understand that America and america refer to the same thing, that regeneration expresses the same idea as regenerate (albeit a different part of speech), that centuries is the plural of century, and that hello; is not essentially different from hello. We want our chatbot to treat in the same way words that are from the same root, unless there is some reason not to.

Say we have the following query:

query = 'I want to learn about geometry algorithms.'

The first thing we can do is convert all characters to lowercase. Python’s built-in lower() method accomplishes this:

print(query.lower())

This outputs i want to learn about geometry algorithms.. Another thing we can do is remove punctuation. To do that, first we’ll create a Python object called a dictionary:

remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

This snippet creates a dictionary that maps every standard punctuation mark to the Python object None, and it stores the dictionary in a variable called remove_punctuation_map. We then use this dictionary to remove punctuation like so:

print(query.lower().translate(remove_punctuation_map))

Here, we’ve used the translate() method to take all the punctuation marks we find in the query and replace them with nothing—or in other words, remove the punctuation marks. The output we get is the same as we saw before—i want to learn about geometry algorithms—but without the period at the end. Next, we can perform tokenization, which converts a text string to a list of coherent substrings:

print(nltk.word_tokenize(query.lower().translate(remove_punctuation_map)))

We used the nltk’s tokenization function to accomplish this, yielding this output: ['i', 'want', 'to', 'learn', 'about', 'geometry', 'algorithms'].

Now we can do what’s called stemming. In English, we use the words jump, jumps, jumping, jumped, and other derived forms that are all different but share a stem: the verb jump. We don’t want our chatbot to be distracted by small differences in word derivation; we want to consider a sentence about jumping to be comparable to a sentence about a jumper, even though they are technically different words. Stemming removes the ends of derived words to convert them into standardized word stems. A function for stemming is available in Python’s nltk module, and we can use this function with a list comprehension as follows:

stemmer = nltk.stem.porter.PorterStemmer()
def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

In this snippet, we’ve created a function called stem_tokens(). It takes a list of tokens and calls nltk’s stemmer.stem() function to turn them into stems:

print(stem_tokens(nltk.word_tokenize(query.lower().translate(remove_punctuation_map))))

The output is ['i', 'want', 'to', 'learn', 'about', 'geometri', 'algorithm']. Our stemmer has converted algorithms to algorithm and geometry to geometri. It has replaced a word with what it regards as its stem: a singular word or word portion that will make text comparisons easier. Finally, we put our normalization steps together in one function, normalize():

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

Text Vectorization

Now you’re ready to learn how to convert texts to numeric vectors. It’s easier to make quantitative comparisons between numbers and vectors than between words, and we’ll need to make quantitative comparisons to make our chatbot work.

We’ll use a simple method called TFIDF, or term frequency-inverse document frequency, which converts documents into numeric vectors. Each document vector has one element for each term in a corpus. Each element is the product of the term frequency for a given term (a raw count of the number of times the term occurs in a particular document) and the inverse document frequency for a given term (a logarithm of a reciprocal of what proportion of documents the term appears in).

For example, imagine that we are creating TFIDF vectors for biographies of US presidents. In the context of creating TFIDF vectors, we’ll refer to each biography as a document. In the biography of Abraham Lincoln, the word representative will probably appear at least once, since he served in the Illinois House of Representatives and the US House of Representatives. If representative appears three times in the biography, then we say its term frequency is 3. More than a dozen presidents have served in the US House of Representatives, so maybe about 20 out of 44 total presidential biographies contain the term representative. We can then calculate the inverse document frequency as:

c11eq001

The final value we’re looking for is the term frequency times the inverse document frequency: 3 × 0.788 = 2.365. Now consider the term Gettysburg. It may appear twice in Lincoln’s biography but never in any other, so the term frequency will be 2 and the inverse document frequency will be the following:

c11eq002

The vector element associated with Gettysburg will be the term frequency times the inverse document frequency, which is 2 × 3.784 = 7.568. The TFIDF value for each term should reflect its importance in a document. Soon, this will be important for our chatbot’s ability to determine user intent.

We don’t have to calculate TFIDF manually. We can use a function from the scikit-learn module:

vctrz = TfidfVectorizer(ngram_range = (1, 1),tokenizer = normalize, stop_words = 'english')

This line has created a TfidfVectorizer() function, which is capable of creating TFIDF vectors from sets of documents. To create the vectorizer, we have to specify an ngram_range. This tells the vectorizer what to treat as a term. We specified (1, 1), meaning that our vectorizer will treat only 1-grams (individual words) as terms. If we had specified (1, 3), it would treat 1-grams (single words), 2-grams (two-word phrases), and 3-grams (three-word phrases) as terms and create a TFIDF element for each of them. We also specified a tokenizer, for which we specified the normalize() function we created before. Finally, we have to specify stop_words, the words that we want to filter out because they’re not informative. In English, stop words include the, and, of, and other extremely common words. By specifying stop_words = 'english', we’re telling our vectorizer to filter out the built-in set of English stop words and vectorize only less common, more informative words.

Now, let’s configure what our chatbot will be able to talk about. Here, it will be able to talk about the chapters of this book, so we’ll create a list that contains very simple descriptions of each chapter. In this context, each string will be one of our documents.

alldocuments = ['Chapter 1. The algorithmic approach to problem solving, including Galileo and baseball.',
            'Chapter 2. Algorithms in history, including magic squares, Russian peasant multiplication, and Egyptian methods.',
            'Chapter 3. Optimization, including maximization, minimization, and the gradient ascent algorithm.',
            'Chapter 4. Sorting and searching, including merge sort, and algorithm runtime.',
            'Chapter 5. Pure math, including algorithms for continued fractions and random numbers and other mathematical ideas.',
            'Chapter 6. More advanced optimization, including simulated annealing and how to use it to solve the traveling salesman problem.',
            'Chapter 7. Geometry, the postmaster problem, and Voronoi triangulations.',
            'Chapter 8. Language, including how to insert spaces and predict phrase completions.',
            'Chapter 9. Machine learning, focused on decision trees and how to predict happiness and heart attacks.',
            'Chapter 10. Artificial intelligence, and using the minimax algorithm to win at dots and boxes.',
            'Chapter 11. Where to go and what to study next, and how to build a chatbot.']

We’ll continue by fitting our TFIDF vectorizer to these chapter descriptions, which will do the document processing to get us ready to create TFIDF vectors whenever we wish. We don’t have to do this manually, since there’s a fit() method defined in the scikit-learn module:

vctrz.fit(alldocuments)

Now, we’ll create TFIDF vectors for our chapter descriptions and for a new query asking for a chapter about sorting and searching:

query = 'I want to read about how to search for items.'
tfidf_reports = vctrz.transform(alldocuments).todense()
tfidf_question = vctrz.transform([query]).todense()

Our new query is a natural English language text about searching. The next two lines use the built-in translate() and todense() methods to create the TFIDF vectors for the chapter descriptions and the query.

Now we have converted our chapter descriptions and query into numeric TFIDF vectors. Our simple chatbot will work by comparing the query TFIDF vector to the chapter description TFIDF vectors, concluding that the chapter the user is looking for is the one whose description vector most closely matches the query vector.

Vector Similarity

We’ll decide whether any two vectors are similar with a method called cosine similarity. If you’ve studied a lot of geometry, you’ll know that for any two numeric vectors, we can calculate the angle between them. The rules of geometry enable us to calculate angles between vectors not only in two and three dimensions, but also in four, five, or any number of dimensions. If the vectors are very similar to each other, the angle between them will be quite small. If the vectors are very different, the angle will be large. It’s strange to think that we can compare English language texts by finding the “angle” between them, but this is precisely why we created our numeric TFIDF vectors—so that we can use numeric tools like angle comparison for data that doesn’t start out numeric.

In practice, it’s easier to calculate the cosine of the angle between two vectors than it is to calculate the angle itself. This is not a problem, since we can conclude that if the cosine of the angle between two vectors is large, then the angle itself is small and vice versa. In Python the scipy module contains a submodule called spatial, which contains a function for calculating the cosines of angles between vectors. We can use the functionality in spatial to calculate cosines between each chapter description vector and query vector, by using a list comprehension:

row_similarities = [1 - spatial.distance.cosine(tfidf_reports[x],tfidf_question) for x in 
ange(len(tfidf_reports)) ]

When we print out the row_similarities variable, we see the following vector:

[0.0, 0.0, 0.0, 0.3393118510377361, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

In this case, only the fourth element is greater than zero, meaning that only the fourth chapter description vector has any angular proximity to our query vector. In general, we can automatically find which row has the highest cosine similarity:

print(alldocuments[np.argmax(row_similarities)])

This gives us the chapter the chatbot thinks we’re looking for:

Chapter 4. Sorting and searching, including merge sort, and algorithm runtime.

Listing 11-1 puts the chatbot’s simple functionality into a function.

def chatbot(query,allreports):
    clf = TfidfVectorizer(ngram_range = (1, 1),tokenizer = normalize, stop_words = 'english')
    clf.fit(allreports)
    tfidf_reports = clf.transform(allreports).todense()
    tfidf_question = clf.transform([query]).todense()
    row_similarities = [1 - spatial.distance.cosine(tfidf_reports[x],tfidf_question) for x in 
ange(len(tfidf_reports)) ]
    return(allreports[np.argmax(row_similarities)])

Listing 11-1: A simple chatbot function that takes a query and returns the document that’s most similar to it

Listing 11-1 does not contain anything new; all of it is code that we’ve seen before. Now we can call the chatbot with a query about where to find something:

print(chatbot('Please tell me which chapter I can go to if I want to read about mathematics algorithms.',alldocuments))

The output will tell us to go to Chapter 5:

Chapter 5. Pure math, including algorithms for continued fractions and random numbers and other mathematical ideas.

Now that you’ve seen how the whole chatbot works, you can understand why we needed to do the normalization and vectorization. By normalizing and stemming words, we can make sure that the term mathematics will prompt the bot to return the Chapter 5 description, even though that exact word does not appear in it. By vectorizing, we enable the cosine similarity metric that tells us which chapter description is the best match.

We’ve completed our chatbot, which required stitching together a few different smallish algorithms (algorithms for normalizing, stemming, and numerically vectorizing text; an algorithm for calculating cosines of angles between vectors; and the overarching algorithm of providing chatbot answers based on query/document vector similarity). You may have noticed that we didn’t manually do many of the calculations—the actual calculation of TFIDF or cosines was done by modules that we imported. In practice, you often don’t need to truly understand the guts of an algorithm in order to import it and use it in your programs. This can be a blessing, in that it can accelerate our work and put amazingly sophisticated tools at our command when we need them. It can also be a curse because it causes people to misuse algorithms they don’t understand; for example, an article in Wired magazine claimed that the misapplication of a particular financial algorithm (a method to use Gaussian copula functions to predict risks) was responsible for “kill[ing] Wall Street” and “swallow[ing] up trillions of dollars” and was a major cause of the Great Recession (https://www.wired.com/2009/02/wp-quant/). I encourage you to study the deep theory of algorithms even when the ease of importing a Python module makes such study seem unnecessary; it can always make you a better academic or practitioner.

This is perhaps the simplest possible chatbot, and it answers only questions related to chapters in this book. You could add so many enhancements to improve it: make the chapter descriptions more specific and thus more likely to match a broad range of queries; find a vectorization method that performs better than TFIDF; add more documents so that it could answer more queries. But although our chatbot is not the most advanced, we can be proud of it because it’s ours and because we built it ourselves. If you can comfortably build a chatbot, you can consider yourself a competent designer and implementer of algorithms—congratulations for this culminating achievement in your journey through this book.

Becoming Better and Faster

You can do more with algorithms than you could when you started the book. But every serious adventurer will also want to be able to do things better and faster.

Many things can make you better at designing and implementing algorithms. Think about how each algorithm we implemented in this book relied on some understanding of a non-algorithmic topic. Our baseball-catching algorithm relies on an understanding of physics and even a little psychology. Russian peasant multiplication relies on an understanding of exponents and on deep properties of arithmetic, including binary notation. Chapter 7’s geometry algorithms rely on insights into how points, lines, and triangles relate and fit together. The deeper your understanding of the field you’re trying to write algorithms for, the easier it will be for you to design and implement algorithms. Thus, the way to get better at algorithms is easy: just understand everything perfectly.

Another natural next step for a budding algorithmic adventurer is to polish and repolish your raw programming skills. Remember that Chapter 8 introduced list comprehensions as a Pythonic tool that enables us to write language algorithms that are concise and perform well. As you learn more programming languages and master their features, you’ll be able to write code that’s better organized, more compact, and more powerful. Even skilled programmers can benefit from going back to the basics and mastering fundamentals until they’re second nature. Many talented programmers write disorganized, badly documented, or inefficient code and think they can get away with it because it “works.” But remember that code doesn’t usually succeed on its own—it is almost always part of a broader program, some team effort or grand business project that relies on cooperation between people and over time. Because of this, even soft skills like planning, oral and written communication, negotiation, and team management can improve your chances of success in the world of algorithms.

If you enjoy creating perfectly optimal algorithms and pushing them to their highest efficiency, you’re in luck. For a huge number of computer science problems, there is no known efficient algorithm that runs much faster than brute force. In the next section, we sketch a few of these problems and discuss what’s so hard about them. If you, dear adventurer, create an algorithm that solves any of these problems quickly, you could have fame, fortune, and worldwide gratitude for the rest of your life. What are we waiting for? Let’s look at some of these challenges for the most courageous among us.

Algorithms for the Ambitious

Let’s consider a relatively simple problem related to chess. Chess is played on an 8×8 board, and two opponents take turns moving differently styled pieces. One piece, the queen, can move any number of squares along the row, column, or diagonal where it is placed. Usually, a player possesses only one queen, but it’s possible for a player to have up to nine queens in a standard chess game. If a player has more than one queen, it may be that two or more queens “attack” each other—in other words, they are placed on the same row, column, or diagonal. The eight queens puzzle challenges us to place eight queens on a standard chessboard such that no pair of queens is on the same row, column, or diagonal. Figure 11-1 shows one solution to the eight queens puzzle.

figure_11_1_lighter

Figure 11-1: A solution to the eight queens puzzle (source: Wikimedia Commons)

None of the queens on this board attacks any of the other queens. The easiest possible way to solve the eight queens puzzle is to simply memorize a solution, like the one in Figure 11-1, and repeat it whenever you’re asked to solve the puzzle. However, a couple of extra twists to the puzzle make memorization infeasible. One twist is to increase the number of queens and the size of the board. The n queens problem asks us to place n queens on an n×n chessboard such that no queen attacks any of the others; n could be any natural number, no matter how high. Another twist is the n queens completion problem: your opponent starts by placing some of the queens, maybe in places that will make it difficult for you to place the rest, and you have to place the rest of the n queens so that none attack any others. Can you design an algorithm that will run very quickly and solve this problem? If so, you could earn a million dollars (see “Solving the Deepest Mysteries” on page 212).

Figure 11-1 may remind you of sudoku, since it involves checking for the uniqueness of symbols in rows and columns. In sudoku, the goal is to fill in the numbers 1 through 9 such that each row, column, and 3×3 block contains exactly one instance of each number (Figure 11-2). Sudoku first gained popularity in Japan, and indeed a sudoku puzzle is reminiscent of the Japanese magic squares we explored in Chapter 2.

Figure_11_2

Figure 11-2: An uncompleted sudoku grid (source: Wikimedia Commons)

It’s an interesting exercise to think about how to write an algorithm that could solve sudoku puzzles. The simplest, slowest possible algorithm would rely on brute force: just try every possible combination of numbers and repeatedly check whether they constitute a correct solution, repeating until the solution is found. This would work, but it lacks elegance, and it could take an extremely long time. It doesn’t seem intuitively right that filling in 81 numbers in a grid according to rules that anyone could easily follow should stretch the limits of our world’s computing resources. More sophisticated solutions could rely on logic to cut down the required runtime.

The n queens completion problem and sudoku share another important trait: solutions are very easy to check. That is, if I show you a chessboard with queens on it, it will probably take you only a few moments to check whether you’re looking at a solution to the n queens completion problem, and if I show you a grid of 81 numbers, you can easily tell whether you’re looking at a correct sudoku solution. The ease with which we can check solutions is, tragically, not matched by the ease of generating solutions—it can take hours to solve a difficult sudoku puzzle that then takes only seconds to verify. This generation/verification effort mismatch is common in many areas of life: I can tell with very little effort whether a meal is delicious, but creating a wonderful meal takes a much greater investment of time and resources. Similarly, I can check whether a painting is beautiful in much less time than it takes to create a beautiful painting, and I can verify whether a plane can fly with much less effort than it takes to build a flying plane.

Problems that are difficult to solve algorithmically but whose solutions are easy to verify are extremely important in theoretical computer science, and they are the deepest and most pressing mystery in the field. Especially courageous adventurers may dare to plunge into these mysteries—but beware the perils awaiting you there.

Solving the Deepest Mysteries

When we say that sudoku solutions are easy to verify but hard to generate, what we mean in more formal terms is that solutions can be verified in polynomial time; in other words, the number of steps required for solution verification is some polynomial function of the size of the sudoku board. If you think back to Chapter 4 and our discussion of runtimes, you’ll remember that even though polynomials like x2 and x3 can grow fast, they are quite slow compared to exponential functions like ex. If we can verify an algorithmic solution to a problem in polynomial time, we regard that verification as easy, but if the generation of a solution takes exponential time, we regard it as hard.

There’s a formal name for the class of problems whose solutions can be verified in polynomial time: the NP complexity class. (Here, NP stands for nondeterministic polynomial time, for reasons that would require a long digression into theoretical computer science that would not be useful here.) NP is one of the two most fundamental complexity classes in computer science. The second is called P, for polynomial time. The P complexity class of problems contains all problems whose solutions can be found by an algorithm that runs in polynomial time. For P problems, we can find full solutions in polynomial time, while for NP problems, we can verify solutions in polynomial time, but it may take exponential time to find those solutions.

We know that sudoku is an NP problem—it is easy to verify a proposed sudoku solution in polynomial time. Is sudoku also a P problem? That is, is there an algorithm that can solve any sudoku puzzle in polynomial time? No one has ever found one, and no one appears to be close to finding one, but we don’t feel certain that it’s impossible.

The list of problems that we know are in NP is extremely long. Some versions of the traveling salesman problem are in NP. So is the optimal solution to the Rubik’s cube, as well as important mathematical problems like integer linear programming. Just as with sudoku, we wonder whether these problems are also in P—can we find solutions for them in polynomial time? One way to phrase this question is, Does P = NP?

In 2000, the Clay Mathematics Institute published a list called the Millennium Prize Problems. It announced that any person who published a verified solution to one of the problems would receive a million dollars. The list was meant to be seven of the world’s most important problems related to mathematics, and the question of whether P = NP is one of them; no one has claimed its prize yet. Will one of the noble adventurers reading these words eventually break the Gordian knot and solve this most crucial of algorithmic problems? I sincerely hope so and wish each of you luck, strength, and joy on the journey.

If there is ever a solution, it will be a proof of one of the following two assertions: either that P = NP or that P ≠ NP. A proof that P = NP could be relatively simple, since all that would be required is a polynomial-time algorithmic solution to an NP-complete problem. NP-complete problems are a special type of NP problem defined by the feature that every single NP problem can be quickly reduced to an NP-complete problem; in other words, if you can solve one NP-complete problem, you can solve every NP problem. If you can solve any single NP-complete problem in polynomial time, you can solve every NP problem in polynomial time, which would prove that P = NP. As it happens, sudoku and the n-queens completion problem are both NP-complete. This means that finding a polynomial-time algorithmic solution to either of them would not only solve every existing NP problem but also earn you a million dollars and worldwide, lifelong fame (not to mention the power to beat everyone you know in friendly sudoku competitions).

A proof that P ≠ NP would probably not be as straightforward as a solution to sudoku. The notion that P ≠ NP means that there are NP problems that cannot be solved by any algorithm with polynomial runtime. Proving this amounts to proving a negative, and it is conceptually much harder to prove that something cannot exist than it is to point to an example of something. Making progress in a proof that P ≠ NP will require extended study in theoretical computer science beyond the scope of this book. Though this path is harder, it seems to be the consensus among researchers that P ≠ NP, and that if there is ever a resolution to the P versus NP question, it will probably be a proof that P ≠ NP.

The P versus NP question is not the only deep mystery related to algorithms, although it is the most immediately lucrative one. Every aspect of the field of algorithm design has wide-open fields for adventurers to charge into. There are not only theoretical and academic questions, but also practical ones related to how to implement algorithmically sound practices in business contexts. Waste no time: remember what you have learned here and sally forth anon, carrying your new skills with you to the utmost bounds of knowledge and practice, on your lifelong algorithmic adventure. Friends, adieu.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.204.24.82