Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

15 Classifying data with logistic regression

This chapter covers

Understanding classification problems and measuring classifiers
Finding decision boundaries to classify two kinds of data
Approximating classified data sets with logistic functions
Writing a cost function for logistic regression
Carrying out gradient descent to find a logistic function of best fit

One of the most important classes of problems in machine learning is classification, which we’ll focus on in the last two chapters of this book. A classification problem is one where we’ve got one or more pieces of raw data, and we want to say what kind of object each one represents. For instance, we might want an algorithm to look at the data of all email messages entering our inbox and classify each one as an interesting message or as unwanted spam. As an even more impactful example, we could write a classification algorithm to analyze a data set of medical scans and decide whether they contain benign or malevolent tumors.

We can build machine learning algorithms for classification where the more real data our algorithm sees, the more it learns, and the better it performs at the classification task. For instance, every time an email user flags an email as spam or a radiologist identifies a malignant tumor, this data can be passed back to the algorithm to improve its calibration.

In this chapter, we look at the same simple data set as in the last chapter: mileages and prices of used cars. Instead of using data for a single model of car like in the last chapter, we’ll look at two car models: Toyota Priuses and BMW 5 series sedans. Based only on the numeric data of the car’s mileage and price, and a reference data set of known examples, we want our algorithm to give us a yes or no answer as to whether the car is a BMW. As opposed to a regression model that takes in a number and produces another number, the classification model will take in a vector and produce a number between zero and one, representing the confidence that the vector represents a BMW instead of a Prius (figure 15.1).

Figure 15.1 Our classifier takes a vector of two numbers, the mileage and price of a used car, and returns a number representing its confidence that the car is a BMW.

Even though classification has different inputs and outputs than regression, it turns out we can build our classifier using a type of regression. The algorithm we’ll implement in this chapter is called logistic regression. To train this algorithm, we start with a known data set of used car mileages and prices, labeled with a 1 if they are BMWs and a 0 if they are Priuses. Table 15.1 shows sample points in this data set that we use to train our algorithm.

Table 15.1 Sample data points used to train the algorithm

Mileage (mi)	Price ($)	Is BMW?
110,890.0	13,995.00	1
94,133.0	13,982.00	1
70,000.0	9,900.00	0
46,778.0	14,599.00	1
84,507.0	14,998.00	0
. . .	. . .	. . .

We want a function that takes the values in the first two columns and produces a result that is between zero and one, and hopefully, close to the correct choice of car. I’ll introduce you to a special kind of function called a logistic function, which takes a pair of input numbers and produces a single output number that is always between zero and one. Our classification function is the logistic function that “best fits” the sample data we provide.

Our classification function won’t always get the answer right, but then again neither would a human. BMW 5 series sedans are luxury cars, so we would expect to get a lower price for a Prius than a BMW with the same mileage. Defying our expectations, the last two rows of the data in table 5.1 show a Prius and BMW at roughly the same price, where the Prius has nearly twice the mileage of the BMW. Due to fluke examples like this, we won’t expect the logistic function to produce exactly one or zero for each BMW or Prius it sees. Rather it can return 0.51, which is the function’s way of telling us it’s not sure, but the data is slightly more likely to represent a BMW.

In the last chapter, we saw that the linear function we chose was determined by the two parameters a and b in the formula f(x) = ax + b. The logistic functions we’ll use in this chapter are parametrized by three parameters, so the task of logistic regression boils down to finding three numbers that get the logistic function as close as possible to the sample data provided. We’ll create a special cost function for the logistic function and find the three parameters that minimize the cost function using gradient descent. There’s a lot of steps here, but fortunately, they all parallel what we did in the last chapter, so it will be a useful review if you’re learning about regression for the first time.

Coding the logistic regression algorithm to classify the cars is the meat of the chapter, but before doing that, we spend a bit more time getting you familiar with the process of classification. And before we train a computer to do the classification, let’s measure how well we can do the task. Then, once we build our logistic regression model, we can evaluate how well it does by comparison.

15.1 Testing a classification function on real data

Let’s see how well we can identify BMWs in our data set using a simple criterion. Namely, if a used car has a price above $25,000, it’s probably too expensive to be a Prius (after all, you can get a brand new Prius for near that amount). If the price is above $25,000, we’ll say that it is a BMW; otherwise, we’ll say that it’s a Prius. This classification is easy to build as a Python function:

def bmw_finder(mileage,price):
    if price > 25000:
        return 1
    else:
        return 0

The performance of this classifier might not be that great because it’s conceivable that BMWs with a lot of miles might sell for less than $25,000. But we don’t have to speculate: we can measure how well this classifier does on actual data.

In this section, we measure the performance of our algorithm by writing a function called test_classifier, which takes a classification function like bmw_finder as well as the data set to test. The data set is an array of tuples of mileages, prices, and a 1 or 0, indicating whether the car is a BMW or a Prius. Once we run the test _classifier function with real data, it returns a percent value, telling us how many of the cars it identifies correctly. At the end of the chapter when we’ve implemented logistic regression, we can instead pass in our logistic classification function to test_classifier and see its relative performance.

15.1.1 Loading the car data

It is easier to write the test_classifier function if we first load the car data. Rather than fuss with loading the data from CarGraph.com or from a flat file, I’ve made it easy for you by providing a Python file called cardata.py in the source code for the book. It contains two arrays of data: one for Priuses and one for BMWs. You can import the two arrays as follows:

from car_data import bmws, priuses

If you inspect either the BMW or Prius raw data in the car_data.py file, you’ll see that this file contains more data than we need. For now, we’re focusing on the mileage and price of each car, and we know what car it is, based on the list it belongs to. For instance, the BMW list begins like this:

[('bmw', '5', 2013.0, 93404.0, 13999.0, 22.09145859494213),
 ('bmw', '5', 2013.0, 110890.0, 13995.0, 22.216458611342592),
 ('bmw', '5', 2013.0, 94133.0, 13982.0, 22.09145862741898),
 ...

Each tuple represents one car for sale, and the mileage and price are given by the fourth and fifth entries of the tuple, respectively. Within car_data.py, these are converted to Car objects, so we can write car.price instead of car[4], for example. We can make a list, called all_car_data, of the shape we want by pulling the desired entries from the BMW tuples and Prius tuples:

all_car_data = []
for bmw in bmws:
    all_car_data.append((bmw.mileage,bmw.price,1))
for prius in priuses:
    all_car_data.append((prius.mileage,prius.price,0))

Once this is run, all_car_data is a Python list starting with the BMWs and ending with the Priuses, labeled with 1’s and 0’s, respectively:

>>> all_car_data
[(93404.0, 13999.0, 1),
 (110890.0, 13995.0, 1),
 (94133.0, 13982.0, 1),
 (46778.0, 14599.0, 1),
 ....
(45000.0, 16900.0, 0),
(38000.0, 13500.0, 0),
(71000.0, 12500.0, 0)]

15.1.2 Testing the classification function

With the data in a suitable format, we can now write the test_classifier function. The job of the bmw_finder is to look at the mileage and price of a car and tell us whether these represent a BMW. If the answer is yes, it returns a 1; otherwise, it returns a 0. It’s likely that bmw_finder will get some answers wrong. If it predicts that a car is a BMW (returning 1), but the car is actually a Prius, we’ll call that a false positive. If it predicts the car is a Prius (returning 0), but the car is actually a BMW, we’ll call that a false negative. If it correctly identifies a BMW or a Prius, we’ll call that a true positive or true negative, respectively.

To test a classification function against the all_car_data data set, we need to run the classification function on each mileage and price in that list, and see whether the result of 1 or 0 matches the given value. Here’s what that looks like in code:

def test_classifier(classifier, data):
    trues = 0
    falses = 0
    for mileage, price, is_bmw in data:
        if classifier(mileage, price) == is_bmw:  ❶
            trues += 1
        else:
            falses += 1                           ❷
    return trues / (trues + falses)

❶ Adds 1 to the trues counter if the classification is correct

❷ Otherwise, adds 1 to the falses counter

If we run this function with the bmw_finder classification function and the all_car_data data set, we see that it has 59% accuracy:

>>> test_classifier(bmw_finder, all_car_data)
0.59

That’s not too bad; we got most of the answers right. But we’ll see we can do much better than this! In the next section, we plot the data set to understand what’s qualitatively wrong with the bmw_finder function. This helps us to see how we can improve the classification with our logistic classification function.

15.1.3 Exercises

Exercise 15.1: Update the test_classifier function to print the number of true positives, true negatives, false positives, and false negatives. Printing these for the bmw_finder classifier, what can you tell about the performance of the classifier?

def test_classifier(classifier, data, verbose=False):   ❶
    true_positives = 0                                  ❷
    true_negatives = 0
    false_positives = 0
    false_negatives = 0
    for mileage, price, is_bmw in data:
        predicted = classifier(mileage,price)
        if predicted and is_bmw:                       ❸
            true_positives += 1
        elif predicted:
            false_positives += 1
        elif is_bmw:
            false_negatives += 1
        else:
            true_negatives += 1
            
    if verbose:        
        print("true positives %f" % true_positives)    ❹
        print("true negatives %f" % true_negatives)
        print("false positives %f" % false_positives)
        print("false negatives %f" % false_negatives)
    
    total = true_positives + true_negatives
            
    return total / len(data)                           ❺

❶ We now have 4 counters to keep track of.

❷ Specifies whether to print the data (we might not want to print it every time).

❸ Depending on whether the car is a Prius or BMW and whether it’s classified correctly, increments 1 of 4 counters

❹ Prints the results of each counter

❺ Returns the number of correct classifications (true positives or negatives) divided by the length of the data set

For the bmw_finder function, this prints the following text:

true positives 18.000000
true negatives 100.000000
false positives 0.000000
false negatives 82.000000

Exercise 15.2: Find a way to update the bmw_finder function to improve its performance and use the test_classifier function to confirm that your improved function has better than 59% accuracy.

Solution: If you solved the last exercise, you saw that bmw_finder was too aggressive in saying that cars were not BMWs. We can lower the price threshold to $20,000 and see if it makes a difference:

def bmw_finder2(mileage,price):
    if price > 20000:
        return 1
    else:
        return 0

Indeed, by lowering this threshold, bmw_finder improved the success rate to 73.5%:

>>> test_classifier(bmw_finder2, all_car_data)
0.735

15.2 Picturing a decision boundary

Before we implement the logistic regression function, let’s look at one more way to measure our success at classification. Because two numbers, mileage and price, define our used car data points, we can think of these as 2D vectors and plot them as points on a 2D plane. This plot gives us a better sense of where our classification function “draws the line” between BMWs and Priuses, and we can see how to improve it. It turns out that using our bmw_finder function is equivalent to drawing a literal line in the 2D plane, calling any point above the line a BMW and any point below it a Prius.

In this section, we use Matplotlib to draw our plot and see where bmw_finder places the dividing line between BMWs and Priuses. This line is called the decision boundary, because what side of the line a point lies on, helps us decide what class it belongs to. After looking at the car data on a plot, we can figure out where to draw a better dividing line. This lets us define an improved version of the bmw_finder function, and we can measure exactly how much better it performs.

15.2.1 Picturing the space of cars

All of the cars in our data set have mileage and price values, but some of them represent BMWs and some represent Priuses, depending on whether they are labeled with a 1 or with a 0. To make our plot readable, we want to make a BMW and a Prius visually distinct on the scatter plot.

Figure 15.2 A plot of price vs. mileage for all cars in the data set with each BMW represented by an X and each Prius represented with a circle

The plot_data helper function in the source code takes the whole list of car data and automatically plots the BMWs with X’s and the Priuses with circles. Figure 15.2 shows the plot.

>>> plot_data(all_car_data)

In general, we can see that the BMWs are more expensive than the Priuses; most BMWs are higher on the price axis. This justifies our strategy of classifying the more expensive cars as BMWs. Specifically, we drew the line at a price of $25,000 (figure 15.3). On the plot, this line separates the top of the plot with more expensive cars from the bottom with less expensive cars.

Figure 15.3 Shows the decision line with car data plotted

This is our decision boundary. Every X above the line was correctly identified as a BMW, while every circle below the line was correctly identified as a Prius. All other points were classified incorrectly. It’s clear that if we move this decision boundary, we can improve our accuracy. Let’s give it a try.

15.2.2 Drawing a better decision boundary

Based on the plot in figure 15.3, we could lower the line and correctly identify a few more BMWs, while not incorrectly identifying any Priuses. Figure 15.4 shows what the decision boundary looks like if we lower the cut-off price to $21,000.

Figure 15.4 Lowering the decision boundary line appears to increase our accuracy.

The $21,000 cut-off might be a good boundary for low-mileage cars, but the higher the mileage, the lower the threshold. For instance, it looks like most BMWs with 75,000 miles or more are below $21,000. To model this, we can make our cut-off price mileage dependent. Geometrically that means drawing a line that slopes downward (figure 15.5).

Figure 15.5 Using a downward-sloping decision boundary

This line is given by the function p(x) = 21,000 − 0.07 · x, where p is price and x is mileage. There is nothing special about this equation; I just played around with the numbers until I plotted a line that looked reasonable. But it looks like it correctly identifies even more BMWs than before, with only a handful of false positives (Priuses incorrectly classified as BMWs). Rather than just eyeballing these decision boundaries, we can turn them into classifier functions and measure their performance.

15.2.3 Implementing the classification function

To turn this decision boundary into a classification function, we need to write a Python function that takes a car mileage and price, and returns one or zero depending on whether the point falls above or below the line. That means taking the given mileage, plugging it into the decision boundary function, p(x), to see what the threshold price is and comparing the result to the given price. This is what it looks like:

def decision_boundary_classify(mileage,price):
    if price > 21000 − 0.07 * mileage:
        return 1
    else:
        return 0

Testing this out, we can see it is much better than our first classifier; 80.5% of the cars are correctly classified by this line. Not bad!

>>> test_classifier(decision_boundary_classify, all_car_data)
0.805

You might ask why we can’t just do a gradient descent on the parameters defining the decision boundary line. If 20,000 and 0.07 don’t give the most accurate decision boundary, maybe some pair of numbers near them do. This isn’t a crazy idea. When we implement logistic regression, you’ll see that under the hood, it moves the decision boundary around using gradient descent until it finds the best one.

There are two important reasons we’ll implement the more sophisticated logistic regression algorithm rather than doing a gradient descent on the parameters a and b of the decision boundary function, ax + b. The first is that if the decision boundary is close to vertical at any step in the gradient descent, the numbers a and b could get very large and cause numerical issues. The other is that there isn’t an obvious cost function. In the next section, we see how logistic regression takes care of both of these issues so we can search for the best decision boundary using gradient descent.

15.2.4 Exercises

def constant_price_classifier(cutoff_price):
    def c(x,p):
        if p > cutoff_price:
            return 1
        else:
            return 0
    return c

The accuracy of this function can be measured by passing the resulting classifier to the test_classify function. Here’s a helper function to automate this check for any price we want to test as a cut-off value:

def cutoff_accuracy(cutoff_price):
    c = constant_price_classifier(cutoff_price)
    return test_classifier(c,all_car_data)

The best cut-off price is between two of the prices in our list. It’s sufficient to check each price and see if it is the best cut-off price. We can do that quickly in Python using the max function. The keyword argument key lets us choose what function we want to maximize by. In this case, we want to find the price in the list that is the best cut-off, so we can maximize by the cutoff_accuracy function:

>>> max(all_prices,key=cutoff_accuracy)
17998.0

>>> test_classifier(constant_price_classifier(17998.0), all_car_data)
0.795

15.3 Framing classification as a regression problem

The way that we can reframe our classification task as a regression problem is to create a function that takes in the mileage and price of a car, and returns a number measuring how likely it is to be a BMW instead of a Prius. In this section, we implement a function called logistic_classifier that, from the outside, looks a lot like the classifiers we’ve built so far; it takes a mileage and a price, and outputs a number telling us whether the car is a BMW or a Prius. The only difference is that rather than outputting one or zero, it outputs a value between zero and one, telling us how likely it is that the car is a BMW.

You can think of this number as the probability that the mileage and price describe a BMW, or more abstractly, you can think of it as giving the “BMWness” of the data point (figure 15.6). (Yes, that’s a made-up word, which I pronounce “bee-em-doubleyou-ness.” It means how much it looks like a BMW. Maybe we could call the antonym “Priusity.”)

Figure 15.6 The concept of “BMWness” describes how much like a BMW a point in the plane is.

To build the logistic classifier, we start with a guess of a good decision boundary line. Points above the line have high “BMWness,” meaning these are likely to be BMWs and the logistic function should return values close to one. Data points below the line have a low “BMWness,” meaning these are more likely to be Priuses and our function should return values close to zero. On the decision boundary, the “BMWness” value will be 0.5, meaning a data point there is equally as likely to be a BMW as it is to be a Prius.

15.3.1 Scaling the raw car data

There’s a chore we need to take care of at some point in the regression process, so we might as well take care of it now. As we discussed in the last chapter, the large values of mileage and price can cause numerical errors, so it’s better to rescale them to a small, consistent size. We should be safe if we scale all of the mileages and the prices linearly to values between zero and one.

We need to be able to scale and unscale each of mileage and price, so we need four functions in total. To make this a little bit less painful, I’ve written a helper function that takes a list of numbers and returns functions to scale and unscale these linearly, between zero and one, using the maximum and minimum values in the list. Applying this helper function to the whole list of mileages and of prices gives us the four functions we need:

def make_scale(data):
    min_val = min(data)                           ❶
    max_val = max(data)
    def scale(x):                                 ❷
        return (x-min_val) / (max_val − min_val)
    def unscale(y):                               ❸
        return y * (max_val − min_val) + min_val
    return scale, unscale                         ❹

price_scale, price_unscale = 
    make_scale([x[1] for x in all_car_data])      ❺
mileage_scale, mileage_unscale =
    make_scale([x[0] for x in all_car_data])

❶ The maximum and minimum provide the current range of the data set.

❷ Puts the data point at the same fraction of the way between 0 and 1 as it was from min_val to max_val

❸ Puts the scaled data point at the same fraction of the way from min_val to max_val as it was from 0 to 1

❹ Returns the scale and unscale functions (closures, if you’re familiar with that term) to use when we want to scale or unscale members of this data set.

❺ Returns two sets of functions, one for price and one for mileage

We can now apply these scaling functions to every car data point in our list to get a scaled version of the data set:

scaled_car_data = [(mileage_scale(mileage), price_scale(price), is_bmw) 
                    for mileage,price,is_bmw in all_car_data]

The good news is that the plot looks the same (figure 15.7), except that the values on the axes are different.

Figure 15.7 The mileage and price data scaled so that all values are between zero and one. The plot looks the same as before, but our risk of numerical error has decreased.

Because the geometry of the scaled data set is the same, it should give us confidence that a good decision boundary for this scaled data set translates to a good decision boundary for the original data set.

15.3.2 Measuring the “BMWness” of a car

Let’s start with a decision boundary that looks similar to the one from the last section. The function p(x) = 0.56 − 0.35 · x gives price at the decision boundary as a function of mileage. This is pretty close to the one I found by eyeballing in the last section, but it applies to the scaled data set instead (figure 15.8).

Figure 15.8 The decision boundary p(x) = 0.56 − 0.35 · x on the scaled data set

We can still test classifiers on the scaled data set with our test_classifier function; we just need to take care to pass in the scaled data instead of the original. It turns out this decision boundary gives us a 78.5% accurate classification of the data.

It also turns out that this decision boundary function can be rearranged to give a measure of the “BMWness” of a data point. To make our algebra easier, let’s write the decision boundary as

p = ax + b

where p is price, x is still mileage, and a and b are the slope and intercept of the line (in this case, a = -0.35 and b = 0.56), respectively. Instead of thinking of this as a function, we can think of it as an equation satisfied by points (x, p) on the decision boundary. If we subtract ax + b from both sides of the equation, we get another correct equation:

p − ax − b = 0

Every point (x, p) on the decision boundary satisfies this equation as well. In other words, the quantity p − ax − b is zero for every point on the decision boundary.

Here’s the point of this algebra: the quantity p − ax − b is a measure of the “BMWness” of the point (x, p). If (x, p) is above the decision boundary, it means p is too big, relative to x, so p − ax − b > 0. If, instead, (x, p) is below the decision boundary, it means p is too small relative to x, then p − ax − b < 0. Otherwise, the expression p − ax − b is exactly zero, and the point is right at the threshold of being interpreted as a Prius or a BMW. This might be a little bit abstract on the first read, so table 15.2 lists the three cases.

Table 15.2 Summary of the possible cases

(x, p) above decision boundary	p − ax − b > 0	Likely to be a BMW
(x, p) on decision boundary	p − ax − b = 0	Could be either car model
(x, p) below decision boundary	p − ax − b < 0	Likely to be a Prius

If you’re not convinced that p − ax − b is a measure of “BMWness” compatible with the decision boundary, an easier way to see this is to look at the heat map of f(x, p) = p − ax − b, together with the data (figure 15.9). When a = -0.35 and b = 0.56, the function is f(x, p) = p − 0.35 · x − 0.56.

Figure 15.9 A plot of the heatmap and decision boundary showing that the bright values (positive “BMWness”) are above the decision boundary and dark values (negative “BMWness”) occur below the decision boundary

The function, f(x, p), almost meets our requirements. It takes a mileage and a price, and it outputs a number that is higher if the numbers are likely to represent a BMW, and lower if the values are likely to represent a Prius. The only thing missing is that the output numbers aren’t constrained to be between zero and one, and the cutoff is at a value of zero rather than at a value of 0.5 as desired. Fortunately, there’s a handy kind of mathematical helper function we can use to adjust the output.

15.3.3 Introducing the sigmoid function

The function f(x, p) = p − ax − b is linear, but this is not a chapter on linear regression! The topic at hand is logistic regression, and to do logistic regression, you need to use a logistic function. The most basic logistic function is the one that follows, which is often called a sigmoid function:

We can implement this function in Python with the exp function, which stands in for ex, where e = 2.71828... and is the constant we’ve used for exponential bases before:

from math import exp
def sigmoid(x):
    return 1 / (1+exp(−x))

Figure 15.10 shows its graph.

Figure 15.10 The graph of the sigmoid function σ(x)

In the function, we use the Greek letter σ (sigma) because σ is the Greek version of the letter S, and the graph of σ(x) looks a bit like the letter S. Sometimes the words logistic function and sigmoid function are used interchangeably to mean a function like the one in figure 15.10, which smoothly ramps up from one value to another. In this chapter (and the next), when I refer to the sigmoid function, I’ll be talking about this specific function: σ(x).

You don’t need to worry too much about how this function is defined, but you do need to understand the shape of the graph and what it means. This function sends any input number to a value between zero and one, with big negative numbers yielding results closer to zero, and big positive numbers yielding results closer to one. The result of σ(0) is 0.5. We can think of σ as translating the range from -∞ to ∞ to the more manageable range from zero to one.

15.3.4 Composing the sigmoid function with other functions

Returning to our function f(x, p) = p − ax − b, we saw that it takes a mileage value and a price value and returns a number measuring how much the values look like a BMW rather than a Prius. This number could be large or positive or negative, and a value of zero indicates that it is on the boundary between being a BMW and being a Prius.

What we want our function to return is a value between zero and one (with values close to zero and one), representing cars likely to be Priuses or BMWs, respectively, and a value of 0.5, representing a car that is equally likely to be either a Prius or a BMW. All we have to do to adjust the outputs of f(x, p) to be in the expected range is to pass through the sigmoid function σ(x) as shown in figure 15.11. That is, the function we want is σ(f(x, p)), where x and p are the mileage and price.

Figure 15.11 Schematic diagram of composing the “BMWness” function f(x, p) with the sigmoid function σ(x)

Let’s call the resulting function L(x, p), so in other words, L(x, p) = σ(f(x, p)). Implementing the function L(x, p) in Python and plotting its heatmap (figure 15.12), we can see that it increases in the same direction as f(x, p), but its values are different.

Figure 15.12 The heatmaps look basically the same, but the values of the function are slightly different.

Based on this picture, you might wonder why we went through the trouble of passing the “BMWness” function through the sigmoid. From this perspective, the functions look mostly the same. However, if we plot their graphs as 2D surfaces in 3D (figure 15.13), you can see that the curvy shape of the sigmoid has an effect.

Figure 15.13 While f(x, p) slopes upward linearly, L(x, p) curves up from a minimum value of 0 to a maximum value of 1.

In fairness, I had to zoom out a bit in (x, p) space to make the curvature clear. The point is that if the type of car is indicated by a 0 or 1, the values of the function L(x, p) actually come close to these numbers, whereas the values of f(x, p) go off to positive and negative infinity!

Figure 15.14 illustrates two exaggerated diagrams to show you what I mean. Remember that in our data set, scaled_car_data, we represented Priuses as triples of the form (mileage, price, 0) and BMWs as triples of the form (mileage, price, 1). We can interpret these as points in 3D where the BMWs live in the plane z = 1 and Priuses live in the plane z = 0. Plotting scaled_car_data as a 3D scatter plot, you can see that a linear function can’t come close to many of the data points in the same way as a logistic function.

With functions shaped like L(x, p), we can actually hope to fit the data, and we’ll see how to do that in the next section.

Figure 15.14 The graph of a linear function in 3D can’t come as close to the data points as the graph of a logistic function.

15.3.5 Exercises

15.4 Exploring possible logistic functions

Let’s quickly retrace our steps. Plotting the mileages and prices of our set of Priuses and BMWs on a scatter plot, we could try to draw a line between these values, called a decision boundary, that defines a rule by which to distinguish a Prius from a BMW. We wrote our decision boundary as a line in the form p(x) = ax + b, and it looked like -0.35 and 0.56 were reasonable choices for a and b, giving us a classification that was about 80% correct.

Rearranging this function, we found that f(x, p) = p − ax − b was a function taking a mileage and price (x, p) and returning a number that was greater than zero on the BMW side of the decision boundary and smaller than zero on the Prius side. On the decision boundary, f(x, p) returned zero, meaning a car would be equally likely to be a BMW or a Prius. Because we represent BMWs with a 1 and Priuses with a 0, we wanted a version of f(x, p) that returned values between zero and one, where 0.5 would represent a car equally likely to be a BMW or a Prius. Passing the result of f into a sigmoid function σ, we got a new function L(x, p) = σ(f(x, p)), satisfying that requirement.

But we don’t want the L(x, p) I made by eyeballing the best decision boundary−we want the L(x, p) that best fits the data. On our way to doing that, we’ll see that there are three parameters we can control to write a general logistic function that takes 2D vectors and returns numbers between zero and one, and also has a decision boundary L(x, p) = 0.5, which is a straight line. We’ll write a Python function, make_logistic (a,b,c), that takes in three parameters a , b, and c, and returns the logistic function they define. As we explored a 2D space of (a, b) pairs to choose a linear function in chapter 14, we’ll explore a 3D space of values (a, b, c) to define our logistic function (figure 15.15).

Figure 15.15 Exploring a 3D space of parameter values (a, b, c) to define a function L(x, p)

Then we’ll create a cost function, much like the one we created for linear regression. The cost function, which we’ll call logistic_cost(a,b,c), takes the parameters a, b, and c, which define a logistic function and produce one number, measuring how far the logistic function is from our car data set. The logistic_cost function needs to be implemented in such a way that the lower its value, the better the predictions from the associated logistic function.

15.4.1 Parameterizing logistic functions

The first task is to find the general form of a logistic function L(x, p), whose values range from zero to one and whose decision boundary L(x, p) = 0.5 is a straight line. We got close to this in the last section, starting with the decision boundary p(x) = ax + b and reverse engineering a logistic function from that. The only problem is that a linear function of the form ax + b can’t represent any line in the plane. For instance, figure 15.16 shows a data set where a vertical decision boundary, x = 0.6, makes sense. Such a line can’t be represented in the form p = ax + b, however.

Figure 15.16 A vertical decision boundary might make sense, but it can’t be represented in the form p = ax + b.

The general form of a line that does work is the one we met in chapter 7: ax + by = c. Because we’re calling our variables x and p, we’ll write ax + bp = c. Given an equation like this, the function z(x, p) = ax + bp − c is zero on the line with positive values on one side and negative values on the other. For us, the side of the line where z(x, p) is positive is the BMW side, and the side where z(x, p) is negative is the Prius side.

Passing z(x, p) through the sigmoid function, we get a general logistic function L(x, p) = σ(z(x, p)), where L(x, p) = 0.5 on the line where z(x, p) = 0. In other words, the function L(x, p) = σ(ax + bp − c) is the general form we’re looking for. This is easy to translate to Python, giving us a function of a, b, and c that returns a corresponding logistic function L(x, p) = σ(ax + bp − c):

def make_logistic(a,b,c):
    def l(x,p):
        return sigmoid(a*x + b*p − c)
    return l

The next step is to come up with a measure of how close this function comes to our scaled_car_data dataset.

15.4.2 Measuring the quality of fit for a logistic function

For any BMW, the scaled_car_data list contains an entry of the form (x, p, 1), and for every Prius, it contains an entry of the form (x, p, 0), where x and p denote (scaled) mileage and price values, respectively. If we apply a logistic function, L(x, p), to the x and p values, we’ll get a result between zero and one.

A simple way to measure the error or cost of the function L is to find how far off it is from the correct value, which is either zero or one. If you add up all of these errors, you’ll get a total value telling you how far the function L(x, p) comes from the data set. Here’s what that looks like in Python:

def simple_logistic_cost(a,b,c):
    l = make_logistic(a,b,c)
    errors = [abs(is_bmw-l(x,p)) 
              for x,p,is_bmw in scaled_car_data]
    return sum(errors)

This cost reports the error reasonably well, but it isn’t good enough to get our gradient descent to converge to a best value of a, b, and c. I won’t go into a full explanation of why this is, but I’ll try to quickly give you the general idea.

Suppose we have two logistic functions, L₁(x, p) and L₂(x, p), and we want to compare the performance of both. Let’s say they both look at the same data point (x, p, 0), meaning a data point representing a Prius. Then let’s say L₁(x, p) returns 0.99, which is greater than 0.5, so it predicts incorrectly that the car is a BMW. The error for this point is |0-0.99| = 0.99. If another logistic function, L₂(x, p), predicts a value of 0.999, the model predicts with more certainty that the car is a BMW, and is even more wrong. That said, the error would be only |0-0.999| = 0.999, which is not much different.

Figure 15.17 The function -log(x) returns big values for small inputs, and −log(1) = 0.

It’s more appropriate to think of L₁ as reporting a 99% chance the data point represents a BMW and a 1% chance that it represents a Prius, with L₂ reporting a 99.9% chance it is a BMW and a 0.1% chance it is a Prius. Instead of thinking of this as a 0.09% worse Prius prediction, we should really think of it as being ten times worse! We can, therefore, think of L₂ as being ten times more wrong than L₁.

We want a cost function such that if L(x, p) is really sure of the wrong answer, then the cost of L is high. To get that, we can look at the difference between L(x, p) and the wrong answer, and pass it through a function that makes tiny values big. For instance, L₁(x, p) returned 0.99 for a Prius, meaning it was 0.01 units from the wrong answer, while L₂(x, p) returned 0.999 for a Prius, meaning it was 0.001 units from the wrong answer. A good function to return big values from tiny ones is −log(x), where log is the special natural logarithm function. It’s not critical that you know what the −log function does, only that it returns big numbers for small inputs. Figure 15.17 shows the plot of −log(x).

To familiarize yourself with −log(x), you can test it with some small inputs. For L₁(x, p), which was 0.01 units from the wrong answer, we get a smaller cost than L₂(x, p), which was 0.001 units from the wrong answer:

from math import log
>>> −log(0.01)
4.605170185988091
>>> −log(0.001)
6.907755278982137

By comparison, if L(x, p) returns zero for a Prius, it would be giving the correct answer. That’s one unit away from the wrong answer, and −log(1) = 0, so there is zero cost for the right answer.

Now we’re ready to implement the logistic_cost function that we set out to create. To find the cost for a given point, we calculate how close the given logistic function comes to the wrong answer and then take the negative logarithm of the result. The total cost is the sum of the cost at every data point in the scaled_car_data data set:

def point_cost(l,x,p,is_bmw):                     ❶
    wrong = 1 − is_bmw
    return −log(abs(wrong − l(x,p)))

def logistic_cost(a,b,c):
    l = make_logistic(a,b,c)
    errors = [point_cost(l,x,p,is_bmw)            ❷
              for x,p,is_bmw in scaled_car_data]
    return sum(errors)

❶ Determines the cost of a single data point

❷ The overall cost of the logistic function is the same as before, except that we use the new point_cost function for each data point instead of just the absolute value of the error.

It turns out, we get good results if we try to minimize the logistic_cost function using gradient descent. But before we do that, let’s do a sanity check and confirm that logistic_cost returns lower values for a logistic function with an (obviously) better decision boundary.

15.4.3 Testing different logistic functions

Let’s try out two logistic functions with different decision boundaries, and confirm if one has an obviously better decision boundary than if it has a lower cost. As our two examples, let’s use p = 0.56 − 0.35 · x, my best-guess decision boundary, which is the same as 0.35 · x + 1 · p = 0.56, and also an arbitrarily selected one, say x + p = 1. Clearly, the former is a better dividing line between the Priuses and the BMWs.

In the source code, you’ll find a plot_line function to draw a line based on the values a, b, and c in the equation ax + by = c(and as an exercise at the end of the section, you can try implementing this function yourself). The respective values of (a, b, c) are (0.35, 1, 0.56) and (1, 1, 1). We can plot them alongside the scatter plot of car data (shown in figure 15.18) with these three lines:

plot_data(scaled_car_data)
plot_line(0.35,1,0.56)
plot_line(1,1,1)

Figure 15.18 The graphs of two decision boundary lines. One is clearly better than the other at separating Priuses from BMWs.

The corresponding logistic functions are σ(0.35 · x + p − 0.56) and σ(x + p − 1), and we expect the first one has a lower cost with respect to the data. We can confirm this with the logistic_cost function:

>>> logistic_cost(0.35,1,0.56)
130.92490748700456
>>> logistic_cost(1,1,1)
135.56446830870456

As expected, the line x + p = 1 is a worse decision boundary, so the logistic function σ(x + p − 1) has a higher cost. The first function σ(0.35 · x + p − 0.56) has a lower cost and a better fit. But is it the best fit? When we run gradient descent on the logistic_cost function in the next section, we’ll find out.

15.4.4 Exercises

Exercise 15.6: Implement the function plot_line(a,b,c) referenced in section 15.4.3 that plots the line ax + by = c, where 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1.

Solution: Note that I used different names other than a, b, and c for the function arguments because c is a keyword argument that sets the color of the plotted line for Matplotlib’s plot function, which I commonly make use of:

def plot_line(acoeff,bcoeff,ccoeff,**kwargs):
    a,b,c = acoeff, bcoeff, ccoeff
    if b == 0:
        plt.plot([c/a,c/a],[0,1])
    else:
        def y(x):
            return (c-a*x)/b
        plt.plot([0,1],[y(0),y(1)],**kwargs)

A graph of σ(x² + y² − 1). Its value is less than 0.5 inside the circle of a radius of 1, and it increases to a value of 1 in every direction outside that circle.

The graph of the second logistic function is steeper than the graph of the first.

Solution: The line ax + by = c is the set of points where z(x, y) = ax + by − c = 0. As we saw for equations of this form in chapter 7, the graph of z(x, y) = ax + by − c is a plane, so it increases in one direction from the line and decreases in the other direction. The gradient of z(x, y) is ∇z(x, y) = (a, b), so z(x, y) increases most rapidly in the direction of the vector (a, b) and decreases most rapidly in the opposite direction (− a, − b). Both of these directions are perpendicular to the direction of the line.

15.5 Finding the best logistic function

We now have a straightforward minimization problem to solve; we’d like to find the values a, b, and c that make the logistic_cost function as small as possible. Then the corresponding function, L(x, p) = σ(ax + bp − c) will be the best fit to the data. We can use that resulting function to build a classifier by plugging in the mileage x and price p for an unknown car and labeling it as a BMW if L(x, p) > 0.5 and as a Prius, otherwise. We’ll call this classifier best_logistic_classifier(x,p), and we can pass it to test_classifier to see how well it does.

The only major work we have to do here is upgrading our gradient_descent function. So far, we’ve only done gradient descent with functions that take 2D vectors and return numbers. The logistic_cost function takes a 3D vector (a, b, c) and outputs a number, so we need a new version of gradient descent. Fortunately, we covered 3D analogies for every 2D vector operation we’ve used, so it won’t be too hard.

15.5.1 Gradient descent in three dimensions

Let’s look at our existing gradient calculation that we used to work with functions of two variables in chapters 12 and 14. The partial derivatives of a function f(x, y) at a point (x₀, y₀) are the derivatives with respect to x and y individually, while assuming the other variable is a constant. For instance, plugging in y₀ into the second slot of f(x, y), we get f(x, y₀), which we can treat as a function of x alone and take its ordinary derivative. Putting the two partial derivatives together as components of a 2D vector gives us the gradient:

def approx_gradient(f,x0,y0,dx=1e-6):
    partial_x = approx_derivative(lambda x:f(x,y0),x0,dx=dx)
    partial_y = approx_derivative(lambda y:f(x0,y),y0,dx=dx)
    return (partial_x,partial_y)

The difference for a function of three variables is that there’s one other partial derivative we can take. If we look at f(x, y, z) at some point (x₀, y₀, z₀), we can look at f(x, y₀, z₀), f(x₀, y, z₀), and f(x₀, y₀, z) as functions of x, y, and z, respectively, and take their ordinary derivatives to get three partial derivatives. Putting these three partial derivatives together in a vector, we get the 3D version of the gradient:

def approx_gradient3(f,x0,y0,z0,dx=1e-6):
    partial_x = approx_derivative(lambda x:f(x,y0,z0),x0,dx=dx)
    partial_y = approx_derivative(lambda y:f(x0,y,z0),y0,dx=dx)
    partial_z = approx_derivative(lambda z:f(x0,y0,z),z0,dx=dx)
    return (partial_x,partial_y,partial_z)

To do the gradient descent in 3D, the procedure is just as you’d expect; we start at some point in 3D, calculate the gradient, and step a small amount in that direction to arrive at a new point, where hopefully, the value of f(x, y, z) is smaller. As one additional enhancement, I’ve added a max_steps parameter so we can set a maximum number of steps to take during the gradient descent. With that parameter set to a reasonable limit, we won’t have to worry about our program stalling even if the algorithm doesn’t converge to a point within the tolerance. Here’s what the result looks like in Python:

def gradient_descent3(f,xstart,ystart,zstart,
                      tolerance=1e-6,max_steps=1000):
    x = xstart
    y = ystart
    z = zstart
    grad = approx_gradient3(f,x,y,z)
    steps = 0
    while length(grad) > tolerance and steps < max_steps:
        x -= 0.01 * grad[0]
        y -= 0.01 * grad[1]
        z -= 0.01 * grad[2]
        grad = approx_gradient3(f,x,y,z)
        steps += 1
    return x,y,z

All that remains is to plug in the logistic_cost function, and the gradient_descent3 function finds inputs that minimize it.

15.5.2 Using gradient descent to find the best fit

To be cautious, we can start by using a small number of max_steps, like 100:

>>> gradient_descent3(logistic_cost,1,1,1,max_steps=100)
(0.21114493546399946, 5.04543972557848, 2.1260122558655405)

If we allow it to take 200 steps instead of 100, we see that it has further to go after all:

>>> gradient_descent3(logistic_cost,1,1,1,max_steps=200)
(0.884571531298388, 6.657543188981642, 2.955057286988365)

Remember, these results are the parameters required to define the logistic function, but they are also the parameters (a, b, c) defining the decision boundary in the form ax + bp = c. If we run gradient descent for 100 steps, 200 steps, 300 steps, and so on, and plot the corresponding lines with plot_line, we can see the decision boundary converging as in figure 15.19.

Figure 15.19 With more and more steps, the values of (a, b, c) returned by gradient descent seem to be settling on a clear decision boundary.

Somewhere between 7,000 and 8,000 steps, the algorithm actually converges, meaning it finds a point where the length of the gradient is less than 10⁻⁶. Approximately speaking, that’s the minimum point we’re looking for:

>>> gradient_descent3(logistic_cost,1,1,1,max_steps=8000)
(3.7167003153580045, 11.422062409195114, 5.596878367305919)

We can see what this decision boundary looks like relative to the one we’ve been using (figure 15.20 shows the result):

plot_data(scaled_car_data)
plot_line(0.35,1,0.56)
plot_line(3.7167003153580045, 11.422062409195114, 5.596878367305919)

Figure 15.20 Comparing our previous best-guess decision boundary to the one implied by the result of gradient descent

This decision boundary isn’t too far off from our guess. The result of the logistic regression appears to have moved the decision boundary slightly downward from our guess, trading off a few false positives (Priuses that are now incorrectly above the line in figure 15.20) for a few more true positives (BMWs that are now correctly above the line).

15.5.3 Testing and understanding the best logistic classifier

We can easily plug these values for (a, b, c) into a logistic function and then use it to make a car classification function:

def best_logistic_classifier(x,p):
    l = make_logistic(3.7167003153580045, 11.422062409195114, 5.596878367305919)
    if l(x,p) > 0.5:
        return 1
    else:
        return 0

Plugging this function into the test_classifier function, we can see its accuracy rate on the test data set is about what we got from our best attempts, 80% on the dot:

>>> test_classifier(best_logistic_classifier,scaled_car_data)
0.8

The decision boundaries are fairly close, so it makes sense that the performance is not too far off of our guess from section 15.2. That said, if what we had previously was close, why did the decision boundary converge so decisively where it did?

It turns out logistic regression does more than simply find the optimal decision boundary. In fact, we saw a decision boundary early in the section that outperformed this best fit logistic classifier by 0.5%, so the logistic classifier doesn’t even maximize accuracy on the test data set. Rather, logistic regression looks holistically at the data set and finds the model that is most likely to be accurate given all of the examples. Rather than moving the decision boundary slightly to grab one or two more percentage points of accuracy on the test set, the algorithm orients the decision boundary based on a holistic view of the data set. If our data set is representative, we can trust our logistic classifier to do well on data it hasn’t seen yet, not just the data in our training set.

The other information that our logistic classifier has is an amount of certainty about every point it classifies. A classifier based only on a decision boundary is 100% certain that a point above that boundary is a BMW and that a point below that is a Prius. Our logistic classifier has a more nuanced view; we can interpret the values it returns between zero and one as a probability a car is a BMW rather than a Prius. For real-world applications, it can be valuable to know not only the best guess from your machine learning model, but also how trustworthy it considers itself to be. If we were classifying benign tumors from malignant ones based on medical scans, we might act much differently if the algorithm told us it was 99% sure, as opposed to 51% sure, if a tumor was malignant.

The way certainty comes through in the shape of the classifier is the magnitude of the coefficients (a, b, c). For instance, you can see that the ratio between (a, b, c) in our guess of (0.35, 1, 0.56) is similar to the ratio in the optimal values of (3.717, 11.42, 5.597). The optimal values are approximately ten times bigger than our best guess. The biggest difference that causes this change is the steepness of the logistic function. The optimal logistic function is much more certain of the decision boundary than the first. It tells us that as soon as you cross the decision boundary, certainty of the result increases significantly as figure 15.21 shows.

Figure 15.21 The optimized logistic function is much steeper, meaning its certainty that a car is a BMW rather than a Prius increases rapidly as you cross the decision boundary.

In the final chapter, we’ll continue to use sigmoid functions to produce certainties of results between zero and one as we implement classification using neural networks.

15.5.4 Exercises

Exercise 15.11: Modify the gradient_descent3 function to print the total number of steps taken before it returns its result. How many steps does the gradient descent take to converge for logistic_cost ?

Solution: All you need to do is add the line print(steps) right before gradient_descent3 to return its result:

def gradient_descent3(f,xstart,ystart,zstart,tolerance=1e−6,max_steps=1000):
    ...
    print(steps)
    return x,y,z

gradient_descent3(logistic_cost,1,1,1,max_steps=8000)

the number printed is 7244, meaning the algorithm converges in 7,244 steps.

Exercise 15.12-Mini Project: Write an approx_gradient function that calculates the gradient of a function in any number of dimensions. Then write a gradient_descent function that works in any number of dimensions. To test your gradient_descent on an n -dimensional function, you can try a function like f(x₁, x₂, ... , xⁿ ) = (x₁ − 1)² + (x₂ − 1)² + ... + (xⁿ − 1)², where x₁, x₂, ... , xⁿ are the n input variables to the function f . The minimum of this function should be (1, 1, ..., 1), an n -dimensional vector with the number 1 in every entry.

f(v₁, v₂, ..., v_i−1, x_i, v_i+1, ..., v_n)

def partial_derivative(f,i,v,**kwargs):
    def cross_section(x):
        arg = [(vj if j != i else x) for j,vj in enumerate(v)]
        return f(*arg)
    return approx_derivative(cross_section, v[i], **kwargs)

def approx_gradient(f,v,dx=1e−6):
    return [partial_derivative(f,i,v) for i in range(0,len(v))]

def gradient_descent(f,vstart,tolerance=1e−6,max_steps=1000):
    v  = vstart
    grad = approx_gradient(f,v)
    steps = 0
    while length(grad) > tolerance and steps < max_steps:
        v  = [(vi − 0.01 * dvi) for vi,dvi in zip(v,grad)]
        grad = approx_gradient(f,v)
        steps += 1
    return v

def sum_squares(*v):
    return sum([(x−1)**2 for x in v])

>>> xv  = [2,2,2,2,2]
>>> gradient_descent(sum_squares,v)
[1.0000002235452137,
 1.0000002235452137,
 1.0000002235452137,
 1.0000002235452137,
 1.0000002235452137]

Exercise 15.13-Mini Project: Attempt to run the gradient descent with the simple_logistic_cost cost function. What happens?

Solution: It does not appear to converge. The values of a, b, and c continue increasing without bound even though the decision boundary stabilizes. This means as the gradient descent explores more and more logistic functions, these are staying oriented in the same direction but becoming infinitely steep. It is incentivized to become closer and closer to most of the points, while neglecting the ones it has already mislabeled. As I mentioned, this can be solved by penalizing the incorrect classifications for which the logistic function is the most confident, and our logistic_cost function does that well.

Summary

Classification is a type of machine learning task where an algorithm is asked to look at unlabeled data points and identify each one as a member of a class. In our examples for this chapter, we looked at mileage and price data for used cars and wrote an algorithm to classify them either as 5 series BMWs or Toyota Priuses.
A simple way to classify vector data in 2D is to establish a decision boundary; that means drawing a literal boundary in the 2D space where your data lives, where points on one side of the boundary are classified in one class and points on the other side are classified in another. A simple decision boundary is a straight line.
If our decision boundary line takes the form ax + by = c, then the quantity ax + by − c is positive on one side of the line and negative on the other. We can interpret this value as a measure of how much the data point looks like a BMW. A positive value means that the data point looks like a BMW, while a negative value means that it looks more like a Prius.
The sigmoid function, defined as follows, takes numbers between -∞ and ∞ and crunches them into the finite interval from zero to one:
Composing the sigmoid with the function ax + by − c, we get a new function σ(ax + by − c) that also measures how much the data point looks like a BMW, but it only returns values between zero and one. This type of function is a logistic function in 2D.
The value between zero and one that a logistic classifier outputs can be interpreted as how confident it is that a data point belongs to one class versus another. For instance, return values of 0.51 or 0.99 would both indicate that the model thinks we’re looking at a BMW, but the latter would be a much more confident prediction.
With an appropriate cost function that penalizes confident, incorrect classifications, we can use gradient descent to find the logistic function of best fit. This is the best logistic classifier according to the data set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15 Classifying data with logistic regression

Create new playlist

Sign In

Sign Up