9. Regression

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

9. Regression

9.1 Introduction

Model fitting is the process of estimating a model’s parameters. In other chapters, we described linear regression as a specific example. Now, you’ll develop that example with a little more detail. You’ll find this same basic pattern throughout the book. It works well for linear regression, and you’ll see the same structure for more advanced models like deep neural networks.

Generally, the first step is to start plotting the data with scatter plots and histograms to see how it is distributed. When you’re plotting, you might see something like Figure 9.1. The constant slope to the data, even when there is a lot of noise around the trend, suggests that y is related linearly to x.

A line graph of y equals sin(x). — Figure 9.1 Random data from y = *sin*(10x), along with the line for y = *sin*(10x)

The horizontal axis represents the x values that range from 0.2 to 1.0 in increments of 0.2 and the vertical axis represents the y values that range from negative 1.00 to 1.00 in increments of 0.75. The graph plots the points as follows: (0.1, 0.80), (0.3, 0.00), (0.38, negative 0.05), (0.45, negative 1.00), (0.78, 0.90), (0.85, 0.50), (0.86, 0.25), (0.90, negative 0.25), and (1.0, negative 0.50). A free line joins the points. Note: The values are estimated.

In Figure 9.1, you can see that there’s still a lot of spread to the y values at fixed values of x. While y increases with x on average, the predicted value of y will generally be a poor estimate of the true y value. This is okay. You need to take more predictors of y into account to explain the rest of the variation of y.

9.1.1 Choosing the Model

Once you’ve explored the data distributions, you can write down your model for y as a function of x and some parameters. Our model will be as follows:

\begin{matrix} y = β x + ∊ & (9.1) \end{matrix}

$\begin{matrix} y = β x + ∊ & (9.1) \end{matrix}$

Here, y is the output, β is a parameter of the model, and ∊ captures the noise in the y value from what you would expect just from the value of x. This noise can take on many different distributions. In the case of linear regression, you can assume that it’s Gaussian with constant variance. As part of the model checking process, you should plot the difference between your model’s output and the true value, or the residual to see the distribution of ∊. If it’s much different from what you expected, you should consider using a different model (e.g., a general linear model).

You could write this model for each y value explicitly, y_i (where i ranges from 1 to N when you have N data points), by writing it as follows:

\begin{matrix} y_{i} = {β x}_{i} + ∊_{i} & (9.2) \end{matrix}

$\begin{matrix} y_{i} = {β x}_{i} + ∊_{i} & (9.2) \end{matrix}$

You could instead write these as column vectors, where

\begin{matrix} y = β x & (9.3) \end{matrix}

$\begin{matrix} y = β x & (9.3) \end{matrix}$

and it will have the same meaning.

Often, you’ll have many different x variables, so you’ll want to write the following:

\begin{matrix} y_{i} = β_{1} x_{i} 1 + β_{2} x_{i} 2 + \dots + β_{j} x_{i} j + ∊_{i} & (9.4) \end{matrix}

$\begin{matrix} y_{i} = β_{1} x_{i} 1 + β_{2} x_{i} 2 + \dots + β_{j} x_{i} j + ∊_{i} & (9.4) \end{matrix}$

Or you can write this more succinctly as the following:

\begin{matrix} y = X β + ∊ & (9.5) \end{matrix}

$\begin{matrix} y = X β + ∊ & (9.5) \end{matrix}$

9.1.2 Choosing the Objective Function

Once you’ve chosen the model, you need a way of saying that one choice of parameter values is better or worse than another choice. The parameters determine the model’s output for a given input. The objective function is usually decided based on the model’s output by comparing it to known values.

A common objective function is the mean squared error. The error is just the difference between what the model thinks the y value should be, which you’ll call ŷ, and the actual output, y. You could look at the average error but there’s a problem. ŷ could be arbitrarily far from y as long as for every positive error there is a negative error that cancels it out. This is true because when you take the average error (ME), you have to sum all the errors together, like so:

\begin{matrix} M E = Σ_{i = 1}^{N} (y_{i} - {\hat{y}}_{i}) & (9.6) \end{matrix}

$\begin{matrix} M E = Σ_{i = 1}^{N} (y_{i} - {\hat{y}}_{i}) & (9.6) \end{matrix}$

Instead, you’d like only positive values to go into the average, so you don’t get cancellation. An easy way to do this is to just square the errors before taking the average, giving you the mean squared error (MSE).

\begin{matrix} M S E = Σ_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2} & (9.7) \end{matrix}

$\begin{matrix} M S E = Σ_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2} & (9.7) \end{matrix}$

Now, the smaller the mean squared error, the smaller the distance (on average) between the true y values and the model’s estimates of the y values. If you have two sets of parameters, you can use one set to calculate the ŷ and compute the MSE. You can do the same with the other set. The parameters that give you the smaller MSE are the better parameters!

You can automate the search for good parameters with an algorithm that iterates over choice of parameters and stops when it finds the ones that give the minimum MSE. This process is called fitting the model.

This isn’t the only choice of objective function you could make. You could have taken absolute values, for example, instead of squaring the errors. What is the difference?

It turns out that when you use MSE and you have data whose mean follows a linear trend, the model that you fit will return the average value of y at x. If you use the absolute value instead (calculating the mean absolute deviation, or MAD), the model reports the median value of y at x. Either might be useful, depending on the application. MSE is the more common choice.

9.1.3 Fitting

When we say fitting, we generally mean systematically varying the model parameters to find the values that give the smallest value of the objective function. Generally, some objective functions should be maximized (likelihood is one example, which you’ll see later), but here you’re interested in minimizing functions. If an algorithm is designed to minimize an objective function, then you can maximize one by running that algorithm on the negative of the objective function.

We’ll give a more thorough treatment later in the book, but a simple algorithm for minimizing a function is Newton’s method, or gradient descent. The basic idea is that at a maximum or minimum, the derivative of a function is zero. At a minimum, the derivative (slope) will be negative to the left of the minimum and positive to the right.

Consider the one-dimensional case. We start at a random value of β and try to figure out if you should use a larger or smaller beta. If the slope of the objective function points downward at the value of you’ve chosen, then you should follow it, since it points you toward the minimum. You should then adjust to larger values. If it’s a positive slope, then the objective function decreases toward the left, and you should use a smaller value of . You can derive this very rigorously using more calculus, but the basic result is a rule for updating your value of to find a better value. If you write your objective function as f (β), then you can write the update rule as follows:

\begin{matrix} β_{n e w} = β_{o l d} - λ \frac{d f}{d β} & (9.8) \end{matrix}

$\begin{matrix} β_{n e w} = β_{o l d} - λ \frac{d f}{d β} & (9.8) \end{matrix}$

You can see that if the derivative is positive, you decrease the value of beta. If it is negative, you increase the value of beta. Here, is a parameter to control how big the algorithm’s steps are. You want it to be small enough that you don’t jump from one side of a minimum clear across to the other. The best size will depend on the context.

Now, given a model and objective function, you have a procedure for choosing the best model parameters. You just write the objective function as a function of the model parameters and minimize it using this gradient descent method.

There are many ways you could do this minimization work. Instead of this iterative algorithm, you can write the objective function as a function of the parameters explicitly and use calculus to minimize it. There are also many variants of iterative algorithms like this one. Which is best for your job depends on the application.

9.1.4 Validation

After fitting your model, you need to test how well it works! You’ll typically do this by giving it some data points and comparing its output with the actual output. You can calculate a score, often the same loss function you used to fit the model, and summarize the model performance. The problem with this procedure is when you validate the model on the same data you used to train it. Let’s imagine an extreme case as an example.

Suppose your model had as much freedom as it liked to fit your data. Suppose also that you had only a few data points, like in Figure 9.1. If you try to fit a model to this data, you run the risk that the model overfits the data. To give you a parameter to adjust, you’ll try to make a polynomial regression fit to this data. You can do this by making columns of data with higher and higher powers of x.

You’ll plot a few different lines fit to the data with different powers of x, where you’ll label the power by k, in Figure 9.2. In this figure, you can see the fit improving as k gets higher, until around k = 5. After k = 5, there’s too much freedom in the model, and the function fine-tunes itself to the data set. Since this data is drawn from y = sin(10x), you can see that you’ll generalize much better to new data points with the k = 5 model, even though the k = 8 model fits the training data better. You say the k = 8 model is “overfit” to the data.

Nine graphs plot few different lines fit to the data with different powers of x. — Figure 9.2 Random data from y = *sin*(10x), along with the line for y = *sin*(10x)

The values along the horizontal axis of the graphs range from 0.0 to 0.6 in increments of 0.2 and the values along the vertical axis of the graphs ranges from negative 1.0 to 1.0 in increments of 0.5. The first three graphs show the points scattered and the line does not meet the any of the plots. The first graph, k equals 0 and R squared equals 0.0 shows a horizontal line that starts at the point (0.0, 0.1) and ends at (0.7, 0.1). The second graph, k equals 1 and R squared equals 0.108 shows a diagonal line that starts at the point (0.0, 0.6) and ends at (0.7, negative 0.25). The third graph, k equals 2 and R squared equals 0.382 shows an upward curve that stats at the point (0.0, 1.5) and ends at (0.7, 0.5). The second three graphs show the points scattered and the line does not meet most of the plots. The first graph, k equals 3 and R squared equals 0.951 shows a curved line that starts at the point (0.0, 0.1) and ends at (0.7, 1.2). The second graph, k equals 4 and R squared equals 0.986 shows a curved line that starts at the point (0.0, negative 0.9) and ends at (0.7, negative 0.9). The third graph, k equals 5 and R squared equals 0.997 shows a curved line that stats at the point (0.0, 0.1) and ends at (0.7, 0.9). The third three graphs show the points plotted and a free line joining the plots. The first graph, k equals 6 and R squared equals 0.999 shows a free line that starts at the point (0.0, 0.1) and ends at (0.7, 0.9). The second graph, k equals 7 and R squared equals 1.0 shows a free line that starts at the point (0.0, 0.1) and ends at (0.7, 0.9). The third graph, k equals 8 and R squared equals 1.0 shows a free line that stats at the point (0.0, 0.1) and ends at (0.7, 0.9). Note: The values are estimated.

How, then, can you see when you’re overfitting a model? The typical way to check is to reserve some of the data at training time and use it later to see how well a model generalizes.

You can do this, for example, using the train_test_split function from the sklearn package. Let’s generate some new data and now compute the R² not only on the training data but also on a reserved set of test data.

First, we generate N = 20 data points.

	Linear Regression Summary
The Algorithm	Linear regression is for predicting outcomes based on linear combinations of features. It’s great when you want interpretable feature weights, want possible causal interpretations, and don’t think there are interaction effects.
Time Complexity	O(C²n), with C features and N data points. It’s dominated by a matrix multiplication.
Memory Considerations	The feature matrix can get large when you have sparse features! Use sparse encodings when possible.

Table of Contents for 9. Regression

Create new playlist

Sign In

Sign Up

9. Regression

9.1 Introduction

9.1.1 Choosing the Model

9.1.2 Choosing the Objective Function

9.1.3 Fitting

9.1.4 Validation

9.2 Linear Least Squares

9.2.1 Assumptions

9.2.2 Complexity

9.2.3 Memory Considerations

9.2.4 Tools

9.2.5 A Distributed Approach

9.2.6 A Worked Example

9.3 Nonlinear Regression with Linear Regression

9.3.1 Uncertainty

9.4 Random Forest

9.4.1 Decision Trees

9.4.2 Random Forests

9.5 Conclusion

Table of Contents for
9. Regression