Multiple linear regression

So far, we have been working with one dependent variable and one independent variable. Nevertheless, it is not unusual to have several independent variables that we want to include in our model. Some examples could be:

Perceived quality of wine (dependent) and acidity, density, alcohol level, residual sugar, and sulphates content (independent variables)
A student's average grades (dependent) and family income, distance from home to school, and mother's education (categorical variable)

We can easily extend the simple linear regression model to deal with more than one independent variable. We call this model multiple linear regression or less often multivariable linear regression (not to be confused with multivariate linear regression, the case where we have multiple dependent variables).

In a multiple linear regression model, we model the mean of the dependent variable as follows:

Notice that this looks similar to a polynomial regression, but is not exactly the same. For multiple linear regression, we have different variables instead of successive powers of the same variable. From the point of view of multiple linear regression, we can say that a polynomial regression is like a multiple linear regression but with made-up variables.

Using linear algebra notation, we can write a shorter version:

Here, is a vector of coefficients of length , that is, the number of dependent variables. The variable is a matrix of size if is the number of observations and is the number of independent variables. If you are a little rusty with your linear algebra, you may want to check the Wikipedia article about the dot product between two vectors and its generalization to matrix multiplication. Basically what you need to know is that we are just using a shorter and more convenient way to write our model:

Using the simple linear regression model, we find a straight line that (hopefully) explains our data. Under the multiple linear regression model we find, instead, a hyperplane of dimension . Thus, the multiple linear regression model is essentially the same as the simple linear regression model, the only difference being that now is a vector and is a matrix.

Let's define our data:

np.random.seed(314)
N = 100
alpha_real = 2.5
beta_real = [0.9, 1.5]
eps_real = np.random.normal(0, 0.5, size=N)

X = np.array([np.random.normal(i, j, N) for i, j in zip([10, 2], [1, 1.5])]).T
X_mean = X.mean(axis=0, keepdims=True)
X_centered = X - X_mean
y = alpha_real + np.dot(X, beta_real) + eps_real

Now, we are going to define a convenient function to plot three scatter plots, two between each independent variable, and the dependent variable, and the last one between both dependent variables. This is nothing fancy at all, just a function we will use a couple of times during the rest of this chapter:

def scatter_plot(x, y):
    plt.figure(figsize=(10, 10))
    for idx, x_i in enumerate(x.T):
        plt.subplot(2, 2, idx+1)
        plt.scatter(x_i, y)
        plt.xlabel(f'x_{idx+1}')
        plt.ylabel(f'y', rotation=0)

    plt.subplot(2, 2, idx+2)
    plt.scatter(x[:, 0], x[:, 1])
    plt.xlabel(f'x_{idx}')
    plt.ylabel(f'x_{idx+1}', rotation=0)

Using the scatter_plot function we just defined, we can visualize our synthetic data:

scatter_plot(X_centered, y)

Figure 3.20

Now, let's use PyMC3 to define a model that's suitable for multiple linear regression. As expected, the code looks pretty similar to what we used for simple linear regression. The main differences are:

The variable beta is a Gaussian with shape=2, one slope per each independent variable
We define the variable using the dot product function pm.math.dot()

If you are familiar with NumPy, you probably know that NumPy includes a dot function, and from Python 3.5 (and from NumPy 1.10), a new matrix operator, @, is included. Nevertheless, here, we use the dot function from PyMC3, which is just an alias for a Theano matrix multiplication operator. We are doing so because the variable is a Theano tensor and not a NumPy array:

with pm.Model() as model_mlr:
    α_tmp = pm.Normal('α_tmp', mu=0, sd=10)
    β = pm.Normal('β', mu=0, sd=1, shape=2)
    ϵ = pm.HalfCauchy('ϵ', 5)

    μ = α_tmp + pm.math.dot(X_centered, β)

    α = pm.Deterministic('α', α_tmp - pm.math.dot(X_mean, β))

    y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y)

    trace_mlr = pm.sample(2000)

Let's summarize the inferred parameters values for easier analysis of the results. How well did the model do?

varnames = ['α', 'β', 'ϵ']
az.summary(trace_mlr, var_names=varnames)

	mean	sd	mc error	hpd 3%	hpd 97%	eff_n	r_hat
α[0]	1.86	0.46	0.0	0.95	2.69	5251.0	1.0
β[0]	0.97	0.04	0.0	0.89	1.05	5467.0	1.0
β[1]	1.47	0.03	0.0	1.40	1.53	5464.0	1.0
ϵ	0.47	0.03	0.0	0.41	0.54	4159.0	1.0

As we can see, our model is capable of recovering the correct values (check the values used to generate the synthetic data).

In the following sections, we are going to focus on some precautions we should take when analyzing the results of a multiple regression model, especially the interpretation of the slopes. One important message to take home is that, in multiple linear regression, each parameter only makes sense in the context of the other parameters.

Table of Contents for Multiple linear regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Multiple linear regression