Multiple linear regression

So far, we have been working with one dependent variable and one independent variable. Nevertheless, it is not unusual to have several independent variables that we want to include in our model. Some examples could be:

  • Perceived quality of wine (dependent) and acidity, density, alcohol level, residual sugar, and sulphates content (independent variables)
  • A student's average grades (dependent) and family income, distance from home to school, and mother's education (categorical variable)

We can easily extend the simple linear regression model to deal with more than one independent variable. We call this model multiple linear regression or less often multivariable linear regression (not to be confused with multivariate linear regression, the case where we have multiple dependent variables).

In a multiple linear regression model, we model the mean of the dependent variable as follows:

Notice that this looks similar to a polynomial regression, but is not exactly the same. For multiple linear regression, we have different variables instead of successive powers of the same variable. From the point of view of multiple linear regression, we can say that a polynomial regression is like a multiple linear regression but with made-up variables.

Using linear algebra notation, we can write a shorter version:

Here,  is a vector of coefficients of length , that is, the number of dependent variables. The variable  is a matrix of size  if  is the number of observations and  is the number of independent variables. If you are a little rusty with your linear algebra, you may want to check the Wikipedia article about the dot product between two vectors and its generalization to matrix multiplication. Basically what you need to know is that we are just using a shorter and more convenient way to write our model:

Using the simple linear regression model, we find a straight line that (hopefully) explains our data. Under the multiple linear regression model we find, instead, a hyperplane of dimension . Thus, the multiple linear regression model is essentially the same as the simple linear regression model, the only difference being that now  is a vector and  is a matrix.

Let's define our data:

np.random.seed(314)
N = 100
alpha_real = 2.5
beta_real = [0.9, 1.5]
eps_real = np.random.normal(0, 0.5, size=N)

X = np.array([np.random.normal(i, j, N) for i, j in zip([10, 2], [1, 1.5])]).T
X_mean = X.mean(axis=0, keepdims=True)
X_centered = X - X_mean
y = alpha_real + np.dot(X, beta_real) + eps_real

Now, we are going to define a convenient function to plot three scatter plots, two between each independent variable, and the dependent variable, and the last one between both dependent variables. This is nothing fancy at all, just a function we will use a couple of times during the rest of this chapter:

def scatter_plot(x, y):
plt.figure(figsize=(10, 10))
for idx, x_i in enumerate(x.T):
plt.subplot(2, 2, idx+1)
plt.scatter(x_i, y)
plt.xlabel(f'x_{idx+1}')
plt.ylabel(f'y', rotation=0)

plt.subplot(2, 2, idx+2)
plt.scatter(x[:, 0], x[:, 1])
plt.xlabel(f'x_{idx}')
plt.ylabel(f'x_{idx+1}', rotation=0)

Using the scatter_plot function we just defined, we can visualize our synthetic data:

scatter_plot(X_centered, y) 
Figure 3.20

Now, let's use PyMC3 to define a model that's suitable for multiple linear regression. As expected, the code looks pretty similar to what we used for simple linear regression. The main differences are:

  • The variable beta is a Gaussian with shape=2, one slope per each independent variable
  • We define the variable  using the dot product function pm.math.dot() 

If you are familiar with NumPy, you probably know that NumPy includes a dot function, and from Python 3.5 (and from NumPy 1.10), a new matrix operator, @, is included. Nevertheless, here, we use the dot function from PyMC3, which is just an alias for a Theano matrix multiplication operator. We are doing so because the variable  is a Theano tensor and not a NumPy array:

with pm.Model() as model_mlr:
α_tmp = pm.Normal('α_tmp', mu=0, sd=10)
β = pm.Normal('β', mu=0, sd=1, shape=2)
ϵ = pm.HalfCauchy('ϵ', 5)

μ = α_tmp + pm.math.dot(X_centered, β)

α = pm.Deterministic('α', α_tmp - pm.math.dot(X_mean, β))

y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y)

trace_mlr = pm.sample(2000)

Let's summarize the inferred parameters values for easier analysis of the results. How well did the model do?

varnames = ['α', 'β', 'ϵ']
az.summary(trace_mlr, var_names=varnames)

mean

sd

mc error

hpd 3%

hpd 97%

eff_n

r_hat

α[0]

1.86

0.46

0.0

0.95

2.69

5251.0

1.0

β[0]

0.97

0.04

0.0

0.89

1.05

5467.0

1.0

β[1]

1.47

0.03

0.0

1.40

1.53

5464.0

1.0

ϵ

0.47

0.03

0.0

0.41

0.54

4159.0

1.0

 

As we can see, our model is capable of recovering the correct values (check the values used to generate the synthetic data).

In the following sections, we are going to focus on some precautions we should take when analyzing the results of a multiple regression model, especially the interpretation of the slopes. One important message to take home is that, in multiple linear regression, each parameter only makes sense in the context of the other parameters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.239.41