The core of the linear regression models

Now that we have discussed some general ideas about linear regression and we have also established a bridge between the vocabulary used in statistics and ML, let's begin to learn how we can build linear models.

The chances are high that you are already familiar with the following equation:

This equation says that there is a linear relation between the variable and the variable . The parameter controls the slope of the linear relationship and thus is interpreted as the change in the variable per unit change in the variable . The other parameter, , is known as the intercept, and tells us the value of when . Graphically, the intercept is the value of the point where the line intercepts the y axis.

There are several ways to find the parameters for a linear model; one method is known as least squares fitting. Least squares returns the values of and yielding the lowest average quadratic error between the observed and the predicted . Expressed in that way, the problem of estimating and is an optimization problem, that is, a problem where we try to find the minima or maxima of some function. An alternative to optimization is to generate a fully probabilistic model. Thinking probabilistically gives us several advantages; we can obtain the best values of and (the same as with optimization methods) together with an estimation of the uncertainty we have about the parameter's values. Optimization methods require extra work to provide this information. Additionally, the probabilistic approach, especially when done using tools such as PyMC3, will give us the flexibility to adapt models to particular problems, as we will see during this chapter.

Probabilistically, a linear regression model can be expressed as follows:

That is, the data vector is assumed to be distributed as a Gaussian with a mean of and with a standard deviation of .

A linear regression model is an extension of the Gaussian model where the mean is not directly estimated but rather computed as a linear function of a predictor variable and some additional parameters.

Since we do not know the values of , , or , we have to set prior distributions for them. A reasonable generic choice would be:

For the prior over , we can use a very flat Gaussian by setting the value to a relatively high value for the scale of the data. In general, we do not know where the intercept can be, and its value can vary a lot from one problem to another and for different domain knowledge. For many problems I have worked on, is usually centered around zero and with a no larger than 10, but this is just my experience (almost anecdotal) with a small subset of problems and not something easy to transfer to other problems. Regarding the slope, it may be is easier to have a general idea of what to expect than for the intercept. For many problems, we can at least know the sign of the slope a priori; for example, we expect the variable weight to increase, on average, with the variable height. For epsilon, we can set to a large value on the scale of the variable , for example, ten times the value for its standard deviation. These very vague priors guarantee a very small effect of the prior on the posterior, which is easily overcome by the data.

The point-estimate obtained using the least squares method will agree with the maximum a posteriori (MAP) (the mode of the posterior) from a Bayesian simple linear regression with flat priors.

A couple of alternatives to the Half-Gaussian distribution for are the Uniform or the half-Cauchy distributions. The half-Cauchy distribution generally works well as a good regularizing prior (see Chapter 6, Model Comparison, for details) and the Uniform distributions are generally not a very good choice unless you know that the parameter is truly restricted by hard boundaries. If we want to use really strong priors around some specific value for the standard deviation, we can use the gamma distribution. The default parameterization of the gamma distribution in many packages can be a little bit confusing at first, but fortunately PyMC3 allows us to define it using the shape and rate (probably the most common parameterization) or the mean and standard deviation (probably a more intuitive parameterization, at least for newcomers).

To see how the gamma and other distributions look like, you can check out the PyMC3 documentation here: https://docs.pymc.io/api/distributions/continuous.html.

Going back to the linear regression model, we can use the nice and easy to interpret Kruschke diagrams to represent them, as in Figure 3.1. You may remember from the previous chapter that in Kruschke diagrams we use the symbol = to define deterministic variables (such as ) and ∼ to define stochastic variables such as , , and :

Figure 3.1

Now that we have defined the model, we need the data to feed the model. Once again, we are going to rely on a synthetic dataset to build intuition on the model. One advantage of a synthetic dataset is that we know the correct values of the parameters and we can check if we are able to recover them with our models:

np.random.seed(1)
N = 100
alpha_real = 2.5
beta_real = 0.9
eps_real = np.random.normal(0, 0.5, size=N)

x = np.random.normal(10, 1, N)
y_real = alpha_real + beta_real * x
y = y_real + eps_real

_, ax = plt.subplots(1,2, figsize=(8, 4))
ax[0].plot(x, y, 'C0.')
ax[0].set_xlabel('x')
ax[0].set_ylabel('y', rotation=0)
ax[0].plot(x, y_real, 'k')
az.plot_kde(y, ax=ax[1])
ax[1].set_xlabel('y')
plt.tight_layout()

Figure 3.2

Now, we will use PyMC3 to build and fit the model—nothing we haven't seen before. Wait—in fact, there is something new! is expressed in the model as a deterministic variable, reflecting what we have already written in mathematical notation and in the Kruschke's diagram. If we specify a PyMC3 deterministic variable, we are telling PyMC3 to compute and save the variable in the trace:

with pm.Model() as model_g:
    α = pm.Normal('α', mu=0, sd=10)
    β = pm.Normal('β', mu=0, sd=1)
    ϵ = pm.HalfCauchy('ϵ', 5)

    μ = pm.Deterministic('μ', α + β * x)
    y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y)

    trace_g = pm.sample(2000, tune=1000)

Alternatively, instead of including a deterministic variable in the model, we can omit it. In such a case, the variable will still be computed but not saved in the trace. For example, we could have written the following:

y_pred = pm.Normal('y_pred', mu= α + β * x, sd=ϵ, observed=y)

To explore the results of our inference, we are going to generate a trace plot, omitting the deterministic variable . We can do this by passing the names of the variables we want to include in the plot as a list to the var_names argument. Many ArviZ functions have a var_names argument:

az.plot_trace(trace_g, var_names=['α', 'β', 'ϵ'])

Figure 3.3

Feel free to experiment with other ArviZ plots to explore the posterior. In the next section, we are going to discuss a property of the linear model and how it can affect the sampling process and model interpretation. Then, we are going to look at a few ways to interpret and visualize the posterior.

Table of Contents for The core of the linear regression models

Create new playlist

Sign In

Sign Up

Table of Contents for
The core of the linear regression models