Chapter 9. Linear Regression

In Chapter 7, Exploring Association Rules with Apriori, we examined association rules with apriori. In the previous chapter, we have notably examined statistical distribution and the relationships between two attributes using several measures of association. These didn't infer any causation between the attributes, only dependence. If we have normally distributed attributes and want to examine how one attribute affects another attribute, we can rely on simple linear regression instead. If we want to examine how several attributes affect an attribute, we can rely on multiple linear regressions.

In this chapter, we will notably:

  • Build and use our own simple linear regression algorithm
  • Create multiple linear regression models in R
  • Perform diagnostic tests of such models
  • Score new data using a linear regression model
  • Examine how well the model predicts the new data
  • Have a quick look at robust regression and bootstrapping

Understanding simple regression

In simple regression, we analyze the relationship between a predictor (the attribute we think to be the cause) and the criterion (the attribute we think is the consequence). There are two very important parameters (among others) that result from a regression analysis:

  • The intercept: This is the average value of the criterion when the predictor is 0, which is when the effect of the predictor is partialed out
  • The slope coefficient: This indicates by how many units, on average, the criterion changes (with reference to the intercept) when the predictor increases by one unit

Regression seeks to obtain the values that explain the relationship the best, but such a model only seldom reflects the relationship entirely. Indeed, measurement error, but also attributes that are not included in the analysis affect also the data. The residuals express the deviation of the observed data points to the model. Its value is the vertical distance from a point to the regression line. Let's examine this with an example of the iris dataset. We have already seen that the dataset contains data about iris flowers. For the purpose of this example, we will consider the petal length as the criterion and the petal width as the predictor.

We will now create a scatterplot, with the petal width on the x axis and the petal length on the y axis, in order to display the data points on these dimensions. We will then compute the regression model and use it to add the regression line to the plot. This should look familiar, as we have already done this in Chapter 2, Visualizing and Manipulating Data Using R, and Chapter 3, Data Visualization with Lattice, when discussing plots in R. This redundancy is not accidental—plotting data and their relationship is one of the most important aspects of analyzing data:

1  plot(iris$Petal.Length ~ iris$Petal.Width,
2     main = "Relationship between petal length and petal width",
3     xlab = "Petal width", ylab = "Petal length")
4  iris.lm = lm(iris$Petal.Length ~ iris$Petal.Width)
5  abline(iris.lm)

The following plot has been manually annotated in gray in order to make the discussion more intelligible:

Understanding simple regression

Annotated scatterplot of petal length and petal width (iris dataset) with regression line

In this example, the intercept is around 1.1 and the slope coefficient is around 2.2 (about 3.3 minus the intercept). As mentioned before, the vertical distance from the line to a point is the residual for that specific point.

Now that this is understood, we will examine how these values can be computed, before going into the results in greater depth.

Computing the intercept and slope coefficient

In simple regression, data can be modeled as the intercept, plus the slope multiplied by the value of the predictor, plus the residual. We are now going to explain how to compute these.

The slope coefficient can be computed in several ways. One is to multiply the correlation coefficient by the standard deviation of the criterion divided by the standard deviation of the predictor. Another is to first compute the value corresponding to the number of observations multiplied by: the sum of the observation-wise products of the criterion and the predictor minus the sum of the values of the predictor multiplied by the sum of the values of the criterion multiplied. The result is then divided by the number of observations multiplied by the sum of the squared values of the predictor minus the squared sum of the predictors. Another way is to rely on matrix computations, which we will not examine here.

The intercept can simply be computed as the mean of the criterion minus the slope coefficient multiplied by the mean of the predictor.

Let's take the same example as before to compute the regression coefficient (using the two computations we have seen), and the intercept.

To compute the slope coefficient using the first way presented, we start by computing the correlation coefficient of the petal length and petal width, and the standard deviation of the predictor and criterion. We then perform the described computation:

SlopeCoef = cor(iris$Petal.Length,iris$Petal.Width) *
   (sd(iris$Petal.Length) / sd(iris$Petal.Width))
SlopeCoef

The outputted value is 2.22994. Let's program a function that implements the other way to compute the slope we've seen. The criterion will be called y and the predictor x:

1  coeffs = function (y,x) {
2     ( (length(y) * sum( y*x)) - 
3        (sum( y) * sum(x)) )  / 
4        (length(y) * sum(x^2) - sum(x)^2)
5  }
6  coeffs(iris$Petal.Length, iris$Petal.Width)

The output is 2.22994 again. Let's compare it to the model we built using the lm() function previously:

iris.lm

The output first reminds us of the function call we used, which is pretty handy as working with many different models can sometimes be confusing. The intercept and slope coefficients are provided. As can be seen, with a difference of 0.07, we are very close with our own computations:

Call:
lm(formula = iris$Petal.Length ~ iris$Petal.Width)
Coefficients:
     (Intercept)  iris$Petal.Width  
           1.084             2.230  

Let's now build a function that computes the intercept and returns both intercept and coefficient:

1  regress =  function (y,x) {
2     slope = coeffs(y,x)
3     intercept = mean(y) - (slope * mean(x))
4     model = c(intercept, slope)
5     names(model) = c("intercept", "slope")
6     model
7  }
8  model = regress(iris$Petal.Length, iris$Petal.Width)
9  model

The value of the intercept is 1.08358, which is the same (but unrounded) as that in the output of the lm() function.

Now that we have seen how to compute the intercept and slope coefficients, let's turn to how residuals can be obtained.

Obtaining the residuals

Let's say it once more; the criterion value of any observation can be obtained by summing the intercept, the slope coefficient multiplied by its predictor value, and the residuals. As we now know the intercept and slope coefficient and have the data, we can compute the residuals as follows:

resids = function (y,x, model) {
  y - model[1] - (model[2] * x)
}

Let's compute the residuals for our model:

Residuals = resids(iris$Petal.Length, iris$Petal.Width, model)

Let's display the first six residuals:

head(round(Residuals,2))

The output is as follows:

[1] -0.13  -0.13  -0.23  -0.03  -0.13  -0.28

Let's also display the residuals computed by the lm() function:

head(round(residuals(iris.lm),2))

Comparing the preceding and following outputs, we can see that the values are the same:

1

2

3

4

5

6

-0.13

-0.13

-0.23

-0.03

-0.13

-0.28

Residuals are very important for several reasons. There is not enough space to explain all of them. Let's just say that an assumption of regression is that residuals are normally distributed. If residuals are not normally distributed, this can be caused by the non-normal distribution of our data, and/or because of nonlinear relationships between the predictors and criterion.

The Quantile-Quantile plot (Q-Q plot) allows visually comparing the actual distribution of the residuals in terms of quantiles, to the theoretical distribution (quantiles as well). This plot can be easily obtained in R. Type the following:

plot(iris.lm)

Then simply click on the R Graphics window (or, if you are using RStudio, hit Return in the console) until the R Graphics window displays the following:

Obtaining the residuals

Q-Q plot of the residuals in the iris.lm model

As the residuals fit on the dotted line reasonably well, we can conclude they are distributed normally. But notice that the values in both extremities do not fit the line so well, so these observations might threaten the reliability of our results. We should be fine but still, we will examine robust regressions and bootstrapping at the end of the chapter.

We will examine the linearity of residuals further in our practical example in the next section.

Computing the significance of the coefficient

As we have seen in the first section of the chapter, determining the significance of the estimates is essential for interpretation; even a big coefficient cannot be interpreted if it is not significantly different from 0. Here, you will learn a little more about the computation of the significance for simple regression:

  1. The first thing we need to do is to compute the standard error of the slope coefficient (a value that assesses its precision).
  2. We obtain the standard error by first taking the square root of: the sum of the squared residuals (SSR) divided by the degrees of freedom (DF—that is, the number of observations minus two).
  3. We then divide this value (called S in the following code) by the square root of the squared mean subtracted values of x.
  4. After we obtain the standard error, we can compute a t-score by dividing the slope coefficient by the standard error.
  5. The score is then compared to 0 on a t-distribution.

There is also a significance test for the intercept. In order to compute the standard error, we first:

  1. Compute 1 divided by the number of observations, plus the square mean of the predictor, divided by the sum of the squared mean subtracted values of the predictor.
  2. We take the square root of this value and multiply it by the value S that we saw previously.

After we obtain the standard error for the intercept, its t-score can be computed as seen previously. The following code implements this and returns the standard error, t score, and significance for both the slope coefficient and the intercept of a simple linear regression:

1  Significance = function (y, x, model) {
2     SSE = sum (resids(y,x,model)^2)
3     DF = length(y)-2
4     S = sqrt ( SSE / DF)
5     SEslope = S / sqrt(sum( (x-mean(x))^2 ))
6     tslope = model[2] / SEslope
7     sigslope = 2*(1-pt(abs(tslope),DF))
8     SEintercept = S * sqrt((1/length(y) + 
9     mean(x)^2 / sum( (x- mean(x))^2)))
10     tintercept = model[1] / SEintercept
11     sigintercept = 2*(1-pt(abs(tintercept),DF))
12     RES = c(SEslope, tslope, sigslope, SEintercept,
13        tintercept, sigintercept)
14     names(RES) = c("SE slope", "T slope", "sig slope", 
15        "SE intercept", "t intercept", "sig intercept")
16     RES
17  }

Let's see this in practice using the example of the iris dataset.

round(Significance(iris$Petal.Length,iris$Petal.Width, model), 3)

The output is as follows:

Computing the significance of the coefficient

Let's now compare our results with the results obtained with lm():

summary(iris.lm)

The following output shows that we obtain the exact same results using our function:

Computing the significance of the coefficient

If you have been following all along, you now know how to write functions that compute the intercept, slope coefficient and residuals, standard errors, t values, and significance for simple regression. Congratulations! We wish to mention that, as always, the code is presented for pedagogical purposes only, and that the tools provided by default in R or the packages available on CRAN should always be used for real applications.

In the next section, we will briefly examine how multiple regression works, then switch to a more practical section using multiple regression. We will discover important new concepts in the process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.97.40