In Chapter 7, Exploring Association Rules with Apriori, we examined association rules with apriori
. In the previous chapter, we have notably examined statistical distribution and the relationships between two attributes using several measures of association. These didn't infer any causation between the attributes, only dependence. If we have normally distributed attributes and want to examine how one attribute affects another attribute, we can rely on simple linear regression instead. If we want to examine how several attributes affect an attribute, we can rely on multiple linear regressions.
In this chapter, we will notably:
In simple regression, we analyze the relationship between a predictor (the attribute we think to be the cause) and the criterion (the attribute we think is the consequence). There are two very important parameters (among others) that result from a regression analysis:
Regression seeks to obtain the values that explain the relationship the best, but such a model only seldom reflects the relationship entirely. Indeed, measurement error, but also attributes that are not included in the analysis affect also the data. The residuals express the deviation of the observed data points to the model. Its value is the vertical distance from a point to the regression line. Let's examine this with an example of the iris
dataset. We have already seen that the dataset contains data about iris flowers. For the purpose of this example, we will consider the petal length as the criterion and the petal width as the predictor.
We will now create a scatterplot, with the petal width on the x axis and the petal length on the y axis, in order to display the data points on these dimensions. We will then compute the regression model and use it to add the regression line to the plot. This should look familiar, as we have already done this in Chapter 2, Visualizing and Manipulating Data Using R, and Chapter 3, Data Visualization with Lattice, when discussing plots in R. This redundancy is not accidental—plotting data and their relationship is one of the most important aspects of analyzing data:
1 plot(iris$Petal.Length ~ iris$Petal.Width, 2 main = "Relationship between petal length and petal width", 3 xlab = "Petal width", ylab = "Petal length") 4 iris.lm = lm(iris$Petal.Length ~ iris$Petal.Width) 5 abline(iris.lm)
The following plot has been manually annotated in gray in order to make the discussion more intelligible:
In this example, the intercept is around 1.1 and the slope coefficient is around 2.2 (about 3.3 minus the intercept). As mentioned before, the vertical distance from the line to a point is the residual for that specific point.
Now that this is understood, we will examine how these values can be computed, before going into the results in greater depth.
In simple regression, data can be modeled as the intercept, plus the slope multiplied by the value of the predictor, plus the residual. We are now going to explain how to compute these.
The slope coefficient can be computed in several ways. One is to multiply the correlation coefficient by the standard deviation of the criterion divided by the standard deviation of the predictor. Another is to first compute the value corresponding to the number of observations multiplied by: the sum of the observation-wise products of the criterion and the predictor minus the sum of the values of the predictor multiplied by the sum of the values of the criterion multiplied. The result is then divided by the number of observations multiplied by the sum of the squared values of the predictor minus the squared sum of the predictors. Another way is to rely on matrix computations, which we will not examine here.
The intercept can simply be computed as the mean of the criterion minus the slope coefficient multiplied by the mean of the predictor.
Let's take the same example as before to compute the regression coefficient (using the two computations we have seen), and the intercept.
To compute the slope coefficient using the first way presented, we start by computing the correlation coefficient of the petal length and petal width, and the standard deviation of the predictor and criterion. We then perform the described computation:
SlopeCoef = cor(iris$Petal.Length,iris$Petal.Width) * (sd(iris$Petal.Length) / sd(iris$Petal.Width)) SlopeCoef
The outputted value is 2.22994
. Let's program a function that implements the other way to compute the slope we've seen. The criterion will be called y
and the predictor x
:
1 coeffs = function (y,x) { 2 ( (length(y) * sum( y*x)) - 3 (sum( y) * sum(x)) ) / 4 (length(y) * sum(x^2) - sum(x)^2) 5 } 6 coeffs(iris$Petal.Length, iris$Petal.Width)
The output is 2.22994
again. Let's compare it to the model we built using the lm()
function previously:
iris.lm
The output first reminds us of the function call we used, which is pretty handy as working with many different models can sometimes be confusing. The intercept and slope coefficients are provided. As can be seen, with a difference of 0.07, we are very close with our own computations:
Call: lm(formula = iris$Petal.Length ~ iris$Petal.Width) Coefficients: (Intercept) iris$Petal.Width 1.084 2.230
Let's now build a function that computes the intercept and returns both intercept and coefficient:
1 regress = function (y,x) { 2 slope = coeffs(y,x) 3 intercept = mean(y) - (slope * mean(x)) 4 model = c(intercept, slope) 5 names(model) = c("intercept", "slope") 6 model 7 } 8 model = regress(iris$Petal.Length, iris$Petal.Width) 9 model
The value of the intercept is 1.08358
, which is the same (but unrounded) as that in the output of the lm()
function.
Now that we have seen how to compute the intercept and slope coefficients, let's turn to how residuals can be obtained.
Let's say it once more; the criterion value of any observation can be obtained by summing the intercept, the slope coefficient multiplied by its predictor value, and the residuals. As we now know the intercept and slope coefficient and have the data, we can compute the residuals as follows:
resids = function (y,x, model) { y - model[1] - (model[2] * x) }
Let's compute the residuals for our model:
Residuals = resids(iris$Petal.Length, iris$Petal.Width, model)
Let's display the first six residuals:
head(round(Residuals,2))
The output is as follows:
[1] -0.13 -0.13 -0.23 -0.03 -0.13 -0.28
Let's also display the residuals computed by the lm()
function:
head(round(residuals(iris.lm),2))
Comparing the preceding and following outputs, we can see that the values are the same:
1 |
2 |
3 |
4 |
5 |
6 |
---|---|---|---|---|---|
-0.13 |
-0.13 |
-0.23 |
-0.03 |
-0.13 |
-0.28 |
Residuals are very important for several reasons. There is not enough space to explain all of them. Let's just say that an assumption of regression is that residuals are normally distributed. If residuals are not normally distributed, this can be caused by the non-normal distribution of our data, and/or because of nonlinear relationships between the predictors and criterion.
The Quantile-Quantile plot (Q-Q plot) allows visually comparing the actual distribution of the residuals in terms of quantiles, to the theoretical distribution (quantiles as well). This plot can be easily obtained in R. Type the following:
plot(iris.lm)
Then simply click on the R Graphics window (or, if you are using RStudio, hit Return in the console) until the R Graphics window displays the following:
As the residuals fit on the dotted line reasonably well, we can conclude they are distributed normally. But notice that the values in both extremities do not fit the line so well, so these observations might threaten the reliability of our results. We should be fine but still, we will examine robust regressions and bootstrapping at the end of the chapter.
We will examine the linearity of residuals further in our practical example in the next section.
As we have seen in the first section of the chapter, determining the significance of the estimates is essential for interpretation; even a big coefficient cannot be interpreted if it is not significantly different from 0. Here, you will learn a little more about the computation of the significance for simple regression:
There is also a significance test for the intercept. In order to compute the standard error, we first:
After we obtain the standard error for the intercept, its t-score can be computed as seen previously. The following code implements this and returns the standard error, t score, and significance for both the slope coefficient and the intercept of a simple linear regression:
1 Significance = function (y, x, model) { 2 SSE = sum (resids(y,x,model)^2) 3 DF = length(y)-2 4 S = sqrt ( SSE / DF) 5 SEslope = S / sqrt(sum( (x-mean(x))^2 )) 6 tslope = model[2] / SEslope 7 sigslope = 2*(1-pt(abs(tslope),DF)) 8 SEintercept = S * sqrt((1/length(y) + 9 mean(x)^2 / sum( (x- mean(x))^2))) 10 tintercept = model[1] / SEintercept 11 sigintercept = 2*(1-pt(abs(tintercept),DF)) 12 RES = c(SEslope, tslope, sigslope, SEintercept, 13 tintercept, sigintercept) 14 names(RES) = c("SE slope", "T slope", "sig slope", 15 "SE intercept", "t intercept", "sig intercept") 16 RES 17 }
Let's see this in practice using the example of the iris dataset.
round(Significance(iris$Petal.Length,iris$Petal.Width, model), 3)
The output is as follows:
Let's now compare our results with the results obtained with lm()
:
summary(iris.lm)
The following output shows that we obtain the exact same results using our function:
If you have been following all along, you now know how to write functions that compute the intercept, slope coefficient and residuals, standard errors, t values, and significance for simple regression. Congratulations! We wish to mention that, as always, the code is presented for pedagogical purposes only, and that the tools provided by default in R or the packages available on CRAN should always be used for real applications.
In the next section, we will briefly examine how multiple regression works, then switch to a more practical section using multiple regression. We will discover important new concepts in the process.
3.139.97.40