Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12
Linear Regression Models

Package(s): faraway, MASS, car

Dataset(s): Euphorbiaceae, anscombe, tc, usc, shelf_stocking, abrasion_index, Frog_survival, flight, viscos, prostate,

12.1 Introduction

Faraway (2002) is probably the first detailed account of the use of R for linear models. Interestingly, this book is allowed to be freely circulated and we may also print it and sell it at a cost covering the cost of print. This book makes an elegant read for the current R versions, although it was written when the R version was in the early 1.x versions. Faraway (2006) is an extended version which considers the generalized linear models, which we deal with in Chapter 17. Fox (2002) deals with regression problems in both R and S-plus. Sheather (2009) is also a very recent account of the use of R for analysis of linear models, and SAS users will also find it easier to use this book as it also gives parallel programs. Ritz and Streibig (2008) is dedicated to the applications of nonlinear regression models using R.

The covariates are also sometimes called explanatory variables, or regressors, or predictors. In general the covariate is an independent variable. The output $c12-math-0001$ is called a regressand.

A rather long route is adapted in this chapter. The reasons are two-fold. First, understanding of the statistical concepts using the simple linear regression model is of prime importance, even today. It is important to go via this route and be familiarized with the nuances of the linear regression model, whose extension to the multiple-regression model is rather straightforward. The second point of view is that using the lm function right away may hide some of the conceptual developments of the subject. This is not the same as saying that the developers of the function should not have given the user those many options. Thus, we take the rather lengthier route of explaining the nitty-gritty of regression using lengthier R codes than necessary. This discussion forms the matter of Section 12.2. Linear regression models, similar to most statistical techniques, need to bedeveloped with a lot of care. It is not uncommon for any technology to be abused and it is to guard against such follies that the reader is cautioned to an extent in Section 12.3. In most of the scenarios which use the linear regression model, we will be dealing with more than a single covariate. Thus, Section 12.4 extends the simple linear regression model to a multiple linear regression model whenever we have to deal with two or more covariates. The use of residuals for the multiple linear regression model is detailed in Section 12.5. The dependencies among the covariates have a dire impact on the estimated values of the regression coefficients and it is important to identify such relationships. Multicollinearity addresses this problem and the R techniques are put into action in Section 12.6. Sometimes, the covariates and/or the regressand may reflect that the assumptions for the linear regression model are not appropriate. In specific situations, certain transformations on the variables ensure that the use of the linear regression model continues to yield good results, see Section 12.7. With a multiple linear regression model, we need to arrive at a more reasonable model, in the sense that we have less variables than the one in the model, with all variables covering as much variability in the regressands as possible. There are multiple ways of achieving it and we will go though these techniques in Section 12.8.

12.2 Simple Linear Regression Model

In Section 4.4.3 we described the general linear regression model in equation (4.11) and used the resistant line to obtain the unknown slope and intercept terms. The form of the simple linear regression model, which will be of interest in this section, is given by

12.1

In comparison with equation (4.11), we have $c12-math-0003$ and $c12-math-0004$ , although the error term has not been stated there. The interpretation of $c12-math-0005$ is similar to $c12-math-0006$ , in that it reflects the change in the regressand $c12-math-0007$ for a unit change in $c12-math-0008$ . We refer to $c12-math-0009$ as the intercept term. However, we have an additional term in $c12-math-0010$ , which is the unobservable error term. Similar to the case of the resistant line model, we need to carry out the inference for $c12-math-0011$ and $c12-math-0012$ based on $c12-math-0013$ pairs of observations: $c12-math-0014$ , $c12-math-0015$ , $c12-math-0016$ , $c12-math-0017$ . We state the important assumptions of the simple linear regression models:

1. The regressand and regressor have a linear relationship.
2. The observations are $c12-math-0018$ , $c12-math-0019$ , $c12-math-0020$ , $c12-math-0021$ are independent observations.
3. The errors $c12-math-0022$ are iid normal RVs with mean 0 and variance $c12-math-0023$ .

Using the data, and the above stated assumptions, the goal is the estimation of the parameters $c12-math-0024$ , $c12-math-0025$ , and $c12-math-0026$ . The purpose of estimating the parameters is again to understand the model 12.1. We will next consider an example and first visualize the data to ascertain whether a linear model is appropriate.

Example 12.2.1. The Height of Euphorbiaceae Tree

Botanists are interested in estimating the volume of a tree, which is often a daunting task. Under the assumption that the tree has a conical shape, the measurements of tree height and radius of the base is sufficient to estimate the volume. Since the height measurement is also cumbersome, being as much as 60 meters, the relationship between height and girth (at about 1 meter) is useful to measure the overall volume. The girth is measured in centimetres. The Euphorbiaceae dataset from the gpk package will be used to illustrate this concept. The dataset has six different species and the ideas will be illustrated for the Haevea brazeliensis species. The simple linear regression model under consideration is the following:

Data is first loaded from the gpk package and a scatter plot is produced to inspect if a linear relationship exists between the height of the tree and its girth measurement of 1 meter.

> library(gpk)
> data(Euphorbiaceae)
> Hb <- subset(Euphorbiaceae,  Species_Name=="Haevea brazeliensis")
> plot(Hb$GBH,Hb$Height,xlab="Girth",ylab="Height")

It is apparent from the scatter plot, see Figure 12.1, that as the girth increases, the height of the tree increases too. Furthermore, the plot depicts a linear relationship too.□

Figure 12.1 Scatter Plot for Height vs Girth of Euphorbiaceae Trees

12.2.1 Fitting a Linear Model

The main problem with a simple linear regression model is the estimation of unknown vector of regression coefficients $c12-math-0028$ and the variance of the error term $c12-math-0029$ . If we have estimates of the parameters, we can use the regression model for prediction purposes. For an intuitive discussion about the choice of the regression coefficients, the reader may refer to Chapter 6 of Tattar (2013).

For many statistical reasons, the parameters are estimated using the least squares method. The least-squares criterion is to find those values of $c12-math-0030$ and $c12-math-0031$ which will ensure that the sum of the squares for the difference between the actual values $c12-math-0032$ and $c12-math-0033$ over all the observations is minimized:

12.2

Differentiating the above equation with respect to $c12-math-0035$ and equating them to zero gives us the least-squares normal equations, and solving them further gives us the estimators for $c12-math-0036$ :

12.3

12.4

where

12.5

12.6

Of course, the term $c12-math-0041$ is known as the sum-of-squares term and $c12-math-0042$ as sum-of-cross-products term. Towards finding an estimate of the variance $c12-math-0043$ , we proceed along the following lines. We first define the model fitted values as the regressand value predicted by the fitted model $c12-math-0044$ . Next, we define residuals as the difference between the observed value $c12-math-0045$ and the corresponding model fitted value $c12-math-0046$ , that is, for $c12-math-0047$ :

12.7

Define the residual or error sum of squares, denoted by $c12-math-0049$ , by:

12.8

Since the residuals are based on $c12-math-0051$ observations, and the parameters $c12-math-0052$ and $c12-math-0053$ are estimated from it, the degrees of freedom associated with $c12-math-0054$ is $c12-math-0055$ . An unbiased estimator of $c12-math-0056$ is given by

12.9

It is thus meaningful that an estimator of the variance is the residual mean square. Using the estimator of variance, we can carry out statistical tests for the parameters of the regression line. Mathematically, the expressions for the variance of $c12-math-0058$ and $c12-math-0059$ are respectively:

Using the above expressions and the estimator of the variance of the error term, we estimate the standard error of $c12-math-0061$ and $c12-math-0062$ using:

12.10

12.11

Thus, if we are interested in investigating that the covariate has an effect of magnitude, say $c12-math-0065$ , we need to test the hypothesis $c12-math-0066$ against the hypothesis that it does not have an impact of magnitude $c12-math-0067$ , a useful test statistic is

12.12

which is distributed as a $c12-math-0069$ -distribution with $c12-math-0070$ degrees of freedom. Here we have used the notation $c12-math-0071$ to indicate that the hypothesis testing problem is related to the regression coefficient $c12-math-0072$ . In general, when we wish to test the hypothesis that the covariate has no effect on the regressand, the hypothesis testing problem becomes $c12-math-0073$ against the hypothesis $c12-math-0074$ . Thus, an $c12-math-0075$ -level test would be to reject the hypothesis $c12-math-0076$ if

where $c12-math-0078$ is the upper $c12-math-0079$ percentile point of the $c12-math-0080$ distribution.

Similarly, for the test statistic for the regression coefficient $c12-math-0081$ , or the intercept term, the hypotheses problem is $c12-math-0082$ against the hypothesis $c12-math-0083$ is

12.13

which is again distributed as an RV with a $c12-math-0085$ -distribution with $c12-math-0086$ degrees of freedom, and the test procedure parallels the testing of $c12-math-0087$ .

12.2.2 Confidence Intervals

Using the null distributions of $c12-math-0088$ and $c12-math-0089$ , equivalently $c12-math-0090$ and $c12-math-0091$ , the $c12-math-0092$ % confidence intervals of the slope and intercept are given by

12.14

and

12.15

Finally, we state that a $c12-math-0095$ percent confidence interval for $c12-math-0096$ is

12.16

We will illustrate all the above concepts for the Euphorbiaceae data.

Example 12.2.2. The Height of Euphorbiaceae Tree

Contd. The translation from formulas to programs is a vital step. We will illustrate the computations for the related formulas from Equations 12.3 to 12.16 in the following R program. The purpose is again to ensure that the formulas are well understood in terms of a program. We will begin with codes for obtaining $c12-math-0098$ in R. The variables in the data.frame Hb will be first attached for the sake of simplicity, although we warn the reader that it is not good practice to attach variables of a data frame object.

> attach(Hb)
> sxx <- sum((GBH-mean(GBH))ˆ2)
> sxx
[1] 6094.118
> sxy <- sum(Height*(GBH-mean(GBH)))
> sxy
[1] 3600.647
> beta1 <- sxy/sxx
> beta1
[1] 0.5908398
> beta0 <- mean(Height)-beta1*mean(GBH)
> beta0
[1] 2.848716
> n <- length(Height)
> sst <- sum(Heightˆ2)-n*(mean(Height)ˆ2)
> sst
[1] 3011.559
> ssres <- sst-beta1*sxy
> ssres
[1] 884.1533
> (sigma2 <- ssres/(n-2))
[1] 27.62979
> msres <- ssres/(n-2)
> (sebeta1 <- sqrt(msres/sxx))
[1] 0.06733384
> sebeta0 <- sqrt(msres*(1/n + (mean(GBH)ˆ2)/sxx))
> sebeta0
[1] 3.796107
> (testbeta1 <- beta1/sebeta1)
[1] 8.774782
> (abs(qt(0.025,32)))
[1] 2.036933
> # returns the value of t-dist with 32 d.f at alpha=.05

The program reveals the results that for a centimeter increase in value of the girth GBH, the Height of tree will increase by beta1=0.5908398 meters. The intercept term beta0=2.848716 corresponds to the minimum tree height of the species. Similarly, the R objects sebeta1 and sebeta0 return the standard error for the slope and intercept quantities. The R code here mimics the formulas on an as-is basis and it will provide a guide towards an understanding of the theory of the linear models.

Since the absolute value of the test statistics for the slope term 8.774782 is larger than the value specified for the $c12-math-0099$ -distribution 2.036933, we reject the hypothesis $c12-math-0100$ that the covariate is insignificant and conclude that there is a linear relationship between the height of the tree and girth. The computations related to the confidence intervals, Equations 12.14–12.16, are as follows:

> (lclbeta1 <- (beta1 - abs(qt(.025,32))*sebeta1))
[1] 0.4536852
> #gives lower confidence limit for beta1
> (uclbeta1 <- (beta1 + abs(qt(.025,32))*sebeta1))
[1] 0.7279943
> #gives upper confidence limit for beta1
> (lclbeta0 <- (beta0 - abs(qt(.025,32))*sebeta0))
[1] -4.883701
> #gives lower confidence limit for beta0
> (uclbeta0 <- (beta0 + abs(qt(.025,32))*sebeta0))
[1] 10.58113
> #gives upper confidence limit for beta0
> (lclsigma2 <- (n-2)*msres/qchisq(1-.025,32))
[1] 17.86875
> #gives lower confidence limit for sigma2
> (uclsigma2 <- (n-2)*msres/qchisq(1-.975,32))
[1] 48.33878
> #gives upper confidence limit for sigma2

Yes, we know that the reader is asking us to put the above calculations in a more formal way. So here we report them. The 95% confidence intervals for the parameters are as follows:

Since the 95% confidence interval for $c12-math-0102$ does not include 0, we conclude that the variable GBH has significant influence on Height.□

12.2.3 The Analysis of Variance (ANOVA)

In the previous sub-section, we investigated whether the covariate has an explanatory power for the regressand. However, we would like to query whether the simple linear regression model 12.1 overall explains the variation in the actual data. An answer to this question is provided by the statistical technique Analysis of Variance, ANOVA. This tool is nearly a century old and Gelman (2005) provides a comprehensive review of the same and explains why it is still relevant and a very useful tool today. Here, ANOVA is used to test the significance of the regression model.

We will first define two quantities, total sum of squares, denoted by $c12-math-0103$ , and regression sum of squares, $c12-math-0104$ , similar to the residual sum of squares 12.8, next:

12.17

12.18

A straightforward algebraic manipulation step will give us the result that the total sum of squares $c12-math-0107$ is equal to the sum of regression sum of squares $c12-math-0108$ and residual sum of squares $c12-math-0109$ :

that is,

12.19

The degrees of freedom (df) for the three quantities are related as

The explicit values of df are obtained as

that is, $c12-math-0114$ .

Under the hypothesis $c12-math-0115$ , $c12-math-0116$ has a $c12-math-0117$ distribution. Since the distributions of $c12-math-0118$ is a $c12-math-0119$ distribution, and $c12-math-0120$ is independent of $c12-math-0121$ , we can use the $c12-math-0122$ -test for testing significance of the covariate. That is

12.20

is distributed as an $c12-math-0124$ -distribution with 1 and $c12-math-0125$ degrees of freedom. The test procedure is to reject the hypothesis $c12-math-0126$ if $c12-math-0127$ , where $c12-math-0128$ is the $c12-math-0129$ percentile of an $c12-math-0130$ -distribution with 1 and $c12-math-0131$ degrees of freedom. We summarize the ANOVA procedure in the form of an ANOVA Table 12.1.

Table 12.1 ANOVA Table for Simple Linear Regression Model

Source of Variation	Sum of Squares	Degrees of Freedom	Mean Square	$c12-math-0132$ -Statistic
Regression	$c12-math-0133$	1	$c12-math-0134$	$c12-math-0135$
Residual	$c12-math-0136$	$c12-math-0137$	$c12-math-0138$
Total	$c12-math-0139$	$c12-math-0140$

The computations and illustration of ANOVA will continue with the euphorbiaceae example.

Example 12.2.3. The Height of Euphorbiaceae Tree

(contd.) The ANOVA table calculations for the problem, which should be straightforward by now, are given below:

> (srs <- beta1*sxy)
[1] 2127.405
> ssres
[1] 884.1533
> sst
[1] 3011.559
> (msrs <- srs/1)
[1] 2127.405
> msres
[1] 27.62979
> (f0 <- msrs/msres)
[1] 76.9968

The above values can be put in the form of Table 12.2.□

Table 12.2 ANOVA Table for Euphorbiaceae Height

Source of Variation	Sum of Squares	Degrees of Freedom	Mean Square	$c12-math-0141$ -Statistic
Regression	2127.405	1	2127.41	76.997
Residual	884.15	32	27.63
Total	3011.559	33

12.2.4 The Coefficient of Determination

We have thus seen two methods of investigating the significance of the covariate in the regression model. An important question, in the case of the covariate being significant, is how useful the covariate is towards explaining the variation of the regressand. Recall that the total sum of squares is given by $c12-math-0142$ and that the regression sum of squares is given in $c12-math-0143$ . Thus, the ratio $c12-math-0144$ gives an explanation of the total variation explained by the fitted regression model. This important measure is called the coefficient of determination, or the $c12-math-0145$ , and is defined as

12.21

A natural extension of the $c12-math-0148$ is quickly seen by choosing $c12-math-0149$ and $c12-math-0150$ in place of $c12-math-0151$ and $c12-math-0152$ , and this measure is more popularly known as Adjusted- $c12-math-0153$ :

12.22

There are some disadvantages of the measure $c12-math-0156$ , and this will be seen later in Section 12.4. We will first obtain these two measures for the euphorbiaceae tree.

12.2.5 The “lm” Function from R

The linear regression model discussed thus far has been put through a lot of rudimentary codes. The lm function helps to create the linear regression model in R through the powerful object of class formula. The tilde operator ∼ in the formula object helps to set up a linear regression model by allowing the user to specify the regressand on the left-hand side of the tilde ∼ and the covariates on its right-hand side. This operator has also been used earlier for various graphical methods, statistical methods, etc. In conjunction with the lm function, we will now set up many useful linear regression models.

The main idea of working with the rudimentary codes exercise is to give a programming guide through R, and also in parallel an understanding of the underlying theory with computations. We now see how the lm command can be used for regression analysis.

Example 12.2.5. The Height of Euphorbiaceae Tree

(contd.) The lm function is now used for fitting a linear regression model for the uphborbiaceae tree data. After fitting a linear model and assigning the result to a new object, the class of the new object becomes lm, and a lot of useful summaries are stored in this lm object. We will explore the lm object in some detail now. The goal is to build a simple linear regression model for Height as a function of the covariate GBH. Hence, the formula for the lm function through the tilde operator becomes Height ∼ GBH. The small program in the following builds the linear model.

> gbhlm <- lm(Height ∼ GBH)
> class(gbhlm)
[1] "lm"
> summary(gbhlm)
Call:
lm(formula = Height ∼ GBH)
Residuals:
     Min       1Q   Median       3Q      Max
-15.3892  -2.0273   0.4269   3.4509  10.1116
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.84872    3.79611   0.750    0.458
GBH          0.59084    0.06733   8.775 5.01e-10 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 5.256 on 32 degrees of freedom
Multiple R-squared:  0.7064, Adjusted R-squared:  0.6972
F-statistic:    77 on 1 and 32 DF,  p-value: 5.013e-10

The fitted linear model using the lm function is stored in the R object gbhlm and the class shows that we have indeed an lm object only. Next, the summary function reveals the details of the fitted linear regression model gbhlm. The regression coefficients, the values of the $c12-math-0158$ -statistics, etc., may be compared with earlier results and verify that the computations earlier were correct, and as such there is no surprise here. Note that the values of multiple and adjusted $c12-math-0159$ given here are synchronous with earlier reported values.

The part of the output following the $c12-math-0160$ -values needs an explanation. For this fitted regression model, the intercept term and the covariate GBH both have three stars, that is ***. These symbols are called Signif. codes and it is a quick way to draw attention to some of the highly significant covariates in the model. In order of their importance, this idea will become clearer in Section 12.4, a *** will be more significant than a ** variable, and so forth. To change and customize the Signif. codes, see Exercise 12.4.

Next, the ANOVA table is obtained using the anova function in R.

> gbhaov <- anova(gbhlm)
> gbhaov
Analysis of Variance Table
Response: Height
          Df  Sum Sq Mean Sq F value    Pr(>F)
GBH        1 2127.41 2127.41  76.997 5.013e-10 ***
Residuals 32  884.15   27.63
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> summary(gbhlm$residuals)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-15.3900  -2.0270   0.4269   0.0000   3.4510  10.1100
> summary(gbhlm$fitted.values)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  20.57   29.44   35.34   35.21   38.30   53.66

Here, we used the anova function and created the rpaov object. This output must be compared with the values reported in Table 12.2. We had defined residuals earlier, and by requirement, the model fitted values of the output. These values are also summarized in the lm object and here we extracted them through $residuals and $fitted.values.

Some details of the lm object are considered later in this chapter. In fact, there are many more advantages to using the lm function, besides obtaining the summary of the linear fit model and the ANOVA table. The $c12-math-0161$ confidence intervals may be obtained using the confint function:

> confint(gbhlm,parm="(Intercept)",level=.99)
                0.5 %   99.5 %
(Intercept) -7.546853 13.24429
> confint(gbhlm,parm="GBH",level=.90)
          5 %      95 %
GBH 0.4767837 0.7048958

By default, confint returns 95% confidence intervals.□

12.2.6 Residuals for Validation of the Model Assumptions

We have earlier defined residuals, which is the difference between the original values and the model fitted values. In the beginning of the section, we considered the model $c12-math-0162$ , and we also defined the residual for the fitted model as $c12-math-0163$ . We will have a preliminary look at some properties of the residuals.

Properties of the Residuals. The mean of the residuals is 0, that is,

The variance of the $c12-math-0165$ residuals is the mean residual sum of squares $c12-math-0166$ , or as seen earlier:

It can be proved that $c12-math-0168$ is an unbiased estimator for the variance of the linear regression model $c12-math-0169$ . Finally, we need to record that the residuals are not independent. These are some of the important properties of the residuals.

Semi-Studentized Residuals. We know that the residuals are zero-mean, non-independent random variables. It has been noted by researchers and practitioners that non-independence is not a major problem and we can continue to treat them as independent variables. Thus, the Studentization method is known here as the semi-Studentization method. The semi-Studentized residuals are then defined as

12.23

The residuals, including the semi-Studentized residuals, can be used to study departures in the simple linear regression models, adapted from Kutner, et al. (2005), on the following lines:

1. The regression function $c12-math-0171$ is not linear.
2. The error terms do not have constant variance, that is, $c12-math-0172$ for some $c12-math-0173$ .
3. The error terms are not independent.
4. The model fits all but one or a few outlier observations.
5. The error terms are not normally distributed, $c12-math-0174$ .
6. One or several important predictor variables have been omitted from the model.

As further stated in Kutner, et al. (2005), the diagnostics for the above problems may be visualized in the plots of residuals (or semi-Studentized residuals):

1. Plot of residuals against predictor variable. If the linear regression model is appropriate, the residuals are expected to fall in a horizontal band around 0. The plot should be fairly random in the positive and negative residual range across the predictor variable $c12-math-0175$ .
2. Plot of absolute or squared residuals against predictor variable. If the residuals vary in a systematic manner in the positive and negative residual values, such curvilinear behavior will be captured in the absolute or squared residuals plot against the predictor variable.
3. Plot of residuals against fitted values. The purpose of residuals against fitted values is similar to the previous two plots. These plots also help to determine whether the model error has constant variance against the range of predictor variables.
4. Plot of residuals against time. The plot of residuals against time, observation numbers in most cases, is useful to check for randomness of the errors. That is, this plot should resemble a random walk and not exhibit any kind of systematic pattern.
5. Plots of residuals against omitted predictor variables. This plot will be more appropriate for the multiple linear regression model, Section 12.4.
6. Box plot of residuals. In the presence of outliers, the box plot of residuals will reflect such observations beyond the whiskers.
7. Normal probability plot of residuals. The assumption of normality for the errors is appropriately validated with the normal probability plot of the residuals.

From the graphical methods developed earlier in this book, we are equipped to handle the first six plots mentioned here. For the example of labor hours required for the lot size of a Toluca Company, we will illustrate these six plots.

Example 12.2.6. The Toluca Company Labour Hours against Lot Size

The Toluca Company manufactures equipment related to refrigerators. The company, in respect of a particular component of a refrigerator, has data on the labor hours required for the component in various lot sizes. Using this data, the officials wanted to find the optimum lot size for producing this part. This dataset has been downloaded from https://netfiles.umn.edu/users/nacht001/www/nachtsheim/5th/KutnerData/Chapter%20%201%20Data%20Sets/CH01TA01.txt. Of course, we are using this well-illustrated example from Kutner, et al. (2005).

A simple understanding of the predictor variable Lot_Size is necessary to begin the analyses.

> tc <- read.table("toluca_company.dat",sep="	",header=TRUE)
> tclm <- lm(Labour_Hours∼Lot_Size,data=tc)
> tclm$coefficients
(Intercept)    Lot_Size
  62.365859    3.570202
> par(mfrow=c(2,2))
> dotchart(tc$Lot_Size,main="Dot Chart for the Lot Size")
> plot.ts(tc$Lot_Size,main="Sequence Plot for the Lot Size",type="b")
> boxplot(tc$Lot_Size,horizontal=TRUE, main="A Box Plot for the Lot Size")
> hist(tc$Lot_Size,main="Histogram of the Lot Size")

The diagram arising as a result of the above R codes is suppressed. The reader can generate them and interpret them as an exercise. The residual plots, which give us an insight into the overall fit of the simple linear regression model, are now generated in the following R program. The focus is on diagnostic plots using the residuals only, and we leave it to the reader to replicate the details with semi-Studentized residuals.

> tc_resid <- resid(tclm) # Note the use of the new function "resid"
> par(mfrow=c(2,3))
> plot(tc$Lot_Size,tc_resid, main="A: Plot of Residuals Vs Predictor
+ Variable", + xlab="Predictor Variable",ylab="Residuals")
> abline(h=0)
> plot(tc$Lot_Size,abs(tc_resid),main="B: Plot of Absolute Residual
+ Values Vs 
 Predictor Variable", xlab="Predictor Variable",
+ ylab="Absolute Residuals")
> # Equivalently
> plot(tc$Lot_Size,tc_resid ˆ2,main="C: Plot of Squared Residual
+ Values Vs 
 Predictor Variable", xlab="Predictor Variable",
+ ylab="Squared Residuals")
> plot(tclm$fitted.values,tc_resid, main="D: Plot of Residuals Vs
+ Fitted Values", xlab="Fitted Values",ylab="Residuals")
> abline(h=0)
> plot.ts(tc_resid, main="E: Sequence Plot of the Residuals")
> boxplot(tc_resid,main="F: Box Plot of the Residuals")

The resid function is used to extract residuals from the tclm linear model object. The graphics window is invoked with the code par(mfrow=c(2,3)). The residual plot for residuals against predictor variable is drawn with plot(tc$Lot_Size,tc_resid,...) and then shows that the linear regression model is appropriate since the residuals vary considerably across the horizontal band about 0 (abline(h=0) and along the range of predictor variable Lot_Size, see Part A of Figure 12.2. The plots plot(tc$Lot_Size,abs(tc_resid),...) and plot(tc$Lot_Size,tc_residˆ2, ...), Parts B and C of Figure 12.2, reflect that there is no systematic behavior of residuals and hence the absence of curvilinear patterns for the regression model. The validity of the model assumption of variance of the error being constant across the predictor variable is visualized through the two earlier plots, and the plot of residuals against the fitted values in Part D of the same figure.

Figure 12.2 Residual Plot for a Regression Model

The time sequence plot of residuals, plot.ts(tc_resid,...), in Part E of the figure, shows a random walk plot and hence we conclude that there is no systematic error in the dataset. Finally, Part F displays the box plot for the residuals and since all the observations (actually residuals) lie within the whiskers, it is apt to conclude that the outliers are absent.□

The Normal Probability Plot. We need to explain the normal probability plot of residuals. In the normal probability plot, the ranked residual values are plotted against their expected value under the normality assumption. The normal probability plot is obtained in the following steps:

Find the rank of each residual.
The expected value of the $c12-math-0176$ smallest residual is given by

12.24

where $c12-math-0178$ denotes the cumulative distribution of a standard normal variate. The subscript $c12-math-0179$ is a standard notation in the subject of Order Statistics and denotes the $c12-math-0180$ smallest observation of $c12-math-0181$ .
Plot the residuals $c12-math-0182$ against $c12-math-0183$ .

If the plot of ranked residuals against the expected rank residuals is a straight line, the theoretical assumption of normal distribution for the error term is a valid assumption. In the event of this plot not reflecting a straight line, we conclude that the normality assumption is not a tenable assumption. The normal probability plot will be illustrated for the Toluca company dataset.

Example 12.2.7. The Toluca Company Labour Hours against Lot Size

Contd. The required R program is set up and executed at the R console.

> tcanova <- anova(tclm)
> tc_resid_rank <- rank(tc_resid)
> tc_mse <- tcanova$Mean[2]
> tc_resid_expected <- sqrt(tc_mse)*qnorm((tc_resid_rank-0.375)
+ /(length(tc$Labour_Hours)+0.25))
> plot(tc_resid_expected,tc_resid,xlab="Expected",
+ ylab="Residuals",main="The Normal Probability Plot")
> abline(0,1) # to check if the points are along a straight line

The first line is straightforward to understand. The mean residual square $c12-math-0184$ is extracted with tcanova$Mean[2] and stored in tc_mse. The reader should verify that the exprected residual rank computed for tc_resid_expected indeed follows Equation 12.24. The plot command plot(tc_resid,tc_resid_expected,...) is a simple technique seen many times over.

The normal probability plot Figure 12.3 reveals that there is little difference between the residual values and the expected values. Thus, the simple linear regression model seems appropriate for the data under consideration.□

Figure 12.3 Normal Probability Plot

12.2.7 Prediction for the Simple Regression Model

Most often the purpose of fitting regression models is prediction. That is, given some values of the predictor variables, we need to predict the output values. This problem is known as the prediction problem.

Suppose that $c12-math-0185$ is the value of the predictor variable of interest. A natural estimator of the true response $c12-math-0186$ using the least-squares fitted model is

To develop a prediction interval for the observation $c12-math-0188$ , we define

Using the linearity property of expectations and distribution theory, we can easily see that the random variable $c12-math-0190$ is normally distributed with mean 0 and variance

To develop the prediction confidence interval, we will again use $c12-math-0192$ as an estimator of the variance $c12-math-0193$ . Thus, the $c12-math-0194$ prediction interval is given by

12.25

Fine, we will simply use the predict inbuilt function of R for computations related to the prediction interval. For example, if we want to predict the number of labor hours required based on a lot size 85, we need to use the predict function which returns the fitted values and the 95% prediction interval:

> predict(tclm,newdata=data.frame(Lot_Size=85),interval ="prediction")
      fit      lwr      upr
1 365.833 262.2730 469.3931

12.2.8 Regression through the Origin

Practical considerations may require that the regression line passes through the origin, that is, we may have information that the intercept term is $c12-math-0196$ . In such cases, the regression model is

12.26

In this case, the least-squares criteria becomes $c12-math-0198$ and the normal equation leads to the estimator:

12.27

which in return gives us the fitted model $c12-math-0200$ . An unbiased estimator of the variance $c12-math-0201$ would be the following:

12.28

It is left as an exercise for the reader to obtain the necessary expressions related to the confidence interval, prediction interval, etc. In the next example, we will illustrate this concept through a dataset.

Example 12.2.8. The Shelf-Stocking Data

A merchandiser stocks soft drinks on a shelf as a multiple number of the number of cases. The time required to put the cases on the shelves is recorded as a response. Clearly, if there are no cases to be stocked, it is natural that the time to put them on the shelf will be 0. Thus, the regression line through the origin makes sense here. A no-intercept model can be easily developed in R.

> shelf_stock <- read.table("shelf_stocking.dat",header=TRUE)
> names(shelf_stock)
[1] "Time"          "Cases_Stocked"
> sslm <- lm(Time ∼ Cases_Stocked -1, data=shelf_stock)
> summary(sslm); anova(sslm)
Call:
lm(formula = Time ∼ Cases_Stocked - 1)
Residuals:
    Min      1Q  Median      3Q     Max
-0.5252 -0.2198 -0.1202  0.1070  0.5443
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
Cases_Stocked 0.402619   0.004418   91.13   <2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.2988 on 14 degrees of freedom
Multiple R-squared: 0.9983, Adjusted R-squared: 0.9982
F-statistic:  8305 on 1 and 14 DF,  p-value: < 2.2e-16
Analysis of Variance Table
Response: Time
              Df Sum Sq Mean Sq F value    Pr(>F)
Cases_Stocked  1 741.62  741.62  8305.2 < 2.2e-16 ***
Residuals     14   1.25    0.09
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> confint(sslm)
                  2.5 %    97.5 %
Cases_Stocked 0.3931431 0.4120941
> predict(sslm,data.frame(Cases_Stocked=10),interval="prediction")
       fit      lwr      upr
1 4.026186 3.378308 4.674063

To build an intercept-free model, the necessary modification is specified in lm(Time ∼ Cases_Stocked -1, data=shelf_stock). Including -1 removes the intercept term, which otherwise is included in the default lm settings. The $c12-math-0203$ -value for the $c12-math-0204$ -statistic is highly significant at 2.2e-16, and hence shows that the fitted linear regression model is indeed a significant model. The values of $c12-math-0205$ and $c12-math-0206$ at respectively 0.9983 and 0.9982 show that the regression model is a good fit. The covariate Cases_Stocked is found to be significant under both the $c12-math-0207$ - and $c12-math-0208$ - statistics. Finally, the confidence and prediction intervals are obtained in continuation of the earlier example details.□

The residual plots play a central role in validating the assumptions of the linear regression model. We have explored the relevant R functions for obtaining them.

$c12-math-0209$

12.3 The Anscombe Warnings and Regression Abuse

Applied statistics regress in most cases towards the linear regression model. Anscombe (1973) presented four datasets which have the same values in the mean, variance, correlation, regression line, $c12-math-0210$ value, $c12-math-0211$ -values, etc. This dataset is available in R as anscombe, and the data may be inspected following the next R session with View(anscombe).

We reproduce here select summaries:

> summary(anscombe)
x1 x2 x3 x4 y1 y2 y3 y4
Median : 9.0 9.0 9.0 8 7.580 8.140 7.11 7.040
Mean : 9.0 9.0 9.0 9 7.501 7.501 7.50 7.501

Furthermore, the ANOVA table shows that the four datasets are identical:

> anova(lm(y1∼x1,data=anscombe))
Analysis of Variance Table
Response: y1
Df Sum Sq Mean Sq F value   Pr(>F)
x1         1 27.510 27.5100   17.99 0.002170 **
Residuals  9 13.763  1.5292
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> anova(lm(y2∼x2,data=anscombe))
Analysis of Variance Table
Response: y2
Df Sum Sq Mean Sq F value   Pr(>F)
x2         1 27.500 27.5000  17.966 0.002179 **
Residuals  9 13.776  1.5307
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> anova(lm(y3∼x3,data=anscombe))
Analysis of Variance Table
Response: y3
Df Sum Sq Mean Sq F value   Pr(>F)
x3         1 27.470 27.4700  17.972 0.002176 **
Residuals  9 13.756  1.5285
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> anova(lm(y4∼x4,data=anscombe))
Analysis of Variance Table
Response: y4
Df Sum Sq Mean Sq F value   Pr(>F)
x4         1 27.490 27.4900  18.003 0.002165 **
Residuals  9 13.742  1.5269
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

In the data summary, we observe that there is a lot of similarity among the median and mean of the predictor variable and the output. If we proceed with fitting regression lines, there is striking similarity between the estimated regression coefficients, MSS, $c12-math-0212$ -values, etc. These summaries and the fitted regression lines leave us with the impression that the four different datasets are almost alike. However, the scatter plot in Figure 12.4 reveals an entirely different story. For the first quartet, a linear regression model seems appropriate. A non-linear association seems appropriate for the second quartet, and there appears an outlier in the third quartet. On the other hand, there does not appear to be a correlation for the fourth and final quartet. Thus, we need to exhibit real caution when carrying out data analysis.

Four plots, with the headings: Plot of I quartet, Plot of II quartet, Plot of III quartet, Plot of IV quartet, with Y1, Y2, Y3, and Y4 on the y-axes, and X1, X2, X3, and X4 on the x-axes. — **Figure 12.4** Regression and Resistant Lines for the Anscombe Quartet

We now fit and plot the resistant lines for the four quartets of this dataset. The regression lines are given in red and the resistant lines in green.

> attach(anscombe)
> rl1 <- resistant_line(x1,y1,iter=4); rl2 <- resistant_line(x2,y2, iter=4)
> rl3 <- resistant_line(x3,y3,iter=4); rl4 <- resistant_line(x4,y4, iter=4)
> par(mfrow=c(2,2))
> plot(x1,y1,main="Plot of I Quartet")
> abline(lm(y1∼x1,data=anscombe),col="red")
> curve(rl1$coeffs[1]+rl1$coeffs[2]*(x-rl1$xCenter),add=TRUE, col="green")
> plot(x2,y2,main="Plot of II Quartet")
> abline(lm(y2∼x2,data=anscombe),col="red")
> curve(rl2$coeffs[1]+rl2$coeffs[2]*(x-rl2$xCenter),add=TRUE, col="green")
> plot(x3,y3,main="Plot of III Quartet")
> abline(lm(y3∼x3,data=anscombe),col="red")
> curve(rl3$coeffs[1]+rl3$coeffs[2]*(x-rl3$xCenter),add=TRUE, col="green")
> plot(x4,y4,main="Plot of IV Quartet")
> abline(lm(y4∼x4,data=anscombe),col="red")
> curve(rl4$coeffs[1]+rl4$coeffs[2]*(x-rl4$xCenter),add=TRUE, col="green")
> rl1$coeffs
[1] 7.5447098 0.4617412
> rl2$coeffs
[1] 7.8525641 0.4315385
> rl3$coeffs
[1] 7.1143590 0.3461538
> rl4$coeffs
[1] NA NA

A very interesting thing happens in the fourth quartet. The slope has been computed as NA and the intercept is not available as a consequence of this. The reason is that 10 out of the 11 observations for the $c12-math-0213$ values are the same, and due to all the thirds of the $c12-math-0214$ 's being the same, the slope is given as NA. Figure 12.4 shows that resistant lines fit for each quartet of the Anscombe data. Also note that the slope and intercept values are estimated differently for each data set.

For more critical abuses of the linear regression model, especially in the context of multiple covariates, the reader should consult Box (1964).

$c12-math-0215$

12.4 Multiple Linear Regression Model

In Section 12.2 we considered the case of one predictor variable. If there is more than one predictor variable, say $c12-math-0216$ , we extend the simple regression model to the multiple linear regression model. Suppose that $c12-math-0217$ is the variable of interest and that it is dependent on some covariates $c12-math-0218$ . The general linear regression model is given by

12.29

That is, the regressors are assumed to have a linear effect on the regressand. The vector $c12-math-0220$ is the vector of regression coefficients. The values of the $c12-math-0221$ , are completely unspecified and take values on the real line. In a simple regression model, the regression coefficients have this interpretation: If the regressor $c12-math-0222$ , is changed by one unit while holding all other regressors at the same value, the change in the regressand $c12-math-0223$ will be $c12-math-0224$ .

For the $c12-math-0225$ individual in the study, the model is given by

The errors $c12-math-0227$ are assumed to be independent and identically distributed as the normal distribution $c12-math-0228$ with unknown variance $c12-math-0229$ . We assume that we have a sample of size $c12-math-0230$ . We will begin with the example of US crime data.

Example 12.4.1. US Crime Data

Data is available on the crime rates across 47 states in the USA, and we have additional information on 13 more explanatory variables. This dataset has been used and illustrated in Der and Everitt (2002) on the use of SAS software. The data is given in the file US_Crime.csv. We explain the variables, as detailed in Der and Everitt, as below.

R: Crime rate – the number of offenses known to the police per 1 000 000 population.
Age: Age distribution – the number of males aged 14 to 24 years per 1000 of total state population.
S: Binary variable distinguishing southern states (S = 1) from the rest.
Ed: Educational level – mean number of years of schooling $c12-math-0231$ 10 of the population 25 years old and over.
Ex0: Police expenditure – per capita expenditure on police protection by state and local governments in 1960.
Ex1: Police expenditure – as Ex0, but for 1959
LF Labor force participation rate per 1000 civilian urban males in the age group 14 to 24 years.
M: Number of males per 1000 females.
N: State population size in hundred thousands.
NW: Number of non-whites per 1000.
U1: Unemployment rate of urban males per 1000 in the age group 14 to 24 years.
U2: Unemployment rate of urban males per 1000 in the age group 35 to 39 years.
W: Wealth, as measured by the median value of transferable goods and assets or family income (unit 10 dollars).
X: Income inequality: the number of families per 1000 earning below one half of the median income.

The main problem here is to build a multiple linear regression model for the crime rate R as a function of the rest of the demographic covariates.

In Section 12.3 we saw the effectiveness of scatter plots for the Anscombe's quartet data. If we have more than one pair of relationships to understand, we require a matrix of scatter plots. This technique is explored in the forthcoming subsection.□

12.4.1 Scatter Plots: A First Look

For the multiple regression data, let us use the R graphics method called pairs. The reader is recommended to try out the strength of this graphical utility with example(pairs). In a matrix of the scatter plot, a scatter plot is provided for each variable against the other variables. Thus, the output will be a symmetric matrix of scatter plots.

Example 12.4.2. US Crime Data

Contd. For the US crime data, obtain the matrix of the scatter plot with the pairs function. The matrix of the scatter plot will be complemented with the correlation matrix, which is obtained using the cor function.

> data(usc)
> pairs(usc)
> round(cor(usc),2)
        R   Age        W     X
R    1.00 -0.09      0.44 -0.18
Age -0.09  1.00     -0.67  0.64
S   -0.09  0.58     -0.64  0.74
Ed   0.32 -0.53      0.74 -0.77
Ex0  0.69 -0.51      0.79 -0.63
Ex1  0.67 -0.51      0.79 -0.65
LF   0.19 -0.16      0.29 -0.27
M    0.21 -0.03      0.18 -0.17
N    0.34 -0.28      0.31 -0.13
NW   0.03  0.59     -0.59  0.68
U1  -0.05 -0.22      0.04 -0.06
U2   0.18 -0.24      0.09  0.02
W    0.44 -0.67      1.00 -0.88
X   -0.18  0.64     -0.88  1.00

From the first row of the matrix of scatter plots in Figure 12.5, we see that the crime rate R is weakly related to most of the explanatory variables. A careful examination of the first row also shows that if there is an increase in values of the variables Ex0, Ex1, and W, the crime rate R also increases. The scatter plot among the explanatory variables reveals a strong relationship between the variables Ex0 and Ex1, indicating multicollinearity which is discussed in detail in Section 12.6, and a negative relationship between the variables W and X, that shows a high correlationship.□

Figure 12.5 Matrix of Scatter Plot for US Crime Data

12.4.2 Other Useful Graphical Methods

The matrix of a scatter plot is a two-dimensional plot. However, we can also visualize scatter plots in three dimensions. Particularly, if we have two covariates and a regressand, the three-dimensional scatter plot comes in very handy for visualization purposes. The R package scatterplot3d may be used to visualize three-dimensional plots. We will consider the three-dimensional visualization of scatter plots in the next example. The discussion here is slightly varied in the sense that we do not consider a real dataset for obtaining the three-dimensional plots.

Example 12.4.3. Visualization of Some Regression Models

Three-dimensional scatter plots of linear regression functions will be constructed now for the following models:

12.30

12.31

12.32

12.33

The following R codes generate the required three-dimensional plots.

> x1 <- rep(seq(0,10,0.5),100)
> x2 <- rep(seq(0,10,0.5),each=100)
> par(mfrow=c(2,2))
> Ey1 <- 83 + 9*x1 + 6*x2
> scatterplot3d(x1,x2,Ey1,highlight.3d=TRUE,xlim=c(0,10), ylim=c(0,10),zlim=c(0,240),
+ xlab=expression(x[1]),ylab=expression(x[2]),zlab="E(y)",main =
+ expression(paste("A 3-d plot for ", E(Y*"|"*x,beta) == 83 + 9*x[1]
+ + 6*x[2])),z.ticklabs="")
> Ey2 <- 83 + 9*x1 + 6*x2 + 3*x1*x2
> scatterplot3d(x1,x2,Ey2,highlight.3d=TRUE,xlim=c(0,10), ylim=c(0,10),zlim=c(0,600),
+ xlab=expression(x[1]),ylab=expression(x[2]),zlab="E(y)",main =
+ expression(paste("A 3-d plot for ",E(Y*"|"*x,beta)== 83 + 9*x[1]
+ + 6*x[2] + 3*x[1]*x[2])),z.ticklabs="")
> Ey3 <- 83 + 9*x1 + 6*x2 + 2*x1ˆ4 + 3*x2ˆ3 + 3*x1*x2
> scatterplot3d(x1,x2,Ey3,highlight.3d=TRUE,xlim=c(0,10), ylim=c(0,10),zlim=c(0,25000),
+ xlab=expression(x[1]),ylab=expression(x[2]),zlab="E(y)",main =
+ expression(paste("A 3-d plot for ",E(Y*"|"*x,beta)== 83 + 9*x[1]
+ + 6*x[2] + 2*x[1]ˆ4 + 3*x[2]ˆ3 + 3*x[1]*x[2])),z.ticklabs="")
> Ey4 <- 83 + 9*x1 + 6*x2 - 2*x1ˆ4 - 3*x2ˆ3 + 3*x1*x2
> scatterplot3d(x1,x2,Ey4,highlight.3d=TRUE,xlim=c(0,10), ylim=c(0,10),zlim=c(-23000,100),
+ xlab=expression(x[1]),ylab=expression(x[2]),zlab="E(y)",main =
+ expression(paste("A 3-d plot for ",E(Y*"|"*x,beta)==83 + 9*x[1]
+ + 6*x[2] - 2*x[1]ˆ4 - 3*x[2]ˆ3 + 3*x[1]*x[2])),z.ticklabs="")

Two vector variables x1 and x2 are created using the rep function, but with a slight difference. The expected value for the three linear models are then computed with Ey on three occasions. Using the scatterplot3d function from the same named package, we obtain the three-dimensional plots. In an easier extension of the plot function, we use three arguments for the scatterplot3d function. The main title for the three-dimensional plot is specified with great focus on generating the exact equation and it needs to be emphasized here that this is a non-trivial code and the reader should pay a considerable amount of attention to the R code.

In the first two three-dimensional plots of Figure 12.6, the linearity of $c12-math-0236$ in terms of $c12-math-0237$ and $c12-math-0238$ is apparent. Thus, a linear regression model seems appropriate. It is to be cautioned that with real datasets, some disturbance from the linearity is expected due to the error term. In the third three-dimensional plot, it appears that there may be quadratic terms for the linear regression models.□

Figure 12.6 Three-Dimensional Plots

Contour plots are again useful techniques for understanding multivariate data through slices of the dataset. In continuation of the previous regression equations, from Example 13.4.3, let us view them in terms of a contour plot.

Example 12.4.4. Continuation of Example 13.4.3

To obtain the contour plot, we need to specify the output value $c12-math-0239$ for each possible combination of $c12-math-0240$ and $c12-math-0241$ values. That is, if we allow 10 different values for the predictor variables $c12-math-0242$ and $c12-math-0243$ , the $c12-math-0244$ values must be obtained for the 100 combinations of them. This task is easily carried out using the outer function. This is demonstrated in the next R program.

> par(mfrow=c(2,2))
> x1=x2=seq(from=0,to=10,by=0.2)
> ey1 <- function(a,b) 83 + 9*a + 6*b
> Ey1 <- outer(x1,x2,ey1)
> contour(x1,x2,Ey1,main = expression(paste("Cantour plot for ",
+ E(Y*"|"*x,beta) ==83 + 9*x[1]+ 6*x[2])))
> ey2 <- function(a,b) 83 + 9*a + 6*b + 3*a*b
> Ey2 <- outer(x1,x2,ey2)
> contour(x1,x2,Ey2,main = expression(paste("Cantour plot for ",
+ E(Y*"|"*x,beta)==83 + 9*x[1]+ 6*x[2] + 3*x[1]*x[2])))
> ey3 <- function(a,b) 83 + 9*a + 6*b + 2*aˆ4 + 3*bˆ3 + 3*a*b
> Ey3 <- outer(x1,x2,ey3)
> contour(x1,x2,Ey3,main = expression(paste("Cantour plot for ",
+ E(Y*"|"*x,beta)==83 + 9*x[1] + 6*x[2] + 2*x[1]ˆ4 + 3*x[2]ˆ3 + 3*x[1]*x[2])))
> ey4 <- function(a,b) 83 + 9*a + 6*b - 2*aˆ4 - 3*bˆ3 + 3*a*b
> Ey4 <- outer(x1,x2,ey4)
> contour(x1,x2,Ey4,main = expression(paste("Cantour plot for ",
+ E(Y*"|"*x,beta)==83 + 9*x[1] + 6*x[2] - 2*x[1]ˆ4 - 3*x[2]ˆ3 + 3*x[1]*x[2])))

Here the vector variables x1 and x2 are created as in the previous example. However, for each combination of these variables, we need to compute the associated $c12-math-0245$ . Thus, the outer function finds all such combinations using the function ey. Note that the R object Ey is a matrix and not a vector, and its dimension is the number of elements of x1 and x2. This is a major difference from the scatterplot3d function. The contour plot function is then invoked with the R objects x1, x2, and Ey. The contour plots are generated for the three regression models.

As with the three-dimensional plot, the linearity is apparent for the first two contour plots of Figure 12.7, and the quadratic terms for the third example.□

Figure 12.7 The Contour Plots for Three Models

12.4.3 Fitting a Multiple Linear Regression Model

Given $c12-math-0246$ observations, the multiple linear regression model can be written in a matrix form as

12.34

where $c12-math-0248$ , $c12-math-0249$ , and $c12-math-0250$ , and the covariate matrix $c12-math-0251$ ¹ is

The least-squares normal equations for the multiple linear regression model is given by

which leads to the least-squares estimator

12.35

The model fitted values $c12-math-0256$ are obtained as

12.36

where we define

12.37

The matrix $c12-math-0259$ is called the hat matrix, which has many useful properties.

Properties of the Hat Matrix. The hat matrix is symmetric and idempotent, that is,

1. $c12-math-0260$ , symmetric property.
2. $c12-math-0261$ , idempotent property.
3. Furthermore, $c12-math-0262$ is also symmetric and idempotent.

The hat matrix plays a vital role in determining the residual values, which in turn is very useful in model adequacy, as will be seen in the Section 12.5. Chatterjee and Hadi (1988) refer to the hat matrix as the prediction matrix. The importance of the hat matrix is that it serves as a bridge between the observed values and the fitted values. By definition

12.38

We had seen earlier, in Section 12.2, the role of the residuals in estimation of the variance $c12-math-0264$ . The results further extend here in the following manner:

12.39

Thus, the residual mean square is

12.40

As in Section 12.2, $c12-math-0267$ is an unbiased estimator of $c12-math-0268$ , that is,

12.41

12.4.4 Testing Hypotheses and Confidence Intervals

12.4.4.1 Testing for Significance of the Regression Model

We need to first assert if the linear relationship between the regressand and the predictor variables is significant or not. In the terms of this hypothesis, this translates into testing for $c12-math-0270$ against the hypothesis $c12-math-0271$ for at least one $c12-math-0272$ . Towards the problem of testing the hypotheses $c12-math-0273$ against $c12-math-0274$ , we define the following:

12.42

12.43

12.44

In Equation 12.42, we have $c12-math-0278$ . Computationally, we recognize that the regression sum of squares is obtained by exploiting the constraint $c12-math-0279$ , that is by $c12-math-0280$ . Under the hypothesis $c12-math-0281$ , $c12-math-0282$ follows a $c12-math-0283$ distribution, and $c12-math-0284$ follows a $c12-math-0285$ distribution. Furthermore, it is noted that $c12-math-0286$ and $c12-math-0287$ are independent. Thus, the test statistic for the hypothesis $c12-math-0288$ is the ratio of the mean of regression sum of squares to the mean of residual sum of squares, which is an $c12-math-0289$ -statistic. In summary, the test statistic is

12.45

which is an $c12-math-0292$ distribution. Now, to test $c12-math-0293$ , compute the test statistic $c12-math-0294$ and reject it if

The ANOVA table is hence consolidated in Table 12.3.

Table 12.3 ANOVA Table for Multiple Linear Regression Model

Source of Variation	Sum of Squares	Degrees of Freedom	Mean Square	$c12-math-0296$ -Statistic
Regression	$c12-math-0297$	$c12-math-0298$	$c12-math-0299$	$c12-math-0300$
Residual	$c12-math-0301$	$c12-math-0302$	$c12-math-0303$
Total	$c12-math-0304$	$c12-math-0305$

12.4.4.2 The Role of C Matrix for Testing the Regression Coefficients

The hypothesis $c12-math-0306$ tested above is a global test of model adequacy. If we reject the previous hypothesis, we may be interested in the specific predictors which are significant. Thus, there is a need to find such predictors.

The least-squares estimator $c12-math-0307$ is an unbiased estimator of $c12-math-0308$ , and further

12.46

Define $c12-math-0310$ . Then the variance of the estimator of the $c12-math-0312$ regression coefficient $c12-math-0313$ is $c12-math-0314$ , where $c12-math-0315$ is the $c12-math-0316$ diagonal element of the matrix $c12-math-0317$ , and the covariance between $c12-math-0318$ and $c12-math-0319$ is $c12-math-0320$ .

We are now specifically interested in testing the hypothesis $c12-math-0321$ against the hypothesis $c12-math-0322$ , $c12-math-0323$ . The test statistic for the hypothesis $c12-math-0324$ is

12.47

Under the hypothesis $c12-math-0326$ , the test statistic $c12-math-0327$ follows a $c12-math-0328$ -distribution with $c12-math-0329$ degrees of freedom. Thus the test procedure is to reject the hypothesis $c12-math-0330$ if $c12-math-0331$ .

Confidence Intervals

An important result, see Montgomery, et al. (2003), is the following:

12.48

This property allows us to construct the joint confidence region for the vector of regression coefficients $c12-math-0333$ as follows:

The $c12-math-0334$ confidence intervals, region actually, for the regression coefficients is given by

In the $c12-math-0336$ -dimensional space, the above inequality is a region of elliptical shape. As the next natural step, if we are interested in some specific predictor variable, the $c12-math-0337$ confidence interval for its regression coefficient $c12-math-0338$ is given by

12.49

Finally, consider a new observation point $c12-math-0340$ . A natural estimate of the future observation $c12-math-0341$ is $c12-math-0342$ . A $c12-math-0343$ prediction interval is given as

12.50

The theoretical developments for multiple linear regression model will now be taken to an R session.

Example 12.4.5. US Crime Data

Continuation of Example 13.4.1. The lm function is deployed to fit a multiple linear regression model for the US Crime Data. The problem is to fit a linear regression model for the crime rate R as a function of the covariates, from Age to X. The covariates which give significant explanation of the crime rate are identified by the $c12-math-0345$ -values. Using the lm function we build the multiple linear regression model, and then use the functions summary, confint, and anova to gather details of the fitted model.


> crime_rate_lm <- lm(R∼Age+S+Ed+Ex0+Ex1+LF+M+N+NW+U1+U2+W+X, data=usc)
> # Equivalently, crime_rate_lm <- lm(R∼.,data=usc)
> summary(crime_rate_lm)
Call:
lm(formula = R ∼ Age + S + Ed + Ex0 + Ex1 + LF + M + N + NW +
    U1 + U2 + W + X, data = usc)
Residuals:
    Min      1Q  Median      3Q     Max
-34.884 -11.923  -1.135  13.495  50.560
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.918e+02  1.559e+02  -4.438 9.56e-05 ***
Age          1.040e+00  4.227e-01   2.460  0.01931 *
S           -8.308e+00  1.491e+01  -0.557  0.58117
Ed           1.802e+00  6.496e-01   2.773  0.00906 **
Ex0          1.608e+00  1.059e+00   1.519  0.13836
U1          -6.017e-01  4.372e-01  -1.376  0.17798
U2           1.792e+00  8.561e-01   2.093  0.04407 *
W            1.374e-01  1.058e-01   1.298  0.20332
X            7.929e-01  2.351e-01   3.373  0.00191 **
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 21.94 on 33 degrees of freedom
Multiple R-squared:  0.7692,    Adjusted R-squared:  0.6783
F-statistic: 8.462 on 13 and 33 DF,  p-value: 3.686e-07
> confint(crime_rate_lm)
                    2.5 %       97.5 %
(Intercept) -1.008994e+03 -374.6812339
Age          1.798032e-01    1.8998161
S           -3.864617e+01   22.0295401
Ed           4.798774e-01    3.1233247
Ex0         -5.460558e-01    3.7616925
U1          -1.491073e+00    0.2877222
U2           5.049109e-02    3.5340347
W           -7.795484e-02    0.3526718
X            3.146483e-01    1.2712173
> anova(crime_rate_lm)
Analysis of Variance Table
Response: R
          Df  Sum Sq Mean Sq F value    Pr(>F)
Age        1   550.8   550.8  1.1448 0.2924072
S          1   153.7   153.7  0.3194 0.5757727
Ed         1  9056.7  9056.7 18.8221 0.0001275 ***
Ex0        1 30760.3 30760.3 63.9278 3.182e-09 ***
Ex1        1  1530.2  1530.2  3.1802 0.0837349 .
LF         1   611.3   611.3  1.2705 0.2677989
U1         1    70.7    70.7  0.1468 0.7040339
U2         1  2696.6  2696.6  5.6043 0.0239336 *
W          1   347.5   347.5  0.7221 0.4015652
X          1  5474.2  5474.2 11.3768 0.0019126 **
Residuals 33 15878.7   481.2
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

It can be seen from the output that the intercept terms, Age, ED, U2, and X are significant variables to explain the crime rate. The 95% confidence intervals also confirm it. The model is also significant, as may be seen from p-value: 3.686e-07, and the adjusted $c12-math-0346$ value is also satisfactory. The interpretation of anova is left to the reader. Please note that the information of some of the insignificant variables has been left out.

It becomes a bit tedious to write the variable names for a large dataset. If the model needs to consider all the variables from the data.frame, a simple trick is to use lm(y∼., data=data) to include all those covariates.

A multiple linear regression model is not a trivial extension of the simple linear regression model. For instance, the $c12-math-0347$ and Adj- $c12-math-0348$ for this dataset are respectively 0.7692 and 0.6783, and the difference is seen to be nearly 10%. In the case of the simple linear regression model for the fitted models gbhlm and tclm, it was not more than 2%. Let us undertake a simple task. We will add the covariates one after another and see how the $c12-math-0349$ and Adj- $c12-math-0350$ behave.

> R2Various <- AdjR2Various <- 1:13
> for(i in 2:14){
+ R2Various[i-1] <- summary(lm(usc$R∼as.matrix(usc[,2:i])))$r. squared
+ AdjR2Various[i-1]<- summary(lm(usc$R∼as.matrix(usc[,2:i])))$adj.r. squared
+ }
> round(R2Various,2)
 [1] 0.01 0.01 0.14 0.59 0.61 0.62 0.64 0.64 0.64 0.65 0.68 0.69 0.77
> round(AdjR2Various,2)
 [1] -0.01 -0.03  0.08  0.55  0.56  0.56  0.57  0.57  0.56  0.55  0.59  0.58  0.68
> round(R2Various-AdjR2Various,2)
 [1] 0.02 0.04 0.06 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.10 0.11 0.09

Note that $c12-math-0351$ keeps on increasing as we add more variables to the model, and that the Adj- $c12-math-0352$ can be negative too. These and other related issues will be addressed in Section 12.8.□

The model validation task for the multiple linear regression model will be undertaken next.

$c12-math-0353$

12.5 Model Diagnostics for the Multiple Regression Model

In Sub-section 13.2.6 we saw the role of residuals for model validation. We have those and a few more options for obtaining residuals in the context of multiple regression. Cook and Weisberg (1982) is a detailed account of the role of residuals for the regression models.

Residuals may be developed in one of four ways:

1. Standardized residuals
2. Semi-Studentized residuals
3. Predicted Residuals, known by its famous abbreviation as PRESS
4. R-Student Residuals.

12.5.1 Residuals

Standardized Residuals. Scaling the residuals by dividing them by their estimated standard deviation, which is the square root of the mean residual sum of squares, we obtain the standardized residuals as

12.51

Note that the mean of the residuals is zero and in that sense it is present in the above expression. The standardized residuals $c12-math-0355$ are approximately standard normal variates, and any of their absolute values greater than 3 is an indicator of the presence of an outlier.

Semi-Studentized Residuals. Recollect the definition of the hat matrix from Sub-section 13.4.3 and its usage to define the residuals as

12.52

Substituting $c12-math-0357$ , and on further evaluation, we arrive at

12.53

Thus, the covariance matrix of the residuals is

12.54

The last step follows, since $c12-math-0360$ is an idempotent matrix. As the matrix $c12-math-0361$ is not necessarily a diagonal matrix, the residuals have different variances and are also correlated. Particularly, it can be shown that the variance of the residual $c12-math-0362$ is

12.55

where $c12-math-0364$ is the $c12-math-0365$ diagonal element of the hat matrix $c12-math-0366$ . Furthermore, the covariance between the residuals $c12-math-0367$ and $c12-math-0368$ is

12.56

where $c12-math-0370$ is the $c12-math-0371$ diagonal element of the hat matrix. The semi-Studentized residuals are then defined as

12.57

PRESS Residuals. The PRESS residual, also called the predicted residual, denoted by $c12-math-0373$ , is based on a fit to data which does not include the $c12-math-0374$ observation. Define $c12-math-0375$ as the least-squares estimate of the regression coefficients obtained by excluding the $c12-math-0376$ observation from the dataset. Then, the $c12-math-0377$ predicted residual $c12-math-0378$ is defined by

The picture looks a bit scary so that we may need to fit $c12-math-0380$ models to obtain the $c12-math-0381$ predicted residuals. Even if we were willing to do this, for large $c12-math-0382$ values, it may not be feasible to run the computer for so many hours, particularly if we can avoid it. There is an interesting relationship between the predicted residuals and the residuals, which circumvents the troublesome route of fitting $c12-math-0383$ models. Equation 2.2.23 of Cook and Weisberg (1982) shows that the predicted residuals and the residuals are related by

12.58

The variance of the $c12-math-0385$ predicted residual is

12.59

The standardized prediction residual, denoted by $c12-math-0387$ , is obtained by

12.60

R-Student Residuals. Montgomery, et al. (2003) mention that the Studentized residuals are useful for outlier diagnostics and that the $c12-math-0389$ estimate of $c12-math-0390$ is an internal scaling of the residual, since it is an internally generated estimate of $c12-math-0391$ . Similar to the approach of prediction residual, we can construct an estimator of $c12-math-0392$ by removing the $c12-math-0393$ observation from the dataset. The estimator of $c12-math-0394$ by removing the $c12-math-0395$ observation, denoted by $c12-math-0396$ , is computed by

Using $c12-math-0398$ for scaling purposes, the R-Student residual is defined by

12.61

The utility of these residuals is studied through the next example.

Example 12.5.1. A Regression Model of the Abrasion Index for the Tire Tread

To understand the relationship between the abrasion index for the tire tread, the output $c12-math-0400$ , as a linear function of the hydrated silica level $c12-math-0401$ , silane coupling agent level $c12-math-0402$ , and the sulfur level $c12-math-0403$ , Derringer and Suich (1980) collected data on 14 observation points.

The appropriateness of a linear regression model through graphics for the current dataset is left as an exercise for the reader. A linear regression model for this dataset is considered and then an attempt is made to understand the four types of residuals discussed so far.

> abrasion_index <- read.table("abrasion_index.dat",header=TRUE)
> ailm <- lm(y∼x1+x2+x3,data=abrasion_index)
> pairs(abrasion_index) # graphical output suppressed
> aianova <- anova(ailm)
> ai_fitted <- ailm$fitted.values
> ailm_mse <- aianova$Mean[length(aianova$Mean)]
> stan_resid_ai <- resid(ailm)/sqrt(ailm_mse)
> # Standardizing the residuals
> studentized_resid_ai <- resid(ailm)/(sqrt(ailm_mse* (1-hatvalues(ailm))))
> #Studentizing the residuals
> # Do not wonder about writing complex codes for
+ Prediction Residuals or R Student Residuals
> # R helps! It has good function for this purpose
> pred_resid_ai <- rstandard(ailm)
> # returns the prediction residuals in a standardized form
> pred_student_resid_ai <- rstudent(ailm)
> # returns the R-Student Predicttion Residuals
> par(mfrow=c(2,2))
> plot(ai_fitted,stan_resid_ai,xlab="Fitted",ylab="Standardized Residuals",
+ main="A: Plotting Standardized Residuals against Fitted Values")
> plot(ai_fitted,studentized_resid_ai,xlab="Fitted",ylab="Studentized Residuals",
+ main="B: Plotting Studentized Residuals against Fitted Values")
> plot(ai_fitted,pred_resid_ai,xlab="Fitted",ylab="Prediction Residuals",
+ main="C: Plotting PRESS against Fitted Values")
> plot(ai_fitted,pred_student_resid_ai,xlab="Fitted",ylab="R-Student Residuals",
+ main="D: Plotting R-Student Residuals against Fitted Values")
> range(stan_resid_ai)
[1] -1.645103  1.267604
> range(studentized_resid_ai)
[1] -2.025635  1.821972
> range(pred_resid_ai)
[1] -2.025635  1.821972
> range(pred_student_resid_ai)
[1] -2.502501  2.114761
> sum(studentized_resid_ai==pred_resid_ai)
[1] 5
> length(studentized_resid_ai)
[1] 14

The multiple linear regression model is fitted for the abrasion index using the standard lm function. The anova is again used to facilitate the computation of mean squared error. The model fit values are stored in the ai_fitted vector object. The standardized residuals given in Equation 12.51 are calculated with the R code stan_resid_ai <- resid(ailm)/sqrt(ailm_mse). The formula for the semi-Studentized residuals given by Equation 12.57 is implemented in the right-hand side of the assignment operator for studentized_resid_ai. Note that the elements $c12-math-0404$ are extracted from the fitted object ailm using the R function hatvalues. The PRESS residuals and $c12-math-0405$ -Student residuals are obtained using R functions rstandard and rstudent respectively, and in our program they are stored in R objects pred_resid_ai and pred_student_resid_ai. Using the now familiar par and plot functions, we plot these four sets of residuals against the fitted values.

At the outset, the four plots A to D in Figure 12.8 look identical and hence the range function has been used for a small inspection that shows that we are indeed dealing with different arrays. The residual plot gives a clear answer to the presence of an outlier. The residual plot, sans the outlier, shows that the linear model is appropriate. It is to be noted that in a case such as this one, we need to first remove the outlier and then redo the entire analysis.□

Figure 12.8 Residual Plot for the Abrasion Index Data

12.5.2 Influence and Leverage Diagnostics

In the previous subsection, we saw the power of leave-one-out residuals. The residual plots help in understanding the model adequacy. We will now take measures to explain the effect of each covariate on the model fit and also of each output value on the same.

A large value of residual may arise on account of either a large value of the covariate or a large value of the output. If some observations are disparagingly distinct from the overall dataset and if the reason for this is the covariate value, then such points are called the leverage points. The leverage points do not affect the estimates of the regression coefficients, although they are known to drastically change the $c12-math-0406$ values. For the fourth quartet of the Anscombe dataset, refer to the scatter plot in Figure 12.8, any value of the regressand should correspond to the covariate value of 8. Hence, we may say that the eighth observation is a levarage point.

Another source of disparaging observations, for reasonable levels of the covariate values, may be from the output values. Such data points are called influence points. The influence points have a significant affect on the estimated values of the regression coefficients, and in particular tilt the model relationship towards them. For the third quarter of the Anscombe dataset, refer to the scatter plot in Figure 12.8, the outliers are clearly influential.

Leverage Points

We will recollect the definition of the hat matrix:

The elements $c12-math-0408$ of the hat matrix may be interpreted as the amount of leverage exerted by the $c12-math-0409$ observation $c12-math-0410$ on the $c12-math-0411$ fitted value $c12-math-0412$ . Note that the diagonal elements may be easily obtained as

12.62

This hat matrix diagonal is a standardized measure of the distance of the $c12-math-0414$ observation from the center of the $c12-math-0415$ -space. The average size of a hat diagonal is $c12-math-0416$ . Any observation for which the $c12-math-0417$ value exceeds twice the expected leverage of $c12-math-0418$ is considered as a leverage point.

Cook's Distance for the Influence Points

The Cook's distance for identifying the influence points is based on the leave-one-out approach as seen earlier. Let $c12-math-0419$ denote the $c12-math-0420$ fitted value, and $c12-math-0421$ denote the $c12-math-0422$ predicted value when the $c12-math-0423$ observation is removed from the model building, that is,

12.63

Then the Cook's distance for the $c12-math-0425$ observation is given by

12.64

The distance $c12-math-0427$ can be interpreted as the squared Euclidean distance that the vector of fitted values moves when the $c12-math-0428$ observation is deleted. The magnitude of the distance $c12-math-0429$ is assessed by comparing with $c12-math-0430$ . As a rule of thumb, any value of the distance $c12-math-0431$ may be called an influential observation.

DFFITS and DFBETAS Measures for the Influence Points

To understand the influence of the $c12-math-0432$ observation on the regression coefficient $c12-math-0433$ , the $c12-math-0434$ measure proposed by Belsley, Kuh and Welsh (1980) is given as

12.65

where $c12-math-0436$ is the $c12-math-0437$ diagonal element of the $c12-math-0438$ matrix, and $c12-math-0439$ is the estimate of the $c12-math-0440$ regression coefficient with the deletion of the $c12-math-0441$ observation. Belsley, Kuh, and Welsh suggest a cut-off value as $c12-math-0442$ .

Similarly, a measure for understanding the affect of an observation on the fitted value is given by $c12-math-0443$ . This measure is defined as

12.66

The relevant rule of thumb is a cut-off value of $c12-math-0445$ .

Example 12.5.2. A Regression Model of the Abrasion Index for the Tire Tread

Contd. Useful functions such as hatvalues, cookd, dffits, dfbetas, and covratio make it easy to compute all related values discussed so far. It is such an easy thing to do them in R, that it takes just one line of code to put them together.

> round(cbind(hatvalues(ailm),cookd(ailm),dffits(ailm),dfbetas(ailm), covratio(ailm)),4)
                         (Intercept)      x1      x2      x3
1  0.3404 0.5294 -1.7978     -0.7207  0.4247  0.9392 -0.9392 0.2794
2  0.5160 0.3231 -1.1504     -0.5619 -0.7064  0.7322  0.3603 1.8778
3  0.5957 0.0679  0.4989      0.1008 -0.3385  0.3096 -0.3096 3.5001
4  0.5160 0.8846  2.1834      1.0665  1.3406  0.6838  1.3897 0.6271
5  0.4309 0.3290  1.1974      0.3556 -0.6034 -0.4633 -0.7811 1.2474
6  0.3404 0.1397  0.7510      0.3011 -0.1774 -0.3923  0.3923 1.4611
7  0.4309 0.3589 -1.2627     -0.3750  0.6363 -0.8236 -0.4886 1.1549
8  0.3830 0.2895 -1.1319     -0.5704 -0.4538 -0.5044  0.5044 1.0815
9  0.0745 0.0022 -0.0904     -0.0904 -0.0183  0.0048 -0.0048 1.5745
10 0.0745 0.0022 -0.0904     -0.0904 -0.0183  0.0048 -0.0048 1.5745
11 0.0745 0.0015  0.0729      0.0729  0.0147 -0.0039  0.0039 1.5993
12 0.0745 0.0039  0.1202      0.1202  0.0243 -0.0064  0.0064 1.5216
13 0.0745 0.0099  0.1935      0.1935  0.0391 -0.0103  0.0103 1.3460
14 0.0745 0.0039  0.1202      0.1202  0.0243 -0.0064  0.0064 1.5216

Some rules of thumb were stated earlier. Also, the plot.lm function gives a ready plot of Cook's distances versus row labels, residuals against leverages, and Cook's distances against leverage/(1-leverage). Finally, we give a plot of the dffits and dfbetas to find the influence points, see Figure 12.9.

> pdf("Cooks_Distance_ailm.pdf")
> par(mfrow=c(1,3));plot(ailm,which=c(4:6))
> dev.off()
X11cairo
       1
> which(abs(as.vector(dfbetas(ailm)[,1]))>2/sqrt(14))
[1] 1 2 4 8
> which(abs(as.vector(dfbetas(ailm)[,2]))>2/sqrt(14))
[1] 2 4 5 7
> which(abs(as.vector(dfbetas(ailm)[,3]))>2/sqrt(14))
[1] 1 2 4 7
> which(abs(as.vector(dfbetas(ailm)[,4]))>2/sqrt(14))
[1] 1 4 5
> which(abs(as.vector(dffits(ailm)))>2/sqrt(14))
[1] 1 2 4 5 6 7 8

Since, by rule of thumb, none of the observations have Cook's distance greater than 1, there is no influential observation in the dataset. The above program gives a platform to interpret the utility of dffits and dfbetas and is left as an exercise.□

Figure 12.9 Cook's Distance for the Abrasion Index Data

For a detailed list of R functions in the context of regression analysis, refer to the pdf file at http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf. The leverage points answer the problem of an outlier with respect to the covariate value. There may be at times linear dependencies among the covariates themselves, which lead to unstable estimates of the regression coefficients and this problem is known as the multicollinearity problem, which will be dealt with in the next section.

$c12-math-0446$

12.6 Multicollinearity

Multicollinearity is best understood by splitting its spelling as “multi-col-linearity”, implying a linear relationship among the multiple columns of the covariate matrix. In another sense, the columns of the covariate matrix are linearly dependent, and further it implies that the covariate matrix may not be of full rank. This linear dependence means that the covariates are strongly correlated. Highly correlated explanatory variables can cause several problems when applying the multiple regression model.

Example 12.6.1. Understanding the Problem of Multicollinearity

We fitted a multiple linear regression model for this dataset in earlier sections. The variables Ex0 and Ex1 are strongly correlated, as can be seen by the following:

> cor(usc$Ex0,usc$Ex1)
[1] 0.9935865
> cor(usc) # output suppressed

Let us have another look at the problem. For the rocket propellant data, Section 12.2, add small noise to the regressor GBH using the jitter function and rebuild the model with these two near identical regressors.

> data(Euphorbiaceae)
> gbhlmjit <- lm(Height ∼ GBH+jitter(GBH),data=Euphorbiaceae)
> summary(gbhlmjit)
Call:
lm(formula = Height ∼ GBH + jitter(GBH), data = Euphorbiaceae)
Residuals:
    Min      1Q  Median      3Q     Max
-29.242  -5.575  -1.455   5.732  27.959
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    8.742      2.166   4.036 0.000105 ***
GBH           -1.761      8.210  -0.214 0.830596
jitter(GBH)    2.065      8.213   0.251 0.802011
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 9.46 on 103 degrees of freedom
Multiple R-squared:  0.404, Adjusted R-squared:  0.3925
F-statistic: 34.91 on 2 and 103 DF,  p-value: 2.654e-12

Now, though the linear model is significant since the $c12-math-0447$ -value is 2.654e-12, the regressor GBH is now an insignificant regressor, and its impact through the regression coefficient on Height changes from a positive value of 0.59084 in gbhlm to a negative -1.761 in gbhlmjit. Thus, the multicollinearity problem needs to be identified in the multiple linear regression model.□

In general, we see that multicollinearity leads to the following problems:

1. Imprecise estimates of $c12-math-0448$ , that is, which also mislead the signs of the regression coefficients
2. The $c12-math-0449$ -tests may fail to reveal significant factors
3. Missing importance of predictors.

Spotting multicollinearity amongst a set of explanatory variables is a daunting task. The obvious course of action is to simply examine the correlations between these variables, and while this is often helpful, it is by no means foolproof, and more subtle forms of multicollinearity may be missed. An alternative and generally far more useful approach is to examine what are known as the variance inflation factors of the explanatory variables.

12.6.1 Variance Inflation Factor

The variance inflation factor $c12-math-0450$ for the $c12-math-0451$ variable is given by

12.67

where $c12-math-0453$ is the square of the multiple correlation coefficient from the regression of the $c12-math-0454$ explanatory variable on the remaining explanatory variables. The variance inflation factor of an explanatory variable indicates the strength of the linear relationship between the variable and the remaining explanatory variables. A rough rule of thumb is that variance inflation factors greater than ten give cause for concern.

Example 12.6.2. US Crime Data

Contd. In the previous sub-section, we fitted a full model for the US crime data. We now compute Variance Inflation Factors (VIF) of the 13 explanatory variables using the codes below. We require the R function vif for this purpose, which is in the library faraway, hence we include it in the present R session by the following codes. Also, in the spirit of understanding the concepts, we will show how the VIF for the Age is computed without using the vif function.

> library(faraway)
> uscrimewor <- usc[,-1] # without response variable
> vif(uscrimewor)
   Age      S       W      X
 2.698  4.877   9.969  8.409
> 1/(1-summary(lm(Age∼.,data=uscrimewor))$r.square)
[1] 2.698

Concentrating for now on the variance inflation factors in the above output, we see that those for Ex0 and Ex1 are well above the value 10. As a consequence, we simply drop the variable with highest VIF in Ex1 from consideration and now regress crime rates on the remaining 12 explanatory variables using the following code:

> crime_rate_lm2 <- lm(R∼Age+S+Ed+Ex0-Ex1+LF+M+N+NW+U1+U2+W+X,usc)
> # Note how the variable Ex1 is removed from the model1#
> summary(crime_rate_lm2)
Call:
lm(formula = R ∼ Age + S + Ed + Ex0 - Ex1 + LF + M + N + NW +
    U1 + U2 + W + X, data = usc)
Residuals:
   Min     1Q Median     3Q    Max
-38.76 -13.59   1.09  13.25  48.93
Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.04e+02   1.53e+02   -4.60  5.6e-05 ***
Age          1.06e+00   4.16e-01    2.56  0.01523 *
S           -7.87e+00   1.47e+01   -0.53  0.59682
Ed           1.72e+00   6.29e-01    2.74  0.00975 **
Ex0          1.01e+00   2.44e-01    4.15  0.00021 ***
LF          -1.72e-02   1.46e-01   -0.12  0.90730
M            1.63e-01   2.08e-01    0.78  0.43842
N           -3.89e-02   1.28e-01   -0.30  0.76360
NW          -1.30e-04   6.20e-02    0.00  0.99834
U1          -5.85e-01   4.32e-01   -1.35  0.18467
U2           1.82e+00   8.46e-01    2.15  0.03883 *
W            1.35e-01   1.05e-01    1.29  0.20571
X            8.04e-01   2.32e-01    3.47  0.00145 **
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 21.7 on 34 degrees of freedom
Multiple R-squared:  0.767,     Adjusted R-squared:  0.685
F-statistic: 9.32 on 12 and 34 DF,  p-value: 1.35e-07
> vif(uscrimewor[,-5])
  Age     S       W     X
2.671 4.865   9.956 8.354

The square of the multiple correlation coefficient for crime_rate_lm2 is 0.767, indicating that the 12 explanatory variables account for 75% of the variability in the crime rates of the 47 states. Also, recollect that crime_rate_lm did not have Ex0 as a significant variable, whereas crime_rate_lm2 shows the same variable as significant. Thus, the multicollinearity problem, if not addressed, has the potential to hide some significant variables too. The last step vif(uscrimewor[,-5]) shows that the VIF are now all less than ten, and hence we stop further investigation of the multicollinearity problem.□

The eigen system analysis provides another approach to identify the presence of multicollinearity and will be considered in the next subsection.

12.6.2 Eigen System Analysis

Eigenvalues were introduced in Section 2.4. The eigenvalues of the matrix $c12-math-0455$ help to identify the multicollinearity effect in the data. Here, we need to standardize the matrix $c12-math-0456$ in the sense that each column has a mean zero and standard deviation at 1. Suppose $c12-math-0457$ are eigenvalues of the matrix $c12-math-0458$ . Let $c12-math-0459$ and $c12-math-0460$ . Define the condition number and condition indices of $c12-math-0461$ respectively by

12.68

12.69

The multicollinearity problem is not an issue for the condition number less than 100, and moderate multicollinearity exists for the values of condition numbers in the range of 100 to 1000. If the condition number exceeds the large number of 1000, the problem of linear dependence among the covariates severely exists.Similarly, any condition index $c12-math-0464$ in excess of 1000 is an indicator of the almost linear-dependence of the covariates, and in general if the $c12-math-0465$ condition indices are greater than 1000, we have $c12-math-0466$ number of linear dependencies.

The decomposition of the matrix $c12-math-0467$ through the eigenvalues gives us the eigen system analysis in the sense that the matrix is decomposed with

12.70

where $c12-math-0469$ is a $c12-math-0470$ diagonal matrix with diagonal elements consisting of the eigenvalues of $c12-math-0471$ and $c12-math-0472$ is a $c12-math-0473$ orthogonal matrix whose columns consists of the eigenvectors of $c12-math-0474$ . If any of the eigenvalues is closer to zero, the corresponding eigenvector gives away the associated linear dependency. We will now continue to use these techniques on the US crime dataset.

Example 12.6.3. US Crime Data

Contd. The data.frame uscrimewor is first converted into a matrix, and then is standardized using the scale function, and $c12-math-0475$ is obtained with t(usc_stan)%*%usc_stan. Using the eigen function, the eigenvalues and matrix of eigenvectors are first obtained. The condition number is obtained with max(usc_eigen)/min(usc_eigen) and the condition indices with max(usc_eigen)/usc_eigen.

> uscrimewor <- as.matrix(uscrimewor)
> usc_stan <- scale(uscrimewor)
> x1x_stan <- t(usc_stan)%*%usc_stan
> usc_eigen <- eigen(x1x_stan)
> max(usc_eigen$values)/min(usc_eigen$values) # Condition number
[1] 1081
> max(usc_eigen$values)/usc_eigen$values # Condition indices
 [1]    1.000    2.243         42.387   80.262   91.630 1080.942
> which(max(usc_eigen$values)/usc_eigen$values>1000)
[1] 13
> usc_eigen$values
 [1] 259.9811 115.9304        2.8373   0.2405
> usc_eigen$vectors
          [,1]      [,2]       [,12]     [,13]
 [1,] -0.31718  0.124021     -0.01952 -0.010618
 [2,] -0.34001 -0.179417      0.24040 -0.005195
 [3,]  0.35610  0.214396     -0.04259  0.032819
 [4,]  0.31698 -0.299853      0.07939  0.698089
 [5,]  0.31952 -0.297130      0.07611 -0.713642
 [6,]  0.18310  0.400916       0.14887 -0.033483
 [7,]  0.12649  0.358458     0.05198 -0.003217
 [8,]  0.10731 -0.453749     0.09264 -0.008582
 [9,] -0.30357 -0.222782     -0.13307  0.022698
[10,]  0.04431 -0.118836    -0.04364 -0.004347
[11,]  0.01665 -0.400497    0.10366 -0.013567
[12,]  0.39207 -0.094476     -0.70169 -0.003024
[13,] -0.38113 -0.008553    -0.60971 -0.015409

The condition number $c12-math-0476$ is very high and clearly points out the presence of the multicollinearity problem. The condition indices greater than 1000 are found with which(max(usc_eigen)/usc_eigen>1000), and show a linear dependency based on the condition index 13. Thus, using the $c12-math-0477$ eigenvector, in terms of the covariates, the dependency can be written as

Thus, the eigensystem analysis for identifying multicollinearity is clearly illustrated here.□

$c12-math-0479$

12.7 Data Transformations

In Chapter 4 we came across our first transformation technique through the concept of the rootogram. In regression analysis we need transformations for a host of reasons. A partial list of reasons is as follows:

1. The linearity assumption is violated, that is, $c12-math-0480$ may not be linear.
2. For the variation in the $c12-math-0481$ 's, the error variance may not be constant. This means that the variance of the $c12-math-0482$ 's may be a function of the mean. In fact, the assumption of constant variance is also known as the assumption of homoscedasticity.
3. Model errors do not follow a normal distribution.
4. In some experiments, we may have information about the need of transformations. For the model $c12-math-0483$ , the transformation $c12-math-0484$ provides us with a linear model. Such information may not be available a priori, and is reflected by residual plots only.

12.7.1 Linearization

As seen in the introduction, linearity is one of our basic assumptions. In many cases, it is possible to achieve linearity by an application of transformation. Many models which have non-linear terms may be transformed to a linear model by a suitable choice of transformation. See Table 6.1 of Chatterjee and Hadi (2006), or Table 5.1 of Montgomery, et al. (2003). For example, if we have the model $c12-math-0485$ , by using a log transformation, we obtain $c12-math-0486$ . The transformed model is linear in terms of its parameters. Figure 12.10 displays the behavior of $c12-math-0487$ for various choices of $c12-math-0488$ and $c12-math-0489$ (before the transformation).

> x <- seq(0,5,0.1)
> alpha <- 1
> par(mfrow=c(1,2))
> plot(x,y=alpha*xˆ{beta=1},xlab="x",ylab="y","l",lwd=1)
> points(x,y=alpha*xˆ{beta=0.5},"l",lwd=2)
> points(x,y=alpha*xˆ{beta=1.5},"l",lwd=3)
> points(x,y=alpha*xˆ{beta= -0.5},"b",lwd=2)
> points(x,y=alpha*xˆ{beta= -1},"b",lwd=3)
> points(x,y=alpha*xˆ{beta= -2},"b",lwd=2)

Image described by caption and surrounding text. — **Figure 12.10** Illustration of Linear Transformation

We will next consider an example where a transformation achieves linearization.

Example 12.7.1. Frog Survival as a Function of Age

A well-known natural phenomenon is that the number of survivors decreases with age. In the Frog_survival dataset from the gpk package, we have the number of survivors at the age of 1, 2,…, 8 years. Now, a straightforward use of the simple linear regression model gives a poor fit. Fitting a linear model for the number of Individuals as a function the Age gives an Adjusted $c12-math-0490$ of 0.2264, which is quite poor. Moreover, the variable Age is also an insignificant variable.

Thus, there is a need to carry out a transformation to improve the model, and we attempt to model the logarithm of Individuals as a function of Age. Mathematically speaking, we are building the model $c12-math-0491$ .

> data(Frog_survival)
> plot(Frog_survival$Individuals,Frog_survival$Age) # Output suppressed
> summary(FS1 <- lm(Individuals∼Age,data=Frog_survival))
Call:
lm(formula = Individuals ∼ Age, data = Frog_survival)
Residuals:
    Min      1Q  Median      3Q     Max
-3017.5 -1693.4  -381.3   943.6  5280.2
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   4573.2     2199.2   2.079   0.0828 .
Age           -760.3      435.5  -1.746   0.1314
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 2822 on 6 degrees of freedom
Multiple R-squared:  0.3369, Adjusted R-squared:  0.2264
F-statistic: 3.048 on 1 and 6 DF,  p-value: 0.1314
> summary(FS2 <- lm(log(Individuals)∼Age,data=Frog_survival))
Call:
lm(formula = log(Individuals) ∼ Age, data = Frog_survival)
Residuals:
    Min      1Q  Median      3Q     Max
-1.9159 -0.5906 -0.1267  0.4820  2.7690
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   7.2212     1.1708   6.168 0.000834 ***
Age          -0.8750     0.2318  -3.774 0.009247 **
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.503 on 6 degrees of freedom
Multiple R-squared:  0.7036, Adjusted R-squared:  0.6542
F-statistic: 14.24 on 1 and 6 DF,  p-value: 0.009247

The transformation leads to an increase in the adjusted $c12-math-0492$ to 0.6542 and the variable Age is also seen to be significant. Thus, the logarithmic transformation helped in achieving linearity for this dataset.□

12.7.2 Variance Stabilization

In general, the transformation stabilizes the variance of the error term. If the error variance is not constant for different $c12-math-0493$ values, we say that the error is heteroscedastic. The problem of heteroscedasticity is usually detected by the residual plots.

Example 12.7.2. Injuries in Airflights

Injuries in airflights, road accidents, etc., are instances of rare occurrences, which are appropriately modeled by a Poisson distribution. This data is adapted from Table 6.6 of Chatterjee and Hadi. Two models, before and after transformation, are fitted and checked if the transformation led to a reduction in the variance.

> data(flight)
> attach(flight)
> names(flight)
[1] "Injury_Incidents" "Total_Flights"
> injurylm <- lm(Injury_Incidents∼Total_Flights)
> injurysqrtlm <- lm(sqrt(Injury_Incidents)∼Total_Flights)
> summary(injurylm)
Call:
lm(formula = Injury_Incidents ∼ Total_Flights)
Residuals:
    Min      1Q  Median      3Q     Max
-5.3351 -2.1281  0.1605  2.2670  5.6382
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.1402     3.1412  -0.045   0.9657
Total_Flights  64.9755    25.1959   2.579   0.0365 *
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 4.201 on 7 degrees of freedom
Multiple R-squared: 0.4872, Adjusted R-squared: 0.4139
F-statistic:  6.65 on 1 and 7 DF,  p-value: 0.03654
> summary(injurysqrtlm)
Call:
lm(formula = sqrt(Injury_Incidents) ∼ Total_Flights)
Residuals:
    Min      1Q  Median      3Q     Max
-0.9690 -0.7655  0.1906  0.5874  1.0211
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)     1.1692     0.5783   2.022   0.0829 .
Total_Flights  11.8564     4.6382   2.556   0.0378 *
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.7733 on 7 degrees of freedom
Multiple R-squared: 0.4828, Adjusted R-squared: 0.4089
F-statistic: 6.535 on 1 and 7 DF,  p-value: 0.03776
> par(mfrow=c(1,2))
> plot(flight$Total_Flights, residuals(injurylm), xlab="Total Flights",ylab="Residuals")
> plot(flight$Total_Flights, residuals(injurysqrtlm), xlab="Total Flights",ylab="Residuals
+ Under Square Root Transformation")

It may be seen from the ANOVA tables that we had estimated a variance of 4.201 for the original data, that is, the simple linear regression model. The square-root transformation reduced the variance to 0.7733, which is significant. Note that the $c12-math-0494$ value remains almost the same. The residual plots are not produced here. The residual plot injurylm shows that the variance of residuals increases with an increase in the $c12-math-0495$ -value. However, the same residual plot for the injurysqrlm model shows that the variability of the error term remains constant across the $c12-math-0496$ -values.□

Technically, we had built the model $c12-math-0497$ . The choice of transformation is not obvious. Moreover, it is difficult to say whether the power of $c12-math-0498$ is appropriate or some other positive number. Also, it is not always easy to choose between the power transformation and logarithmic transformation. A more generic approach will be considered in the next subsection.

12.7.3 Power Transformation

In the previous two subsections, we were helped by the logarithmic and square-root transformations. Box and Cox (1962) proposed a general transformation, called the power transformation. This method is useful when we do not have theoretical or empirical guidelines for an appropriate transformation. The Box-Cox transformation is given by

12.71

where $c12-math-0500$ . For details, refer to Section 5.4.1 of Montgomery, et al. (2003). The linear regression model that will be built is the following:

12.72

For the choice of $c12-math-0502$ away from 0, we achieve variance stabilization, whereas for a value closer to 0, we obtain an approximate logarithmic transformation. The exact inference procedure for the choice of $c12-math-0503$ cannot be taken up here, and we simply note that the MLE technique is used for this purpose.

Example 12.7.3. The Box-Cox Transformation for Viscosity Dataset

The goal of this study is to find the impact of temperature on the viscosity of toluence-tetralin blends. This dataset is available in Problem 1, Chapter 5, of Montgomery, et al. (2003). First, the viscocity is modeled using simple linear regression. By using the boxcox function from the MASS package, we can find the MLE of $c12-math-0504$ and then use it to obtain $c12-math-0505$ , ybc in the R program. Using the estimated $c12-math-0506$ , a new linear model is built for ybc and Temperature.

> library(MASS)
> data(viscos)
> names(viscos)
[1] "Temperature" "Viscosity"
> viscoslm <- lm(Viscosity∼Temperature,data=viscos)
> par(mfrow=c(1,3))
> plot(viscoslm$fitted.values,viscoslm$residuals,
+ xlab="Fitted Values",ylab="Residuals",col="red")
> bclist <- boxcox(viscoslm)
> mlelambda <- bclist$x[which(bclist$y==max(bclist$y))]
> mlelambda
[1] -0.7474747
> ygeom <- prod(viscos$Viscosity)ˆ{1/length(viscos$Viscosity)}
> ybc <- (viscos$Viscosityˆ{mlelambda}-1)/(mlelambda*ygeomˆ {mlelambda-1})
> viscosbclm <- lm(ybc∼viscos$Temperature)
> plot(viscosbclm$fitted.values,viscosbclm$residuals,
+ xlab="Fitted Values", ylab="Residuals",col="red")
> summary(viscoslm)
Call:
lm(formula = Viscosity ∼ Temperature, data = viscos)
Residuals:
      Min        1Q    Median        3Q       Max
-0.043955 -0.035863 -0.009305  0.019900  0.069559
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.2815107  0.0468683   27.34 1.58e-07 ***
Temperature -0.0087578  0.0007284  -12.02 2.01e-05 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.04743 on 6 degrees of freedom
Multiple R-squared:  0.9602, Adjusted R-squared:  0.9535
F-statistic: 144.6 on 1 and 6 DF,  p-value: 2.007e-05
> summary(viscosbclm)
Call:
lm(formula = ybc ∼ viscos$Temperature)
Residuals:
      Min        1Q    Median        3Q       Max
-0.011929 -0.001858  0.000970  0.001953  0.010282
Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)         0.2781401  0.0067156   41.42 1.33e-08 ***
viscos$Temperature -0.0083670  0.0001044  -80.17 2.54e-10 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.006797 on 6 degrees of freedom
Multiple R-squared:  0.9991, Adjusted R-squared:  0.9989
F-statistic:  6428 on 1 and 6 DF,  p-value: 2.536e-10

The boxcox function from the MASS package helps to obtain the value of $c12-math-0507$ as required in the analysis of the Box-Cox transformation. Since the value of estimated $c12-math-0508$ is mlelambda=-0.7474747, we have a reciprocal transformation. The which function is used to determine the optimum value of $c12-math-0509$ , and since the value is not closer to zero, we use the power transformation $c12-math-0510$ through the R objects ygeom and ybc to finally set up the linear regression model viscosbclm. Note the reduction in variance and also the distribution of the residuals against the fitted values in Figure 12.11.□

Figure 12.11 Box-Cox Transformation for the Viscosity Data

$c12-math-0511$

12.8 Model Selection

Consider a linear regression model with ten covariates. The total number of possible models is then $c12-math-0512$ . The number of possible models is too enormous to investigate each one in depth. We also have further complications. Reconsider Example 12.4.5. It is seen from the output of the code summary(crime_rate_lm), that the variables S, Ex0, Ex1, LF, M, N, NW, U1, and W are all insignificant variables by their corresponding $c12-math-0513$ -value, as given in column Pr(>|t|). What do we do with such variables? If we have to discard them, what should be the procedure? For instance, it may be recalled from Example 12.6.1 that if the variable jitter(GBH) is dropped, the variable GBH will be significant. Hence, there is a possibility that if we drop one of the variables from crime_rate_lm, some other variable may turn out to be significant.

The need of model selection is thus more of a necessity than a fancy. Any text on regression models will invariably discuss this important technique. We will see the rationale behind it with an example.

Example 12.8.1. The Prostate Cancer Example

In the prostate cancer study, we are interested in understanding the relationship between the logarithm of prostate specific antigen and the predictors including logarithms of cancer volume, prostate weight, benign prostatic hyperplasia amount, capsular penetration, and also the age, seminal vesicle invasion, Gleason score, and percentage Gleason scores 4 or 5. In this dataset, we have $c12-math-0514$ observations and $c12-math-0515$ number of covariates. Clearly, the total number of possible regression models is $c12-math-0516$ . We will now obtain a plot of the residual sum of squares for each of these models. This plot is actually a beautiful programming exercise and the reader should look at Figure 12.12 first, and follow it up with a serious attempt towards obtaining it on his own. The codes have been provided here in the hope that the reader will only use it as a last resort.

> # The Need of Model Selection
> library(faraway)
> data(prostate)
> lspa <- prostate[,9]
> covariates <- prostate[,-9]
> covnames <- names(covariates)
> p <- length(covnames); n <- length(lspa)
> RSSmatrix <- matrix(nrow=sum(choose(8,8:1)), ncol=11)
> currrow <- 0
> for(i in 1:p) {
+  temp <- choose(p,i)
+  tempmat <- t(combn(1:8,i))
+  for(j in 1:temp) {
+   currrow <- currrow+1
+   RSSmatrix[currrow,1] <- currrow
+   RSSmatrix[currrow,tempmat[j,]+1] <- covnames[tempmat[j,]]
+   templm <- lm(lspa∼.,subset(covariates,select = tempmat[j,]))
+   RSSmatrix[currrow,10] <- sum(templm$residualsˆ2)
+   RSSmatrix[currrow,11] <- i
+   }
+   }
> plot(RSSmatrix[,11],RSSmatrix[,10],xlab="Number of Predictors",
+ ylab="Residual Sum of Squares")

We confirm that the R program should be referred to as a last resort only, and hence its details are not given here. Figure 12.12 clearly shows that as the number of components of the model increases, the RSS decreases. However, the reader can verify that the model, which has least variance for a fixed $c12-math-0517$ , need not have the $c12-math-0518$ subset variables of them, as in the previous $c12-math-0519$ case.□

Figure 12.12 An RSS Plot for all Possible Regression Models

A number of methods for model selection are available, including:

Backward elimination
Forward selection
Stepwise regression.

12.8.1 Backward Elimination

The backward elimination method is the simplest of all variable selection procedures and can be easily implemented without a special function/package. In situations where there is a complex hierarchy, backward elimination can be run manually while taking account of what variables are eligible for removal. This method starts with a model containing all the explanatory variables and eliminates the variables one at a time, at each stage choosing the variable for exclusion as the one leading to the smallest decrease in the regression sum of squares. An $c12-math-0520$ -type statistic is used to judge when further exclusions would represent a significant deterioration in the model. The algorithm of backward selection is as follows:

1. Start with all the predictors in the model.
2. Remove the predictor with highest $c12-math-0521$ -value greater than $c12-math-0522$ critical.
3. Refit the model and go to 2.
4. Stop when all $c12-math-0523$ -values are less than alpha critical.

The alpha critical is sometimes called the p-to-remove and does not have to be 5%. If prediction performance is the goal, then a 15 to 20% cut-off may work best, although methods designed more directly for optimal prediction should be preferred.

Example 12.8.2. The US Crime Data

Example 12.6.2. Contd. The backward elimination method is illustrated for thefitted model crime_rate_lm2. Here at each stage we remove the predictor with the largest $c12-math-0524$ -value greater than 0.05. Since the $c12-math-0525$ -value of regressor NM in the summary output of summary(crime_rate_lm2) is largest, we shall eliminate it first. The R function update will be used to re-fit a model with suitable modification.

> crime_rate_lm3 <- update(crime_rate_lm2,.∼.-NW)
> summary(crime_rate_lm3)
Call:
lm(formula = R ∼ Age + S + Ed + Ex0 + LF + M + N + U1 + U2 +
    W + X, data = usc)
Residuals:
    Min      1Q  Median      3Q     Max
-38.755 -13.587   1.089  13.242  48.921
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -704.12001  149.69012  -4.704 3.91e-05 ***
Age            1.06420    0.38919   2.734  0.00974 **
S             -7.88954   12.76942  -0.618  0.54068
Ed             1.72173    0.61413   2.804  0.00819 **
Ex0            1.00952    0.21793   4.632 4.85e-05 ***
LF            -0.01727    0.13814  -0.125  0.90123
M              0.16308    0.19819   0.823  0.41615
N             -0.03886    0.12634  -0.308  0.76021
U1            -0.58496    0.42003  -1.393  0.17251
U2             1.81921    0.83415   2.181  0.03600 *
W              0.13516    0.10070   1.342  0.18819
X              0.80399    0.22856   3.518  0.00123 **
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 21.41 on 35 degrees of freedom
Multiple R-squared:  0.7669, Adjusted R-squared:  0.6936
F-statistic: 10.47 on 11 and 35 DF,  p-value: 4.008e-08

Since we have to remove the NM variable from the crime_rate_lm2 fitted model, the option of .∼.-NW achieves the required modification in the update function. Now the $c12-math-0526$ -value of LF is the highest in the current model and hence is eliminated using the update function once again.

> crime_rate_lm4 <- update(crime_rate_lm3,.∼.-LF)
> summary(crime_rate_lm4)
Call:
lm(formula = R ∼ Age + S + Ed + Ex0 + M + N + U1 + U2 + W + X,
    data = usc)
Residuals:
    Min      1Q  Median      3Q     Max
-38.595 -13.628   0.876  12.927  48.991
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -700.19857  144.35146  -4.851 2.37e-05 ***
Age            1.06640    0.38344   2.781 0.008565 **
S             -7.18568   11.30331  -0.636 0.528984
Ed             1.70184    0.58499   2.909 0.006176 **
Ex0            1.01444    0.21139   4.799 2.77e-05 ***
M              0.15059    0.16878   0.892 0.378199
N             -0.04146    0.12290  -0.337 0.737825
U1            -0.56349    0.37804  -1.491 0.144793
U2             1.81277    0.82110   2.208 0.033720 *
W              0.13417    0.09901   1.355 0.183821
X              0.79674    0.21804   3.654 0.000816 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 21.11 on 36 degrees of freedom
Multiple R-squared:  0.7668, Adjusted R-squared:  0.702
F-statistic: 11.84 on 10 and 36 DF,  p-value: 1.125e-08

Now, the variables which will leave the model one after the other are N, S, S, M, U1, and W. The output for all summaries are not provided, except the final one.

> crime_rate_lm5 <- update(crime_rate_lm4,.∼.-N)
> summary(crime_rate_lm5)
> crime_rate_lm6 <- update(crime_rate_lm5,.∼.-S)
> summary(crime_rate_lm6)
> crime_rate_lm7 <- update(crime_rate_lm6,.∼.-M)
> summary(crime_rate_lm7)
> crime_rate_lm8 <- update(crime_rate_lm7,.∼.-U1)
> summary(crime_rate_lm8)
> crime_rate_lm9 <- update(crime_rate_lm8,.∼.-W)
> summary(crime_rate_lm9)
Call:
lm(formula = R ∼ Age + Ed + Ex0 + U2 + X, data = usc)
Residuals:
    Min      1Q  Median      3Q     Max
-45.344  -9.859  -1.807  10.603  62.964
Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -524.3743    95.1156  -5.513 2.13e-06 ***
Age            1.0198     0.3532   2.887 0.006175 **
Ed             2.0308     0.4742   4.283 0.000109 ***
Ex0            1.2331     0.1416   8.706 7.26e-11 ***
U2             0.9136     0.4341   2.105 0.041496 *
X              0.6349     0.1468   4.324 9.56e-05 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 21.3 on 41 degrees of freedom
Multiple R-squared:  0.7296, Adjusted R-squared:  0.6967
F-statistic: 22.13 on 5 and 41 DF,  p-value: 1.105e-10

Since none of the $c12-math-0527$ -values associated with the covariates in crime_rate_lm9 are greater than 0.05, we now stop the backward selection process. Notice that the $c12-math-0528$ for the full model of 0.7692 is reduced only slightly to 0.7296 in the final model. Thus, the removal of eight predictors causes only a minor reduction in fit.□

The above example is really taxing and there is a need to refine this before obtaining the final result. A customized R function which will achieve the same result is given next.

# The Backward Selection Methodology
pvalueslm <- function(lm) {summary(lm)$coefficients[,4]}
backwardlm <- function(lm,criticalalpha) {
 lm2 <- lm
 while(max(pvalueslm(lm2))>criticalalpha) {
 lm2 <- update(lm2,paste(".∼.-",attr(lm2$terms,"term.labels") [(which(pvalueslm(lm2)
              ==max(pvalueslm(lm2))))-1],sep=""))
 }
 return(lm2)
 }

12.8.2 Forward and Stepwise Selection

The forward selection method reverses the backward method. This method starts with a model containing none of the explanatory variables and then considers variables one by one for inclusion. At each step, the variable added is the one that results in the biggest increase in the regression sum of squares. An $c12-math-0529$ -type statistic is used to judge when further additions would not represent a significant improvement in the model.

1. Start with no variables in the model.
2. For all predictors not in the model, check their $c12-math-0530$ -value if they are added to the model. Choose the one with lowest $c12-math-0531$ -value less than alpha critical.
3. Continue until no new predictors can be added.

For a function similar to backwardlm, refer to Chapter 6 of Tattar (2013) for an implementation of the forward selection algorithm. Stepwise regression is a combination of forward selection and backward elimination. This addresses the situation where variables are added or removed early in the process and we want to change our mind about them later. Starting with no variables in the model, variables are added as with the forward selection method. Here, however, with each addition of a variable, a backward elimination process is considered to assess whether variables entered earlier might now be removed, because they no longer contribute significantly to the model.

We will now look at another criteria for model selection: Akaike Information Criteria, abbreviated as AIC. Let the log-likelihood function for a fitted regression model with $c12-math-0532$ covariates be denoted by $c12-math-0533$ . The total number of estimated parameters is denoted by $c12-math-0534$ . The AIC for the regression model is then given by

12.73

where $c12-math-0536$ . The term $c12-math-0537$ is referred to as the penalty term. The model which has the least AIC value is considered the best model. For more details, refer to Sheather (2009). The use of AIC for forward and stepwise selection will be illustrated in the following.

Example 12.8.3. The US Crime Data

Let us continue the use of the US crime dataset. The reader is assured that the steps, or R program, is actually very simple.

> step(crime_rate_lm,direction="both")
Start:  AIC=301.66
R ∼ Age + S + Ed + Ex0 + Ex1 + LF + M + N + NW + U1 + U2 + W +
    X
       Df Sum of Sq   RSS    AIC
- NW    1       6.1 15885 299.68
- LF    1      34.4 15913 299.76
- N     1      48.9 15928 299.81
- S     1     149.4 16028 300.10
- Ex1   1     162.3 16041 300.14
- M     1     296.5 16175 300.53
<none>              15879 301.66
- W     1     810.6 16689 302.00
- U1    1     911.5 16790 302.29
- Ex0   1    1109.8 16988 302.84
- U2    1    2108.8 17988 305.52
- Age   1    2911.6 18790 307.57
- Ed    1    3700.5 19579 309.51
- X     1    5474.2 21353 313.58
Step:  AIC=299.68
Step:  AIC=291.83
R ∼ Age + Ed + Ex0 + U2 + W + X
       Df Sum of Sq   RSS    AIC
<none>              17351 291.83
+ U1    1     408.6 16942 292.71
- W     1    1252.6 18604 293.11
+ Ex1   1     251.2 17100 293.14
+ LF    1     230.7 17120 293.20
+ N     1     189.6 17162 293.31
+ M     1     177.8 17173 293.35
+ S     1      71.0 17280 293.64
+ NW    1      59.2 17292 293.67
- U2    1    1628.7 18980 294.05
- Age   1    4461.0 21812 300.58
- Ed    1    6214.7 23566 304.22
- X     1    8932.3 26283 309.35
- Ex0   1   15596.5 32948 319.97
Call:
lm(formula = R ∼ Age + Ed + Ex0 + U2 + W + X, data = usc)
Coefficients:
(Intercept)          Age        W X
  -618.5028       1.1252   0.1596  0.8236

The step function is a very generic function such as summary, plot, predict, etc., and it may be applied to many fitted regression models. This shows that the model selected by the stepwise regression includes the variables Age, Ed, Ex0, U2, W, and X in the best model.□

$c12-math-0538$

12.9 Further Reading

In this section we consider some of the regression books which have been loosely classified into different sections.

12.9.1 Early Classics

Draper and Smith (1966–98) is a treatise on applied regression analysis. Chatterjee and Hadi (1977–2006) and Chatterjee and Hadi (1988) are two excellent companions for regression analysis. Bapat (2000) builds linear models using linear algebra and the book gives the reader an indepth knowledge of the necessary theory. Christensen (2011) develops a projective approach for linear models. For firm foundations in linear models, the reader may also use Rao, et al. (2008). Searle (1971) is a classic book, which is still preferred by some readers.

12.9.2 Industrial Applications

Montgomery, et al. (2003) and Kutner, et al. (2005) provide comprehensive coverages of linear models with dedicated emphasis on industrial applications.

12.9.3 Regression Details

Belsley, Kuh and Welsh (1980), Fox (1991), Cook (1998), Cook and Weisberg (1982), Cook and Weisberg (1994), and Cook and Weisberg (1999) are the monographs which have details about regression diagnostics. For a robust regression model, the reader will find a very useful source in Rousseeuw and Leroy (1987).

12.9.4 Modern Regression Texts

Andersen and Skovgaard (2010) consider many variants of regression models which have linear predictors. Gelman and Hill (2007) develop hierarchical regression models. Freedman (2009) is an instant classic which lays more emphasis on matrix algebra. Sengupta and Rao (2003), Clarke (2008), Seber and Lee (2003), and Rencher and Schaalje (2008) are also useful accounts of the linear models.

12.9.5 R for Regression

Fox (2002) is an early text on the use of R software for regression analysis. Faraway (2002) is an open source book which has detailed R programs for linear models.

12.10 Complements, Problems, and Programs

Problem 12.1 Fit a simple linear regression model for the Galton dataset as seen in Example 4.5.1. Compare the values of the regression coefficients of the linear regression model for this dataset with the previously obtained resistant linecoefficients.
Problem 12.2 Verify that Equation 12.19 is satisfied for Example 13.2.3. Fit the resistant lines model for this dataset, and verify whether the ANOVA decomposition holds for the fitted values obtained using the resistant line model.
Problem 12.3 Extend the concept of $c12-math-0539$ and $c12-math-0540$ for the resistant line model. Create an R function which will extract these two measures for a fitter resistant line model and obtain these values for the Galton dataset, rp, and tc.
Problem 12.4 The Signif. codes as obtained by summary(lm) may be easily customized in R to use your own cut-off points, and symbols too. There are two elements to this, first the cut-off points for the $c12-math-0541$ -values and the default settings are cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1), and the second part has the symbols in symbols = c(“***”, “**”, “*”, “.”, “ ”). Change these default settings to, say, symbols = c(“$$$”, “$$”, “$”, “.”, “ ”) by first running fix(printCoefmat). Edit the printCoefmat and save it. The changes in this object are then applied in an R session by running the code assignInNamespace(“printCoefmat”, printCoefmat, “stats”) at the console. Customize your Signif. codes and complete the program!
Problem 12.5 The model validation in Example 13.2.6 for the Toluca Company dataset has been carried out using the residuals. The semi-Studentized residuals also play a critical role in determining departures from the assumptions of the linear regression model. Obtain the six plots as in that example with the semi-Studentized residuals.
Problem 12.6 The model validation aspects need to be checked for the Rocket propellant problem, Example 13.2.1. Complete the program for the fitted simple linear regression model and draw the appropriate conclusions.
Problem 12.7 In Example 13.4.3, change the range of the variables x1 and x2 to x1 <- rep(seq(-10,10,0.5),100) and x2 <- rep(seq(-10,10,0.5),each=100) and redo the three-dimensional plot, especially for the third linear regression model. Similarly, for the contour plot of the same model, change the variable ranges to x1=x2=seq(from=-5,to=5,by=0.2) and redraw the contour plot. What are your typical observations?
Problem 12.8 For the fitted linear model crime_rate_lm, using the usc dataset, obtain the plot of residuals against the fitted values.
Problem 12.9 Verify the properties of the hat matrix $c12-math-0542$ given in Equation 12.37 for the fitted object crime_rate_lm, or any other fitted multiple linear regression model of your choice.
Problem 12.10 Using self-defined functions for $c12-math-0543$ and $c12-math-0544$ , as given in Equations 12.66 and 12.65, say my_dffits and my_dfbetas, compute the values for an ailm fitted object and compare the results with the R functions dffits and dfbetas.
Problem 12.11 The VIF given in Equation 12.67 for a covariate requires computation of $c12-math-0545$ , as obtained in the regression model when the covariate is an output and other covariates are input variables for it. Thus, using 1/(1-summary(lm(xi∼x1+...+xi-1+xi+1+...+xp))$r.squared), the VIF of the covariate $c12-math-0546$ may be obtained. Verify the VIFs obtained in Example 13.6.2.
Problem 12.12 Identify if the multicollinearity problems exist for the fitted ailm object using (i) VIF method, and (ii) eigen system analysis.
Problem 12.13 Carry out the Box-Cox transformation method for the bacteria_study and compare it with the result in Example 13.7.1 of the log transformation technique, bacterialoglm.
Problem 12.14 For the prostate cancer problem discussed in Example 13.8.1, find the best possible linear regression model using (i) step-wise, (ii) forward, and (iii) backward selection technique.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.