Chapter 16

Tests in Regression Analysis

Regression analysis investigates and models the relationship between variables. A linear relationship is assumed between a dependent or response variable c16-math-0001 of interest and one or several independent, predictor or regressor variables. We present tests on regression parameters in simple and multiple linear regression analysis. Tests cover the hypothesis on the value of individual regression parameters as well as tests for significance of regression where the hypothesis states that none of the regressor variables has a linear effect on the response.

16.1 Simple linear regression

Simple linear regression relates a response variable c16-math-0002 to the given outcome c16-math-0003 of a single regressor variable by assuming the relation c16-math-0004, which is linear in unknown coefficients or parameters c16-math-0005 and c16-math-0006. Further c16-math-0007 is an error term which models the deviation of the observed values from the linear relationship. In two-dimensional space this equals a straight line. For this reason simple linear regression is also called straight line regression. The value c16-math-0008 of the regressor variable is fixed or measured without error. If the regressor variable is a random variable c16-math-0009 the model is commonly understood as modeling the response c16-math-0010 conditional on the outcome c16-math-0011. To analyze if the regressor has an influence on the response c16-math-0012 it is tested if the slope c16-math-0013 of the regression line differs from zero. Other tests treat the intercept c16-math-0014.

16.1.1 Test on the slope

Description: Tests if the regression coefficient c16-math-0015 of a simple linear regression differs from a value c16-math-0016.
Assumptions:
  • A sample of c16-math-0017 pairs c16-math-0018 is given.
  • The simple linear regression model for the sample is stated as c16-math-0019.
  • The error term c16-math-0020 is a random variable which is Gaussian distributed with mean c16-math-0021 and variance c16-math-0022, that is, c16-math-0023 for all c16-math-0024. It further holds that c16-math-0025 for all c16-math-0026.
Hypotheses: (A) c16-math-0027 vs c16-math-0028
(B) c16-math-0029 vs c16-math-0030
(C) c16-math-0031 vs c16-math-0032
Test statistic: c16-math-0033
with c16-math-0034, c16-math-0035
c16-math-0036 and c16-math-0037
Test decision: Reject c16-math-0038 if for the observed value c16-math-0039 of c16-math-0040
(A) c16-math-0041 or c16-math-0042
(B) c16-math-0043
(C) c16-math-0044
p-values: (A) c16-math-0045
(B) c16-math-0046
(C) c16-math-0047
Annotations:
  • The test statistic c16-math-0048 follows a t-distribution with c16-math-0049 degrees of freedom.
  • c16-math-0050 is the c16-math-0051-quantile of the t-distribution with c16-math-0052 degrees of freedom.
  • Of special interest is the test problem c16-math-0053 vs c16-math-0054; the test is then also called a test for significance of regression. If c16-math-0055 can not be rejected this indicates that there is no linear relationship between c16-math-0056 and c16-math-0057. Either c16-math-0058 has no or little effect on c16-math-0059 or the true relationship is not linear (Montgomery 2006, p. 23).
  • Alternatively the squared test statistic c16-math-0060 can be used which follows a F-distribution with c16-math-0061 and c16-math-0062 degrees of freedom.

Example
Of interest is the slope of the regression of weight on height in a specific population of students. For this example two hypotheses are tested with (a) c16-math-0063 and (b) c16-math-0064. A dataset of measurements on a random sample of c16-math-0065 students has been used (dataset in Table A.6).


SAS code
* Simple linear regression including test for H0: beta_1=0;
proc reg data=students;
 model weight=height;
run;
* Perform test for H0: beta_1=0.5;
proc reg data=students;
 model weight=height;
 test height=0.5;
run;
quit;
SAS output
                   Parameter Estimates
                  Parameter   Standard
Variable     DF   Estimate      Error    t Value   Pr> |t|
Intercept     1   -51.81816   35.76340   -1.45      0.1646
height        1     0.67892    0.20645    3.29      0.0041
                  The REG Procedure
                    Model: MODEL1
     Test 1 Results for Dependent Variable weight
                                Mean
Source             DF         Square    F Value    Pr> F
Numerator           1       94.54374       0.75    0.3975
Denominator        18      125.87535
Remarks:
  • The SAS procedure proc reg is the standard procedure for linear regression. It is a powerful procedure and we use here only a tiny part of it.
  • For the standard hypothesis c16-math-0066 the model dependent variable=independent variable statement is sufficient.
  • For testing a special hypothesis c16-math-0067 you must add the test variable= value statement. Note, here a F-test is used, which is equivalent to the proposed t-test, because a squared t-distributed random variable with c16-math-0068 degrees of freedom is c16-math-0069-distributed. The p-value stays the same. To get the t-test use the restrict variable= value statement.
  • The quit; statement is used to terminate the procedure; proc reg is an interactive procedure and SAS then knows not to expect any further input.
  • The p-values for the other hypothesis must be calculated by hand. For instance, for c16-math-0070 the p-value for hypothesis (B) is 1-probt(3.29,18)=0.0020 and for hypothesis (C) probt(3.29,18)=0.9980.


R code
# Read the data
y<-students$weight
x<-students$height
# Simple linear regression including test for H0: beta_1=0
reg<-summary(lm(y~x))
# Perform test for H0: beta_1=0.5
# Get estimated coefficient
beta_1<-reg$coeff[2,1]
# Get standard deviation of estimated coefficient
std_beta_1<-reg$coeff[“x”,2]
# Perform the test
t_value<-(beta_1-0.5)/std_beta_1
# Calculate p-value
p_value<-2*pt(-abs(t_value),18)
# Output result
# Simple linear regression
reg
# For hypothesis H0: beta_1=0.5
t_value
p_value
R output
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.8182    35.7634  -1.449  0.16456
x             0.6789     0.2065   3.288  0.00408 **
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
> # For hypothesis H0: beta_1=0.5
> t_value
[1] 0.8666546
> p_value
[1] 0.3975371
Remarks:
  • The function lm() performs a linear regression in R. The response variable is placed on the left-hand side of the c16-math-0071 symbol and the regressor variable on the right-hand side.
  • The summary function gets R to return the estimates, p-values, etc. Here we store the values in the object reg.
  • The standard hypothesis c16-math-0072 is performed by the function lm(). The hypothesis c16-math-0073 with c16-math-0074 is not covered by this function but it provides the necessary statistics which we store in above example code in the object reg. In the second part of the example we extract the estimated coefficient c16-math-0075 with the command reg$coeff[2,1] and its estimated standard deviations c16-math-0076 with the command reg$coeff[2,2]. These values are then used to perform the test.
  • The p-values for the other hypothesis must be calculated by hand. For instance for c16-math-0077 the p-value for hypothesis (B) is 1-pt(3.29,18)=0.0020 and for hypothesis (C) pt(3.29,18)=0.9980.

16.1.2 Test on the intercept

Description: Tests if the regression coefficient c16-math-0078 of a simple linear regression differs from a value c16-math-0079.
Assumptions:
  • A sample of c16-math-0080 pairs c16-math-0081 is given.
  • The simple linear regression model for the sample is stated as c16-math-0082.
  • The error term c16-math-0083 is a random variable which is Gaussian distributed with mean c16-math-0084 and variance c16-math-0085, that is, c16-math-0086 for all c16-math-0087. It further holds that c16-math-0088 for all c16-math-0089.
Hypotheses: (A) c16-math-0090 vs c16-math-0091
(B) c16-math-0092 vs c16-math-0093
(C) c16-math-0094 vs c16-math-0095
Test statistic: c16-math-0096
with c16-math-0097, c16-math-0098,
c16-math-0099, c16-math-0100
and c16-math-0101
Test decision: Reject c16-math-0102 if for the observed value c16-math-0103 of c16-math-0104
(A) c16-math-0105 or c16-math-0106
(B) c16-math-0107
(C)c16-math-0108
p-values: (A) c16-math-0109
(B) c16-math-0110
(C) c16-math-0111
Annotations:
  • The test statistic c16-math-0112 follows a t-distribution with c16-math-0113 degrees of freedom.
  • c16-math-0114 is the c16-math-0115-quantile of the t-distribution with c16-math-0116 degrees of freedom.
  • The hypothesis c16-math-0117 is used to test if the regression line goes through the origin.

Example
Of interest is the intercept of the regression of weight on height in a specific population of students. For this example two hypotheses are tested with (a) c16-math-0118 and (b) c16-math-0119. A dataset of measurements on a random sample of c16-math-0120 students has been used (dataset in Table A.6).


SAS code
* Simple linear regression including test for H0: beta_1=0;
proc reg data=students;
 model weight=height;
run;
* Perform test for H0: beta_0=10;
proc reg data=students;
 model weight=height;
 test intercept=10;
run;
SAS output
                   Parameter Estimates
                  Parameter   Standard
Variable     DF   Estimate      Error    t Value   Pr> |t|
Intercept     1   -51.81816   35.76340   -1.45      0.1646
height        1     0.67892    0.20645    3.29      0.0041
                   The REG Procedure
                     Model: MODEL1
          Test 1 Results for Dependent Variable weight
                                Mean
Source             DF         Square    F Value    Pr> F
Numerator           1      376.09299       2.99    0.1010
Denominator        18      125.87535
Remarks:
  • The SAS procedure proc reg is the standard procedure for linear regression. It is a powerful procedure and we use here only a tiny part of it.
  • For the standard hypothesis c16-math-0121 the model dependent variable=independent variable statement is sufficient.
  • For testing a special hypothesis c16-math-0122 you must add the test intercept value statement. Note, here a F-test is used, which is equivalent to the proposed t-test, because a squared t-distributed random variable with c16-math-0123 degrees of freedom is c16-math-0124-distributed. The p-value stays the same. To get the t-test use the restrict variable= value statement.
  • The quit; statement is used to terminate the procedure; proc reg is an interactive procedure and SAS then knows not to expect any further input.
  • The p-values for the other hypothesis must be calculate by hand. For instance for c16-math-0125 the p-value for hypothesis (B) is 1-probt(-51.82,18)=1 and for hypothesis (C) probt(-51.82,18)=0.


R code
# Read the data
y<-students$weight
x<-students$height
# Simple linear regression
reg<-summary(lm(y~x))
# Perform test for H0: beta_0=10
# Get estimated coefficient
beta_0<-reg$coeff[1,1]
# Get standard deviation of estimated coefficient
std_beta_0<-reg$coeff[1,2]
# Perform the test
t_value<-(beta_0-10)/std_beta_0
# Calculate p-Value
p_value<-2*pt(-abs(t_value),18)
# Output result
# Simple linear regression
reg
# For hypothesis H0: beta_0=10
t_value
p_value
R output
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.8182    35.7634  -1.449  0.16456
x             0.6789     0.2065   3.288  0.00408 **
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
>
> # For hypothesis H0: beta_0=10
> t_value
[1] -1.728531
> p_value
[1] 0.1010077
Remarks:
  • The function lm() performs a linear regression in R. The response variable is placed on the left-hand side of the c16-math-0126 symbol and the regressor variable on the right-hand side.
  • The summary function gets R to return the estimates, p-values, etc. Here we store the values in the object reg.
  • The standard hypothesis c16-math-0127 is performed by the function lm(). The hypothesis c16-math-0128 with c16-math-0129 is not covered by this function but it provides the necessary statistics which we store in the example code in the object reg. In the second part of the example we extract the estimated coefficient c16-math-0130 with the command reg$coeff[1,1] and its estimated standard deviation c16-math-0131 with the command reg$coeff[1,2]. These values are then used to perform the test.
  • The p-values for the other hypothesis must be calculated by hand. For instance for c16-math-0132 the p-value for hypothesis (B) is 1-pt(-51.82,18)=1 and for hypothesis (C) pt(-51.82,18)=0.

16.2 Multiple linear regression

Multiple linear regression is an extension of the simple linear regression to more than one regressor variable. The response c16-math-0133 is predicted from a set of regressor variables c16-math-0134. Instead of a straight line a hyperplane is modeled. Again, the values of the regressor variables are either fixed, measured without error or conditioned on (Rencher 1998, chapter 7). Multiple linear regression is based on assuming a relation c16-math-0135, which is linear in unknown coefficients or parametersc16-math-0136. Further c16-math-0137 is an error term which models the deviation of the observed values from the hyperplane. To analyze if individual regressors have an influence on the response c16-math-0138 it is tested if the corresponding parameter differs from zero. Tests for significance of regression test the overall hypothesis that none of the regressor has an influence on c16-math-0139 in the regression model.

16.2.1 Test on an individual regression coefficient

Description: Tests if a regression coefficient c16-math-0140 of a multiple linear regression differs from a value c16-math-0141.
Assumptions:
  • A sample of c16-math-0142 tuples c16-math-0143 is given.
  • The multiple linear regression model for the sample can be written in matrix notation as c16-math-0144 with response vector c16-math-0145, unknown parameter vector c16-math-0146, random vector of errors c16-math-0147 and a matrix with values of the regressors c16-math-0148 (Montgomery et al. 2006, p. 68).
  • The elements c16-math-0149 of c16-math-0150 follow a Gaussian distribution with mean c16-math-0151 and variance c16-math-0152, that is, c16-math-0153 for all c16-math-0154. It further holds that c16-math-0155 for all c16-math-0156.
Hypotheses: (A) c16-math-0157 vs c16-math-0158
(B) c16-math-0159 vs c16-math-0160
(C) c16-math-0161 vs c16-math-0162
Test statistic: c16-math-0163
with c16-math-0164, c16-math-0165,
c16-math-0166 and diagjj(X′X)−1 the jjth element of the diagonal of the inverse matrix of X′X.
Test decision: Reject c16-math-0167 if for the observed value c16-math-0168 of c16-math-0169
(A) c16-math-0170 or c16-math-0171
(B) c16-math-0172
(C) c16-math-0173
p-values: (A) c16-math-0174
(B) c16-math-0175
(C) c16-math-0176
Annotations:
  • The test statistic c16-math-0177 follows a t-distribution with c16-math-0178 degrees of freedom.
  • c16-math-0179 is the c16-math-0180-quantile of the t-distribution with c16-math-0181 degrees of freedom.
  • Usually it is tested if c16-math-0182. If this hypothesis cannot be rejected it can be concluded that the regressor variable c16-math-0183 does not add significantly to the prediction of c16-math-0184, given the other regressor variables c16-math-0185 with c16-math-0186.
  • Alternatively the squared test statistic c16-math-0187 can be used which follows a F-distribution with c16-math-0188 and c16-math-0189 degrees of freedom. As the test is a partial test of one regressor, the test is also called a partial F-test.

Example
Of interest is the effect of sex in a regression of weight on height and sex in a specific population of students. The variable sex needs to becoded as a dummy variable for the regression model. In our example we choose the outcome male as reference, hence the new variable sex takes the value c16-math-0190 for female students and c16-math-0191 for male students. We test the hypothesis c16-math-0192. A dataset of measurements on a random sample of c16-math-0193 students has been used (dataset in Table A.6).


SAS code
* Create dummy variable for sex with reference male;
data reg;
 set students;
 if sex=1 then s=0;
 if sex=2 then s=1;
run;
* Perform linear regression;
proc reg data=reg;
 model weight=height s;
run;
quit;
SAS output
                  Parameter   Standard
Variable     DF    Estimate      Error   t Value   Pr> |t|
Intercept     1   -44.10291   39.97051     -1.10     0.2852
height        1     0.64182    0.22489      2.85     0.0110
s             1    -2.60868    5.46554     -0.48     0.6392
Remarks:
  • The SAS procedure proc reg is the standard procedure for linear regression. It is a powerful procedure and we use here only a tiny part of it.
  • For the standard hypothesis c16-math-0194 the model dependent variable=independent variables statement is sufficient. The independent variables are separated by blanks.
  • Categorical variables can also be regressors but care must be taken as to which value is the reference value. Here we code sex as the dummy variable, with males as the reference group.
  • The quit; statement is used to terminate the procedure; proc reg is an interactive procedure and SAS then knows not to expect any further input.
  • The p-values for the other hypothesis must be calculate by hand. For instance for the variable sex c16-math-0195 the p-value for hypothesis (B) is 1-probt(-0.48,18)= 0.6815 and for hypothesis (C) probt(-0.48,18)=0.3185.
  • For testing a special hypothesis c16-math-0196 you must add the test variable= value statement. Note, here a F-test is used, which is equivalent to the proposed t-test, because a squared t-distributed random variable with c16-math-0197 degrees of freedom is c16-math-0198-distributed. The p-value stays the same. To get the t-test use the restrict variable= value statement.


R code
# Read the data
weight<-students$weight
height<-students$height
sex<-students$sex
# Multiple linear regression
summary(lm(weight~height+factor(sex)))
R output
Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  -44.1029    39.9705  -1.103    0.285
height         0.6418     0.2249   2.854    0.011 *
factor(sex)2  -2.6087     5.4655  -0.477    0.639
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
Remarks:
  • The function lm() performs a linear regression in R. The response variable is placed on the left-hand side of the c16-math-0199 symbol and the regressor variables on the right-hand side separated by a plus (+).
  • Categorical variables can also be regressors, but care must be taken as to which value is the reference value. We use the factor() function to tell R that sex is a categorical variable. We see from the output factor(sex)2 that the effect is for females and therefore the males are the reference. To switch these, recode the values of males and females.
  • The summary function gets R to return the estimates, p-values, etc.
  • The standard hypothesis c16-math-0200 is performed by the function lm(). The hypothesis c16-math-0201 is not covered by this function but it provides the necessary statistics which can then be used. See Test 16.1.1 on how to do so.
  • The p-values for the other hypothesis must be calculated by hand. For instance, for c16-math-0202 the p-value for hypothesis (B) is 1-pt(-0.477,18)=0.6805 and for hypothesis (C) pt(-0.47729,18)=0.3196.

16.2.2 Test for significance of regression

Description: Tests if there is a linear relationship between any of the regressors c16-math-0203 and the response c16-math-0204 in a linear regression.
Assumptions:
  • A sample of c16-math-0205 tuples c16-math-0206 is given.
  • The multiple linear regression model for the sample can be written in matrix notation as c16-math-0207 with response vector c16-math-0208, unknown parameter vector c16-math-0209, random vector of errors c16-math-0210 and a matrix with values of the regressors c16-math-0211 (Montgomery et al. 2006, p.68).
  • The elements c16-math-0212 of c16-math-0213 follow a Gaussian distribution with mean c16-math-0214 and variance c16-math-0215, that is, c16-math-0216 for all c16-math-0217. It further holds that c16-math-0218 for all c16-math-0219.
Hypotheses: c16-math-0220
vs c16-math-0221 for at least one c16-math-0222.
Test statistic: c16-math-0223
where the c16-math-0224 are calculated through c16-math-0225
Test decision: Reject c16-math-0226 if for the observed value c16-math-0227 of c16-math-0228
c16-math-0229
p-values: c16-math-0230
Annotations:
  • The test statistic c16-math-0231 is c16-math-0232-distributed.
  • c16-math-0233 is the c16-math-0234-quantile of the F-distribution with c16-math-0235 and c16-math-0236 degrees of freedom.
  • If the null hypothesis is rejected none of the regressors adds significantly to the prediction of c16-math-0237. Therefore the test is sometimes called the overall F-test.

Example
Of interest is the regression of weight on height and sex in a specific population of students. We test for overall significance of regression, hence the hypothesis c16-math-0238. A dataset of measurements on a random sample of c16-math-0239 students has been used (dataset in Table A.6).


SAS code
proc reg data=reg;
 model weight=height sex;
run;
quit;
SAS output
                   Analysis of Variance
                          Sum of      Mean
Source           DF      Squares    Square  F Value  Pr> F
Model             2   1391.20481  695.60241    5.29  0.0164
Error            17   2235.79519  131.51736
Corrected Total  19   3627.00000
Remarks:
  • The SAS procedure proc reg is the standard procedure for linear regression. It is a powerful procedure and we use here only a tiny part of it.
  • Categorical variables can also be regressors, but care must be taken as to which value is the reference value. Here we code sex as the dummy variable, with male as the reference group.
  • The quit; statement is used to terminate the procedure; proc reg is an interactive procedure and SAS then knows not to expect any further input.


R code
summary(lm(students$weight~students$height
                                    +factor(students$sex)))
R output
F-statistic: 5.289 on 2 and 17 DF,  p-value: 0.01637
Remarks:
  • The function lm() performs a linear regression in R. The response variable is placed on the left-hand side of the c16-math-0240 symbol and the regressor variables on the right-hand side separated by a plus (+).
  • We use the R function factor() to tell R that sex is a categorical variable.
  • The summary function gets R to return parameter estimates, p-values for the overall F-tests, p-values for tests on individual regression parameters, etc.

References

Montgomery D.C., Peck E.A. and Vining G.G. 2006 Introduction to Linear Regression Analysis, 4th edn. John Wiley & Sons, Ltd.

Rencher A.C. 1988 Multivariate Statistical Inference and Applications. John Wiley & Sons, Ltd.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.60.158