Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 16 Tests in Regression Analysis

Regression analysis investigates and models the relationship between variables. A linear relationship is assumed between a dependent or response variable $c16-math-0001$ of interest and one or several independent, predictor or regressor variables. We present tests on regression parameters in simple and multiple linear regression analysis. Tests cover the hypothesis on the value of individual regression parameters as well as tests for significance of regression where the hypothesis states that none of the regressor variables has a linear effect on the response.

16.1 Simple linear regression

Simple linear regression relates a response variable $c16-math-0002$ to the given outcome $c16-math-0003$ of a single regressor variable by assuming the relation $c16-math-0004$ , which is linear in unknown coefficients or parameters $c16-math-0005$ and $c16-math-0006$ . Further $c16-math-0007$ is an error term which models the deviation of the observed values from the linear relationship. In two-dimensional space this equals a straight line. For this reason simple linear regression is also called straight line regression. The value $c16-math-0008$ of the regressor variable is fixed or measured without error. If the regressor variable is a random variable $c16-math-0009$ the model is commonly understood as modeling the response $c16-math-0010$ conditional on the outcome $c16-math-0011$ . To analyze if the regressor has an influence on the response $c16-math-0012$ it is tested if the slope $c16-math-0013$ of the regression line differs from zero. Other tests treat the intercept $c16-math-0014$ .

16.1.1 Test on the slope

Description:	Tests if the regression coefficient $c16-math-0015$ of a simple linear regression differs from a value $c16-math-0016$ .
Assumptions:	A sample of $c16-math-0017$ pairs $c16-math-0018$ is given. The simple linear regression model for the sample is stated as $c16-math-0019$ .

	The error term $c16-math-0020$ is a random variable which is Gaussian distributed with mean $c16-math-0021$ and variance $c16-math-0022$ , that is, $c16-math-0023$ for all $c16-math-0024$ . It further holds that $c16-math-0025$ for all $c16-math-0026$ .
Hypotheses:	(A) $c16-math-0027$ vs $c16-math-0028$
	(B) $c16-math-0029$ vs $c16-math-0030$
	(C) $c16-math-0031$ vs $c16-math-0032$
Test statistic:	$c16-math-0033$
	with $c16-math-0034$ , $c16-math-0035$
	$c16-math-0036$ and $c16-math-0037$
Test decision:	Reject $c16-math-0038$ if for the observed value $c16-math-0039$ of $c16-math-0040$
	(A) $c16-math-0041$ or $c16-math-0042$
	(B) $c16-math-0043$
	(C) $c16-math-0044$
p-values:	(A) $c16-math-0045$
	(B) $c16-math-0046$
	(C) $c16-math-0047$
Annotations:	The test statistic $c16-math-0048$ follows a t-distribution with $c16-math-0049$ degrees of freedom. $c16-math-0050$ is the $c16-math-0051$ -quantile of the t-distribution with $c16-math-0052$ degrees of freedom. Of special interest is the test problem $c16-math-0053$ vs $c16-math-0054$ ; the test is then also called a test for significance of regression. If $c16-math-0055$ can not be rejected this indicates that there is no linear relationship between $c16-math-0056$ and $c16-math-0057$ . Either $c16-math-0058$ has no or little effect on $c16-math-0059$ or the true relationship is not linear (Montgomery 2006, p. 23). Alternatively the squared test statistic $c16-math-0060$ can be used which follows a F-distribution with $c16-math-0061$ and $c16-math-0062$ degrees of freedom.

Example

Of interest is the slope of the regression of weight on height in a specific population of students. For this example two hypotheses are tested with (a) $c16-math-0063$ and (b) $c16-math-0064$ . A dataset of measurements on a random sample of $c16-math-0065$ students has been used (dataset in Table A.6).

SAS code

* Simple linear regression including test for H0: beta_1=0;
proc reg data=students;
 model weight=height;
run;
* Perform test for H0: beta_1=0.5;
proc reg data=students;
 model weight=height;
 test height=0.5;
run;
quit;

SAS output

                   Parameter Estimates
                  Parameter   Standard
Variable     DF   Estimate      Error    t Value   Pr> |t|
Intercept     1   -51.81816   35.76340   -1.45      0.1646
height        1     0.67892    0.20645    3.29      0.0041
                  The REG Procedure
                    Model: MODEL1
     Test 1 Results for Dependent Variable weight
                                Mean
Source             DF         Square    F Value    Pr> F
Numerator           1       94.54374       0.75    0.3975
Denominator        18      125.87535

Remarks:

The SAS procedure proc reg is the standard procedure for linear regression. It is a powerful procedure and we use here only a tiny part of it.
For the standard hypothesis $c16-math-0066$ the model dependent variable=independent variable statement is sufficient.
For testing a special hypothesis $c16-math-0067$ you must add the test variable= value statement. Note, here a F-test is used, which is equivalent to the proposed t-test, because a squared t-distributed random variable with $c16-math-0068$ degrees of freedom is $c16-math-0069$ -distributed. The p-value stays the same. To get the t-test use the restrict variable= value statement.
The quit; statement is used to terminate the procedure; proc reg is an interactive procedure and SAS then knows not to expect any further input.
The p-values for the other hypothesis must be calculated by hand. For instance, for $c16-math-0070$ the p-value for hypothesis (B) is 1-probt(3.29,18)=0.0020 and for hypothesis (C) probt(3.29,18)=0.9980.

R code

# Read the data
y<-students$weight
x<-students$height
# Simple linear regression including test for H0: beta_1=0
reg<-summary(lm(y~x))
# Perform test for H0: beta_1=0.5
# Get estimated coefficient
beta_1<-reg$coeff[2,1]
# Get standard deviation of estimated coefficient
std_beta_1<-reg$coeff[“x”,2]
# Perform the test
t_value<-(beta_1-0.5)/std_beta_1
# Calculate p-value
p_value<-2*pt(-abs(t_value),18)
# Output result
# Simple linear regression
reg
# For hypothesis H0: beta_1=0.5
t_value
p_value

R output

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.8182    35.7634  -1.449  0.16456
x             0.6789     0.2065   3.288  0.00408 **
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
> # For hypothesis H0: beta_1=0.5
> t_value
[1] 0.8666546
> p_value
[1] 0.3975371

Remarks:

The function lm() performs a linear regression in R. The response variable is placed on the left-hand side of the $c16-math-0071$ symbol and the regressor variable on the right-hand side.
The summary function gets R to return the estimates, p-values, etc. Here we store the values in the object reg.
The standard hypothesis $c16-math-0072$ is performed by the function lm(). The hypothesis $c16-math-0073$ with $c16-math-0074$ is not covered by this function but it provides the necessary statistics which we store in above example code in the object reg. In the second part of the example we extract the estimated coefficient $c16-math-0075$ with the command reg$coeff[2,1] and its estimated standard deviations $c16-math-0076$ with the command reg$coeff[2,2]. These values are then used to perform the test.
The p-values for the other hypothesis must be calculated by hand. For instance for $c16-math-0077$ the p-value for hypothesis (B) is 1-pt(3.29,18)=0.0020 and for hypothesis (C) pt(3.29,18)=0.9980.

16.1.2 Test on the intercept

Description:	Tests if the regression coefficient $c16-math-0078$ of a simple linear regression differs from a value $c16-math-0079$ .
Assumptions:	A sample of $c16-math-0080$ pairs $c16-math-0081$ is given. The simple linear regression model for the sample is stated as $c16-math-0082$ . The error term $c16-math-0083$ is a random variable which is Gaussian distributed with mean $c16-math-0084$ and variance $c16-math-0085$ , that is, $c16-math-0086$ for all $c16-math-0087$ . It further holds that $c16-math-0088$ for all $c16-math-0089$ .
Hypotheses:	(A) $c16-math-0090$ vs $c16-math-0091$
	(B) $c16-math-0092$ vs $c16-math-0093$
	(C) $c16-math-0094$ vs $c16-math-0095$
Test statistic:	$c16-math-0096$
	with $c16-math-0097$ , $c16-math-0098$ ,
	$c16-math-0099$ , $c16-math-0100$
	and $c16-math-0101$

Test decision:	Reject $c16-math-0102$ if for the observed value $c16-math-0103$ of $c16-math-0104$
	(A) $c16-math-0105$ or $c16-math-0106$
	(B) $c16-math-0107$
	(C) $c16-math-0108$
p-values:	(A) $c16-math-0109$
	(B) $c16-math-0110$
	(C) $c16-math-0111$
Annotations:	The test statistic $c16-math-0112$ follows a t-distribution with $c16-math-0113$ degrees of freedom. $c16-math-0114$ is the $c16-math-0115$ -quantile of the t-distribution with $c16-math-0116$ degrees of freedom. The hypothesis $c16-math-0117$ is used to test if the regression line goes through the origin.

Example

Of interest is the intercept of the regression of weight on height in a specific population of students. For this example two hypotheses are tested with (a) $c16-math-0118$ and (b) $c16-math-0119$ . A dataset of measurements on a random sample of $c16-math-0120$ students has been used (dataset in Table A.6).

SAS code

* Simple linear regression including test for H0: beta_1=0;
proc reg data=students;
 model weight=height;
run;
* Perform test for H0: beta_0=10;
proc reg data=students;
 model weight=height;
 test intercept=10;
run;

SAS output

                   Parameter Estimates
                  Parameter   Standard
Variable     DF   Estimate      Error    t Value   Pr> |t|
Intercept     1   -51.81816   35.76340   -1.45      0.1646
height        1     0.67892    0.20645    3.29      0.0041
                   The REG Procedure
                     Model: MODEL1
          Test 1 Results for Dependent Variable weight
                                Mean
Source             DF         Square    F Value    Pr> F
Numerator           1      376.09299       2.99    0.1010
Denominator        18      125.87535

Remarks:

The SAS procedure proc reg is the standard procedure for linear regression. It is a powerful procedure and we use here only a tiny part of it.
For the standard hypothesis $c16-math-0121$ the model dependent variable=independent variable statement is sufficient.
For testing a special hypothesis $c16-math-0122$ you must add the test intercept value statement. Note, here a F-test is used, which is equivalent to the proposed t-test, because a squared t-distributed random variable with $c16-math-0123$ degrees of freedom is $c16-math-0124$ -distributed. The p-value stays the same. To get the t-test use the restrict variable= value statement.
The quit; statement is used to terminate the procedure; proc reg is an interactive procedure and SAS then knows not to expect any further input.
The p-values for the other hypothesis must be calculate by hand. For instance for $c16-math-0125$ the p-value for hypothesis (B) is 1-probt(-51.82,18)=1 and for hypothesis (C) probt(-51.82,18)=0.

R code

# Read the data
y<-students$weight
x<-students$height
# Simple linear regression
reg<-summary(lm(y~x))
# Perform test for H0: beta_0=10
# Get estimated coefficient
beta_0<-reg$coeff[1,1]
# Get standard deviation of estimated coefficient
std_beta_0<-reg$coeff[1,2]
# Perform the test
t_value<-(beta_0-10)/std_beta_0
# Calculate p-Value
p_value<-2*pt(-abs(t_value),18)
# Output result
# Simple linear regression
reg
# For hypothesis H0: beta_0=10
t_value
p_value

R output

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.8182    35.7634  -1.449  0.16456
x             0.6789     0.2065   3.288  0.00408 **
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
>
> # For hypothesis H0: beta_0=10
> t_value
[1] -1.728531
> p_value
[1] 0.1010077

Remarks:

The function lm() performs a linear regression in R. The response variable is placed on the left-hand side of the $c16-math-0126$ symbol and the regressor variable on the right-hand side.
The summary function gets R to return the estimates, p-values, etc. Here we store the values in the object reg.
The standard hypothesis $c16-math-0127$ is performed by the function lm(). The hypothesis $c16-math-0128$ with $c16-math-0129$ is not covered by this function but it provides the necessary statistics which we store in the example code in the object reg. In the second part of the example we extract the estimated coefficient $c16-math-0130$ with the command reg$coeff[1,1] and its estimated standard deviation $c16-math-0131$ with the command reg$coeff[1,2]. These values are then used to perform the test.
The p-values for the other hypothesis must be calculated by hand. For instance for $c16-math-0132$ the p-value for hypothesis (B) is 1-pt(-51.82,18)=1 and for hypothesis (C) pt(-51.82,18)=0.

16.2 Multiple linear regression

Multiple linear regression is an extension of the simple linear regression to more than one regressor variable. The response $c16-math-0133$ is predicted from a set of regressor variables $c16-math-0134$ . Instead of a straight line a hyperplane is modeled. Again, the values of the regressor variables are either fixed, measured without error or conditioned on (Rencher 1998, chapter 7). Multiple linear regression is based on assuming a relation $c16-math-0135$ , which is linear in unknown coefficients or parameters $c16-math-0136$ . Further $c16-math-0137$ is an error term which models the deviation of the observed values from the hyperplane. To analyze if individual regressors have an influence on the response $c16-math-0138$ it is tested if the corresponding parameter differs from zero. Tests for significance of regression test the overall hypothesis that none of the regressor has an influence on $c16-math-0139$ in the regression model.

16.2.1 Test on an individual regression coefficient

Description:	Tests if a regression coefficient $c16-math-0140$ of a multiple linear regression differs from a value $c16-math-0141$ .
Assumptions:	A sample of $c16-math-0142$ tuples $c16-math-0143$ is given. The multiple linear regression model for the sample can be written in matrix notation as $c16-math-0144$ with response vector $c16-math-0145$ , unknown parameter vector $c16-math-0146$ , random vector of errors $c16-math-0147$ and a matrix with values of the regressors $c16-math-0148$ (Montgomery et al. 2006, p. 68). The elements $c16-math-0149$ of $c16-math-0150$ follow a Gaussian distribution with mean $c16-math-0151$ and variance $c16-math-0152$ , that is, $c16-math-0153$ for all $c16-math-0154$ . It further holds that $c16-math-0155$ for all $c16-math-0156$ .
Hypotheses:	(A) $c16-math-0157$ vs $c16-math-0158$
	(B) $c16-math-0159$ vs $c16-math-0160$
	(C) $c16-math-0161$ vs $c16-math-0162$
Test statistic:	$c16-math-0163$
	with $c16-math-0164$ , $c16-math-0165$ ,
	$c16-math-0166$ and diag_jj(X′X)⁻¹ the jjth element of the diagonal of the inverse matrix of X′X.
Test decision:	Reject $c16-math-0167$ if for the observed value $c16-math-0168$ of $c16-math-0169$
	(A) $c16-math-0170$ or $c16-math-0171$
	(B) $c16-math-0172$
	(C) $c16-math-0173$
p-values:	(A) $c16-math-0174$
	(B) $c16-math-0175$
	(C) $c16-math-0176$
Annotations:	The test statistic $c16-math-0177$ follows a t-distribution with $c16-math-0178$ degrees of freedom.

$c16-math-0179$ is the $c16-math-0180$ -quantile of the t-distribution with $c16-math-0181$ degrees of freedom.
Usually it is tested if $c16-math-0182$ . If this hypothesis cannot be rejected it can be concluded that the regressor variable $c16-math-0183$ does not add significantly to the prediction of $c16-math-0184$ , given the other regressor variables $c16-math-0185$ with $c16-math-0186$ .
Alternatively the squared test statistic $c16-math-0187$ can be used which follows a F-distribution with $c16-math-0188$ and $c16-math-0189$ degrees of freedom. As the test is a partial test of one regressor, the test is also called a partial F-test.

Example

Of interest is the effect of sex in a regression of weight on height and sex in a specific population of students. The variable sex needs to becoded as a dummy variable for the regression model. In our example we choose the outcome male as reference, hence the new variable sex takes the value $c16-math-0190$ for female students and $c16-math-0191$ for male students. We test the hypothesis $c16-math-0192$ . A dataset of measurements on a random sample of $c16-math-0193$ students has been used (dataset in Table A.6).

SAS code

* Create dummy variable for sex with reference male;
data reg;
 set students;
 if sex=1 then s=0;
 if sex=2 then s=1;
run;
* Perform linear regression;
proc reg data=reg;
 model weight=height s;
run;
quit;

SAS output

                  Parameter   Standard
Variable     DF    Estimate      Error   t Value   Pr> |t|
Intercept     1   -44.10291   39.97051     -1.10     0.2852
height        1     0.64182    0.22489      2.85     0.0110
s             1    -2.60868    5.46554     -0.48     0.6392

Remarks:

The SAS procedure proc reg is the standard procedure for linear regression. It is a powerful procedure and we use here only a tiny part of it.
For the standard hypothesis $c16-math-0194$ the model dependent variable=independent variables statement is sufficient. The independent variables are separated by blanks.
Categorical variables can also be regressors but care must be taken as to which value is the reference value. Here we code sex as the dummy variable, with males as the reference group.
The quit; statement is used to terminate the procedure; proc reg is an interactive procedure and SAS then knows not to expect any further input.
The p-values for the other hypothesis must be calculate by hand. For instance for the variable sex $c16-math-0195$ the p-value for hypothesis (B) is 1-probt(-0.48,18)= 0.6815 and for hypothesis (C) probt(-0.48,18)=0.3185.
For testing a special hypothesis $c16-math-0196$ you must add the test variable= value statement. Note, here a F-test is used, which is equivalent to the proposed t-test, because a squared t-distributed random variable with $c16-math-0197$ degrees of freedom is $c16-math-0198$ -distributed. The p-value stays the same. To get the t-test use the restrict variable= value statement.

R code

# Read the data
weight<-students$weight
height<-students$height
sex<-students$sex
# Multiple linear regression
summary(lm(weight~height+factor(sex)))

R output

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  -44.1029    39.9705  -1.103    0.285
height         0.6418     0.2249   2.854    0.011 *
factor(sex)2  -2.6087     5.4655  -0.477    0.639
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Remarks:

The function lm() performs a linear regression in R. The response variable is placed on the left-hand side of the $c16-math-0199$ symbol and the regressor variables on the right-hand side separated by a plus (+).
Categorical variables can also be regressors, but care must be taken as to which value is the reference value. We use the factor() function to tell R that sex is a categorical variable. We see from the output factor(sex)2 that the effect is for females and therefore the males are the reference. To switch these, recode the values of males and females.
The summary function gets R to return the estimates, p-values, etc.
The standard hypothesis $c16-math-0200$ is performed by the function lm(). The hypothesis $c16-math-0201$ is not covered by this function but it provides the necessary statistics which can then be used. See Test 16.1.1 on how to do so.
The p-values for the other hypothesis must be calculated by hand. For instance, for $c16-math-0202$ the p-value for hypothesis (B) is 1-pt(-0.477,18)=0.6805 and for hypothesis (C) pt(-0.47729,18)=0.3196.

16.2.2 Test for significance of regression

Description:	Tests if there is a linear relationship between any of the regressors $c16-math-0203$ and the response $c16-math-0204$ in a linear regression.
Assumptions:	A sample of $c16-math-0205$ tuples $c16-math-0206$ is given. The multiple linear regression model for the sample can be written in matrix notation as $c16-math-0207$ with response vector $c16-math-0208$ , unknown parameter vector $c16-math-0209$ , random vector of errors $c16-math-0210$ and a matrix with values of the regressors $c16-math-0211$ (Montgomery et al. 2006, p.68). The elements $c16-math-0212$ of $c16-math-0213$ follow a Gaussian distribution with mean $c16-math-0214$ and variance $c16-math-0215$ , that is, $c16-math-0216$ for all $c16-math-0217$ . It further holds that $c16-math-0218$ for all $c16-math-0219$ .
Hypotheses:	$c16-math-0220$
	vs $c16-math-0221$ for at least one $c16-math-0222$ .
Test statistic:	$c16-math-0223$
	where the $c16-math-0224$ are calculated through $c16-math-0225$
Test decision:	Reject $c16-math-0226$ if for the observed value $c16-math-0227$ of $c16-math-0228$
	$c16-math-0229$
p-values:	$c16-math-0230$
Annotations:	The test statistic $c16-math-0231$ is $c16-math-0232$ -distributed. $c16-math-0233$ is the $c16-math-0234$ -quantile of the F-distribution with $c16-math-0235$ and $c16-math-0236$ degrees of freedom. If the null hypothesis is rejected none of the regressors adds significantly to the prediction of $c16-math-0237$ . Therefore the test is sometimes called the overall F-test.

Example

Of interest is the regression of weight on height and sex in a specific population of students. We test for overall significance of regression, hence the hypothesis $c16-math-0238$ . A dataset of measurements on a random sample of $c16-math-0239$ students has been used (dataset in Table A.6).

SAS code

proc reg data=reg;
 model weight=height sex;
run;
quit;

SAS output

                   Analysis of Variance
                          Sum of      Mean
Source           DF      Squares    Square  F Value  Pr> F
Model             2   1391.20481  695.60241    5.29  0.0164
Error            17   2235.79519  131.51736
Corrected Total  19   3627.00000

Remarks:

The SAS procedure proc reg is the standard procedure for linear regression. It is a powerful procedure and we use here only a tiny part of it.
Categorical variables can also be regressors, but care must be taken as to which value is the reference value. Here we code sex as the dummy variable, with male as the reference group.
The quit; statement is used to terminate the procedure; proc reg is an interactive procedure and SAS then knows not to expect any further input.

R code

summary(lm(students$weight~students$height
                                    +factor(students$sex)))

R output

F-statistic: 5.289 on 2 and 17 DF,  p-value: 0.01637

Remarks:

The function lm() performs a linear regression in R. The response variable is placed on the left-hand side of the $c16-math-0240$ symbol and the regressor variables on the right-hand side separated by a plus (+).
We use the R function factor() to tell R that sex is a categorical variable.
The summary function gets R to return parameter estimates, p-values for the overall F-tests, p-values for tests on individual regression parameters, etc.

References

Montgomery D.C., Peck E.A. and Vining G.G. 2006 Introduction to Linear Regression Analysis, 4th edn. John Wiley & Sons, Ltd.

Rencher A.C. 1988 Multivariate Statistical Inference and Applications. John Wiley & Sons, Ltd.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 16: Tests in Regression Analysis

Create new playlist

Sign In

Sign Up

Chapter 16

Tests in Regression Analysis

16.1 Simple linear regression

16.1.1 Test on the slope

16.1.2 Test on the intercept

16.2 Multiple linear regression

16.2.1 Test on an individual regression coefficient

16.2.2 Test for significance of regression

References

Table of Contents for
Chapter 16: Tests in Regression Analysis