Chapter 7 Regression Analysis and Modeling

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Regression Analysis and Modeling

Chapter Highlights

•Introduction to Regression and Correlation
•Linear Regression

○Regression Model

•The Estimated Equation of Regression line
•The Method of Least Squares
•Illustration of Least Squares Regression Method
•Analysis of a Simple Regression Problem
•Regression Analysis using Computer

○Simple Regression using EXCEL
○Simple Regression using MINITAB
○Analysis of Regression Output
○Model Adequacy Test
○Assumptions of Regression Model and Checking the Assumptions using MINITAB Residual Plots
○Checking the Assumptions of Regression using Residual Plots

•Multiple Regression: Computer Analysis and Results

○Introduction to Multiple Regression
○Multiple Regression Model

•The Least Squares Multiple Regression Model
•Models with Two Quantitative Independent Variables x₁ and x₂
•Assumptions of Multiple Regression Model
•Computer Analysis of Multiple Regression

○The Coefficient of Multiple Determination (r2)
○Hypothesis Tests in Multiple Regression
○Testing the Overall Significance of Regression
○Hypothesis Tests on Individual Regression Coefficients

•Multicollinearity and Autocorrelation in Multiple Regression
•Summary of the Key Features of Multiple Regression Model
•Model Building and Computer Analysis

○Model with a Single Quantitative Independent Variable
○First-order Model/ Second-order Model/ Third-order Model

•A Quadratic (second-order) Model: Second-order Model using MINITAB

○Analysis of Computer Results

•Models with Qualitative Independent (Dummy) Variables

○One Qualitative Independent Variable at Two Levels

•Model with One Qualitative Independent Variable at Three Levels
•Example: Dummy Variables
•Overview of Regression Models
•Implementation Steps and Strategy for Regression Models

AU: Please check the hierarchy of head levels in the proofs and suggest changes if required.

Introduction to Regression and Correlation

This chapter provides an introduction of regression and correlation analysis. The techniques of regression enable us to explore the relationship between variables. We will discuss how to develop regression models that can be used to predict one variable using the other variable, or even multiple variables. Also, the following features related to regression analysis are the topic of this chapter.

I.Concepts of dependent or response variable and independent variables or predictors,
II.The basics of the least squares method in regression analysis and its purpose in estimating the regression line,
III.Determining the best-fitting line through the data points,
IV.Calculating the slope and y-intercept of the best fitting regression line and interpreting the meaning of regression line, and
V.Measures of association between two quantitative variables - the covariance and the coefficient of correlation

Linear Regression

Regression analysis is used to investigate the relationship between two or more variables. Often we are interested in predicting a variableusing one or more independent variables x₁, x₂,...,x_k. For example, we might be interested in the relationship between two variables: sales and profit for a chain of stores, number of hours required to produce a certain number of products, number of accidents vs. blood alcohol level, advertising expenditures and sales, or the height of parents compared to their children. In all these cases, regression analysis can be applied to investigate the relationship between the two variables.

In general, we have one dependent or response variable, y and one or more independent variables, x₁, x₂,...,x_k. The independent variables are also called predictors. If there is only one independent variable x that we are trying to relate to the dependent variable y, then this is a case of simple regression. On the other hand, if we have two or more independent variables that are related to a single response or dependent variable, then we have a case of multiple regression. In this section, we will discuss simple regression, or to be more specific, simple linear regression. This means that the relationship we obtain between the dependent or response variable y and the independent variable x will be linear. In this case, there is only one predictor or independent variable (x) of interest that will be used to predict the dependent variable (y).

In regression analysis, the dependent or response variable y is a random variable; whereas the independent variable or variables x₁, x₂,...,x_n are measured with negligible error and are controlled by the analyst. The relationship between the dependent and independent variable or variables are described by a mathematical model known as a regression model.

The Regression Model

In a simple linear regression method, we study the linear relationship between two variables, the dependent or the response variable (y)and the independent variable or predictor (x).

Suppose that the Mountain Power Utility company is interested in developing a model that will enable them to predict the home heating cost based on the size of homes in two of the western states that they serve. This model involves two variables: the heating cost and the size of the homes. We will denote them by y and x respectively. The manager in charge of developing the model believes that there is a positive relationship between x and y meaning that the larger homes (homes with larger square-footage) tend to have higher heating cost. The regression model relating the two variables— home heating cost y as the dependent variable and the size of the homes as the independent variable x – can be denoted using equation (7.1).

Equation (7.1) shows the relationship between the values of x and y, or the independent and dependent variable and an error term in a simple regression model.

where y = dependent variable x = independent variable

β₀ = y - intercept (population) β₁ = slope of the population regression line

ε = random error term (ε is the Greek letter “epsilon”)

The model represented by equation (7.1) can be viewed as a population model in which β₀ and β₁ are the parameters of the model. The error term ε represents the variability in y that cannot be explained by the relationship between x and y.

In our example, the population consists of all the homes in the region. This population consists of sub-populations consisting of each home of size, x. Thus, one subpopulation may be viewed as all homes with 1,500 square-feet, another subpopulation consisting of all home with 2,100 square-feet, and so on. Each of these subpopulations consisting of size x will have a corresponding distribution of y values with the mean or expected value E(y). The relationship between the expected value of y or E(y) and x is the regression equation given by:

where E(y)= is the mean or expected value of y for a given value of x

β₀ = y- intercept of the regression line β₁ = slope of the regression line

The regression equation represented by equation (7.2) is an equation of a straight line describing the relationship between E(y) and x. This relationship shown in Figure 7.1 (a) – (c) can be described as positive, negative, or no relationship. The positive linear relationship is identified by a positive slope. It shows that an increase in the value of x causes an increase in the mean value of y or E(y), whereas a negative linear relationship is identified by a negative slope and indicates that an increase in the value x causes a decrease in the mean value of y.

Figure 7.1 Possible linear relationship between E(y) and x in simple linear regression

The no relationship between x and y means that the mean value of y or E(y) is the same for every value of x. In this case, the regression equation cannot be used to make a prediction because of a weak or no relationship between x and y.

The Estimated Equation of Regression Line

In equation (7.2), β₀ and β₁ are the unknown population parameters that must be estimated using the sample data. The estimates of β₀ and β₁ are denoted by b₀ and b₁ that provide the estimated regression equation given by the following equation.

where ŷ = point estimator of E(y) or the mean value of y for a given value of x

b₀ = y - intercept of the regression line b₁ = slope of the regression line

The regression equation above represents the estimated line of regression in the slope intercept form. The y-intercept b₀ and the slope b₁ in equation (7.3) are determined using the least squares method. Before we discuss the least squares method in detail, we will describe the process of estimating the regression equation. Figure 7.2 explains this process.

Figure 7.2 Estimating the regression equation

The Method of Least Squares

The regression model is described in form of a regression equation that is obtained using the least squares method. In a simple linear regression, the form of the regression equation is y = b₀ + b₁x. This is the equation of a straight line in the slope intercept form.

Figure 7.3 shows a scatter plot of the data of Table 7.1. Scatter plots are often used to investigate the relationship between two variables. An investigation of the plot shows a positive relationship between sales and advertising expenditures therefore, the manager would like to predict the sales using the advertising expenditure using a simple regression model.

Figure 7.3 Scatterplot of sales and advertisement expenditures

Table 7.1 Sales and advertisement data

Sales ($1,000s)	Advertising ($1,000s)
458	34
390	30
378	29
426	30
330	26
400	31
458	33
410	30
628	41
553	38
728	44
498	40
708	48
719	47
658	45

As outlined above, a simple regression model involves two variables where one variable is used to predict the other variable. The variable to be predicted is the dependent or response variable, and the other variable is the independent variable. The dependent variable is usually denoted by y while the independent variable is denoted by x.

In a scatter plot the dependent variable (y) is plotted on the vertical axis and the independent variable (x) is plotted on the horizontal axis.

The scatter plot in Figure 7.3 suggests a positive linear relationship between sales (y) and the advertising expenditures (x). From the figure, it can be seen that the plotted points can be well approximated by a straight line of the form y = b₀ + b₁x where, b₀ and b₁ are the y-intercept and the slope of the line. The process of estimating this regression equation uses a widely used mathematical tool known as the least squares method.

The least squares method requires fitting a line through the data points so that the sum of the squares of errors or residuals is minimum. These errors or residuals are the vertical distances of the points from the fitted line. Thus, the least squares method determines the best fitting line through the data points that ensures that the sum of the squares of the vertical distances or deviations from the given points and the fitted line are a minimum.

Figure 7.4 shows the concept of the least squares method. The figure shows a line fitted to the scatter plot of Figure 7.3 using the least squares method. This line is the estimated line denoted using y-hat (ŷ). The method of estimating this line will be illustrated later. The equation of this line is given below.

Figure 7.4 Fitting the regression line to the sales and advertising data of table 7.1

The vertical distance of each point from the line is known as the error or residual. Note that the residual or error of a point can be positive, negative, or zero depending upon whether the point is above, below, or on the fitted line. If the point is above the line, the error is positive, whereas if the point is below the fitted line, the error is negative.

Figure 7.4 shows graphically the errors for a few points. To demonstrate how the error or residual for a point is calculated, refer to the data in Table 7.1.

This table shows that for the advertising expenditure of 40 (or, x = 40) the sales is 498 or (y = 498). This is shown graphically in in Figure 7.4. The estimated or predicted sales for x = 40 equals the vertical distance all the way up to the fitted regression line from y = 498. This predicted value can be determined using the equation of the fitted line as

This is shown in Figure 7.4 as ŷ = 582.3. The difference between the observed sales, y = 493, and the predicted value of y is the error or residual and is equal to

Figure 7.4 shows this error value. This error is negative because the point y = 498 lies below the fitted regression line.

Now, consider the advertising expenditure of x = 44. The observed sales for this value is 728 or y = 728 (from Table 7.1). The predicted sales for x = 44 equals the vertical distance from y = 44 to the fitted regression line. This value is calculated as:

The value is shown in Figure 7.4. The error for this point is the difference between the observed and the predicted, or the estimated value which is

This value of the error is positive because the point y = 728 lies above the fitted line.

The errors for the other observed values can be calculated in a similar way. The vertical deviation of a point from the fitted regression line represents the amount of error associated with that point. The least squares method determines the values b₀ and b₁ in the fitted regression line ŷ = b + b₁x that will minimize the sum of the squares of the errors. Minimizing the sum of the squares of the errors provides a unique line through the data points such that the distance of each point from the fitted line is a minimum.

Since the least squares criteria require that the sum of the squares of the errors be minimized, we have the following relationship:

where y is the observed value and ŷ is the estimated value of the dependent variable given by ŷ = b + b₁x

Equation (7.4) involves two unknowns b₀ and b₁. Using differential calculus, the following two equations can be obtained:

These equations are known as the normal equations and can be solved algebraically to obtain the unknown values of the slope and y-intercept b₀ and b₁. Solving these equations yields the results shown below.

The values b₀ and b₁ when calculated using equations (7.6) and (7.7) minimize the sum of the squares of the vertical deviations or errors. These values can be calculated easily using the data points (x_iy_i) which are the observed values of the independent and dependent variables (the collected data in Table 7.1).

Illustration of Least Squares Regression Method

In this section we will demonstrate the least squares method which is the basis of regression model. We will also discuss the process of finding the regression equation using the sales and advertising expenditures data in Table 7.1. Since the sales manager found a positive linear relationship between the sales and advertising expenditures through an investigation of the scatter plot in Figure 7.3, he would now use the data to find the best fitting line through the points on the scatter plot. The line of best fit can be obtained by first calculating b₀ and b₁ using equations (7.6) and (7.7) above. These values will provide the line of the form y = b₀ + b₁x that can be used to predict the sales (y) using the advertising expenditures (x).

In order to evaluate b₀ and b₁, we need to perform some intermediate calculations shown in Table 7.2. We must first calculate . These values can be calculated using the data points x and y. For later calculations, we will also need the value of therefore, an extra column for y2, or the squares of the dependent variable (y) is added in this table.

Table 7.2 Intermediate calculations for determining the estimated regression line

	Sales ($1,000s)	Advertising ($1,000s)
	y	x	xy	x2	y2
1	458	34	15,572	1,156	209,764
2	390	30	11,700	900	152,100
3	378	29	10,962	841	142,884
4	426	30	12,780	900	181,476
5	330	26	8,580	676	108,900
6	400	31	12,400	961	160,000
7	458	33	15,114	1,089	209,764
8	410	30	12,300	900	168,100
9	628	41	25,748	1,681	394,384
10	553	38	21,014	1,444	305,809
11	728	44	32,032	1,936	529,984
12	498	40	19,920	1,600	248,004
13	708	48	33,984	2,304	501,264
14	719	47	33,793	2,209	516,961
15	658	45	29,610	2,025	432,964

Note: n = the number of observations = 15

Using the values in Table 7.2, and equations (7.6) and (7.7) we first calculate the value of b₁

Using the value of b₁, we obtain the value of b₀.

This gives us the following equation for the estimated regression line:

This equation is plotted in Figure 7.5.

Figure 7.5 Graph of the estimated regression equation

The slope (b₁) of the estimated regression line has a positive value of 18.33. This means that as the advertising expenditures (x) increase, the sales increase. Since the advertising expenditures (x) and the sales both are measured in $1,000s, the estimated regression equation, means that each unit increase in the value of x (or every $1,000 increase in the advertising expenditures) will lead to an increase of $18,330 (or 18.33 × 1,000 = 18,330) in expected sales. We can also use the regression equation to predict the sales for a given value of x or the advertisement expenditure. For instance, the predicted sales for x = 40 can be calculated as:

Thus, for the advertising expenditure of $40,000 the predicted sales would be $582,300.

It is important to check the adequacy of the estimated regression equation before using the equation to make predictions. In the sections that follow, we will discuss several tests to check the adequacy of the regression model.

Analysis of a Simple Regression Problem

The example below demonstrates the necessary computations, their interpretation, and application of a simple regression problem using computer packages. Suppose the operations manager of a manufacturing company wants to predict the number of hours required to produce a certain number of products. The data for the number of units produced and the time in hours to produce those units are shown in the Table 7.3 (Data File: Hours_Units). This is a simple linear regression problem, so we have one dependent or response variable that we are trying to relate to one independent variable or predictor. Since we are trying to predict the number of hours using the number of units produced; hours is the dependent or response variable (y) and number of units is the independent variable or predictor (x). For the data in Table 7.3, we first calculate the intermediate values shown in Table 7.4. All these values are calculated using the observed values of x and y in Table 7.3. These intermediate values will be used in most of the computations related to simple regression analysis.

Table 7.3 Data for regression example

ObsNo.	1	2	3	4	5	6	7	8	9	10
Units (x)	932	951	531	766	814	914	899	535	554	445
Hours (y)	16.20	16.05	11.84	14.21	14.42	15.08	14.45	11.73	12.24	11.12
ObsNo.	11	12	13	14	15	16	17	18	19	20
Units (x)	704	897	949	632	477	754	819	869	1,035	646
Hours (y)	12.63	14.43	15.46	12.64	11.92	13.95	14.33	15.23	16.77	12.41
Obs. No.	21	22	23	24	25	26	27	28	29	30
Units (x)	1,055	875	969	1,075	655	1,125	960	815	555	925
Hours (y)	17.00	15.50	16.20	17.50	12.92	18.20	15.10	14.00	12.20	15.50

We will also use computer packages such as MINITAB and EXCEL to analyze the simple regression problem and provide detailed analysis of the computer output. First, we will explain the manual calculations and interpret the results. You will find that all the formulas are written in terms of the values calculated in Table 7.4.

Table 7.4 Intermediate calculations for data in Table 7.3

Constructing a Scatterplot of the Data

We can use EXCEL or MINITAB to do a scatter plot of the data. From the data in Table 7.3, enter the units (x) in the first column and hours (y) in second column of EXCEL or MINITAB and construct a scatter plot. Figure 7.6 shows the scatter plot for this data.

Figure 7.6 Scatter plot of Hours (y) and Units (x)

The above plot clearly shows an increasing trend. It shows a linear relationship between x and y; therefore, the data can be approximated using a straight line with a positive slope.

Finding the Equation of the Best Fitting Line (Estimated Line)

The equation of the estimated regression line is given by:

where b₀ = y-intercept, and b₁ = slope. These are determined using the least squares method. The y-intercept b₀ and the slope, b₁ are determined using the equations (7.6) and (7.7) discussed earlier.

Using the values in Table 7.4, first calculate the values of b₁ (the slope) and b₀ (the y-intercept) as shown below.

and

Therefore, the equation of the estimated line,

The regression equation or the equation of the “best” fitting line can also be written as:

Hours(y) = 6.62 + 0.00964 Units(x)

or simply,

where, y is the hours and x is the number of units produced. The hat (^) over y means that the line is estimated. Thus, the equation of the line, in fact, is an estimated equation of the best fitting line. The line is also known as the least squares line which minimizes the sum of the squares of the errors. This means that when the line is placed over the scatter plot, the vertical distance from each of the points to the line is minimized. The error is the vertical distance of each point from the estimated line. The error is also known as the residual. Figure 7.7 shows the least squares line and the residuals for each of the points as the vertical distance from the point to the estimated regression line.

Figure 7.7 The least squares line and residuals

[Note: The estimated line is denoted by ŷ and the residual for a point y_i is given by ]

Recall that the error or the residual for a point is given by which is the vertical distance of a point from the estimated line. Figure 7.8 shows the fitted regression line over the scatter plot.

Figure 7.8 Fitted line regression plot

Interpretation of the Fitted Regression Line

The estimated least squares line is of the form y = b₀ + b₁x where, b₁ is the slope and b₀ is the y-intercept. The equation of the fitted line is

In this equation of the fitted line, 6.62 is the y-intercept and 0.00964 is the slope. This line provides the relationship between the hours and the number of units produced. The equation means that for each unit increase in(the number of units produced), (the number of hours) will increase by 0.00964. The value 6.62 represents the portion of the hours that is not affected by the number of units.

Making Predictions Using the Regression Line

The regression equation can be used to predict the number of hours to produce a certain number of units. For example, suppose we want to predict the number of hours (y) required to produce 900 units (x). This can be determined using the equation of the fitted line as:

Hours(y) = 6.62 + 0.00964 Units(x)

Hours(y) = 6.62 + 0.00964 × (900) = 15.296 hours

Thus, it will take approximately 15.3 hours to produce 900 units of the product. Note that making a prediction outside of the range will introduce error in the predicted value. For example, if we want to predict the time for producing 2,000 units; this prediction will be outside of the data range (see the data in Table 7.3, the range of x values is from 445 to 1,125). The value x = 2,000 is far greater than all the other x values in the data. From the scatter plot, a straight line fit with an increasing trend is evident for the data but we should be cautious about assuming that this straight line trend will continue to hold for values as large as x = 2,000. Therefore, it may not be reasonable to make this prediction for values that are far beyond the range of the data values.

The Standard Error of the Estimate(s)

The standard error of the estimate measures the variation or scatter of the points around the fitted line of regression. This is measured in units of the response or dependent variable (y). The standard error of the estimate is analogous to the standard deviation. The standard deviation measures the variability around the mean, whereas the standard error of the estimate (s) measures the variability around the fitted line of regression. A large value of s indicates larger variation of the points around the fitted line of regression. The standard error of the estimate is calculated using the following formula:

The equation can also be written and evaluated using the values of b₀, b₁ and the values in Table 7.4, the standard error of the estimate can be calculated as:

Equation (7.7A) measures the average deviation of the points from the fitted line of regression. Equation (7.8) is mathematically equivalent to equation (7.7A) and is computationally more efficient. Thus,

s = 0.4481

A small value of s indicates less scatter of the data points around the fitted line of regression (see Figure 7.8). The value s = 0.4481 indicates that the average deviation is 0.4481 hours (measured in units of dependent variable y).

Assessing the Fit of the Simple Regression Model: The Coefficient of Determination (r2) and Its Meaning

The coefficient of determination, r2 is an indication of how well the independent variable predicts the dependent variable. In other words, it is used to judge the adequacy of the regression model. The value of r2 lies between 0 and 1 (0 ≤ r2 ≤ 1) or 0 to 100 percent. The closer the value of r2 to 1 or 100 percent, the better is the model because the r2 value indicates the amount of variation in the data explained by the regression model. Figure 7.9 shows the relationship between the explained, unexplained, and the total variation.

Figure 7.9 SST = SSR + SSE

In regression, the total sum of squares is partitioned into two components; the regression sum of squares and the error sum of squares giving the following relationship:

SST = SSR + SSE

SST = total sum of squares for y

SSR = regression sum of squares (measures the variability in y, accounted for by the regression line, also known as explained variation)

SSE = error sum of squares (measures the variation due to the residual or error. This is also known as unexplained variation).

y_i = any point i; = average of the y values

From Figure 7.9, the SST and SSE are calculated as

Note that we can calculate SSR by calculating SST and SSE since,

SST = SSR + SSE or SSR = SST − SSE

Using the SSR and SST values, the coefficient of determination, r2 is calculated using

The coefficient of determination, r2 is used to measure the goodness of fit for the regression equation. It measures the variation in y explained by the variation in independent variable x or r2 is the ratio of the explained variation to the total variation.

The calculation of r2 is explained below. First, calculate SST and SSE using equations (7.9) and (7.10) and the values in Table 7.3.

Since

SST = SSR + SSE

Therefore,

or, r2 = 94.6%

This means that 94.6 percent variation in the dependent variable, y is explained by the variation in x and 5.4 percent of the variation is due to unexplained reasons or error.

The Coefficient of Correlation (r) and Its Meaning

The coefficient of correlation, r can be calculated by taking the square root of r2 or,

(7.13)

In this case, r = 97.3% indicates a strong positive correlation between x and y. Note that r is positive if the slope b₁ is positive indicating a positive correlation between x and y. The value of r is between −1 and +1.

The value of r determines the correlation between x and y variables. The closer the value of r to −1 or +1, stronger is the correlation between x and y.

The value of the coefficient of correlation r can be positive or negative. The value of r is positive if the slope b₁ is positive; it is negative if b₁ is negative. If r is positive it indicates a positive correlation, whereas a negative r indicates a negative correlation. The coefficient of correlation r can also be calculated using the following formula:

Using the values in Table 7.4, we can calculate r from equation (7.15).

Summary of the Main Features of the Simple Regression Model Discussed Above

The sections above illustrated the least squares method which is the basis of regression model. The process of finding the regression equation using the least squared method was demonstrated using the sales and advertising expenditures data. The problem involved predicting the sales–the response or the dependent variable (y) using the predictor or independent variable (x)—the advertising expenditures. Another example involved the number of hours (y) required to produce the number of products (x) The analysis of this simple regression problem was presented by calculating and interpreting several measures. In particular, the following analyses were performed: (a) constructing a scatterplot of the data, (b) finding the equation of the best fitting line, (c) interpreting the fitted regression line, and (d) making predictions using the fitted regression equation. Other important measures critical to assessing the quality of the regression model were calculated and explained. These measures include: (a) the standard error of the estimate (s) that measures the variation or scatter of the points around the fitted line of regression, (b) the coefficient of determination (r2) that measures how well the independent variable predicts the dependent variable or the percent of variation in the dependent variable y explained by the variation in the independent variable, x, (c) the coefficient of correlation (r) that measures the strength of relationship between x and y.

Regression Analysis Using Computer

This section provides a step-wise computer analysis of regression model. In real world, computer software is almost always used to analyze regression problems. There are a number of computer software in use today among which MINITAB, EXCEL, SAS, SPSS are few. Here, we have used Excel and MINITAB computer packages to analyze the regression models. The applications of simple, multiple, and higher order regressions using EXCEL and MINITAB software are demonstrated in this and subsequent sections. If you perform regression analysis with substantial amount of data and need more detailed analyses, the use of statistical package such as MINITAB, SAS, and SPSS is recommended. Besides these, a number of software including R, Stata and others are available readily and are widely used in research and data analysis.

Simple Regression Using EXCEL

The instructions in Table 7.5 will produce the regression output shown in Table 7.6. If you checked the boxes under Residuals and the Line Fit Plots, the residuals and fitted line plot will be displayed.

Table 7.5 EXCEL instructions for regression

1.Label columns A and B of EXCEL worksheet with Units (x) and Hours (y) and enter the data of Table 7.3 or, open the EXCEL data file: Hours_Units.xlsx
2.Click the Data tab on the main menu
3.Click Data Analysis tab (on far right)
4.Select Regression
5.Select Hours(y) for Input y range and Units(x) for Input x range (including the labels)
6.Check the Labels box
7.Click on the circle to the left of Output Range, click on the box next to output range and specify where you want to store the output by clicking a blank cell (or select New Worksheet Ply)
8.Check the Line Fit Plot under residuals. Click OK
You may check the boxes under residuals and normal probability plot as desired.

Table 7.6 EXCEL regression output

Table 7.6 shows the output with regression statistics. We calculated all these manually except the adjusted R-Squared in the previous chapter. The regression equation can be read from the Coefficients column. The regression coefficients are b₀ and b₁; the y-intercept and the slope. In the coefficients column, 6.620904991 is the y-intercept and 0.009638772 is the slope. The regression equation from this table is

This is the same equation we obtained earlier using manual calculations.

The Coefficient of Determination (r2) Using EXCEL

The values of SST and SSR were calculated manually in the previous chapter. Recall that in regression, the total sum of squares is partitioned into two components; the regression sum of squares (SSR) and the error sum of squares (SSE), giving the following relationship: SST = SSR + SSE. The coefficient of determination r2 which is also the measure of goodness of fit for the regression equation can be calculated using

The values of SSR, SSE, and SST can be obtained using the ANOVA table of regression output above which is part of the regression analysis output of EXCEL. Table 7.7 shows the EXCEL regression output with SSR and SST values. Using these values, the coefficient of determination, r² SSR / SST = 0.9458. This value is reported under regression statistics in Table 7.7.

Table 7.7 EXCEL regression output

The t-test and F-test for the significance of regression can be easily performed using the information in the EXCEL computer output under the ANOVA table. Table 7.8 shows the EXCEL regression output with the ANOVA table.

Table 7.8 EXCEL regression output

(1) Conducting the t-Test Using the Regression Output in Table 7.8.

The test statistic for conducting the significance of regression is given by the following equation:

The values of b₁, s_b₁ and the test-statistic value t_n-2 are labeled in Table 7.8 below.

Using the test-statistic value, the hypothesis test for the significance of regression can be conducted. This test is explained here using the computer results. The appropriate hypotheses for the test are:

The null hypothesis states that the slope of the regression line is zero. Thus, if the regression is significant, the null hypothesis must be rejected. A convenient way of testing the above hypotheses is to use the p-value approach. The test statistic value t_n-2 and the corresponding p values are reported in the regression output Table 7.8. Note that the p value is very close to zero (p = 2.92278E-19). If we test the hypothesis at a 5 percent level of significance (α = 0.05) then p = 0.000 is less than α = 0.05 and we reject the null hypothesis and conclude that the regression is significant overall.

Simple Regression Using MINITAB

The regression results using MINITAB is explained in this section. We created a scatter plot, a fitted line plot (a plot with the best fitting line) and the regression results for the data in Table 7.3. We already analyzed the results from EXCEL above.

[Note: Readers can download a free 30 days trial copy of the MINITAB version 17 or 18 software from www.minitab.com]

The scatter plot shown in Figure 7.10 shows an increasing or direct relationship between the number of units produced (x) and the number of hours (y). Therefore, the data may be approximated by a straight line of the form y = b₀ + b₁x where, b₀ is the y-intercept and b₁ is the slope. The fitted line plot with the regression equation from MINITAB is shown in Figure 7.11. Also, the “Regression Analysis” and “Analysis of Variance” tables shown in Table 7.9 will be displayed. We will first analyze the regression and the analysis of variance tables and then provide further analysis.

Figure 7.10 Scatterplot of Hours (y) and Units (x)

Figure 7.11 Fitted line and regression equation

Analysis of Regression Output in Table 7.9

Table 7.9 The regression analysis and analysis of variance tables using MINITAB

Refer to the Regression Analysis part. In this table, the regression equation is printed as Hours(y) = 6.62 + 0.00964 Units(x). This is the equation of the best fitting line using the least squares method. Just below the regression equation, a table is printed that describes the model in more detail. The values under the Coef column means coefficients. The values in this column refer to the regression coefficients b₀ and b₁ where b₀ is the y-intercept or constant and b₁ is the slope of the regression line. Under the Predictor, the value of Units (x) is 0.0096388 which is b₁ (or the slope of the fitted line). The Constant is 6.6209. These values form the regression equation.

Refer to Table 7.9 above

1.The regression equation or the equation of the “best” fitting line is:

Hours(Y) = 6.62 + 0.00964 Units(X)

or, ŷ = 6.62 + 0.00964x where y is the hours and x is the units produced.

This line minimizes the sum of the squares of the errors. This means that when the line is placed over the scatter plot, the vertical distance from each of the points to the line is minimum. The error or the residual is the vertical distance of each point from the estimated line. Figure 7.12 shows the least squares line and the residuals. The residual for a point is given by which is the vertical distance of a point from the estimated line.

Figure 7.12 The least squares line and residuals

[Note: The estimated line is denoted by ŷ and the residual for a point yi is given by (yi-ŷ)]

The estimated least squares line is of the form y = b0 + b1x where b1 is the slope and b₀ is the y-intercept. In the regression equation: Hours(Y) = 6.62 + 0.00964 Units(X), 6.62 is the y-intercept and 0.00964 is the slope. This line provides the relationship between the hours and the number of units produced. The equation states that for each unit increase in x (the number of units produced), y (the number of hours) will increase by 0.00964.

2.The Standard Error of the Estimate (s)

The standard error of the estimate measures the variation of the points around the fitted line of regression. This is measured in units of the response or dependent variable (y).

In regression analysis, the standard error of the estimate is reported as s. The value of s is reported in Table 7.9 under “Regression Analysis.” This value is

s = 0.4481

A small value of s indicates less scatter of the points around the fitted line of regression.

3.The Coefficient of Determination (r2)

The coefficient of determination, r2 is an indication of how well the independent variable predicts the dependent variable. In other words, it is used to judge the adequacy of the regression model. The value of r2 lies between 0 and 1 (0 ≤ r2≤ 1) or 0 to 100 percent. The closer the value of r2 to 1 or 100 percent, better is the model. The r2 value indicates the amount of variability in the data explained by the regression model. In our example, the r2 value is 94.6 percent (Table 7.9, Regression Analysis). The value of r2 is reported as:

R-Sq = 94.6%

This means that 94.6 percent variation in the dependent variable, y can be explained by the variation in x and 5.4 percent of the variation is due to unexplained reasons or error.

The R-Sq(adj) = 94.4 percent next to the value of r2 in the regression output is the R2-adjusted value. This is the r2 value adjusted for the degrees of freedom. This value has more importance in multiple regression.

Model Adequacy Test

To check whether the fitted regression model is adequate, we first review the assumptions on which regression is based followed by the residual plots that are used to check the model assumptions.

Residuals: A residual or error for any point is the difference between the actual y value and the corresponding estimated value (denoted by y-cap, ŷ ). Thus, for a given value of , the residual is given by: )

Assumptions of Regression Model and Checking the Assumptions Using MINITAB Residual Plots

The regression analysis is based on the following assumptions:

(1) Independence of errors (2) Normality assumption

(3) Assumption regarding E(y): the expected values of y fall on the same straight line described by the model E(y) (4) Equal variance, and (5) Linearity

The assumption regarding the independence of errors can be evaluated by plotting the errors or residuals in the order or the sequence in which the data were collected. If the errors are not independent, a relationship exists between consecutive residuals which is a violation of the assumption of independence of errors. When the errors are not independent, the plot of residuals versus the time (or the order) in which the data were collected will show a cyclical pattern. Meeting this assumption is particularly important when data are collected over a period of time. If the data are collected at different time periods, the errors for specific time period may be correlated with the errors of those of the previous time periods.

The assumption that the errors are normally distributed or the normality assumption requires that the errors have a normal or approximately normal distribution. Note that this assumption means that the errors do not deviate too much from normality. The assumption can be verified by plotting the histogram or the normal probability plot of errors.

The assumption that the variance of errors are equal (equality of variance) is also known as homoscedasticity. This requires that the errors are constant for all values of x or the variability of y values is the same for both the low and high values of x. The equality of variance assumption is of particular importance for making inferences about b₀ and b₁.

The linearity assumption means that the relationship between the variables is linear. This assumption can be verified using residual plot to be discussed in the next section.

To check the validity of the above regression assumptions, a graphical approach known as the residual analysis is used. The residual analysis is also used to determine whether the selected regression model is an appropriate model.

Checking the Assumptions of Regression Using MINITAB Residual Plots

Several residual plots can be created using EXCEL and MINITAB to check the adequacy of the regression model. The plots are shown in Figure 7.13a through 7.13d.

Figure 7.13 Plots for residual analysis

The plots to check the regression assumptions include the histogram of residuals, normal plot of residuals, plot of the residuals vs. fits, and residuals vs. order of data. The residuals can also be plotted with each of the independent variables.

Figures 7.13a and 7.13b are used to check the normality assumption. The regression model assumes that the errors are normally distributed with mean zero. Figure 7.13a shows the normal probability plot. This plot is used to check for the normality assumption of regression model. In this plot, if the plotted points lie on a straight line or close to a straight line then the residuals or errors are normally distributed. The pattern of points appear to fall on a straight line indicating no violation of the normality assumption.

Figure 7.13b shows the histogram of residuals. If the normality assumption holds, the histogram of residuals should look symmetrical or approximately symmetrical. Also, the histogram should be centered at zero because the sum of the residuals is always zero. The histogram of residuals is approximately symmetrical which indicates that the errors appear to be approximately normally distributed. Note that the histogram may not be exactly symmetrical. We would like to see a pattern that is symmetrical or approximately symmetrical.

In Figures 7.13c, the residuals are plotted against the fitted value and the order of the data points. These plots are used to check the assumptions of linearity. The points in this plots should be scattered randomly around the horizontal line drawn through the zero residual value for the linear model to be valid. As can be seen, the residuals are randomly scattered about the horizontal line indicating that the relationship between x and y is linear.

The plot of residual vs. the order of the data shown in Figure 7.13d is used to check the independence of errors.

The independence of errors can be checked by plotting the errors or the residuals in the order or sequence in which the data were collected. The plot of residuals vs. the order of data should show no pattern or apparent relationship between the consecutive residuals. This plot shows no apparent pattern indicating that the assumption of independence of errors is not violated.

Note that checking the independence of errors is more important in the case where the data were collected over time. Data collected over time sometimes may show an autocorrelation effect among successive data values. In these cases, there may be a relationship between consecutive residuals that violates the assumption of independence of errors.

The equality of variance assumption requires that the errors are constant for all values of x or the variability of y is the same for both the low and high values of x. This can be checked by plotting the residuals and the order of data points. This plot is shown in Figure 7.13d. If the equality of variance assumption is violated, this plot will show an increasing trend showing an increasing variability. This demonstrates a lack of homogeneity in the variances of y values at each level of x. The plot shows no violation of equality of variance assumption.

Multiple Regression: Computer Analysis and Results

Introduction to Multiple Regression

In the previous chapter we explored the relationship between two variables using the simple regression and correlation analysis. We demonstrated how the estimated regression equation can be used to predict a dependent variable (y) using an independent variable (x). We also discussed the correlation between two variables that explains the degree of association between two variables. In this chapter, we expand the concept of simple linear regression to include multiple regression analysis. A multiple linear regression involves one dependent or response variable, and two or more independent variables or predictors. The concepts of simple regression discussed earlier are also applicable to the multiple regression.

Multiple Regression Model

The mathematical form of multiple linear regression model relating the dependent variable y and two or more independent variables with the associated error term is given by:

where, are k independent or explanatory variables; are the regression coefficients, and ε is the associated error term. Equation (7.16) can be viewed as a population multiple regression model in which y is a linear function of unknown parameters and an error term. The error ε explains the variability in y that cannot be explained by the linear effects of the independent variables. The multiple regression model is similar to the simple regression model except that multiple regression involves more than one independent variable.

One of the basic assumptions of the regression analysis is that the mean or the expected value of the error is zero. This implies that the mean or expected value of y or E = (y) in the multiple regression model can be given by:

The above equation relating the mean value of y and the k independent variables is known as the multiple regression equation.

It is important to note that are the unknown population parameters, or regression coefficients and they must be estimated using the sample data to obtain the estimated equation of multiple regression. The estimated regression coefficients are denoted by . These are the point estimates of the parameters . The estimated multiple regression equation using the estimates of the unknown population regression coefficients can be written as:

where ŷ = point estimator of E = (y) or the estimated value of the response y are the estimated regression coefficients and are the estimates of

Equation (7.18) is the estimated multiple regression equation and can be viewed as the sample regression model. The regression equation with the sample regression coefficients is written as in equation (7.18). This equation defines the regression equation for k independent variables.

In equation (7.16), denote the regression coefficients for the population. The sample regression coefficients are the estimates of the population parameters and can be determined using the least squares method.

In a multiple linear regression, the variation in y (the response variable) may be explained using two or more independent variables or predictors. The objective is to predict the dependent variable. Compared to simple linear regression, a more precise prediction can be made because we use two or more independent variables. By using two or more independent variables, we are often able to make use of more information in the model. The simplest form of a multiple linear regression model involves two independent variables and can be written as:

Equation (7.19) describes a plane. In this equation β₀ is the y-intercept of the regression plane. The parameter β₁ indicates the average change in y for each unit change in x₁ when x₂ is constant. Similarly, β₂ indicates the average change in y for each unit change in x₂ when x₁ is held constant. When we have more than two independent variables, the regression equation of the form described using equation (7.18) is the equation of a hyperplane in an n-dimensional space.

The Least Squares Multiple Regression Model

The regression model is described in form of a regression equation that is obtained using the least squares method. Recall that in a simple regression, the least squares method requires fitting a line through the data points so that the sums of the squares of errors or residuals are minimized. These errors or residuals are the vertical distances of the points from the fitted line. The same concept of simple regression is used to develop the multiple regression equation.

In a multiple regression, the least squares method determines the best fitting plane or the hyperplane through the data points that ensures that the sum of the squares of the vertical distances or deviations from the given points and the plane are a minimum.

Figure 7.14 shows a multiple regression model with two independent variables. The response y with two independent variables x₁ and x₂ forms a regression plane. The observed data points in the figure are shown using dots. The stars on the regression plane indicate the corresponding points that have identical values for x₁ and x₂. The vertical distance from the observed points to the point on plane are shown using vertical lines. These vertical lines are the errors. The error for a particular point yi is denoted by where the estimated value ŷ is calculated using the regression equation: for a given value of x₁ and x₂.

Figure 7.14 Scatter plot and regression plane with two independent variables

The least squares criteria requires that the sum of the squares of the errors be minimized, or,

where y is the observed value and ŷ is the estimated value of the dependent variable given by

[Note: The terms independent, or explanatory variables, and the predictors have the same meaning and are used interchangeably in this chapter. The dependent variable is often referred to as the response variable in multiple regression.]

Similar to the simple regression, the least squares method uses the sample data to estimate the regression coefficients and hence the estimated equation of multiple regression. Figure 7.15 shows the process of estimating the regression coefficients and the multiple regression equation.

Figure 7.15 Process of estimating the multiple regression equation

Models with Two Quantitative Independent Variables x₁ and x₂

The model with two quantitative independent variables is the simplest multiple regression model. It is a first order model and is written as:

where, b₀ = y -intercept, the value of y when x₁ = x₂ = 0

b₁ = change in y for a 1-unit increase in x₁ when x₂ is constant

b₂ = change in y for a 1-unit increase in x₂ when x₁ is constant

The graph of the first order model is shown in Figure 7.16. This graph with two independent quantitative variables x₁ and x₂ plots a plane in a three-dimensional space. The plane plots the value of y for every combination (x₁, x₂). This corresponds to the points in the (x₁, x₂) plane.

Figure 7.16 A multiple regression model with two quantitative variables

The first-order model with two quantitative variables x₁ and x₂ is based on the assumption that there is no interaction between x₁ and x₂. This means that the effect on the response of y of a change in x₁(for a fixed value of x₂) is same regardless of the value of x₂ and the effect on y of a change in x₂ (for a fixed value of x₁) is same rardless of the value of x₁.

In case of simple regression analysis in the previous chapter, we presented both the manual calculations and the computer analysis of the problem. Most of the concepts we discussed for simple regression also apply to the multiple regression; however, the computations for multiple regression are more involved and require the use of matrix algebra and other mathematical concepts which are beyond the scope of this text. Therefore, in this chapter, we have provided computer analysis of the multiple linear regression models using EXCEL and MINITAB. This section provides examples with computer instructions and analysis of the computer results. The assumptions and the interpretation of the multiple linear regression models are similar to that of the simple linear regression. As we provide the analysis, we will point out the similarities and the differences between the simple and multiple regression models.

Assumptions of Multiple Regression Model

As discussed earlier, the relationship between the response variable (y) to the independent variables in the multiple regression is assumed to be a model of the form where, are the regression coefficients, and ε is the associated error term. The multiple regression model is based on the following assumptions about the error term ε.

1.The independence of errors assumption. The assumption—independence of errors means that the errors are independent of each other. That is, the error for a set of values of independent variables is not related to the error for any other set of values of independent variables. This assumption is critical when the data are collected over different time periods. When the data are collected over time, the errors in one-time period may be correlated with another time period.
2.The normality assumption. This means that the errors or residuals (εi) calculated using are normally distributed. The normality assumption in regression is fairly robust against departures from normality. Unless the distribution of errors is extremely different from normal, the inferences about the regression parameters are not affected seriously.

The error assumption. The error, ε is a random variable with mean or expected value of zero, that is, E(ε) = 0. This implies that the mean values of the dependent variable y , for a given value of the independent variable, x is the expected, or the mean value of y

3.denoted by E (y) and the population regression model can be written as:

4.Equality of variance assumption. This assumption requires that the variance of the errors (εi), denoted by σ2 are constant for all values of the independent variables . In case of serious departure from the equality of variance assumption, methods such as weighted least-squares, or data transformation may be used.

[Note: The terms error and residual have the same meaning and these terms are used interchangeably in this chapter.]

Computer Analysis of Multiple Regression

In this section we provide a computer analysis of multiple regression. Due to the complexity involved in the computation, computer software is always used to model and solve regression problems. We discuss the steps using MINITAB and EXCEL.

Problem Description: The home heating cost is believed to be related to the average outside temperature, size of the house, and the age of the heating furnace. A multiple regression model is to be fitted to investigate the relationship between the heating cost and the three predictors or independent variables. The data in Table 7.10 shows the home heating cost (y), average temperature (x₁), house size (x₂) in thousands of square feet, and the age of the furnace (x₃) in years. The home heating cost is the response variable and the other three variables are predictors. (The data for this problem: HEAT_COST.MTW, EXCEL data file: HEAT_COST.xlsx) a is listed in Table 7.10 below.

Table 7.10 Data for home heating cost

Row	Avg Temp	House Size	Age of Furnace	Heating Cost
1	37	3.0	6	210
2	30	4.0	9	365
3	37	2.5	4	182
4	61	1.0	3	65
5	66	2.0	5	82
6	39	3.5	4	205
7	15	4.1	6	360
8	8	3.8	9	295
9	22	2.9	10	235
10	56	2.2	4	125
11	55	2.0	3	78
12	40	3.8	4	162
13	21	4.5	12	405
14	40	5.0	6	325
15	61	1.8	5	82
16	21	4.2	7	277
17	63	2.3	2	99
18	41	3.0	10	195
19	28	4.2	7	240
20	31	3.0	4	144
21	33	3.2	4	265
22	31	4.2	11	355
23	36	2.8	3	175
24	56	1.2	4	57
25	35	2.3	8	196
26	36	3.6	6	215
27	9	4.3	8	380
28	10	4.0	11	300
29	21	3.0	9	240
30	51	2.5	7	130

Constructing Scatter Plots and Matrix Plots

We begin our analysis by constructing scatter plots and matrix plots of the data. These plots provide useful information about the model. We first construct scatterplots of the response (y) versus each of the independent or predictor variables Figure 7.17. If the scatterplots of y on the independent variables appear to be linear enough, a multiple regression model can be fitted. Based on the analysis of the scatter plots of y and each of the independent variables, an appropriate model (for example, a first order model) can be recommended to predict the home heating cost.

Figure 7.17 Matrix plot of each y vs. each x

A first order multiple regression model does not include any higher order terms (e.g., x2). An example of a first-order model with five independent variables can be written as:

The multiple linear regression model is based on the assumption that the relationship between the response and the independent variables is linear. This relationship can be checked using a matrix plot. The matrix plot is used to investigate the relationships between pairs of variables by creating an array of scatterplots. MINITAB provides two options for constructing the matrix plot: Matrix of Plots and Each Y versus each X. The first of these plots is used to investigate the relationships among pairs of variables when there are several independent variables involved. The other plot (each y versus each x) produces separate plots of the response y and each of the explanatory or independent variable.

Recall that in a simple regression, a scatter plot was constructed to investigate the relationship between the response y and the predictor. A matrix plot should be constructed when two or more independent variables are investigated. To investigate the relationships between the response and each of the independent or explanatory variables before fitting a multiple regression model, a matrix plot may prove to be very useful. The plot allows graphically visualizing the possible relationship between response and independent variables. The plot is also very helpful in investigating and verifying the linearity assumption of multiple regression and to determine which explanatory variables are good predictors of y. For this example, we have constructed matrix plots using MINITAB.

Figure 7.17 shows such a matrix plot (each y versus each). In this plot, the response variable y is plotted with each of the independent variables. The plot shows scatterplots for heating cost (y) versus each of the independent variables: average temperature, house size, and age of the furnace. An investigation of the plot shows an inverse relationship between the heating cost and the average temperature (the heating cost decreases as the temperature rises) and a positive relationship between the heating cost and each of the other two variables: house size and age of the furnace. The heating cost increases with the increasing house size and also with the older furnace. None of these plots show bending (nonlinear or curvilinear) patterns between the response and the explanatory variables. The presence of bending patterns in these plots would suggest transformation of variables. The scatterplots in Figure 7.17 (also known as side-by-side scatter plots) show linear relationship between the response and each of the explanatory variables indicating all the three explanatory variables could be a good predictor of the home heating cost. In this case, a multiple linear regression would be an adequate model for predicting the heating cost.

Matrix of Plots: Simple

Another variation of the matrix plot is known as “matrix of plots” in MINITAB and is shown in Figure 7.18. This plot provides scatterplots that are helpful in visualizing not only the relationship of the response variable with each of the independent variables but also provides scatterplots that are useful in assessing the interaction effects between the variables. This plot can be used when more detailed model beyond a first-order model is of interest. Note that the first order model is the one that contains only the first order terms; with no square or interaction terms and is written as

Figure 7.18 Matrix plot

The matrix plot in Figure 7.18 is a table of scatterplots with each cell showing a scatterplot of the variable that is labeled for the column versus the variable labeled for the row. The cell in the first row and first column displays the scatterplot of heating cost (y) versus average temperature (x₁). The plot in the second row and first column is the scatterplot of heating cost (y) and the house size (x₂) and the plot in the third row and the first column shows the scatterplot of heating cost (y) and the age of the furnace (x₃).

The second column and the second row of the matrix plot shows a scatterplot displaying the relationship between average temperature (x₁) and the house size (x₂). The scatterplots showing the relationship between the pairs of independent variables are obtained from columns 2 and 3 of the matrix plot. The matrix plot is helpful in visualizing the interaction relationships. For fitting the first order model, a plot of y versus each x is adequate.

The matrix plots in Figures 7.17 and 7.18 show a negative association or relationship between the heating cost (y) and the average temperature (x₁) and a positive association or relationship between the heating cost (y) and the other two explanatory variables: house size (x₂) and the age of the furnace (x₃). All these relationships are linear indicating that all the three explanatory variables can be used to build a multiple regression model. Constructing the matrix plot and investigating the relationships between the variables can be very helpful in building a correct regression model.

Multiple Linear Regression Model

Since a first order model can be used adequately to predict the home heating cost, we will fit a multiple linear regression model of the form

where,

y = Home heating cost (dollars), x₁ = Average temperature (in °F)

x₂ = Size of the house (in thousands of square feet), x₃ = Age of the furnace (in years)

Table 7.10 and data file HEAT_COST.MTW shows the data for this problem. We used MINITAB to run the regression model for this problem.

Table 7.11 shows the results of running the multiple regression problem using MINITAB. In this table, we have marked some of the calculations (e.g., b₀, b₁, sbo, sb₁, etc.) for clarity and explanation. These are not the part of the computer output. The regression computer output has two parts: Regression Analysis and Analysis of Variance.

Table 7.11 MINITAB regression analysis results

The Regression Equation

Refer to the “Regression Analysis” part of Table 7.11 for analysis. Since there are three independent or explanatory variables, the regression equation is of the form:

The regression equation from the computer output is

where, y is the response variable (Heating Cost), x₁, x₂, x₃ are the independent variables as described above, the regression coefficients b₀, b₁, b₃, b₃ are stored under the column Coef. In the regression equation these coefficients appear in rounded form.

The regression equation which can be stated in the form of equation (7.22) or (7.23) is the estimated regression equation relating the heating cost to all the three independent variables.

Interpreting the Regression Equation

Equation (7.22) or (7.23) can be interpreted in the following way:

•b₁ = −1.65 means that for each unit increase in the average temperature (x₁), the heating cost y (in dollars) can be predicted to go down by 1.65 (or, $1.65) when the house size (x₂), and the age of the furnace (x₃) are held constant.
•b₂ = +57.5 means that for each unit increase in the house size (x₂ in thousands of square feet), the heating cost, y (in dollars) can be predicted to go up by 57.5 when the average temperature (x₁) and the age of the furnace (x₃) are held constant.
•b₃ = + 7.91 means that for each unit increase in the age of the furnace (x₃ in years), the heating cost y can be predicted to go up by $7.91 when the average temperature (x₁) and the house size (x₂) are held constant.

Standard Error of the Estimate(s) and Its Meaning

The standard error of the estimate or the standard deviation of the model s is a measure of scatter or the measure of variation of the points around the regression hyperplane. A small value of s is desirable for a good regression model. The estimation of y is more accurate for smaller values of s. The value of the standard error of estimate is reported in the regression analysis (see Table 7.11). This value is measured in terms of the response variable (y). For our example, the standard error of the estimate,

s = 37.32 dollars

The standard error of the estimate is used to check the utility of the model and to provide a measure of reliability of the prediction made from the model. One interpretation of s is that the interval ±2s will provide an approximation to the accuracy with which the regression model will predict the future value of the response y for given values of. Thus, for our example, we can expect the model to provide predictions of heating cost (y) to be within

The Coefficient of Multiple Determination (r2)

The coefficient of multiple determination is often used to check the adequacy of the regression model. The value of r2 lies between 0 and 1, or 0 percent and 100 percent, that is, 0 ≤ r2 ≤ 1. It indicates the fraction of total variation of the dependent variable y that is explained by the independent variables or predictors. Usually, closer the value of r2 to 1 or 100 percent; stronger is the model. However, one should be careful in drawing conclusions based solely on the value of r2. A large value of r2 does not necessarily mean that the model provides a good fit to the data. In case of multiple regression, addition of a new variable to the model always increases the value of r2 even if the added variable is not statistically significant. Thus, addition of a new variable will increase r2 indicating a stronger model but may lead to poor predictions of new values. The value of r2 can be calculated using the expression

In the above equations, SSE is the sum of square of errors (unexplained variation or error), SST is the total sum of squares, and SSR is the sum of squares due to regression (explained variation). These values can be read from the “Analysis of Variance” part of Table 7.11. From this table, The value of r2 is calculated and reported in the “Regression Analysis” part of Table 7.11. For our example the coefficient of multiple determination; r2 (reported as R-sq) is

r2 = 88.0%

This means that 88.0 percent of the variability in y is explained by the three independent variables used in the model. Note that r2 = 0 implies a complete lack of fit of the model to the data; whereas, r2 = 1 implies a perfect fit.

The value of r2 = 88.0% for our example implies that using the three independent variables; average temperature, size of the house, and the age of the furnace in the model, 88.0 percent of the total variation in heating cost (y) can be explained. The statistic r2 tells how well the model fits the data, and thus, provides the overall predictive usefulness of the model.

The value of adjusted R2 is also used in comparing two regression models that have the same response variable but different number of independent variables or predictors.

Hypothesis Tests in Multiple Regression

In multiple regression, two types of hypothesis tests are conducted to measure the model adequacy. These are

1.Hypothesis Test for the overall usefulness, or significance of regression
2.Hypothesis Tests on the individual regression coefficients

The test for overall significance of regression can be conducted using the information in the “Analysis of Variance” part of Table 7.11. The information contained in the “Regression Analysis” part of this table is used to conduct the tests on the individual regression coefficients using the “T ” or “p” column. These tests are explained below.

Testing the Overall Significance of Regression

Recall that in simple regression analysis, we conducted the test for the significance using a t-test and F-test. Both of these tests in simple regression provided the same conclusion. If the null hypothesis is rejected in these tests, it will lead to the conclusion that the slope was not zero, or β₁ = 0. In a multiple regression, the t-test and the F-test have somewhat different interpretation. These tests have the following objectives:

The F-test in a multiple regression is used to test the overall significance of the regression. This test is conducted to determine whether a significant relationship exists between the response variable y and the set of independent variables, or predictors x1, x2, …,xn.

1.If the conclusion of the F-test indicates that the regression is significant overall then a separate t-test is conducted for each of the independent variables to determine whether each of the independent variables is significant.

Both the F-test and t-test are explained below.

F-Test

The null and alternate hypotheses for the multiple regression model

If the null hypothesis H₀ is rejected, we conclude that at least one of the independent variables: contributes significantly to the prediction of the response variable y. If H₀ is not rejected, then none of the independent variables contributes to the prediction of y. The test statistic for testing this hypothesis uses an F-statistic and is given by

where MSR = mean squares due to regression, or explained variability, and MSE = mean square error, or unexplained variability. In equation (7.25), the larger the explained variation of the total variability, the larger is the F-statistic. The values of MSR, MSE, and the F statistic are calculated in the “Analysis of Variance” table of the multiple regression computer output (see Table 7.12 below).

Table 7.12 Analysis of variance table

The critical value for the test is given by where, k is the number of independent variables, n is the number of observations in the model, and α is the level of significance. Note that k and (n-k-1) are the degrees of freedom associated with MSR and MSE respectively. The null hypothesis is rejected if where F is the calculated F value or the test statistic value in the Analysis of Variance table.

Test the Overall Significance of Regression for the Example Problem at a 5 Percent Level of Significance

Step 1: State the Null and Alternate Hypotheses

For the overall significance of regression, the null and alternate hypotheses are:

Step 2: Specify the Test Statistic to Test the Hypothesis

The test statistics is given by

The value of F statistic is obtained from the “Analysis of Variance” (ANOVA) table of the computer output. We have reproduced the Analysis of Variance part of the table (Table 7.12). In this table the labels k, [n − (k + 1)], SSR, SSE etc. are added for explanation purpose. They are not the part of computer results.

In the ANOVA table below, the first column refers to the sources of variation, DF = the degrees of freedom, SS = the sum of squares, MS = mean squares, F = the F statistic, and p is the probability or p-value associated with the calculated F statistic.

The degrees of freedom (DF) for Regression and Error are k and n − (k + 1) respectively where, k is the number of independent variables (k = 3 for our example) and n is the number of observations (n = 30). Also, the total sum of squares (SST) is partitioned into two parts: sum of squares due to regression (SSR) and the sum of squares due to error (SSE) having the following relationship.

SST = SSR + SSE

We have labeled SST, SSR, and SSE values in Table 7.12. The mean square due to regression (MSR) and the mean squares due to error (MSE) are calculated using the following relationships:

MSR = SSR/k and MSE = SSE/(n – k − 1)

The F-test statistic is calculated as F = MSR/MSE.

Step 3: Determine the Value of the Test Statistic

The test statistic value or the F statistic from the ANOVA table (see Table 7.12) is

F = 63.62

Step 4: Specify the Critical Value

The critical value is given by

(From the F-table)

Step 5: Specify the Decision Rule

Reject Ho if F-statistic > FCritical

Step 6: Reach a Decision and State Your Conclusion

The calculated F statistic value is 63.62. Since F = 63.62 > Fcritical = 2.74, we reject the null hypothesis stated in equation (7.26) and conclude that the regression is significant overall. This indicates that there exists a significant relationship between the dependent and independent variables.

Alternate Method of Testing the above Hypothesis

The hypothesis stated using equation (7.26) can also be tested using the p-value approach. The decision rule using the p-value approach is given by

If p ≥ α, do not reject H₀

If P < α, reject H₀

From Table 7.12, the calculated p value is 0.000 (see the P column). Since p = 0.000 < α = 0.05, we reject the null hypothesis H₀ and conclude that the regression is significant overall.

Hypothesis Tests on Individual Regression Coefficients

t-tests

If the F-test shows that the regression is significant, a t-test on individual regression coefficients is conducted to determine whether a particular independent variable is significant. We are often interested in determining which of the independent variables contributes to the prediction of the y. The hypothesis test described here can be used for this purpose.

To determine which of the independent variables contributes to the prediction of the dependent variable y, the following hypotheses test can be conducted:

This hypothesis tests an individual regression coefficient. If the null hypothesis H₀ is rejected; it indicates that the independent variable xj is significant and contributes in the prediction of y. On the other hand, if the null hypothesis H₀ is not rejected, then xj is not a significant variable and can be deleted from the model or further investigated. The test is repeated for each of the independent variables in the model.

This hypothesis test also helps to determine if the model can be made more effective by deleting certain independent variables, or by adding extra variables. The information to conduct the hypothesis test for each of the independent variables is contained in the “Regression Analysis” part of the computer output which is reproduced in Table 7.13 below. The columns labeled T and p are used to test the hypotheses. Since there are three independent variables, we will test to determine whether each of the three variables is a significant variable; that is, if each of the independent variables contributes in the prediction of y. The hypothesis to be tested and the test procedure are explained below. We will use a significance level of α = 0.05 for testing each of the independent variables.

Table 7.13 MINITAB regression analysis results

Test the Hypothesis That Each of the Three Independent Variables Is Significant at a 5 Percent Level of Significance

Test for the significance of x1 or Average Temperature

Step 1: State the null and alternate hypotheses. The null and alternate hypotheses are:

H₀: β1 = 0 (x₁ is not significant or x₁ does not contribute in prediction of y)

H₁:β1 ≠ 0 (x₁ is significant or x₁ does contribute in prediction of y) (7.29)

Step 2: Specify the test statistic to test the hypothesis.

The test statistics is given by

where, b₁ is the estimate of slope β1 and sb₁ is the estimated standard deviation of b₁.

Step 3: Determine the value of the test statistic

The values b₁, sb₁ and t are all reported in the Regression Analysis part of Table 7.13. From this table, these values for the variable x₁ or the average temperature (Avg. Temp.) are

and the test statistic value is

This value is reported under the T column.

Step 4: Specify the critical value

The critical values for the test are given by

which is the t-value from the t-table for [n − (k + 1)] degrees of freedom and α /2, where n is the number of observations (n = 30), k is the number of independent variables (k = 3) and α is the level of significance (0.05 in this case). Thus,

Step 5: Specify the decision rule: The decision rule for the test:

Reject H0 if t > +2.056

or, if t < −2.056

Step 6: Reach a decision and state your conclusion

The test statistic value (T value) for the variable “av. temp” (x1) from Table 7.13 is −2.36. Since, t = −2.36 < tcritical = −2.056

we reject the null hypothesis H₀ (stated in equation 7.29) and conclude that the variable average temperature (x1) is a significant variable and does contribute in the prediction of y.

The significance of other independent variables can be tested in the same way. The test statistic or the t values for all the independent variables are reported in Table 7.13 under T column. The critical values for testing each independent variable are the same as in the test for the first independent variable above. Thus, the critical values for testing the other independent variables are

Alternate Way of Testing the above Hypothesis

The hypothesis stated using equation (7.29) can also be tested using the p-value approach. The decision rule for the p-value approach is given by

From Table 7.14, the p-value for the variable average temperature (Avg. Temp., x1) is 0.026. Since, p = 0.026 < α = 0.05, we reject H₀ and conclude that the variable average temperature (x1) is a significant variable.

Table 7.14 Summary table

Independent Variable	p-value from Table 7.4	Compare p to α	Decision	Significant? Yes or No
Av. Temp. (x1)	0.026	P < α	Reject H₀	Yes
House Size (x2)	0.000	P < α	Reject H₀	Yes
Age of Furnace (x3)	0.024	P < α	Reject H₀	Yes

I.Test for the other independent variables
The other two independent variables are
x2 = Size of the house (or House Size)
x3 = Age of the furnace

It is usually more convenient to test the hypothesis using the p-value approach. Table 7.14 provides a summary of the tests using the p-value approach for all the three independent variables. The significance level α is 0.05 for all the tests. The hypothesis can be stated as:

H0 :βj = 0 (xj is not a significant variable)

H0 :βj ≠ 0 (xj is a significant variable)

where, j = 1,2,…3 for our example.

From Table 7.14 it can be seen that all the independent variables are significant. This means that all the three independent variables contribute in predicting the response variable y, the heating cost.

Note: The above method of conducting t-tests on each β parameter in a model is not the best way to determine whether the overall model is providing information for the prediction of y. In this method, we need to conduct a t-test for each independent variable to determine whether the variable is significant. Conducting a series of t-tests increases the likelihood of making an error in deciding which variable to retain in the model and which one to exclude. For example, suppose we are fitting a first order model like the one in this example with 10 independent variables and decided to conduct t-tests on all 10 of the β’s. Suppose each test is conducted at α = 0.05. This means that there is a 5 percent chance of making a wrong or incorrect decision (Type I error − probability of rejecting a true null hypothesis) and there is a 95 percent chance of making a right decision. If 10 tests are conducted, the probability of making a correct decision drops to approximately 60 percent [(0.95)10 = 0.599] assuming all the tests are independent of each other. This means that even if all the β parameters (except β₀) are equal to 0, approximately 40 percent of the time, the null hypothesis will be rejected incorrectly at least once leading to the conclusion that β differs from 0. Thus, in the multiple regression models where a large number of independent variables are involved and a series of t- tests are conducted, there is a chance of including a large number of insignificant variables and excluding some useful ones from the model. In order to assess the utility of the multiple regression models, we need to conduct a test that will include all the β parameters simultaneously. Such a test would test the overall significance of the multiple regression model.

The other useful measure of the utility of the model would be to find some statistical quantity such as R2 that measures how well the model fits the data.

A Note on Checking the Utility of a Multiple Regression Model (Checking the Model Adequacy)

Step 1.To test the overall adequacy of a regression model, first test the following null and alternate hypotheses,

A)If the null hypothesis is rejected, there is evidence that all the β parameters are not zero and the model is adequate. Go to step 2.
B)If the null hypothesis is not rejected then the overall regression model is not adequate. In this case, fit another model with more independent variables, or consider higher-order terms.

Step 2.If the overall model is adequate, conduct t-tests on the β parameters of interest, or the parameters considered to be most important in the model. Avoid conducting a series of t-tests on the β parameters. It will increase the probability of type I error, α.

Multicollinearity and Autocorrelation in Multiple Regression

Multicollinearity is a measure of correlation among the predictors in a regression model. Multicollinearity exists when two or more independent variables in the regression model are correlated with each other. In practice, it is not unusual to see correlations among the independent variables. However, if serious multicollinearity is present, it may cause problems by increasing the variance of the regression coefficients and making them unstable and difficult to interpret. Also, highly correlated independent variables increase the likelihood of rounding errors in the calculation of β estimates and standard errors. In the presence of multicollinearity, the regression results may be misleading.

Effects of Multicollinearity

A)Consider a regression model where the production cost (y) is related to three independent variables: machine hours (x₁), material cost (x₂), and labor hours (x₃):

MINITAB computer output for this model is shown in Table 7.15. If we perform t-tests for testing β₁, β₂, and β₃, we find that all the three independent variables are non-significant at α = 0.05 while the F-test for H₀: β₁ = β₂ = β₃ = 0 is significant (see the p-value in the Analysis of Variance results shown in Table 7.15). The results are contradictory but in fact, they are not. The tests on individual bi parameters indicate that the contribution of one variable, say x1 = machine hours is not significant after the effects of x2 = material cost, and x₃ = labor hours have been accounted for. However, the result of the F-test indicates that at least one of the three variables is significant, or is making a contribution to the prediction of response y. It is also possible that at least two or all the three variables are contributing to the prediction of y. Here, the contribution of one variable is overlapping with that of the other variable or variables. This is because of the multicollinearity effect.

Table 7.15 Regression Analysis: PROD COST vs. MACHINE HOURS, MATERIAL COST, and LABOR Hours.

B)Multicollinearity may also have an effect on the signs of the parameter estimates. For example, refer to the regression equation in Table 7.15. In this model, the production cost (y) is related to the three explanatory variables: machine hours (x1), material cost (x2), and labor hours (x3). If we check the effect of the variable machine hours (x1), the regression model indicates that for each unit increase in machine hour, the production cost (y) decreases when the other two factors are held constant. However, we would expect the production cost (y) to increase as more machine hours are used. This may be due to the presence of multicollinearity. Because of the presence of multicollinearity, the value of a β parameter may have the opposite sign from what is expected.

One way of avoiding multicollinearity in regression is to conduct design of experiments and select the levels of factors in a way that the levels are uncorrelated. This may not be possible in many situations. It is not unusual to have correlated independent variables; therefore, it is important to detect the presence of multicollinearity to make the necessary modifications in the regression analysis.

Detecting Multicollinearity

Several methods are used to detect the presence of multicollinearity in regression. We will discuss two of them.

1.Detecting Multicollinearity using Variance Inflation Factor (VIF): MINITAB provides an option to calculate Variance inflation factors (VIF) for each predictor variable that measures how much the variance of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related. Use the guidelines in Table 7.16 to interpret the VIF.

Table 7.16 Detecting correlation using VIF values

Values of VIF	Predictors are…
VIF = 1	Not correlated
1 < VIF < 5	Moderately correlated
VIF = 5 to 10 or greater	Highly correlated

VIF values greater than 10 may indicate that multicollinearity is unduly influencing your regression results. In this case, you may want to reduce multicollinearity by removing unimportant independent variables from your model.

Refer to the table above for the values of VIF for the production cost example. The VIF value for each predictor has a value greater than 10 indicating the precedence of multicollinearity. The VIF values indicate that the predictors are highly correlated. The VIF for each of the independent variables is calculated automatically when a multiple regression model is run using MINITAB.

Detecting Multicollinearity by Calculating Coefficient of Correlation, r

A simple way of determining multicollinearity is to calculate the coefficient of correlation, r, between each pair of predictor or independent variables in the model. The degree of multicollinearity depends on the magnitude of the value of r. Use Table 7.17 as a guide to determine the multicollinearity.

Table 7.17 Determining multicollinearty using correlation coefficient, r

Correlation Coefficient, r
	Extreme multicollinearity
	Moderate multicollinearity
	Low multicollinearity

Table 7.18 shows the correlation coefficient, r between each pair of predictors for the production cost example.

Table 7.18 Correlation coefficient between pairs of variables

Correlations: Machine Hours, Material Cost, Labor Hours
	Machine Hours	Material Cost(y)
Material Cost	0.964
Labor Hours	0.953	0.917
Cell Contents: Pearson correlation

The above values of r show that the variables are highly correlated. The correlation coefficient matrix above was calculated using MINITAB.

Summary of the Key Features of Multiple Regression Model

The multiple regression model above extended the concept of simple linear regression and provided an in-depth analysis of the multiple regression model—one of the most widely used prediction techniques used in data analysis and decision making. The multiple regression model explores the relationship between a response variable, and two or more independent variables or the predictors. The sections provided computer analysis and interpretation of multiple regression models. Several examples of matrix plots were presented. These plots are helpful in the initial stages of model building. Using the computer results, the following key features of multiple regression model were explained; (a) the multiple regression equation and its interpretation, (b) the standard error of the estimate—a measure used to check the utility of the model and to provide a measure of reliability of the prediction made from the model, (c) the coefficient of multiple determination r2 that explains the variability in the response y explained by the independent variables used in the model. Besides these, we discussed the hypothesis tests using the computer results. Step-wise instructions were provided to conduct the F-test and t-tests. The overall significance of the regression model is tested using the F-test. The t- test is conducted on individual predictor or the independent variable to determine the significance of that variable. The effect of multicollinearity and detection of multicollinearity using computer were discussed with examples.

Model Building and Computer Analysis

Introduction to Model Building

In the previous chapters, we discussed simple and multiple regression where we provided detailed analysis of these techniques including the analysis and interpretation of computer results. In both the simple and multiple regression models, the relationship among the variables is linear. In this chapter we will provide an introduction to model building and nonlinear regression models. By model building, we mean selecting the model that will provide a good fit to a set of data, and the one that will provide a good estimate of the response or the dependent variable, y that is related to independent variables or factors x1, x2, …xn. It is important to choose the right model for the data.

In regression analysis, the dependent or the response variable is usually quantitative. The independent variables may be either quantitative or qualitative. The quantitative variable is one that assumes numerical values or can be expressed as numbers. The qualitative variable may not assume numerical values.

In experimental situations we often encounter both the quantitative and qualitative variables. In the model building examples, we will show later how to deal with qualitative independent variables.

Model with a Single Quantitative Independent Variable

The models relating the dependent variable y to a single quantitative independent variable x are derived from the polynomial of the form:

In the above equation, n is an integer and b₀, b₁,...,bn are unknown parameters that must be estimated.

A)First-order Model
The first order model is given by:

where b₀ = y-intercept, bi = regression coefficients

B)Second-order Model
A second order model can be written as

Equation (7.34) is a parabola in which:

b0 = y-intercept, b1 = a change in the value of b1 shifts the parabola to the left or right; increasing the value of b1 causes the parabola to shift to the left, b2 = rate of curvature

The second order model is a parabola. If b2 > 0 the parabola opens up; if b2 < 0, the parabola opens down. The two cases are shown in Figure 7.19.

Figure 7.19 The second order model

C)Third-order Model
A third order model can be written as:

b₀: y-intercept and b₃: controls the rate of reversal of the curvature of curve.

A second order model has no reversal in curvature. In a second order model, the y value either continues to increase or decrease as x increases and produces either a trough or a peak. A third order model produces one reversal in curvature and produces one peak and one trough. Reversals in curvature are not very common but can be modeled using third or higher order polynomial. The graph of a nth-order polynomial contains (n − 1) peaks and troughs. Figure 7.20 shows the graph of a third order polynomial. In real world situation, the second-order model is perhaps the most useful.

Figure 7.20 The third-order model

Example: A Quadratic (Second-Order) Model

The life of an electronic component is believed to be related to the temperature in the operating environment. Table 7.19 shows 25 observations (Data File: COMP_LIFE) that show the life of the components (in hours) and the corresponding operating temperature (in °F). We would like to fit a model to predict the life of the component. In this case, the life of the component is the dependent variable (y) and the operating temperature is the independent variable (x).

Table 7.19 Life of electronic components

Obs.	1	2	3	4	5	6	7	8	9	10
X(Temp.)	99	101	100	113	72	93	94	89	95	111
Y (Life)	141.0	136.7	145.7	194.3	101.5	121.4	123.5	118.4	137.0	183.2
Obs.	11	12	13	14	15	16	17	18	19	20
X(Temp.)	72	76	105	84	102	103	92	81	73	97
Y (Life)	106.6	97.5	156.9	111.2	158.2	155.1	119.7	105.9	101.3	140.1
Obs.	21	22	23	24	25
X(Temp.)	105	90	94	79	91
Y (Life)	148.6	116.4	121.5	108.9	110.1

Figure 7.21 shows the scatter plot of the data in Table in 7.19. From the scatter plot, we can see that the data can be well approximated by a quadratic model.

Figure 7.21 Scatter Plot of Life (y) vs. Operating Temp. (x)

We used MINITAB and EXCEL to fit a second order model to the data. The analysis of the computer results is presented below.

Second-Order Model Using MINITAB

A second order model was fitted using MINITAB. The regression output of the model is shown in Table 7.20.

Table 7.20 Computer results of second order model

A quadratic model in MINITAB can also be run using the fitted line plot option. The results of the quadratic model using this option provide a fitted line plot (shown in Figure 7.22).

Figure 7.22 Regression Plot with Equation

While running the quadratic model, the data values and residuals can be stored and the plots of residuals be created.

Residual Plots for the above Example Using MINITAB

Figure 7.23 shows the residual plots for this quadratic model. The residual plots are useful in checking the assumptions of the model and the model adequacy.

Figure 7.23 Residual plots for the quadratic model example

The analysis of residual plots for this model is similar to that of simple and multiple regression models. The investigation of the plots shows that the normality assumption is met. The plot of residuals versus the fitted values shows a random pattern indicating that the quadratic model fitted to the data is adequate.

Running a Second-Order Model Using EXCEL

Unlike MINITAB, EXCEL does not provide an option to run a quadratic model of the form

y = b₀ + b₁x + b₂ x2

However, we can run a quadratic regression model by calculating the x2 column from the x column in the data file. The EXCEL computer results are shown in Table 7.21.

Table 7.21 EXCEL computer output for the quadratic model

Analysis of Computer Results of Tables 7.20 and 7.21

Refer to MINITAB output in Table 7.20 or the EXCEL computer output in Table 7.21. The prediction equation from this table can be written using the coefficients column. The equation is

In the EXCEL output, the prediction equation can be read from the “coefficients” column.

The r2 value is 95.9 percent which is an indication of a strong model. It indicates that 95.9 percent of the variation in y can be explained by the variation in x and 4.1 percent of the variation is unexplained or due to error. The equation can be used to predict the life of the components at a specified temperature.

We can also test a hypothesis to determine if the second order term in our model, in fact, contributes to the prediction of y. The null and alternate hypotheses to be tested for this can be expressed as

The test statistic for this test is given by

The test statistic value is calculated by the computer and is shown in Table 7.21. In this table, the t value is reported in x**2 row and under t stat column. This value is 7.93. Thus,

The critical value for the test is

[Note: is the t-value from the t-table for (n – k − 1) degrees of freedom where n is the number of observations and k is the number of independent variables.]

For our example, n = 25, k = 2 and the level of significance, α = 0.05. Using these values, the critical value or the t-value from the t-table for 22 degrees of freedom and α = 0.025 is 2.074. Since the calculated value of t

t = 7.93 > tcritical = 2.074

We reject the null hypothesis and conclude that the second order term in fact contributes in the prediction of the life of components (y). Note: we could have tested the following hypotheses:

H0:β = 0

H0:β > 0

which will determine that the value of b₂ = 0.0598 in the prediction equation is large enough to conclude that the life of the components increases at an increasing rate with temperature. This hypothesis will have the same test statistic and can be tested at α = 0.05.

Therefore, our conclusion is that the mean component life increases at an increasing rate of temperature and the second order term in our model, in fact, is significant and contributes to the prediction of y.

Another Example: Quadratic (Second-Order) Model

The fitted line plot of the temperature and yield in Figure 7.24 shows the yield of a chemical process at different temperatures. The plot clearly indicates a nonlinear relationship. There is an indication that the data can be well approximated by a quadratic model.

Figure 7.24 Fitted line plot showing the yield of a chemical process

We used MINITAB and EXCEL to run a quadratic model to the data. The prediction equation from the regression output is shown below.

Yield (y) = 1,459 + 277 Temperature (x) − 0.896 x*x or,

The coefficient of determination, R2 is 88.2 percent. This tells us that 88.2 percent of the variation in y is explained by the regression and 11.8 percent of the variation is unexplained or due to error. The model is appropriate and the prediction equation can be used to predict the yield at different temperatures.

Summary of Model Building

The sections above provided an introduction to model building. The first order, second order, and third order models were discussed. Unlike the simple and multiple regression models, where the relationship among the variables is linear, there are situation where the relationship among the variables under study may not be linear. We discussed the situation where higher order and nonlinear models provide a better relationship between the response and independent variables and provided examples of quadratic or second-order models. Scatter plots were created to select the model that would provide a good fit to a set of data and can be used to obtain a good estimate of the response or the dependent variable, y that is related to the independent variables or predictors. Since the second order or quadratic models are appropriate in many applications, we provided a detailed computer analysis of such models. The computer analysis and interpretation of computer results were explained and examined including the residual plots and analysis.

Models with Qualitative Independent (Dummy) Variables

Dummy or Indicator Variables in Multiple Regression: In regression we often encounter qualitative or indicator variables that need to be included as one of the independent variables in the model. For example, if we are interested in building a regression model to predict the salary of male and female employees based on their education and years of experience; the variable male or female is a qualitative variable that must be included as a separate independent variable in the model. To include such qualitative variables in the model we use a dummy or indicator variable. The use of dummy or indicator variables in a regression model allows us to include qualitative variables in the model. For example, to include the sex of employees in a regression model as an independent variable, we define this variable as

In the above formulation, a “1” indicates that the employee is a male and a “0” means the employee is a female. Which one of the male or female is assigned the value of 1 is arbitrary.

In general, the number of dummy or indicator variables needed is one less than the total number of indicator variables to be included in the model.

One Qualitative Independent Variable at Two Levels

Suppose we want to build a model to predict the mean salary of male and female employees. This model can be written as

y = b0 +b1 x

where x is the dummy variable coded as

This coding scheme will allow us to compare the mean salary for male and female employees by substituting the appropriate code in the regression equation: y = b0 + b1 x.

Suppose µM = mean salary for the male employees

µF = mean salary for the female employees

Then the mean salary for the male: µM = y = b0 + b1 (1) = b0 + b1

and the mean salary for the female: µF = y = b0 + b1 (0) = b0

Thus, the mean salary for the female employees is b0. In a 0-1 coding system, the mean response will always be b0 for the qualitative variable that is assigned the value 0.This is also called the base level.

The difference in the mean salary for the male and female employees can be calculated by taking the difference (µM − µF)

µM −µF = (b0 +b1) − b0 = b1

The above is the difference between the mean response for the level that is assigned the value 1 and the level that is assigned the value 0 or the base level. The mean salary for the male and female employees is shown graphically in Figure 7.25. We can also see that

Figure 7.25 Mean salary of female and male employees

Model with One Qualitative Independent Variable at Three Levels

We would like to write a model relating the mean profit of a grocery chain. It is believed that the profit to a large extent depends on the location of the stores. Suppose that the management is interested in three specific locations where the stores are located. We will call these locations A, B, and C. In this case, the store location is a single qualitative variable which is at three levels corresponding to the three locations A, B, and C. The prediction equation relating the mean profit (y) and the three locations can be written as:

y = b0 + b1 x1 + b2 x2 where,

The variables x1 and x2 are known as the dummy variables that make the model function.

Explanation of the Model

Suppose, µA = mean profit for location A

µB = mean profit for location B

µC = mean profit for location C

If we set x1 = 0 and x2 = 0, we will get the mean profit for location A. Therefore, the mean value of profit y when the store location is A

µA = y = b0 + b1(0) + b2 (0)

or,

µA = b0

Thus, the mean profit for location A is b0 or, b0 = µA
Similarly, the mean profit for location B can be calculated by setting x1 = 1 and x2 = 0. The resulting equation is

µB = y = b0 + b1 x1 + b2 x2 = b0 + b1(1) + b2(0)

or,

µB = b0 + b1

Since bo = µA, we can write

µB = µA + b1

or,

b1 = µB − µA

Finally, the mean profit for location C can be calculated by setting x1 = 0 and x2 = 1. The resulting equation is

µC = y = b0 + b1 x1 + b2 x2 = b0 + b1(0) + b2(1)

or,

µC = b0 + b2

Since b0 = µA, we can write

µC = µA + b2

b2 = µC − µA

Thus, in the above coding system, one qualitative independent variable is at three levels,

µA = b0 and b1 = µB − µA

µB = b0 + b1

µC = b0 + b2b2 = µC − µA

where µA, µB, µC are the mean profits for locations A, B, and C.

Note that the three levels of the qualitative variable can be described with only two dummy variables. This is because the mean of the base level (in this case location A) is accounted for by the intercept b0. In general form, for m levels of qualitative variable, we need (m − 1) dummy variables.

The bar graph in Figure 7.26 shows the values of mean profit (y) for the three locations.

Figure 7.26 Bar chart showing the mean profit for three locations A, B, C

In the above bar chart, the height of the bar corresponding to location A is y = b₀. Similarly, the heights of the bars corresponding to locations B and C are y = b₀ + b₁ and y = b₀ + b₂ respectively. Note that either b₁ or b₂, or both could be negative. In Figure 7.26, b₁ and b₂ are both positive.

Example: Dummy Variables

Consider the problem of the pharmaceutical company model where the relationship between the sales volume (y) and three quantitative independent variables: advertisement dollars spent (x1) in hundreds of dollars, commission paid to the salespersons (x2) in hundreds of dollars, and the number of salespersons (x3) were investigated. The company is now interested in including different sales territories where they market the drug. The territory in which the company markets the drug is divided into three zones: zone A, B, and C. The management wants to predict the sales for the three zones separately. To do this, the variable “zone” which is a qualitative independent variable must be included in the model. The company identified the sales volumes for the three zones along with the variables considered earlier. The data including the sales volume and the three zones are shown in the last column of Table 7.22 (Data File: DummyVar_File1).

Table 7.22 Sales for different zones

Row	Sales Volume (y)	Advertisement (x₁)	Commission (x₂)	No. of Salesperson (x₃)	Zone
1	973.62	580.17	235.48	8	A
2	903.12	414.67	240.78	7	A
3	1,067.37	420.48	276.07	10	A
4	1,193.37	454.59	295.70	14	B
5	1,429.62	524.05	286.67	16	C
6	1,557.87	623.77	325.66	18	A
7	1,590.12	641.89	298.82	17	A
8	1,081.62	403.03	210.19	12	C
9	1,088.37	415.76	202.91	13	C
10	1,132.62	506.73	275.88	11	B
11	1,314.87	490.35	337.14	15	A
12	1,562.37	624.24	266.30	19	C
13	1,050.12	459.56	240.13	10	C
14	1,055.37	447.03	254.18	12	B
15	1,112.37	493.96	237.49	14	B
16	1,235.37	543.84	276.70	16	B
17	1,518.12	618.38	271.14	18	A
18	1,574.37	690.50	281.94	15	C
19	1,644.87	591.27	316.75	20	C
20	1,169.37	530.73	297.37	10	C
21	1,212.87	541.34	272.77	13	B
22	1,304.37	492.20	344.35	11	B
23	1,477.62	546.34	295.53	15	C
24	1,593.87	590.02	293.79	19	C
25	1,134.87	505.32	277.05	11	B

In the above coding system, the choice of 0 and 1 in the coding is arbitrary.

Note that, we have defined only two dummy variables—x4 and x5—for a total of three zones. It is not necessary to define a third dummy variable for zone C

From the above discussion, it follows that the regression model for the data in Table 7.22 including the variable “zone” can be written as:

where (y): sales volume (y)

(x₁): advertisement dollars spent in hundreds of dollars,

(x₂): commission paid to the salespersons in hundreds of dollars,

(x₃): the numbers of salespersons, and x₄ and x₅ the dummy variables:

Table 7.23 shows the data file for this regression model with the dummy variables. The data can be analyzed using a MINITAB data file – [Data File: DummyVar_File(2) or from the EXCEL data file – DummyVar_File (2).xlsx.]

Table 7.23 Data file for the model with dummy variables

Row	Volume (y)	Advertisement (x₁)	Commission (x₂)	No. of Salespersons (x₃)	Zone A (x₄)	Zone B (x₅)
1	973.62	580.17	235.48	8	1	0
2	903.12	414.67	240.78	7	1	0
3	1,067.37	420.48	276.07	10	1	0
4	1,193.37	454.59	295.70	14	0	1
5	1,429.62	524.05	286.67	16	0	0
6	1,557.87	623.77	325.66	18	1	0
7	1,590.12	641.89	298.82	17	1	0
8	1,081.62	403.03	210.19	12	0	0
9	1,088.37	415.76	202.91	13	0	0
10	1,132.62	506.73	275.88	11	0	1
11	1,314.87	490.35	337.14	15	1	0
12	1,562.37	624.24	266.30	19	0	0
13	1,050.12	459.56	240.13	10	0	0
14	1,055.37	447.03	254.18	12	0	1
15	1,112.37	493.96	237.49	14	0	1
16	1,235.37	543.84	276.70	16	0	1
17	1,518.12	618.38	271.14	18	1	0
18	1,574.37	690.50	281.94	15	0	0
19	1,644.87	591.27	316.75	20	0	0
20	1,169.37	530.73	297.37	10	0	0
21	1,212.87	541.34	272.77	13	0	1
22	1,304.37	492.20	344.35	11	0	1
23	1,477.62	546.34	295.53	15	0	0
24	1,593.87	590.02	293.79	19	0	0
25	1,134.87	505.32	277.05	11	0	1

We used both MINITAB and EXCEL to run this model The MINITAB and EXCEL regression output and results are shown in Tables 7.24 and 7.25. Refer to the computer results to answer the following questions.

A)Using the EXCEL data file, run a regression model. Show your regression output.
B)Using the MINITAB or EXCEL regression output, write down the regression equation.
C)Using a 5 percent level of significance and the column “p” in the MINITAB regression output or “p-value” column in the EXCEL regression output, conduct appropriate hypotheses tests to determine that the independent variables advertisement, commission paid, and number of sales persons are significant or they contribute in predicting the sales volume.
D)Write separate regression equations to predict the sales for each of the zones A, B, and C.
E)Refer to the given MINITAB residual plots and check that all the regression assumptions are met and the fitted regression model is adequate.

Solution:

A)The MINITAB regression output is shown in Table 7.24.
B)Table 7.25 shows the EXCEL regression output.
C)From the MINITAB or the EXCEL regression outputs in Tables 7.24 and 7.25, the regression equation is:

Sales Volume (y) = −98.2 + 0.884 Advertisement(x₁) + 1.81 Commission(x₂) + 33.8 No. of Salespersons(x₃) − 67.2 Zone A (x₄) −105 Zone B (x₅)

The regression equation from the EXCEL output in Table 7.25 can be written using the coefficients column.

D)The hypotheses to check the significance of each of the independent variables can be written as:

(xj is not a significant variable)

(xj is a significant variable)

The above hypothesis can be tested using the “p” column in either MINITAB or the p-value column in EXCEL computer results. The decision rule for the p-value approach is given by

If , do not reject H₀

If, reject H₀

Table 7.26 shows the p-value for each of the predictor variables. From MINITAB or EXCEL computer results in Table 7.24 or 7.25 (see the “p” or the “p-value” columns in these tables).

Table 7.26 Summary table

Independent Variable	p-value from Table 7.24 or 7.25	Compare p to α	Decision	Significant? Yes or No
Advertisement (X₁)	0.000	p < α	Reject H₀	Yes
Commissions (X₂)	0.000	p < α	Reject H₀	Yes
No. of salespersons (X₃)	0.000	p < α	Reject H₀	Yes

From the above table it can be seen that all the three independent variables are significant.

E)As indicated, the overall regression equation is

Sales Volume (y) = −98.2 + 0.884 Advertisement(x1) + 1.81 Commission(x2) + 33.8 No. of Salespersons(x3) − 67.2 Zone A (x4) − 105 Zone B (x5)

Separate equations for each zone can be written from this equation.

Zone A: x₄ = 1.0, x₅ = 0

Therefore, the equation for the sales volume of Zone A can be written as

Sales Volume (y) = −98.2 + 0.884 Advertisement(x1) + 1.81 Commission(x2) +33.8 No. of Salespersons(x3) − 67.2(1) − 105 (0.0) or,

Sales Volume (y) = −98.2 + 0.884 Advertisement (x₁) + 1.81 Commission (x₂) + 33.8 No. of Salespersons (x₃) – 67.2 or,

Sales Volume (y) = −165.4 + 0.884 Advertisement(x₁) + 1.81 Commission(x₂) + 33.8 No. of Salespersons(x₃)

Similarly, the regression equations for the other two zones are shown below.

Zone B: x₄ = 0, x₅ = 1.0
Substituting these values in the overall regression equation of part (c)

Sales Volume (y) = −98.2 + 0.884 Advertisement(x₁) + 1.81 Commission (x₂) + 33.8 No. of Salespersons(x₃) − 105 or, Sales Volume (y) = −203.2 + 0.884 Advertisement (x₁) + 1.81 Commission (x₂) +33.8 No. of Salespersons (x₃)

Zone C: x₄ = 0, x₅ = 0
Substituting these values in the overall regression equation of part (c)

Sales Volume (y) = −98.2 + 0.884 Advertisement(x₁) + 1.81 Commission(x₂) + 33.8 No. of Salespersons(x₃)

Note that in all of the above equations, the slopes are same but intercepts are different.

(F) The MINITAB residual plots are shown in Figure 7.27.

Figure 7.27 Residual plots for the dummy variable example

The residual plots in Figure 7.27 show that the normal probability plot and the histogram of residuals are approximately normally distributed. The plot of residuals versus fits does not show any pattern and is quite random indicating that the fitted linear regression model is adequate. The plot of residuals and the order of data points show no apparent pattern indicating that there is no violation of independence of error assumptions.

Overview of Regression Models

Regression is a powerful tool and is widely used in studying the relationships among the variables. A number of regression models were discussed in this book. These models are summarized here:

Simple Linear Regression
Multiple Regression
Polynomial Regression (second order models can be extended to higher order model)	Second-order polynomial: Higher-order polynomial:
Interaction Models	An interaction model relating y and two quantitative independent variables can be written as
Models with Dummy Variables	General form of Model with one qualitative (dummy)independent variable at m levels y = b0 + b1 x1 + b2 x2 +……+ bm − 1 xm − 1 where, xi is the dummy variable for level (i + 1) and
All Subset and Stepwise Regression	Finding the best set of predictor variables to be included in the model

Note; the Interaction Models and All Subset Regression are not discussed in this chapter.

There are other regression models that are not discussed but can be developed using the concepts presented for the other models. Some of these models are explained here.

Reciprocal Transformation of x Variable

This transformation can produce a linear relationship and is of the form:

This model is appropriate when x and y have an inverse relationship. Note that the inverse relationship is not linear.

Log Transformation of x Variable

Log Transformation of x and y variables

The logarithmic transformation is of the form:

This is a useful curvilinear form where (x)is the natural logarithm of x and x > 0.

The purpose of this transformation is to achieve a linear relationship. The model is valid for positive values of x and y. This transformation is more involved and is difficult to compare it to other models with y as the dependent variable.

Logistic Regression

This model is used when the response variable is categorical. In all the regression models we developed in this book, response variable was a quantitative variable. In cases, where the response is categorical or qualitative, the simple and multiple least-squares regression model violates the normality assumption. The correct model in this case is logistic regression and is not discussed in this book.

Implementation Steps and Strategy for Regression Models

Successful implementation of regression models requires an understanding of different types of models. A knowledge of least-squares method on which many of the regression models are based as well as the awareness of the assumptions of least-squares regression are critical in evaluating and implementing the correct regression models. The computer packages have made the model building and analysis easy. As we have demonstrated, the scatter plots and matrix plots constructed using the computer are very helpful in the initial stages of selecting the right model for the given data. The residual plots for checking the assumptions of regression can be easily constructed using computer. While the computer packages have removed the computational hurdle, it is important to understand the fundamentals underlying the regression to apply the regression models properly. A lack of understanding of least-squares method and the assumptions underlying the regression may lead to drawing wrong conclusions and selecting alternative course of action. For example, if the assumptions of regression are violated, it is important to determine the alternate course or courses of action.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7 Regression Analysis and Modeling

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7 Regression Analysis and Modeling