CHAPTER 3

Multiple Linear Regression

Invo tells us that his boss wanted him to examine several factors that can affect consumer spending. Prof. Metric replies that this is one of the topics to be discussed this week and that once we finish the chapter, we will be able to:

1. Develop models for multiple linear regression;

2. Discuss specific issues with the OLS estimation method;

3. Explain basic concepts for F-tests and other measurements;

4. Perform data analyses and interpret the results using Excel.

We learn that this chapter will involve two or more explanatory variables.

Econometric Model

So far we have learned to perform regression with only one independent variable. In a real-life situation, we often see more than one factor affecting the movement of a market. Hence, a new model needs to be introduced.

A model with more than one determinant of spending can look like this:

image

where SPEND and WAGE are the same as in chapter 2, and HOUSEP is the average house price. When house prices go up, consumers feel richer and so increase their spending. The interpretation of a1 is the same as that in Chapter 2. The interpretation of a2 and a3 needs some revision. The parameter a2 now provides an estimate of the change in consumer spending due to a one-unit change in the wage, holding HOUSEP constant. The parameter a3 represents the change in consumer spending due to a one-unit change in the average house price, holding WAGE constant.

The econometric model derived from equation (3.1) is:

image

For cross-sectional data, the general model is:

image

where y is the dependent variable, and the x’s are usually called explanatory variables or regressors instead of independent variables because multiple x’s might not be completely independent of each other. The interpretation of the slope ak is:

image

Regarding estimations using cross-sectional data, the six classic assumptions in multiple linear regression are:

  (i) The model is yi = a1 + a2 xi2 +…+ ak xik + ei.

 (ii) E(ei) = E (yi) = 0.

(iii) Var(ei) = Var(yi) = σ2.

(iv) Cov(ei, ej) = Cov(yi, yj) = 0 for i ≠ j.

 (v) Each xik is not random, must take at least two different values, and is not an exact linear function of any other x.

(vi) ei ~ N(0, σ2); yi ~ ([a1 + a2 xi2 +…+ ak xik], σ2).

Assumption (v) is modified for time-series data as follows:

(v.a) y and x’s are stationary random variables, must take at least two different values, and et is independent of current, past, and future values of x’s.

(v.b) when some of the x’s are lagged values of y, et is uncorrelated to all x’s and their past values.

Prof. Metric reminds us,

Assumption (v) only requires that x’s are not perfectly correlated to each other. In empirical studies, any correlation of less than 90 percent between two variables can be acceptable; otherwise, they are considered highly correlated with each other, and we will run into the problem of multicollinearity. To overcome this problem, we can replace or eliminate one of the highly correlated variables. We can also modify the model using nonsample information, which will be discussed later in this chapter.

If assumptions (i) through (v) hold, then the OLS technique will produce the BLUE) in multiple linear regression. If assumption (vi) also holds, then test results are valid as long as we can cite the CLT concerning the approximately normal distribution of the errors.

Taila then asks, “What could be the consequences of the multicollinearity problem?” Prof. Metric says that there are several consequences when two or more explanatory variables are highly correlated with each other and continues his explanation. First, the effect of each explanatory variable on the dependent variable Y tends to be imprecise. As indicated in equation (3.1), a regression coefficient approximates the change in the dependent variable due to a one-unit change in an explanatory variable, holding the other variables constant. If variable Z is highly correlated with variable X, a one-unit change in Z causes a change in X, which can no longer be held constant. Second, since the perfectly correlated variables can be written as functions of each other, some of them are redundant. For example, Z = 2X, so an estimated equation Y = a1 + a2 Z + a3 X can be written as Y = a1 + a2 2X + a3 X = a1 + (2a2 + a3) X. As a result, the estimated equation can be simplified as Y = a1 + a4 X, and Z is redundant. Third, since some of the explanatory variables are highly correlated with each other, the standard errors of the affected coefficients are often inflated, and the test of the hypothesis that the coefficient is significantly different from zero is less reliable. Finally, because the explanatory variables are highly correlated with each other, small changes to the data of one variable can lead to large changes in the model and might bias the results.

We now see the seriousness of multicollinearity and look forward to learning how to detect and correct the problem.

Estimators and Estimates

We learn that the OLS procedure for multiple linear regression continues to minimize the sum of the squared differences between the observed values of y and their expected values E(y). Let this sum of squares be a function of a1, a2,…, ak, then we can write:

image

Prof. Metric says that minimizing this function again can be quite complicated, as one needs knowledge of calculus and the formulas for the estimators are too complicated for us to practice calculating the estimates manually. Hence, the only requirement here is to know that the estimators for multiple regression are â1, â2,…, âk, and the estimated equation is:

image

The numeric values of these estimators, which can be obtained from a data analysis of a specific sample using any econometric software, are the estimates of the regression.

Point Estimates

Suppose that estimating equation (3.2) yields the following results:

image

where the units are in hundreds of dollars for all three variables, then the results imply that:

  (i) Weekly spending of a person without wage is $100.

 (ii) Holding the house price constant, $100 increase in weekly wage raises weekly spending by $50.

(iii) Holding the weekly wage constant, $100 increase in the average house price increases weekly spending by $3 (= 0.03*100).

Interval Estimates

Prof. Metric tells us that the equation for interval estimate is the same as in Chapter 2:

image

The only difference is that we have K parameters to be estimated in multiple regression. Hence, the t-critical value is based on (N − K) degrees of freedom instead of (N − 2).

Predicted Values

Prof. Metric says that the predicted values can be calculated by substituting the estimated coefficients into the model following a similar procedure discussed in Chapter 2.

In this example, we will substitute the values of wage and house price into equation (3.5) and find that a person with a weekly income of $600 with a change in house price of $1,000 can expect a weekly spending of:

image

Prof. Metric says that interval predictions can also be made for multiple regression with similar formulas as those in Chapter 2, except for the standard errors:

image

From (3.8), the interval prediction is:

image

In the spending example, if N = 53 then df = 53 3 = 50. We choose α = 0.05, so tC = 2.009. Suppose se(p) = 0.2, then the interval prediction at a 95 percent confidence interval is:

4.3 ± 2.009*0.2 = (3.8982; 4.7018) = ($389.82; $470.18).

This result implies that a person with a weekly wage of $600 will spend anywhere from $389.82 to $470.18 weekly when the house price is added to the model and our prediction has a 95 percent confidence interval.

Hypothesis Testing

In order to evaluate each coefficient estimate, a t-test is still appropriate. Prof. Metric says that to determine the joint significance of two or more coefficients or the significance of a model, we need to perform F-tests.

Tests of Joint Significance

Prof. Metric gives us the full model in equation (3.3) with more variables written out here:

yi = a1 + a2 xi2 + a3 xi3 + a4 xi4 +…+ ak xik + ei.

This model is called an unrestricted model, on which we have to perform a regression and obtain its SSEU, where U stands for unrestricted. Suppose that we want to test whether a2 and a3 are jointly significant, then the restricted model is:

image

Hence, we need to perform a regression on this restricted model and obtain its SSER, where R stands for restrict. The F-test is then performed in the four standard steps; for equation (3.10) the hypotheses are written as:

  (i) H0: a2 = a3 = 0; Ha: a2 and a3 are jointly significant.

 (ii) The F-statistic:

image

where J is called the number of restrictions, which is the number of coefficients in the null hypothesis. The null hypothesis for equation (3.10) only has a2 and a3 and so J = 2.

(iii) The F-critical value, FC, can be found from any F-distribution table by choosing α and looking through the table for FC = F(α, J, NK).

In Excel, FC can be found by typing =FINV(α, J, N − K), then pressing Enter.

(iv) Decision: If FSTAT > Fc, we reject the null hypothesis, meaning the two coefficients are jointly significant and implying that we might not want to eliminate one of them from the regression equation.

Prof. Metric reminds us that an F-distribution has only positive values, so we do not have to compare absolute values of F-statistics to F-critical values. In addition, existing textbooks use various notations for J and N − K, as outlined in the following:

J = numerator degrees of freedom = Num. df = v2 (displayed across the top row);

N − K = denominator degrees of freedom = Den. df = v1 (displayed down the column).

He then gives us an unrestricted model with three explanatory variables:

image

where SALARY is the yearly salary and EDU is years of education. EXP is years of experience, IQ (intelligence quotient) score is the same as in section 2, and FED is federal tax credits to residential-property investment. Suppose the regression results for this model are:

SALARY = 0.23 + 0.082 EDU + 0.13 EXP + 0.091 IQ;

(se)(0.08) (0.02) (0.05) (0.09)

Number of observations = 51;

SSEU = 140.

From these results, the coefficient of IQ is insignificant. However, if IQ and EXP are jointly significant, or IQ and EDU are jointly significant, then we might not want to eliminate IQ from the model. To test the joint significance of IQ and EX we need to perform a regression on the restricted model:

SALARY = a1 + a2 EDU + e.

Suppose the results for this restricted model are:

SALARY = 0.12 + 0.103 EDU;

(se) (0.07) (0.03)

SSER = 170, where R stands for restricted.

We perform the test as follows:

  (i) The hypotheses:

H0: a3 = a4 = 0; Ha: a3 and a4 are jointly significant.

 (ii) The F-statistic:

In this case, J = 2(a3 and a4), and (N − K ) is the degrees of freedom (df ). In this case df = 47(= 51 4); hence,

image

(iii) We choose α = 0.05, so FC = F (0.95, 2, 47) ≈ 3.19.

In Excel, we can type =FINV(0.05, 2, 47) into any empty cell and then press the Enter key. It gives us the result of 3.195.

(iv) Decision: Since FSTAT > FC, we reject the null, meaning the two coefficients are jointly significant and implying that we might not want to eliminate IQ from the regression equation.

Tests of Model Significance

To this point, Booka asks, “What if all coefficients of the explanatory variables in a model are not statistically significant?” Prof. Metric praises her for raising the issue and says that in this case the model does not make any significant contribution to the estimation and should be revised. Thus, we need to test for the model significance, which only needs the full model with its SST and SSE. The four-step procedure for the test is:

  (i) H0: all ak = 0 for k = 2, 3,…, K; Ha: at least one ak ≠ 0.

 (ii) The F-statistic:

image

where J is the number of restrictions. Note that in this case, only a1 is excluded in the null hypothesis because it is for the constant term, so J = K − 1.

(iii) F-critical: We learn that we again need to look for FC = F(α, J, NK).

(iv) Decision: If FSTAT > Fc, we reject the null hypothesis. This means at least one ak ≠ 0 and implies that the model is statistically significant.

We then work on the example in equation (3.12) with SST given by Prof. Metric as SST = 180. SSE is already given and can be written as SSE = 140 because we have only a single model in (3.12); that is, no subscript U is needed. The four-step procedure for the test is:

  (i) H0: a2 = a3 = a4 = 0; Ha: at least one ak ≠ 0.

 (ii) The F-statistic:

image

Note that J = K − 1 = 4 1 = 3.

(iii) F-critical: We choose α = 0.05, so FC = F (0.95, 3, 47) ≈ 2.80.

We also type =FINV(0.05, 3, 47) into an empty cell in Excel and then press Enter. It gives us the same result of 2.8024.

(iv) Decision: If FSTAT > Fc, we reject the null hypothesis and conclude that at least one ak ≠ 0, implying that the model is statistically significant.

F-test Versus t-Test

Tailor suddenly asks, “What are the differences between t-tests and F-tests?” Prof. Metric provides the following explanations.

F-tests and t-tests shares one similarity: both of them are testing for the significance or the expected values of the coefficient estimates. In addition, when J =1, the t- and F-tests are equivalent.

However, the two tests have several disparities:

1. In t-test, we have a hypothesis for testing a single estimated coefficient.

2. In an F-test we have joint hypotheses. F-test can also be used to perform a test on the significance of a model.

3. For the test with ak ≠ 0, the t-test is a two-tailed test whereas the F-test is a one-tailed test.

4. The F-distribution has J numerator degrees of freedom (df ) and (NK) denominator df, whereas the t distribution has only one df for the numerator.

We are now ready to move to the next section.

Goodness-of-Fit and Reporting the Results

Goodness-of-Fit

An adjusted R2 can measure the goodness-of-fit in multiple regression because using more than one variable decreases the degrees of freedom (df ). As a result, the adjusted R2 value is a better measure to account for the decreasing degrees of freedom, even though an R2 value is still reported by most econometric packages, including Excel. The formula for calculating the adjusted R2 value is:

image

In multiple regression, the concept of p-value learned in Chapter 2 comes very handy for measuring goodness-of-fit, because we might want to perform regression through the origin once in a while. Touro bursts out, “Yes, sometimes regressing with the intercept does not make sense. Yesterday, my boss wanted me to investigate how the sizes of land and buildings affect prices of vacation houses in the city. I ran a regression and found these results:

HOUSEP = 54,201 + 12 LANDS + 102 BUILDS,

where HOUSEP is the vacation-house prices, LANDS is the square feet of land, and BUILDS is the square feet of the building on the land. I stared at the results and told my boss that a house that has zero square feet of land costs roughly $54,000!”

Prof. Metric says that was an excellent example and that in this case, regressing through the origin makes more sense because the intercept should be zero; so, equation (3.3) becomes:

image

The only problem is that R2 value sometimes comes out negative, which is embarrassing, so using p-values to measure goodness-of-fit makes sense in this case.

Reporting the Results

We learn that the results for simple and multiple regression are reported in similar manners. The only difference between the two is that the adjusted R2 is added for a multiple regression. For example:

SALE = 18.97 − 1.901 PRICEi + 0.763 ADSi,

 (se)(6.35)(0.096) (0.314)

 R2 = 0.824; adjusted R2 = 0.798; N = 36.

Data Analyses

Prof. Empirie says that a correlation analysis is crucial in multiple linear regression. Before estimating, we need to detect and eliminate any variable that causes multicollinearity.

Detecting Multicollinearity

The dataset is available in the file Ch03.xls, Fig.3.1 to 3.2. The dependent variable is investment (INV), and the two explanatory variables are tax credits by the government (CREDIT) and personal income (INCOME). First, we carry out a correlation analysis:

Go to Data then Data Analysis on the ribbon.

Select Correlation instead of Regression, then click OK.

A dialog box appears, as shown in Figure 3.1.

The result shows that the correlation coefficient between INCOME and CREDIT is 0.8016, which is acceptable to perform a regression (in the data file, you can find this correlation coefficient in cell Q3).

Regressing

Next, we perform a regression of INV on CREDIT and INCOME:

Go to Data then Data Analysis, select Regression and click OK.

image

Figure 3.1 Dialog box for correlation analysis

The Input Y Range is B1:B52, the Input X Range is C1:D52.

Check the boxes Labels and Residuals.

Check the button Output Range and enter F1, then click OK.

A dialog box appears; click OK to overwrite the data.

We also copy and paste the correlation results into cells L1 though N3 together with the regression results in Figure 3.2, which shows the correlation coefficient in cell M3.

From these results, the estimated equation can be written as:

INVi = 4421 + 0.1041 INCOMEi + 0.2592 CREDITi;

(se) (1760) (0.0076) (0.0966)

R2 = 0.9369; adjusted R2 = 0.9344.

Excel again displays the predicted values next to the residuals. Prof. Metric reminds us that we can use equations (3.8) and (3.9) to calculate interval prediction by following the steps given in Chapter 2.

Performing Regression for F-Tests

To carry out the F-test, we need to estimate two models: the unrestricted and the restricted. Data on Thailand-Japan real exchange rate (EXCHA), Real GDP (RGDP) for Thailand, and exports from Japan to Thailand (EXPS) are available in the file Ch03.xls, Fig.3.3 to 3.4. The hypothesis is that EXCHA and RGDP jointly affect EXPS, so we regress EXPSt on RGDPt and EXCHAt.

image

Figure 3.2 Multiple regression results

We learn that the following steps have to be performed:

Go to Data then Data Analysis, select Regression and click OK.

The input Y range is B1:B33, the input X range is C1:D33.

Check the box Labels.

Check the button Output Range and enter F1, then click OK.

A dialog box appears; click OK to overwrite the data.

The results for the unrestricted model are displayed in Figure 3.3.

From these results, the estimated equation is:

EXPSt = 970.0661 649.6924 EXCHAt + 0.0193 RGDPt, SSEU = 493,470,879.

We then estimate the restricted model by regressing EXPSt on EXCHAt−1 using the same dataset:

Click Data and then Data Analysis on the ribbon.

Select Regression in the list and click OK.

image

Figure 3.3 Estimation results for unrestricted model

Source: IMF Data and Statistics and the World Bank.

A dialog box appears.

Type B1:B33 into the Input Y Range box.

Type C1:C33 into the Input X Range box.

Choose Labels.

Check the Output Range button and enter F21.

Click OK.

A dialog box appears.

Click OK again to overwrite the data.

The estimated results are displayed in Figure 3.4. Prof. Empirie reminds us that we can also test for the model significance by using the unrestricted model with SST = 17,905,964,312 and the same SSE = 493,470,879.

From the results, the estimated equation is:

EXPSt = 9929.6148 + 4807.4210 EXCHAt,
SSER = 15,392.098,311.

Prof. Empirie also reminds us that there are two formulas for calculating F-statistics:

  (i) The one for joint significance of the two coefficients uses SSEU from the regression results in Figure 3.3 and the SSER from the regression results in Figure 3.4. The formula for calculating this F-statistic is in equation (3.11).

 (ii) The F-test for the model significance only uses SST and SSE which come from the regression results in Figure 3.3. Hence, if the problem raised is to test for the model significance, we do not need to perform a regression on the restricted model. The formula for calculating this F-statistic is given in equation (3.13).

Invo suddenly exclaims, “Oh! In the regression of EXPSt on RGDPt and EXCHAt, the two tests are the same because we have only two explanatory variables, so testing for the joint significance of these two variables is the same as testing for the model significance.”

image

Figure 3.4 Estimation results for restricted model

Source: IMF Data and Statistics.

Prof. Empirie smiles, “Yes, that is very true.”

We look at Invo with awe and are very happy that we need to practice performing only one test for the two cases when we get home.

Exercises

1. The file RGDP.xls contains data on real GDP (RGDP), consumption (CONS), investment (INV), and exports (EXPS). The data are for the United States, from the first quarter of 2006 to the first quarter of 2014. Given RGDP as the dependent variable,

(a) Perform a correlation analysis for the three explanatory variables.

(b) Perform a multiple regression of RGDP on the other three variables. Provide comments on the results, including the significances of a1, a2, and a3, R2, adjusted R2, and the standard error of regression.

2. Write the estimated equation for the regression results for Exercise 1; enter the standard errors below the estimated coefficients, adding the adjusted R2 next to the equation. Obtain the point prediction for the second quarter of the year 2014 based on this equation using a handheld calculator.

3. Use the results in Exercise 1 and carry out an additional regression on a restricted model as needed to test the joint significance of INV and EXPS at a 5 percent significance level. Write the procedure in four standard steps similar to those in the Hypothesis Testing section The calculations of the F-statistics may be performed using a handheld calculator or using Excel.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.167.41