CHAPTER 7

Modeling Issues and Endogeneity

Touro shares an anecdote with us. His company sent him to China last week to carry out a study on demand for travel from Chinese residents. However, his colleagues in China warned him that China’s data have a great deal of measurement errors. He is wondering how we can control for this problem. Prof. Metric tells him not to worry, because we will discuss the issue this week. He says that upon finishing with this chapter, we will be able to:

1. Analyze several modeling issues in regression.

2. Explain cases when endogeneity occurs.

3. Detect endogeneity problems and perform the correction procedure.

4. Apply Excel tools into estimating the related models.

We will discuss the modeling issues first because some of them are related to the endogeneity problem.

Modeling Issues

Model Specification

Unless we develop an econometric model based on a theoretical model, there will always be the possibility of having fewer or more variables than needed.

Omitted Variables

An omitted variable will significantly bias the regression coefficients. Given a model

image

If all coefficient estimates of this model are significant, but we accidentally omit xi2, then there are two consequences:

1. The coefficient estimates will be biased.

2. The variances will be incorrect, so the test results will be invalid.

For example, given the model:

image

where DUR is durable expenditures, WAGE is average weekly wage, ASSETP is average asset price, and DURP is durable price. If a1, a2, a3, and a4 are all significant, but we accidentally eliminate ASSETP, then there is an omitted-variable problem.

Irrelevant Variables

Including irrelevant variables will not significantly bias the regression coefficients, but the Ordinary Least Squares (OLS) procedure may provide incorrect variances of the coefficient estimates, so the tests are less reliable as discussed in Kmenta (2000). For example, if we accidentally add average stock price (STOCKP), having forgotten that we already included it in the ASSETP variable sometime in the past, then TOCKP is an irrelevant variable to Equation (7.2) and the model becomes:

image

The presence of the irrelevant variable, STOCKP, will not bias the coefficient estimates of the relevant variables, but their variances might be incorrect, so the test results will be less reliable.

We now see that choosing a correct model is crucial. Prof. Metric says that we might want to use a piecewise-downward approach starting from all theoretically possible variables with all available data. We then use F- and t-tests to eliminate the highly insignificant variables. He says that we can also use a piecewise-upward approach, which starts from a single explanatory variable. However, the downward approach is preferable, because this approach avoids the omitted variable problem that might arise if you use the piecewise-upward approach.

The Effect of Scaling the Data

Prof. Metric points out three cases of scaling the data.

1. Changing the scale of x: The only factor affected is the standard error, but it changes by the same proportion, so the t-statistic and R2 are unaffected. The interpretation changes according to the new unit.

2. Change in the scale of y: The standard error is scaled up or down, but it also changes by the same proportion, so the t-statistic and R2 are unaffected. The interpretation changes according to the new unit.

3. Scale of x and y are changed by the same factor, then there is no change in the regression results for b2 and its standard error. The t-statistic and R2 are unaffected. The interpretation changes according to the new unit.

We then work on an example of a spending model:

SPEND = 61.7 + 12.4 INCOMER2 = 0.685

(se) (8.76)(3.52)

where income (INCOME) is in hundreds of dollars and spending (SPEND) is in dollars.

In this case, the two-tail t-statistics for INCOME is 3.52 (= 12.4/3.52).

Hence, spending changes by $12.40 when income changes by $100.

If income is in dollars instead of hundreds of dollars, then we have

SPEND = 61.7 + 0.124 INCOMER2 = 0.685

(se)(8.76)(0.0352)

In this case, the two-tail t-statistics for INCOME is still 3.52 (= 0.124/0.0352), and spending changes by 12.4 cents when income change by $1 (the same as spending changes by $12.40 if income changes by $100).

Using Nonsample Information

Nonsample information is over and above the information contained in the sample observations. One popular application in econometrics is the use of Constant Returns to Scale (CRTS) to improve the precision of the estimated coefficients.

A production function (F) exhibits CRTS when it shows a constant ratio between inputs and outputs. In other words, CRTS occurs when output shows the same proportional change as all inputs proportional changes. Algebraically, for a constant c > 0, the equation for CRTS is

Y(cK, cL) = cY(K, L)

where Y denotes output, K is capital, and L is labor. An example of a Cobb-Douglas production function in CRTS is

image

Taking the logarithm of both sides yields:

LnY = Lnc + aLnL + bLnK = d + aLnL + (1 − a)LnK,where d = Lnc

LnY = d + aLnL + LnKaLnK = d + a(LnLLnK) + LnK

Hence, we can estimate the following equation:

image

Once we find a, we can calculate b = 1 − a.

Prof. Metric says that in addition to improving the precision of the estimated coefficients, the use of nonsample information might help correct other problems such as multicollinearity or volatility of the data because the logarithmic function is less volatile.

Cases of Endogeneity

The Issue

Prof. Metric says that in the previous chapters, we assume that the x’s are not random. If this assumption is violated, then we have an endogeneity problem, which is also called the random-regressor problem. Given the equation:

image

If xi2 is random, then xi2 might be correlated with ei, that is, xi2 is a random regressor and is said to be endogenous. In this case, the model has an endogeneity problem. There are two main consequences of the random-regressor problem:

1. OLS estimators are biased in small samples and are inconsistent even in a very large sample (the estimators do not converge to the true values in a large sample).

2. The standard errors are incorrect, so the interval estimates and the hypothesis-testing results are invalid.

Cases in Which x and e Are Correlated

There are many cases in which x and e are correlated.

Omitted Variables

A model might have an omitted variable or several omitted variables that will cause biased coefficient estimates. Omitted variable bias comes in two forms. The first one is called “theoretically driven bias” by Edwards (2014). Prof. Metric offers an example of this bias: We know that wage often depends on education (EDU)

image

However, IQ might affect WAGEi, so IQi lies in the error term ei. Also, since some people who have a higher IQ might also have a higher education, EDUi is correlated with ei.

The second case is called the “statistical omitted variable bias” in Edwards (2014). For example, giving an estimated model

image

In this case, w is a direct function of x and so is correlated with x and y. If we estimate Equation (7.7b) without w, then w lies in the error term ei which will be correlated to x.

Autocorrelation in Lagged-Dependent Model

In Chapter 5, we learned about the model with lagged-dependent variables:

yt = ayt−1 + et, so yt−1 = ayt−2 + et−1

If the error has an autocorrelation problem, then:

image

Since yt−1 is correlated with et−1, which is also correlated to et as shown in equation (7.8), yt−1 is correlated with et as well.

Measurement Error (Attenuation Bias)

Prof. Metris says that we now come to the case asked by Touro at the beginning of the chapter. Suppose the estimated equation includes x* which is measured with some error:

image

We do not know x*, so we write the correct variable, x, as:

image

where u is some error.

Combining Equations (7.9) and (7.10) yields

image

where e = (va2u)

Because x* includes x and some error u, which belongs to a composite error e, x* is correlated with e, that is, the covariance of x and e is different from zero. Hence, the measurement error creates an endogeneity problem called attenuation bias.

Simultaneous Equation Bias

In the supply-demand model, Qd = Qs = Q in equilibrium, so the demand function can be written as having quantity (Q) dependent on price (P) and income (Y):

image

Alternatively, the inverted demand function has P dependent on Q and Y

image

However, the supply function has Q dependent on P and input price, for example, wage (W):

image

In practice, we should estimate these two equations in a system of simultaneous equations as follows:

image

It is clear from Equations (7.12b) and (7.12c) that P and Q are not exogenous (they depend on each other and must be determined simultaneously). Hence, the system has an endogeneity problem called the simultaneous equation bias.

Dealing with Endogeneity

Detecting Endogeneity

We learn that in order to detect endogeneity, we need to perform a Hausman test. Prof. Metric says that to perform the original Hausman test, we need to master knowledge of matrix algebra and know how to operate a sophisticated econometric package. Thus, he would rather teach us the modified Hausman test discussed in Kennedy (2008). The theoretical concept of the modified Hausman test is as follows:

Suppose that xi2 is the only endogenous variable in Equation (7.6), then all other variables are exogenous and are denoted as xj’s, where xj = xi1, xi3,..,xik. If we perform a regression of xi2 on these xj’s, the part of xi2 that is explained by the exogenous xj’s will be factored out as shown in Figure 7.1.

The remaining part of xi2 is explained by the residual vi from the estimation:

image

Once an estimation of vi is obtained from estimating Equation (7.14), it can be added to Equation (7.6) in the subsequent regression:

image

A t-test on the coefficient estimate of vi is then performed. If this coefficient estimate is not significantly different from zero, then xi2 is exogenous, and the model does not have an endogeneity problem.

image

Figure 7.1 Diagram for the modified Hausman test

Hence, the steps to perform the test are as follows:

1. Perform a regression of xi2 on all exogenous xj’s and obtain the values of the error term image.

2. Add image to Equation (7.6) and perform the subsequent regression of the model in Equation (7.15).

3. Look at the p-value or calculate t-statistic of image to test following the hypotheses:

H0: c = 0; Ha: c ≠ 0.

4. If the null hypothesis is rejected, then c ≠ 0, meaning xi2 is correlated with the error term and implying that the model has an endogeneity problem.

Prof. Metric then gives us an example, “Suppose regressing Equation (7.15) in Step (ii) yields estimated coefficient of image as ĉ = 3.2 with its standard error se (ĉ) = 4.5, and NK = 34. Let’s find out whether the model has an endogeneity problem.”

We proceed with the problem and calculate tSTAT = 3.2/4.5 = 0.71. We choose α = 0.05 and look at the t-table for tC = t(0.975, 34) = 2.032. Since |tSTAT| < tC, we do not reject the null and so c = 0, meaning that xi2 is not correlated with the error and implying that the model does not have an endogeneity problem.

Correcting Endogeneity

We learn that that when an endogeneity problem exists, we can use an alternative method of estimation, the Method of Moments (MM). When all classic assumptions in linear regressions are satisfied, the MM procedure leads us to the OLS estimators. When one or more of the explanatory variables are random, the MM procedure leads us to the Instrumental Variable (IV) estimators that are asymptotically consistent in large samples.

The purpose of the MM estimation is to find a variable as a substitute for the endogenous variable x. Theoretically, the MM procedure will lead to estimators that satisfy the condition that cov(w,e) = 0. Empirically, this procedure will lead to estimators that almost satisfies this condition, that is, the method will minimize cov(w,e) as much as possible. The variable w is called the IV.

Suppose that xi2 continues to be the endogenous variable as in the previous section, then:

The IV = w must satisfy two conditions:

1. w is correlated with xi2, that is, cov(w, xi2) ≠ 0.

2. w is not correlated with e, so that cov(w, e) = 0.

Prof. Metric also says that in cross-sectional data, it is very difficult to choose an IV that satisfies both conditions. In time-series or panel data, we can choose a lagged value that is correlated to the endogenous variable as the IV so that the first condition is satisfied. Assuming no serial correlation, the second condition is automatically satisfied by the classic assumptions, that is, the IV is in a lagged period, so it is not correlated to the error of the current period. The IV estimation is performed in two stages and so is also called two-stage least squares (2SLS) estimation. We learn that we can have more than one IV in performing a 2SLS procedure.

Prof. Metric also reminds us that the lags must be close enough in time to serve as acceptable IVs. For instance, if data are collected every five years, it is unlikely that the first lag will be a good instrument for the current variable because the lagged dataset and the current dataset were collected five years apart. However if the data are collected annually, then it’s possible that the lagged variable will be a good instrument for the current variable.

Using 2SLS in Single Equation Estimations

For a single equation model, the procedure is as follows:

First stage: Perform a regression of the endogenous variable xi2 on all exogenous variables, including the IV, which is wi:

image

Prof. Metric says that Equation (7.16) is called the reduced-form equation.

Second stage: Estimate Equation (7.6), which is called the structural equation, with xi2 replaced by the predicted value of xi2, which is image obtained from the reduced-form estimation of Equation (7.16).

image

The problem is solved because we use wi as IV for xi2, and wi is not correlated with ei, so image is not correlated with ei either, that is, image, is exogenous.

Another issue is the standard errors, which are calculated from the residuals of the regression on Equation (7.17)

image

This error term is incorrect because it is different from the original image in Equation (7.6)

image

Most econometric software will automatically provide the correct standard errors based on ei, as long as we select the command “2SLS” or “IV” estimation. Unfortunately, Excel does not have this feature, so Prof. Empirie will teach us how to obtain these correct standard errors in the Data Analysis section.

Using 2SLS in System of Equations Estimations

There are two modifications needed for using 2SLS procedure on a system of equations. The first is called the “identification condition,” which requires that there is at least (M − 1) variables excluded from the other equation in a system of M equations. Detailed explanations of the theoretical intuition behind this condition are in Hill, Griffiths, and Lim (2011). Going back to System (7.13), we see that the first equation has the variable Y excluded from the second equation, which has the variable W excluded from the first equation, so System (7.13) satisfies the identification condition.

The second modification is that the reduced form in the first stage includes all exogenous variables from both equations, in addition to, the IVs for the system. Hence, in System (7.13), the reduced form is:

image

where IVj, j = 1,2,…, j, is one of the IVs for the system.

Obtaining the predicted value of P from estimating Equation (7.18), we should substitute it into each of the two structural equations in System (7.13) to perform the second stage of the IV estimation.

image

Prof. Metric also tells us that the 2SLS procedure does not yield a reliable R2 value, so Greene (2012) recommends that we use the Root Mean Square Error (RMSE) as the alternative measure of goodness-of-fit. This is true for all 2SLS procedures, regardless of a single equation or a system of equations estimations. He reminds us that discussions of the RMSE concept are in Chapter 6 of this textbook.

Data Analyses

Nonsample Information

Prof. Empirie says that we will learn to use nonsample information to correct the problem of multicollinearity between two explanatory variables. She tells us to go the file Ch07.xls, Fig.7.2, where data for a production process include logs of capital (LnK), labor (LnL) and output (LnY) are in millions of dollars. We first examine the correlation between LnK and LnL by performing the following steps:

Go to Data then Data Analysis on the Ribbon.

Select Correlation instead of Regression then click OK.

A dialog box will appear, type A1:B34 into the box Input Range.

image

Figure 7.2 Correlation analyses: Original results and correction

Check the box Labels and enter the Output Range of J1.

Click OK then OK again to override the existing data.

The results are displayed in Cells J1 through L3 of Figure 7.2 and reveal that there is a multicollinearity problem between the two variables.

We then use the nonsample information of CRTS to change one variable to Ln(K/L) and perform another correlation analysis between Ln(K/L) and LnL:

Go to Data then Data Analysis on the Ribbon.

Select Correlation instead of Regression then click OK.

A dialog box will appear, type D1:E34 into the box Input Range.

Check the box Labels and enter the Output Range of N1.

Click OK then OK again to override the existing data.

The results are displayed in Cells N1 through P3 of Figure 7.2 and reveal that there is no multicollinearity problem between Ln(K/L) and LnL. Hence, we can proceed to perform the regression of the following model:

image

Once the coefficient b is obtained from the regression, the coefficient a can be calculated by writing a = 1b.

Endogeneity

Testing Endogeneity

Data on sale values (DEMt), values of promotion (PROMt), income (INCt), and advertisement expenditures (ADSt), are from the file Ch07.xls, Fig.7.3-7.5. The model is:

DEMt = a1 + a2INCt + a3PROMt + et

We suspect that PROMt is endogenous and wish to use ADSt-1 as an instrument variable (IV) for PROMt. Since most companies use promotion sales to support their advertisements, PROMt is likely correlated with ADSt-1 and will be examined later. Since ADSt-1 is in period (t − 1), it is not correlated with et by the classic assumptions. This might make ADSt-1 an acceptable IV for PROMt. To perform the modified Hausman test for endogeneity, we first perform a regression of PROM on ADS and INC:

Go to Data then Data Analysis, select Regression then click OK.

The input Y range is E1:E35, the input X range is C1:D35.

Check the Labels and Residuals boxes.

Check the button Output Range and enter J1 then click OK.

A dialogue box will appear, click OK to overwrite the data.

The main results for the qualification of ADS are displayed in Figure 7.3.

Next, we perform the regression of the original equation with the Residuals image added:

image

Copy the Residuals from Cells L25 through L59 and paste into Cell F1 through F35.

Go to Data then Data Analysis, select Regression then click OK.

The input Y range is B1:B35, the input X range is D1:F35.

Check the Labels box.

Check the button Output Range and enter N22.

Click OK and then OK again to overwrite the data.

image

Figure 7.3 Qualification of ADSt−1 as instrument variable

image

Figure 7.4 Main results for modified Hausman test

The main results are displayed in Figure 7.4.

The results reveal that the coefficient estimate of image, called “Residuals,” is weakly significant with a p-value of 0.06, so endogeneity problem exists although it is not too serious.

Correcting the Endogeneity Problem

The results in Figure 7.3 also support our argument that ADSt−1 is correlated with PROMt because the coefficient estimate of ADSt−1 has the p-value = 0.0003 < 0.05. Additionally, by the classic assumptions, ADSt−1 is not correlated to et. Therefore, ADSt−1 satisfies both conditions to be an acceptable IV.

To correct this endogeneity problem, continue with the given dataset.

Copy and paste Predicted PROM from Cells K25 through K59, into Cells G1 through G35.

Second, copy INC in Column D and paste INC into Column H next to Predicted PROM.

Finally regress the original equation with Predicted PROM in place of PROM:

image

Go to Data then Data Analysis, select Regression then click OK.

The input Y range is B1:B35, the input X range is G1:H35.

Check the box Labels.

Check the button Output Range and enter B37.

Click OK then OK again to obtain the results in Figure 7.5.

From the results, the estimated equation is

image

Figure 7.5 Regression results for endogeneity correction

image

As discussed in the theoretical section, we still must obtain the correct standard errors. For your convenience, we have copied and pasted the original data into the file Ch07.xls, Fig.7.6-7.7. In this file, we also have the aforementioned results in Cells I1 through O17. You should perform the following steps to obtain the correct standard errors for the slope estimates:

In Cell B2 type =$M$15+($M$16*C2)+($M$17*D2) and press enter.

Copy and paste the formula into Cells B3 through B35 (this is our image).

In Cell E2 type =SUMXMY2(A2:A35,B2:B35), which gives the new Sum of Squared Errors (SSE).

In Cell F2, type =E2/31, where 31 is T-K, which gives the new variance of the model.

In Cell G2, type =(F2)^(1/2), which gives the new standard error of the model, se(mod).

The results are reported in Figure 7.6.

In Cell K18 type =K16/$J$5, which divide out the old se(a2) for PROM.

Copy and paste the formula into Cell K19 for INC.

image

Figure 7.6 Obtaining correct standard error for the model

image

Figure 7.7 Obtain correct standard errors and p-values for slopes

In Cell K20, type =K18*$G$2, which multiply in the new se(a2) for PROM.

In Cell L20 type =J16/K20, which calculate the correct t-statistic for PROM.

In Cell M20 type =TDIST(L20,31,2), which gives the correct p-value for PROM.

Copy and paste the formulas in Cells K20 to M20 into Cells K21 to M21 for INC.

The results are reported in Figure 7.7 with correct values in Cells K20 through M21.

We now can use these correct values for the t-tests, F-tests, interval estimations, and interpretations of the coefficient estimates.

Exercises

1. Given the following equation:

lnY = ln a + α ln K + β ln L + δ ln H

The correlation between ln L and ln K has the coefficient = r = 0.99.

a. What is the problem?

b. An econometrician believes that using nonsample information of CRTS might correct the problem. Suggest a model of CRTS, showing all steps of derivation.

2. Data from the file Samples.xls, are on demand (DEMt), promotion expenditures (PROt), income (INCt), and values of free samples (SAMPt). Given the model

DEMt = a1 + a2INCt + a3PROt + et

We suspect that PROt is endogenous and want to use SAMPt−1 as an instrument variable (IV) for PROt. Carry out an endogeneity test for this model using a handheld calculator or Excel.

3. Propose a method to correct for the endogeneity problem in Question 2 and carry out the correction procedure using Excel.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.238.31