Modeling Issues and Endogeneity
Touro shares an anecdote with us. His company sent him to China last week to carry out a study on demand for travel from Chinese residents. However, his colleagues in China warned him that China’s data have a great deal of measurement errors. He is wondering how we can control for this problem. Prof. Metric tells him not to worry, because we will discuss the issue this week. He says that upon finishing with this chapter, we will be able to:
1. Analyze several modeling issues in regression.
2. Explain cases when endogeneity occurs.
3. Detect endogeneity problems and perform the correction procedure.
4. Apply Excel tools into estimating the related models.
We will discuss the modeling issues first because some of them are related to the endogeneity problem.
Modeling Issues
Model Specification
Unless we develop an econometric model based on a theoretical model, there will always be the possibility of having fewer or more variables than needed.
Omitted Variables
An omitted variable will significantly bias the regression coefficients. Given a model
If all coefficient estimates of this model are significant, but we accidentally omit xi2, then there are two consequences:
1. The coefficient estimates will be biased.
2. The variances will be incorrect, so the test results will be invalid.
For example, given the model:
where DUR is durable expenditures, WAGE is average weekly wage, ASSETP is average asset price, and DURP is durable price. If a1, a2, a3, and a4 are all significant, but we accidentally eliminate ASSETP, then there is an omitted-variable problem.
Irrelevant Variables
Including irrelevant variables will not significantly bias the regression coefficients, but the Ordinary Least Squares (OLS) procedure may provide incorrect variances of the coefficient estimates, so the tests are less reliable as discussed in Kmenta (2000). For example, if we accidentally add average stock price (STOCKP), having forgotten that we already included it in the ASSETP variable sometime in the past, then TOCKP is an irrelevant variable to Equation (7.2) and the model becomes:
The presence of the irrelevant variable, STOCKP, will not bias the coefficient estimates of the relevant variables, but their variances might be incorrect, so the test results will be less reliable.
We now see that choosing a correct model is crucial. Prof. Metric says that we might want to use a piecewise-downward approach starting from all theoretically possible variables with all available data. We then use F- and t-tests to eliminate the highly insignificant variables. He says that we can also use a piecewise-upward approach, which starts from a single explanatory variable. However, the downward approach is preferable, because this approach avoids the omitted variable problem that might arise if you use the piecewise-upward approach.
The Effect of Scaling the Data
Prof. Metric points out three cases of scaling the data.
1. Changing the scale of x: The only factor affected is the standard error, but it changes by the same proportion, so the t-statistic and R2 are unaffected. The interpretation changes according to the new unit.
2. Change in the scale of y: The standard error is scaled up or down, but it also changes by the same proportion, so the t-statistic and R2 are unaffected. The interpretation changes according to the new unit.
3. Scale of x and y are changed by the same factor, then there is no change in the regression results for b2 and its standard error. The t-statistic and R2 are unaffected. The interpretation changes according to the new unit.
We then work on an example of a spending model:
SPEND = 61.7 + 12.4 INCOMER2 = 0.685
(se) (8.76)(3.52)
where income (INCOME) is in hundreds of dollars and spending (SPEND) is in dollars.
In this case, the two-tail t-statistics for INCOME is 3.52 (= 12.4/3.52).
Hence, spending changes by $12.40 when income changes by $100.
If income is in dollars instead of hundreds of dollars, then we have
SPEND = 61.7 + 0.124 INCOMER2 = 0.685
(se)(8.76)(0.0352)
In this case, the two-tail t-statistics for INCOME is still 3.52 (= 0.124/0.0352), and spending changes by 12.4 cents when income change by $1 (the same as spending changes by $12.40 if income changes by $100).
Nonsample information is over and above the information contained in the sample observations. One popular application in econometrics is the use of Constant Returns to Scale (CRTS) to improve the precision of the estimated coefficients.
A production function (F) exhibits CRTS when it shows a constant ratio between inputs and outputs. In other words, CRTS occurs when output shows the same proportional change as all inputs proportional changes. Algebraically, for a constant c > 0, the equation for CRTS is
Y(cK, cL) = cY(K, L)
where Y denotes output, K is capital, and L is labor. An example of a Cobb-Douglas production function in CRTS is
Taking the logarithm of both sides yields:
LnY = Lnc + aLnL + bLnK = d + aLnL + (1 − a)LnK,where d = Lnc
LnY = d + aLnL + LnK − aLnK = d + a(LnL − LnK) + LnK
Hence, we can estimate the following equation:
Once we find a, we can calculate b = 1 − a.
Prof. Metric says that in addition to improving the precision of the estimated coefficients, the use of nonsample information might help correct other problems such as multicollinearity or volatility of the data because the logarithmic function is less volatile.
Cases of Endogeneity
The Issue
Prof. Metric says that in the previous chapters, we assume that the x’s are not random. If this assumption is violated, then we have an endogeneity problem, which is also called the random-regressor problem. Given the equation:
If xi2 is random, then xi2 might be correlated with ei, that is, xi2 is a random regressor and is said to be endogenous. In this case, the model has an endogeneity problem. There are two main consequences of the random-regressor problem:
1. OLS estimators are biased in small samples and are inconsistent even in a very large sample (the estimators do not converge to the true values in a large sample).
2. The standard errors are incorrect, so the interval estimates and the hypothesis-testing results are invalid.
Cases in Which x and e Are Correlated
There are many cases in which x and e are correlated.
Omitted Variables
A model might have an omitted variable or several omitted variables that will cause biased coefficient estimates. Omitted variable bias comes in two forms. The first one is called “theoretically driven bias” by Edwards (2014). Prof. Metric offers an example of this bias: We know that wage often depends on education (EDU)
However, IQ might affect WAGEi, so IQi lies in the error term ei. Also, since some people who have a higher IQ might also have a higher education, EDUi is correlated with ei.
The second case is called the “statistical omitted variable bias” in Edwards (2014). For example, giving an estimated model
In this case, w is a direct function of x and so is correlated with x and y. If we estimate Equation (7.7b) without w, then w lies in the error term ei which will be correlated to x.
Autocorrelation in Lagged-Dependent Model
In Chapter 5, we learned about the model with lagged-dependent variables:
yt = ayt−1 + et, so yt−1 = ayt−2 + et−1
If the error has an autocorrelation problem, then:
Since yt−1 is correlated with et−1, which is also correlated to et as shown in equation (7.8), yt−1 is correlated with et as well.
Measurement Error (Attenuation Bias)
Prof. Metris says that we now come to the case asked by Touro at the beginning of the chapter. Suppose the estimated equation includes x* which is measured with some error:
We do not know x*, so we write the correct variable, x, as:
where u is some error.
Combining Equations (7.9) and (7.10) yields
where e = (v − a2u)
Because x* includes x and some error u, which belongs to a composite error e, x* is correlated with e, that is, the covariance of x and e is different from zero. Hence, the measurement error creates an endogeneity problem called attenuation bias.
Simultaneous Equation Bias
In the supply-demand model, Qd = Qs = Q in equilibrium, so the demand function can be written as having quantity (Q) dependent on price (P) and income (Y):
Alternatively, the inverted demand function has P dependent on Q and Y
However, the supply function has Q dependent on P and input price, for example, wage (W):
In practice, we should estimate these two equations in a system of simultaneous equations as follows:
It is clear from Equations (7.12b) and (7.12c) that P and Q are not exogenous (they depend on each other and must be determined simultaneously). Hence, the system has an endogeneity problem called the simultaneous equation bias.
Dealing with Endogeneity
Detecting Endogeneity
We learn that in order to detect endogeneity, we need to perform a Hausman test. Prof. Metric says that to perform the original Hausman test, we need to master knowledge of matrix algebra and know how to operate a sophisticated econometric package. Thus, he would rather teach us the modified Hausman test discussed in Kennedy (2008). The theoretical concept of the modified Hausman test is as follows:
Suppose that xi2 is the only endogenous variable in Equation (7.6), then all other variables are exogenous and are denoted as xj’s, where xj = xi1, xi3,..,xik. If we perform a regression of xi2 on these xj’s, the part of xi2 that is explained by the exogenous xj’s will be factored out as shown in Figure 7.1.
The remaining part of xi2 is explained by the residual vi from the estimation:
Once an estimation of vi is obtained from estimating Equation (7.14), it can be added to Equation (7.6) in the subsequent regression:
A t-test on the coefficient estimate of vi is then performed. If this coefficient estimate is not significantly different from zero, then xi2 is exogenous, and the model does not have an endogeneity problem.
Hence, the steps to perform the test are as follows:
1. Perform a regression of xi2 on all exogenous xj’s and obtain the values of the error term .
2. Add to Equation (7.6) and perform the subsequent regression of the model in Equation (7.15).
3. Look at the p-value or calculate t-statistic of to test following the hypotheses:
H0: c = 0; Ha: c ≠ 0.
4. If the null hypothesis is rejected, then c ≠ 0, meaning xi2 is correlated with the error term and implying that the model has an endogeneity problem.
Prof. Metric then gives us an example, “Suppose regressing Equation (7.15) in Step (ii) yields estimated coefficient of as ĉ = 3.2 with its standard error se (ĉ) = 4.5, and N − K = 34. Let’s find out whether the model has an endogeneity problem.”
We proceed with the problem and calculate tSTAT = 3.2/4.5 = 0.71. We choose α = 0.05 and look at the t-table for tC = t(0.975, 34) = 2.032. Since |tSTAT| < tC, we do not reject the null and so c = 0, meaning that xi2 is not correlated with the error and implying that the model does not have an endogeneity problem.
Correcting Endogeneity
We learn that that when an endogeneity problem exists, we can use an alternative method of estimation, the Method of Moments (MM). When all classic assumptions in linear regressions are satisfied, the MM procedure leads us to the OLS estimators. When one or more of the explanatory variables are random, the MM procedure leads us to the Instrumental Variable (IV) estimators that are asymptotically consistent in large samples.
The purpose of the MM estimation is to find a variable as a substitute for the endogenous variable x. Theoretically, the MM procedure will lead to estimators that satisfy the condition that cov(w,e) = 0. Empirically, this procedure will lead to estimators that almost satisfies this condition, that is, the method will minimize cov(w,e) as much as possible. The variable w is called the IV.
Suppose that xi2 continues to be the endogenous variable as in the previous section, then:
The IV = w must satisfy two conditions:
1. w is correlated with xi2, that is, cov(w, xi2) ≠ 0.
2. w is not correlated with e, so that cov(w, e) = 0.
Prof. Metric also says that in cross-sectional data, it is very difficult to choose an IV that satisfies both conditions. In time-series or panel data, we can choose a lagged value that is correlated to the endogenous variable as the IV so that the first condition is satisfied. Assuming no serial correlation, the second condition is automatically satisfied by the classic assumptions, that is, the IV is in a lagged period, so it is not correlated to the error of the current period. The IV estimation is performed in two stages and so is also called two-stage least squares (2SLS) estimation. We learn that we can have more than one IV in performing a 2SLS procedure.
Prof. Metric also reminds us that the lags must be close enough in time to serve as acceptable IVs. For instance, if data are collected every five years, it is unlikely that the first lag will be a good instrument for the current variable because the lagged dataset and the current dataset were collected five years apart. However if the data are collected annually, then it’s possible that the lagged variable will be a good instrument for the current variable.
Using 2SLS in Single Equation Estimations
For a single equation model, the procedure is as follows:
First stage: Perform a regression of the endogenous variable xi2 on all exogenous variables, including the IV, which is wi:
Prof. Metric says that Equation (7.16) is called the reduced-form equation.
Second stage: Estimate Equation (7.6), which is called the structural equation, with xi2 replaced by the predicted value of xi2, which is obtained from the reduced-form estimation of Equation (7.16).
The problem is solved because we use wi as IV for xi2, and wi is not correlated with ei, so is not correlated with ei either, that is, , is exogenous.
Another issue is the standard errors, which are calculated from the residuals of the regression on Equation (7.17)
This error term is incorrect because it is different from the original in Equation (7.6)
Most econometric software will automatically provide the correct standard errors based on ei, as long as we select the command “2SLS” or “IV” estimation. Unfortunately, Excel does not have this feature, so Prof. Empirie will teach us how to obtain these correct standard errors in the Data Analysis section.
Using 2SLS in System of Equations Estimations
There are two modifications needed for using 2SLS procedure on a system of equations. The first is called the “identification condition,” which requires that there is at least (M − 1) variables excluded from the other equation in a system of M equations. Detailed explanations of the theoretical intuition behind this condition are in Hill, Griffiths, and Lim (2011). Going back to System (7.13), we see that the first equation has the variable Y excluded from the second equation, which has the variable W excluded from the first equation, so System (7.13) satisfies the identification condition.
The second modification is that the reduced form in the first stage includes all exogenous variables from both equations, in addition to, the IVs for the system. Hence, in System (7.13), the reduced form is:
where IVj, j = 1,2,…, j, is one of the IVs for the system.
Obtaining the predicted value of P from estimating Equation (7.18), we should substitute it into each of the two structural equations in System (7.13) to perform the second stage of the IV estimation.
Prof. Metric also tells us that the 2SLS procedure does not yield a reliable R2 value, so Greene (2012) recommends that we use the Root Mean Square Error (RMSE) as the alternative measure of goodness-of-fit. This is true for all 2SLS procedures, regardless of a single equation or a system of equations estimations. He reminds us that discussions of the RMSE concept are in Chapter 6 of this textbook.
Data Analyses
Nonsample Information
Prof. Empirie says that we will learn to use nonsample information to correct the problem of multicollinearity between two explanatory variables. She tells us to go the file Ch07.xls, Fig.7.2, where data for a production process include logs of capital (LnK), labor (LnL) and output (LnY) are in millions of dollars. We first examine the correlation between LnK and LnL by performing the following steps:
Go to Data then Data Analysis on the Ribbon.
Select Correlation instead of Regression then click OK.
A dialog box will appear, type A1:B34 into the box Input Range.
Check the box Labels and enter the Output Range of J1.
Click OK then OK again to override the existing data.
The results are displayed in Cells J1 through L3 of Figure 7.2 and reveal that there is a multicollinearity problem between the two variables.
We then use the nonsample information of CRTS to change one variable to Ln(K/L) and perform another correlation analysis between Ln(K/L) and LnL:
Go to Data then Data Analysis on the Ribbon.
Select Correlation instead of Regression then click OK.
A dialog box will appear, type D1:E34 into the box Input Range.
Check the box Labels and enter the Output Range of N1.
Click OK then OK again to override the existing data.
The results are displayed in Cells N1 through P3 of Figure 7.2 and reveal that there is no multicollinearity problem between Ln(K/L) and LnL. Hence, we can proceed to perform the regression of the following model:
Once the coefficient b is obtained from the regression, the coefficient a can be calculated by writing a = 1 − b.
Endogeneity
Testing Endogeneity
Data on sale values (DEMt), values of promotion (PROMt), income (INCt), and advertisement expenditures (ADSt), are from the file Ch07.xls, Fig.7.3-7.5. The model is:
DEMt = a1 + a2INCt + a3PROMt + et
We suspect that PROMt is endogenous and wish to use ADSt-1 as an instrument variable (IV) for PROMt. Since most companies use promotion sales to support their advertisements, PROMt is likely correlated with ADSt-1 and will be examined later. Since ADSt-1 is in period (t − 1), it is not correlated with et by the classic assumptions. This might make ADSt-1 an acceptable IV for PROMt. To perform the modified Hausman test for endogeneity, we first perform a regression of PROM on ADS and INC:
Go to Data then Data Analysis, select Regression then click OK.
The input Y range is E1:E35, the input X range is C1:D35.
Check the Labels and Residuals boxes.
Check the button Output Range and enter J1 then click OK.
A dialogue box will appear, click OK to overwrite the data.
The main results for the qualification of ADS are displayed in Figure 7.3.
Next, we perform the regression of the original equation with the Residuals added:
Copy the Residuals from Cells L25 through L59 and paste into Cell F1 through F35.
Go to Data then Data Analysis, select Regression then click OK.
The input Y range is B1:B35, the input X range is D1:F35.
Check the Labels box.
Check the button Output Range and enter N22.
Click OK and then OK again to overwrite the data.
The main results are displayed in Figure 7.4.
The results reveal that the coefficient estimate of , called “Residuals,” is weakly significant with a p-value of 0.06, so endogeneity problem exists although it is not too serious.
Correcting the Endogeneity Problem
The results in Figure 7.3 also support our argument that ADSt−1 is correlated with PROMt because the coefficient estimate of ADSt−1 has the p-value = 0.0003 < 0.05. Additionally, by the classic assumptions, ADSt−1 is not correlated to et. Therefore, ADSt−1 satisfies both conditions to be an acceptable IV.
To correct this endogeneity problem, continue with the given dataset.
Copy and paste Predicted PROM from Cells K25 through K59, into Cells G1 through G35.
Second, copy INC in Column D and paste INC into Column H next to Predicted PROM.
Finally regress the original equation with Predicted PROM in place of PROM:
Go to Data then Data Analysis, select Regression then click OK.
The input Y range is B1:B35, the input X range is G1:H35.
Check the box Labels.
Check the button Output Range and enter B37.
Click OK then OK again to obtain the results in Figure 7.5.
From the results, the estimated equation is
As discussed in the theoretical section, we still must obtain the correct standard errors. For your convenience, we have copied and pasted the original data into the file Ch07.xls, Fig.7.6-7.7. In this file, we also have the aforementioned results in Cells I1 through O17. You should perform the following steps to obtain the correct standard errors for the slope estimates:
In Cell B2 type =$M$15+($M$16*C2)+($M$17*D2) and press enter.
Copy and paste the formula into Cells B3 through B35 (this is our ).
In Cell E2 type =SUMXMY2(A2:A35,B2:B35), which gives the new Sum of Squared Errors (SSE).
In Cell F2, type =E2/31, where 31 is T-K, which gives the new variance of the model.
In Cell G2, type =(F2)^(1/2), which gives the new standard error of the model, se(mod).
The results are reported in Figure 7.6.
In Cell K18 type =K16/$J$5, which divide out the old se(a2) for PROM.
Copy and paste the formula into Cell K19 for INC.
In Cell K20, type =K18*$G$2, which multiply in the new se(a2) for PROM.
In Cell L20 type =J16/K20, which calculate the correct t-statistic for PROM.
In Cell M20 type =TDIST(L20,31,2), which gives the correct p-value for PROM.
Copy and paste the formulas in Cells K20 to M20 into Cells K21 to M21 for INC.
The results are reported in Figure 7.7 with correct values in Cells K20 through M21.
We now can use these correct values for the t-tests, F-tests, interval estimations, and interpretations of the coefficient estimates.
Exercises
1. Given the following equation:
lnY = ln a + α ln K + β ln L + δ ln H
The correlation between ln L and ln K has the coefficient = r = 0.99.
a. What is the problem?
b. An econometrician believes that using nonsample information of CRTS might correct the problem. Suggest a model of CRTS, showing all steps of derivation.
2. Data from the file Samples.xls, are on demand (DEMt), promotion expenditures (PROt), income (INCt), and values of free samples (SAMPt). Given the model
DEMt = a1 + a2INCt + a3PROt + et
We suspect that PROt is endogenous and want to use SAMPt−1 as an instrument variable (IV) for PROt. Carry out an endogeneity test for this model using a handheld calculator or Excel.
3. Propose a method to correct for the endogeneity problem in Question 2 and carry out the correction procedure using Excel.
18.188.238.31