CHAPTER 6

Panel Data Techniques

Taila asks, “Prof. Empirie introduced us to longitudinal/panel datasets in chapter 2, but we have not discussed anything on using them. I was wondering if you could provide us with more information.” Prof. Metric responds enthusiastically that we will learn panel data analysis this week. He says that upon finishing this chapter, we will be able to:

1. Explain the nature and advantages of learning panel data techniques;

2. Master panel data techniques in obtaining regression coefficients;

3. Discuss the goodness-of-fit issue that arises with the panel data technique;

4. Use Excel to carry out corresponding analyses.

Nature of Panel Data

Prof. Metric reminds the class that a panel dataset combines a cross-sectional dataset with a time-series dataset. Invo asks, “Why does one have to learn panel data techniques?” Taila volunteers to give an example of the advantage of using panel data. If we have a time-series dataset on pajama sales in Korea for 15 months and another dataset on pajama sales in China for the same 15 months, then combining the two datasets gives us a sample size of 30 data points.

Prof. Metric praises Taila on offering a good example and says that another advantage of using the panel data technique is that we will be able to observe more than one identity over time to control for the individual heterogeneity. In this particular case, the observation provides us with additional information on the specific characteristics of each market. For example, we can understand demand for pajamas in Northeast Asia by studying pajama sales in Korea and China. In addition, we are able to carry out a comparative study over a period of time. For example, we can compare demand in Korea over the past five years with demand in China during the same time period and then develop different strategies to increase sales in each country.

Panel data can be divided into “short-and-wide” panels or “long-and-narrow” panels. If I is the number of individuals observed in each of T time periods, then a short-and-wide panel has I > T while a long-and-narrow panel has T > I. Panel data can also be divided into balanced and unbalanced panels, with the unbalanced panels having some observations missing, while the balanced ones do not have any missing observation. Prof. Metric says that Excel cannot handle missing observations, so we need to delete the whole row if we run into this problem.

We learn that we can use OLS to run regression on a panel dataset if the two identities have the same characteristics; for example, if pajama sales in Korea and China have the same pattern, then the coefficient estimates will be the same for the two countries. If this is the case, all we have to do is stack one dataset above the other and then run a regression called a “pooled OLS” estimation to extend the dataset to 30 data points, as mentioned by Taila, so that CLT will guarantee valid test results. Hence, the model for the pooled OLS is:

image

Prof. Metric reminds us to notice that the parameters a1, a2, and a3, are still the same as those in Chapter 3, even though each of the variables and the error term have the subscript “it” added.

a1it = a1; a2it = a2; a3it = a3.

We understand that all the classic assumptions have to hold for us to use this pooled OLS estimator, specifically,

image

We all feel that the two markets cannot have the exact same characteristics most of the time. Prof. Metric says that in this case, stacking two or more datasets for different identities will bias the coefficient estimates, and the variances will be inflated, so the tests will be invalid. He tells us that the most general case of panel data is written as:

image

Booka exclaims, “Oh… I now see that each of the parameters a1it, a2it, and a3it has the subscript it.” Prof. Metric, says, “Yes, this model allows for each identity to be different across sections and over time.” He continues by stating that we cannot estimate this model because there are not enough data points to cover all the unknown parameters. Hence, some simplifications are needed. The first way is to allow the identities to be different in their intercepts:

image

Prof. Metric reminds us to note the subscript i as in a1i, which indicates different intercepts across the identities; whereas, the variables a2 and a3 do not have this subscript, implying that the identities have the same slope. The model in equation (6.4) assumes that all behavioral differences among the identities are captured by the constant term, so that:

a1it = a1i; a2it = a2; a3it = a3.

Another case occurs when sectional individuals have different slopes:

image

The third case combines equations (6.4) and (6.5):

image

These differences in characteristics are unobserved effects that need to be removed. Panel data techniques can be applied for either case, and more cases will be discussed later.

Panel Data Techniques

We first focus on the model in equation (6.4) using either the first-difference estimation or the fixed-effects estimation.

First-Difference Estimation

Given the model in equation (6.4), we can add a second equation by retrogressing one period:

yit = a1i + a2x2it + a3x3it + eit;

yi,t−1 = a1i + a2x2i,t−1 + a3x3i,t−1 + ei,t−1.

Subtract the second equation from the first:

image

The model in equation (6.7) will control for the difference in the intercepts because the constant term in equation (6.4) has been eliminated and no longer exists in equation (6.7).

Prof. Metric then tells us that the suitable case for using a first-difference model, instead of the fixed-effects model, is when the error term follows a random walk (Wooldridge 2013).

yit = a1i + a2x2it + a3x3it + eit,

where eit = ei,t−1 + vit,

where vt satisfies all the classic assumptions:

E(vit) = 0; var (vit) = E(v2it) = s2;

Cov(vit, vjz) = E(vit, vjz) = 0 for ij or tz;

Cov(vit, x2it) = Cov(vit, x3it) = 0.

Taking the first difference yields:

image

In this case, taking the first difference serves two purposes: (i) elimination of the intercept a1i presented in equation (6.4) and (ii) correcting for the autocorrelation of the errors. Since vt satisfies all the classic assumptions, there is no longer an autocorrelation problem.

An example of the first-difference estimation is the following model:

HOUSEit = a1i + a2INCOMEit + a3CREDITit + eit; et = et−1 + vt,

where HOUSE is the average value of investment in residential housing, INCOME is per capita income, and CREDIT is the investment credit from the federal government. Going backward one period and subtracting the second equation from the first yields:

ΔHOUSEit = a2INCOMEit + a3CREDITit + Δeit.

Suppose that the regression results are:

ΔHOUSEit = 1.2 ΔINCOMEit + 0.3 ΔCREDITit,

where ∆INCOMEit = $8,000 and ΔCREDITit = $ 5,000, then the point prediction of the investment in housing is:

ΔHOUSEit = 1.2 * 4,000 + 0.3 * 5,000 = 9,600 + 1,500 = $ 11,100.

Since this is a change in the investment value instead of the investment value itself, we can calculate the predicted value of the investment as follows:

ΔHOUSEit = HOUSEitHOUSEi,t−1,

HOUSEit = ΔHOUSEit + HOUSEi,t−1.

Suppose data on the previous period provides the average investment value as HOUSEi,t−1 = $150,000, then the predicted value is:

HOUSEit = $11,100 + $150,000 = $161,100.

We learn that the concept of differencing can be extended to more than two periods. For example, a three-period difference model will be as follows:

Δyi,t = a2Δx2i,t + a3Δx3i,t + a4Δx3i,t−1 + Δei,t + Δei,t−1.

Prof. Metric tells us that in this case, a comma is often placed between the subscript for the identity and the time period to avoid any confusion on the meaning of the notation.

Fixed-Effects Estimation

Next, Prof. Metric shows us another method of controlling for the differences in the intercepts by using the “fixed effects” estimation. We again zero in on the difference in intercepts first. The theoretical model is obtained by taking the deviation from the mean, so we have to take the time average values of equation (6.4):

image

We then subtract equation (6.8) from equation (6.4):

image

We now can perform forecasts on the following model using OLS:

image

where

image

The transformed model in (6.9) will control the fixed effects problem because the intercept term in equation (6.4) has been removed from equation (6.9). The data are said to be “demeaned” because we take the deviation from the mean of the data.

In practice, the most convenient and flexible way to carry out regression on a fixed-effects model is to use the least square dummy variable (LSDV) method whenever the number of identities is not too large (less than 100 identities or time periods). To obtain the LSDV estimators for equation (6.4), we generate a dummy variable for each of the identities. Suppose we have six different identities, then:

image

With these six dummies added, equation (6.4) can be written as:

image

Prof. Metric reminds us to suppress the constant, because the model (eq. 6.9) no longer has a constant.

Prof. Metric continues with the example of pajama sales in Korea and China and gives us this model:

SALEit = a1i + a2PRICEit + a3PROMit + eit,

where SALE is the values of pajama sales, PRICE is price of pajamas, and PROM is the expenditures on sale promotion. He asks us to write a model to control for the difference in sale characteristics in the two countries.

We decide to use the LSDV model and see that only the intercept term is different between the two countries, as indicated by the subscript i, so we add two intercept dummies: DK = Korea and DC = China and write the model as:

image

We also tell Prof. Metric that we will have to suppress the constant when running the regression, and he is very pleased that we remembered the details of this method.

Touro then asks, “We learned in chapter 4 that a researcher can only add (G−1) dummies. Why do we have to use two dummies for two groups here?” Prof. Metric commends Touro on the question and says that we already suppressed the constant, so there is no perfect collinearity here, and the reason we want to add two dummies is that we want to control for different characteristics in both countries.

He then says that if the two markets also differ over time, then adding time dummies to the equation will help:

SALEit = a11DK + a12DC + a2PRICEit + a3PROMit + b1t1 + ... + bTtT + eit

Invo asks, “Can we add the slopes in addition to the time dummies? Or had we better stick to the model in (6.6)?” Prof. Metric says, “We can add the slope dummies to the equation, so using either cross-sectional or a combination of cross-sectional and time dummies is fine.” Combining the two produces this model:

SALEit = a11DK + a12DC + a2PRICEit + a3PROMit + b1t1 + ... + bTtT
+ c1 (DK * PRICEit) + c2 (DC * PRICEit) + d1 (DK * PROMit)
+ d2 (DC * PROMit) + eit

Prof. Metric also tells us that we will learn more applications of time dummies and slope dummies in later chapters.

Seemingly Unrelated Regressions (SUR)

Because this method requires econometric software packages other than Excel, Prof. Metric only provides us with a quick introduction. Suppose that we have three equations for Singapore (1), Myanmar (2), and Laos (3). SUR estimations assume that the errors of these three equations exhibit contemporaneous correlation in the same period. The basic SUR estimation, which is a GLS procedure, can be performed in three steps, as follows:

  (i) Estimate the three equations separately using OLS.

 (ii) Use the residuals from the OLS estimation in step (i) to estimate the variances image, and the covariance image.

(iii) Use the estimates from step (ii) to regress the three equations jointly within a GLS framework.

This method is usually very effective for identities of the same region (Southeast Asia in this case), country, state, or city, because they are often correlated with each other.

Detecting Different Characteristics

Since pooled OLS can be performed when all identities or time periods have the same characteristics, we need to perform a test on these characteristics. For example, we want to know if the equations for Korea and China have identical parameters. This is an F-test, called the Chow test, for the significance of the dummy variable. The restricted model is:

SALEit = a11 + a2PRICEit + a3PROMit + eit.

For the purpose of testing the preceding equation, the unrestricted model needs only one dummy added because a11 already catches characteristics of one of the two countries, those of Korea in this case, so the equation is:

SALEit = a11 + a12DC + a2PRICEit + a3PROMit + eit.

Suppose a12 = 0, then China shares the same characteristics with Korea, so the hypotheses are written as:

H0:a12 = 0; Ha:a12 ≠ 0.

Assuming that heteroscedasticity or serial correlation is not a problem with this model, the formula for F-statistics is similar to the one discussed in chapters 3 and 4, with the same definitions for J and K, except that we have a panel data, so:

image

where IT is the sample size with I = the number of observations across identities and T = the number of observations over time.

Prof. Metric reminds us that we still have to perform the test in the four standard steps. If the F-statistic is greater than F-critical, then we reject H0, and the two equations do not have identical coefficients, so a panel-data technique is needed for estimations.

Goodness-of-Fit

Prof. Metric tells us that eliminating the constant amounts to regressing through the origin. This procedure causes the R2 value to become an unreliable measure for the goodness-of-fit, which is also compromised for several other models in Volume Two. For this reason, he introduces the Root Mean Squared Error (RMSE) here so that we know an alternative measure for the goodness-of-fit:

image

We then work on an RMSE example given the information in Table 6.1.

We then calculate the mean squared errors (MSE):

MSE = (1 + 2 + 4)/3 = 2.33.

Finally, we take the square root of the MSE:

image

Prof. Metric says that the smaller the RMSE, the better fit a model is.

Table 6.1 Steps for calculating RMSE

(1)

(2)

(3)

(4)

(5)

Observation

y

image

yimage

(yimage)2

1

8

7

8 − 7 = 1

      12 = 1

2

9

8

9 − 8 = 1

      12 = 2

3

7

9

  7 − 9 = − 2

(− 2)2 = 4

Data Analyses

Taila shares with us a yearly dataset on output per worker (OUT) and exports (EXPS) for Malaysia, Australia, and Cambodia during the years 2007 to 2015. She has downloaded the data from the World Bank website. Since one lagged variable is generated, we have data for the years 2008 to 2015 to perform the regressions and tests. We find that the data are in the file Ch06.xls, Fig. 6.1.

Testing Different Characteristics

The restricted model is:

EXPSit = a11 + a2OUTi,t−1 + eit.

The unrestricted model is:

image

DA and DC are the dummies for Australia and Cambodia, respectively. Prof. Empirie reminds us that the three countries will share the same characteristics if a1A = a1C = 0, so we only need two dummies for the test.

To perform the cross-equation test on the three countries, we first estimate the restricted model by regressing EXPS on OUT:

Go to Data then Data Analysis.

Select Regression, then click OK.

The input Y range is E1:E25, the input X range is G1:G25.

Check the box Labels.

Check the button Output Range and enter L1.

Click OK then OK again to overwrite the data.

Next, we regress EXPS on OUT, DA and DC:

Go to Data then Data Analysis.

Select Regression, then click OK.

The input Y range is E1:E25, the input X range is G1:I25.

Check the box Labels.

Check the button Output Range and enter L20.

Click OK then OK again to overwrite the data.

The Analysis of Variance (ANOVA) sections that report the SSEs for the two models are displayed in Figure 6.1. We find that SEER is reported in cell X4 and SSEU in cell AC4.

From the results in this figure, the four steps for the test are:

  (i) H0: a1A = a1C = 0; Ha: a1A ≠ 0, or a1C ≠ 0, or both ≠ 0.

 (ii) image

(iii) We decide to use α = 0.05 and type = FINV (0.05, 2, 20) into an Excel cell, which gives us FC = 3.49.

(iv) Since FSTAT > FC, we reject the null, meaning at least one pair of parameters is different and implying that a panel-data estimation is needed.

Estimating with Panel Data

First-Difference Estimation

We find that the data are in the file Ch06.xls, Fig. 6.2. The model is

ΔEXPSit = ΔOUTi,t−1 + eit.

We have to perform the following steps:

image

Figure 6.1 ANOVA sections for restricted and unrestricted models, respectively

In cell H2 type =D2E2, then press Enter.

In cell I2 type =F2G2, then press Enter.

Copy cells H2 through I2 and paste into cells H3 through I25.

Go to Data then Data Analysis.

Select Regression, then click OK.

The input Y range is H1:H25, the input X range is I1:I25.

Check the boxes Labels; Constant is Zero.

Check the Residuals button to obtain the predicted values.

Check the button Output Range and enter K1.

Click OK then OK again to overwrite the data.

Figure 6.2 shows that the results for the intercept are suppressed and reported as #N/A.

Prof. Empirie points out that we were estimating the change of the variables, so we need to follow the theoretical equation to recover the intercept for predictions:

yit = a1i + a2x2it + a3x3it,

so,

image

image

Figure 6.2 Results for first-difference estimation

Once the intercept is recovered, we can substitute it into the estimated equation to calculate the point and interval predictions as usual.

Fixed-Effects Estimation

We find that the data are in the file Ch06.xls, Fig. 6.3. Prof. Empirie reminds us that theoretically, we can perform a fixed-effects model by using the demeaned model. Empirically, it is much more convenient to use the LSDV techniques, so we are going to perform an LSDV estimation, and the regression equation is:

EXPSit = a1ADA + a1CDC + a1MDM + OUTi,t−1 + eit.

We are sure that you noticed the three dummies added to the equation and that the constant is suppressed from this equation. You should perform the following regression steps:

Go to Data then Data Analysis.

Select Regression, then click OK.

The input Y range is E1:E25, the input X range is G1:J25.

Check the boxes Labels; Constant is Zero.

Check the Residuals button to obtain the predicted values.

Check the button Output Range and enter L1.

Click OK then OK again to overwrite the data.

The results are reported in Figure 6.3. You can see that we use all three dummies to control for the fixed effects.

Prof. Empirie reminds us that recovering the intercept is possible by using the model:

image

so,

image

image

Figure 6.3 Results for fixed-effects estimation

Once the intercept is recovered, substitute it into the estimated equation to calculate the point and interval predictions as usual.

Exercises

1. The file Growth.xls contains data on GDP growth (GROW), investment (INV), and money growth (MONEY) for four regions A, B, C, and D for 10 years. Assuming that the four regions differ in intercepts only,

(a) Perform a regression of GROW on INV and MONEY using the first difference technique.

(b) Report the results using the standard format learned in chapter 3.

2. Using the dataset from Exercise 1,

(a) Perform a regression of GROW on INV and MONEY using the LSDV technique.

(b) Provide an interpretation of MONEY, including magnitude and significance level.

3. Use the results in Exercise 2 to provide point and interval predictions for INV and MONEY.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.180.113