2.2. Estimation with Two Observations Per Person

When there are exactly two observations per person (t = 1, 2), estimation of the fixed effects model can be easily accomplished by OLS regression using difference scores for all the time-varying variables. The equations for the two time points are


Subtracting the first equation from the second, we get


Notice that both γzi and αi have been "differenced out" of the equation. Consequently, we cannot estimate γ, the coefficients for the time-invariant predictors. Nevertheless, this method completely controls for the effects of these variables. Furthermore, if εi1 and εi2 both satisfy the assumptions of the standard linear model, then their difference will also satisfy those assumptions (even if εi1 and εi2 are correlated). So OLS applied to the difference scores should give unbiased and efficient estimates of β, the coefficients for the time-varying predictors.

Let's try this on some real data. The sample consists of 581 children who were interviewed in 1990, 1992, and 1994 as part of the National Longitudinal Survey of Youth (Center for Human Resource Research 2002). We'll look at three variables that were measured in each of the three interviews:


ANTI

antisocial behavior, measured with a scale ranging from 0 to 6


SELF

self-esteem, measured with a scale ranging from 6 to 24


POV

poverty status of family, coded 1 for in poverty, otherwise 0

In this section, we will use only the data from 1990 and 1994. Our goal is to estimate a regression model of the form:

ANTI(t) = μ(t) + β1SELF(t) + β2POV(t) + ε(t)

for t = 1, 2. That is, we shall assume that poverty and self-esteem at time t affect antisocial behavior at the same time. I recognize that there may be uncertainties about causal ordering for these variables, but such difficulties will be ignored in this chapter. A related issue is whether the independent variables should be lagged, but with only two time points, it's impossible to estimate a model with both lags and fixed effects. Another implicit assumption is that the regression coefficients are the same at each time point, but that assumption can be tested or relaxed, as we'll see later.

As a point of departure, I begin by estimating the regression equation for each year separately using PROC REG. The SAS data set MY.NLSY has one observation per person, with separate variables for the measurements in the different years. The following SAS code is used for estimating the model:

PROC REG DATA=my.nlsy;
   MODEL anti90=self90 pov90;
   MODEL anti94=self94 pov94;
RUN;

Please note that this and all other data sets used in this book are available for download at support.sas.com/companionsites.

Selected portions of the output are displayed in Output 2.1. We see that both of the independent variables are statistically significant at beyond the .05 level in both years. As one might expect, higher self esteem is associated with lower levels of antisocial behavior, whereas poverty is associated with higher levels of antisocial behavior. The effect of SELF is slightly larger in 1994 than in 1990, while the reverse is true for POV.

Table 2.1. Output 2.1 Regressions of ANTI on POV and SELF in 1990 and 1994
Dependent Variable: anti90 child antisocial behavior in 1990
VariableDFParameter EstimateStandard ErrortValuePr>|t|
Intercept12.374820.384476.18<.0001
self901−0.050140.01870−2.680.0075
pov9010.594730.126294.71<.0001
Dependent Variable: anti94 child antisocial behavior in 1994
VariableDFParameter EstimateStandard ErrortValuePr>|t|
Intercept12.887970.446886.46<.0001
self941−0.063880.02113−3.020.0026
pov9410.547120.147653.710.0002

The problem with these regressions is that they do not control for any time-invariant variables. Rather than putting such variables into the model, we'll proceed directly to the difference equation, which controls for all time-invariant variables. To do this, we first need a DATA step to create the difference scores:

DATA diff;
   SET my.nlsy;
   antidiff=anti94-anti90;
   povdiff=pov94-pov90;
   selfdiff=self94-self90;
PROC REG DATA=diff;
   MODEL antidiff=selfdiff povdiff;
RUN;

Results are shown in Output 2.2. The coefficient for SELFDIFF is about midway between the two coefficients for SELF in Output 2.1, but the coefficient for POVDIFF is markedly lower than the two coefficients in Output 2.1 and is far from statistically significant. It thus appears that although there might be an association between poverty status and antisocial behavior, that association is largely cross-sectional and is perhaps explainable by their mutual dependence on other variables. But changes in poverty status do not seem to be associated with changes in antisocial behavior.

Table 2.2. Output 2.2 Regression with Difference Scores
Dependent Variable: antidiff
VariableDFParameter EstimateStandard ErrortValuePr>|t|
Intercept10.209230.063053.320.0010
selfdiff1−0.056150.01531−3.670.0003
povdiff1−0.036310.12827−0.280.7772

One concern is that there might be too few changes in poverty status to reliably estimate the effect of this variable. Output 2.3 shows that although the majority of children did not change in status, about 24% did change in one direction or another. This change should be sufficient to get a reliable estimate. In fact, the standard error for the poverty coefficient in the difference equation is about the same as the smaller of the two standard errors for the cross-sectional coefficients in Output 2.1.

Table 2.3. Output 2.3 Cross-Tabulation of Poverty Status in 1990 and 1994
pov90pov94
Frequency 01Total
 032165386
 173122195
Total 394187581

Regression with difference scores is not the only way to produce OLS estimates of the fixed effects model for these data. The following alternative method is computationally cumbersome, but instructive. The first step is to reorganize the data so that, instead of one observation per person, there is one observation per person-year. The same variable name is used for the measurements of each conceptual variable in the two years. The new data set also contains an ID variable that has the same value for both years for the same person, and a TIME variable with a value of 0 for 1990 and 1 for 1994. Here is the SAS code to produce this new data set:

DATA persyr2;
   SET my.nlsy;
   id=_N_;
   time=0;
   anti=anti90;
   self=self90;
   pov=pov90;
   OUTPUT;
   time=1;
   anti=anti94;
   self=self94;
   pov=pov94;
   OUTPUT;
RUN;

The new data set has 1162 observations, two for each of the 581 children. Equation (2.1) is now estimated in its original form


except that γzi is removed because it is perfectly collinear with αi. To allow for different intercepts in the two years, the regression includes the TIME variable. To estimate the αi terms, the regression model includes 580 dummy variables, one for each child except the last. This would be awkward in PROC REG because all the dummies would have to be created in a DATA step. It's easy in PROC GLM, however, because the CLASS statement can create the dummies automatically:

PROC GLM DATA=persyr2;
   CLASS id;
   MODEL anti=self pov time id / SOLUTION;
RUN;

The SOLUTION option tells GLM to print out the coefficient estimates and their associated statistics. (This is unnecessary when there are no CLASS variables). Selected results are shown in Output 2.4. Only the first 10 coefficients for the dummy variables are shown.

The most important fact about this output is that the coefficients for SELF and POV (along with their standard errors, t-statistics, and p-values) are identical to those in Output 2.2, which was based on the difference equation. Furthermore, the coefficient for TIME is identical to the intercept in Output 2.2. So it seems that we get equivalent results using these two computational methods. But the dummy variable method is much slower than the difference score method because it requires the inversion of a very large matrix. On my PC, PROC REG took .01 seconds to estimate the difference-score model, whereas PROC GLM took 3.3 seconds to estimate the dummy variable model.

Table 2.4. Output 2.4 PROC GLM Results for Person-Year Data Set
ParameterEstimateStandard ErrortValuePr>|t|
Intercept1.5084359130.812147061.860.0638
self−0.0561486390.01530506−3.670.0003
pov−0.0363086180.12827438−0.280.7772
time0.2092337320.063054363.320.0010
id 10.6585259081.067506640.620.5376
id 2−0.3777827101.06740470−0.350.7235
id 34.6502916091.067696224.36<.0001
id 41.1222172901.067404701.050.2935
id 50.1783659291.068042480.170.8674
id 60.5941429701.067167990.560.5779
id 72.9256970511.066899882.740.0063
id 81.8695484121.067249561.750.0803
id 90.0379943301.066859080.040.9716
id 104.3133997721.067818514.04<.0001

Using the same person-year data configuration, it's possible to greatly reduce the computation time by not explicitly estimating the dummy variable coefficients. I'll explain this method in more detail in section 2.4, but let's first look at how it is implemented. In PROC GLM, we take the variable ID out of the MODEL and CLASS statements, and we put it in an ABSORB statement instead. Here is the new program:

PROC GLM DATA=persyr2;
   ABSORB id;
   MODEL anti=self pov time;
RUN;

This program took about the same computing time as using PROC REG with the difference scores. It's apparent that the results in Output 2.5 are identical to those found in Outputs 2.4 and 2.2. As we'll see in section 2.4, this last method for OLS estimation of the fixed effects model is generally preferred when there are more than two observations per person.

Table 2.5. Output 2.5 GLM Results Using the ABSORB Statement
ParameterEstimateStandard ErrortValuePr>|t|
self−.05614863950.01530506−3.670.0003
pov−.03630861830.12827438−0.280.7772
time0.20923373220.063054363.320.0010

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.110.58