Estimation with PROC GLM for More Than Two Observations Per Person

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2.4. Estimation with PROC GLM for More Than Two Observations Per Person

When each person has three or more measurements on the time-varying variables, it's not obvious how to extend the method of difference scores. One approach is to compute first-difference scores for each pair of adjacent observations, yielding T – 1 observations for each individual. Then the problem is to estimate a single model for the entire set while allowing for correlated errors.^[] Another approach is the dummy variable method, which gives the correct results in this new situation but is computationally intensive. In general, the easiest method is the one that was implemented in the last section using the ABSORB statement in PROC GLM. We now consider that method in greater detail.

^[] One reasonable method is to do generalized least squares on the difference equations, allowing for unrestricted correlations between the error terms from the same individual. In SAS, this can be done with PROC GENMOD using the REPEATED statement.

As before, our basic model is given by the equation

where α_i is a set of fixed parameters, ε_it satisfies the assumptions of a standard linear model and x_it is assumed to be strictly exogenous. OLS produces optimal estimates of the parameters, but direct application of OLS with dummy variables for the α_i terms is computationally tedious. It turns out, however, that we can get identical results by "conditioning out" the α_i terms and performing OLS on deviation scores. That is, for each person and for each time-varying variable (both response variables and predictors), we compute the means over time for that person:

where n_i is the number of measurements for person i. Then we subtract the person-specific means from the observed values of each variable:

Finally, we regress y* on x*, plus variables to represent the effect of time. This is what PROC GLM does when you use the ABSORB command.

If you construct the deviation scores yourself (using, say, PROC MEANS and a DATA step) and then use PROC REG to estimate the regression, you will get the correct OLS regression coefficients for the time-varying predictors. But the standard errors and p-values will be incorrect. That's because PROC REG calculates the degrees of freedom based on the number of variables on the MODEL statement, but it should actually include the number of dummy variables implicitly used to represent different persons in the sample (580 for the NLSY data). Formulas are available to correct these statistics (Judge et al. 1985), but it's much easier to let PROC GLM do it automatically. When the ABSORB statement is used, GLM converts all variables to deviation scores, estimates the regression, and uses the correct degrees of freedom to compute standard errors and p-values.

Let's try this with the NLSY data, except now we also include data from the middle year, 1992. Again, the first step is to construct a data set with one observation for each person at each time point:

DATA persyr3;
   SET my.nlsy;
   id=_N_;
   time=1;
   anti=anti90;
   self=self90;
   pov=pov90;
   OUTPUT;

time=2;
   anti=anti92;
   self=self92;
   pov=pov92;
   OUTPUT;
   time=3;
   anti=anti94;
   self=self94;
   pov=pov94;
   OUTPUT;
RUN;

If there were more than three time points, it might be worthwhile to shorten this program by using arrays and a DO loop. Note that TIME has been assigned values of 1, 2 and 3, which facilitates the use of a CLASS statement in PROC GLM. This DATA step produced 1,743 observations, three for each of the 581 children.

The PROC GLM statements for estimating the basic model are virtually identical to those for the two-period case, except that we now use a CLASS statement to handle the three-valued TIME variable:

PROC GLM DATA=persyr3;
   ABSORB id;
   CLASS time;
   MODEL anti=self pov time / SOLUTION;
RUN;

Note that for this to work, the data set must be sorted by the variable specified on the ABSORB statement. Of course, the DATA step that produced PERSYR3 did this automatically.

Results in Output 2.10 are similar to what we found in Output 2.2 for two time points: a highly significant effect of self-esteem with a coefficient of about –.055, and a nonsignificant effect of poverty. TIME also has a significant effect, with antisocial behavior increasing over the three periods.

Table 2.10. Output 2.10 GLM Estimates of a Fixed Effects Model for Three Periods
Dependent Variable: anti
Source	DF	Sum of Squares	Mean Square	F Value	Pr>F
Model	584	3181.883112	5.448430	5.48	<.0001
Error	1158	1151.232207	0.994156
Corrected Total	1742	4333.115318

R-Square	Coeff Var	Root MSE	antiMean
0.734318	60.91480	0.997074	1.636833

Source	DF	Type I SS	Mean Square	F Value	Pr>F
id	580	3142.448652	5.418015	5.45	<.0001
self	1	23.966255	23.966255	24.11	<.0001
pov	1	1.254392	1.254392	1.26	0.2615
time	2	14.213813	7.106907	7.15	0.0008

Source	DF	Type III SS	Mean Square	F Value	Pr>F
self	1	27.29362397	27.29362397	27.45	<.0001
pov	1	1.44138475	1.44138475	1.45	0.2288
time	2	14.21381348	7.10690674	7.15	0.0008

Parameter	Estimate	Standard Error	tValue	Pr>\|t\|
self	−.0551514027	0.01052575	−5.24	<.0001
pov	0.1124748908	0.09340988	1.20	0.2288
time 1	−.2107365666	0.05879781	−3.58	0.0004
time 2	−.1663431979	0.05856544	−2.84	0.0046
time 3	0.0000000000	.	.	.

We also learn from the output that 73% of the variation in antisocial behavior is between children, whereas the remaining 27% is within children (across time). I got these numbers by dividing the Type I sum of squares for ID (3142.44) by the corrected total sum of squares (4333.12), which yields .73. The square root of this number (.85) is an estimate of the intraclass correlation for these data. Since the total R² from the model is also .73, we conclude that the time-dependent predictors are not accounting for much additional variation.

It's also instructive to compare the results in Output 2.10 to what you get when the ABSORB command is omitted—that is, OLS regression with no control for between-person variation. These results are shown in Output 2.11. Notice first that the mean squared error in Output 2.10 is less than half of what we see in Output 2.11. That's because the control for between-person variation greatly reduces the error sum of squares (the R² increases from .048 to .734). It also reduces the degrees of freedom (which would make the mean squared error larger), but in this case, the reduction is not nearly as rapid. In data sets where the between-person proportion of variation in the dependent variable is small, the mean squared error could go up rather than down in a fixed effects model.

Table 2.11. Output 2.11 Conventional OLS without Control for Between-Person Variation
Dependent Variable: anti
Source	DF	Sum of Squares	Mean Square	F Value	Pr>F
Model	4	208.312848	52.078212	21.94	<.0001
Error	1738	4124.802470	2.373304
Corrected Total	1742	4333.115318

R-Square	Coeff Var	Root MSE	antiMean
0.048075	94.11792	1.540553	1.636833

Source	DF	Type III SS	Mean Square	F Value	Pr>F
self	1	86.3022527	86.3022527	36.36	<.0001
pov	1	102.8341414	102.8341414	43.33	<.0001
time	2	15.7786990	7.8893495	3.32	0.0362

Parameter	Estimate	Standard Error	tValue	Pr>\|t\|
Intercept	2.959617390	0.23985318	12.34	<.0001
self	−0.066894066	0.01109311	−6.03	<.0001
pov	0.517573842	0.07862856	6.58	<.0001
time 1	−0.222741659	0.09059359	−2.46	0.0140
time 2	−0.172135548	0.09043193	−1.90	0.0571
time 3	0.000000000	.	.	.

The root mean squared error directly affects the standard errors of the coefficients, so we might expect the standard errors to be smaller for the fixed effects regression than for the conventional regression. That's true for SELF, where the fixed effects standard error is .0105 while the conventional OLS standard error is .0111. But for POV, the fixed effects standard error is .0934 and the conventional OLS standard error is .0786. Why the difference? The answer is that the standard errors depend not only on the root mean square error, but also on the relative proportion of within- and between-person variation on the predictor variables. Other things being equal, the greater the proportion of variation that is between persons on a given predictor variable, the larger the standard error of its fixed effects coefficient. Other analysis shows that for SELF, 53% of the variation is between persons. For POV, on the other hand, the between-person variation is 70%. That's why the standard error for POV went up rather than down under the fixed effects analysis. The ideal situation for a fixed effects analysis is when all of the variation on the time-varying predictors is within persons, but there's still lots of between-person variation on the response variable.

As in the two-period case, we can also test whether the time-varying predictors have coefficients that vary with time by including interactions between them and TIME:

PROC GLM DATA=persyr3;
   ABSORB id;
   CLASS time;
   MODEL anti=self pov time self*time pov*time / SOLUTION;
RUN;

With high p-values for the two interactions, Output 2.12 shows no evidence for variation over time of the coefficients for SELF and POV.

Table 2.12. Output 2.12 Tests for Interaction between TIME and Time-Varying Predictors
Source	DF	Type III SS	Mean Square	F Value	Pr>F
self	1	26.61340572	26.61340572	26.74	<.0001
pov	1	1.34058714	1.34058714	1.35	0.2460
time	2	3.72428319	1.86214160	1.87	0.1544
self*time	2	2.62684393	1.31342197	1.32	0.2676
pov*time	2	0.04291216	0.02145608	0.02	0.9787

In a similar way, we can test for constancy of the effects of time-invariant predictors:

PROC GLM DATA=persyr3;
   ABSORB id;
   CLASS time;
   MODEL anti=self pov time black*time hispanic*time
         childage*time married*time gender*time
         momage*time momwork*time / SOLUTION;
RUN;

As shown in Output 2.13, there is no evidence that any of the time-invariant predictors has an effect that varies with time.

Table 2.13. Output 2.13 Tests for Interaction between TIME and Time-Invariant Predictors
Source	DF	Type III SS	Mean Square	F Value	Pr>F
self	1	24.13352368	24.13352368	24.41	<.0001
pov	1	1.28045845	1.28045845	1.30	0.2553
time	2	0.48255089	0.24127545	0.24	0.7835
black*time	2	4.99285922	2.49642961	2.53	0.0805
hispanic*time	2	1.63509176	0.81754588	0.83	0.4376
childage*time	2	5.06884670	2.53442335	2.56	0.0774
married*time	2	1.21443928	0.60721964	0.61	0.5412
gender*time	2	0.94702064	0.47351032	0.48	0.6195
momage*time	2	2.77934289	1.38967145	1.41	0.2456
momwork*time	2	3.47491248	1.73745624	1.76	0.1729

In addition to PROC GLM, another SAS procedure, PROC TSCSREG (for time series cross section regression), also does OLS estimation of the fixed effects model. PROC TSCSREG, which is a component of the ETS product, has one nice feature that I will discuss in the next section, a Hausman test of fixed effects versus random effects. However, the downside of PROC TSCSREG is that it explicitly estimates coefficients for the dummy variables for the fixed effects and thus may use excessive computer time for large samples.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Estimation with PROC GLM for More Than Two Observations Per Person

Create new playlist

Sign In

Sign Up

2.4. Estimation with PROC GLM for More Than Two Observations Per Person

Table of Contents for
Estimation with PROC GLM for More Than Two Observations Per Person