Contingency tables sometimes have cells with frequency counts of 0. These may cause problems or require special treatment. There are two kinds of zeros:
Structural zeros: These are cells for which a nonzero count is impossible because of the nature of the phenomenon or the design of the study. The classic example is a cross-tabulation of sex by type of surgery in which structural zeros occur for male hysterectomies and female vasectomies.
Random zeros: In these cells, nonzero counts are possible (at least as far we know), but a zero occurs because of random variation. Random zeros are especially likely to arise when the sample is small and the contingency table has many cells.
Structural zeros are easily accommodated with PROC GENMOD. Simply delete the structural zeros from the data set before estimating the model. Random zeros can be a little trickier. Most of the time they don’t cause any difficulty, except that the expected cell counts may be very small, thereby degrading the chi-square approximation to the deviance and Pearson’s statistic. However, more serious problems arise when random zeros show up in fitted marginal tables. When a fitted marginal table contains a frequency count of zero, at least one ML parameter estimate is infinite and the fitting algorithm will not converge. We already encountered this problem in Section 3.4 for the binary logit model. There, we saw that if there is a 0 in the 2 × 2 table describing the relationship between the dependent variable and any dichotomous independent variable, the coefficient for that variable is infinite and the algorithm will not converge. The identical problem arises when fitting a logit model by means of its equivalent loglinear model, and the potential solutions are the same. However, problems arise more frequently when fitting loglinear models because it’s necessary to fit the full marginal table describing the relationships among all the independent variables. As we’ve seen, loglinear models typically contain many nuisance parameters, any of which could have infinite estimates causing problems with convergence.
Here’s a simple, hypothetical example. Consider the following three-way table for dichotomous variables X, Y, and Z:
X | ||||
1 | 0 | |||
Z | Z | |||
Y | 1 | 0 | 1 | 0 |
1 | 20 | 5 | 4 | 0 |
0 | 5 | 5 | 11 | 0 |
Total | 25 | 10 | 15 | 0 |
Considering the two-way marginal tables, neither the XY table nor the YZ table has any zeros. But the XZ table clearly has one random zero, producing two random zeros in the three-way table.
Now suppose we want to estimate a logit model for Y dependent on X and Z. We can read the table into SAS as follows:
DATA zero; INPUT x y z f; DATALINES; 1 1 1 20 1 0 1 5 1 1 0 5 1 0 0 5 0 1 1 4 0 0 1 11 0 1 0 0 0 0 0 0 ;
We can estimate the logit model directly with:
PROC GENMOD DATA=zero; FREQ f; MODEL y = x z / D=B; RUN;
This produces the results in Output 10.11. There is no apparent problem here. The coefficient for X is large and statistically significant, while the coefficient for Z is smaller and not quite significant. Both goodness-of-fit statistics are 0 with 0 degrees of freedom. That’s because the model has three parameters, but there are only three combinations of X and Z for which we observe the dependent variable Y.
Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 0 0.0000 . Pearson Chi-Square 0 0.0000 . Log Likelihood . -28.1403 . Analysis Of Parameter Estimates Parameter DF Estimate Std Err ChiSquare Pr>Chi INTERCEPT 1 -2.3979 0.9954 5.8027 0.0160 X 1 2.3979 0.7687 9.7306 0.0018 Z 1 1.3863 0.8062 2.9566 0.0855 |
Now let’s estimate the equivalent loglinear model:
PROC GENMOD DATA=zero; MODEL f=x z y x*z y*z y*x / D=P OBSTATS; RUN;
As Output 10.12 shows, all the parameters pertaining to Y (and associated statistics) are the same in both the loglinear and logit versions of the model. But, all the nuisance parameters are very large, with huge chi-squares. The goodness of fit chi-squares are again 0, but the reported degrees of freedom is 1 rather than 0.
Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 1 0.0000 0.0000 Pearson Chi-Square 1 0.0000 0.0000 Log Likelihood . 65.9782 . Analysis Of Parameter Estimates Parameter DF Estimate Std Err ChiSquare Pr>Chi INTERCEPT 1 -23.6020 0.7006 1134.7429 0.0001 X 1 25.2115 0.5394 2184.9366 0.0001 Z 1 25.9999 0.6325 1689.9898 0.0001 Y 1 -2.3979 0.9954 5.8027 0.0160 X*Z 0 -25.9999 0.0000 . . Z*Y 1 1.3863 0.8062 2.9566 0.0855 X*Y 1 2.3979 0.7687 9.7306 0.0018 |
The four parameter estimates greater than 20 all stem from the 0 in the marginal table for X and Z. While there’s no guarantee, my experience with PROC GENMOD is that it invariably produces the right estimates and standard errors for the parameters that do not pertain to the marginal table with zeros. However, other software may not do the same. Even if the logit parameter estimates are correct, the incorrect degrees of freedom for the deviance and Pearson statistics may invalidate comparisons with other models.
The solution is to treat the random zeros that arise from marginal zeros as if they were structural zeros—that is, delete them from the data before fitting the model. How do we know which random zeros in the full table come from zeros in the fitted marginal tables? In this example, it’s fairly evident, but more complicated tables may present some difficulties. One approach is to examine the table produced by the OBSTATS option and check for records in which the observed frequency is 0 and the estimated frequency is very small. Output 10.13 shows the OBSTATS table produced for the model in Output 10.12. We see that the last two lines do, in fact, have observed frequencies of 0 and predicted frequencies near 0.
Observation Statistics F Pred Xbeta Std HessWgt Lower Upper 20 20.0000 2.9957 0.2236 20.0000 12.9031 31.0002 5 5.0000 1.6094 0.4472 5.0000 2.0811 12.0127 5 5.0000 1.6094 0.4472 5.0000 2.0811 12.0127 5 5.0000 1.6094 0.4472 5.0000 2.0811 12.0127 4 4.0000 1.3863 0.5000 4.0000 1.5013 10.6576 11 11.0000 2.3979 0.3015 11.0000 6.0918 19.8628 0 5.109E-12 -25.9999 0.7071 5.109E-12 1.278E-12 2.043E-11 0 5.62E-11 -23.6020 0.7006 5.62E-11 1.424E-11 2.219E-10 |
Now let’s refit the model without these two zeros:
PROC GENMOD DATA=zero; WHERE f NE 0; MODEL f=x z y x*z y*z y*x / D=P; RUN;
Results in Output 10.14 give the correct logit parameter estimates and the correct degrees of freedom, 0. None of the estimated parameters is unusually large. Note, however, that we do not get an estimate for the X*Z interaction, because we’ve eliminated one component of the XZ table. Deletion of cells with zero frequency should only be used when the parameters corresponding to the marginal tables with zeros are nuisance parameters. Otherwise, follow the strategies discussed in Section 3.4.
Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 0 0.0000 . Pearson Chi-Square 0 0.0000 . Log Likelihood . 65.9782 . Analysis Of Parameter Estimates Parameter DF Estimate Std Err ChiSquare Pr>Chi INTERCEPT 1 2.3979 0.7006 11.7128 0.0006 X 1 -0.7885 0.5394 2.1370 0.1438 Z 1 -0.0000 0.6325 0.0000 1.0000 Y 1 -2.3979 0.9954 5.8027 0.0160 X*Z 0 0.0000 0.0000 . . Z*Y 1 1.3863 0.8062 2.9566 0.0855 X*Y 1 2.3979 0.7687 9.7306 0.0018 |
3.141.31.240