10.7. The Problem of Zeros

Contingency tables sometimes have cells with frequency counts of 0. These may cause problems or require special treatment. There are two kinds of zeros:

  • Structural zeros: These are cells for which a nonzero count is impossible because of the nature of the phenomenon or the design of the study. The classic example is a cross-tabulation of sex by type of surgery in which structural zeros occur for male hysterectomies and female vasectomies.

  • Random zeros: In these cells, nonzero counts are possible (at least as far we know), but a zero occurs because of random variation. Random zeros are especially likely to arise when the sample is small and the contingency table has many cells.

Structural zeros are easily accommodated with PROC GENMOD. Simply delete the structural zeros from the data set before estimating the model. Random zeros can be a little trickier. Most of the time they don’t cause any difficulty, except that the expected cell counts may be very small, thereby degrading the chi-square approximation to the deviance and Pearson’s statistic. However, more serious problems arise when random zeros show up in fitted marginal tables. When a fitted marginal table contains a frequency count of zero, at least one ML parameter estimate is infinite and the fitting algorithm will not converge. We already encountered this problem in Section 3.4 for the binary logit model. There, we saw that if there is a 0 in the 2 × 2 table describing the relationship between the dependent variable and any dichotomous independent variable, the coefficient for that variable is infinite and the algorithm will not converge. The identical problem arises when fitting a logit model by means of its equivalent loglinear model, and the potential solutions are the same. However, problems arise more frequently when fitting loglinear models because it’s necessary to fit the full marginal table describing the relationships among all the independent variables. As we’ve seen, loglinear models typically contain many nuisance parameters, any of which could have infinite estimates causing problems with convergence.

Here’s a simple, hypothetical example. Consider the following three-way table for dichotomous variables X, Y, and Z:

  X  
 10
 ZZ
Y1010
120540
055110
Total2510150

Considering the two-way marginal tables, neither the XY table nor the YZ table has any zeros. But the XZ table clearly has one random zero, producing two random zeros in the three-way table.

Now suppose we want to estimate a logit model for Y dependent on X and Z. We can read the table into SAS as follows:

DATA zero;
  INPUT x y z f;
  DATALINES;
1    1    1    20
1    0    1     5
1    1    0     5
1    0    0     5
0    1    1     4
0    0    1    11
0    1    0     0
0    0    0     0
;

We can estimate the logit model directly with:

PROC GENMOD DATA=zero;
  FREQ f;
  MODEL y = x z / D=B;
RUN;

This produces the results in Output 10.11. There is no apparent problem here. The coefficient for X is large and statistically significant, while the coefficient for Z is smaller and not quite significant. Both goodness-of-fit statistics are 0 with 0 degrees of freedom. That’s because the model has three parameters, but there are only three combinations of X and Z for which we observe the dependent variable Y.

Output 10.11. Logit Output for Data with Marginal Zeros
         Criteria For Assessing Goodness Of Fit

Criterion             DF         Value      Value/DF

Deviance               0        0.0000             .
Pearson Chi-Square     0        0.0000             .
Log Likelihood         .      -28.1403             .

           Analysis Of Parameter Estimates

Parameter    DF    Estimate     Std Err   ChiSquare  Pr>Chi

INTERCEPT     1     -2.3979      0.9954      5.8027  0.0160
X             1      2.3979      0.7687      9.7306  0.0018
Z             1      1.3863      0.8062      2.9566  0.0855

Now let’s estimate the equivalent loglinear model:

PROC GENMOD DATA=zero;
  MODEL f=x z y x*z y*z y*x / D=P OBSTATS;
RUN;

As Output 10.12 shows, all the parameters pertaining to Y (and associated statistics) are the same in both the loglinear and logit versions of the model. But, all the nuisance parameters are very large, with huge chi-squares. The goodness of fit chi-squares are again 0, but the reported degrees of freedom is 1 rather than 0.

Output 10.12. Loglinear Output for Data with Marginal Zeros
       Criteria For Assessing Goodness Of Fit

Criterion             DF         Value      Value/DF

Deviance               1        0.0000        0.0000
Pearson Chi-Square     1        0.0000        0.0000
Log Likelihood         .       65.9782             .

            Analysis Of Parameter Estimates

Parameter   DF    Estimate     Std Err   ChiSquare  Pr>Chi

INTERCEPT    1    -23.6020      0.7006   1134.7429  0.0001
X            1     25.2115      0.5394   2184.9366  0.0001
Z            1     25.9999      0.6325   1689.9898  0.0001
Y            1     -2.3979      0.9954      5.8027  0.0160
X*Z          0    -25.9999      0.0000           .       .
Z*Y          1      1.3863      0.8062      2.9566  0.0855
X*Y          1      2.3979      0.7687      9.7306  0.0018

The four parameter estimates greater than 20 all stem from the 0 in the marginal table for X and Z. While there’s no guarantee, my experience with PROC GENMOD is that it invariably produces the right estimates and standard errors for the parameters that do not pertain to the marginal table with zeros. However, other software may not do the same. Even if the logit parameter estimates are correct, the incorrect degrees of freedom for the deviance and Pearson statistics may invalidate comparisons with other models.

The solution is to treat the random zeros that arise from marginal zeros as if they were structural zeros—that is, delete them from the data before fitting the model. How do we know which random zeros in the full table come from zeros in the fitted marginal tables? In this example, it’s fairly evident, but more complicated tables may present some difficulties. One approach is to examine the table produced by the OBSTATS option and check for records in which the observed frequency is 0 and the estimated frequency is very small. Output 10.13 shows the OBSTATS table produced for the model in Output 10.12. We see that the last two lines do, in fact, have observed frequencies of 0 and predicted frequencies near 0.

Output 10.13. OBSTATS Output for Model with Marginal Zeros
                          Observation Statistics
 F       Pred      Xbeta        Std    HessWgt      Lower      Upper

20    20.0000     2.9957     0.2236    20.0000    12.9031    31.0002
 5     5.0000     1.6094     0.4472     5.0000     2.0811    12.0127
 5     5.0000     1.6094     0.4472     5.0000     2.0811    12.0127
 5     5.0000     1.6094     0.4472     5.0000     2.0811    12.0127
 4     4.0000     1.3863     0.5000     4.0000     1.5013    10.6576
11    11.0000     2.3979     0.3015    11.0000     6.0918    19.8628
 0  5.109E-12   -25.9999     0.7071  5.109E-12  1.278E-12  2.043E-11
 0   5.62E-11   -23.6020     0.7006   5.62E-11  1.424E-11  2.219E-10

Now let’s refit the model without these two zeros:

PROC GENMOD DATA=zero;
  WHERE f NE 0;
  MODEL f=x z y x*z y*z y*x / D=P;
RUN;

Results in Output 10.14 give the correct logit parameter estimates and the correct degrees of freedom, 0. None of the estimated parameters is unusually large. Note, however, that we do not get an estimate for the X*Z interaction, because we’ve eliminated one component of the XZ table. Deletion of cells with zero frequency should only be used when the parameters corresponding to the marginal tables with zeros are nuisance parameters. Otherwise, follow the strategies discussed in Section 3.4.

Output 10.14. Loglinear Output for Model with Zeros Deleted
           Criteria For Assessing Goodness Of Fit

     Criterion             DF         Value      Value/DF

     Deviance               0        0.0000             .
     Pearson Chi-Square     0        0.0000             .
     Log Likelihood         .       65.9782             .

             Analysis Of Parameter Estimates

Parameter   DF    Estimate     Std Err   ChiSquare  Pr>Chi

INTERCEPT    1      2.3979      0.7006     11.7128  0.0006
X            1     -0.7885      0.5394      2.1370  0.1438
Z            1     -0.0000      0.6325      0.0000  1.0000
Y            1     -2.3979      0.9954      5.8027  0.0160
X*Z          0      0.0000      0.0000           .       .
Z*Y          1      1.3863      0.8062      2.9566  0.0855
X*Y          1      2.3979      0.7687      9.7306  0.0018

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.31.240