2.8. Maximum Likelihood Estimation with PROC GENMOD

GENMOD is a relatively new SAS procedure that’s designed to estimate generalized linear models (McCullagh and Nelder 1989), which include the standard linear model, logit and probit models, loglinear models, Poisson regression models, and many other less familiar models. It’s very similar in features and syntax to the famous GLIM program introduced by the Royal Statistical Society in the early 1970s.

Here’s how to use PROC GENMOD to estimate the same logit model that we fit with PROC LOGISTIC in the previous section. The code for Release 6.12 and later is

PROC GENMOD DATA=my.penalty;
  MODEL death=blackd whitvic serious / DIST=BINOMIAL;
RUN;

(Earlier releases of SAS require that the dependent variable be specified as DEATH/N, where N is a variable that is always equal to 1). Although PROC GENMOD doesn’t need the DESCENDING option, it does require the DIST=BINOMIAL option (which can be abbreviated D=B) in the MODEL statement. This tells PROC GENMOD that the dependent variable is dichotomous with a binomial distribution. For dichotomous data, the default in GENMOD is to fit a logit model. In Section 3.10, we’ll see how to use the LINK option to fit probit and complementary log-log models.

Results are shown in Output 2.3. The first section of the output, labeled “Model Information,” is largely self-explanatory. The third section, “Analysis of Parameter Estimates,” reports the same numbers we got with LOGISTIC. However, we don’t get standardized estimates or odds ratios. This section also contains a line labeled SCALE, along with a note saying that “the scale parameter was held fixed.” This information can be ignored for binary regression models unless you’re working with grouped data and want to allow for something called overdispersion (see Section 4.6).

In the middle section of Output 2.3 under “Criteria for Assessing Goodness of Fit,” we find the deviance and the Pearson chi-square. (For now, we can ignore the scaled versions of these statistics.) For individual-level data, the deviance is just –2 times the log-likelihood, which we also saw in the LOGISTIC output. In logit analysis, the deviance plays the same role as the residual sum of squares in linear regression analysis.

By adding 2k to the deviance (where k is the number of parameters), you can calculate the Akaike Information Criterion. Or by adding k log n to the deviance (n being the sample size), you get the BIC statistic. You can also use the difference in deviances for two nested models as a chi-square test for whether the simpler model is valid or not. We’ll see many examples of this later on. What you cannot do (at least not legitimately) is treat the deviance itself as a goodness of fit chi-square statistic and compute a p-value for the model. That may be appropriate for grouped data (as we’ll see in Chapter 4), but for individual-level data the deviance does not have a chi-square distribution. The Pearson chi square is even worse for individual-level data.

Output 2.3. PROC GENMOD Output for Death Penalty Data
                  The GENMOD Procedure

                    Model Information

         Description                     Value

         Data Set                        MY.JUDGE
         Distribution                    BINOMIAL
         Link Function                   LOGIT
         Dependent Variable              DEATH
         Observations Used               147
         Number Of Events                50
         Number Of Trials                147


          Criteria For Assessing Goodness Of Fit

   Criterion             DF         Value      Value/DF

   Deviance             143      176.2850        1.2328
   Scaled Deviance      143      176.2850        1.2328
   Pearson Chi-Square   143      149.5144        1.0456
   Scaled Pearson X2    143      149.5144        1.0456
   Log Likelihood         .      -88.1425             .


              Analysis Of Parameter Estimates

Parameter    DF    Estimate     Std Err   ChiSquare  Pr>Chi

INTERCEPT     1     -2.6516      0.6748     15.4424  0.0001
BLACKD        1      0.5952      0.3939      2.2827  0.1308
WHITVIC       1      0.2565      0.4002      0.4107  0.5216
SERIOUS       1      0.1871      0.0612      9.3343  0.0022
SCALE         0      1.0000      0.0000           .       .

NOTE:  The scale parameter was held fixed.

Unlike PROC LOGISTIC, PROC GENMOD does not report a global test for the null hypothesis that all the coefficients are 0. The best way to get that statistic is to fit a model with no explanatory variables (MODEL death=/D=B;), sometimes called a null model. For this data, the deviance for the null model is 188.49. Taking the difference between 188.49 and 176.3 (the deviance for the model in Output 2.3), we get 12.2, which is the likelihood-ratio chi square reported by PROC LOGISTIC.

Now that you’ve seen the basic syntax and output for both the LOGISTIC and GENMOD procedures, you may well ask why we need GENMOD? After all, GENMOD doesn’t produce some of the useful statistics reported by LOGISTIC. The answer is that GENMOD has several features that are absent in LOGISTIC—features that are extremely useful for some applications, especially the analysis of contingency tables. Here are some of GENMOD’s features that I find particularly valuable:

CLASS variables. As with the GLM, LIFEREG, and PROBIT procedures, GENMOD has a CLASS statement that allows you to specify that a variable is to be treated as categorical (nominal). When a CLASS variable is included as an explanatory variable in the MODEL statement, GENMOD automatically creates a dummy variable for each distinct value of the original variable. To accomplish the same thing in LOGISTIC, you must create the dummy variables yourself in a DATA step.

Here’s an example with the death-penalty data. The data set contains the variable CULP, which has the integer values 1 to 5 (5 denotes high culpability and 1 denotes low culpability, based on a large number of aggravating and mitigating circumstances defined by statute). Although we could treat this variable as an interval scale, we might prefer to treat it as a set of categories. To do this, we run the program

PROC GENMOD DATA=my.penalty;
  CLASS culp;
  MODEL death = blackd whitvic culp / D=B;
RUN;

which produces the results in Output 2.4.

Output 2.4. Use of a CLASS Variable in PROC GENMOD
                 Analysis Of Parameter Estimates
Parameter      DF    Estimate     Std Err   ChiSquare  Pr>Chi

INTERCEPT       1      0.5533      0.7031      0.6193  0.4313
BLACKD          1      1.7246      0.6131      7.9141  0.0049
WHITVIC         1      0.8385      0.5694      2.1687  0.1408
CULP      1     1     -4.8670      0.8251     34.7926  0.0001
CULP      2     1     -3.0547      0.7754     15.5185  0.0001
CULP      3     1     -1.5294      0.8400      3.3153  0.0686
CULP      4     1     -0.3610      0.8857      0.1662  0.6835
CULP      5     0      0.0000      0.0000           .       .

Because the variable CULP has 5 possible values, GENMOD has created four dummy variables, one for each of the values 1 through 4. As in other procedures that have CLASS variables, the default in GENMOD is to take the highest value as the omitted category. Thus, each of the four coefficients for CULP is a comparison between that particular value and the highest value. More specifically, each coefficient can be interpreted as the log-odds for that particular value of CULP minus the log-odds for CULP=5, controlling for other variables in the model. The pattern for the four coefficients is just what we’d expect. Defendants with CULP=1 are much less likely to get the death sentence than those with CULP=5. Each increase of CULP is associated with an increase in the probability of a death sentence. Note that when CULP is included in the model, the coefficient for BLACKD (black defendant) is much larger than it was in Output 2.3 and is now statistically significant.

Multiplicative terms in the MODEL statement. Regression analysts often want to build models that have interactions in which the effect of one variable depends on the level of another variable. The most popular way of doing this is to include a new explanatory variable in the model, one that is the product of the two original variables. With PROC LOGISTIC, you have to create the product variables in a DATA step. With PROC GENMOD, you just specify the product in the MODEL statement. For example, some criminologists have argued that black defendants who kill white victims may be especially likely to receive a death sentence. We can test that hypothesis for the New Jersey data with this program:

PROC GENMOD DATA=my.penalty;
  MODEL death = blackd whitvic culp blackd*whitvic / D=B;
RUN;

This produces the table in Output 2.5.

Output 2.5. Multiplicative Variables in PROC GENMOD
             Analysis Of Parameter Estimates

Parameter        DF    Estimate     Std Err   ChiSquare  Pr>Chi

INTERCEPT         1     -5.4047      1.1627     21.6073  0.0001
BLACKD            1      1.8723      1.0463      3.2021  0.0735
WHITVIC           1      1.0727      0.9877      1.1794  0.2775
CULP              1      1.2704      0.1968     41.6881  0.0001
BLACKD*WHITVIC    1     -0.3274      1.1782      0.0772  0.7811
SCALE             0      1.0000      0.0000           .       .

With a p-value of .78, the product term is clearly not significant and can be excluded from the model.

The product syntax in GENMOD also makes it easy to construct polynomial functions. For example, to estimate a cubic equation you can specify a model of the form

MODEL y = x x*x x*x*x / D=B;

This fits the model log(p/(1 – p)) = α + β1x + β2x2 + β3x3.

Likelihood-ratio tests for individual coefficients. While PROC LOGISTIC reports a likelihood-ratio test for the null hypothesis that all coefficients are 0, the tests for the individual coefficients are Wald statistics. That’s because Wald statistics are so easy to compute: just divide the coefficient by its estimated standard error and square the result. By contrast, to get likelihood ratio tests, you must refit the model multiple times, deleting each explanatory variable in turn. You then compute twice the positive difference between the log-likelihood for the full model and for each of the reduced models. Despite the greater computational burden, there is mounting evidence that likelihood ratio tests are superior (Hauck and Donner 1977, Jennings 1986), especially in small samples or samples with unusual data patterns, and many authorities express a strong preference for them (for example, Collett 1991).

As with PROC LOGISTIC, PROC GENMOD reports Wald statistics for the individual coefficients. But you can also get likelihood-ratio statistics by putting the TYPE3 option in the MODEL statement. For the model in Output 2.5, we can modify the MODEL statement to read

MODEL death = blackd whitvic culp blackd*whitvic / D=B TYPE3;

which produces the likelihood-ratio tests in Output 2.6. While similar to the Wald statistics, there are some noteworthy differences. In particular, the likelihood-ratio chi square for CULP is nearly twice as large as the Wald chi square.

Output 2.6. Likelihood-ratio Tests in GENMOD
LR Statistics For Type 3 Analysis

Source            DF   ChiSquare  Pr>Chi

BLACKD             1      3.6132  0.0573
WHITVIC            1      1.2385  0.2658
CULP               1     75.3762  0.0001
BLACKD*WHITVIC     1      0.0775  0.7807

The TYPE3 option does not produce likelihood-ratio tests for the individual coefficients of a CLASS variable. Instead, a single chi square for the entire set is reported, testing the null hypothesis that all the coefficients in the set are equal to 0. To get individual likelihood-ratio tests, use the CONTRAST statement described in Section 4.5.

Generalized Estimating Equations. With Release 6.12, GENMOD provides an optional estimation method—known as GEE—that’s ideal for longitudinal and other clustered data. For conventional logit analysis, a crucial assumption is that observations are independent—the outcome for one observation is completely unrelated to the outcome for any other observation. But suppose you follow people over a period of time, measuring the same dichotomous variable at regular intervals. It’s highly unlikely that a person’s response at one point in time will be independent of earlier or later responses. Failure to take such correlations into account can lead to seriously biased standard errors and test statistics. The GEE method—invoked by using the REPEATED statement—solves these problems in a convenient, flexible way. We’ll discuss this method in some detail in Chapter 8.

What the GENMOD procedure lacks. Despite these desirable features, GENMOD lacks some important capabilities of PROC LOGISTIC that prevent it from being the universal logit procedure in SAS. Most importantly, LOGISTIC has

  • The cumulative logit model. LOGISTIC can estimate logit models where the dependent variable has more than two ordered categories. We’ll discuss such models extensively in Chapter 6. This capability will be added to GENMOD in Version 7.

  • Influence statistics. These statistics, discussed in Section 3.8, tell you how much the coefficients change with deletion of each case from the model fit. They can be very helpful in discovering problem observations.

  • Automated variable selection methods. LOGISTIC has a variety of methods for selecting variables into a model from a pool of potential explanatory variables. But because I’m not a fan of such methods, I won’t discuss them in this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.220.114