GENMOD is a relatively new SAS procedure that’s designed to estimate generalized linear models (McCullagh and Nelder 1989), which include the standard linear model, logit and probit models, loglinear models, Poisson regression models, and many other less familiar models. It’s very similar in features and syntax to the famous GLIM program introduced by the Royal Statistical Society in the early 1970s.
Here’s how to use PROC GENMOD to estimate the same logit model that we fit with PROC LOGISTIC in the previous section. The code for Release 6.12 and later is
PROC GENMOD DATA=my.penalty; MODEL death=blackd whitvic serious / DIST=BINOMIAL; RUN;
(Earlier releases of SAS require that the dependent variable be specified as DEATH/N, where N is a variable that is always equal to 1). Although PROC GENMOD doesn’t need the DESCENDING option, it does require the DIST=BINOMIAL option (which can be abbreviated D=B) in the MODEL statement. This tells PROC GENMOD that the dependent variable is dichotomous with a binomial distribution. For dichotomous data, the default in GENMOD is to fit a logit model. In Section 3.10, we’ll see how to use the LINK option to fit probit and complementary log-log models.
Results are shown in Output 2.3. The first section of the output, labeled “Model Information,” is largely self-explanatory. The third section, “Analysis of Parameter Estimates,” reports the same numbers we got with LOGISTIC. However, we don’t get standardized estimates or odds ratios. This section also contains a line labeled SCALE, along with a note saying that “the scale parameter was held fixed.” This information can be ignored for binary regression models unless you’re working with grouped data and want to allow for something called overdispersion (see Section 4.6).
In the middle section of Output 2.3 under “Criteria for Assessing Goodness of Fit,” we find the deviance and the Pearson chi-square. (For now, we can ignore the scaled versions of these statistics.) For individual-level data, the deviance is just –2 times the log-likelihood, which we also saw in the LOGISTIC output. In logit analysis, the deviance plays the same role as the residual sum of squares in linear regression analysis.
By adding 2k to the deviance (where k is the number of parameters), you can calculate the Akaike Information Criterion. Or by adding k log n to the deviance (n being the sample size), you get the BIC statistic. You can also use the difference in deviances for two nested models as a chi-square test for whether the simpler model is valid or not. We’ll see many examples of this later on. What you cannot do (at least not legitimately) is treat the deviance itself as a goodness of fit chi-square statistic and compute a p-value for the model. That may be appropriate for grouped data (as we’ll see in Chapter 4), but for individual-level data the deviance does not have a chi-square distribution. The Pearson chi square is even worse for individual-level data.
The GENMOD Procedure Model Information Description Value Data Set MY.JUDGE Distribution BINOMIAL Link Function LOGIT Dependent Variable DEATH Observations Used 147 Number Of Events 50 Number Of Trials 147 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 143 176.2850 1.2328 Scaled Deviance 143 176.2850 1.2328 Pearson Chi-Square 143 149.5144 1.0456 Scaled Pearson X2 143 149.5144 1.0456 Log Likelihood . -88.1425 . Analysis Of Parameter Estimates Parameter DF Estimate Std Err ChiSquare Pr>Chi INTERCEPT 1 -2.6516 0.6748 15.4424 0.0001 BLACKD 1 0.5952 0.3939 2.2827 0.1308 WHITVIC 1 0.2565 0.4002 0.4107 0.5216 SERIOUS 1 0.1871 0.0612 9.3343 0.0022 SCALE 0 1.0000 0.0000 . . NOTE: The scale parameter was held fixed. |
Unlike PROC LOGISTIC, PROC GENMOD does not report a global test for the null hypothesis that all the coefficients are 0. The best way to get that statistic is to fit a model with no explanatory variables (MODEL death=/D=B;), sometimes called a null model. For this data, the deviance for the null model is 188.49. Taking the difference between 188.49 and 176.3 (the deviance for the model in Output 2.3), we get 12.2, which is the likelihood-ratio chi square reported by PROC LOGISTIC.
Now that you’ve seen the basic syntax and output for both the LOGISTIC and GENMOD procedures, you may well ask why we need GENMOD? After all, GENMOD doesn’t produce some of the useful statistics reported by LOGISTIC. The answer is that GENMOD has several features that are absent in LOGISTIC—features that are extremely useful for some applications, especially the analysis of contingency tables. Here are some of GENMOD’s features that I find particularly valuable:
CLASS variables. As with the GLM, LIFEREG, and PROBIT procedures, GENMOD has a CLASS statement that allows you to specify that a variable is to be treated as categorical (nominal). When a CLASS variable is included as an explanatory variable in the MODEL statement, GENMOD automatically creates a dummy variable for each distinct value of the original variable. To accomplish the same thing in LOGISTIC, you must create the dummy variables yourself in a DATA step.
Here’s an example with the death-penalty data. The data set contains the variable CULP, which has the integer values 1 to 5 (5 denotes high culpability and 1 denotes low culpability, based on a large number of aggravating and mitigating circumstances defined by statute). Although we could treat this variable as an interval scale, we might prefer to treat it as a set of categories. To do this, we run the program
PROC GENMOD DATA=my.penalty; CLASS culp; MODEL death = blackd whitvic culp / D=B; RUN;
which produces the results in Output 2.4.
Analysis Of Parameter Estimates Parameter DF Estimate Std Err ChiSquare Pr>Chi INTERCEPT 1 0.5533 0.7031 0.6193 0.4313 BLACKD 1 1.7246 0.6131 7.9141 0.0049 WHITVIC 1 0.8385 0.5694 2.1687 0.1408 CULP 1 1 -4.8670 0.8251 34.7926 0.0001 CULP 2 1 -3.0547 0.7754 15.5185 0.0001 CULP 3 1 -1.5294 0.8400 3.3153 0.0686 CULP 4 1 -0.3610 0.8857 0.1662 0.6835 CULP 5 0 0.0000 0.0000 . . |
Because the variable CULP has 5 possible values, GENMOD has created four dummy variables, one for each of the values 1 through 4. As in other procedures that have CLASS variables, the default in GENMOD is to take the highest value as the omitted category. Thus, each of the four coefficients for CULP is a comparison between that particular value and the highest value. More specifically, each coefficient can be interpreted as the log-odds for that particular value of CULP minus the log-odds for CULP=5, controlling for other variables in the model. The pattern for the four coefficients is just what we’d expect. Defendants with CULP=1 are much less likely to get the death sentence than those with CULP=5. Each increase of CULP is associated with an increase in the probability of a death sentence. Note that when CULP is included in the model, the coefficient for BLACKD (black defendant) is much larger than it was in Output 2.3 and is now statistically significant.
Multiplicative terms in the MODEL statement. Regression analysts often want to build models that have interactions in which the effect of one variable depends on the level of another variable. The most popular way of doing this is to include a new explanatory variable in the model, one that is the product of the two original variables. With PROC LOGISTIC, you have to create the product variables in a DATA step. With PROC GENMOD, you just specify the product in the MODEL statement. For example, some criminologists have argued that black defendants who kill white victims may be especially likely to receive a death sentence. We can test that hypothesis for the New Jersey data with this program:
PROC GENMOD DATA=my.penalty; MODEL death = blackd whitvic culp blackd*whitvic / D=B; RUN;
This produces the table in Output 2.5.
Analysis Of Parameter Estimates Parameter DF Estimate Std Err ChiSquare Pr>Chi INTERCEPT 1 -5.4047 1.1627 21.6073 0.0001 BLACKD 1 1.8723 1.0463 3.2021 0.0735 WHITVIC 1 1.0727 0.9877 1.1794 0.2775 CULP 1 1.2704 0.1968 41.6881 0.0001 BLACKD*WHITVIC 1 -0.3274 1.1782 0.0772 0.7811 SCALE 0 1.0000 0.0000 . . |
With a p-value of .78, the product term is clearly not significant and can be excluded from the model.
The product syntax in GENMOD also makes it easy to construct polynomial functions. For example, to estimate a cubic equation you can specify a model of the form
MODEL y = x x*x x*x*x / D=B;
This fits the model log(p/(1 – p)) = α + β1x + β2x2 + β3x3.
Likelihood-ratio tests for individual coefficients. While PROC LOGISTIC reports a likelihood-ratio test for the null hypothesis that all coefficients are 0, the tests for the individual coefficients are Wald statistics. That’s because Wald statistics are so easy to compute: just divide the coefficient by its estimated standard error and square the result. By contrast, to get likelihood ratio tests, you must refit the model multiple times, deleting each explanatory variable in turn. You then compute twice the positive difference between the log-likelihood for the full model and for each of the reduced models. Despite the greater computational burden, there is mounting evidence that likelihood ratio tests are superior (Hauck and Donner 1977, Jennings 1986), especially in small samples or samples with unusual data patterns, and many authorities express a strong preference for them (for example, Collett 1991).
As with PROC LOGISTIC, PROC GENMOD reports Wald statistics for the individual coefficients. But you can also get likelihood-ratio statistics by putting the TYPE3 option in the MODEL statement. For the model in Output 2.5, we can modify the MODEL statement to read
MODEL death = blackd whitvic culp blackd*whitvic / D=B TYPE3;
which produces the likelihood-ratio tests in Output 2.6. While similar to the Wald statistics, there are some noteworthy differences. In particular, the likelihood-ratio chi square for CULP is nearly twice as large as the Wald chi square.
LR Statistics For Type 3 Analysis Source DF ChiSquare Pr>Chi BLACKD 1 3.6132 0.0573 WHITVIC 1 1.2385 0.2658 CULP 1 75.3762 0.0001 BLACKD*WHITVIC 1 0.0775 0.7807 |
The TYPE3 option does not produce likelihood-ratio tests for the individual coefficients of a CLASS variable. Instead, a single chi square for the entire set is reported, testing the null hypothesis that all the coefficients in the set are equal to 0. To get individual likelihood-ratio tests, use the CONTRAST statement described in Section 4.5.
Generalized Estimating Equations. With Release 6.12, GENMOD provides an optional estimation method—known as GEE—that’s ideal for longitudinal and other clustered data. For conventional logit analysis, a crucial assumption is that observations are independent—the outcome for one observation is completely unrelated to the outcome for any other observation. But suppose you follow people over a period of time, measuring the same dichotomous variable at regular intervals. It’s highly unlikely that a person’s response at one point in time will be independent of earlier or later responses. Failure to take such correlations into account can lead to seriously biased standard errors and test statistics. The GEE method—invoked by using the REPEATED statement—solves these problems in a convenient, flexible way. We’ll discuss this method in some detail in Chapter 8.
What the GENMOD procedure lacks. Despite these desirable features, GENMOD lacks some important capabilities of PROC LOGISTIC that prevent it from being the universal logit procedure in SAS. Most importantly, LOGISTIC has
The cumulative logit model. LOGISTIC can estimate logit models where the dependent variable has more than two ordered categories. We’ll discuss such models extensively in Chapter 6. This capability will be added to GENMOD in Version 7.
Influence statistics. These statistics, discussed in Section 3.8, tell you how much the coefficients change with deletion of each case from the model fit. They can be very helpful in discovering problem observations.
Automated variable selection methods. LOGISTIC has a variety of methods for selecting variables into a model from a pool of potential explanatory variables. But because I’m not a fan of such methods, I won’t discuss them in this book.
18.217.220.114