Chapter 12: Binary Logistic Regression

Introduction

Preparing the Birth Weight Data Set for Logistic Regression

Selecting Reference Levels for Your Model

Conclusions

Problems

Introduction

In the last chapter, you learned how to create multiple regression models. Conceptually, logistic regression has some similarities to multiple regression, although the computational method (maximum likelihood) is quite different (and CPU-intensive). Multiple regression uses a set of predictor variables to predict (and model) a continuous outcome variable. Binary logistic regression uses a set of predictor variables to predict a dichotomous outcome. Theoretically, a multiple regression equation can predict values from negative infinity to positive infinity—binary logistic regression is attempting to compute a probability that an event occurs or does not occur. Because probabilities are bounded between 0 and 1, multiple regression should not be used. Instead, a transformation (called a logit) is performed so that the results of a binary logistic model are bounded by 0 and 1. The transformation, for the mathematically interested reader, is to take the natural log of the odds (the probability that the event occurs divided by the probability that the event does not occur). Luckily, the results of a binary logistic model provide you with odds ratios (OR) for classification variables that have a straightforward interpretation

Preparing the Birth Weight Data Set for Logistic Regression

It is instructive to reanalyze the birth weight data from the last chapter using logistic regression. However, there is one problem: the outcome variable (Weight) in the multiple regression examples was the actual birth weight (in grams). You need to create a dichotomous variable to represent high or low birth weights. (OK, I've just got to tell you this: Speaking of birth weights, our new grandson was born TODAY and he was 7 pounds 13 ounces!)

Instead of using the actual birth weight in grams as the outcome variable, let's create a new variable (call it Wt_Group) that represents weights below and above the median birth weight (3,402 grams). To do this, you need to write a short SAS program. Figure 1 below is such a

program. It creates a new, permanent data set called High_Low that contains all the observations from the 25% sample of the SASHELP birth weight data plus the Wt_Group variable.

Figure 1: Program to Create Weight Groups

*Program to create a categorical weight variable;

data BOOKDATA.High_Low;

   set Birth_Wt_Sample;

   where Weight is not missing;

 

   *Wt_Group = 1 is lower weight group

   The median weight is 3402 grams;

 

   if Weight lt 3402 then Wt_Group = 1;

   else Wt_Group = 0;

run;

 

title "Listing of First 10 Observations from High_Low";

proc print data=BOOKDATA.High_Low(obs=10);

run;

The program starts with a COMMENT statement. COMMENT statements start with an asterisk and end with a semicolon. These statements are ignored when the program executes—they are included to document the program. Next, the DATA statement names the new data set High_Low and uses the libref BOOKDATA to instruct the program to make the data set permanent and place it in the BOOKDATA library.

When you use a two-level data set name (two names separated by a period), the first part of the name (before the period) is the library name. The second part of the name (after the period) is the data set name.

If you left off the BOOKDATA (and the period), the High_Low data set would be placed in the WORK library. That is fine, except that you would need to re-create it every time you opened a new Studio session.

SAS uses a SET statement to read observations from another SAS data set. This SET statement says to read observations from the Birth_Wt_Sample data set, the same data set that was created with the Random selection task in Chapter 9.

Because you want to eliminate any observation in the input data set (Birth_Wt_Sample), where the variable Weight is a missing value, you add a WHERE statement to delete such observations. (Note: There are no observations with missing birth weights in the SASHELP data set, but it is always a good idea to test for missing values in a SAS program.) The WHERE statement in this program uses the keywords IS NOT MISSING to test for missing values of Weight. If Weight is not missing, the program continues—if Weight is missing, the statement is not true and the program returns to the top of the DATA step to read the next observation.

There is another COMMENT statement telling anyone reading this program that Wt_Group = 1 includes the low birth weight babies (a possible risk factor). The IF statement tests if the variable Weight is less than (abbreviated LT) 3,402. If the logical expression in the IF statement is true, the expression following the logical comparison is executed. So, in this case, if Weight is less than 3,402, Wt_Group is set to 1; otherwise, it is set to 0.

The reason that the program did not include observations where the variable Weight was a missing value is that missing values in SAS are logically lower than any real value. If you had missing values and did not eliminate or test for them, the expression if Weight LT 3402 would be true for missing values. That is, missing values are less than any real value.

The DATA step ends with a RUN statement. You are actually done now, but it is always a good idea to take a look at the first few observations in a data set to be sure the program runs as you expected. You could have used the List data task in SAS Studio, but it was just as easy to include a procedure to print the first few observations in the new data set. You use a DATA= data set option to tell PROC PRINT that you want to print observations from the BOOKDATA.High_Low data set. The OBS=10 in parentheses following the data set name is a data set option, and it says to stop printing when you reach observation 10.

This program was run and the output is shown in Figure 2:

Figure 2: Listing of the First 10 Observations in Data Set High_Low

image

Notice the last column of this listing. This column contains the new Wt_Group variable. You are now ready to run a binary logistic regression model. Start by selecting Binary logistic regression from the task list. Select the BOOKDATA.High_Low data set, and select Wt_Group as your response variable. In addition, you can select which event (in this case, 0 – above the median or 1 – below the median) you want to predict. Although most of the birth weights in this data set are not extremely low, choosing event = 1 (low birth weight) seems to make the most sense (see Figure 3 below):

Figure 3: Selecting a Response Variable and Event of Interest

image

The next step is to select classification (categorical) variables and choose a Coding method. For all the examples in this book, Reference coding is used. Reference coding, the most popular coding method, at least in the health sciences, computes statistics such as odds ratios, based on a reference level. For this first demonstration of binary logistic regression, you are going to allow the procedure to select a reference level. In a later section, you will see how to choose the reference level yourself.

The two variables, Black (1=black, 0=non-black) and MomSmoke (1=yes, 0=no), are good candidates for classification variables (Figure 4):

Figure 4: Selecting Classification Variables and Reference Coding

image

Next, add MomWtGain and MomAge as continuous variables (not shown).

Click the Model tab to specify your model. This process is similar to the way you specified a model for multiple regression. First, click Edit. Next, highlight the four variables on the top left-side of the screen. Finally, click Add and then OK (Figure 5):

Figure 5: Specifying your Model

image

At this point, use the default values for the remaining options and selection criteria, and run the task. There is a lot of output, so we'll discuss it piece by piece.

To start, you will see model information; numbers of observations read and used (pay careful attention to these, especially if there are a lot fewer observations used than read); a table showing values of your response variable (Wt_Group); and the fact that you are modeling Wt_Group = 1 (low birth weight).

Figure 6: Model Information and Response Profile

image

Next, there is a table showing your classification variables and their values. Make sure that these values are what you expect. A single data error can result in extra levels of these variables and invalidate the entire process.

Figure 7: Class Level Information

image

Everything is as you expected, with values of 0 and 1 for both variables.

The next section of output shows model fit statistics. These are a bit complicated and are more useful when comparing models. Smaller values of AIC (Akaike's information criteria) indicate better models. The criterion labeled SC (Schwarz Criterion) is based on the value of AIC but adjusts for the number of variables entered into the model. As with AIC, smaller values of SC indicate better models. The Schwarz Criterion is probably better to use than the AIC if you want a parsimonious model (fewer predictor variables).

Figure 8: Fit Statistics

image

Next comes the global test of the null hypothesis. If this is not significant, it may be back to the drawing board. Here, three tests of the null hypothesis all reject it with very low p-values.

Figure 9: Test of Global Null Hypothesis

image

The p-values for each of the predictor variables (both classification variables and continuous variables) are presented (Figure 10):

Figure 10: p-Values for Effects

image

All variables are significant. Now comes the important information on the predictor variables: the odds ratios. The odds ratios for the two classification variables are easier to understand, and we will look at them first. For Black (0 versus 1), you see a point estimate of .491. Because you didn't select a reference level for the two classification variables, the task selected the higher value (1) as the reference level (Figure 11):

Figure 11: Odds Ratios

image

Because the odds ratio for Black is .491, you conclude that a person who is non-black (0) is less likely to have a baby whose weight is below the median value. If you take the reciprocal of this value (about 2.037), it is easier to interpret. You could say, based on this model, that the odds of a black mother having a baby whose weight is below the median is 2.038 times higher than for a non-black mother. The 95% confidence limits indicate that you are 95% confident that your estimate of the odds ratio is between those two limits. Because both classification variables were significant, these limits do not include 1 (meaning that the odds are equal for each outcome). The same interpretation holds for the mother smoking or not. The value .436 means that non-smoking moms are less likely to have babies with birth weights below the median.

The odds ratios for the continuous variables show the odds for each year (for MomAge) or each pound (for MomWtGain). There are options in PROC LOGISTIC for changing these odds ratios so that they represent the odds ratios for multiples of each continuous variable. (See Cody 2011 for more details.) For example, you might want to know the odds ratio that corresonds to 10-pound changes instead of 1-pound changes.

Selecting Reference Levels for Your Model

It is quite easy to modify the program produced by the Binary logistic task to specify reference levels for your classification variables. First, click the Edit icon in the Code window. All you need to do is add a reference level for each variable listed in the CLASS statement. In the program shown in Figure 12, a reference level of 0 was selected for the two variables Black and MomSmoke. Notice that this level is placed in quotations marks. The reason for the quotation marks is a bit complicated. (See Cody 2011, SAS Statistics by Example for a complete explanation or just use the quotation marks.)

Figure 12: Editing the Program to Specify Reference Levels

proc logistic data=BOOKDATA.HIGH_LOW plots

    (maxpoints=none)=(oddsratio(cldisplay=serifarrow) roc);

   class Black(ref='0')  MomSmoke(ref='0') / param=ref;

   model Wt_Group(event='1')=MomAge MomWtGain Black MomSmoke/
      link=logit

      technique=fisher;

run;

Running the program again with the reference levels specified, results in the odds ratios showing the risk for a black mother or a mother who smokes compared to a non-black mother or a non-smoking mother. Notice that the new odds ratios are the reciprocals of the odds ratios where the reference level was 1 instead of 0. See Figure 13 below:

Figure 13: New Odds Ratios

image

Conclusions

Programming a logistic regression model by writing the actual SAS code is a bit daunting. However, by using the Binary logistic regression task, it is quite easy to do. This is one instance where even a veteran programmer may decide that a point-and-click approach to running this task is a good choice.

Problems

12-1: Starting with the SASHELP data set Heart, run a binary logistic regression with Status as the Response variable and Dead as the event of interest. Select the two variables Chol_Status (cholesterol status) and BP_Status (blood pressure status) as classification variables. Create a filter using the expression:

BP_Status ne 'Optimal'

Set parameterization of effects to Reference.

12-2: Run problem 12-1 again except edit the code so that the reference level for Chol_Status is 'Desirable' and the reference level for BP_Status is 'Normal'. Compare the odds ratio to the previous result.

12-3: Starting with the CSV file Risk.csv, create a temporary SAS data set (call it Risk). Use the binary logistic regression task to predict a heart attack (variable Heart_Attack, 1=yes, 0=no) with the two risk factors Age_Group (Less than 60, 61 to 70, and 71 and older) and Chol_High (1=yes, 0=no). Use reference coding.

12-4: Repeat problem 12-3, but set the reference level for Age_Group = '1:Less 60' and for Chol_High = '0'.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.140.204