The Logit Model for Discrete Time

We begin with the logit version of the model because it is more widely used and because logit regression is already familiar to many readers. In Chapter 5 (see The DISCRETE Method), we considered Cox’s model for discrete-time data. In brief, we let Pit be the conditional probability that individual i has an event at time t, given that an event has not already occurred to that individual. The model says that Pit is related to the covariates by a logistic regression equation:

Equation 7.1


where t = 1, 2, 3,.... This model is most appropriate when events can only occur at regular, discrete points in time, but it has also been frequently employed when ties arise from grouping continuous-time data into intervals.

In Chapter 5, we saw how to estimate this model by the method of partial likelihood, thereby discarding any information about the αts. Now we are going to estimate the same model by maximum likelihood, so that we get explicit estimates of the αts. The procedure is best explained by way of an example. As in Chapter 5, we’ll estimate the model for 100 simulated job durations, measured from the year of entry into the job until the year that the employee quit. Durations after the fifth year are censored. We know only the year in which the employee quit, so the survival times have values of 1, 2, 3, 4, or 5. These values are contained in a variable called DUR, while the variable EVENT is coded 1 if the employee quit; otherwise, it is coded 0. Covariates are ED (years of education), PRESTIGE (a measure of the prestige of the occupation), and SALARY in the first year of the job. None of these covariates are time dependent.

The first task is to take the original data set (JOBDUR) with one record per person and create a new data set (JOBYRS) with one record for each year that each person was observed. Thus, someone who quit in the third year gets three observations, while someone who still had not quit after five years on the job gets five observations. The following DATA step accomplishes this task:

data jobyrs;
   set jobdur;
   do year=l to dur;
      if year=dur and event=1 then quit=l;
      else quit=2;
      output;
   end;
run;

The DO loop creates 272 person-years that are written to the output data set. The IF statement defines the dependent variable QUIT, which equals 1 if the employee quit in that particular person-year; otherwise, QUIT equals 2. (I use 2 rather than 0 for nonevents because the default in PROC LOGISTIC and PROC PROBIT is to predict the probability of the smaller value of a binary variable. You can change the default in PROC LOGISTIC with the DESCENDING option.) Thus, if a person quit in the fifth year, QUIT is coded 2 for the first four records and 1 in the last record. For people who don’t quit during any of the five years, QUIT is coded 2 for all five records.

Output 7.1 shows the first 20 records produced by this DATA statement. Observation 1 is for a person who quit in the first year of the job. Observations 2 through 5 correspond to a person who quit in the fourth year. QUIT is coded 2 for the first three years and 1 for the fourth. Observations 6 through 10 correspond to a person who still held the job at the end of the fifth year.

Output 7.1. First 20 Cases of Person-Year Data Set for Job Durations
OBS    DUR    EVENT    QUIT    YEAR    ED    PRESTIGE    SALARY

  1     1       1        1       1      7        3          19
  2     4       1        2       1     14       62          17
  3     4       1        2       2     14       62          17
  4     4       1        2       3     14       62          17
  5     4       1        1       4     14       62          17
  6     5       0        2       1     16       70          18
  7     5       0        2       2     16       70          18
  8     5       0        2       3     16       70          18
  9     5       0        2       4     16       70          18
 10     5       0        2       5     16       70          18
 11     2       1        2       1     12       43         135
 12     2       1        1       2     12       43         135
 13     3       1        2       1      9       18          12
 14     3       1        2       2      9       18          12
 15     3       1        1       3      9       18          12
 16     1       1        1       1     11       31          12
 17     1       1        1       1     13       26           6
 18     1       1        1       1     10        1           4
 19     2       1        2       1     12       28          17
 20     2       1        1       2     12       28          17

Now we’re ready to estimate a logistic regression model for these data. We can use any of the three procedures discussed in this chapter, but let’s start with PROC PROBIT. (Despite the name, PROC PROBIT can optionally estimate logistic regression models). The following PROC PROBIT statements accomplish this task:

proc probit data=jobyrs;
   class year quit;
   model quit=ed prestige salary year / d=logistic;
run;

By specifying YEAR as a CLASS variable, we tell PROC PROBIT to create a set of four indicator (dummy) variables, with the reference category being YEAR=5 (the highest value). Output 7.2 shows the results. Comparing these estimates with the partial likelihood estimates in Output 5.10, we see that the coefficients of ED, PRESTIGE, and SALARY are similar, as are the chi-square statistics. Again, this is not surprising since they are simply alternative ways of estimating the same model. Despite the similarity in results, however, PROC PHREG (using the TIES=DISCRETE option) took six times as long to estimate the model as PROC PROBIT did.

Output 7.2. ML Estimates of Discrete-Time Logistic Model for Job Duration Data
Weighted Frequency Counts for the Ordered Response Categories

                                 Level     Count
                                     1        68
                                     2       204

 Log Likelihood for LOGISTIC -99.68329834

       Variable  DF   Estimate  Std Err ChiSquare Pr>Chi Label/Value

       INTERCPT   1 3.44431429 1.182107  8.489692 0.0036 Intercept
       ED         1 0.22485581  0.08598  6.839274 0.0089
       PRESTIGE   1 -0.1235217 0.018099  46.57946 0.0001
       SALARY     1 -0.0268422 0.010386  6.679172 0.0098

       YEAR       4                      23.25297 0.0001
                  1 -2.6874898 0.832656  10.41748 0.0012           1
                  1 -1.4475238 0.767124  3.560574 0.0592           2
                  1 -0.0129973 0.727165  0.000319 0.9857           3
                  1 0.23547505 0.777895  0.091632 0.7621           4
                  0          0        0         .  .               5

Unlike partial likelihood, the maximum likelihood method also gives us estimates for the effect of time on the odds of quitting, as reflected in the αts in equation (7.1). INTERCPT, in Output 7.2, is an estimate of α5, the log-odds of quitting in year 5 for a person with values of 0 on all covariates. For level j of the YEAR variable, the coefficient is an estimate of αj – α5, that is, the difference in the log-odds of quitting in year j and the log-odds of quitting in year 5 (controlling for the covariates). We see that the log-odds is lowest in the first year of the job, rises steadily to year 3, and then stays roughly constant for the next two years. Overall, the effect of YEAR is highly significant with a Wald chi-square statistic of 23.25 with 4 d.f.

When the model in equation (7.1) is estimated by partial likelihood, there can be no restrictions on the αts. With the ML method of this chapter, however, we can readily estimate restricted versions of the model. In fact, because time (in this case YEAR) is just another variable in the regression model, we can specify the dependence of the hazard on time as any function that SAS allows in the DATA statement. For example, if we remove YEAR from the CLASS statement but keep it in the MODEL statement, we constrain the effect of YEAR to be linear on the log-odds of quitting. Alternatively, we can take the logarithm of YEAR before putting it in the model, or we can fit a quadratic model with YEAR and YEAR squared. The log-likelihoods for these models are

Unrestricted–99.68
Linear–103.14
Logarithmic–100.99
Quadratic–100.01

Taking twice the positive difference between the linear and unrestricted log-likelihoods, we get a chi-square statistic of 6.92 with 3 d.f., for a p-value of .07. (The three degrees of freedom correspond to the three additional parameters estimated in the unrestricted model). While this is marginally acceptable, the logarithmic and quadratic models fit much better with p-values of .45 and .72, respectively. And since the logarithmic model has one fewer coefficient than the quadratic model, it has the edge in parsimony. The coefficients for ED, PRESTIGE, and SALARY in the logarithmic model (not shown) hardly change at all from the unrestricted model. The coefficient for the logarithm of YEAR is 2.00, indicating that a 1-percent increase in time in the job produces a 2-percent increase in the odds of quitting.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.154.208