As we have seen, the form of the dependence of the hazard on time is left unspecified in the proportional hazards model. Furthermore, the partial likelihood method discards that portion of the likelihood function that contains information about the dependence of the hazard on time. Nevertheless, it is still possible to get nonparametric estimates of the survivor function based on a fitted proportional hazards model.
When there are no time-dependent covariates, the Cox model can be written as
S(t) = [S0(t)]exp(βx)
where S(t) is the survival probability at time t for an individual with covariate values x, and S0(t) is the baseline survivor function, that is, the survivor function for an individual whose covariate values are all 0. After estimating β by partial likelihood, you can get an estimate of S0(t) by a nonparametric maximum likelihood method. With that estimate in hand, you can generate the estimated survivor function for any set of covariate values by substitution in the equation above.
In PROC PHREG, you accomplish this with the BASELINE statement. The easiest task is to get the survivor function for x = , the vector of sample means. For the recidivism data set, the SAS code for accomplishing this task is as follows:
proc phreg data=recid; model week*arrest(0)=fin age prio / ties=efron; baseline out=a survival=s logsurv=ls loglogs=1ls; run; proc print data=a; run;
(Only statistically significant covariates are included in this run so that the width of an output record does not exceed the printed page.) The OUT=A option requests that SAS put the survival estimates in a temporary data set named A. SURVIVAL=S asks that the survival probabilities be stored in a variable named S. You can interpret these probabilities as estimates of the survivor function, controlling for the effects of the covariates. LOGSURV requests the negative logarithm of the survival probabilities, also known as the cumulative hazard function. LOGLOGS requests log[–logS(t)]. These logarithmic transformations of the survivor functions were discussed in more detail in Chapters 2, 3, and 4.
Data set A is printed in Output 5.19. There are fifty records, corresponding to the 49 unique weeks in which arrests were observed to occur, plus an initial record for time 0. Each record gives the mean value for each of the three covariates. The S column gives the estimated survival probabilities. The last two columns are the two logarithmic transforms.
OBS FIN AGE PRIO WEEK S LS LLS 1 0.5 24.5972 2.98380 0 1.00000 0.00000 . 2 0.5 24.5972 2.98380 1 0.99801 -0.00200 -6.21670 3 0.5 24.5972 2.98380 2 0.99601 -0.00399 -5.52280 4 0.5 24.5972 2.98380 3 0.99402 -0.00600 -5.11671 5 0.5 24.5972 2.98380 4 0.99203 -0.00800 -4.82813 6 0.5 24.5972 2.98380 5 0.99004 -0.01001 -4.60378 7 0.5 24.5972 2.98380 6 0.98804 -0.01203 -4.41998 8 0.5 24.5972 2.98380 7 0.98604 -0.01406 -4.26428 9 0.5 24.5972 2.98380 8 0.97601 -0.02428 -3.71816 10 0.5 24.5972 2.98380 9 0.97200 -0.02840 -3.56145 11 0.5 24.5972 2.98380 10 0.96999 -0.03047 -3.49104 12 0.5 24.5972 2.98380 11 0.96593 -0.03467 -3.36193 13 0.5 24.5972 2.98380 12 0.96184 -0.03891 -3.24646 14 0.5 24.5972 2.98380 13 0.95979 -0.04104 -3.19320 15 0.5 24.5972 2.98380 14 0.95363 -0.04748 -3.04744 16 0.5 24.5972 2.98380 15 0.94951 -0.05181 -2.96010 17 0.5 24.5972 2.98380 16 0.94538 -0.05616 -2.87948 18 0.5 24.5972 2.98380 17 0.93918 -0.06275 -2.76859 19 0.5 24.5972 2.98380 18 0.93293 -0.06942 -2.66754 20 0.5 24.5972 2.98380 19 0.92876 -0.07391 -2.60492 21 0.5 24.5972 2.98380 20 0.91831 -0.08522 -2.46257 22 0.5 24.5972 2.98380 21 0.91414 -0.08977 -2.41050 23 0.5 24.5972 2.98380 22 0.91205 -0.09206 -2.38534 24 0.5 24.5972 2.98380 23 0.90996 -0.09435 -2.36070 25 0.5 24.5972 2.98380 24 0.90160 -0.10358 -2.26738 26 0.5 24.5972 2.98380 25 0.89528 -0.11061 -2.20170 27 0.5 24.5972 2.98380 26 0.88891 -0.11775 -2.13915 28 0.5 24.5972 2.98380 27 0.88467 -0.12254 -2.09928 29 0.5 24.5972 2.98380 28 0.88041 -0.12737 -2.06068 30 0.5 24.5972 2.98380 30 0.87614 -0.13223 -2.02324 31 0.5 24.5972 2.98380 31 0.87400 -0.13467 -2.00493 32 0.5 24.5972 2.98380 32 0.86972 -0.13958 -1.96909 33 0.5 24.5972 2.98380 33 0.86542 -0.14455 -1.93416 34 0.5 24.5972 2.98380 34 0.86109 -0.14956 -1.90008 35 0.5 24.5972 2.98380 35 0.85242 -0.15968 -1.83458 36 0.5 24.5972 2.98380 36 0.84590 -0.16736 -1.78763 37 0.5 24.5972 2.98380 37 0.83719 -0.17771 -1.72761 38 0.5 24.5972 2.98380 38 0.83500 -0.18032 -1.71303 39 0.5 24.5972 2.98380 39 0.83063 -0.18557 -1.68434 40 0.5 24.5972 2.98380 40 0.82186 -0.19618 -1.62870 41 0.5 24.5972 2.98380 42 0.81746 -0.20155 -1.60173 42 0.5 24.5972 2.98380 43 0.80863 -0.21241 -1.54922 43 0.5 24.5972 2.98380 44 0.80418 -0.21793 -1.52360 44 0.5 24.5972 2.98380 45 0.79972 -0.22349 -1.49840 45 0.5 24.5972 2.98380 46 0.79078 -0.23473 -1.44932 46 0.5 24.5972 2.98380 47 0.78855 -0.23756 -1.43733 47 0.5 24.5972 2.98380 48 0.78406 -0.24326 -1.41361 48 0.5 24.5972 2.98380 49 0.77284 -0.25769 -1.35601 49 0.5 24.5972 2.98380 50 0.76607 -0.26648 -1.32247 50 0.5 24.5972 2.98380 52 0.75703 -0.27836 -1.27885 |
Of what use are these estimated functions? If you have hypotheses about the shape of the hazard function, the estimates can provide some helpful evidence, as we saw in Chapter 3. In particular, a constant hazard function implies a cumulative hazard function that increases as a straight line. If a graph of the log-survivor function curves upward, it is evidence for an increasing hazard. On the other hand, if the log-survivor function bends below a straight line, it suggests that the hazard is decreasing with time. For this purpose, it really doesn’t matter at what covariate values the survivor function is calculated.
A graph of the negative log-survivor (cumulative hazard) function, shown in Output 5.20, is produced by the following statements:
data b; set a; ls=-ls; run; proc gplot data=b; symbol1 value=none interpol=join; plot ls*week; run;
The curve appears to bend slightly upward, suggesting a hazard that increases with time. That’s consistent with what we found in Chapters 3 and 4.
By combining the BASELINE statement with stratification, we can also produce graphs that are helpful in evaluating the proportional hazards assumption. Suppose we take financial aid (FIN) as the stratifying variable for the recidivism data. That might seem self-defeating since FIN is the variable of greatest interest and stratifying on it means that no tests or estimates of its effect are produced. But after stratifying, we can graph the baseline survivor function for the two financial aid groups using the following code:
proc phreg data=recid; model week*arrest(0)=age prio / ties=efron; strata fin; baseline out=a loglogs=lls survival=s; run; proc gplot data=a; plot lls*week=fin; symbol1 interpol=join color=black line=1; symbol2 interpol=join color=black line=2; run;
The resulting graph in Output 5.21 shows the log-log survivor functions for each of the two financial aid groups, evaluated at the means of the covariates. If the hazards are proportional, the log-log survivor functions should be parallel. Here’s why. If two hazard functions, h1(t) and h2(t), are proportional, we can write
h1(t) = γh2(t)
where γ is the constant of proportionality. Substituting this into equation (2.6), it’s easily shown that
S1(t) = [S2(t)]γ.
Taking the logarithm, multiplying by –1, and taking the logarithm a second time yields
log[–log S1(t)] = log γ + log[–log S2(t)]
which says that the two log-log survival curves differ by a constant amount, log γ. Examining Output 5.21, we see that the two curves are approximately the same shape, but also farther apart in some regions than others.
The differences are more dramatic if we compare the smoothed hazard functions produced with the SMOOTH macro that was introduced in Chapter 3 and that is described in detail in Appendix 1, “Macro Programs.” After producing baseline data set A, which contains the survivor function estimates, the macro is invoked by submitting the following statement, which gives the graph in Output 5.22.
%smooth(data=a,time=week,survival=s)
Here we see evidence that the hazard of arrest is almost identical during the earlier weeks, but it rapidly diverges after week 15 or thereabouts, reaching a maximum difference around week 25. (Group 2 received aid, Group 1 did not). This evidence suggests that it takes awhile for the financial aid to have its desired effect, but that the effect eventually wears off after the aid is terminated.
This graph suggests a different specification for the interaction between financial aid and time that we investigated earlier (see Interactions with Time as Time-Dependent Covariates). Specifically, let’s construct a dummy variable that is coded as 1 when time is between 20 and 30 weeks and is coded as 0 elsewhere. Then, we include the product of that variable and FIN in the Cox regression model:
proc phreg data=recid; model week*arrest(0)=fin finmid age prio / ties=efron; mid=(20<week<30); finmid=fin*mid; run;
Results in Output 5.23 show that the interaction is significant at the .03 level. During this middle period, the arrest rate for those who did not receive financial aid is more than five times larger than the rate for those who did receive aid. The p-value is perhaps an underestimate, however, since the unusual specification is dependent on the graphical analysis rather than some a priori hypothesis.
Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Risk Variable DF Estimate Error Chi-Square Chi-Square Ratio FIN 1 -0.158078 0.20504 0.59435 0.4407 0.854 FINMID 1 -1.455933 0.66475 4.79696 0.0285 0.233 AGE 1 -0.066964 0.02084 10.32821 0.0013 0.935 PRIO 1 0.096731 0.02727 12.58657 0.0004 1.102 |
Another major use of the baseline survivor function is to obtain predictions about survival time for particular sets of covariate values. These covariate values need not be ones that appear in the data set being analyzed. For the recidivism data, for example, we may want to say something about arrest times for 40-year-olds with three prior convictions who did not receive financial aid. The mechanics of doing this are a bit awkward. You must create a new data set containing the values of the covariates for which you want predictions and then pass the name of that data set to PROC PHREG:
data covals; input fin age prio; cards; 0 40 3 run; proc phreg data=recid; model week*arrest(0)=fin age prio / ties=efron; baseline out=a covariates=covals survival=s lower=lcl upper=ucl / nomean; run; proc print data=a; run;
The advantage of doing it this way is that predictions can easily be generated for many different sets of covariate values just by including more input lines in the data set COVALS. Each input line produces a complete set of survivor estimates, but all estimates are output to a single data set. The NOMEAN option suppresses the output of survivor estimates evaluated at the mean values of the covariates, which are otherwise included by default. The LOWER= and UPPER= options (available in Release 6.10 and later) give 95-percent confidence intervals around the survival probability.
Output 5.24 displays a portion of the data set generated by the BASELINE statement above. In generating predictions, it’s typical to focus on a single summary measure rather than the entire distribution. The median survival time is easily obtained by finding the smallest value of t such that S(t) ≤ .50. That won’t work for the recidivism data, however, because the data are censored long before a .50 probability is reached. For these data, it’s probably more useful to pick a fixed point in time and calculate survival probabilities at that time under varying conditions. For the covariate values in Output 5.24, the six-month (26 week) survival probability is .95, with a 95 percent confidence interval of .92 to .99.
OBS FIN AGE PRIO WEEK S LCL UCL 19 0 40 3 18 0.97101 0.94823 0.99434 20 0 40 3 19 0.96916 0.94510 0.99384 21 0 40 3 20 0.96453 0.93728 0.99258 22 0 40 3 21 0.96267 0.93415 0.99207 23 0 40 3 22 0.96174 0.93257 0.99182 24 0 40 3 23 0.96080 0.93100 0.99156 25 0 40 3 24 0.95705 0.92471 0.99053 26 0 40 3 25 0.95421 0.91996 0.98973 27 0 40 3 26 0.95132 0.91514 0.98894 28 0 40 3 27 0.94939 0.91192 0.98841 29 0 40 3 28 0.94746 0.90869 0.98788 30 0 40 3 30 0.94551 0.90544 0.98734 31 0 40 3 31 0.94453 0.90382 0.98707 32 0 40 3 32 0.94256 0.90056 0.98653 33 0 40 3 33 0.94058 0.89728 0.98598 34 0 40 3 34 0.93859 0.89399 0.98542 35 0 40 3 35 0.93457 0.88737 0.98428 36 0 40 3 36 0.93154 0.88239 0.98342 |
Release 6.10 and later lets you choose between two alternative methods (labeled PL for product limit and CH for cumulative hazard) for calculating the survivor function and its transformations, but there are no strong reasons for preferring one or the other. PL is the default. These two methods produce identical results (apart from rounding error) when there is only one censoring time for all cases, as with the recidivism data. Note, finally, that the BASELINE statement will not produce any output when there are time-dependent covariates.
3.12.108.236