CHAPTER 5

Model Development

5.1 INTRODUCTION

In any applied setting, performing a proportional hazards regression analysis of survival data requires a number of critical decisions. It is likely that we will have data on more covariates than we can reasonably expect to include in the model, so we must decide on a method to select a subset. We must consider such issues as clinical importance and adjustment for confounding, as well as statistical significance. Once we have selected the subset, we must determine whether the model is “linear” in the continuous covariates and, if not, what transformations are suggested by the data and clinical considerations. Which interactions, if any, should be included in the model is another important decision. In this chapter, we discuss these and other practical model development issues.

The end use of the estimated regression model will most often be a summary presentation and interpretation of the factors that have influenced survival. This summary may take the form of a table of estimated hazard ratios and confidence intervals and/or estimated covariate–adjusted survival functions. Before this step can be taken, we must critically examine the estimated model for adherence to key assumptions (e.g., proportional hazards) and determine whether any subjects have an undue influence on the fitted model. In addition, we may calculate summary measures of goodness-of-fit to support our efforts at model assessment. Methods for model assessment are discussed and illustrated in Chapter 6.

The methods available to select a subset of covariates to include in a proportional hazards regression model are essentially the same as those used in any other regression model. In this chapter, we present three methods for selecting a subset of covariates. Purposeful selection is a method completely controlled by the data analyst, while stepwise and best subsets selection of covariates are statistical methods. We also discuss an iterative method called “multivariable fractional polynomials” that makes statistical decisions regarding which covariates to include in the model as well as possible transformations of continuous covariates. These approaches to covariate selection have been chosen because use of one or more will yield, in the vast majority of model building applications, a subset of statistically and clinically significant covariates.

A word of caution: statistical software for fitting regression models to survival data is, for the most part, easy to use and provides a vast array of sophisticated statistical tools and techniques. One must be careful, therefore, not to lose sight of the problem and end up with the software prescribing the model to the analyst rather than the other way around.

Regardless of which method is used for covariate selection, any survival analysis should begin with a thorough univariable analysis of the association between survival time and each of the covariates under consideration. These methods are discussed in detail in Chapter 2. For categorical covariates, this should include Kaplan–Meier estimates of the group–specific survival functions, point and interval estimates of the median (and/or other quantiles of) survival time and use of one or more of the significance tests to compare survival experience across the groups defined by the covariate. For descriptive purposes, continuous covariates could be broken into quartiles, or other clinically meaningful groups, and the methods for categorical covariates could then be applied. Alternatively, point and interval estimates of the hazard ratio for a clinically relevant change in the covariate could be used in conjunction with the significance level of the partial likelihood ratio test. These results should be displayed using the tabular conventions of the scientific field.

5.2 PURPOSEFUL SELECTION OF COVARIATES

Modern statistical software is so powerful and easy to use that it is sometimes difficult to avoid shortcuts in analyses. The expression “cutting to the chase” is not appropriate to describe multivariable model building. We feel that one should approach multivariable model building with patience and a keen eye for the details that differentiate a good model from one that is merely adequate for the job. A good model is one that has been chosen by using a careful, well thought out, covariate selection process that gives thoughtful consideration to issues of adjustment and interaction (confounding and effect modification) and thoroughly evaluates the model for assumptions, influential observations, and tests for goodness-of-fit. We feel that the method described below comes as close to this ideal as any method.

Our purposeful selection method consists of the following seven steps:

Step 1: We begin by fitting a multivariable model containing all variables significant in the univariable analysis at the 20-25 percent level, as well as any other variables not selected with this criterion but judged to be of clinical importance. (If there are adequate data to fit a model containing all study covariates, this full model could be the beginning a multivariable model. We will provide further details regarding our perspective on what adequate data means toward the end of this section.) The rationale for choosing a relatively modest level of significance is based on recommendations for linear regression by Bendel and Afifi (1977), for discriminant analysis by Costanza and Afifi (1979), and for change in coefficient modeling in epidemiology by Mickey and Greenland (1989). Use of this level of significance should lead to the inclusion, in the preliminary multivariable model, of any statistically significant variable or one with the potential to be an important confounder.

Step 2: Following the fit of the initial multivariable model, we use the p-values from the Wald tests of the individual coefficients to identify covariates that might be deleted from the model. Some caution should be taken at this point not to reduce the size of the model by deleting too many seemingly non-significant variables at one time. The p-value of the partial likelihood ratio test should confirm that the deleted covariate is not significant. This is especially important when a nominal scale covariate with more than one design variable has been deleted, because we typically make a rough guess about overall significance based on the significance levels of the individual coefficients of the design variables.

Step 3: Following the fit of the reduced model, we assess whether removal of the covariate has produced an “important” change in the coefficients of the variables remaining in the model. In general, we use a value of about 20 percent as an indicator of an important change in a coefficient. If the variable excluded is an important confounder, it should be added back into the model. This process continues until no covariates can be deleted from the model.

Step 4: At this point, we add to the model, one at a time, all variables excluded from the initial multivariable model to confirm that they are neither statistically significant nor an important confounder. We have encountered situations in practice where a variable had a univariable test p-value that exceeded 0.8 but became highly significant when added to the multivariable model obtained at step (3). We refer to the model at the conclusion of this step as the preliminary main effects model.

Step 5: We now examine the scale of the continuous covariates in the preliminary main effects model. A number of techniques are available, all of which are designed to determine whether the data support the hypothesis that the effect of the covariate is linear in the log hazard and, if not, what transformation of the covariate is linear in the log hazard. Discussion of these methods is somewhat involved and would interfere with the flow of the presentation of purposeful selection, if done at this point. Hence, we discuss these methods in Section 5.2.1. We refer to the model at the end of step (5) as the main effects model.

Step 6: The final step in the variable selection process (but not the final step in the model building process) is to determine whether interactions are needed in the model. In this setting, an interaction term is a new variable that is the product of two covariates in the model. Special considerations may dictate that a particular interaction term or terms be included in the model, regardless of the statistical significance of the coefficient(s). If this is the case, these interaction terms and their component terms should be added to the main effects model and the larger model fit before proceeding with a statistical evaluation of other possible interactions. However, in most settings, there will be insufficient clinical theory to justify automatic inclusion of interactions.

The selection process begins by forming a set of plausible interaction terms from the main effects in the model. Each individual interaction is assessed by comparing the model with the interaction term to the main effects model via the partial likelihood ratio test. All interactions significant at the 5 percent, or other, level are then added jointly to the main effects model. Wald statistic p-values are used as a guide to selecting interactions that may be eliminated from the model, but significance should be checked by the partial likelihood ratio test. Often when an interaction term enters a model, the coefficient of one of its component main effects may have a non-significant Wald statistic. We firmly believe that all main effects of significant interactions should remain in the model because, as shown in Section 4.4, estimates of effect require both main effect and interaction coefficients.

Several points should be kept in mind when selecting interaction terms. Because interactions are included to improve inferences and obtain a more realistic model, we feel that all interaction terms should be statistically significant at usual levels of significance, such as 5 or 10 percent, and perhaps as low as 1 percent in some settings. Inclusion of any interaction term in a model makes interpretation more difficult, but often more informative. However, if the interaction term is not significant, then standard error estimates will needlessly increase, thus unnecessarily widening confidence interval estimates of hazard ratios.

Step 7: We refer to the model at the conclusion of step (6) as the preliminary model. It does not become the final model until we thoroughly evaluate it. Model evaluation should include: checking for adherence to key model assumptions using casewise diagnostic statistics to check for influential observations and testing for overall goodness-of-fit. This step is mandatory for any model building strategy, not just purposeful selection. We discuss evaluation methods in Chapter 6.

A modification that is sometimes used in a clinical trial setting where there is a clear “treatment” variable is to exclude the treatment variable from the variable selection process. After the preliminary main effects model containing all other variables associated with outcome, treatment is then added to the model. This approach provides an estimate of the additional effect of treatment, adjusting for other covariates, in contrast to modeling in epidemiological studies where “treatment” would be the risk factor of interest. In these settings, selection of variables may be based on the change in the coefficient (estimate of effect) of the risk factor variable. Thus, rather than being the last variable to enter, the risk factor enters the model first. This points out that one must have clear goals for the analysis and proceed thoughtfully, using a variety of statistical tools and methods. The variable selection methods discussed may be an integral part of this analysis.

We mentioned in the beginning of this section that it is likely that we will have data on more covariates than we can reasonably expect to include in a single model. This might mean that the number of covariates showing a statistically significant association with survival from the univariable analyses will be much larger than can be expected to be included as a beginning multivariable model. In such a case, we recommend rank ordering all covariates based on the p-values from the univariable analyses and using, for the beginning model, only the most highly significant ones. It is difficult to make a general statement about how many covariates should or can be included, but a rough guideline is to include one covariate per ten events. As an example, for the WHAS500 study where 215 events are observed, a model with fewer than 20 estimated coefficients would not be expected to cause problems of over fitting. On the other hand, a data set with fewer than 20 events cannot be expected to yield meaningful results from multivariable analyses.

5.2.1 METHODS TO EXAMINE THE SCALE OF CONTINUOUS COVARIATES IN THE LOG HAZARD

As noted above in step (5) an important but, unfortunately in practice, often ignored modeling step is to determine whether the data support the assumption of linearity in the log hazard for all continuous covariates. Methods ranging from simple to complex can now be performed in many software packages.

The simplest method is to replace the covariate with design variables formed from its quartiles (these are easily obtained from the univariable descriptive analysis). The estimated coefficients for the design variables are plotted versus the midpoints of the intervals defined by the cutpoints. At the midpoint of the first interval, a point is plotted at zero. If the correct scale is linear in the log hazard, then the polygon connecting the points should be approximately a straight line. If the polygon departs substantially from a linear trend, its form may be used to suggest a transformation of the covariate. This is often difficult as there may only be four plotted points. We refer to this method as the quartile design variable method and its clear advantage is that it does not require any special software. Its disadvantage is that it is not powerful enough to detect subtle, but often important, deviations from a linear trend.

Another approach is to use fractional polynomials, developed by Royston and Altman (1994), to suggest transformations. Sauerbrei and Royston (1999) and Royston and Sauerbrei (2006) study the method in some detail. Basically we wish to determine what value of p in xp yields the best model for the covariate (e.g., the log hazard is linear in xp). In theory, we could incorporate the power, p, as an additional parameter in the estimation procedure. However, this greatly increases the complexity of the estimation problem. Royston and Altman propose replacing full maximum likelihood estimation of the power by a search through a small but useful set of possible values. We provide a brief description of the method here and use it in the model building example in the next section.

Fractional polynomials can be used with a multivariable proportional hazards regression model, but, for sake of simplicity, we describe them using a model with a single continuous covariate. We discuss an iterative multivariable version first proposed by Royston and Ambler (1998) in Section 5.3. The univariable hazard function for the proportional hazards regression model shown in (3.7) is

c05e000

and the log-hazard function, which is linear in the covariate, is

c05e000

One way to generalize this log-hazard function is to specify it as a function of J terms

c05e000

The functions Fj (x) are a particular type of power function. The value of the first function is F1(x) = xP1. In theory, the power, p1, could be any number, but in most applied settings we try to use something simple. Royston and Altman (1994) propose restricting the power to be among those in the set

c05e000

where p1 = 0 denotes the natural log of the variable. The remaining functions are defined as

c05e000

for j = 2,...,7 and restricting powers to those in common. For example, if we chose J = 2 with p1 = 0 and p2 = –0.5, then the log–hazard function is

c05e000

As another example, if we chose J = 2 with p1 = 2 and p2 = 2, then the log-hazard function is

c05e000

This is true for all repeated powers (p1, = P, p2 = p). The model is quadratic in x if p1, = 1 and p2 = 2. Again, we could allow the covariate to enter the model with any number of functions, J, but in most applied settings an adequate transformation can be found if we use J = 1 or 2. Implementation requires, for J = 1, fitting 8 models, that is, p1common. The best model is the one with the largest log partial likelihood or smallest value of what we call the Deviance, –2 times the log partial likelihood. The process is repeated with J = 2 by fitting the 36 models obtained from the unique possible pairs of powers (28 pairs where p1p2 and eight pairs where p1 = p2). The best model is again the one with the largest log partial likelihood. The relevant question is whether either of the two best models is significantly better than the linear model. Let L(1) denote the log partial likelihood for the linear model, that is, J = 1 and p1 = 1, let L(p1) denote the log partial likelihood for the best J = 1 model and let L(p1,p2) denote the log partial likelihood for the best J = 2 model. Royston and Altman (1994) and Ambler and Royston (2001) suggest, and verify with simulations, that each term in the fractional polynomial model contributes approximately 2 degrees of freedom to the model, effectively one for the power and one for the coefficient. Thus, the partial likelihood ratio test comparing the linear model to the best J = 1 model,

c05e000

is approximately distributed as chi–square with 1 degree of freedom under the null hypothesis of linearity. The partial likelihood ratio test comparing the best J = 1 model to the best J = 2 model,

c05e000

is approximately distributed as chi–square with 2 degrees of freedom under the null hypothesis that the second fractional polynomial function is equal to zero. Similarly, the partial likelihood ratio test comparing the linear model to the best J = 2 model,

c05e000

is distributed approximately as chi–square with 3 degrees of freedom. To keep the notation simple, we have used p1 to denote the best power both when J = 1 and as the first of the two powers for J = 2. These are not likely to be the same numeric value in practice.

In an applied setting, these partial likelihood ratio tests can be used in one of two ways to find the best fractional polynomial model: a closed test and a sequential test procedure, see Sauerbrei, Meier-Hirmer, Benner and Royston (2006) and cited references. Sauerbrei, Meier-Hirmer, Benner and Royston also consider as a candidate for the null model, one where x is not included in the model. We do not consider this model because all covariates remaining at the end of purposeful selection step (4) have passed statistical selection criteria and thus are in the model; the null model is no longer an option.

In the closed test procedure one begins by comparing the linear model to the best two-term model via G[1, (p1,p2)]. If this test is not significant at the chosen level of significance (usually 5 percent) then we stop and assume that the log hazard is linear in x. If the test is significant then we compare the best one-term model to the best two-term model via G[p1,(p1, p2)]. If the test is significant, then we use the best two-term model otherwise we use the best one-term model.

In the sequential test procedure, we begin by comparing the best one-term model to the best two-term model, again via G[p1, (p1, p2)]. If the test is significant, we stop and use the best two-term model. If the test is not significant, then we compare the linear model to the best one-term model via G(1, p1). If this test is significant, then we use the best one-term model. If this test is not significant, then we use the linear model.

Ambler and Royston (2001) examined the type I error rates of the two procedures via simulation and conclude that the closed test procedure comes closer to maintaining the overall level of significance and thus is the one we use in this text.

In general, we recommend that, if either the one or two-term fractional polynomial model is selected for use, it should not only provide a statistically significant improvement, but the transformation(s) must make clinical sense.

STATA is the only software package to have fully implemented fractional polynomials. Its fractional polynomial routine offers the user considerable flexibility in expanding the number of terms, J, as well as the set of powers searched. In most settings, the default values of J = 2 and common are adequate. Sauerbrei, Meier-Hirmer, Benner and Royston (2006) describe software they have developed to implement the method in the SAS package and R computer language.

Graphical methods, other than the design variable method, to check the scale of covariates may be performed in most software packages. The most easily used is similar to the added variable plot from linear regression; see Ryan (1997). A complete discussion of residuals from a fitted proportional hazards model is provided in Chapter 6. The reader wishing to know the details is welcome to read Section 6.2 before proceeding, but it is not necessary at this point.

The first plot we discuss uses as the components of the residual for the ith subject, the value of the censoring variable, ci, and a modification of the estimator of the cumulative hazard shown in (3.42). Specifically, we use the Nelson-Aalen estimator of the log of the baseline survival function, defined as

c05e000

yielding the estimator

c05e000

These are used to calculate the estimated martingale residuals, defined as

c05e000

It may seem a small and unimportant detail, but the sum of the martingale residuals is not equal to zero if we use the estimator in (3.42), while they do sum to zero as defined above.

Therneau, Grambsch and Fleming (1990) suggest calculating the martingale residuals from a model that excludes the covariate of interest. One then plots these martingale residuals versus the excluded covariate. They suggest adding a lowess or other smooth to the scatterplot to aid interpretation. This plot is analogous to the added variable plot in normal errors linear regression. The smooth provides an estimate of the functional form of the covariate in the log hazard. The difficulty in using this plot is deciding what the correct functional form is if the plot looks decidedly non-linear. We refer to this plot as the smoothed added variable plot.

Grambsch, Therneau and Fleming (1995) expand on their earlier work to propose a plot that has greater diagnostic power to detect the functional form of a model covariate. One begins by calculating the martingale residuals from the fit of a model containing all covariates. One then calculates a lowess smooth from the scatter plot of the censoring variable, ci versus the covariate of interest and another smooth of the scatter plot of the estimated cumulative hazard, common1i, versus the covariate of interest. Denote these smoothed values as cism and common1ism. One uses the two sets of smoothed values to calculate

c05e000

where common2x denotes the estimator of the coefficient of the covariate of interest, denoted as x. One then plots the values of fi versus xi. The correct functional form of the covariate is estimated in this plot. But again, we face the same question as in the added variable martingale residual plot. Namely, what is the functional form if the plot is non-linear? We refer to the plot of fi versus xi as the GTF smoothed plot. Descriptions of applications of these and other related methods may be found in Therneau and Grambsch (2000).

In practice, our decision to transform a continuous covariate is most often based on the results of a fractional polynomial analysis. We tend to use the graphical methods to see if they support the fractional polynomial model.

Rather than illustrate the method in this section and in the next section we use and illustrate them in the model building example in the next section.

Gray (1992) suggests that spline functions may be used as a way of modeling a continuous covariate that avoids the linear scale. Ryan (1997) discusses construction and use of spline functions in linear regression. Harrell et al. (1996) and Harrell (2001) use spline functions in a variety of modeling settings, including the proportional hazards model. Because spline functions are not readily available in most packages and not as easy to use as the methods we discuss in this section, we do not consider them in this text.

5.2.2 AN EXAMPLE OF PURPOSEFUL SELECTION OF COVARIATES

Among the data sets that we have chosen to use in this text, the Worcester Heart Attack Study’s WHAS500 data provides the best opportunity1 to illustrate not only issues of variable selection including confounding and interaction but also covariates that are not linearly scaled. At the end of this exercise, you may feel that we have gone a bit overboard and the resulting model is unnecessarily complicated. However, one of our goals is to present some working rules or guidelines that you can use in your own practice when complications such as interactions and non-linear continuous covariates occur.

The results of the univariable analysis of each covariate in relation to survival time (in years) following admission to a hospital after an MI are given in Table 5.1 for the discrete covariates and in Table 5.2 for the continuous covariates defined in Table 1.2. All variables except complete heart block are significant at the 20 percent level and therefore are candidates for inclusion in the multivariable model.

In Table 5.1 the cardiogenic shock and complete heart block variables present difficulties when considered for inclusion in a multivariable model. There are only 22 and 11 subjects, respectively, with the condition present and of these 17 and 7 die, respectively. Cardiogenic shock is significant at the 1 percent level. Examination of the survival times for the 17 subjects with cardiogenic shock who died showed that nine died in the first week of follow up and five more within the first month. Thus, those having cardiogenic shock have a high rate of early death and there is little data remaining to estimate its effect beyond one month of follow up. For these reasons we are dropping this covariate as a candidate for the multivariable model. Complete heart block is not a significant univariable predictor ( p = 0.254 ) but has similar “thin data” problems. Hence, it is also dropped from further consideration for the multivariable model.

Table 5.1 Estimated Median Time to Death with 95% Brookmeyer-Crowley Confidence Interval Estimates, Log-Rank Test, and Partial Likelihood Ratio Test p-values for Categorical Covariates in the WHAS500 data (n = 500)

c05t001

Before we fit the multivariable model, we note the close agreement in Table 5.1 between the significance levels of the partial likelihood ratio test and the log-rank test. This is as expected because, for a discrete covariate, the score test from a univariable proportional hazards model is algebraically equivalent to the log–rank test, and the score test and likelihood ratio test perform similarly. This implies that the log–rank test is a perfectly acceptable choice for purposes of covariate selection for the initial multivariable model.

Table 5.3 presents the results of fitting the multivariable proportional hazards model containing all variables significant at the p < 0.20 level in the univariable analysis (excluding cardiogenic shock as stated above).

Examining the p-values for the Wald statistics with the goal of trying to simplify the model, we see that the two largest values are p = 0.967 for history of cardiovascular disease (cvd) and p = 0.904 for systolic blood pressure (sysbp). Deleting these two covariates and refitting the model (results shown in Table 5.4) yields a two degrees-of-freedom partial likelihood ratio test whose p-value = 0.991 (calculations not shown here). We see that the estimates of the coefficients for the covariates remaining in the model are virtually unchanged and conclude that cvd and sysbp are not confounders.

Table 5.2 Estimated Hazard Ratio for Time to Death with 95% Confidence Interval Estimates, Wald Test, and Partial Likelihood Ratio Test p-values for Continuous Co-variates in the WHAS500 data (n = 500)

c05t002

In Table 5.4, the covariate with the largest p-value is MI order (miord) where p = 0.758. The results of fitting the model deleting this covariate are shown in Table 5.5 and p = 0.758 for the one degree-of-freedom partial likelihood ratio test comparing the model in Table 5.5 to the one in Table 5.4 (calculations not shown). The estimates of the coefficients for the covariates common to the two models are nearly the same in Table 5.4 and Table 5.5, thus, miord is not a confounder.

The covariates in Table 5.5 with the largest p-values for the Wald statistics are atrial fibrillation (afb) with p = 0.464 and MI type (mitype) with p = 0.309. The results of fitting a model deleting these two are shown in Table 5.6. The partial likelihood ratio test comparing the model in Table 5.6 to the model in Table 5.5 has p = 0.473 (calculations not shown). We see that the estimates of the coefficients for the covariates remaining in the model are virtually unchanged and conclude that afb and mitype are not confounders.

Table 5.3 Estimated Coefficients, Standard Errors, z-Scores, Two-Tailed p-values, and 95% Confidence Interval Estimates for the Proportional Hazards Model Containing Variables Significant at the 20% Level in the Univariable Analysis for the WHAS500 data (n = 500)

c05t003

Table 5.4 Estimated Coefficients, Standard Errors, z-Scores, Two-Tailed p-values, and 95% Confidence Interval Estimates for the Reduced Proportional Hazards Model for the WHAS500 data (n = 500)

c05t004

With the exception of gender, each of the covariates in Table 5.6 has a significant Wald test when using α = 0.05. Because gender is such an important clinical variable and is significant at the 10 percent level, we keep it in the model. No further model reduction is possible. At this point, we would normally add any covariates not in the initial multivariable model. The only candidates from Table 5.1 are cardiogenic shock and complete heart block, which were dropped due to inadequate data. The next step is to examine the scale of the continuous covariates age, heart rate (hr), diastolic blood pressure (diasbp) and body mass index (bmi).

Before checking the scale of the continuous covariates, we note that we deleted two covariates in first and third model reductions. This reflects one of our basic modeling strategies: to delete simultaneously covariates displaying similar statistical associations with the outcome. However, we rarely remove more than two or three at a time, and if one covariate has more clinical relevance than the others, we will remove it separately from the others. Also, if removal of more than one covariate introduces confounding for one or more remaining variables, we step back and remove them one at a time.

Table 5.5 Estimated Coefficients, Standard Errors, z-Scores, Two-Tailed p-values, and 95% Confidence Interval Estimates for the Reduced Proportional Hazards Model for the WHAS500 data (n = 500)

c05t005

Table 5.6 Estimated Coefficients, Standard Errors, z-Scores, Two-Tailed p-values, and 95% Confidence Interval Estimates for the Reduced Proportional Hazards Model for the WHAS500 data (n = 500)

c05t006

The first method we illustrate for checking the scale of continuous covariates uses the quartile design variables. Separately, for each of the four continuous variables, we replace the variable in the model with three design variables formed using as cutpoints the three quartiles. Table 5.7 presents a summary of the resulting coefficients and group midpoints. We next graph the coefficients against the group midpoints. These are shown in Figure 5.1.

The plots of the coefficients for age and heart rate support an assumption of linearity in the log hazard. The plot for diastolic blood pressure has a small increase from the second to third quartile. The plot of the coefficients for body mass index has a similar but larger jump. It is difficult to tell from Figure 5.1 if the plots for diastolic blood pressure and body mass index indicate a statistically significant departure from linearity or are due to random variation. Based on these plots alone, we would be reluctant to recommend any non-linear transformation of diastolic blood pressure or body mass index.

Next, we use fractional polynomials to examine the linearity assumption. The fractional polynomial analysis of age, heart rate, and diastolic blood pressure did not yield any significant transformations, thus we do not show these results. The analysis of body mass index did not support linearity in the log hazard and the summary of results from STATA is shown in Table 5.8. Unfortunately, the results do not completely conform to either the closed test or sequential test procedure, but all the required information is provided. This situation may change in later releases of STATA. We first compare the linear model to the best two-term model. The partial likelihood ratio test statistic for this test is given in the last row of the third column as G = 10.215. The three degrees-of-freedom p-value for this test is 0.017 (Note: this result is not currently provided in the STATA output).

Table 5.7 Estimated Coefficients for the Three Design Variables Formed from the Quartiles for the Variables Age, Heart Rate, Diastolic Blood Pressure, and Body Mass Index for the WHAS500 data (n = 500)

c05t007

Thus, we conclude that the best two-term model is significantly different from the linear model. We next compare the best one-term model to the best two term model. The partial likelihood ratio test statistic for this test is not provided in the table but is calculated as

c05e000

Figure 5.1Graphs of estimated coefficients versus quartile midpoints for (a) age, (b) heart rate, (c) diastolic blood pressure, and (d) body mass index.

c05f001

and its two degree-of-freedom p-value is 0.143 (Note: this p-value is provided in the last row of the fourth column of the STATA output). Because this test is not significant, we choose the one term fractional polynomial model, which states that transformation is bmi–2. Before accepting this inverse squared transformation of body mass index, we need to be sure it is clinically plausible and examine whether the fractional polynomial analysis is being highly influenced by either exceptionally small or large values.

We know that both low and high body mass index are associated with poorer survival than moderate values. The fitted model (results not shown) using the inverse squared transformation of body mass index is monotonic. Hence, this transformation models the decrease in the log hazard from low to moderate body mass index but does not pick up the increase for high body mass index. A two-term model is required for any form of “quadratic like” shape in the log hazard. As noted above, the two-term model was not significantly different from the one-term model. Thus, on clinical grounds, we would choose the two-term model over the one-term model. If we use the two-term model in Table 5.8, then we will look at what the fitted (2,3) fractional polynomial looks like. Royston and Sauerbrei (2006) point out that the fractional polynomial analysis can be unduly influenced by outlying values. They propose a preliminary somewhat complicated mathematical transformation of the covariate to minimize any outlier effect. We refer the reader to their paper for the details2. A box plot of the distribution of body mass index, not shown here, indicates that there is one small value, 13.05, and one large value, 44.83, that have a gap of at least 1.5 from their nearest neighbor. Excluding these two values and rerunning the fractional polynomial analysis yielded models and results fully equivalent to those shown in Table 5.8. Hence, we conclude that the results in Table 5.8 are not outlier dependent.

We still have to make a choice regarding a main effects model, but before doing so, we consider what the smoothed added variable plot and GTF smoothed plot tell us about the functional form of body mass index. The two plots are shown in Figure 5.2a and Figure 5.2b. The smoothed lines in both plots show a decrease, then an increase, on the log hazard scale, which supports the clinical knowledge about the effect of body mass index on post MI survival. The extreme increase in Figure 5.2b is caused by a few subjects with high body mass index. When we put all the results and clinical factors together, we conclude that a two-term fractional polynomial is preferred over a one-term or linear model. When using fractional polynomials, or any method for that matter, to identify the functional form of the relationship between outcome and covariate, one must take into account the fact that there is low power with small sample sizes. With time-to-event regression models, power is largely a function of the number of events, not pure sample size, a point we discussed above, and will detail more fully when discussing sample size methods.

Table 5.8 Summary of Fractional Polynomials Analysis of Body Mass Index (bmi) for WHAS500 data (n = 500)

c05t008

* Compares linear model to model without bmi.

+ Compares the best J = 1 model to one with bmi linear.

# Compares the best J = 2 model to the best J = 1 model.

The two-term (2, 3) fractional polynomial model has the numerically smallest value of –21og partial likelihood among the 36 two-term models fit. By using the ‘log” option in STATA’s implementation of fractional polynomials, we are able to see what other two-term models have values of –21og partial likelihood close to the minimum of 2237.784 of the (2, 3) model. From the log listing (not shown here), we see that there are several. In particular, the quadratic model, the two-term (1, 2) fractional polynomial has a value of 2238.103, that is only trivially different from the best model. All things being equal, we would certainly prefer the mathematically simpler quadratic model. However, the two fractional polynomial functions, (2, 3) and (1, 2), are different. The (1, 2) model is a symmetric function, while the (2, 3) model is not. The clinical implication of choosing the symmetric model states that the rate of decrease and then increase in the log hazard is the same. However, both smoothed plots in Figure 5.2 indicate an asymmetric behavior. Thus, we decide to proceed with the two-term (2, 3) fractional polynomial model. The results of fitting this model are shown in Table 5.9.

In Table 5.9 the two fractional polynomial transformations of body mass index are

c05e000

and

c05e000

Before we accept the model in Table 5.9 as our main effects model, we plot the fitted two-term (2, 3) fractional polynomial model and the GTF smooth versusbody mass index. That is, we form a new plot where we add a centered version of the parametric two-term fractional polynomial model to Figure 5.2b. To accomplish this, we have saved the GTF smoothed values shown in Figure 5.2b in

Figure 5.2a Smoothed added variable plot.

c05f002

Figure 5.2b Grambsch, Therneau and Fleming (GTF) smoothed plot.

Figure 5.2 Smoothed residual plots for body mass index.

c05f003

Table 5.9 Estimated Coefficients, Standard Errors, z-Scores, Two-Tailed p-values, and 95% Confidence Interval Estimates for Main Effects Proportional Hazards Model for the WHAS500 data

c05t009

our STATA program that produced the figure. Next we generate a new variable containing the values of the fitted two-term factional polynomial as follows

c05e000

Figure 5.3 Plot of the GTF smooth and fitted two-term (2, 3) fractional polynomial model from Table 5.8 versus body mass index for the WHAS500 data.

c05f004

To determine whether the two functions are similar, we add the average difference between the GTF smooth and fpi, 0.90, to fpi before plotting. The plot of the two functions is shown in Figure 5.3. The similarity in the two plotted functions is striking, thus adding further credence to using the (2, 3) fractional polynomial to model the effect of body mass index in the WHAS500 data.

Hence, we proceed to the final step in purposeful selection, selecting interactions, using the main effects model in Table 5.9. This step begins by creating a list of plausible interactions formed from the main effects in Table 5.9. By consulting with the Worcester Heart Attack Study investigators, we learned that only interactions involving age and gender with all other model covariates are of clinical interest. These are added, one at a time, to the main effects model and tested for significance. Table 5.10 shows the two variables forming the interaction, the degrees-of-freedom of the interaction, and the p-value for the partial likelihood ratio test comparing the models with and without the interaction. The interaction terms are formed as the arithmetic product of the pair of variables. The interactions involving body mass index are formed using the two fractional polynomial transformations, bmifpl and bmifp2.

As an example of the calculations used in Table 5.10 we show in Table 5.11 the fitted model containing the age-by-gender interaction. The Wald statistic for the interaction coefficient has p = 0.022. The log partial likelihood of the fitted main effects model in Table 5.9 is L = –1118.8921, and for the fitted model in Table 5.11, it is L = –116.2793. The partial likelihood ratio test statistic is

c05e000

yielding a p-value of 0.022 from a chi square distribution with one degree-of-freedom. The remaining p-values in Table 5.10 are determined in a similar manner.

Table 5.10 Interaction Variables, Degrees-of-Freedom (df), and p-values for the Partial Likelihood Ratio Test for the Addition of the Interaction to the Model in Table 5.9

c05t010

Table 5.11 Estimated Coefficients, Standard Errors, z-Scores, Two-Tailed p-values, and 95% Confidence Interval Estimates for Main Effects Proportional Hazards Model with the Interaction of Age and Gender Added for the WHAS500 data

c05t011

As we noted when we presented the steps in purposeful selection, we believe that interactions should be significant at traditional levels. We see, in Table 5.10, that only the age-by-gender interaction and age-by-congestive-heart complications (chf) are significant at the 5 percent level. However, the age-by-diastolic-blood pressure and gender-by-diastolic-blood pressure interactions are significant at the 10 percent level. Hence, we decide to add all four to the main effects model. The results of fitting this model are shown in Table 5.12.

The log partial likelihood of the fitted model in Table 5.12 is L = –1112.5953. The partial likelihood ratio test statistic comparing it to the

Table 5.12 Estimated Coefficients, Standard Errors, z-Scores, Two-Tailed p-values, and 95% Confidence Interval Estimates for Main Effects Proportional Hazards Model Containing Selected Interactions for the WHAS500 data

c05t012

Table 5.13 Estimated Coefficients, Standard Errors, z-Scores, Two-Tailed p-values, and 95% Confidence Interval Estimates for Main Effects Proportional Hazards Model Containing Selected Interactions for the WHAS500 data.

c05t013

main effects model in Table 5.9 is

c05e000

yielding a p-value of 0.013 from a chi-square distribution with four degrees of freedom. Thus, in aggregate, the interactions contribute significantly to the main effects model. However, none of the Wald statistics for the individual interaction terms are now significant at the 5% level. At this point, we begin to reduce the model in the same manner as was used when reducing the model in Table 5.3. The least significant interaction in Table 5.12 is that of gender and diastolic blood pressure. The partial likelihood ratio test comparing the model in Table 5.12 to the model with this interaction removed yields p = 0.196, a value nearly identical to that of the Wald test in Table 5.12. In a similar manner, we remove the age-by-diastolic-blood-pressure interaction and the age-by-congestive-heart-complications interaction to obtain the model shown in Table 5.13.

The model in Table 5.13 is our preliminary final model. It does not become the final model until we test it for adherence to model assumptions, examine case-wise diagnostic statistics and test for goodness of fit. Before we consider these important topics in detail in Chapter 6, we discuss stepwise, best-subsets selection and the multivariable fractional polynomial method [Royston and Ambler, (1998)].

5.3 STEPWISE, BEST-SUBSETS AND MULTIVARIABLE FRACTONAL POLYNOMIAL METHODS OF SELECTING COVARIATES

Statistical algorithms for selecting covariates, such as stepwise and best-subsets selection, can be used with the proportional hazards model. They operate in an identical manner to those same methods when used in regression models such as linear or logistic regression. In addition, an iterative multivariable fractional polynomial method that combines model reduction and scale selection by fractional polynomials in a sequential manner has been proposed [Royston and Ambler, (1998)]. While the method has more user control than stepwise or best-subsets it has some of the same algorithmic elements, thus we consider it in this section rather than in the previous one.

5.3.1 STEPWISE SELECTION OF COVARIATES

Covariates may be selected for inclusion in a proportional hazards regression model using stepwise selection methods. The statistical test used as a criterion is most often the partial-likelihood ratio test. However, the score test and Wald test are often used by software packages. From a conceptual point of view, it does not matter which test is used. However, the partial-likelihood ratio test has been shown to have the best statistical testing properties of the three and should be used when there is a choice.

We assume familiarity with stepwise methods from either linear or logistic regression. Thus the presentation here will not be detailed. Detailed descriptions of stepwise selection of covariates may be found in Hosmer and Lemeshow (2000), Chapter 4, for logistic regression and in Ryan (1997), Chapter 7, for linear regression.

We begin by describing the full stepwise selection process, which consists of forward selection followed by backward elimination. The forward selection process adds to the model the covariate most statistically significant among those not in the model. The backward elimination process checks each covariate in the model for continued significance. Two variations of the full stepwise procedure available in most software packages use forward selection only or backward elimination only.

Most software packages now have the capability to create design variables for nominal scaled covariates at more than two levels and to treat these design variables as a unit when considering the covariate for entry or removal from the model. However, to keep the notation to a minimum, we describe the stepwise procedure using single degree of freedom tests for entry and removal of covariates. Thus, for this description, we assume all covariates are either continuous or dichotomous.

Step 0: Assume that there are p possible covariates, denoted xj, j = 1,2,···,p. This list is assumed to include all covariates. At step 0, the partial likelihood ratio test and its p-value for the significance of each covariate is computed by comparing the log-partial likelihood of the model containing xj to the log-partial likelihood of model zero (i.e., the model containing no covariates). This test statistic is

(5.1)c05e000

where L(0) is equal to the log partial likelihood of model zero, the no-covariate model, and L(0)(j) is equal to the log partial likelihood of the model containing covariate xj. The test’s significance level is

(5.2)c05e000

Evaluating (5.1) and (5.2) requires fitting p separate proportional hazards models. The parenthesized superscript in (5.1) and (5.2) denotes the step, and j indexes the particular covariate. The candidate for entry into the model at step 1 is the most significant covariate and is denoted by xe1, where

(5.3)c05e000

For the variable xe1 to be entered into the model, its p-value must be smaller than some pre-chosen criterion for significance, denoted pE. If the variable selected for entry is significant (i.e., p(0)(e1) < pE), then the program goes to step 1; otherwise it stops.

Step 1: This step begins with variable xe1 in the model. Then p – 1 new proportional hazards models (each including one remaining variable along with xe1) are fit, and the results are used to compute the partial likelihood ratio test of the fitted two-variable model to the one-variable model containing only xe1,

(5.4)c05e000

The p-value for the test of the significance of adding xj. to the model containing xe1 is

(5.5)c05e000

The variable selected as the candidate for entry at step 2 is xe2 where

(5.6)c05e000

If the selected covariate xe2 is significant, p(1) (e2) < pE, then the program goes to step 2; otherwise it stops.

Step 2: This step begins with both xe1 and xe2 in the model. During this step, two different evaluations occur. The step begins with a backward elimination check for the continued contribution of xe1. That is, does xe1 still contribute to the model after xe2 has been added? This is essentially an evaluation of (5.4) and (5.5) with the roles of the two variables reversed. From an operational point of view, we choose a different significance criterion for this check, denoted pR. We choose this value such that pR > pE to eliminate the possibility of entering and removing the same variable in an endless number of successive steps. Assume the variable entered at step 1 is still significant.

The program fits p – 2 proportional hazards models (each including one remaining variable along with xe1 and xe2) and computes the partial likelihood ratio test and its p-value for the addition of the new covariate to the model, namely

c05e000

and

c05e000

The covariate xe3 selected for entry at step 3 is the one with the smallest p-value, that is,

c05e000

The program proceeds to step 3 if p(2) (e3) < pE; otherwise it stops.

Step 3: Step 3, if reached, is similar to step 2 in that the elimination process determines whether all variables entered into the model at earlier steps are still significant. The selection process then followed is identical to the selection part of earlier steps. This procedure is followed until the last step, step S.

Step S: At this step, one of two things happens: (1) all the covariates are in the model and none may be removed or (2) each covariate not in the model has p(S) (j) > pE. At this point, no covariates are selected for entry and none of the covariates in the model may be removed.

The number of variables selected in any application will depend on the strength of the associations between covariates and survival time and the choice of pE and pR. Due to the multiple testing that occurs, it is nearly impossible to calculate the actual statistical significance of the full stepwise process. Research in linear regression by Bendel and Afifi (1977) and in discriminant analysis by Co-stanza and Afifi (1979) indicates that use of pE = 0.05 excludes too many important covariates and one should choose a level of significance of 15 percent. In many applications, it may make sense to use 25-50 percent to allow more variables to enter than will ultimately be used and then narrow the field of selected variables using p < 0.15 to obtain a multivariable model for further analysis. An unavoidable problem with any stepwise selection procedure is the potential for the inclusion of “noise” covariates and the exclusion of important covariates. One must always examine the variables selected and excluded for basic scientific plausibility.

At this point, the model is likely to contain continuous covariates, and these should be examined carefully for linearity using the previously discussed methods. We then see if any interactions significantly improve the model. The stepwise selection procedure uses as candidate variables a list of plausible interactions among the main effects previously identified during the initial stepwise model building. We begin with a model containing all the main effects, and the final model is selected using usual levels of statistical significance.

As an example of stepwise selection, we consider the WHAS500 data. As in the example in the previous section, the list of candidate variables includes: age, gender, heart rate, systolic blood pressure, diastolic blood pressure, body mass index, history of cardiovascular disease, atrial fibrillation, congestive heart complications, MI order and MI type, for a total of 11 covariates.

In general, the exact order of variable selection will depend on whether we use the partial likelihood ratio test, the score test, or the Wald test as there can be small differences in the magnitude of the three test statistics. In the end, each should select nearly the same set of covariates. The amount of output available to the user at each step varies from package to package. STATA has the least, reporting only the significance level of the variable selected for entry or removal and provides only the fit of the model at the last step. Both SAS and SPSS provide more detailed output.

The results presented in Table 5.14 were obtained using SPSS and the score test. For illustrative purposes, we use entry and removal p-values of pE = 0.25 and pR = 0.8. Many of the covariates have quite large score test values that lead to significance levels of <0.0001, thus making it impossible to see the rank ordering. Hence, we report the score test statistics. We are able to do this in the example because each variable has a single degree of freedom. If we had a nominal scaled covariate with more than two levels, we would have to report p-values. One further complication in reporting the result is that SPSS only provides Wald chi-square statistics for covariates in the fitted model at each step.

There were a total of 7 steps, counting step 0. At step 0, is the age variable with the smallest p-value and largest score test a value of 126.26 and p < 0.0001. Because the p-value for age is smaller than pE = 0.25 (i.e., the score test exceeds c05ie001 (1) = 1.32), the variable enters the model at step 1. At step 1 congestive heart complications (chf) has the largest score statistic, 38.41, thus the smallest p-value, and it is smaller than pE = 0.25. Congestive heart complications enters the model at step 2. At step 2, both age and congestive heart complications have Wald chi-square statistics that exceed the 20th percentile of the chi-square distribution with one degree of freedom, c05ie002.jpg(1) = 0.0642, and p-values to remove are less than pR = 0.8. Thus they remain in the model. Among the variables not in the model, heart rate has the largest score test statistic at 9.11 and smallest p-value, p = 0.003. Because it is less than the criteria for entry, it enters the model. The program goes to step 3, where the three–variable model is fit. All Wald chi-square tests exceed 0.0642 and thus have p-values to remove that are less than 0.8. No variables are removed from the model. At this point the variable with the largest score test and smallest p-value for entry is diastolic blood pressure, with a score test of 9.29 and p = 0.002, which is less then pE = 0.25. The program then goes to step 4 and fits the four-variable model.

This process of fitting, checking for continued significance, and selection continues until step 6. At this step, each of the six variables in the model has a p-value less than 0.8 to remove, and the p-values to enter for the five variables not in the model exceed 0.25 (i.e., their score tests are less than 1.32). Therefore, the program terminates the selection process at step 6.

We use the results in Table 5.14 with a significance level of 0.15 to identify the preliminary main effects model by proceeding sequentially to the next step, as long as the p-value for the variable entered is less than 0.15. At step 6 the smallest score statistic is 3.53 for the variable that entered at that step, gender. Its p-value is 0.060, less than 0.15. Thus using the 15 percent rule, we take as our preliminary main effects model the one fit at step 6, which is the same model found by purposeful selection.

Table 5.14 Results of Stepwise Selection of Covariates, Score Test for Entry Below Solid Line, and Wald Chi-Square Test for Removal above Solid Line in Each Column for WHAS500 data (n = 500). Rows are in Order of Entry

c05t014

We next check the scale of the continuous covariates in the model, following the same procedure illustrated in the previous section. Because the model is the same as the purposeful selection preliminary main effects model (Table 5.6), our stepwise main effects model is the same as the one in Table 5.9.

Stepwise selection of interactions proceeds, using as candidate variables the interactions listed in Table 5.10. At step 0, the model contains all the main effects, the model in Table 5.9. Candidate interaction variables are shown in Table 5.10. The same interactions chosen by purposeful selection are selected using a p-value to enter of 0.10. Therefore, we do not present the computational details. For this example, the preliminary model chosen by stepwise methods is also the one in Table 5.13, identified using purposeful selection.

5.3.2 BEST SUBSETS SELECTION OF COVARIATES

In the previous section, we discussed stepwise selection of covariates. Most analysts are familiar with its use from other regression modeling settings and it is available in most major software packages. However, the procedure considers only a small number of the total possible models that can be formed from the covariates. The method of best subsets selection, if available, provides a computationally efficient way to screen all possible models.

The conceptual basis for best subsets selection of covariates in a proportional hazards regression is the same as in linear regression. The procedure requires a criterion to judge a model. Given the criterion, the software screens all models containing q covariates and reports the covariates in the best, say 5, models for q = 1,2,...,p, where p denotes the total number of covariates.

Software to implement best subsets normal errors linear regression is generally, though not widely, available and has been used to provide best subsets selection capabilities for non–normal errors linear regression models such as logistic regression, see Hosmer and Lemeshow (2000, Chapter 4). There are three requirements to use the method described by Hosmer and Lemeshow: (1) It must be possible to obtain estimates of the coefficients of the model containing all p covariates from a weighted linear regression where the dependent variable is of the form

c05e000

(2) the weight must be an easily computed function of the variance of the residual and (3) both weight and residual must be easily computed functions of the estimated coefficients and covariates. Only requirement 1 is satisfied in the proportional hazards regression model when fit using the partial likelihood. Even though the partial likelihood, see (3.19), is a product of n terms, the terms are not independent of each other. Each “subject” may contribute information to more than one term in the product (i.e., “subjects” appear in every risk set until they fail or are censored). Thus, Hosmer and Lemeshow’s method cannot be used to perform best subsets proportional hazards regression. We do not want to dwell on this point, but feel that it is important to explain why this easily used approach is not appropriate in this setting.

Kuk (1984) described how best subsets selection in a proportional hazards regression model may be performed with a normal errors linear regression best subsets program if the program allows data input in the form of a covariance matrix. Kuk’s method is related to a general method described by Lawless and Singhal (1978). While Kuk’s method is clever, none of the major software packages allows for covariance matrix input. The only best subsets program we are aware of that allowed this type of input to its regression routines is BMDP (1992) program, BMDP9R, which is no longer distributed. Hence, we do not discuss Kuk’s method in this edition. Interested readers can find the details in the first edition, Hosmer and Lemeshow (1999).

An alternative method for best subset selection is to mimic the approach used in stepwise selection and choose as “best” models those in which the covariates in the model have the highest level of significance. Selection of covariates thus proceeds by inclusion rather than exclusion. The best models containing k covariates are those with the largest values of a test of the significance of the model. Theoretically, one could use any one of the three equivalent tests: partial likelihood ratio, Wald or score test. The SAS package, PROC PHREG, has implemented this selection method using the score test. Models identified are, for each fixed number of covariates, the ones with the largest value of the score test.

It is difficult to compare models of different sizes using the score test for model significance because the score test tends to increase with the number of covariates in the model. The most frequently used criterion to compare normal errors liner regression models containing different numbers of covariates is Mallow’s C. See Mallows (1973) and Ryan (1997 Chapter 7) for a discussion of the use of Mallow’s C in normal errors linear regression modeling. Good models are those with small values of Mallow’s C. In the context of the proportional hazards model Mallow’s C is defined as

(5.7)c05e000

where p is the number of variables under consideration and q denotes the number of covariates not included in the subset model. The quantity Wq is the value of the multivariable Wald statistic testing that the coefficients for the q covariates are simultaneously equal to zero and is obtained from a fit of the model containing all p covariates. We use score tests, as follows, to approximate the value of Wq in (5.7). Let the score test for the model containing all p covariates be denoted Sp and the score test for the model containing a particular set of k ( = pq) covariates be denoted Sk. The value of the score test for the exclusion of the q covadates from the full p variable model is approximately Sq = SpSk. Because the Wald and score tests are asymptotically equivalent, this suggests that an approximation to Mallow’s C for a fitted model containing p-q covariates is

(5.8)c05e000

Using SAS’s PROC PHREG with METHOD = BEST to perform the computations to obtain Sp and Sk, we show in Table 5.15 the five best models from the WHAS500 data. We note that we must compute (5.8) by hand or use a spreadsheet program, which simplifies sorting to find the models with the smallest values of C.

As an example, consider model 1 in Table 5.15, with k = 6 covariates in the model. The value of the score test for the significance of the 11-covariate model is S11 =208.6223 and the value of the score test for the significance of the 6-covariate model is S6 = 206.4896. The approximation to the score test for the addition of the 5 covariates to the 6-covariate model is

c05e000

The value of the approximation to Mallow’s C is

c05e000

The advantage of best subsets over stepwise is illustrated in table 5.15 where we see that all good models, by Mallow’s C, contain age, hr, diasbp, bmi and chf. The covariate gender is in four of the five models. Adding sysbp, mitype or miord does not improve the models. With stepwise, we are able to examine only progressively larger models and not ones with the same number of covariates. At this point, one strategy would begin by fitting one of the larger models (e.g., the second best containing age, gender, hr, diasbp, bmi, chf, and mitype) and then use Wald tests and confounding considerations to simplify it. Another possibility is to begin with a model containing the nine different covariates in the five best models. In work not shown, this process leads to the same model found by both purposeful selection and stepwise. Complete agreement in the covariates in the models selected by the three methods may not always occur. However, it is our experience that the three methods select a similar set of covariates.

Table 5.15 Five Best Models Identified Using the Score Test Approximation to Mallow’s C. Model Covariates, Approximate Mallow’s C, and the Approximate Score Test for the Excluded Covariates for the WHAS500 data (n = 500)

c05t015

When using procedures such as stepwise or best subsets selection to identify possible model covariates, we must remember the results should be taken only as suggestions for models to be examined in more detail. One cannot rule out the possibility that these methods may reveal new and interesting associations, but the collection of covariates must make clinical sense to the researchers. The statistical selection procedures suggest, but do not dictate, the model.

5.3.3 SELECTING COVARIATES AND CHECKING THEIR SCALE USING MULTIVARIABLE FRACTIONAL POLYNOMIALS

Sauerbrei, Meier-Hirmer, Brenner and Royston (2006) describe software for SAS, STATA and R that implements a multivariable fractional polynomial method, referred to here as mfp, that Royston and Ambler (1999) first wrote for the STATA package. The algorithm combines elements of backward elimination of non-significant covariates with an iterative examination of the scale of all continuous covariates using either the closed or sequential test procedures described in Section 5.2.1.

The mfp procedure begins by fitting a multivariable model that contains the user-specified covariates. All variables are modeled linearly, and the significance level of their respective Wald tests defines the order in which they are processed in all following steps. This initial collection, ideally, would include all study covariates. However, we may have a setting where there is an inadequate number of events to allow inclusion of all covariates and, in this case, we might choose, as a subset, the clinically important covariates and those significant at, say, the 25 percent level on univariable analysis. In the example below, using the WHAS500 data, we include all covariates except cardiogenic shock and complete heart block, which are excluded for reasons of inadequate data (described previously). The initial model includes all covariates as linear terms in the log hazard. In subsequent fits, each covariate is modeled according to a specified number of degrees of freedom. All dichotomous and design variables have one degree of freedom, meaning they are not candidates for fractional polynomial transformation. Continuous covariates may be forced to be modeled linearly by specifying one degree of freedom, or may be candidates for a one-or two-term fractional polynomial by specifying two or four degrees of freedom, respectively.

Following the initial multivariable linear fit, variables are considered in descending order of their Wald statistics. For covariates modeled with one degree of freedom, a partial likelihood ratio test is used to assess their contribution to the model, and its significance relative to a chosen alpha level is noted. Continuous covariates are modeled using either the closed or sequential test method, noting if the covariate should be removed, kept linear, or transformed. This completes the first cycle.

The second cycle begins with a fit of a multivariable model containing the model from cycle one (i.e., the model with covariates transformed or deleted). All covariates are considered again for possible transformation, inclusion or exclusion from the model. Covariates are examined in descending order of the Wald statistics. Continuous covariates with a significant fractional polynomial transformation are entered transformed, which becomes their null model. The point of this step is two fold: (1) does the transformation “linearize” the covariate in the log hazard, and (2) does the transformation affect scaling of other covariates. Each covariate’s level of significance is noted as well as the need to transform. This completes the second cycle.

The mfp procedure stops when the results of two consecutive cycles are the same. The minimum number is two. More than two cycles occur if additional transformations of continuous covariates are suggested in cycle two and beyond or if the level of significance of the partial likelihood ratio test for contribution to the model changes the decision to include or exclude a covariate.

We use the mfp method on a model for the WHAS500 data that begins with all categorically scaled model covariates in Table 5.1, except cardiogenic shock and complete heart block, and all continuous variables in Table 5.2. The dichotomous covariates in Table 5.1 are each modeled with one degree of freedom. The continuous covariates in Table 5.2 are each modeled with up to four degrees of freedom. The process took two cycles to converge, using the 15 percent level of significance for both inclusion and transformation. The results from cycle 2 are shown in Table 5.16.

In this table, the first covariate processed is age so, we know it had the largest Wald statistic. Because age is continuous and we allow up to four degrees of freedom the first test, line3 1, compares a model not containing age to one where age is transformed by the (3, 3) transformation. The value in the Deviance column of 2301.581 is for the model that excludes age. The value in the G column, 67.477, is the difference between 2301.581 and the Deviance for the best two-term fractional polynomial model. The value in the p column in line 1 is the significance level using four degrees of freedom, p = Pr[χ2(4) ≥ 67.477]. The superscripted “*” means that the test is significant at the user-specified significance level for inclusion, 0.15 in this example, in the model. Because the test is significant, the procedure then compares the best two-term factional polynomial model to the linear model. Hence, the value of 2237.784 in the Deviance column is for a linear model including age. The difference between 2237.784 and the Deviance for the two-term fractional polynomial model is 3.680 and its p-value computed, using 3 degrees of freedom, is 0.298 = Pr[χ2(3) ≥ 3.680]. Because this is not significant at our chosen alpha level, 0.15, the processing of age stops and the final model is shown in line 3 as age linear in the log hazard.

The second covariate processed is congestive heart complication, chf; it had the second largest Wald statistic. This variable is dichotomous and, as such, we modeled it with a single degree of freedom. Thus, for this covariate there are only two choices. It is either not in the model or in the model and modeled linearly. The Deviance in line 4, 2268.808, is for a model that excludes chf. The test is G = 31.023, which is equal to the difference between 2268.808 and the Deviance from the model including chf, and its p-value computed using a chi-square distribution with one degree of freedom is reported as 0.000*, indicating that it is significant at the 15 percent level. The value of “1” in the transformation column means that chf is not transformed (i.e., modeled linearly). Hence the final model for chf is to include chf modeled linearly.

The third variable processed is heart rate. Its results are similar to those for age, except the best two-term fractional polynomial is (–2, –2).

The results for body mass index, bmi, are different and warrant elaboration. The results in line 9 are similar in terms of models being compared with those for age in line 1 except the two-term fractional polynomial is (2, 3). The results in line 10 are similar to those for age in line 2. However, in this case, the test is significant, at the 15 percent level, so processing continues. The results in line 11 compare the best two-term to the best one-term, (–2), fractional polynomial model. The p-value for this comparison is 0.143, which is significant at the 15 percent level, denoted by a “+”. Hence the final model for bmi is the two-term fractional polynomial (2, 3), shown in line 12.

Processing of diastolic blood pressure is similar to age and that of gender is similar to chf.

You will note that two different significance levels are being used. One controls the fractional polynomial processing. We chose a value of 0.15. If we had used 0.05, then processing of bmi would have chosen a one-term fractional polynomial model as the p-value comparing the two to one-term models is 0.143 in line 11. Tests significant at this level are denoted by an “+” following the p-value. The second significance level controls the inclusion and exclusion of covariates from the model. For example, if the p-value in line 1 for age was greater than 0.15, then age would have been tagged for removal. Significant results at this level are denoted by an “*” following the p-value. Examples of variables tagged for removal begin with MI type (mitype), in line 18 and continue to history of cardiovascular disease in line 27. In STATA, the user may choose values or use program defaults. The rationale for our choice of 0.15 for both is two fold: we wanted to include main effects that had the possibility to be confounders and we can see if one-and two-term transformations are marginally different. We prefer to put the user in control of making important final modeling decisions. We discuss this in greater detail below.

Table 5.16 Results from the Final Cycle of MFP Applied to the WHAS500 data (n = 500)

c05t016

*: p < chosen significance level for inclusion

+: p < chosen significance level for transformation

All variables are processed until all transformations and decisions to include or exclude are the same for each covariate at two consecutive cycles. The results in Table 5.16 show that the final mfp model contains age, chf, hr, bmi transformed using (2,3), diasbp, and gender. In this case, the mfp method identified the same main effects model as the other methods, but also identifies that among the continuous covariates, only bmi needs to be transformed.

The mfp method is clearly an extremely powerful analytic modeling tool, which, on the surface, would appear to relieve the analyst of having to think too hard about model content. This is not the case, of course. We recommend that, if one uses mfp, then its model be considered as a suggestion for a possible main effects model, much in the way that stepwise and best subsets identify possible models. The model needs a thorough evaluation to be sure all covariates and transformations make clinical sense, that transformations are not caused by a few extreme observations and, quite importantly, that excluded covariates are not con-founders of model covariate estimates of effect. We highly recommend you spend time with Royston and Sauerbrei (2006), Sauerbrei, Meier-Hirmer, Benner and Royston (2006) and the host of other excellent papers cited that describe, in detail, the development and use of both fractional polynomials and the mfp procedure.

In summary, stepwise, best subsets and mfp have their place as covariate selection methods, but it is always the responsibility of the user to choose the content and form of the final model.

5.4 NUMERICAL PROBLEMS

The software available in the major statistical packages for fitting the proportional hazards model is easy to use and, for the most part, contains checks and balances that warn the user of impending numerical disasters. However, there are certain data configurations that cause numerical difficulties that may not produce a suitable warning to the user. The problem of monotone likelihood described by Bryson and Johnson (1981) is one such problem. This problem in a survival analysis is similar to the occurrence of a zero frequency cell in a two-by-two contingency table or when the distributions of a continuous covariate are completely separated by the binary outcome variable in logistic regression. The problem occurs in a proportional hazards regression when the rank ordering of the covariate and the survival times are the same. That is, at each observed survival time, the subject who fails has the largest (smallest) value of one of the covariates among the subjects in the risk set.

To illustrate the problem, we created a hypothetical data set containing 100 observations of survival time in days, truncated at one year with approximately 30 percent of the observations censored. We created a dichotomous covariate whose value is equal to one if the observed survival time was less than the median and zero otherwise. The results of fitting the proportional hazards model are shown in Table 5.17, where the notation “9.7E6” means 9.7 × 106.

Table 5.17 Estimated Coefficient, Standard Error, z-Score, Two-Tailed p-value, and 95% Confidence Intervals for a Proportional Hazards Model Containing a Monotone Likelihood Covariate (n = 100)

c05t017

Table 5.18 Estimated Coefficients, Standard Errors, z-Scores, Two-Tailed p-values, and 95% Confidence Intervals for a Proportional Hazards Model Containing Two Highly Correlated Continuous Covariates (n = 100)

c05t018

The estimated coefficient and its standard error are unreasonably large. The software also required 25 iterations to obtain this value. As in the case of logistic regression, any implausibly large coefficient and standard error is a clear indication of numerical difficulties. In this case, a graph of the covariate versus time would indicate the problem.

The example in Table 5.17 is a simple one because it involves a single covariate. In practice, the situation is likely to be more complex, with a combination of multiple covariates inducing the same effect. Bryson and Johnson (1981) show that certain types of linear combinations (e.g., a simple sum of the covariates) may yield monotone likelihood. In these situations, the problem will manifest itself with unreasonably large coefficients and standard errors.

Extreme collinearity among the covariates is another possible problem. Most software packages contain diagnostic checks for highly correlated data, but clinically implausible results may be produced before the program’s diagnostic switch is tripped. The results of fitting a proportional hazards model when the relationship between the two covariates is x2 = x1 + u, where u is the value of a uniformly distributed random variable on the interval (0, 0.01), are shown in Table 5.18. The correlation between the covariates is effectively 1.0, yet the program prints a result. Similar results were obtained until u ~ U/(0,0.0001), at which point one of the covariates was dropped from the model by the program.

The bottom line is that it is ultimately the user of the software who is responsible for the results of an analysis. Any analysis producing “large” effect(s) or standard error(s) should be treated as a “mistake” until the involved covariate(s) are examined critically.

EXERCISES

1. An important step in any model building process is assessing the scale of continuous variables in the model. There are two continuous variables, age and bmi, in the WHAS100 data. Use the methods discussed in this chapter to assess the scale of both in a model containing age and bmi.

2. In this problem, use the ACTG320 data with covariates (see Table 1.5) tx, sex, ivdrug, karnof, cd4, priorzdv, and age. Using the methods for model building discussed in this chapter, find the best model for estimating the effect of the covariates on survival time to AIDS diagnosis or death. This process should include the following steps: variable selection, assessment of the scale of continuous variables, and selection of interactions.

Note: Save any work done for Exercise 2 as there are exercises in Chapter 6 dealing with this model.

3. Without referring to the work by Sauerbrei and Royston (1999) use the methods in this Chapter to find the best model for the GBCS data for time to recurrence. Is the same model appropriate for modeling time to death?

1 The German Breast Cancer Study is another good data set but we could add little to the excellent work illustrating modeling done on these data by Sauerbrei and Royston (1999).

2 We used their transformation on bmi and obtained results and models equivalent to those in Table 5.8.

3 The Stata output does not include line numbers. We included them in Table 5.16 to help in discussing the results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.18.65