Chapter 6

Time to Event Analysis

6.1 Introduction

In survival or reliability analysis the investigator is interested in time to an event of interest as the outcome variable. Often in a clinical trial the goal is to evaluate the effectiveness of a new treatment at prolonging survival; i.e. to extend the time to the event of death. It is usually the case that at the end of followup a portion of the subjects in the trial have not experienced the event; for these subjects the outcome variable is censored. Similarly, in engineering studies, often the lifetimes of mechanical or electrical parts are of interest. In a typical experimental design, lifetimes of these parts are recorded along with covariates (including design variables). Often the lifetimes are called failure times, i.e., times until failure. As in a clinical study, at the end of the experiment, there may be parts which are still functioning (censored observations).

In this chapter we discuss standard nonparametric and semiparametric methods for analysis of time to event data. In Section 6.2, we discuss the Kaplan–Meier estimate of the survival function for these models and associated nonparametric tests. Section 6.3 introduces the proportional hazards analysis for these models, while in Section 6.4 we discuss rank-based fits of accelerated failure time models, which include proportional hazards models. We illustrate our discussion with analyses of real datasets based on computation by R functions. For a more complete introduction to survival data we refer the reader to Chapter 7 of Cook and DeMets (2008) or to the monograph by Kalbfleisch and Prentice (2002). Therneau and Grambsch (2000) provide a thorough treatment of modeling survival data using SAS and R/S.

6.2 Kaplan–Meier and Log Rank Test

Let T denote the time to an event. Assume T is a continuous random variable with cdf F(t). The survival function is defined as the probability that a subject survives until at least time t; i.e., S(t) = P(T > t) = 1 − F(t). When all subjects in the trial experience the event during the course of the study, so that there are no censored observations, an estimate of S(t) may be based on the empirical cdf. However, in most studies there are a number of subjects who are not known to have experienced the outcome prior to the study completion. Kaplan and Meier (1958) developed their product-limit estimate as an estimate of S(t) which incorporates information from censored observations. In this section we briefly discuss estimates of the survival function and also illustrate them via small samples. The focus, however, is on the R syntax for analysis. We describe how to store time to event data and censoring in R, as well as computation of the Kaplan–Meier estimate and the log-rank test – which is a standard test for comparing two survival distributions.

We begin with a brief overview of survival data as well as simple examples which illustrate the calculation of the Kaplan–Meier estimate.

Example 6.2.1 (Treatment of Pulmonary Metastasis).

In a study of the treatment of pulmonary metastasis arising from osteosarcoma, survival time was collected; the data are provided in Table 6.1.

Table 6.1

Survival Times (in months) for Treatment of Pulmonary Metastasis.

11

13

13

13

13

13

14

14

15

15

17

As there are no censored observation an estimate of the survival function at time t is

S^(t)=#{ti>t}n   (6.1)

which is based on the empirical cdf. Because of the low number of distinct time points the estimate (6.1) is easily calculated by hand which we briefly illustrate next. Since n = 11, the result is

S^(t)={ 10t<11101111t<1351113t<1431114t<1511115t<170t17.

The estimated survival function is plotted in Figure 6.1.

Figure 6.1

Figure showing estimated survival curve (S^(t)).

Estimated survival curve (S^(t)).

Though (6.1) aids in the understanding of survival functions, it is not often useful in practice. In most clinical studies, at the end of followup there are subjects who have yet to experience the event being studied. In this case, the Kaplan–Meier product limit estimate is used which we describe briefly next. Suppose n experimental units are put on test. Let t(1) < ... < t(k) denote the ordered distinct event times. If there are censored responses, then k < n. Let ni = #subjects at risk at the beginning of time t(i) and di = #events occurring at time t(i) (i.e., during that day, month, etc.). The Kaplan–Meier estimate of the survival function is defined as

S^(t)=t(i)t(1dini).      (6.2)

Note that when there is no censoring (6.2) reduces to (6.1). To aid in interpretation, we illustrate the calculation in the following example.

Example 6.2.2 (Cancer Remission: Time to Relapse.).

The data in Table 6.2 represent time to relapse (in months) in a cancer study. Notice, based on the top row of the table, that there are k = 5 distinct survival event times. Table 6.3 illustrates the calculation of of the Kaplan–Meier estimate for this dataset.

Table 6.2

Time in Remission (in months) in Cancer Study.

Relapse

3

6.5

6.5

10

12

15

Lost to followup

8.4

Alive and in remission at at end of study

4

5.7

10

Table 6.3

Illustration of the Kaplan–Meier Estimate.

t

n

d

1 - d/n

S(t)

3

10

1

9/10 = 0.9

0.9

6.5

7

2

5/7 = 0.71

0.9*0.71 = 0.64

10

4

1

3/4 = 0.75

0.64*0.75 = 0.48

12

2

1

1/2 = 0.5

0.48*0.5 = 0.24

15

1

1

0/1 = 0.0

0

Often a study on survival involves the effect that different treatments have on survival time. Suppose we have r independent groups (treatments). Let H0 be the null hypothesis that the distributions of the groups are the same; i.e., the population survival functions are the same. Obviously, overlaid Kaplan–Meier survival curves provide an effective graphical comparison of the times until failure of the different treatment groups. A nonparametric test that is often used to test for a difference in group survival times is the log-rank test. This test is complicated and complete details can be found, for example, in Kalbfleisch and Prentice (2002). Briefly, as above, let t1 < t2 < · · · < tk be the distinct failure times of the combined samples. Then at each time point tj, it can be shown that the number of failures in Group i conditioned on the total number of failures has a distribution-free hypergeometric distribution under H0. Based on this a goodness-of-fit type test statistic (called the log-rank test) can be formulated which has a χ2-distribution with r – 1 degrees of freedom under H0. The next example illustrates this discussion for the time until relapse of two groups of patients who had survived a lobar intracerebral hemorrhage.

Example 6.2.3 (Hemorrhage Data).

For demonstration we use the hemorrhage data discussed in Chapter 6 of Dupont (2002). The study population consisted of patients who had survived a lobar intracerebral hemorrhage and whose genotype was known. The outcome variable was the time until recurrence of lobar intracerebral hemorrhage. The investigators were interested in examining the genetic effect on recurrence as there were three common alleles e2, e3, and e4. The analysis was focused on the effect of homozygous e3/e3 (Group 1) versus at least one e2 or e4 (Group 2). The data are available at the author’s website. The following code segment illustrates reading the data into R and converting it to a survival dataset which includes censoring information. Many of the functions for survival data are available in the R package survival (Therneau 2013).

> with(hemorrhage,Surv(round(time,2),recur))
 [1] 0.23  1.05+ 1.22  1.38+ 1.41 1.51+  1.58+ 1.58  3.06  3.32
[11] 3.52  3.55  4.04+ 4.63+ 4.76 8.08+  8.44+ 9.53 10.61+ 10.68+
[21] 11.86+ 12.32 13.27+ 13.60+ 14.69+ 15.57 16.72+ 17.84+ 18.04+ 18.46+
[31] 18.46+ 18.46+ 18.66+ 19.15 19.55+ 19.75+ 20.11+ 20.27+ 20.47+ 24.77
[41] 24.87 25.56+ 25.63+ 26.32+ 26.81+ 28.09 30.52+ 32.95+ 33.05+ 33.61
[51] 34.99+ 35.06+ 36.24+ 37.03+ 37.52 37.75+ 38.54+ 38.97+ 39.16+ 40.61+
[61] 42.22+ 42.41+ 42.78+ 42.87 43.27+ 44.65+ 45.24+ 46.29+ 46.88+ 47.57+
[71] 53.88+

In the output are survival times (in months) for 71 subjects. However, one subject’s genotype information is missing and is excluded from analysis. Of the remaining 70 subjects, 32 are in Group 1 and 38 are in Group 2. A + sign indicates a censored observation; meaning that at that point in time the subject had yet to report recurrence. The study could have ended or the subject could have been lost to followup. Kaplan–Meier estimates are available through the command survfit. The resulting estimates may then be plotted, as is usually the case for Kaplan–Meier estimates, as the following code illustrates. If confidence bands are desired, one may use the conf.type option to survfit. Setting conf.type=‘plain’ returns the usual Greenwood (1926) estimates.

> fit<-with(hemorrhage, survfit(Surv(time,recur)˜ genotype))

> plot(fit,lty=1:2,
+  ylab=‘Probability of Hemorrhage-Free Survival’,
+  xlab=‘Time (in Months)’
+)
> legend(‘bottomleft’,c(‘Group 1’, ’Group 2’),lty=1:2,bty=‘n’)

As illustrated in Figure 6.2, patients that were homozygous e3/e3 (Group 1) seem to have significantly greater survival.

Figure 6.2

Figure showing plots of kaplan–meier estimated survival distributions.

Plots of Kaplan–Meier estimated survival distributions.

> with(hemorrhage, survdiff(Surv(time,recur)˜ genotype))

Call :
survdiff(formula = Surv(time, recur) ˜ genotype)

n=70, 1 observation deleted due to missingness.

   N Observed Expected (O-E)^2/E (O-E)^2/V
genotype=0 32   4 9.28  3.00  6.28
genotype=1 38  14 8.72  3.19  6.28

Chisq= 6.3  on 1 degrees of freedom, p= 0.0122

Note that the log-rank test statistic is 6.3 with p-value 0.0122 based on a null χ2-distribution with 1 degree of freedom. Thus the log-rank test confirms the difference in survival time of the two groups.

6.2.1 Gehan’s Test

Gehan’s test, (see Higgins (2003)), sometimes referred to as the Gehan– Wilcoxon test, is an alternative to the log-rank test. Gehan’s method is a generalization of the Wilcoxon procedure discussed in Chapter 3. Suppose in a randomized controlled trial subjects are randomized to one of two treatments, say, with survival times represented by X and Y. Represent the sample as X1,...,Xn1 and Y1,...,Yn2 with a censored observation denoted by with a plus sign, Xi+, for example. Only unambiguous pairs of observations are used. Not used are ambiguous observations such as when an observed X is greater than a censored Y(Xi>Yj+) or when both observations are censored. The test statistic is defined as the number of times each of the X clearly beats Y minus the number of times Y clearly beats X. Let S1 denote the set of uncen-sored observations, S2 denote the set of observations for which X is censored and Y is uncensored, and S3 denote the set where Y is censored and X is uncensored. Then Gehan’s test statistic can be represented as

U=(#s1{Xi>Yj}+#s2{Xi+Yj})(#s1{Yj>Xi}+#s3{Yj+Xi}).

Example 6.2.4 (Higgins’ Cancer Data).

Example 7.3.1 of Higgins (2003) describes an experiment to assess the effect of a new treatment relative to a standard. The data are in the dataset cancertrt, but for convenience the data are also given in Table 6.4. We illustrate the computation of Gehan’s test based on the npsm function gehan.test. There are three required arguments to the function: the survival time, an indicator variable indicating that the survival time corresponds to an event (and is not censored), and a dichotomous variable representing one of two treatments; see the output of the args function below.

Table 6.4

Survival Times (in days) for Undergoing Standard Treatment (S) and a New Treatment (N).

S

94

180+

741

1133

1261

382

567+

988

1355+

N

155

375

951+

1198

175

521

683+

1216+

> args(gehan.test)

function (time, event, trt)
NULL

We use the function gehan.test next on the cancertrt dataset.

> with(cancertrt,gehan.test(time,event,trt))

statistic = -0.6071557 , p-value = 0.5437476

The results agree with those in Higgins. The two-sided p-value = 0.5437 which is not significant. As a final note, using the survdiff function with rho=1 gives the Peto–Peto modification of the Gehan test.

6.3 Cox Proportional Hazards Models

As in the last section, let T denote the time until the event of an experimental unit. Let x denote the corresponding p × 1 vector of covariates. Assume that T is a continuous random variable with respective pdf and cdf denoted by f(t) and F(t). Let S(t) = 1 − F(t) denote the survival time of T. Let T0 denote a baseline response; i.e., a response in the absence of all covariate effects.

The hazard function of T, which is often interpreted as the instantaneous chance of the event (death), is defined as

h(t)=f(t)S(t);

see expression (6.7) for a formal definition. For a simple but much used example, assume that T0 has the exponential distribution with pdf f(t) = λ0 exp{−λ0t), t > 0. Then it is easy to show that the hazard function of T0 has the constant value of λ0. The proportional hazards model assumes that the hazard function of T is given by

λ(t;x)=λ0eβTx   (6.3)

where x is a p × 1 vector of covariates and β is a p × 1 vector of parameters. Note that the hazard function of T is proportional to that of T0.

To illustrate these ideas, assume that T0 has constant hazard λ0. Suppose the only covariate is an indicator variable w which is either 0 or 1 depending on whether a subject is not treated or treated. Assuming a proportional hazards model, the hazard function of T is given by

λ(t;w)=λ0ewΔ.   (6.4)

The hazard ratio of the experimental treatment relative to the control is then eΔ. That is, Δ has the interpretation of log hazard; a value < 1 (less hazardous) favors the experimental treatment and a value > 1 (more hazardous) favors the control. Further examples are given in Section 7.4.2 of Cook and DeMets (2008).

The proportional hazards model developed by (Cox 1972) is a semipara-metric model which does not necessarily specify the hazard function; only the relative effect of covariates is estimated. In the simple case under discussion it can be used to estimate the parameter Δ as shown in the following example.

Example 6.3.1 (Hemorrhage data example, Continued).

As a first example, we again consider Example 6.2.3 concerning the hemorrhage data from the previous section. Using the function coxph from the survival package we obtain an estimate Δ and corresponding inference.

> fit<-coxph(Surv(time,recur)˜ genotype,data=hemorrhage)
> summary(fit)

Call :
coxph(formula = Surv(time, recur) ˜ genotype, data = hemorrhage)

 n= 70, number of events= 18
  (1 observation deleted due to missingness)

   coef exp(coef) se(coef) z Pr(>|z|)
genotype 1.3317 3.7874  0.5699 2.337  0.0195 *
–––
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

  exp(coef) exp(-coef) lower .95 upper .95
genotype 3.787  0.264 1.239 11.57
Concordance= 0.622 (se = 0.065)
Rsquare= 0.09 (max possible= 0.851)
Likelihood ratio test= 6.61 on 1 df , p=0.01015
Wald test   = 5.46 on 1 df , p=0.01946
Score (logrank) test = 6.28 on 1 df , p=0.01219

From the output we see Δ^=1.3317 which indicates an increased risk for Group 2, those with heterozygous genotype. We also observe the estimated risk of hemorrhage for being heterozygous (Group 2) is 3.787 over being homozygous (Group 1). A 95% confidence interval is also given as (1.239, 11.57). Notice that the value of the score test statistics is the same as from the last section.

More generally, assume that the baseline hazard function is λ0(t). Assume that the hazard function of T is

λ(t;x)=λ0(t)eβTx.

Notice that the hazard ratio of two covariate patterns (e.g. for two subjects) is independent of baseline hazard

λ(t;x1)λ(t;x2)=eβ(x1x2).

We close this section with the following example concerning an investigation with treatment at two levels and several covariates.

Example 6.3.2 (DES for treatment of prostate cancer).

The following example is taken from Collett (2003); data are available from the publisher’s website. Under investigation in this clinical trial was the pharmaceutical agent diethylstilbestrol DES; subjects were assigned treatment to 1.0 mg DES (treatment = 2) or to placebo (treatment = 1). Covariates include age, serum hemoglobin level, size, and the Gleason index.

In Exercise 6.5.2 the reader is asked to obtain the full model fit for the Cox proportional hazards model. Several of the explanatory variables are nonsignificant, though in practice one may want to include important risk factors such as age in the final model. For demonstration purposes, we have dropped age and shb from the model. As discussed in Collett (2003), the most important predictor variables are size and index.

> f2<-coxph(Surv(time,event=status)˜ as.factor(treatment)+size+index,
+ data=prostate)
> summary(f2)

Call:
coxph(formula = Surv(time, event = status) ˜ as.factor(treatment) +
 size + index, data = prostate)
  n= 38, number of events= 6

      coef exp(coef) se(coef)  z Pr(>|z|)
as.factor(treatment)2 -1.11272  0.32866 1.20313 -0.925  0.3550
size    0.08257  1.08608 0.04746 1.740  0.0819 .
index    0.71025  2.03450 0.33791 2.102  0.0356 *
–––
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

     exp(coef) exp(-coef) lower .95 upper .95
as.factor(treatment)2 0.3287 3.0426  0.03109 3.474
size     1.0861 0.9207  0.98961 1.192
index    2.0345 0.4915  1.04913 3.945

Concordance= 0.873 (se = 0.132)
Rsquare= 0.304 (max possible= 0.616)
Likelihood ratio test= 13.78 on 3 df , p=0.003226
Wald test   = 10.29 on 3 df , p=0.01627
Score (logrank) test = 14.9 on 3 df , p=0.001903

These data suggest that the Gleason Index is a significant risk factor of mortality (p-value=0.0356). Size of tumor is marginally significant (p-value=0.0819). Given that Δ^=1.11272<1it appears that DES lowers risk of mortality; however, the p-value = 0.3550 is nonsignificant.

6.4 Accelerated Failure Time Models

In this section we consider analysis of survival data based on an accelerated failure time model. We assume that all survival times are observed. Rank-based analysis with censored survival times is considered in Jin et al. (2003).

Consider a study on experimental units (subjects) in which data are collected on the time until failure of the subjects. Hence, the setup for this section is the same as in the previous two sections of this chapter, with time until event replaced by time until failure. For such an experiment or study, let T be the time until failure of a subject and let x be the vector of associated covariates. The components of x could be indicators of an underlying experimental design and/or concomitant variables collected to help explain random variability. Note that T > 0 with probability one. Generally, in practice, T has a skewed distribution. As in the last section, let the random variable T0 denote the baseline time until failure. This is the response in the absence of all covariates.

In this section, let g(t; x) and G(t; x) denote the pdf and cdf of T, respectively. In the last section, we introduced the hazard function h(t). A more formal definition of the hazard function is the limit of the rate of instantaneous failure at time t; i.e.,

h(t;x)=limΔt0P[t<Tt+Δt|T>t;x]Δt              =limΔt0g(t;x)ΔtΔt(1G(t;x)=g(t;x)1G(t;x).       (6.5)

Models frequently used with failure time data are the log-linear models

Y=α+xTβ+ε,    (6.6)

where Y = logT and ε is random error with respective pdf and cdf f(s) and F(s). We assume that the random error ε is free of x. Hence, the baseline response is given by T0 = exp{ε}. Let h0(t) denote the hazard function of T0. Because

T=exp{Y}=exp{α+xTβ+ε}=exp{α+xTβ}exp{ε}=exp{α+xTβ}T0,

it follows that the hazard function of T is

hT(t;x)=exp{(α+xTβ)}h0(exp{(α+xTβ)}t).   (6.7)

Notice that the effect of the covariate x either accelerates or decelerates the instantaneous failure time of T; hence, log-linear models of the form (6.6) are generally called accelerated failure time models.

If T0 has an exponential distribution with mean 1/λ0, then the hazard function of T simplifies to:

hT(t;x)=λ0exp{(α+xTβ)};   (6.8)

i.e., Cox’s proportional hazard function given by expression (6.3) of the last section. In this case, it follows that the density function of ε is the extreme-valued pdf given by

f(s)=λ0esexp{λ0es},    <s<.    (6.9)

Accelerated failure time models are discussed in Kalbfleisch and Prentice (2002). As a family of possible error distributions for ε, they suggest the generalized log F family; that is, ε = logT0, where down to a scale parameter, T0 has an F-distribution with 2m1 and 2m2 degrees of freedom. In this case, we say that ε = log T0 has a GF(2m1, 2m2) distribution. Kalbfleisch and Prentice discuss this family for m1, m2 ≥ 1; while McKean and Sievers (1989) extended it to m1,m2 > 0. This provides a rich family of distributions. The distributions are symmetric for m1 = m2; positively skewed for m1 > m2; negatively skewed for m1 < m2; moderate to light-tailed for m1,m2 > 1; and heavy tailed for m1, m2 ≤ 1. For m1 = m2 = 1, ε has a logistic distribution, while as m1 = m2 → ∞ the limiting distribution of ε is normal. Also, if one of mi is one while the other approaches infinity, then the GF distribution approaches an extreme valued-distribution, with pdf of the form (6.9). So at least in the limit, the accelerated GF models encompass the proportional hazards models. See Kalbfleisch and Prentice (2002) and Section 3.10 of Hettmansperger and McKean (2011) for discussion.

The accelerated failure time models are linear models so the rank-based fit and associated inference using Wilcoxon scores can be used for analyses. By a prudent choice of a score function, though, this analysis can be optimized. We next discuss optimal score functions for these models and show how to compute analyses based on the them using Rfit. We begin with the proportional hazards model and then discuss the scores for the generalized log F-family.

Suppose a proportional hazard model is appropriate, where the baseline random variable T0 has an exponential distribution with mean 1/λ0. Then ε has the extreme valued pdf given by (6.9). Then as shown in Exercise 6.5.5 the optimal rank-based score function is φ(u) = −1 – log(1 – u), for 0 < u < 1. A rank-based analysis using this score function is asymptotically fully efficient. These scores are in the package npsm under the name logrankscores. The left panel of Figure 6.3 contains a plot of these scores, while the right panel shows a graph of the corresponding extreme valued pdf, (6.9). Note that the density has very light right-tails and much heavier left-tails. To guard against the influence of large (absolute) observations from the left-tails, the scores are bounded on the left, while their behavior on the right accommodates light-tailed error structure. The scores, though, are unbounded on the right and, hence, the resulting R analysis is not bias robust. In the sensitivity analysis discussed in McKean and Sievers (1989), the R estimates based on these scores were much less sensitive to outliers than the maximum likelihood estimates. Similar to the normal scores, these log rank scores appear to be technically bias robust.

Figure 6.3

Figure showing log-rank score function.

Log-rank score function.

We illustrate the use of these scores in the next example.

Example 6.4.1 (Simulated Exponential Data).

The data for this model are generated from a proportional hazards model with λ = 1 based on the code eps <- log(rexp(10)); x=1:10; y = round(4*x+eps,digits=2). The actual data used are given in Exercise 6.5.11. Using Rfit with the log-rank score function, we obtain the fit of this dataset:

> fit <- rfit(y˜ x,scores=mylogrank)
> summary(fit)

Call :
rfit.default(formula = y ˜ x, scores = mylogrank)

Coefficients :
   Estimate Std. Error t.value  p.value
(Intercept) -1.60687 1.49251 -1.0766 0.313
x   4.19125 0.22496 18.6310 7.107e-08 ***
–––
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Multiple R-squared (Robust): 0.9287307
Reduction in Dispersion Test: 104.2504 p-value: 1e-05

Note that the true slope of 4 is included in the approximate 95% confidence interval 4.19 ±2.31· 0.22.

Next, suppose that the random errors in the accelerated failure time model, (6.6), have down to a scale parameter, a GF(2m1, 2m2) distribution. Then as shown on page 234 of Hettmansperger and McKean (2011) the optimal score function is

φm11,m2(u)=m1m2[exp{F1(u)}1]m2+m1exp{F1(u)},   m1>0,m2>0,    (6.10)

where F is the cdf of ε. Note, for all values of m1 and m2, these score functions are bounded over the interval (0, 1); hence, the corresponding R analysis is biased robust. These scores are called the generalized log-F scores (GLF). The software npsm contains the necessary R code logf scores to add these scores to the class of scores. For this code, we have used the fact that the pth quantile of the F2m1,2m2 cdf satisfies

q=exp{Fε1(p)}where q=F2m1,2m21(p).

The default values are set at m1 = m2 = 1, which gives the Wilcoxon scores. Figure 6.4 shows the diversity of these scores for different values of m1 and m2. It contains plots of four of the scores. The upper left corner graph displays the scores for m1 = 1 and m2 = 20. These are suitable for error distributions which have moderately heavy (heaviness of a logistic distribution) left-tails and very light right-tails. In contrast, the scores for m1 = 1 and m2 = 0.10 are appropriate for moderately heavy left-tails and very heavy right-tails. The lower left panel of the figure is a score function designed for heavy tailed and symmetric distributions. The final plot, m1 = 5 and m2 = 0.8, are appropriate for moderate left-tails and heavy right-tails. But note from the degree of downweighting that the right-tails for this last case are clearly not as heavy as for the two cases with m2 = 0.10.

Figure 6.4

Figure showing GLF scores for various settings of m1 and m2.

GLF scores for various settings of m1 and m2.

The next example serves as an application of the log F-scores.

Example 6.4.2 (Insulating Fluid Data).

Hettmansperger and McKean (2011) present an example involving failure time (T) of an electrical insulating fluid subject to seven different levels of voltage stress (x). The data are in the dataset insulation. Figure 6.5 shows a scatterplot of the log of failure time (Y = log T) versus the voltage stress. As voltage stress increases, time until failure of the insulating fluid decreases. It appears that a simple linear model suffices. In their discussion, Hettmansperger and McKean recommend a rank-based fit based on generalized log F-scores with m1 = 1 and m2 = 5. This corresponds to a distribution with left-tails as heavy as a logistic distribution and right-tails lighter than a logistic distribution; i.e., moderately skewed left. The following code-segment illustrates computation of the rank-based fit of these data based on this log F-score.

Figure 6.5

Figure showing log failure times of the insulation fluid versus the voltage stress.

Log failure times of the insulation fluid versus the voltage stress.

> myscores <- logfscores
> myscores@param=c(1,5)
> fit <- rfit(logfail˜ voltstress,scores=myscores)
> summary(fit)

Call:

rfit.default(formula = logfail ˜ voltstress, scores = myscores)

Coefficients :
    Estimate Std. Error t.value  p.value
(Intercept)  63.9596 6.5298  9.795 5.324e-15 ***
voltstress  -17.6624 1.8669 -9.461 2.252e-14 ***
–––
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Multiple R-squared (Robust): 0.5092232
Reduction in Dispersion Test: 76.78138 p-value: 0

> fit$tauhat

[1] 1.572306

Not surprisingly, the estimate of the slope is highly significant. As a check on goodness-of-fit, Figure 6.6 presents the Studentized residual plot and the q−q plot of the Studentized residuals versus the quantiles of a log F-distribution with the appropriate degrees of freedom 2 and 10. For the q−q plot, the population quantiles are the quantiles of a log f-distribution with 2 and 10 degrees of freedom. This plot is fairly linear, indicating1 that an appropriate choice of scores was made. The residual plot indicates a good fit. The outliers on the left are mild and, based on the q−q plot, follow the pattern of the log F-distribution with 2 and 10 degrees of freedom.

Figure 6.6

Figure showing the top panel contains the studentized residual plot of the rank-based fit using generalized log f-scores with 2 and 10 degrees of freedom. the bottom panel shows the q–q plot of studentized residuals versus log f-scores with 2 and 10 degrees of freedom.

The top panel contains the Studentized residual plot of the rank-based fit using generalized log F-scores with 2 and 10 degrees of freedom. The bottom panel shows the q–q plot of Studentized residuals versus log F-scores with 2 and 10 degrees of freedom.

6.5 Exercises

  1. 6.5.1. Using the data discussed in Example 6.2.4:

    1. (a) Obtain a plot of the Kaplan–Meier estimates for the two treatment groups.
    2. (b) Obtain the p-value based on the log-rank statistic.
    3. (c) Obtain the p-value based on the Peto–Peto modification of the Gehan statistic.
  2. 6.5.2. Obtain full model fit of the prostate cancer data discussed in Example 6.3.2. Include age, serum haemoglobin level, size, and Gleason index. Comment on the similarity or dissimilarity of the estimated regression coefficients to those obtained in Example 6.3.2.
  3. 6.5.3. For the dataset hodgkins, plot Kaplan–Meier estimated survival curves for both treatments. Note treatment code 1 denotes radiation of affected node and treatment code 2 denotes total nodal radiation.
  4. 6.5.4. To simulate survival data, often it is useful to simulate multiple time points. For example the time to event and the time to end of study. Then, events occurring after the time to end of study are censored. Suppose the time to event of interest follows an exponential distribution with mean 5 years and the time to end of study follows an exponential distribution with a mean of 1.8 years. For a sample size n = 100 simulate survival times from this model. Plot the Kaplan–Meier estimate.
  5. 6.5.5. Show that the optimal rank-based score function is φ(u) = –1 –log(1 -u), for 0 < u < 1 for random variables which have an extreme valued distribution (6.9). In this case, the generated scores are called the log-rank scores
  6. 6.5.6. Consider the dataset rs. This is simulated data from a simple regression model with the true slope parameter at 0.5. The first column is the independent variable x while the second column is the dependent variable y. Obtain the following three fits of the model: least squares, Wilcoxon rank-based, and rank-based using logf scores with m1 = 1 and m2 = 0.10.

    1. (a) Scatterplot the data and overlay the three fits.
    2. (b) Obtain Studentized residual plots of all three fits.
    3. (c) Based on Parts (a) and (b) which fit is worst?
    4. (d) Compare the two rank-based fits in terms of precision (estimates of τφ). Which fit is better?
  7. 6.5.7. Generate data from a linear model with log-F errors with degrees of freedom 4 and 8 using the following code

    n <- 75; m1 <- 2; m2 <- 4; x<-rnorm(n,50,10)
    errs1 <- log(rf(n,2*m1,2*m2)); y1 <- x + 30*errs1
    1. (a) Using logfscores, obtain the optimal scores for this dataset.
    2. (b) Obtain side-by-side plots of the pdf of the random errors and the scores. Comment on the plot.
    3. (c) Fit the simple linear model for this data using the optimal scores. Obtain a residual analysis including a Studentized residual plot and a normal q−q plot. Comment on the plots and the quality of the fit.
    4. (d) Obtain a histogram of the residuals for the fit in part (b). Overlay the histogram with an estimate of the density and compare it to the plot of the pdf in part (a).
    5. (e) Obtain a summary of the fit of the simple linear model for this data using the optimal scores. Obtain a 95% confidence interval for the slope parameter β. Did the interval trap the true parameter?
    6. (f) Use the fit to obtain a confidence interval for the expected value of y when x = 60.
  8. 6.5.8. For the situation described in Exercise 6.5.7, obtain a simulation study comparing the mean squared errors of the estimates of slope using fits based on Wilcoxon scores and the optimal scores. Use 10,000 simulations.
  9. 6.5.9. Consider the failure time data discussed in Example 6.4.2. Recall that the generalized log F-scores with 2m1 = 2 and 2m2 = 10 degrees of freedom were used to compute the rank-based fit. The Studentized residuals from this fit were then used in a q−q plot to check goodness-of-fit based on the strength of linearity in the plot, where the population quantiles were obtained from a log F-distribution with 2 and 10 degrees of freedom. Obtain the rank-based fits based on the Wilcoxon scores, normal scores, and log F-scores with 2m1 = 10 and 2m2 = 2. For each, obtain the q−q plot of Studentized residuals using as population quantiles the normal distribution, the logistic distribution, and the log F-distribution with 10 and 2 degrees of freedom, respectively. Compare the plots. Which, if any, is most linear?
  10. 6.5.10. Suppose we are investigating the relationship between a response Y and an independent variable x. In a planned experiment, we record responses at r values of x, x1 < x2 < · · · < xr. Suppose ni independent replicates are obtained at xi. Let Yij denote the response for the jth replicate at xi. Then the model for a linear relationship is

    Yij=α+xijβ+eij,    i=1,...,r;j=1,...,ni.      (6.11)

    In this setting, we can obtain a lack-of-fit test. For this test, the null hypothesis is Model (6.11). For the alternative, we take the most general model which is a one-way design with r groups; i.e., the model

    Yij=μi+eij,     i=1,...,r;j=1,...,ni,    (6.12)

    where μi is the median (or mean) of the ith group (responses at xi). The rank-based drop in dispersion is easily formulated to test these hypotheses. Select a score function φ. Let D (RED) denote the minimum value of the dispersion function when Model (6.11) is fit and let D(FULL) denote the minimum value of the dispersion function when Model (6.12) is fit. The Fφ test statistic is

    Fφ=[D(RED)D(FULL)]/(r/2)τ^φ.

    This test statistic should be compared with F-critical values having r − 2 and n − r degrees of freedom, where n=ini is the total sample size. In general the drop in dispersion test is computed by the function drop.test. Carry out this test for the data in Example 6.4.2 using the log F-scores with 2m1 = 2 and 2m2 = 10 degrees of freedom.

  11. 6.5.11. The data for Example 6.4.1 are:

    x

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    y

    2.84

    6.52

    6.87

    16.43

    18.17

    25.24

    28.15

    31.65

    36.37

    38.84

    1. (a) Using Rfit, verify the analysis presented in Example 6.4.1.
    2. (b) Obtain Studentized residuals from the fit. Comment on the residual plot.
    3. (c) Obtain the q−q plot of the sorted residuals of Part (b) versus the quantiles of the random variable ε which is distributed as the log of an exponential. Comment on linearity in the q−q plot.

1 See the discussion in Section 3.10 of Hettmansperger and McKean (2011).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.254.90