Maximum Likelihood Estimation

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Maximum Likelihood Estimation

All models in PROC LIFEREG are estimated by the method of maximum likelihood. This section explores some of the basics of ML estimation, with an emphasis on how it handles censored observations. The discussion is not intended to be rigorous. If you want a more complete and careful treatment of ML, you should consult one of the many texts available on the subject. (Kalbfleisch and Prentice, 1980, Chapter 3, gives a more detailed introduction in the context of survival analysis).

ML is a quite general approach to estimation that has become popular in many different areas of application. There are two reasons for this popularity. First, ML produces estimators that have good large-sample properties. Provided that certain regularity conditions are met, ML estimators are consistent, asymptotically efficient, and asymptotically normal. Consistency means that the estimates converge in probability to the true values as the sample gets larger, implying that the estimates will be approximately unbiased in large samples. Asymptotically efficient means that, in large samples, the estimates will have standard errors that are (approximately) at least as small as those for any other estimation method. And, finally, asymptotically normal means that the sampling distribution of the estimates will be approximately normal in large samples, which means that you can use the normal and chi-square distributions to compute confidence intervals and p-values.

All these approximations get better as the sample size gets larger. The fact that these desirable properties have only been proven for large samples does not mean that ML has bad properties for small samples. It simply means that we usually don’t know what the small-sample properties are. And in the absence of attractive alternatives, researchers routinely use ML estimation for both large and small samples. Although I won’t argue against that practice, I do urge caution in interpreting p-values and confidence intervals when samples are small. Despite the temptation to accept larger p-values as evidence against the null hypothesis in small samples, it is actually more reasonable to demand smaller values to compensate for the fact that the approximation to the normal or chi-square distributions may be poor.

The other reason for ML’s popularity is that it is often straightforward to derive ML estimators when there are no other obvious possibilities. As we will see, one case that ML handles nicely is data with censored observations. While you can use least squares with certain adjustments for censoring (Lawless 1982, p. 328), such estimates often have much larger standard errors, and there is little available theory to justify the construction of hypothesis tests or confidence intervals.

The basic principle of ML is to choose as estimates those values that will maximize the probability of observing what we have, in fact, observed. There are two steps to this: (1) write down an expression for the probability of the data as a function of the unknown parameters, and (2) find the values of the unknown parameters that make the value of this expression as large as possible.

The first step is known as constructing the likelihood function. To accomplish this, you must specify a model, which amounts to choosing a probability distribution for the dependent variable and choosing a functional form that relates the parameters of this distribution to the values of the covariates. We have already considered those two choices.

The second step—maximization—typically requires an iterative numerical method, that is, one involving successive approximations. Such methods are often computationally demanding, which explains why ML estimation has become popular only in the last two decades.

In the next section, I work through the basic mathematics of constructing and maximizing the likelihood function. You can skip this part without loss of continuity if you’re not interested in the details or if you simply want to postpone the effort. Immediately after this section, I discuss some of the practical details of ML estimation with PROC LIFEREG.

Maximum Likelihood Estimation: Mathematics

Assume that we have n independent individuals (i = 1,....,n). For each individual i, the data consist of three parts: t_i, δ_i, and x_i, where t_i is the time of the event or the time of censoring, δ_i, is an indicator variable with a value of 1 if t_i is uncensored or 0 if censored, and x_i = [1 x_i1 ... x_ik]´ is a vector of covariate values (the 1 is for the intercept). For simplicity, we will treat δ_i and x_i as fixed rather than random. We could get equivalent results if δ_i were random but noninformative, and if the distributions of δ_i and t_i were expressed conditional on the values of x_i. But that would just complicate the notation.

For the moment, suppose that all the observations are uncensored. Since we are assuming independence, it follows that the probability of the entire data is found by taking the product of the probabilities of the data for every individual. Because t_i is assumed to be measured on a continuum, the probability that it will take on any specific value is 0. Instead, we represent the probability of each observation by the p.d.f. f(t_i). Thus, the probability (or likelihood) of the data is given by the following, where Π indicates repeated multiplication:

Notice that f_i is subscripted to indicate that each individual has a different p.d.f. that depends on the covariates.

To proceed further, we need to substitute an expression for f_i(t_i) that involves the covariates and the unknown parameters. Before we do that, however, let’s see how this likelihood is altered if we have censored cases. If an individual is censored at time t_i, all we know is that this individual’s event time is greater than t_i. But the probability of an event time greater than t_i is given by the survivor function S(t) evaluated at time t_i. Now suppose that we have r uncensored observations and n – r censored observations. If we arrange the data so that all the uncensored cases come first, we can write the likelihood as

where, again, we subscript the survivor function to indicate that it depends on the covariates. Using the censoring indicator δ, we can equivalently write this as

Here δ_i acts as a switch, turning the appropriate function on or off, depending on whether the observation is censored. As a result, we do not need to order the observations by censoring status. This last expression, which applies to all the models that PROC LIFEREG estimates with right-censored data, shows how censored and uncensored cases are combined in ML estimation.

Once we choose a particular model, we can substitute appropriate expressions for the p.d.f. and the survivor function. Take the simplest case—the exponential model. We have

f_i(t_i) = λ_ie^–λ_it_i and = S_i(t_i) = e^–λ_it_i

where λ_i = exp{– βx_i} and β is a vector of coefficients. Substituting, we get

Although this expression can be maximized directly, it is generally easier to work with the natural logarithm of the likelihood function because products get converted into sums and exponents become coefficients. Because the logarithm is an increasing function, whatever maximizes the logarithm also maximizes the original function.

Taking the logarithm of both sides, we get

This brings us to step 2, choosing values of β that make this expression as large as possible. There are many different methods for maximizing functions like this. One well-known approach is to find the derivative of the function with respect to β, set the derivative equal to 0, and then solve for β. Taking the derivative and setting it equal to 0 gives us

Because x_i is a vector, this is actually a system of k + 1 equations, one for each element of β.

While these equations are not terribly complicated, the problem is that they involve nonlinear functions of β. Consequently, except in special cases (like a single dichotomous x variable), there is no explicit solution. Instead, we have to rely on iterative methods, which amount to successive approximations to the solution until the approximations converge to the correct value. Again, there are many different methods for doing this. All give the same solution, but they differ in such factors as speed of convergence, sensitivity to starting values, and computational difficulty at each iteration.

PROC LIFEREG uses the Newton-Raphson algorithm (actually a ridge stabilized version of the algorithm), which is by far the most popular numerical method for solving for β. The method is named after Isaac Newton, who devised it for a single equation and a single unknown. But who was Raphson? Some say he was Newton’s programmer. Actually Joseph Raphson was a younger contemporary of Newton who generalized the algorithm to multiple equations with multiple unknowns.

The Newton-Raphson algorithm can be described as follows. Let U(β) be the vector of first derivatives of log L with respect to β, and let I(β) be the matrix of second derivatives of log L with respect to β. That is,

The vector of first derivatives U(β) is sometimes called the gradient or score, while the matrix of second derivatives I(β) is called the Hessian. The Newton-Raphson algorithm is then

Equation 4.4

where I^-1 is the inverse of I. In practice, we need a set of starting values β₀, which PROC LIFEREG calculates by using ordinary least squares, treating the censored observations as though they were uncensored. These starting values are substituted into the right side of equation (4.4), which yields the result for the first iteration, β₁. These values are then substituted back into the right side, the first and second derivatives are recomputed, and the result is β₂. This process is repeated until the maximum change in the parameter estimates from one step to the next is less than .001. (This is an absolute change if the current parameter value is less than .01; otherwise it is a relative change.)

Once the solution is found, a byproduct of the Newton-Raphson algorithm is an estimate of the covariance matrix of the coefficients, which is just . This matrix, which can be printed by listing COVB as an option in the MODEL statement, is often useful for constructing hypothesis tests about linear combinations of coefficients. PROC LIFEREG computes standard errors of the parameters by taking the square roots of the main diagonal elements of this matrix.

Maximum Likelihood Estimation: Practical Details

PROC LIFEREG chooses parameter estimates that maximize the logarithm of the likelihood of the data. For the most part, the iterative methods used to accomplish this task work quite well with no attention from the data analyst. If you’re curious to see how the iterative process works, you can request ITPRINT as an option in the MODEL statement. Then, for each iteration, PROC LIFEREG will print out the log-likelihood and the parameter estimates. When the iterations are complete, the final gradient vector and the negative of the Hessian matrix will also be printed (see the preceding section for definitions of these quantities). When the exponential model was fitted to the recidivism data, the ITPRINT output revealed that it took six iterations to reach a solution. The log-likelihood for the starting values was –531.1, which increased to –327.5 at convergence. Examination of the coefficient estimates showed only slight changes after the fourth iteration. By comparison, the generalized gamma model took 13 iterations to converge.

Occasionally the algorithm fails to converge, although this seems to occur much less frequently than it does with logistic regression. In general, nonconvergence is more likely to occur when samples are small, when censoring is heavy, or when many parameters are being estimated. There is one situation, in particular, that guarantees nonconvergence (at least in principle). If all the cases at one value of a dichotomous covariate are censored, the coefficient for that variable becomes larger in magnitude at each iteration. Here’s why: the coefficient of a dichotomous covariate is a function of the logarithm of the ratio of the hazards for the two groups. But if all the cases in a group are censored, the ML estimate for the hazard in that group is 0. If the 0 is in the denominator of the ratio, then the coefficient tends toward plus infinity. If it’s in the numerator, taking the logarithm yields a result that tends toward minus infinity. By extension, if a covariate has multiple values that are treated as a set of dichotomous variables (e.g., with a CLASS statement) and all cases are censored for one or more of the values, nonconvergence should result. When this happens, there is no ideal solution. You can remove the offending variable from the model, but that variable may actually be one of the strongest predictors. When the variable has more than two values, you can combine adjacent values or treat the variable as quantitative.

PROC LIFEREG has two ways of alerting you to convergence problems. If the number of iterations exceeds the maximum allowed (the default is 50), SAS issues the message: WARNING: Convergence not attained in 50 iterations. WARNING: The procedure is continuing but the validity of the model fit is questionable. If it detects a problem before the iteration limit is reached, the software says WARNING: The negative of the Hessian is not positive definite. The convergence is questionable. Unfortunately, PROC LIFEREG sometimes reports estimates and gives no warning message in situations that are fundamentally nonconvergent. The only indication of a problem is a coefficient that is large in magnitude together with a huge standard error.

It’s tempting to try to get convergence by raising the default maximum number of iterations or by relaxing the convergence criterion. This rarely works, however, so don’t get your hopes up. You can raise the maximum with the MAXITER= option in the MODEL statement. You can alter the convergence criterion with the CONVERGE= option, but I don’t recommend this unless you know what you’re doing. Too large a value could make it seem that convergence had occurred when there is actually no ML solution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Maximum Likelihood Estimation

Create new playlist

Sign In

Sign Up

Maximum Likelihood Estimation

Maximum Likelihood Estimation: Mathematics

Maximum Likelihood Estimation: Practical Details

Table of Contents for
Maximum Likelihood Estimation