Chapter 4

Restricted maximum likelihood and inference of random effects in linear mixed models

Abstract

Chapter 4 concerns the restricted maximum likelihood (REML) estimator and some other Bayes-type techniques applied in longitudinal data analysis. Therefore, Bayes’ theorem and Bayesian inference are reviewed first to familiarize the reader with the rationale of various Bayes-type models and methods included in the current and many of the succeeding chapters. Next, the general specifications and inference of the REML estimator are delineated. Two computational procedures are then displayed for the estimation of model parameters described in Chapter 3: the Newton–Raphson (NR) and the Expectation–Maximization (EM) algorithms. The best linear unbiased predictor (BLUP) is then presented, a popular method to approximate the subject-specific random effects in longitudinal data analysis. Corresponding to the specification of BLUP, statistical shrinkage is also introduced, which serves as a powerful statistical technique in linear predictions of longitudinal response outcomes. Lastly, an empirical illustration is provided, in which the analytic results from the ML and REML estimators are compared, and the longitudinal trajectory of the response variable is predicted.

Keywords

Bayesian inference
best linear unbiased predictor (BLUP)
Expectation–Maximization (EM) algorithm
Newton–Raphson (NR) algorithm
restricted maximum likelihood estimator (REML)
shrinkage
As described in Chapter 3, given the desirable large sample property, maximization of the log-likelihood function on longitudinal data yields statistically robust and consistent estimates of the coefficient β in linear mixed models. When the sample size is small, however, the ML estimator is criticized for failing to account for loss of the degrees of freedom in estimating β, thereby resulting in downward bias in the variance estimate (Patterson and Thompson, 1971). In those situations, the restricted maximum likelihood (REML) approach can be applied for correcting the bias. Compared to MLE, REML accounts for loss of the degrees of freedom in estimating the variance components by using linear combinations of the error contrasts from the data. It then maximizes the likelihood function from the distribution of the linear combinations. Harville (1974) proves from a Bayesian perspective that using only error contrasts to make inference on variance components is equivalent to ignoring any prior information on the fixed effects and using all the data. Given this proof, REML is recognized as an empirical Bayes method.
As summarized in Chapter 3, inferences of linear mixed models consist of three estimating procedures: the fixed-effects parameters β, the random effects bi, and the variance–covariance components R and G. As briefly indicated in Chapter 3, these procedures are often performed separately. I described the general specification and inference of the first and the third dimensions in the last chapter but not their computational details; neither the description of the second procedure of the inference – the estimation of the random effects bi. As the specified random effects are not empirically observable, the estimation of bi requires a complex approximation process. In some sense, it is appropriate to state that the random effects are predicted given the observed data and the fixed-effect estimates, rather than being estimated. In predicting the random effects, Bayes’ theory and the empirical Bayes techniques play an extremely important role.
In this chapter, Bayes’ theorem and Bayesian inference are reviewed first to help the reader understand the rationale of the REML estimator and the methods for approximating the random effects. The reader can also benefit from a brief overview of this methodology to set a solid foundation for comprehending various Bayes-type models and methods described in many of the succeeding chapters. Next, the general specifications and inference of the REML estimator are described. The third section presents two computational procedures for estimating parameters described in Chapter 3 and in Section 4.2: the Newton–Raphson (NR) and the Expectation–Maximization (EM) algorithms. The methods for approximating the subject-specific random effects are then delineated. Lastly, by following the two examples provided in Chapter 3, analytic results derived from the ML and REML estimators are compared, and longitudinal trajectories of the response variables are predicted and displayed.

4.1. Overview of Bayesian inference

As it provides a flexible and practical approximating system, Bayesian inference and the associated algorithms have seen tremendous popularity in recent decades in longitudinal data analysis. With their widespread applications, Bayes formulations will be frequently used in the succeeding chapters when various Bayes-type techniques are described. Therefore, it is essential for the reader to grasp the basic concepts and the underlying properties of Bayesian inference described in this section before studying details of various Bayes-type models.
By definition, Bayesian inference is the process of fitting a probability model to a set of data (Gelman et al., 2004). The inference summarizes the result by a probability distribution on the parameters of the model and on unobserved quantities such as predictions for new observations. Given such a capacity of using probability models, Bayesian inference is valuable to quantify uncertainty in regression models, both generally and with specific regard to longitudinal data analysis.
In creating a Bayes model, the first step is to establish a joint distribution for all observable and unobservable quantities in a topic of interest. Let y=(y1,y2,...,yN)image be a random vector of observed outcome data for N units. If each unit has more than one observation, as in the case of longitudinal data, the dataset y is specified as a random block vector denoted by y=(y1,y2,...,yN)image. Let θimage be the unobservable vector of the population parameters (e.g., the subject-specific random effects in longitudinal analysis) and X be a random matrix of the explanatory variables or covariates with M columns. In Bayes theorem, θimage is expressed as a random vector estimated from data y, which cannot be determined exactly. Given these conditions, uncertainty about θimage is expressed through probability statements and distributions. The probability statements are conditional on the observed outcome value y and implicitly on the observed values of covariates X, with the density of θimage written as p(θ|y)image. The function p(θ|y)image represents a conditional probability density with the argument determined by the observed data. With such probability statements on θimage, Bayesian inference can be described by three steps.
First, setting up a probability distribution of θimage, defined as ˜π(θ)image, referred to, in the terminology of Bayesian inference, as the prior distribution or simply prior. The specification of the prior distribution depends on the researcher’s knowledge about the parameter before the data are examined (e.g., the normal distribution of the random effects in longitudinal data analysis).
Second, given the observed dataset y, a statistical model is chosen to formulate the distribution of y given θimage, written as p(y|θ)image.
Third, knowledge about θimage is updated by combining information from the prior distribution and the data through the calculation of p(θ|y)image, referred to as the posterior distribution. In a sense, the posterior distribution is simply the empirical realization of a probability distribution based on a prior.
The execution of the third step is based on the basic property of the conditional probability, referred as Bayes’ rule in Bayesian inference. Specifically, the probability density p(θ|y)image can be expressed in terms of the product of two densities, the prior distribution ˜π(θ)image and the sampling distribution p(y|θ)image, referred to as a joint probability distribution and given by

p(θ,y)=˜π(θ)p(y|θ).

image
The conditional probability of θimage given the known value of the data y can be written as

p(θ|y)=p(θ,y)p(y)=p(y|θ)˜π(θ)p(y)=p(y|θ)˜π(θ)p(y|θ)˜π(θ)dθ,

image(4.1)
where the denominator in the third equality, p(y|θ)˜π(θ)dθimage, is the Bayes expression of p(y), assuming θimage to be continuous. In Bayesian inference, this expression is referred to as the marginal distribution of y or the prior predictive distribution (Gelman et al., 2004). The specification of this term is the key to understanding Bayesian inference about fitting a probability model to data. To better understand the rationale of Bayesian inference, the reader not well-versed in calculus might want to view the data as the weighted average of a joint probability distribution with the prior of θimage (the discrete realization of an integral).
Given observed data y, an unknown observable ˜yimage can be predicted from the above inference. Let y=(y1,y2,...,yN)image be the vector of the recorded values of an object weighed N times on a scale and θ=(μ,σ2)image be the unknown value of the object with mean μ and variance σ2. Then, the distribution of ˜yimage, given y, is given by

p(˜y|y)=p(˜y,θ|y)dθ=p(˜y|θ,y)p(θ|y)dθ=p(˜y|θ)p(θ|y)dθ.

image(4.2)
Given Equation (4.2), the distribution or density of ˜yimage is referred to as the posterior predictive distribution – posterior because it is conditional on the observed y and predictive because it is a prediction for an observable ˜yimage. The third equality in Equation (4.2) reflects the posterior predictive distribution as an integral of the conditional predictions given the posterior distribution of θimage. This equation important in longitudinal data analysis because there is generally unobserved heterogeneity across subjects, and with the above inference it is practical to specify mixed-effects models in terms of the marginal mean given the distribution of the random effects. By viewing θimage as an unobservable parameter, the researcher can use Equation (4.2) to express the marginal mean of a posterior predictive distribution with prior knowledge about the distribution of θimage. The reader might want to link this inference to longitudinal data analysis. The subject-specific random effects are specified as a prior to account for intraindividual correlation, and the marginal mean of the posterior predictive distribution is the prediction given the prior distribution.
In Bayesian inference, given a chosen probability model, the data y affect the posterior inference only through the function p(y|θ)image, which, as a function of θimage given fixed data y, is the likelihood function. The likelihood function of θimage can be any function proportional to p(y|θ)image, written as

L(θ)p(y|θ).

image(4.3)
With the above specification, a Bayes-type model can be expressed in terms of a likelihood function, given by

p(θ,y)=L(θ)˜π(θ)L(θ)˜π(θ)dθ,

image(4.4)
where the denominator on the right of Equation (4.4) is the likelihood expression of the marginal distribution p(y) as an integral. As long as the integral is finite, the value of the integral does not provide any additional information about the posterior distribution. Correspondingly, the distribution of θimage given y can be expressed as an arbitrary constant in a proportional form, given by

p(θ|y).L(θ)˜π(θ).

image(4.5)
To summarize, Bayesian inference starts with prior knowledge of the distribution for θimage and then updates the knowledge about the prior after learning information from the observed data y. Empirically, all Bayesian inferences are performed from the posterior predictive distribution, namely p(θ|y)image. In longitudinal data analysis, if the researcher has prior information about the probability distribution of the subject-specific random effects, the distribution can be included in statistical inference by applying Bayesian inference. Various Bayes techniques can be applied to integrate the joint posterior distribution of all unknowns, and consequently, the desired marginal distribution can be obtained. It follows that the specified parameters of observed variables can be adequately estimated by including the unobserved quantities in a probability distribution. In statistics, the unknown quantities that are not of direct interest but are required in specifying a joint likelihood function are referred to as nuisance parameters. Most Bayesian methods require sophisticated computations, including complex simulation techniques and approximation algorithms.

4.2. Restricted maximum likelihood estimator

The maximum likelihood approach is perhaps the most popular fitting method in regression modeling. When the population mean is unknown, however, the maximum likelihood estimator for the variance σ2 is biased downward due to loss of the degrees of freedom in the estimation of the fixed effects. For small samples, it may be more appropriate to apply the REML estimator to derive unbiased estimates of variance and covariance parameters. In this section, I first display how the variance estimate is biased in the maximum likelihood estimator. Next, based on the discussion about the effect of the nuisance parameters, I describe statistical inference of the REML estimator.

4.2.1. MLE bias in variance estimate in general linear models

I begin with reviewing some basic statistics on the computation of sample variance with a sample of N subjects. With a known or an unknown population mean, denoted by μ, the variance estimate is computed differently, given by

ˆσ2={Ni=1(Yiμ)2NifμisknownNi=1(YiˉY)2N1ifμisunknown}.

image
The above equation indicates that when μ is unknown there is one degree of freedom lost in computing the sample mean ˉYimage. Such a loss in the degree of freedom needs to be corrected for, deriving an unbiased variance estimate. The corrected variance estimate given unknown μ is statistically defined as the mean square error.
The above simple case can be extended to the inference of the MLE bias in the variance estimate in general linear models. Let all vectors of Yi, ɛi, and the matrices Xi be combined into Y, ɛ, and the matrices X, respectively. Then, a combined general liner model can be written as

Y=Xβ+e,

image
where e ∼ N(0,Σ) and Σ is defined as an N × N positive-definite variance–covariance matrix of random errors. The matrix Σ depends on the unknown parameters organized into a vector ˜Γimage, and given the specification of β, this matrix is often simplified as σ2I, where I is the identity matrix, assuming random errors to be conditionally independent in the presence of the fixed-effect parameters. For Σ = σ2I, ˜Γσ2image. It follows that the classical MLE of σ2, given the observed outcome data y, is given by

ˆσ2=(yXˆβ)(yXˆβ)N.

image(4.6)
If the population mean μ is unknown, β needs to be estimated from the data first for deriving the sample mean Xˆβimage. Given M degrees of freedom lost in ˆβimage, the expectation of σ2 becomes

E(ˆσ2)=NMNσ2,

image(4.7)
where M is the dimension of X, or the number of covariates if the model has full rank. Equation (4.7) is the mean square error statistic, an unbiased estimator for σ2. It can be mathematically proved that when μ is unknown, Equation (4.6) does not exactly correspond to Equation (4.7), thereby yielding a biased estimate of σ2. Below I provide a simple mathematical proof.
By contradiction, suppose E(ˆσ2)=σ2image, given by

E[(yXˆβ)(yXˆβ)N]=σ2.

image(4.8)
It follows that

E[(yXˆβ)(yXˆβ)N]=NMNσ2<σ2=E[(yXˆβ)(yXˆβ)N].

image(4.9)
Obviously, Equation (4.9) contradicts the condition that E(ˆσ2)=σ2image. Therefore, the ML estimator for σ2image is biased downward due to failure to account for loss of the degrees of freedom in the estimation of μ given the estimates of β. It is also clear from Equation (4.9) that if N is large and M is relatively small, bias in MLE ˆσ2image becomes negligible and thus can be overlooked.

4.2.2. Specification of REML in general linear models

The REML estimator is designed to correct for bias incurred from the classical MLE by avoiding the specification of β first in a simplified likelihood function. Specifically, REML maximizes a likelihood of a set of selected error contrasts, not of all the data, with the likelihood specified only in terms of Σ. The method starts with the construction of error contrasts with zero expectation, and maximization of the corresponding joint likelihood is then performed for yielding unbiased parameter estimates.
As indicated earlier, MLE of σ2image is biased downward without an adjustment factor (NM)/N. For making a correction on this bias, the REML approach first specifies an N-dimensional vector containing only ones, denoted by 1N. The distribution of Y then can be written by N(μ1N,σ2IN)image (notice that 1 and the identity matrix I are two different concepts). Let A={a1,...,aN1}image be any N × (N − 1) matrix with N − 1 linearly independent columns orthogonal to the vector 1Nimage. It follows that an N × 1 vector of error contrasts is created, defined as U=AYimage. By definition, Uimage follows a normal distribution with mean 0 and covariance matrix σ2AAimage. Maximizing the corresponding likelihood with respect to the only remaining parameter σ2image gives

ˆσ2=YA(AA)1AY(N1).

image(4.10)
Obviously, the above estimator is unbiased. Equation (4.10) is the basic specification of the REML estimator.
Equation (4.10) can be readily adapted to the estimation of σ2image in general linear models. Given a linear regression Y=Xβ+e,image where Y is an N-dimensional vector and X is an N × M matrix with known covariate values. All elements in e are assumed to be independent and identically distributed (iid) with zero expectation and variance σ2image. Given the inclusion of the covariates, A={a1,....,aNM}image is now any N × (NM) matrix with NM linearly independent columns orthogonal to the columns of the design matrix X. Accordingly, in general linear models, the vector of error contrasts Uimage is an (NM) × 1 vector, given by

U={u1uNM}={a1yaNMy}=AY,

image(4.11)
where a given element in Uimage is aiyimage, which, by definition, is an error contrast if E(aiy)=0image, and thereby aix=0image.
The vector Uimage follows a normal distribution with mean 0 and the newly structured covariance matrix σ2AAimage, where σ2image remains the only unknown parameter. Maximizing the corresponding likelihood with respect to σ2image gives rise to the following estimator:

ˆσ2=[yX(XX)1Xy][yX(XX)1Xy]NM,

image(4.12)
which is the mean square error in the context of general linear models, unbiased for σ2image. As a result, underestimation of MLE σ2image is corrected. Therefore, the REML estimator is essentially a maximum likelihood approach on residuals. Additionally, from the specification U=AYimage, the following inference can be derived:

U=AY=A(Xβ+e)=AX′β+Ae=0+Ae=Ae,

image
where AeN(0,AΣA)image.
After some additional algebra, the restricted log-likelihood function with respect to Σ for a sample of N subjects can be written as

lR(Σ|U)=NM2log(2π)12log|AΣA|12U(AΣA)1U.

image(4.13)
Given the above inference, the REML estimator is a residual-based estimating method that integrates β out. For this reason, this estimator is also referred to as the residual maximum likelihood or the modified maximum likelihood estimator (Harville, 1974; Patterson and Thompson, 1971).

4.2.3. REML estimator in linear mixed models

The above REML estimator for general linear models can be easily extended to linear mixed models by following the same approach of error contrasts. Combining vectors of Yiimage, biimage, and ɛiimage and the matrices Xiimage into Y, b, ɛ, and X leads to

Y=Xβ+Zb+e,

image
where Z is the block diagonal matrix with blocks Ziimage on the main diagonal and zeroes off-diagonally. As formalized in Chapter 3, the marginal distribution of Y is normal with mean vector Xβimage and covariance matrix V(R,G)image with blocks Viimage on the main diagonal and zeroes off-diagonally. After some simplification, the classical log-likelihood function associated with MLE is

l(G,R)=N2log(2π)12log|V|12rV1r,

image(4.14)
where r=yXˆβimage, defined as the vector of marginal residuals, which will be further discussed in Chapter 6.
Maximization of the above log-likelihood function on the data yields an estimate of β and V(G,R)image. Analogous to the specification for general linear models, the diagonal elements in V(G,R)image are underestimated for small samples due to loss of the degrees of freedom in estimation of the marginal mean Xβimage first. By extending the REML estimator for general linear models, such underestimation in variance estimates can be corrected.
The REML estimator for the variance components R and G in linear mixed models can be obtained from maximizing the log-likelihood function of a set of contrasts U=AYimage, where Aimage is any [N × (NM)] full-rank matrix with columns orthogonal to the columns of the X matrix. The Uimage follows a normal distribution with mean vector 0 and covariance matrix AV(R,G)Aimage that integrates β out. After some algebra, the joint restricted log-likelihood function with respect to parameter vectors R and G for a random sample of size N is

lR(G,R)=NM2log(2π)12log|V|12rV1r12log|XV1X|,

image(4.15)
where, given the standard formula,

r=yXˆβ=yX(XV1X)1XV1y.

image
Maximization of the above residual log-likelihood function on the transformed data of error contrasts yields an unbiased estimate of V(G,R)image. In turn, Equation (3.23) can be applied to estimate the vector β (Fitzmaurice et al., 2004).
By comparing Equations (4.15) and (4.14), there is one additional term in the REML log-likelihood, which is the last term on the right of Equation (4.15). Expanding that term yields

12log|XV1X|=12log|(XV1X)1|=log|cov(ˆβ)|12,

image
which is the covariance of ˆβimage. This additional term can be regarded as an adjustment factor for bias in V(G,R)image from MLE. Correspondingly, the above restricted log-likelihood function can be expressed in terms of MLE plus an adjustment factor, given by

lR(G,R)l(G,R)+log|cov(ˆβ)|12.

image(4.16)
Analogous to MLE, the REML estimator is based on the MAR hypothesis. Operationally, a maximum likelihood estimate of V, derived either from MLE or from REML, can be obtained through a specific iterative scheme, such as the NR (Lindstrom and Bates, 1988) and the EM algorithms (Dempster et al., 1977), which will be described in Section 4.3.

4.2.4. Justification of the restricted maximum likelihood method

There is some debate concerning the validity of the REML estimator. At the first glance, some information seems lost by basing inferences of G and R on lR rather than on l. Patterson and Thompson (1971) contend that for any vector of NM linearly independent error contrasts, the log-likelihood function for the transformed data vector U=AYimage is proportional to the log-likelihood function for y, so that inferences on AYimage are valid as on y. It actually makes no difference which NM linearly independent contrasts are used because the log-likelihood function for any such set differs by no more than an additive constant (Harville, 1977).
Harville (1974) found it attractive to use only error contrasts rather than all the data. Given this convenience, a prior distribution for β does not need to be specified nor should analytic or numerical integrations be specified to determine the posterior distribution for G and R. For proving his argument, Harville (1974) provides a justification of the REML estimator in general linear models by using Bayesian inference.
Let aiyimage be a linear combination of error contrasts such that E(aiy)=0image and aix=0image and the density of Uimage be fU(AY|G,R)image. Given the familiar statistical expression of β in linear regression models

β=(XV1X)1XV1Y,

image
the estimate β can be rewritten as

ˆβ=Hy,

image(4.17)
where Himage is defined as

H=V1X(XV1X)1.

image(4.18)
Using results on determinants in matrix algebra gives rise to

|det(A,H)|={det[(A,H)(A,H)]}12=[det(1)]12[det(HHHA11AH)]12=[det(XX)]12.

image(4.19)
Let fˆβ(|β,G,R)image and fy(|β,G,R)image be the probability density functions (p.d.f.) of ˆβimage and y, respectively. Given the well-known relationship

(yXβ)V1(yXβ)=(yXˆβ)V1(yXˆβ)+(βˆβ)(XV1X)(βˆβ),

image
the Bayes-type expression about the likelihood of Uimage is then given by

fU(Ay|G,R)=fU(Ay|G,R)fˆβ(Hy|β,G,R)dβ=[det(XX)]12fy(y|β,G,R)dβ=(2π)12(NM)[det(XX)]12[det(V)]12[det(XV1X)]12exp[12(yXˆβ)V1(yXˆβ)].

image(4.20)
Compared to Patterson and Thompson’s expression, Equation (4.20) provides a more convenient perspective to derive the likelihood equations, thereby being preferable for numerical computation of the density fU(Ay|G,R)image.
Let f(G,R)image be the prior p.d.f. for the variance–covariance components G and R. When only the error contrasts are used, the posterior probability density for G and R is then written as

p(G,R|y).f(G,R)fU(Ay|G,R).

image(4.21)
Equation (4.21) literally states that assuming β and the variance components (G,R)image to be statistically independent and the components of β to be independent and identically distributed, the posterior density for (G,R)image is proportional to the product of the prior density for (G,R)image and the likelihood function of an arbitrary set of (NM) linearly independent error contrasts. Therefore, from the standpoint of Bayesian inference, using only error contrasts to make inference on (G,R)image is equivalent to ignoring any prior information on β and using all the data to make these inferences. In other words, the REML estimator is identical to the mode of the marginal posterior density for (G,R)image, formally integrating β out.

4.2.5. Comparison between ML and REML estimators

Both ML and REML estimators are powerful estimating approaches with desirable properties of statistical consistency, asymptotic normality, and efficiency. It is traditionally regarded that MLE is associated with several advantages in linear regression modeling. As neither overestimating nor underestimating the corresponding population parameters, ML estimates are generally unbiased for large samples. The ML estimator is consistent as the parameter estimates converge in probability to the true value of the population parameters as the sample size increases, asymptotically following a multivariate normal distribution. When this large-sample behavior follows, the ML function approximates a chi-square distribution under the assumption of multivariate normality, and accordingly, the model chi-square can be used to test the overall model fit. In linear mixed models, these statistical strengths usually remain effective in the estimation of the fixed effects and the variance components, as long as the sample size is sufficiently large. For small samples, however, these advantages in precision of using all the data may be negated to some extent by the additional approximation procedures to specify a complete prior distribution for β, G, and R (Harville, 1974, 1977). As indicated earlier, in theory the ML estimates of the variance components are biased downward due to loss of the degrees of freedom in estimating the fixed effects. From the Bayesian standpoint, when the prior distribution of β is concentrated away from the true value of β, the posterior distribution of G and R based on the complete data is adversely affected. In these situations, maximizing the likelihood based on error contrasts is effective.
In the application of linear mixed models, the REML estimator is often considered preferable over MLE as it corrects for loss of the degrees of freedom in MLE of the fixed effects (Davidian and Giltinan, 1995; Fitzmaurice et al., 2004; Harville, 1977). If the sample size is sufficiently larger than the number of model parameters, however, the bias in MLE is negligible, and in such situations, the two estimators usually yield very close parameter estimates. Harville (1977) compares the size of the estimated variances derived from the two estimators in the application of general linear models. The ML estimator of the variance σ2image consistently yields smaller mean square errors than the REML estimator when M = rank(X) ≤ 4; by contrast, the REML estimator of the variance σ2image has smaller mean square errors than MLE when M = rank(X) ≥ 5 and the size of NM is sufficiently large. Therefore, when more fixed effects are specified, the difference between the ML and REML estimators of variance widens.
Notice that while it can be used to compare nested models for the covariance, the REML log-likelihood cannot be used to compare nested regression models for the mean (Fitzmaurice et al., 2004). This is thought to be because the addition of the adjustment factor in the REML estimator, the last term on the right of Equation (4.15), depends on the specification of a regression model. Therefore, the REML likelihood for two nested models for the mean response is based on different sets of error contrasts. If statistical assessment on the difference between two means is necessary, the researcher needs to use the ML estimator for estimating the two related nested regression models after carefully checking the statistical adequacy of using this classical estimator.

4.3. Computational procedures

In longitudinal data analysis, the maximum likelihood or the REML estimators cannot be expressed in a simple and closed form except in some special situations. Therefore, the ML and REML estimates of the parameters in G and R must be computed by applying numeric methods. In this section, two popular computational methods are described: the NR and the EM algorithms. Both methods are iterative schemes and have been widely applied in longitudinal data analysis.

4.3.1. Newton–Raphson algorithm

The NR algorithm is an iterative method for finding estimates for the parameters by minimizing −2 times a specific log-likelihood function. In applying this algorithm, both ML and REML log-likelihood functions can be used to estimate the variance components (Laird and Ware, 1982; Ware, 1985). The first step of the NR algorithm is to simplify the computational procedure by solving the ML and REML estimates of σ2image as a function of β, G, and R. In statistics, the procedure for reducing the dimension of an objective function by analytic substitution is referred to as profiling. In the NR algorithm, profiling is applied to ensure that the numerical optimization can be performed only over the model parameters. Given quations (3.22) and (4.15), the estimate of σ2image can be simplified by

ˆσ2ML(β,G,R)=1NrV(G,R)1r,

image(4.22)
and

ˆσ2REML(β,G,R)=1NMrV(G,R)1r,

image(4.23)
where N=Ni=1niimage and r=yXβimage, both defined earlier.
The above two equations are then substituted into Equations (3.22) and (4.15). After some simplification, two corresponding profile log-likelihood functions of β, G, and R can be specified:

PML(β,G,R|y)=N2log(rV(G,R)1r)12log|V(G,R)|,

image(4.24)

PREML(β,G,R|y)=NM2log(rV(G,R)1r)12log|V(G,R)|12log|xV(G,R)1x|,

image(4.25)
where PML()image and PREML()image are the profile log-likelihood functions for computing the ML and the REML estimates, respectively. According to Lindstrom and Bates (1988), the NR optimization can be based either on the original log-likelihood or the profile log-likelihood function, but the latter is recommended as the profile log-likelihood usually requires fewer iterations for deriving estimates for β, G, and R.
In the NR algorithm, a series of formulas are specified for the derivatives of the profile log-likelihoods in Equations (4.24) and (4.25) and with respect to β, G, and R. These formulations enable an NR implementation. The derivatives for linear mixed models are computed with the conditionally independent errors, given by Vi=σ2Vi(G,R)image. The chain rule (a rule in mathematics for differentiating compositions of functions) is used to find the derivatives for a specific mixed-effects model with respect to the transformed G and R. The inverse of the second derivatives of the profile log-likelihood function at iterative convergence yields an estimate for the marginal variance–covariance matrix over the parameter space. If the residual variance σ2image is a part of the mixed model, it can be profiled by analytically solving for the optima σ2image and then plugging this expression back into the likelihood formula (Wolfinger et al., 1994). The exact procedures for these formulas are complex, and the reader interested in the computational details concerning the first and second derivatives of PML(β,G,R|y)image and PREML(β,G,R|y)image is referred to the original articles in this regard (Lindstrom and Bates, 1988; Wolfinger et al., 1994).
Operationally, the estimation of G and R can be performed through a conventional NR iterative scheme, given some specified initial values for the parameter estimates. With regard to linear mixed models, the series of parameter estimates in the iterative scheme is generally denoted by ˆθ˜jimage (¨j=1,2,)image, where θ consists of β, G, and R. The iterative scheme terminates when ˆθ˜j+1image is sufficiently close to ˆθ˜jimage given some statistical criterion. As a result, the maximum likelihood estimate of θ can be operationally defined as ˆθ=ˆθ˜j+1image.
The NR algorithm is widely considered to be a preferred computational method over other procedures in the application of linear mixed models given its desirable convergence properties and its capacity to derive information matrices. Sometimes, however, the researcher might encounter failure of convergence in the iterative processes or the occurrence of convergence at unrealistic parameter values when performing the NR algorithm. In general, such convergence problems occur much more frequently in the estimation of the variance components than in that of the fixed effects. In these situations, the researcher might want to consider specifying a different set of starting values for parameter estimates or using other numeric methods.

4.3.2. Expectation–maximization algorithm

The EM method is designed to estimate the unobserved parameters in regression models. Like the NR algorithm, the EM method can be applied to obtain both the ML and the REML estimates (Laird and Ware, 1982). When MLE is applied, the covariance matrix can be estimated by using the classical EM algorithm described in Dempster et al. (1977, 1981), Laird (1982), and Laird and Ware (1982).
In longitudinal data analysis, the EM algorithm can be applied to approximate the unobserved subject-specific random effects by maximizing the likelihood function given the observed outcomes y (Laird and Ware, 1982). Given the estimates of β, G, and R, the random effect for a specific subject can be obtained by the following equation:

ˆbi=ˆGZiˆV1i(yiXiˆβ).

image(4.26)
In Equation (4.26), Viimage is the overall variance–covariance matrix of yiimage, and its inverse, V1iimage, is often used as a weight matrix in the estimation process to yield efficient and robust parameter estimates. In this equation, ˆbiimage is computed as the proportion of the overall variance–covariance matrix that comes from the between-subjects variability times the overall residual from the marginal mean. Therefore, given the availability of ˆGimage and ˆViimage, ˆbiimage can be readily computed. This estimator is efficient and robust because the fixed-effects estimate maximizes the likelihood based on the marginal distribution of the empirical data. The inference of Equation (4.26) will be further presented in Section 4.4.
Let Ri=σ2Iimage and G is an arbitrary q × q nonnegative-definite covariance matrix. The variance σ2image can then be written as

ˆσ2=Ni=1eieiNi=1ni=˜t1Ni=1ni,

image(4.27)
and

ˆG=N1Ni=1bibi=˜t2N,

image(4.28)
where eiimage is an ni × 1 column vector of model residuals for subject i, as defined in Chapter 3. According to Laird and Ware (1982), the two terms of sum of squares, ˜t1=Ni=1eieiimage and ˜t2=Ni=1bibiimage, are the sufficient statistics for Ri and G, respectively.
Let Θ be an available estimate of Ri and G (notice that Θ differs from θ). The statistics ˜t1image and ˜t2image can then be calculated by using their expectations conditionally on the observed data vector y, given by

ˆ˜t1=E{Ni=1eiei|yi,ˆβ(ˆΘ),ˆΘ}=Ni=1{ˆei(ˆΘ)ˆei(ˆΘ)+trvar[ei|yi,ˆβ(ˆΘ),ˆΘ]},

image(4.29)
and

ˆ˜t2=E{Ni=1bibi|yi,ˆβ(ˆΘ),ˆΘ}=Ni=1{ˆbi(ˆΘ)ˆbi(ˆΘ)+var[bi|yi,ˆβ(ˆΘ),ˆΘ]},

image(4.30)
where

ˆei(ˆΘ)=E(ei|yi,ˆβ(ˆΘ),ˆΘ)=yiXiˆβ(ˆΘ)Zibi(ˆΘ).

image
Operationally, the EM algorithm requires suitable starting values for ˆRiimage and ˆGimage and then iterate between Equations (4.29) and (4.30), referred to as the E-step (expectation step), and between Equations (4.27) and (4.28), referred to as the M-step (maximization step). At convergence of values in ˜t1image and ˜t2image after a series of iterations, the ML estimates of both the fixed and the random effects, ˆΘMLimage, ˆβ(ˆΘML)image, and ˆb(ˆΘML)image, can be obtained from the last E-step.
The EM algorithm can also be applied to compute the REML estimates, denoted by ˆΘREMLimage, ˆβ(ˆΘREML)image, and ˆbi(ˆΘREML)image, respectively, through an empirical Bayes approach. In the EM iterative series for the REML estimator, the M-step remains the same as iterating between Equations (4.27) and (4.28) because the estimation of Θ is still based on y, β, and e. The E-step in the REML estimator, however, differs from the procedure in MLE. While the E-step for MLE depends on y and β, the expectation with REML is only conditional on y as β is integrated out of the likelihood. Consequently, in REML Equations (4.29) and (4.30) are replaced with the following formulas:

ˆ˜t1=E{Ni=1eiei|yi,ˆΘ}=Ni=1{ˆei(ˆΘ)ˆei(ˆΘ)+trvar[ei|yi,ˆΘ]},

image(4.31)
and

ˆ˜t2=E{Ni=1bibi|yi,ˆΘ}=Ni=1{ˆbi(ˆΘ)ˆbi(ˆΘ)+var[bi|yi,ˆΘ]},

image(4.32)
where ei(ˆΘ)image is still defined as yiXiˆβ(ˆΘ)Zibi(ˆΘ).image
The expectations computed in the above E-step in the REML estimator involve the conditional means and variances of biimage and eiimage. Theoretically, the variance estimates obtained from REML can differ from their ML counterparts that are biased downward, as indicated in Section 4.2. For large samples, however, the bias in the ML estimates tends to be negligible, and therefore, the ML and the REML estimates are equally robust if the sample size is sufficiently large.
In summary, the EM algorithm is a simple and convenient approach for estimating the variance–covariance parameters. Given some desirable properties in this approach (Dempster et al., 1977), the EM procedure has historically seen considerable applications in longitudinal data analysis. In many situations, however, the EM algorithm is regarded as a less preferable numerical method than the NR in several aspects. Its relatively weaker convergence properties are particularly criticized. Lindstrom and Bates (1988) and Wolfinger et al. (1994) provide reasons for this negative assessment on the EM algorithm. Consequently, the EM method has gradually become a less-applied computational algorithm than the NR approach in longitudinal data analysis.

4.4. Approximation of random effects in linear mixed models

In longitudinal data analysis, researchers are often interested in predicting the subject-specific trajectories of the response variable. To accomplish this prediction, the random effect for each subject needs to be approximated first. In linear mixed models, the conditional mean of biimage can be predicted from data y and ˆβimage, given Equation (4.26). In this section, I introduce a popular predictor in longitudinal data analysis, referred to as the best linear unbiased predictor (BLUP). Corresponding to the specification of BLUP, the shrinkage approach is also described, which is a powerful statistical technique in performing linear predictions of longitudinal trajectories.

4.4.1. Best linear unbiased prediction

BLUP was originally designed by Henderson (1950, 1975) and now sees widespread applications in predicting the random effects in longitudinal data analysis. As indicated earlier, the fixed and the random components of the parameters can be estimated by the following two equations:

ˆβ=(Ni=1XiV1iXi)1Ni=1XiV1iyi,

image
and

ˆbi=ˆGZiˆV1i(yiXiˆβ).

image
In the second equation, ˆbiimage is computed as the proportion of the overall variance–covariance matrix that comes from the between-subjects variability times the overall residual from the marginal mean. Given this perspective, the covariance matrix of ˆbiimage as a parameter in a linear mixed model can be written as

var(ˆbi)=ˆGZi{ˆV1iˆV1iXi(Ni=1XiˆV1iXi)1XiˆV1i}ZiˆG.

image(4.33)
If β, G, and R are known, the best linear predictor for X0β+Z0bimage, where X0image and Z0image are matrices with some specific values, is

X0β+Z0GZV1(yXˆβ),

image
where ˆβimage can be viewed as, in the general term, any solution for the following generalized least squares (GLS) equation

XV1Xˆβ=XV1y.

image
As V is a large matrix containing G and R, Henderson (1950) specifies the celebrated BLUP equation, given by

(XR1XXR1ZZR1XZR1Z+G1)(ˆβˆb)=(XR1yZR1y).

image(4.36)
For those not highly familiar with matrix algebra, the above block matrices can expand to the following simultaneous equations:

XR1Xˆβ+XR1Zˆb=XR1y,

image(4.37a)

ZR1Xˆβ+(ZR1Z+G1)ˆb=ZR1y.

image(4.37b)
Henderson (1975) proves that in Equation (4.36), ˆβimage is a solution to Equation (4.35) and ˆbimage is equal to GZV1(yXˆβ)image of Equation (4.34). The advantage of applying Equation (4.36) over Equation (4.34) is that neither V nor its inverse are required in the computation as R is usually specified as an identify matrix and G is often a diagonal matrix.
The variance–covariance matrix of parameter estimates in BLUP is also specified by Henderson (1975), given by

E(ˆββˆbb)(ˆββˆbb)=(XR1XXR1ZZR1XZR1Z+G1)1σ2.

image(4.38)
In summary, the BLUP procedure maximizes the sum of two-component log-likelihoods given the joint distribution of y and b. In this expression, the log-likelihood is not a classical likelihood function in general linear models due to the specification of b. Instead, BLUP selects estimates, or more accurately predictors, of β, b, and σ2 to maximize the log-likelihood function. For the continuous response data, the BLUP estimators, denoted by ˜βimage and ˜bimage, are the values that make the derivatives of the log-likelihood function with respect to β and b equal 0 (McGilchrist, 1994).
Statistically, the BLUP estimates have desirable distributional properties and may differ from those derived from the generalized least squares estimator. In the context of linear mixed models, BLUP estimates are linear in the sense that they are linear functions of the data y; unbiased because the average value of the estimate is equal to the average value of the quantity being estimated, written by E(X0β+Z0b)=X0βimage; best in view of the fact that they have minimum mean squared error within the class of linear unbiased estimators; and prediction as to distinguish the estimates from the estimation of the fixed effects (Henderson, 1975; Robinson, 1991). For more details regarding BLUP’s derivations, justifications, and links to the other statistical techniques, the reader is referred to Henderson (1950, 1975), Liu et al. (2008), McGilchrist (1994), and Robinson (1991).

4.4.2. Shrinkage and reliability

Equations (4.36) and (4.38) imply that the estimation of model parameters depends on the relative size of the random effects per subject because ˆbiimage is equal to GZiV1i(yiXiˆβ)image. In the expression of ˆbiimage, the term GZV1image is the proportion of the overall covariance matrix that comes from the random effect components (Verbeke and Molenberghs, 2000). Therefore, if between-subjects variability is relatively low, thereby indicating greater residuals, much weight will be given to the population predictor Xˆβimage. In contrast, if between-subjects variability is relatively high, much weight will be bestowed to the observed data y. Consequently, the predicted outcome for subject i, denoted by ˆyiimage, can be regarded as a weighted average of the model-based population mean and the observed data for subject i. In statistics, this estimating technique is referred to as shrinkage, also called the borrowing of strength approach. In longitudinal data analysis, this technique not only plays an important role in the estimation of parameters in mixed-effects regression models, but it is also impactful in missing-data analysis and in the study on misspecification of the assumed distribution of predictions (McCulloch and Neuhaus, 2011).
The concept of shrinkage might sound somewhat unfamiliar to the reader not highly familiar with advanced statistics. Given the importance of shrinkage in longitudinal data analysis, it may be useful to link shrinkage with the reliability analysis of the random effects applied in statistical inference of multilevel regression modeling (Raudenbush and Bryk, 2002). The reliability of a measurement is defined as the ratio between the variance of true scores and the variance of observed data. In multilevel regression models, in which longitudinal data analysis can be regarded as a special case, reliability can be estimated by the ratio between the level-2 variance, denoted by σ00image, and the sum of the level-2 and the level-1 variance components, with the latter divided by the number of observations within a level-2 unit i. The equation is

ρ0i=σ00σ00+σ2/ni,

image(4.39)
where ρ0iimage is the reliability score for level-2 unit i, σ2image is the level-1 variance, and ni is the sample size within level-2 unit i. Clearly, the smaller the size of ni, the lower the reliability for level-2 unit i. Given Equation (4.39), it is easy to calculate the reliability score of the random effects for each level-2 unit using the parameter estimates from a multilevel model.
The above reliability analysis can be extended to the longitudinal setting, in which the level-2 unit i is the individual and the level-1 unit is a given observation at a given time point. In most datasets of a panel design, there are only a limited number of time points designed to follow up respondents at baseline, among whom many only have one, two, or three outcome observations due to various types of attrition (death, migration, or refusal to answer sensitive questions), as discussed in Chapter 1. As a result, reliability is low for many subjects, thereby resulting in massive instability in the random-effects estimates. Given this long-existing problem in longitudinal data analysis, statisticians and other quantitative methodologists have applied the shrinkage approach to borrow information from subjects with high reliability in the estimation of the random effects for those with fewer observations (Gelman et al., 2004; Laird and Ware, 1982). In this empirical Bayes method, reliability of a subject is used as a weight factor in prediction of the random effects.
The shrinkage technique applied in longitudinal data analysis can be viewed as an extension of the above reliability analysis. From the inference described in Section 4.4.1, the prediction of y for subject i can be specified by the following empirical Bayes algorithm:

ˆyi=Xiˆβ+Ziˆbi=Xiˆβ+ZiˆGZiˆVi1(yiXiˆβ)=(IniZiˆGZiˆVi1)Xiˆβ+ZiˆGZiˆVi1yi=ˆRiVi1Xiˆβ+(IniˆRiˆVi1)yi.

image(4.40)
In the above BLUP estimation, prediction of the outcomes for subject i is expressed as a weighted average of the population mean Xiˆβimage and the observed data yi, with weights RiVi1image and IniRiΣi1image, respectively. When within-subject variability is relatively sizable, the observed data are shrunk toward the prior average Xiˆβimage because the weight component IniRiΣi1image is relatively low.
For the reader to better comprehend the rationale of the shrinkage approach in longitudinal data analysis, below I expand Equation (4.40) to a random intercept linear model in which a scalar between-subjects random term, bi, is specified. According to the inference described in Section 4.4.1, the unbiased BLUP of bi is given by

BLUP(ˆbi)=ˆσ2bˆσ2b+ˆσ2ɛ/ni(yiXiˆβ),

image(4.41)
where the term ˆσ2b/(ˆσ2b+ˆσ2ɛ/ni)image is referred to as the shrinkage factor in longitudinal data analysis. By the inclusion of the shrinkage factor, the BLUP of bi has mean 0 and variance [ˆσ2b/(ˆσ2b+ˆσ2ɛ/ni)]ˆσ2bimage. Therefore, the variance of var|BLUP(ˆbi)|<var(bi)image.
It follows then that the BLUP of yi can be written as

ˆyi=Xiˆβ+ˆσ2bˆσ2b+ˆσ2ɛ/ni(yiXiˆβ).

image(4.42)
After some algebra, Equation (4.42) can be expanded to

BLUP(ˆyi)=yiˆσ2bˆσ2b+ˆσ2ɛ/ni(yiXiˆβ).

image(4.43)
In the above random intercept model, the prediction of yi depends on σ2ɛimage, σ2bimage, and niimage. Other things being equal, the number of observations for the subject determines the amount of shrinkage in linear estimation and prediction. In other words, if niimage is small, within-subject variability is relatively large, and consequently, the observed data are more severely shrunk toward the population average Xiˆβimage. This simple case can be readily extended to BLUP predictions in the linear random coefficient model.
In multilevel regression modeling, shrinkage has approved an efficient, robust, and consistent approach, even in the presence of large-scale missing observations. Its application in longitudinal data analysis, based on the MAR hypothesis, reduces instability in parameter estimates due to missing data and serial correlation among repeated measurements for the same subject. Little and Rubin (1999) believe that in situations where good covariate information is available and included in the analysis, the MAR assumption is a reasonable approximation to reality. Given the application of shrinkage, the mixed-effects models have often been found to generate more reliable and robust outcomes than the models handling longitudinal data with the MNAR hypothesis. Furthermore, McCulloch and Neuhaus (2011) found that very different distributions for BLUPs perform similarly in practice, as measured by mean square error; the random effects distributions that are statistically better fitting may not perform better in the overall prediction. Given such desirable properties, the standard approaches in linear mixed models, described in both Chapter 3 and the present one, generally suffice to yield reliable parameter estimates and predictions. The statistical methods for handling exceptional cases of nonignorable missing data will be described and discussed in Chapter 14.
In programming linear mixed models, many statistical software packages, including SAS, use various shrinkage estimators, combining empirical Bayes and the ML or the REML estimators for the estimation of parameters and prediction of the random effects. For example, in SAS, the PROC MIXED procedure provides the REML and BLUP estimators, both yielding Bayes-type shrinkage estimates (SAS, 2012).

4.5. Hypothesis testing on variance component G

In performing hypothesis testing on the variance components in G, the critical issue is whether the between-subjects variance components are tested on the boundary of the parameter space. If a hypothesis test is conducted on the non-negative variance of a single random effect, σ2b˜iimage, negative values are not allowed given the parameter space interval (0, ∞). In this situation, hypothesis testing should be performed on the boundary of the parameter space, and consequently, the classical hypothesis testing approaches are not valid. Instead, some one-sided testing statistics should be developed to test the null and the alternative hypotheses H0:σ2b˜i=0image and H1:σ2b˜i>0image (Verbeke and Molenberghs, 2003). For large samples, the asymptotic null distribution of the likelihood ratio statistic should be approximated as a 50:50 mixture of two chi-square distributions (Self and Liang, 1987).
For example, in the likelihood ratio test for the variance of a single random effect, the term σ2b˜iimage asymptotically follows a mixture of central chi-square distributions. In this case, the critical value should be set at χ21,12αimage, rather than χ21,1αimage, for a significance test. Correspondingly, the adjusted p-value of the likelihood ratio statistic is given by

p={1ifˆΛ=00.5Pr(χ21ˆΛ)ifˆΛ>0}forq=1,

image(4.44)
where ˆΛimage is the observed likelihood ratio test statistic. That is, for a single random effect, the p-value of its variance is one-half of the p-value for the regular χ21image distribution.
If there are two random effect parameters under the null hypothesis, the p-value of the likelihood ratio statistic is written as

p=0.5Pr(χ21ˆΛ)+0.5Pr(χ22ˆΛ),forq=2.

image(4.45)
Given the significance level α, the critical value, denoted by cα, for the likelihood ratio test can be obtained by the equation

α=0.5Pr(χ21c)+0.5Pr(χ22c).

image(4.46)
For q > 2, the researcher can test the statistical significance of the qth element of the random effects bi. Given the same rationale of the 50:50 mixture distributions, the p-value of the likelihood ratio statistic, given as ˆΛimage, is

p=0.5Pr(χ2sˆΛ)+0.5Pr(χ2s+1ˆΛ),s=1, 2,..., q1.

image(4.47)
In Equation (4.47), the scenario is similar to the case for q = 2 except that the first component on the right of Equation (4.47) is a q × 1 random vector. When q is large, the above specified p-value of the mixture distributions is closer to the classical likelihood ratio test statistic.
If a random effect is removed from an unstructured G matrix, the p-value of the likelihood ratio statistic is

p=0.5Pr(χ2˜qˆΛ)+0.5Pr(χ2˜q1ˆΛ).

image(4.48)
In longitudinal data analysis, there are more complex situations for the specification of mixture distributions in the likelihood ratio test. In these complicated situations, it is not easy for the researcher to derive the mixing probabilities. One can use simulation techniques to calculate the p-value. In this process, the structure of the information matrix and the approximating dimensions play important roles in the derivation of the asymptotic null distribution. For more details concerning the mixture distribution approach, the interested reader is referred to Searle et al. (1992), Self and Liang (1987); Verbeke and Molenberghs (2000, Chapter 6, 2003).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.32.67