Chapter 6

Residual and influence diagnostics

Abstract

In Chapter 6, the statistical methods of residual diagnostics are described first. A variety of residual types are specified, both generally and with specific regard to longitudinal data analysis. I also delineate semi-variogram, a popular residual diagnostic technique applied in longitudinal data analysis. This unique method is used to check whether serial correlation among repeated measurements for the same subject is present given the specified fixed effects and the covariance parameters. The method can be specified in both the random intercept and the random coefficient perspectives. Next, the statistical methods on influence diagnostics are displayed. The variety of the basic diagnostics to identify influential observations include the Cook’s distance statistic, leverage, the DFFITS score, the MDFFITS statistic, COVTRACE, COVRATIO, the likelihood displacement statistic (the LD statistic), and the LMAX standardized statistic. An empirical illustration is provided for displaying how to check whether there are any influential observations in fitting a linear mixed model.

Keywords

Influence diagnostics
leverage
likelihood displacement statistic
LMAX statistic
residual types
semi-variogram
One of the primary objectives in creating a linear mixed model is to describe the time trend of a response measurement on the continuous scale, either by experimental units (such as treatments) or by population groups (such as currently married vs. currently not married). Given within-subject dependence in longitudinal data, specification of the random effects or of a residual covariance structure is required to account for intraindividual correlation, even though those parameters are usually not of direct interest. In the estimation of the fixed and the random effects, some bizarre observations can have unduly influences on the fitness of linear mixed models, thereby affecting the quality of parameter estimates. Therefore, once a linear mixed model is fitted with longitudinal data, regression diagnostics are necessary for verifying whether the statistical model fits the data appropriately or meets various assumptions on model specifications. In general linear models, the assessments of model adequacy are regularly focused on checking linearity, normality, homogeneity of variance, and independence of errors. With respect to longitudinal data analysis, regression diagnostics are more complex because an individual’s response is measured repeatedly over time, so that analytic results of regression diagnostics may vary over different time points. The general principles in statistical modeling, however, are universal for all regression models, and linear mixed modeling is no exception. Standard diagnostic techniques, such as deviation of the expected value from the observed, the overall model fit, and the identification of influential observations, also need to be performed in longitudinal data analysis.
In this chapter, statistical methods of residual diagnostics are described first, starting with an introduction on different types of residuals applied in linear mixed models. Next, the statistical methods on influence diagnostics are presented. I then provide an empirical illustration to display the application of various regression diagnostics in linear mixed models based on the same longitudinal data previously used. Lastly, I summarize the chapter with comments on the regression diagnostics applied in linear mixed models.

6.1. Residual Diagnostics

In this section, a variety of residual types applied in linear mixed models are introduced. Next, I delineate a popular residual diagnostic technique in longitudinal data analysis, referred to as semivariogram, which is used to check whether serial correlation among repeated measurements for the same subject is present given the specification of the fixed effects and the covariance parameters. This unique method can be applied in both the random intercept and the random coefficient models.

6.1.1. Types of Residuals in Linear Regression Models

In regression modeling, residuals are closely related to random errors; the two concepts, however, differ distinctively. While a random error, or disturbance, of an observed value is defined as the deviation of the observed value from the true function value (unobservable), residual of an observed value is the difference between the observed value and the estimated function. A regression model, based on various assumptions and with certain properties on random errors, yields parameter estimates and model-based predictions from a statistical procedure, which, in turn, yield residuals, not random errors.
Let the predicted mean response for subject i at time point j be ˆμij=Xijˆβimage. An ni-dimensional vector of residuals for each subject can then be derived from a linear mixed model, given by

rmi=YiXiˆβ,

image(6.1)
where rmiimage, briefly discussed in Chapter 4, is referred to as the vector of the marginal residuals because Xijˆβimage is a marginal mean vector in linear mixed models. If the random-effects component is considered, residuals for each subject become conditional on the random effects, written as

rci=YiXiˆβZiˆbi=rmiZiˆbi,

image(6.2)
where rciimage represents the conditional residuals.
Given the specifications of linear mixed models, the variance–covariance matrix of marginal residuals is

var(ˆrmi)=ˆViXi(XiˆV1iXi)1X,

image(6.3)
where ˆViimage is the estimated total variance of Yi, from either maximum likelihood estimate or the restricted maximum likelihood (REML) estimator. The variance–covariance matrix of conditional residuals can be specified according to Gregoire et al. (1995) as follows:

var(ˆrci)=(IniZiˆGZiˆVi1)var(ˆrmi)(IniZiˆGZiˆVi1).

image(6.4)
In linear regression models, the distributions of residuals at different data points may not necessarily follow an expected pattern, even if random errors from the true function value are identically distributed. For example, in the conditional residuals, the random effect estimate ˆbiimage, obtained from the empirical Bayes approximation, heavily depends on the normality assumption and is also influenced by the assumed covariance structure ˆViimage. Therefore, the elements in ˆbiimage are the empirical best linear unbiased predictors (BLUPs), thereby being shrunk toward the population fixed effects β. As a result, the distribution of the BLUPs does not accurately represent the distribution of the true random effects. In the application of linear mixed models, residuals often need to be adjusted by the expected variability for comparative purposes at different time points. A popular approach for such an adjustment is to standardize the raw residuals by using the estimated residual variance, referred to as studentizing in statistical analysis, given by

rstudentmi=rmivar(rmi),

image(6.5a)

rstudentci=rcmivar(rci),

image(6.5b)
where rstudentmiimage is the vector of studentized marginal residuals for subject i and rstudentciimage is the corresponding vector of studentized conditional residuals. Standardization of residuals is particularly important in the longitudinal data with distinctive outliers. Some scientists suggest external studentization of residuals in which a common residual variance estimate is used for all subjects.
Raw residuals can also be scaled by the estimate of the variance for the observed response y, referred to as Pearson-type residuals, given by

rpearsonmi=rmivar(yi),

image(6.6a)

rpearsonci=rcmivar(yi).

image(6.6b)
A more refined type of standardized residuals is the scaled residuals. Construction of this type of residuals is based on the argument that given the specification of the random effects or the inclusion of a selected residual covariance structure in linear mixed models, residuals should behave as pure measurement error, and therefore, they should reflect variability that is not explained by the specified random effects or the covariance parameters (Verbeke et al., 1998). Thus, it is necessary to eliminate all sources of correlation by appropriate scaling to check residuals. The classical approach in this regard is to apply the Cholesky decomposition for the generation of transformed residuals that have constant variance and zero correlation.
The application of the Cholesky decomposition on the variance–covariance matrix starts with construction of a lower triangular matrix for each subject, denoted by Ciimage, which satisfies the condition

ˆVi=CiCi,

image
where Ciimage represents the Cholesky root of ˆViimage, with C1iYiimage having constant variance and zero correlation.
Given the attached properties, the Ciimage matrix can be used to transform the correlated residuals to correlation-free transformed residuals. For the marginal distribution of longitudinal data, the scaled residuals, denoted by Rmiimage, are defined as

Rmi=C1irmi=C1i(YiXiˆβ),

image(6.7)
which have unit variance and zero correlation.
For practical purposes, the scaled residuals can be plotted against the transformed predictions of Y, denoted by ˆY*iimage and defined as

ˆY*i=C1iXiˆβ.

image
If a linear mixed model is correctly specified, the plot of the transformed residuals against ˆY*iimage should be scattered randomly around zero with a constant range of variation. In contrast, if this scatter-plot displays a systematic trend, residuals remain correlated even with the specified random effects and/or a selected residual covariance structure. In the latter case, more random effect terms need to be considered in the specification of a linear mixed model and in statistical inference. Sometimes, the researcher might want to fit a smooth curve to the scatter-plot. If the linear mixed model is adequately assumed, the fitted line should not display distinctive systematic departures from a horizontal line centered at about 0.8 (Fitzmaurice et al., 2004). The transformed residuals can also be used to identify skewness, detect potential outliers, and assess the normal distribution hypothesis.

6.1.2. Semivariogram in Random Intercept Linear Models

Based on the random intercept linear model, Diggle (1988) introduces the approach of empirical semivariogram to assess residuals in longitudinal data analysis. This method is designed to check serial correlation in residuals, conditionally on the random components already included in the model. Construction of this residual diagnostic method begins with the specification of a general linear model with an error term that is assumed to be independently distributed as multivariate normal, referred to as the generalized multivariate linear model.
Let Yi=(Yi1...,Yini)image be an ni-dimensional vector of repeated measurements for subject i and ei=(ei1,...,eini)image be an ni × 1 column vector of assumed errors. A generalized multivariate linear regression model is then written as

Yi=Xiβ+ei,

image(6.8)
where Xi is a known ni × M matrix of covariates and β is an M × 1 vector of unknown population parameters. Given the correlation between serial observations for the same subject, ei is assumed to follow a multivariate normal distribution with mean 0 and ni × ni covariance matrix Vi. Correspondingly, Yi can be expressed in terms of YiMVN(Xiβ,Vi)image. As summarized by Diggle (1988) and Diggle et al. (2002), an appropriate Vi should at least accommodate three different sources of random variations. First, average responses usually vary randomly between subjects, with some subjects being intrinsically high and some being low. Second, a subject’s observed measurement profile may be a response to time-varying stochastic processes. Third, as the individual measurements involve some kind of subsampling within subjects, the measurement process adds a component of variation to the data. This perspective on the breakdown of variability is briefly mentioned in Chapter 1.
The above decomposition of stochastic variations in repeated measurements of the response facilitates the formulation of correlation between pairs of measurements for the same subject. A generalized multivariate linear model, specifying all three features in Vi, can be written as

Yij=Xijβ+bi+˜Wi(Tij)+ɛij,

image(6.9)
where Xijβimage represents the model-based mean response for subject i at time point j, bi indicates the i.i.d. variation in average response between subjects with mean 0 and variance σ2bimage, and ɛij represents the subsampling variation within the subject with mean 0 and variance σ2ɛimage. The term ˜Wi(Tij)image reflects independent stationary Gaussian processes with zero expectation and covariance function σ2ρ(|TjTj|)image. Clearly, the terms bi, ˜Wi(Tij)image, and ɛij correspond to the random intercept, autoregressive correlation, and within-subject random error, respectively, as described previously.
Given the specification of Equation (6.9), the variance matrix for subject i can be specified as

Vi=σ2bJ+σ2R(Ti)+σ2ɛI,

image(6.10)
where J is a square matrix with all its elements being unity, Ti=(Ti1,...,Tini)image is the time vector, R(Ti)image is a symmetric matrix with element ρ(|TjTj|)image, and I is the identity matrix. As indicated in Chapter 5, the form of ρ(|TjTj|)image can be parameterized in different situations. When σ2=0image, the above-generalized multivariate linear model reduces to the uniformed correlation model (Diggle, 1988). If σ2bimage is also equal to zero, the model reduces further to a typical general linear model with independent random errors. Such cases, however, are empirically very rare in longitudinal data.
Let TTimage be a continuous time variable. Then the semivariogram of a random process {Y(T)}:TTimage is defined as the function

g(u)=12[E{Y(T)Y(Tu)}2]:u0,

image(6.11)
which is assumed to be independent of T. Given this specification, the empirical semivariogram of a time series or a time trend is specified as a scatter-plot of squared differences, given by

d2jj=12[y(Tj)y(Tj)]2,

image
as against some corresponding quantities.
Empirically, the marginal residuals, denoted by rmi, can be used to generate the empirical semivariogram. With rij=YijXijˆβimage, the empirical variogram can be written as

12[E(rijrij)2]=σ2ɛ+σ2[1ρ(|TijTij|)].

image(6.12)
Obviously, the empirical semivariogram is a spatial data specification. The rationale of this method is that given the specification of the random intercepts across subjects, the empirical semi-variogram displays ordinary-least-squares (OLS)-type residuals by removing the term for correlation in repeated measurements. Consequently, if the random intercept model is correctly assumed, the errors should follow a constant pattern, and, correspondingly, the empirical semivariogram should be scattered randomly rather than systematically.
Equation (6.12) literally states that the process variance can be estimated by half the average squared difference between pairs of observations from different subjects. As a nonparametric diagnostic method, the empirical semivariogram provides a useful graphical check on the adequacy of a specific covariance structure in R, provided that there are no random effects except a random intercept term. This approach is particularly useful to check residuals for longitudinal data with unequal time intervals.

6.1.3. Semivariogram in the Linear Random Coefficient Model

The specification of the generalized multivariate linear model provides a flexible framework for checking residuals in modeling normal longitudinal data. The development of this approach is based on the thought that the effect of serial correlation is dominated by the combination of the random intercept and random errors, and therefore, inclusion of more random effect terms is unnecessary.
More recently, some more flexible formulations on covariance structures have been proposed to handle complex longitudinal data patterns (Chi and Reinsel, 1989; Verbeke et al., 1998). The majority of these methods introduce an additional random term in the expression of total errors, containing both the random coefficients across subjects and serial correlation in within-subject random errors, written as

Yi=Xiβ+Zibi+ei,

image(6.13)
where ei is assumed to follow a multivariate normal distribution with mean 0 and ni × ni covariance matrix Ri. As indicated earlier, in the standard linear random coefficient model, cov(ei)=σ2Iimage is usually assumed given the specified covariates and the random effects. In performing residual diagnostics, however, specifying cov(ei)=Riimage as multivariate is considered essential to check whether residuals remain correlated with the specified random coefficients. Chi and Reinsel (1989) propose a score test to examine the random coefficient model with cov(ei)=σ2Iimage against the random coefficient multivariate model with the AR(1) errors for ei. This approach provides a simple check for possible autocorrelation in residuals, which, in linear mixed models, are generally assumed to be conditionally independent. It is found from the score test that the specification of the random effects generally accounts for the serial correlation among repeated measurements, and therefore, the random coefficient model adding a serial correlation term overparameterizes the covariance structure.
In the framework of the random coefficient multivariate model, the error structure can be more flexibly decomposed (Verbeke et al., 1998), given by

ei=Zibi+ɛ1i+ɛ2i,

image(6.14)
where ɛ1i is the ni × 1 residual vector to indicate time-varying stochastic processes operating within the subject (serial correlation), assumed to be normally distributed with mean zero and covariance matrix ˜Hiimage. The covariance matrix ˜Hiimage has element ˜hijjimage taking the form ˜τ2g(|TijTij|)image, where ˜τ2image is the profiled variance for all elements of ɛ1i and g(.) is the unknown positive decreasing function. The vector ɛ2i represents random errors assumed to be independent and identically distributed with zero expectation. The q-dimensional vector bi, the subject-specific random effects, are assumed to follow a multivariate normal distribution with mean zero and covariance matrix G, as regularly specified.
Given Equation (6.14), the response vector yi marginally follows a normal distribution with mean vector Xiβimage and covariance matrix

Vi=ZiGZi+˜Hi+σ2ɛIni,

image(6.15)
where Iniimage is the ni-dimensional identity matrix. This marginal random coefficient multivariate model can be estimated by using either the maximum likelihood (ML) or the REML approach. As discussed by Verbeke et al. (1998), this regression model can be used to check whether a classical linear mixed model sufficiently accounts for intraindividual correlation without considering an additional residual covariance structure.
Empirically, one might compare the analytic and graphical results from different regression models such as the ordinary least squares, the random intercept, the random coefficient, and the random coefficient plus a specific residual covariance structure. If both the OLS and the random-intercept residuals are found to deviate markedly from normality, serial correlation needs to be further specified and some regression coefficients should be considered random across subjects. Likewise, if residuals from a random coefficient model still deviate notably from a normal distribution, the researcher might want to add an appropriate residual covariance matrix to the linear random coefficient model for yielding efficient and consistent parameter estimates. The latter case, however, rarely occurs in longitudinal data analysis.
With the inclusion of the random coefficients, the semivariogram based on the random intercept model is extended to the random coefficient perspective. As the covariance structure specified in Equation (6.15) has been found to be dominated by its first component ZiGZiimage, it is necessary to remove all variability explained by the random effects bi before proceeding with the serial correlation check. The scaled residuals, denoted by Rmiimage and described in Section 6.1.1, can be used for this removal. These standardized residuals are independent of any distributional assumptions on bi, and therefore, their computation does not require an estimate of the random-effects covariance matrix G (Verbeke et al., 1998). Because the scaled residuals Rmiimage have unit variance and zero correlation, the semivariogram based on the linear random coefficient model can be written as

12[E(RijRjj)2]=12var(Rij)+12var(Rij)cov(Rij,Rjj)=12+120=1.

image(6.16)
Equation (6.16) indicates that if a random coefficient linear model is correctly specified, the plot of the semivariogram of the transformed residuals against time should be scattered randomly around unity without displaying any systematic time trend. With this property, the semivariogram can be applied to check whether the specified random effects explain all serial correlation between repeated measurements. In performing this residual diagnostic method, the conditional residuals, defined as YiXiˆβZiˆbimage, are not recommended for use because the BLUP ˆbiimage from the empirical Bayes approximation depends heavily on the normality assumption and is also influenced by the assumed covariance structure ˆViimage.
The semivariogram, based on either the random intercept or the random coefficient model, is usually applied for spatial data when the time intervals are unequally spaced. For highly unbalanced longitudinal data, a smooth plot of the empirical semivariogram may be fitted from the scatter-plot of the transformed residuals, which should be centered at unity displaying no systematic time trend. Because the empirical semivariogram is sensitive to outliers, influence diagnostics need to be performed first before fitting a smooth curve to the scatter-plot of the empirical semivariogram. Large-scale longitudinal data are needed to conduct this residual diagnostic technique.

6.2. Influence Diagnostics

In regression diagnostics, another important area is identification of influential observations. For various regression models, it is essential to identify particular observations that have extraordinary influences on the analytic results. Identification of influential observations in linear mixed models differs slightly from the classical approaches applied in general linear models. Most significantly, diagnostics on longitudinal data involve individuals having multiple data points, rather than at a single time. Consequently, removal of one individual can affect a series of observations, thereby magnifying the case’s influence on parameter estimates, both the fixed effects and the random components. Therefore, more refined techniques are sometimes required to identify influential observations for a linear mixed model. In most situations, however, influence diagnostics applied in longitudinal data analysis follow the standard perspectives as those used in general linear models.
This section describes a variety of the basic diagnostic techniques to identify influential observations in linear mixed models, including the Cook’s distance statistic, leverage, the DFFITS score, the MDFFITS statistic, COVTRACE, COVRATIO, the likelihood displacement (LD) statistic and its approximates, and the LMAX standardized statistic. An illustration is provided to check whether there are any influential observations in fitting the two linear mixed models described in the preceding three chapters.

6.2.1. Cook’s D and Related Influence Diagnostics

In fitting a linear mixed model, some observations may have unduly impacted the inferential process to derive parameter estimates, as frequently encountered in fitting other types of regression models. Traditionally, these influential cases can be identified by the change in the estimated regression coefficients after deleting each observation in a sequence (Cook, 1977, 1979).
Let ˆβimage be the estimate of β that maximizes the log-likelihood or the log restricted likelihood function and ˆβ(i)image be the same estimate of β when subject i is eliminated from the estimating process. For a single covariate Xm, the distance in the estimated regression coefficient after removing the subject, denoted dmiimage, is written as

dmi=ˆβmˆβm(i).

image(6.17)
Equation (6.17) provides an exact measurement for the absolute influence of deleting subject i from the estimating process on the regression coefficient estimate, referred to as Cook’s distance statistic. This statistic can be expressed in terms of the entire vector of the estimated regression coefficients, given by di=ˆβˆβ(i)image. A greater value of diimage suggests subject i to have a stronger influence on the estimate of β; likewise, a lower value indicates that subject is impact on the model fit is limited.
For convenience of performing the significance test, Cook’s distance statistic is often scaled by the estimates of the coefficient standard errors after removing subject i. Mathematically, this scaled or standardized distance score is written as

ˉdi=[ˆβˆβ(i)](XX)[ˆβˆβ(i)][1+rank(X)]s2,

image(6.18)
where ˉdiimage is the scaled Cook’s distance, or simply Cook’s D, and s2 is the mean square error of the data. In the literature of influence diagnostics, the two Cook’s distance statistics, diimage and ˉdiimage, are also referred to as DFBETAi and DFBETASi, respectively (Belsley et al., 1980; Fox, 1991).
Like the original statistic diimage, a large value in ˉdiimage indicates that the parameter estimates are sensitive to removal of the ith subject. With standardization, ˉdiimage approximately follows an F-distribution with the numerator degrees of freedom being rank (X) and the denominator degrees of freedom being N-rank (X). Given the F-distribution, the significance of Cook’s D can be statistically tested. Specifically, assuming X to have full rank, the F statistic can be used to test the null hypothesis with threshold FM, N M, α.
For linear mixed models, the specification of the scaled Cook’s D is slightly more complex, given by

ˉdi=[ˆβˆβ(i)]var(ˆβ)1[ˆβˆβ(i)]rank(X),

image(6.19)
where var(ˆβ)1image is the matrix from sweeping

[XV(ˆG,ˆR)1X]1.

image
If V is known, Cook’s D can be evaluated according to a chi-square distribution with the degrees of freedom being the rank of X (Christensen et al., 1992). If V is unknown, an estimate of V needs to be obtained from the approach described in Chapters 3 and 4, and the statistical evaluation of ˉdiimage should be based on the FM, N M, α distribution. For large samples, however, checking the statistical significance of ˉdiimage for each subject in sequence is unrealistic. In these situations, plotting the scaled statistics graphically is a more practical approach for a quick diagnostic check. An analytic approach to checking the statistical significance of ˉdiimage without removing subjects in sequence will be described in Section 6.2.4.
The influential cases can also be identified by the change in the predicted value after deleting each subject in a sequence (Cook, 1977, 1979). One of the useful diagnostics in this regard is the PRESS statistic, defined as

PRESS=iiˆr2i(i),

image(6.20)
where the sum does not include subject i whose influence on the model fit is under check, and

ˆri(i)=yiXiˆβ(i).

image
There are some other influence diagnostics based on the predicted value when a subject is removed from the regression. As more or less related to the leverage statistic, they are described in Section 6.2.3 after leverage is delineated.

6.2.2. Leverage

In linear regression models, leverage is used to assess outliers with respect to the independent variables by identifying the observations that are distant from the average predictor values. While potentially impactful on the parameter estimates and the model fit, a higher leverage point does not necessarily indicate strong influence on the regression coefficient estimates because a far distance for a subject’s predictor values from those of others can be situated in the same regression line as other observations (Fox, 1991). Therefore, checking a substantial influence must combine high leverage with discrepancy of the case from the rest of the data.
The basic measurement of leverage is the so-called hat-value, denoted by hi. In general linear models, the hat-value is specified as a weight variable in the expression of the fitted value of the predicted response ˆyjimage, given by

ˆy˜j=Ni=1hi˜jyi,

image(6.21)
where hi˜jimage is the weight of subject i in predicting the outcome Y at data point ˜jimage (˜j=1,2,...,Nimage), and ˆy˜jimage is specified as a weighted average of N observed values. Therefore, the weight variable hi˜jimage displays the influence of yi on ˆy˜jimage, with a higher score indicating a greater impact on the fitted value. Let

hi=hii=N˜j=1h2i˜j.

image(6.22)
According to Equation (6.22), the hat-value hi, with property 0hi1image, is the leverage score of yi on all fitted values.
In general linear models including a number of independent variables, leverage measures distance from the means of the independent variables and can be expressed as a matrix quantity given the covariance structure of the X matrix. Correspondingly, the hat-value hi is the ith diagonal of a hat matrix H. The hat matrix H is given by

H=X(XX)1X.

image(6.23)
The diagonal of H provides a standardized measure of the distance for the ith subject from the center of the X-space, with a large value indicating that the subject is potentially influential. If all cases have equal influence, each subject will have a leverage score of M/N, where M is the number of independent variables (including the intercept) and N is the number of observations. In the literature of influence diagnostics, the leverage values exceeding 2M/N for large samples or 3M/N for samples of N ≤ 30 are regarded roughly as the influential cases.
Given the H matrix, the predicted values of y in general linear models can be written as

ˆy=Hy.

image(6.24)
Therefore, the H matrix determines the variance and covariance of the fitted values and residuals, given by

var(ˆy)=σ2H,

image(6.25a)

var(r)=σ2(1H).

image(6.25b)
In longitudinal data analysis, the specification of the H matrix becomes more complex due to the inclusion of the covariance matrix V(R, G). Let Θ be an available estimate of R and G (also specified in Chapter 4). The leverage score for subject i can be expressed as the ith diagonal of the following hat matrix:

H=X[XV(ˆΘ)1X]XV(ˆΘ)1.

image(6.26)
The ith diagonal of the above matrix is the leverage score for subject i displaying the degree of the case’s difference from others in one or more independent variables.

6.2.3. DFFITS, MDFFITS, COVTRACE, and COVRATIO Statistics

In addition to Cook’s D, a number of other case-deletion diagnostics are frequently used in linear regression models. These diagnostics include the DFFITS, the MDFFITS, the COVTRACE, and the COVRATIO statistics (Belsley et al., 1980; Christensen et al., 1992; Fox, 1991; SAS, 2012, Chapter 59). The DFFITS statistic is defined as the change in the predicted value after removing a case, standardized by dividing by the estimated standard error of the fit. In general linear models, the statistic is defined as

DFFITSi=ˆyiˆyi(i)s(i)hi,

image(6.27)
where ˆyiimage is the predicted value of Y for subject i from the linear regression including i, ˆyi(i)image is the same prediction after removing data point i in fitting the regression model, s(i)image is the standard error estimate without i, and hi is the corresponding leverage score. Given studentizing, the DFFITS statistic follows a Student t distribution multiplied by a leverage factor, given by

DFFITSi=tihi1hi.

image(6.28)
In longitudinal data analysis, the statistical procedure to fit a regression model is more complex than that of a simple linear regression given the specification of Vi. For example, in linear mixed models, the DFFITS statistic is specified as

DFFITSi=ˆyiˆyi(i)ese(ˆyi),

image(6.29)
where ese(ˆyi)image is the asymptotic standard error estimate for ˆyiimage. This statistic can be approximated by

ese(ˆyi)=xi[XV(ˆΘ(i))X]1xi,

image(6.30)
where xi is the observed matrix of X for subject i. As a standardized diagnostic measure, the DFFITS statistic indicates the number of standard deviation that the fitted value changes after the removal of subject i.
The MDFFITS statistic is used when multiple data points are removed from the regression (Belsley et al., 1980). In longitudinal data, case deletion generally implies removal of multiple data points, thereby indicating that MDFFITS is an appropriate statistic for performing influence diagnostics in linear mixed models. Christensen et al. (1992) specify this statistic on the fixed effects in the context of linear mixed models, given by

MDFFITS[β(i)]=[ˆβˆβ(i)]var[ˆβ(i)][ˆβˆβ(i)]rank(X).

image(6.31)
There is a striking similarity between the above MDFFITS[β(−i)] formulation and Cook’s D, specified by Equation (6.19). Both statistics measure the influence of data points on a vector of the fixed effects, with the only difference being the specification of var[ˆβ(i)]image in Equation (6.31).
If the covariance parameters are assumed to be fixed, the MDFFITS score for each subject can be estimated by a noniterative procedure to check only the fixed effects and the residual variance. The MDFFITS score, however, can be underestimated if a subject’s impact on the covariance parameters is overlooked. Therefore, the iterative approach is preferred to assess the overall impact of a subject, including the influence on the covariance parameter estimates. For the iterative influence analysis, var[ˆβ(i)]image is evaluated at ˆΘ(i)image, and therefore, an MDFFITS score can be computed specifically for the covariance parameters, written as

MDFFITS[Θ(i)]=[ˆΘˆΘ(i)]var[ˆΘ(i)]1[ˆΘˆΘ(i)].

image(6.32)
The covariance trace and the ratio statistics, referred to as COVTRACE and COVRATIO, respectively, are the other two widely used diagnostics for identifying influential cases in linear regression models (Belsley et al., 1980; Christensen et al., 1992; Fox, 1991; SAS, 2012, Chapter 59). While Cook’s D, DFFITS, and MDFFITS statistics are used to measure a subject’s influence on the parameter estimates and the fitted values, the COVTRACE and COVRATIO statistics are applied to assess the influence on the precision of parameter estimates. For linear mixed models, Christensen et al. (1992) define the corresponding COVTRACE and COVRATIO statistics, given by

COVTRACE[β(i)]=|trace{var(ˆβ)var[ˆβ(i)]}rank(X)|,

image(6.33)

COVRATIO[β(i)]=det{var[ˆβ(i)]}det[var(ˆβ)],

image(6.34)
where det{var[ˆβ(i)]}image indicates the determinant of the nonsingular part of the var[ˆβ(i)]image matrix. The COVRATIO statistic can be used to assess the precision of the fixed effects with the following criteria: if COVRATIO > 1, inclusion of subject i in the regression improves the precision of the parameter estimates; if COVRATIO < 1, the incorporation of the subject in the estimating process reduces the precision of the estimation, so that subject i may be deleted in the model fit.
In the iterative influence analysis, the COVTRACE and COVRATIO statistics can also be computed for the covariance parameter estimates:

COVTRACE[Θ(i)]=|trace{var(ˆΘ)var[ˆΘ(i)]}rank[var(ˆΘ)]|,

image(6.35)

COVRATIO[(Θ(i))]=det{var[ˆΘ(i)]}det[var(ˆΘ)].

image(6.36)
Empirically, iterative computations of COVTRACE and COVRATIO for the covariance parameter estimates are burdensome and tedious, particularly for a large sample. When difficulty in performing the iterative methods arises, the researcher might want to consider using another diagnostic approach.

6.2.4. Likelihood Displacement Statistic Approximation

The above influence diagnostics are useful for graphically identifying influential cases. Some of those diagnostic statistics are linked to certain types of probability distributions by using standardization. Based on the standardized results, empirically based cut-points may be created. In this aspect, a popular diagnostic approach is LD, which is directly associated with the likelihood ratio statistic. This likelihood-type diagnostic statistic, originally proposed by Cook (1977), produces both graphical plots and analytic results simultaneously. In the literature of influence diagnostics, the LD statistic is generally referred to as LD if the statistic is generated from the log-likelihood function or as RLD if it is derived from the restricted log-likelihood function.
Let l be the log-likelihood function log L and lR be the restricted log-likelihood function log LR. In general linear models, the LD and RLD scores for subject i, denoted by LDi and RLDi, respectively, can then be defined as

LDi=2l(ˆβ)2l[ˆβ(i)],

image(6.37a)

RLDi=2lR(ˆβ)2lR[ˆβ(i)],

image(6.37b)
where l(ˆβ)image is the log-likelihood function and lR(ˆβ)image is the log restricted likelihood function with respect to the estimated regression coefficients ˆβimage. In simple linear regression models, the LD statistic is distributed as chi-square with one degree of freedom under the null hypothesis that ˆβ(i)=ˆβimage. With this desirable property, the observations having a strong impact on ˆβimage can be statistically tested as well as graphically identified. Consequently, this statistic provides sufficient information on whether an identified influential case should be removed from the regression.
In longitudinal data analysis, the use of this exact statistic is often not realistic. When the sample size is large, this case-deletion process becomes extremely tedious and time-consuming. Consider, for example, a sample of 2000 subjects: the analyst needs to create 2001 linear mixed models to identify which case or cases have exceptionally strong impact on the parameter estimates. If the distance score, denoted by ˆθˆθ(i)image, can be statistically approximated by a scalar measurement in linear mixed models, influential observations can be identified without removing each case in a sequence from the estimating process.
Cook (1986) developed a method to approximate Equation (6.37) by introducing weights into the likelihood function, with different individuals allowed different weights. In this method, an N-dimensional vector w=(w1,w2,...,wN)image is created, with element wi (i = 1, 2, …, N) being the weight for subject i. The w vector in the classical log-likelihoods can be denoted by w0=(1,1,...,1)image, in which each subject has weight one. This approach can be readily extended to linear mixed models. Let θ be a parameter matrix consisting of β, G, and R. The log-likelihood and the log restricted likelihood functions with weights, referred to as perturbed log-likelihoods (Verbeke and Molenberghs, 2000), are then given by

l(θ|w)=Ni=1wili(θ),

image(6.38a)

lR(θ|w)=Ni=1wilRi(θ),

image(6.38b)
where l(θ|w)image is the perturbed log-likelihood function given the inclusion of the weight vector w, and lR(θ|w)image is the corresponding perturbed log restricted likelihood function. In longitudinal data analysis, wi is set at zero if the entire set of observations for subject i is not considered in linear mixed models.
Given the specification of weights, the influential cases can be identified by measuring the distance between ˆθwimage and ˆθimage in the LD formulation. The approximated LD and RLD scores for all subjects, denoted by LD(w)image or RLD(w)image, are given by

LD(w)=2l(ˆθ)2l[ˆθw],

image(6.39a)

RLD(w)=2lR(ˆθ)2lR[ˆθw].

image(6.39b)
Obviously, it is still unrealistic to evaluate LD(w)image or RLD(w)image for all elements in w. Cook (1986) suggests studying the local behavior of LD(w)image around w0 because it describes the sensitivity of LD(w)image with respect to slight perturbations of the weights. Accordingly, the LD(w)image at w0 can be approximated by the classical likelihood expressions without directly including w. In the context of linear mixed models, the LD score for subject i can then be approximated by

LDi(wi)˜UiI1(ˆθ)˜Ui,

image(6.40)
where ˜Uiimage is the score function for subject I, mathematically defined as the first partial derivative of the log-likelihood function with respect to θ, and I1(ˆθ)image is the inverse of the observed Fisher information matrix, mathematically defined as the second partial derivative of the log-likelihood function with respect to θ. Both matrices are evaluated at θ=ˆθimage. As both ˜Uiimage and I(ˆθ)image can be obtained from the standard linear mixed modeling, this approximation does not include additional statistical inference, thereby evading the specification of wi. Given Equation (4.16), the RLDi(wi)image statistic can be obtained in the same fashion with minor modifications (Lesaffre and Verbeke, 1998).
As discussed in Lesaffre and Verbeke (1998) and Verbeke and Molenberghs (2000), the change in the log-likelihoods between ˆθwimage and ˆθimage reflects variability of ˆθimage. If the LD(w)image or the RLD(w)image score is large, l(θ)image or lR(θ)image is strongly curved at ˆθimage, thereby suggesting θ to be estimated with high precision. If the statistic is small, θ is estimated with high variability. As a result, a graph of LD(w)image or RLD(w)image approximates against the w vector can provide important information on the total local influence of each subject to identify influential cases. For statistical details concerning this graphical method, the reader is referred to Gruttola et al. (1987), Lesaffre and Verbeke (1998), and Verbeke and Molenberghs (2000, Chapter 11).

6.2.5. LMAX Statistic for Identification of Influential Observations

Based on the LD approximation, a more refined, innovative technique for identifying influential observations is the LMAX statistic, originally developed by Cook (1986) as a standardized diagnostic method in general regression modeling and later introduced and advanced into linear mixed models by Lesaffre and Verbeke (1998). Methodologically, this method maximizes approximation to the LD(w) statistic for the changes standardized to the unit length.
Cook (1986) suggests the use of a standardized LD statistic to more adequately measure influences of particular cases in the estimating process. Given the LD statistic, Cook first defines a symmetric matrix B, given by

B˜UI1(ˆθ)˜U,

image(6.41)
where ˜Uimage is the matrix with rows containing the score residual vector ˆ˜Uiimage.
With B defined, Cook considers the direction of the N × 1 vector ˜limage that maximizes ˜lB˜limage, and ˜limage is standardized to have unit length. Because the M × M matrix ˆI(ˆθ)1image is positive definite, the N × N symmetric matrix B is positive semidefinite, with rank no more than M. The statistic ˜lmaximage corresponds to the unit length eigenvector of B, which has the largest eigenvalue ˜γmaximage. Therefore, ˜lmaxB˜lmaximage maximizes ˜lB˜limage and satisfies the equation

B˜lmax=¨λmax˜lmaxand˜lmax˜lmax=1,

image
where ¨λmaximage is the largest eigenvalue of B, and ˜lmaximage is the eigenvector associated with ¨λmaximage. The elements of ˜lmaximage, standardized to unit length, measure sensitivity of the model fit to each observation. The absolute value of ˜liimage, the ith element in ˜lmaximage, is used as the LMAX score for subject i. Given the unit length, the expected value of the squared LMAX statistic for each case is 1/N where N is sample size, and correspondingly, a value significantly greater than this expected value indicates a strong influence on the overall fit of a regression model thereby being identified. If M = 1, the LMAX statistic is proportional to ˜Uimage and ¨λmax=I(ˆθ)-1˜U.image When M > 1, an advantage of examining elements of the LMAX statistic is that each case has a single summary measurement of influence.
As the LMAX score is a standardized statistic, it is useful to plot the elements of the LMAX scores against time points and/or the values of other covariates. The standardization of ˜lmaximage to unit length means that the squares of the elements in ˜lmaximage sum up to unity, and therefore, the signs of the elements of ˜lmaximage are not of concern. Therefore, for ˜liimage, only the absolute value needs to be plotted. Observations that have the most unduly influences on parameter estimates and the model fit can be readily identified by examining the relative influence of the elements in ˜lmaximage. If none of the subjects has an undue impact on the model fit, the plot of the LMAX scores should approximate a horizontal line.
There is a lack of analytic expressions for ˜lmaximage. Given high proportional contributions of the graphically influential cases to the overall fit of a linear mixed model, it is necessary to check the exact LD for them. The researcher can perform the significance test to further analyze those influential cases with two steps (Liu, 2012). First, graphically identify the most influential observations by using the LMAX approximation (there are usually only a few distinctive outliers in linear models). Next, analytically examine the exact changes in the estimates of the fixed effects and the covariance parameters after deleting each of those few cases, using the classical LD criterion. Following this strategy, a decision can be made regarding whether those influential cases should be removed in fitting a linear mixed model.

6.3. Empirical Illustrations on Influence Diagnostics

In this empirical illustration, I provide two examples for performing influence diagnostics in longitudinal data analysis. I continue to use the two longitudinal datasets indicated in the preceding chapters: one on the effectiveness of acupuncture treatment on PTSD and one concerning the effect of marital status on an older person’s disability severity.

6.3.1. Influence Checks on Linear Mixed Model Concerning Effectiveness of Acupuncture Treatment on PCL Score

The first example is a diagnostic check of influence on the linear mixed model fitted in Chapter 5 (Section 5.5.1). As a follow-up analysis, the model specifications, covariate definitions, and the use of an appropriate covariance structure are all the same as described in Chapter 5. Specifically, the dependent variable is PCL_SUM, measured at four time points: 0 = baseline survey, 1 = 4-week follow-up, 2 = 8-week follow-up, and 3 = 12-week follow-up. The treatment factor is dichotomous with 1 = receiving acupuncture treatment, 0 = else. From three covariance pattern models, CS, [AR(1)], and TOEP, CS yields the smallest values in all four information criteria, thereby being selected as the appropriate residual covariance pattern model for this analysis. The objective of this diagnostic analysis is to identify whether there are influential cases for both the fixed effects and the covariance parameters in the model fit. The following SAS program is created to perform the diagnostics by using an iterative approach.
SAS Program 6.1:. . . . . .
image
In SAS Program 6.1, the ODS Graphics is enabled to create diagnostic plots. As the analysis is focused on the identification of influential cases, the options to derive the fixed effects and the covariance parameters are not specified. The ML estimator is applied in the diagnostic checks because the REML estimator cannot be used to compare nested regression models for the mean (Fitzmaurice et al., 2004). The INFLUENCE option is specified in the MODEL statement telling SAS to compute influence statistics. The influence suboption EFFECT = ID specifies that the classification effect is subject. Additionally, the suboption ITER = 5 informs SAS that the fixed effects and the covariance parameters be updated by refitting the mixed model up to five iterations. It may be mentioned that for ITER = 0, SAS performs a noniterative influence analysis on the fixed effects only, assuming the covariance parameters or their ratios to be fixed.
Given SAS Program 6.1, SAS constructs a table of influence diagnostics at the subject level. This diagnostic summary table contains eleven diagnostic statistics, including Cook’s D for both the fixed effects and the covariance parameters, PRESS, MDFFITS for both the fixed effects and the covariance parameters, COVRATIO for both the fixed effects and the covariance parameters, COVTRACE for both the fixed effects and the covariance parameters, the likelihood distance (LD), and the root of mean square error (RMSE). While the diagnostic table includes a large amount of data, a portion of the statistics is selected for display, including Cook’s D, MDFFITS for the fixed effects, COVRATIO for the fixed effects, and the LD statistic. Table 6.1 presents those statistics for the first ten subjects.

Table 6.1

Selected Influence Statistics for 10 Subjects: DHCC Acupuncture Treatment Study (N = 10)

Subject’s ID Influence Diagnostics for 10 Subjects
Number of Observations Number of Iterations Cook’s D MDFFITS on β COVRATIO on β LD Statistic
132 4 2 0.0533 0.0524 1.0080 0.5396
167 4 1 0.0114 0.0108 1.2674 0.1007
193 4 2 0.0242 0.0236 1.1246 0.2139
220 4 2 0.0134 0.0132 1.2960 0.1690
227 4 2 0.0195 0.0188 1.1914 0.1657
230 2 1 0.0006 0.0006 1.1728 0.0158
235 4 1 0.0263 0.0252 1.1858 0.2130
245 4 2 0.0358 0.0351 1.0913 0.3399
271 1 1 0.0000 0.0000 1.0785 0.0057
276 4 2 0.0324 0.0315 1.1090 0.3003

Table 6.1 displays that each of the ten subjects has multiple observations except subject 271, among whom eight have complete information on the PCL score. All subjects have relatively small values of Cook’s D and the MDFFITS statistics on the fixed effects. As they differ only in the specification of the covariance matrix for the fixed effects, the values of Cook’s D and the MDFFITS are very close, with the MDEFITS value slightly smaller given the use of var[ˆβ(i)]image rather than var[ˆβ]image in computation. The COVRATIO and the LD statistics are also fairly stable, thereby suggesting that none of these ten subjects has a strong impact on the fit of the linear mixed model on PCL_SUM. From the results of other influence diagnostics and for other subjects, no distinctively influential cases can be found.
From SAS Program 6.1, a set of diagnostic plots are displayed. First, a plot of the likelihood distance is presented Fig 6.1.
image
Figure 6.1 Likelihood Distance for Each Subject
As there are 55 subjects in the analysis, not all the ID numbers are displayed in Fig. 6.1. Based on the output table from SAS Program 6.1, influential cases can be easily identified. As judged from both Fig. 6.1 and the output table, subjects 791 and 814 have the strongest impact on the model fit. With the large value of the overall model fit statistic, however, those relatively outstanding values, 1.1328 and 1.5064, cannot lead to a firm conclusion that the two cases make actual influences on the fitness of the linear regression model.
Figure 6.2 displays Cook’s D and the COVRATIO statistics for both the fixed effects and the covariance parameters considered in the model.
image
Figure 6.2 Other Influence Statistics on PCL_SUM
In Fig. 6.2, the aforementioned two subjects also display high Cook’s D values, thereby suggesting a strong impact on the results of the model fit. These two cases are shown to be influential not only on the fixed effects but also on the estimates of the variance and covariance parameters. With respect to Cook’s D, subject 623 is identified as an additional influential case in the estimation of the fixed effects but not in the covariance parameters. The three subjects, 791, 814, and 623, are also linked to low precision both in the fixed effects and in the covariance parameters, given the low COVRATIO scores. These diagnostic statistics, however, display the relative contribution to the model fit but do not necessarily translate into an actual influence.
Given the relative importance of the two most influential cases to the overall fit of the linear mixed model, it may be necessary to check the exact LD for both. Accordingly, two additional linear mixed models are created with each deleting one of those influential cases. The SAS program for this step is not presented because programming of the two additional models follows the standard procedure, as previously exhibited, except for removal of a single observation. Table 6.2 summarizes the results.

Table 6.2

Maximum Likelihood Estimates and the Likelihood Displacement Statistic for Three Linear Mixed Models

Explanatory Variable Parameter Estimate Standard Error Degrees of Freedom t-value p-value
Linear Mixed Model With Full Data (-2 LL = 1411.3; p < 0.0001)
Intercept 55.444 2.509 53 22.10 <0.0001
Treatment 2.698 3.516 53 0.77 0.4463
Time 1 −3.963 2.056 128 −1.93 0.0561
Time 2 −2.977 2.168 128 −1.37 0.1721
Time 3 −9.233 2.137 128 −4.32 <0.0001
Treat × Time 1 −14.494 3.007 128 −4.82 <0.0001
Treat × Time 2 −15.803 3.232 128 −4.89 <0.0001
Treat × Time 3 −8.666 3.177 128 −2.73 0.0073
Linear Mixed Model Deleting Subject 791 (-2 LL = 1371.3; p < 0.0001)
Intercept 56.231 2.550 52 22.05 <0.0001
Treatment 1.912 3.542 52 0.54 0.5916
Time 1 −4.885 2.007 125 −2.43 0.0164
Time 2 −4.359 2.122 125 −2.05 0.0420
Time 3 −10.613 2.090 125 −5.08 <0.0001
Treat × Time 1 −13.537 2.908 125 −4.65 <0.0001
Treat × Time 2 −14.369 3.129 125 −4.59 <0.0001
Treat × Time 3 −7.229 3.075 125 −2.35 0.0203
LD statistic 40.0 (df = 4; p < 0.01)
Linear Mixed Model Deleting Subject 814 (-2 LL = 1368.6; p < 0.0001)
Intercept 54.923 2.535 52 21.67 <0.0001
Treatment 3.220 3.520 52 0.91 0.3646
Time 1 −4.231 1.992 125 −2.12 0.0356
Time 2 −3.338 2.106 125 −1.59 0.1154
Time 3 −7.959 2.074 125 −3.84 0.0002
Treat × Time 1 −14.189 2.886 125 −4.92 <0.0001
Treat × Time 2 −15.388 3.105 125 −4.96 <0.0001
Treat × Time 3 −9.880 3.052 125 −3.24 0.0015
LD statistic 42.7 (df = 4; p < 0.01)

In Table 6.2, the first linear mixed model uses full data, with results previously reported. The second model is fitted after removing subject 791 with four observations. The exact LD statistic, from the formula 2l(ˆθ)2l[ˆθ(i)]image, is statistically significant with four degrees of freedom and at α = 0.05 (LD = 40.0, p < 0.01), thereby indicating that this subject makes a very strong statistical impact on the overall fit of the first linear mixed model. Likewise, the third model is fitted after deleting subject 814, with the exact LD statistic also statistically significant at the same criterion (LD = 42.7, p < 0.01). Furthermore, in both the second and the third models, the parameter estimates, including the regression coefficients and the standard errors, vary notably after removing each of those two cases. Obviously, those two cases make a genuine impact in fitting the linear mixed model. Given the considerable changes in the estimated regression coefficients as well as the exceptionally strong statistical contributions, I would recommend that the two influential cases should be eliminated from the regression. The lack of statistical stability may also be linked to the small sample size for the data. For large samples, a strong relative influence of a few particular cases usually can be washed out by the effects of the vast majority of the normal observations in the estimating process. This argument will be further discussed in the next example where a large sample is analyzed.
Influence diagnostics can also be applied in the random coefficient model in which time is treated as a continuous variable. For example, I can graphically examine the fixed effects and the covariance parameters in the random coefficient model on the PCL score specified in Chapter 3. By adapting SAS Program 4.1, the code is displayed below.
SAS Program 6.2:. . . . . .
image
In SAS Program 6.2, some options are selected to request an influence analysis on the parameter estimates in the random coefficient model. In the PROC MIXDED statement, the PLOT(ONLY) = INFLUENCESTATPANEL option requests panels of influence statistics, with the ONLY suboption in the parentheses informing SAS that only the requested plots be produced while the default plot in the package be suppressed. For iterative analysis, the panel displays Cook’s D and the COVRATIO statistic for the fixed effects and the covariance parameters. In the INFLUENCE option in the MODEL statement, the suboption EST is added that requests SAS to write the updated parameter estimates to the ODS output dataset. The resulting plots from this program are displayed in Fig. 6.3.
image
Figure 6.3 Influence Statistics on PCL_SUM in the Random Coefficient Model
Generally, Fig. 6.3 demonstrates the same results as those from the influence analysis for the linear model using a residual covariance pattern model. Such a similarity is not surprising, as both types of linear mixed models usually yield very close estimates for the fixed effects and the covariance parameters.
If the researcher is interested in graphically checking the detailed influence statistics for each fixed-effect and covariance parameter in the random coefficient model, additional analyses are needed for a thorough assessment. This influence analysis can be achieved by making some minor modifications in SAS Program 6.2. Specifically, by replacing the option PLOT(ONLY) = INFLUENCESTATPANEL with the PLOT(ONLY) = INFLUENCEESTPLOT option, the following graphs are produced for the fixed effects and the 3 × 3 variance–covariance matrix of the random coefficients.
In the interpretation of Fig. 6.4, the focus is on the behaviors of the two most influential cases, subjects 791 and 814. Specifically, subject 791 has a strong impact on the model fit of the intercept and the slopes for CT, CT_2, TREAT, and the interactions between CT and TREAT and between CT_2 and TREAT. Removal of this subject, however, has no influence at all on the fixed effect of CT_3 and the interaction between CT_3 and TREAT. On the other hand, subject 814 has a very strong influence on the regression coefficient estimates of the intercept, CT_2, TREAT, and the interaction between CT_2 and TREAT. At the same time, subject 814 has some impact on the fixed effects of CT_3 and the interactions between CT and TREAT and between CT_3 and TREAT but has no influence at all on the fixed effects of CT. In this analysis, subject 998 is identified as an additional influential case since it has a strong impact on half of the fixed effects.
imageimage
Figure 6.4 Fixed-Effects Deletion Estimates on PCL_SUM
Figure 6.5 displays the results of the deletion estimates for each of the covariance parameters on the PCL score.
imageimage
Figure 6.5 Covariance Parameter Deletion Estimates on PCL_SUM
In Fig. 6.5, all cases, except subjects 132, 227, and 998, do not contain information because the estimation of G after removal of those subjects does not yield a positive definite matrix, thereby failing to function. With a small sample size of a longitudinal dataset, the influence analysis on the covariance parameter deletion estimates is often inefficient in checking the model fit of the random coefficient model.

6.3.2. Influence Diagnostics on Linear Mixed Model Concerning Marital Status and Disability Severity Among Older Americans

In the second example of influence diagnostics, the AHEAD data are used (2000 randomly selected subjects and six waves: 1998, 2000, 2002, 2004, 2006, and 2008). In this analysis, the dependent variable remains the health-related difficulty in performing five activities of daily living (dress, bath/shower, eat, walk across time, get in/out of bed), measured at six time points and named ADL_COUNT. The disability is scored one if an older person has difficulty, personal help, or equipment help for health-related reasons and scored zero if otherwise. Therefore, the value of the ADL count ranges from 0 to 5 at each time point. Marital status, named married, is the same dichotomous variable with 1 = currently married and 0 = else, specified as a time-varying variable. The controls are the three centered variables, Age_mean, Educ_mean, and Female_mean. Given the same set of covariates, the same linear mixed model described in Chapter 5 (Section 5.6.2) is applied. The ML estimator is used to perform an influence analysis on the fixed effects and the covariance parameters. The following is the SAS program for this analysis, an adapted version of SAS Program 5.4.
SAS Program 6.3:. . . . . .
image
SAS Program 6.3 is similar to SAS Program 6.1, with contextual modifications. In the PROC MIXED statement, the PLOTS(MAXPOINTS = 20,000) option tells SAS to increase the maximum points of data in plotting the least squares means from the default 5,000 to 20,000, given a large sample size of the AHEAD longitudinal data. As indicated in Chapter 5 (Section 5.6.2), the TOEP covariance pattern model is selected to address the covariance structure in residuals.
Given the large sample size, SAS Program 6.3 yields a substantially large quantity of output data, containing eleven diagnostic statistics for each of the 2000 subjects. The diagnostic table for this influence analysis takes the same format as that of the first example, in which a rich body of influence diagnostics, including tables and graphs, is presented. Given the large quantity of the output results, the detailed contents of diagnostic statistics and the figures cannot all be reported. Instead, the focus of this influence analysis is placed upon identification of the most influential cases, rather than on the interpretation of the output tables and graphs. It follows that the exact LD scores are computed by removing each of the most influential cases in the linear mixed model. Furthermore, the actual influences of those outstanding cases are examined by checking changes in the parameter estimates after removing each of them from the regression.
After a careful examination on the results from SAS Program 6.3, two influential subjects are identified: subject 200520020 (HHIDPN number) and subject 207669010. Given the relatively high LD scores, it is necessary to check the exact LD for both cases. Analogous to the approach applied for the first example, two additional linear mixed models are created with each deleting one of those two influential cases. The SAS program for this step is not presented because the two mixed models follow the standard procedure, as previously displayed, except for the removal of a single observation. Table 6.3 summarizes the results for the fixed effects.

Table 6.3

Maximum Likelihood Estimates and the Likelihood Displacement Statistic for Three Linear Mixed Models on ADL Count: Older Americans

Explanatory Variable Parameter Estimate Standard Error Degrees of Freedom t-value p-value
Linear Mixed Model With Full Data (-2 LL = 20,016.3; p < 0.0001)
Intercept 0.673 0.044 1726 15.31 <0.0001
Married 0.029 0.059 300 0.49 0.6223
Time 1 0.257 0.036 4814 7.19 <0.0001
Time 2 0.513 0.052 4814 9.87 <0.0001
Time 3 0.579 0.058 4814 9.94 <0.0001
Time 4 0.768 0.061 4814 12.56 <0.0001
Time 5 0.932 0.060 4814 15.60 <0.0001
Treat × Time 1 −0.159 0.053 4814 −3.01 0.0026
Treat × Time 2 −0.252 0.078 4814 −3.24 0.0012
Treat × Time 3 −0.233 0.090 4814 −2.58 0.0099
Treat × Time 4 −0.060 0.098 4814 −0.61 0.5441
Treat × Time 5 −0.309 0.101 4814 −3.07 0.0021
Linear Mixed Model Deleting Subject 200520020 (-2 LL = 20000.5; p < 0.0001)
Intercept 0.673 0.044 1725 15.29 <0.0001
Married 0.028 0.059 300 0.48 0.6335
Time 1 0.257 0.036 4809 7.18 <0.0001
Time 2 0.513 0.052 4809 9.87 <0.0001
Time 3 0.580 0.058 4809 9.94 <0.0001
Time 4 0.768 0.061 4809 12.56 <0.0001
Time 5 0.932 0.060 4809 15.60 <0.0001
Treat × Time 1 −0.159 0.053 4809 −3.01 0.0026
Treat × Time 2 −0.250 0.078 4809 −3.21 0.0013
Treat × Time 3 −0.234 0.090 4809 −2.59 0.0096
Treat × Time 4 −0.059 0.098 4809 −0.60 0.5478
Treat × Time 5 −0.302 0.101 4809 −3.00 0.0027
LD statistic 15.8 (df = 2; p < 0.01)
Linear Mixed Model Deleting Subject 207669010 (-2 LL = 19985.0; p < 0.0001)
Intercept 0.675 0.044 1725 15.35 <0.0001
Married 0.026 0.059 300 0.44 0.6599
Time 1 0.255 0.036 4809 7.11 <0.0001
Time 2 0.513 0.052 4809 9.86 <0.0001
Time 3 0.578 0.058 4809 9.92 <0.0001
Time 4 0.770 0.061 4809 12.58 <0.0001
Time 5 0.923 0.060 4809 15.47 <0.0001
Treat × Time 1 −0.156 0.053 4809 −2.96 0.0031
Treat × Time 2 −0.252 0.078 4809 −3.23 0.0012
Treat × Time 3 −0.231 0.090 4809 −2.56 0.0105
Treat × Time 4 −0.061 0.098 4809 −0.62 0.5324
Treat × Time 5 −0.300 0.100 4809 −2.99 0.0028
LD statistic 31.3 (df = 6; p < 0.01)


In Table 6.3, the first linear model uses full data, while the second model is fitted after deleting the most influential case (subject ID: 200520020). The exact LD statistic, from the formula 2l(ˆθ)2l[ˆθ(i)]image, is statistically significant with two degrees of freedom (this subject has two observations) and at α = 0.05 (LD = 15.8, p < 0.01), indicating that the most influential case makes a very strong statistical impact on the overall fit of the fixed effects. Likewise, the third mixed model is fitted after deleting the second most influential case (subject ID: 207669010), with the exact LD statistic also statistically significant with six degrees of freedom at the same criterion (LD = 31.3, p < 0.01). In both the second and the third models, however, the fixed estimates, including the regression coefficients and the standard errors, do not vary at all after removing each of those two cases. The fixed effects of the three control variables, not presented in Table 6.3, are also strikingly consistent after removing each of those two cases. The three sets of the estimated regression coefficients are almost identical. Obviously, deleting those two cases makes no genuine impact on the fixed-effects estimates, albeit strong statistical influences.
Next, the impacts of the two influential cases are further examined on the covariance parameter estimates. The following SAS program output displays three panels of covariance parameter estimates, derived from the aforementioned three linear mixed models, respectively.
SAS Program Output 6.1:
The covariance parameter estimates for full data
image
The covariance parameter estimates after deleting the first case
image
The covariance parameter estimates after deleting the second case
image
As shown in SAS Program Output 6.1, the three sets of the covariance parameter estimates are very close, and therefore, deletion of those two statistically influential cases does not affect the quality of the covariance parameter estimates, either. Given the remarkable similarities of the estimated regression coefficients and covariance parameters, I do not see any reason that the two statistically influential cases should be eliminated from the linear mixed model, although they have exceptionally strong statistical contributions to the model fit. Indeed, for large samples, a strong relative influence of a few particular cases can be easily averaged out by the effects of other cases in the estimating process. Consequently, deleting any influential observation can hardly make any actual impact on the regression coefficient estimates and the covariance parameter approximates in linear mixed models.

6.4. Summary

In this chapter, a number of statistical methods are described for performing residual diagnostics in linear mixed models. The chapter starts with an introduction of a variety of residual types that are widely used in longitudinal data analysis. The semivariogram approaches are then delineated to help the reader further command the techniques for checking the behaviors of residuals in linear mixed models. The semivariogram can be applied to check whether serial correlation in repeated measurements is present given the fixed effects and the specified random effects. Compared to the classical residual diagnostics, the semivariogram is proposed to analyze spatial data, thereby not seeing many applications in the analysis of longitudinal data with equal time intervals. In longitudinal data analysis, plotting various residuals is effective in revealing inconsistencies between the observed data and the model-based predictions (Diggle et al., 2002).
Compared to residual diagnostics, the identification of influential cases on the goodness-of-fit of linear mixed models is more complex. In Section 6.2, I describe a variety of popular approaches in this area. For large samples, the actual impact of a few statistically influential cases is often found to be very limited, even though they contribute significantly to the model fit. Given this finding, the techniques for the identification of influential cases are particularly important in the analysis of longitudinal data with a small sample size. In such analyses, a few influential cases can significantly modify the analytic results in the application of linear mixed models. I would like to recommend the following steps for identifying influential cases in those studies. First, identify the most influential cases by using the influence diagnostic methods described in this chapter. Next, examine the exact change in the estimated regression coefficients and the covariance parameter approximates after deleting each of the identified cases. Following this strategy, a decision can be made regarding whether those cases should be removed in the application of a linear mixed model.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.54.245