Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6

Residual and influence diagnostics

Abstract

In Chapter 6, the statistical methods of residual diagnostics are described first. A variety of residual types are specified, both generally and with specific regard to longitudinal data analysis. I also delineate semi-variogram, a popular residual diagnostic technique applied in longitudinal data analysis. This unique method is used to check whether serial correlation among repeated measurements for the same subject is present given the specified fixed effects and the covariance parameters. The method can be specified in both the random intercept and the random coefficient perspectives. Next, the statistical methods on influence diagnostics are displayed. The variety of the basic diagnostics to identify influential observations include the Cook’s distance statistic, leverage, the DFFITS score, the MDFFITS statistic, COVTRACE, COVRATIO, the likelihood displacement statistic (the LD statistic), and the LMAX standardized statistic. An empirical illustration is provided for displaying how to check whether there are any influential observations in fitting a linear mixed model.

Keywords

Influence diagnostics

leverage

likelihood displacement statistic

LMAX statistic

residual types

semi-variogram

One of the primary objectives in creating a linear mixed model is to describe the time trend of a response measurement on the continuous scale, either by experimental units (such as treatments) or by population groups (such as currently married vs. currently not married). Given within-subject dependence in longitudinal data, specification of the random effects or of a residual covariance structure is required to account for intraindividual correlation, even though those parameters are usually not of direct interest. In the estimation of the fixed and the random effects, some bizarre observations can have unduly influences on the fitness of linear mixed models, thereby affecting the quality of parameter estimates. Therefore, once a linear mixed model is fitted with longitudinal data, regression diagnostics are necessary for verifying whether the statistical model fits the data appropriately or meets various assumptions on model specifications. In general linear models, the assessments of model adequacy are regularly focused on checking linearity, normality, homogeneity of variance, and independence of errors. With respect to longitudinal data analysis, regression diagnostics are more complex because an individual’s response is measured repeatedly over time, so that analytic results of regression diagnostics may vary over different time points. The general principles in statistical modeling, however, are universal for all regression models, and linear mixed modeling is no exception. Standard diagnostic techniques, such as deviation of the expected value from the observed, the overall model fit, and the identification of influential observations, also need to be performed in longitudinal data analysis.

In this chapter, statistical methods of residual diagnostics are described first, starting with an introduction on different types of residuals applied in linear mixed models. Next, the statistical methods on influence diagnostics are presented. I then provide an empirical illustration to display the application of various regression diagnostics in linear mixed models based on the same longitudinal data previously used. Lastly, I summarize the chapter with comments on the regression diagnostics applied in linear mixed models.

6.1. Residual Diagnostics

In this section, a variety of residual types applied in linear mixed models are introduced. Next, I delineate a popular residual diagnostic technique in longitudinal data analysis, referred to as semivariogram, which is used to check whether serial correlation among repeated measurements for the same subject is present given the specification of the fixed effects and the covariance parameters. This unique method can be applied in both the random intercept and the random coefficient models.

6.1.1. Types of Residuals in Linear Regression Models

In regression modeling, residuals are closely related to random errors; the two concepts, however, differ distinctively. While a random error, or disturbance, of an observed value is defined as the deviation of the observed value from the true function value (unobservable), residual of an observed value is the difference between the observed value and the estimated function. A regression model, based on various assumptions and with certain properties on random errors, yields parameter estimates and model-based predictions from a statistical procedure, which, in turn, yield residuals, not random errors.

Let the predicted mean response for subject i at time point j be

${\hat{μ}}_{ij} = X_{i j}^{'} \hat{β}$

. An n_i-dimensional vector of residuals for each subject can then be derived from a linear mixed model, given by

$r_{m i} = Y_{i} - X_{i}^{'} \hat{β},$

(6.1)

where

$r_{m i}$

, briefly discussed in Chapter 4, is referred to as the vector of the marginal residuals because

$X_{i j}^{'} \hat{β}$

is a marginal mean vector in linear mixed models. If the random-effects component is considered, residuals for each subject become conditional on the random effects, written as

$r_{c i} = Y_{i} - X_{i}^{'} \hat{β} - Z_{i} {\hat{b}}_{i} = r_{m i} - Z_{i} {\hat{b}}_{i},$

(6.2)

where

$r_{c i}$

represents the conditional residuals.

Given the specifications of linear mixed models, the variance–covariance matrix of marginal residuals is

$var ({\hat{r}}_{m i}) = {\hat{V}}_{i} - X_{i} {(X_{i}^{'} {\hat{V}}_{i}^{- 1} X_{i})}^{- 1} X^{'},$

(6.3)

where

${\hat{V}}_{i}$

is the estimated total variance of Y_i, from either maximum likelihood estimate or the restricted maximum likelihood (REML) estimator. The variance–covariance matrix of conditional residuals can be specified according to Gregoire et al. (1995) as follows:

$var ({\hat{r}}_{c i}) = (I_{n_{i}} - Z_{i} \hat{G} Z_{i}^{'} {\hat{V}}_{i}^{- 1}) var ({\hat{r}}_{m i}) {(I_{n_{i}} - Z_{i} \hat{G} Z_{i}^{'} {\hat{V}}_{i}^{- 1})}^{'} .$

(6.4)

In linear regression models, the distributions of residuals at different data points may not necessarily follow an expected pattern, even if random errors from the true function value are identically distributed. For example, in the conditional residuals, the random effect estimate

${\hat{b}}_{i}$

, obtained from the empirical Bayes approximation, heavily depends on the normality assumption and is also influenced by the assumed covariance structure

${\hat{V}}_{i}$

. Therefore, the elements in

${\hat{b}}_{i}$

are the empirical best linear unbiased predictors (BLUPs), thereby being shrunk toward the population fixed effects β. As a result, the distribution of the BLUPs does not accurately represent the distribution of the true random effects. In the application of linear mixed models, residuals often need to be adjusted by the expected variability for comparative purposes at different time points. A popular approach for such an adjustment is to standardize the raw residuals by using the estimated residual variance, referred to as studentizing in statistical analysis, given by

$r_{m i}^{student} = \frac{r_{m i}}{\sqrt{var (r_{m i})}},$

(6.5a)

$r_{c i}^{student} = \frac{r_{cm i}}{\sqrt{var (r_{c i})}},$

(6.5b)

where

$r_{m i}^{student}$

is the vector of studentized marginal residuals for subject i and

$r_{c i}^{student}$

is the corresponding vector of studentized conditional residuals. Standardization of residuals is particularly important in the longitudinal data with distinctive outliers. Some scientists suggest external studentization of residuals in which a common residual variance estimate is used for all subjects.

Raw residuals can also be scaled by the estimate of the variance for the observed response y, referred to as Pearson-type residuals, given by

$r_{m i}^{pearson} = \frac{r_{m i}}{\sqrt{var (y_{i})}},$

(6.6a)

$r_{c i}^{pearson} = \frac{r_{cm i}}{\sqrt{var (y_{i})}} .$

(6.6b)

A more refined type of standardized residuals is the scaled residuals. Construction of this type of residuals is based on the argument that given the specification of the random effects or the inclusion of a selected residual covariance structure in linear mixed models, residuals should behave as pure measurement error, and therefore, they should reflect variability that is not explained by the specified random effects or the covariance parameters (Verbeke et al., 1998). Thus, it is necessary to eliminate all sources of correlation by appropriate scaling to check residuals. The classical approach in this regard is to apply the Cholesky decomposition for the generation of transformed residuals that have constant variance and zero correlation.

The application of the Cholesky decomposition on the variance–covariance matrix starts with construction of a lower triangular matrix for each subject, denoted by

${\overset{⌢}{C}}_{i}$

, which satisfies the condition

${\hat{V}}_{i} = {\overset{⌢}{C}}_{i} {\overset{⌢}{C}}_{i}^{'},$

where

${\overset{⌢}{C}}_{i}$

represents the Cholesky root of

${\hat{V}}_{i}$

, with

${\overset{⌢}{C}}^{'}_{i}^{- 1} Y_{i}$

having constant variance and zero correlation.

Given the attached properties, the

${\overset{⌢}{C}}_{i}$

matrix can be used to transform the correlated residuals to correlation-free transformed residuals. For the marginal distribution of longitudinal data, the scaled residuals, denoted by

$R_{m i}$

, are defined as

$R_{m i} = {\overset{⌢}{C}}_{i}^{- 1} r_{m i} = {\overset{⌢}{C}}_{i}^{- 1} (Y_{i} - X_{i}^{'} \hat{β}),$

(6.7)

which have unit variance and zero correlation.

For practical purposes, the scaled residuals can be plotted against the transformed predictions of Y, denoted by

${\hat{Y}}_{i}^{*}$

and defined as

${\hat{Y}}_{i}^{*} = {\overset{⌢}{C}}_{i}^{- 1} X_{i}^{'} \hat{β} .$

If a linear mixed model is correctly specified, the plot of the transformed residuals against

${\hat{Y}}_{i}^{*}$

should be scattered randomly around zero with a constant range of variation. In contrast, if this scatter-plot displays a systematic trend, residuals remain correlated even with the specified random effects and/or a selected residual covariance structure. In the latter case, more random effect terms need to be considered in the specification of a linear mixed model and in statistical inference. Sometimes, the researcher might want to fit a smooth curve to the scatter-plot. If the linear mixed model is adequately assumed, the fitted line should not display distinctive systematic departures from a horizontal line centered at about 0.8 (Fitzmaurice et al., 2004). The transformed residuals can also be used to identify skewness, detect potential outliers, and assess the normal distribution hypothesis.

6.1.2. Semivariogram in Random Intercept Linear Models

Based on the random intercept linear model, Diggle (1988) introduces the approach of empirical semivariogram to assess residuals in longitudinal data analysis. This method is designed to check serial correlation in residuals, conditionally on the random components already included in the model. Construction of this residual diagnostic method begins with the specification of a general linear model with an error term that is assumed to be independently distributed as multivariate normal, referred to as the generalized multivariate linear model.

Let

$Y_{i} = {(Y_{i 1} ..., Y_{i n_{i}})}^{'}$

be an n_i-dimensional vector of repeated measurements for subject i and

$e_{i} = {(e_{i 1}, ..., e_{i n_{i}})}^{'}$

be an n_i × 1 column vector of assumed errors. A generalized multivariate linear regression model is then written as

$Y_{i} = X_{i}^{'} β + e_{i},$

(6.8)

where X_i is a known n_i × M matrix of covariates and β is an M × 1 vector of unknown population parameters. Given the correlation between serial observations for the same subject, e_i is assumed to follow a multivariate normal distribution with mean 0 and n_i × n_i covariance matrix V_i. Correspondingly, Y_i can be expressed in terms of

$Y_{i} \sim MVN (X_{i}^{'} β, V_{i})$

. As summarized by Diggle (1988) and Diggle et al. (2002), an appropriate V_i should at least accommodate three different sources of random variations. First, average responses usually vary randomly between subjects, with some subjects being intrinsically high and some being low. Second, a subject’s observed measurement profile may be a response to time-varying stochastic processes. Third, as the individual measurements involve some kind of subsampling within subjects, the measurement process adds a component of variation to the data. This perspective on the breakdown of variability is briefly mentioned in Chapter 1.

The above decomposition of stochastic variations in repeated measurements of the response facilitates the formulation of correlation between pairs of measurements for the same subject. A generalized multivariate linear model, specifying all three features in V_i, can be written as

$Y_{i j} = X_{i j}^{'} β + b_{i} + {\tilde{W}}_{i} (T_{i j}) + ɛ_{i j},$

(6.9)

where

$X_{i j}^{'} β$

represents the model-based mean response for subject i at time point j, b_i indicates the i.i.d. variation in average response between subjects with mean 0 and variance

$σ_{b}^{2}$

, and ɛ_ij represents the subsampling variation within the subject with mean 0 and variance

$σ_{ɛ}^{2}$

. The term

${\tilde{W}}_{i} (T_{i j})$

reflects independent stationary Gaussian processes with zero expectation and covariance function

$σ^{2} ρ (|T_{j^{'}} - T_{j}|)$

. Clearly, the terms b_i,

${\tilde{W}}_{i} (T_{i j})$

, and ɛ_ij correspond to the random intercept, autoregressive correlation, and within-subject random error, respectively, as described previously.

Given the specification of Equation (6.9), the variance matrix for subject i can be specified as

$V_{i} = σ_{b}^{2} J + σ^{2} R (T_{i}) + σ_{ɛ}^{2} I,$

(6.10)

where J is a square matrix with all its elements being unity,

$T_{i} = {(T_{i 1}, ..., T_{i n_{i}})}^{'}$

is the time vector,

$R (T_{i})$

is a symmetric matrix with element

$ρ (|T_{j^{'}} - T_{j}|)$

, and I is the identity matrix. As indicated in Chapter 5, the form of

$ρ (|T_{j^{'}} - T_{j}|)$

can be parameterized in different situations. When

$σ^{2} = 0$

, the above-generalized multivariate linear model reduces to the uniformed correlation model (Diggle, 1988). If

$σ_{b}^{2}$

is also equal to zero, the model reduces further to a typical general linear model with independent random errors. Such cases, however, are empirically very rare in longitudinal data.

Let

$T \in T$

be a continuous time variable. Then the semivariogram of a random process

$\{Y (T)\} : T \in T$

is defined as the function

$g (u) = \frac{1}{2} [E {\{Y (T) - Y (T - u)\}}^{2}] : u \geq 0,$

(6.11)

which is assumed to be independent of T. Given this specification, the empirical semivariogram of a time series or a time trend is specified as a scatter-plot of squared differences, given by

$d_{j j^{'}}^{2} = \frac{1}{2} {[y (T_{j}) - y (T_{j^{'}})]}^{2},$

as against some corresponding quantities.

Empirically, the marginal residuals, denoted by r_mi, can be used to generate the empirical semivariogram. With

$r_{i j} = Y_{i j} - X_{i j}^{'} \hat{β}$

, the empirical variogram can be written as

$\frac{1}{2} [E {(r_{i j} - r_{i j^{'}})}^{2}] = σ_{ɛ}^{2} + σ^{2} [1 - ρ (|T_{i j} - T_{i j^{'}}|)] .$

(6.12)

Obviously, the empirical semivariogram is a spatial data specification. The rationale of this method is that given the specification of the random intercepts across subjects, the empirical semi-variogram displays ordinary-least-squares (OLS)-type residuals by removing the term for correlation in repeated measurements. Consequently, if the random intercept model is correctly assumed, the errors should follow a constant pattern, and, correspondingly, the empirical semivariogram should be scattered randomly rather than systematically.

Equation (6.12) literally states that the process variance can be estimated by half the average squared difference between pairs of observations from different subjects. As a nonparametric diagnostic method, the empirical semivariogram provides a useful graphical check on the adequacy of a specific covariance structure in R, provided that there are no random effects except a random intercept term. This approach is particularly useful to check residuals for longitudinal data with unequal time intervals.

6.1.3. Semivariogram in the Linear Random Coefficient Model

The specification of the generalized multivariate linear model provides a flexible framework for checking residuals in modeling normal longitudinal data. The development of this approach is based on the thought that the effect of serial correlation is dominated by the combination of the random intercept and random errors, and therefore, inclusion of more random effect terms is unnecessary.

More recently, some more flexible formulations on covariance structures have been proposed to handle complex longitudinal data patterns (Chi and Reinsel, 1989; Verbeke et al., 1998). The majority of these methods introduce an additional random term in the expression of total errors, containing both the random coefficients across subjects and serial correlation in within-subject random errors, written as

$Y_{i} = X_{i}^{'} β + Z_{i}^{'} b_{i} + e_{i},$

(6.13)

where e_i is assumed to follow a multivariate normal distribution with mean 0 and n_i × n_i covariance matrix R_i. As indicated earlier, in the standard linear random coefficient model,

$cov (e_{i}) = σ^{2} I$

is usually assumed given the specified covariates and the random effects. In performing residual diagnostics, however, specifying

$cov (e_{i}) = R_{i}$

as multivariate is considered essential to check whether residuals remain correlated with the specified random coefficients. Chi and Reinsel (1989) propose a score test to examine the random coefficient model with

$cov (e_{i}) = σ^{2} I$

against the random coefficient multivariate model with the AR(1) errors for e_i. This approach provides a simple check for possible autocorrelation in residuals, which, in linear mixed models, are generally assumed to be conditionally independent. It is found from the score test that the specification of the random effects generally accounts for the serial correlation among repeated measurements, and therefore, the random coefficient model adding a serial correlation term overparameterizes the covariance structure.

In the framework of the random coefficient multivariate model, the error structure can be more flexibly decomposed (Verbeke et al., 1998), given by

$e_{i} = Z_{i}^{'} b_{i} + ɛ_{1 i} + ɛ_{2 i},$

(6.14)

where ɛ_1i is the n_i × 1 residual vector to indicate time-varying stochastic processes operating within the subject (serial correlation), assumed to be normally distributed with mean zero and covariance matrix

${\tilde{H}}_{i}$

. The covariance matrix

${\tilde{H}}_{i}$

has element

${\tilde{h}}_{i j j^{'}}$

taking the form

${\tilde{τ}}^{2} g (|T_{i j} - T_{i j^{'}}|)$

, where

${\tilde{τ}}^{2}$

is the profiled variance for all elements of ɛ_1i and g(.) is the unknown positive decreasing function. The vector ɛ_2i represents random errors assumed to be independent and identically distributed with zero expectation. The q-dimensional vector b_i, the subject-specific random effects, are assumed to follow a multivariate normal distribution with mean zero and covariance matrix G, as regularly specified.

Given Equation (6.14), the response vector y_i marginally follows a normal distribution with mean vector

$X_{i}^{'} β$

and covariance matrix

$V_{i} = Z_{i} G Z_{i}^{'} + {\tilde{H}}_{i} + σ_{ɛ}^{2} I_{n_{i}},$

(6.15)

where

$I_{n_{i}}$

is the n_i-dimensional identity matrix. This marginal random coefficient multivariate model can be estimated by using either the maximum likelihood (ML) or the REML approach. As discussed by Verbeke et al. (1998), this regression model can be used to check whether a classical linear mixed model sufficiently accounts for intraindividual correlation without considering an additional residual covariance structure.

Empirically, one might compare the analytic and graphical results from different regression models such as the ordinary least squares, the random intercept, the random coefficient, and the random coefficient plus a specific residual covariance structure. If both the OLS and the random-intercept residuals are found to deviate markedly from normality, serial correlation needs to be further specified and some regression coefficients should be considered random across subjects. Likewise, if residuals from a random coefficient model still deviate notably from a normal distribution, the researcher might want to add an appropriate residual covariance matrix to the linear random coefficient model for yielding efficient and consistent parameter estimates. The latter case, however, rarely occurs in longitudinal data analysis.

With the inclusion of the random coefficients, the semivariogram based on the random intercept model is extended to the random coefficient perspective. As the covariance structure specified in Equation (6.15) has been found to be dominated by its first component

$Z_{i} G Z_{i}^{'}$

, it is necessary to remove all variability explained by the random effects b_i before proceeding with the serial correlation check. The scaled residuals, denoted by

$R_{m i}$

and described in Section 6.1.1, can be used for this removal. These standardized residuals are independent of any distributional assumptions on b_i, and therefore, their computation does not require an estimate of the random-effects covariance matrix G (Verbeke et al., 1998). Because the scaled residuals

$R_{m i}$

have unit variance and zero correlation, the semivariogram based on the linear random coefficient model can be written as

$\begin{array}{l} \frac{1}{2} [E {(R_{i j} - R_{j j^{'}})}^{2}] = \frac{1}{2} var (R_{i j}) + \frac{1}{2} var (R_{i j^{'}}) - cov (R_{i j}, R_{j j^{'}}) \\ = \frac{1}{2} + \frac{1}{2} - 0 = 1. \end{array}$

(6.16)

Equation (6.16) indicates that if a random coefficient linear model is correctly specified, the plot of the semivariogram of the transformed residuals against time should be scattered randomly around unity without displaying any systematic time trend. With this property, the semivariogram can be applied to check whether the specified random effects explain all serial correlation between repeated measurements. In performing this residual diagnostic method, the conditional residuals, defined as

$Y_{i} - X_{i}^{'} \hat{β} - Z_{i} \hat{b}$

, are not recommended for use because the BLUP

${\hat{b}}_{i}$

from the empirical Bayes approximation depends heavily on the normality assumption and is also influenced by the assumed covariance structure

${\hat{V}}_{i}$

The semivariogram, based on either the random intercept or the random coefficient model, is usually applied for spatial data when the time intervals are unequally spaced. For highly unbalanced longitudinal data, a smooth plot of the empirical semivariogram may be fitted from the scatter-plot of the transformed residuals, which should be centered at unity displaying no systematic time trend. Because the empirical semivariogram is sensitive to outliers, influence diagnostics need to be performed first before fitting a smooth curve to the scatter-plot of the empirical semivariogram. Large-scale longitudinal data are needed to conduct this residual diagnostic technique.

6.2. Influence Diagnostics

In regression diagnostics, another important area is identification of influential observations. For various regression models, it is essential to identify particular observations that have extraordinary influences on the analytic results. Identification of influential observations in linear mixed models differs slightly from the classical approaches applied in general linear models. Most significantly, diagnostics on longitudinal data involve individuals having multiple data points, rather than at a single time. Consequently, removal of one individual can affect a series of observations, thereby magnifying the case’s influence on parameter estimates, both the fixed effects and the random components. Therefore, more refined techniques are sometimes required to identify influential observations for a linear mixed model. In most situations, however, influence diagnostics applied in longitudinal data analysis follow the standard perspectives as those used in general linear models.

This section describes a variety of the basic diagnostic techniques to identify influential observations in linear mixed models, including the Cook’s distance statistic, leverage, the DFFITS score, the MDFFITS statistic, COVTRACE, COVRATIO, the likelihood displacement (LD) statistic and its approximates, and the LMAX standardized statistic. An illustration is provided to check whether there are any influential observations in fitting the two linear mixed models described in the preceding three chapters.

6.2.1. Cook’s D and Related Influence Diagnostics

In fitting a linear mixed model, some observations may have unduly impacted the inferential process to derive parameter estimates, as frequently encountered in fitting other types of regression models. Traditionally, these influential cases can be identified by the change in the estimated regression coefficients after deleting each observation in a sequence (Cook, 1977, 1979).

Let

$\hat{β}$

be the estimate of β that maximizes the log-likelihood or the log restricted likelihood function and

${\hat{β}}_{(- i)}$

be the same estimate of β when subject i is eliminated from the estimating process. For a single covariate X_m, the distance in the estimated regression coefficient after removing the subject, denoted

${\overset{⌢}{d}}_{m i}$

, is written as

${\overset{⌢}{d}}_{m i} = {\hat{β}}_{m} - {\hat{β}}_{m_{(- i)}} .$

(6.17)

Equation (6.17) provides an exact measurement for the absolute influence of deleting subject i from the estimating process on the regression coefficient estimate, referred to as Cook’s distance statistic. This statistic can be expressed in terms of the entire vector of the estimated regression coefficients, given by

${\overset{⌢}{d}}_{i} = \hat{β} - {\hat{β}}_{(- i)}$

. A greater value of

${\overset{⌢}{d}}_{i}$

suggests subject i to have a stronger influence on the estimate of β; likewise, a lower value indicates that subject is impact on the model fit is limited.

For convenience of performing the significance test, Cook’s distance statistic is often scaled by the estimates of the coefficient standard errors after removing subject i. Mathematically, this scaled or standardized distance score is written as

${\bar{d}}_{i} = \frac{[\hat{β} - {\hat{β}}_{(- i)}]' (X' X) [\hat{β} - {\hat{β}}_{(- i)}]}{[1 + rank (X)] s 2},$

(6.18)

where

${\bar{d}}_{i}$

is the scaled Cook’s distance, or simply Cook’s D, and s² is the mean square error of the data. In the literature of influence diagnostics, the two Cook’s distance statistics,

${\overset{⌢}{d}}_{i}$

and

${\bar{d}}_{i}$

, are also referred to as DFBETA_i and DFBETAS_i, respectively (Belsley et al., 1980; Fox, 1991).

Like the original statistic

${\overset{⌢}{d}}_{i}$

, a large value in

${\bar{d}}_{i}$

indicates that the parameter estimates are sensitive to removal of the ith subject. With standardization,

${\bar{d}}_{i}$

approximately follows an F-distribution with the numerator degrees of freedom being rank (X) and the denominator degrees of freedom being N-rank (X). Given the F-distribution, the significance of Cook’s D can be statistically tested. Specifically, assuming X to have full rank, the F statistic can be used to test the null hypothesis with threshold F_{M, N} ₋ _M, _α.

For linear mixed models, the specification of the scaled Cook’s D is slightly more complex, given by

${\bar{d}}_{i} = \frac{{[\hat{β} - {\hat{β}}_{(- i)}]}^{'} var {(\hat{β})}^{- 1} [\hat{β} - {\hat{β}}_{(- i)}]}{rank (X)},$

(6.19)

where

$var {(\hat{β})}^{- 1}$

is the matrix from sweeping

$[X' V (\hat{G}, \hat{R}) - 1 X] - 1 .$

If V is known, Cook’s D can be evaluated according to a chi-square distribution with the degrees of freedom being the rank of X (Christensen et al., 1992). If V is unknown, an estimate of V needs to be obtained from the approach described in Chapters 3 and 4, and the statistical evaluation of

${\bar{d}}_{i}$

should be based on the F_M, _N ₋ _M, _α distribution. For large samples, however, checking the statistical significance of

${\bar{d}}_{i}$

for each subject in sequence is unrealistic. In these situations, plotting the scaled statistics graphically is a more practical approach for a quick diagnostic check. An analytic approach to checking the statistical significance of

${\bar{d}}_{i}$

without removing subjects in sequence will be described in Section 6.2.4.

The influential cases can also be identified by the change in the predicted value after deleting each subject in a sequence (Cook, 1977, 1979). One of the useful diagnostics in this regard is the PRESS statistic, defined as

$PRESS = \sum_{\overset{⌢}{i} \neq i} {\hat{r}}_{\overset{⌢}{i} (- i)}^{2},$

(6.20)

where the sum does not include subject i whose influence on the model fit is under check, and

${\hat{r}}_{\overset{⌢}{i} (- i)} = y_{i} - X_{i}^{'} {\hat{β}}_{(- i)} .$

There are some other influence diagnostics based on the predicted value when a subject is removed from the regression. As more or less related to the leverage statistic, they are described in Section 6.2.3 after leverage is delineated.

6.2.2. Leverage

In linear regression models, leverage is used to assess outliers with respect to the independent variables by identifying the observations that are distant from the average predictor values. While potentially impactful on the parameter estimates and the model fit, a higher leverage point does not necessarily indicate strong influence on the regression coefficient estimates because a far distance for a subject’s predictor values from those of others can be situated in the same regression line as other observations (Fox, 1991). Therefore, checking a substantial influence must combine high leverage with discrepancy of the case from the rest of the data.

The basic measurement of leverage is the so-called hat-value, denoted by h_i. In general linear models, the hat-value is specified as a weight variable in the expression of the fitted value of the predicted response

${\hat{y}}_{j}$

, given by

${\hat{y}}_{\tilde{j}} = \sum_{i = 1}^{N} h_{i \tilde{j}} y_{i},$

(6.21)

where

$h_{i \tilde{j}}$

is the weight of subject i in predicting the outcome Y at data point

$\tilde{j}$

(

$\tilde{j} = 1,2,..., N$

), and

${\hat{y}}_{\tilde{j}}$

is specified as a weighted average of N observed values. Therefore, the weight variable

$h_{i \tilde{j}}$

displays the influence of y_i on

${\hat{y}}_{\tilde{j}}$

, with a higher score indicating a greater impact on the fitted value. Let

$h_{i} = h_{i i} = \sum_{\tilde{j} = 1}^{N} h_{i \tilde{j}}^{2} .$

(6.22)

According to Equation (6.22), the hat-value h_i, with property

$0 \leq h_{i} \leq 1$

, is the leverage score of y_i on all fitted values.

In general linear models including a number of independent variables, leverage measures distance from the means of the independent variables and can be expressed as a matrix quantity given the covariance structure of the X matrix. Correspondingly, the hat-value h_i is the ith diagonal of a hat matrix H. The hat matrix H is given by

$H = X {(X^{'} X)}^{- 1} X^{'} .$

(6.23)

The diagonal of H provides a standardized measure of the distance for the ith subject from the center of the X-space, with a large value indicating that the subject is potentially influential. If all cases have equal influence, each subject will have a leverage score of M/N, where M is the number of independent variables (including the intercept) and N is the number of observations. In the literature of influence diagnostics, the leverage values exceeding 2M/N for large samples or 3M/N for samples of N ≤ 30 are regarded roughly as the influential cases.

Given the H matrix, the predicted values of y in general linear models can be written as

$\hat{y} = Hy .$

(6.24)

Therefore, the H matrix determines the variance and covariance of the fitted values and residuals, given by

$var (\hat{y}) = σ^{2} H,$

(6.25a)

$var (r) = σ^{2} (1 - H) .$

(6.25b)

In longitudinal data analysis, the specification of the H matrix becomes more complex due to the inclusion of the covariance matrix V(R, G). Let Θ be an available estimate of R and G (also specified in Chapter 4). The leverage score for subject i can be expressed as the ith diagonal of the following hat matrix:

$H = X {[X^{'} V {(\hat{Θ})}^{- 1} X]}^{-} X^{'} V {(\hat{Θ})}^{- 1} .$

(6.26)

The ith diagonal of the above matrix is the leverage score for subject i displaying the degree of the case’s difference from others in one or more independent variables.

6.2.3. DFFITS, MDFFITS, COVTRACE, and COVRATIO Statistics

In addition to Cook’s D, a number of other case-deletion diagnostics are frequently used in linear regression models. These diagnostics include the DFFITS, the MDFFITS, the COVTRACE, and the COVRATIO statistics (Belsley et al., 1980; Christensen et al., 1992; Fox, 1991; SAS, 2012, Chapter 59). The DFFITS statistic is defined as the change in the predicted value after removing a case, standardized by dividing by the estimated standard error of the fit. In general linear models, the statistic is defined as

$DFFIT S_{i} = \frac{{\hat{y}}_{i} - {\hat{y}}_{i (- i)}}{s_{(- i)} \sqrt{h_{i}}},$

(6.27)

where

${\hat{y}}_{i}$

is the predicted value of Y for subject i from the linear regression including i,

${\hat{y}}_{i (- i)}$

is the same prediction after removing data point i in fitting the regression model,

$s_{(- i)}$

is the standard error estimate without i, and h_i is the corresponding leverage score. Given studentizing, the DFFITS statistic follows a Student t distribution multiplied by a leverage factor, given by

$DFFIT S_{i} = t_{i} \sqrt{\frac{h_{i}}{1 - h_{i}}} .$

(6.28)

In longitudinal data analysis, the statistical procedure to fit a regression model is more complex than that of a simple linear regression given the specification of V_i. For example, in linear mixed models, the DFFITS statistic is specified as

$DFFIT S_{i} = \frac{{\hat{y}}_{i} - {\hat{y}}_{i (- i)}}{ese ({\hat{y}}_{i})},$

(6.29)

where

$ese ({\hat{y}}_{i})$

is the asymptotic standard error estimate for

${\hat{y}}_{i}$

. This statistic can be approximated by

$ese ({\hat{y}}_{i}) = \sqrt{{x^{'}}_{i} {[X^{'} V {({\hat{Θ}}_{(- i)})}^{-} X]}^{- 1}} x_{i},$

(6.30)

where x_i is the observed matrix of X for subject i. As a standardized diagnostic measure, the DFFITS statistic indicates the number of standard deviation that the fitted value changes after the removal of subject i.

The MDFFITS statistic is used when multiple data points are removed from the regression (Belsley et al., 1980). In longitudinal data, case deletion generally implies removal of multiple data points, thereby indicating that MDFFITS is an appropriate statistic for performing influence diagnostics in linear mixed models. Christensen et al. (1992) specify this statistic on the fixed effects in the context of linear mixed models, given by

$MDFFITS [β_{(- i)}] = \frac{{[\hat{β} - {\hat{β}}_{(- i)}]}^{'} var {[{\hat{β}}_{(- i)}]}^{-} [\hat{β} - {\hat{β}}_{(- i)}]}{rank (X)} .$

(6.31)

There is a striking similarity between the above MDFFITS[β_(−i)] formulation and Cook’s D, specified by Equation (6.19). Both statistics measure the influence of data points on a vector of the fixed effects, with the only difference being the specification of

$var [{\hat{β}}_{(- i)}]$

in Equation (6.31).

If the covariance parameters are assumed to be fixed, the MDFFITS score for each subject can be estimated by a noniterative procedure to check only the fixed effects and the residual variance. The MDFFITS score, however, can be underestimated if a subject’s impact on the covariance parameters is overlooked. Therefore, the iterative approach is preferred to assess the overall impact of a subject, including the influence on the covariance parameter estimates. For the iterative influence analysis,

$var [{\hat{β}}_{(- i)}]$

is evaluated at

${\hat{Θ}}_{(- i)}$

, and therefore, an MDFFITS score can be computed specifically for the covariance parameters, written as

$MDFFITS [Θ_{(- i)}] = {[\hat{Θ} - {\hat{Θ}}_{(- i)}]}^{'} var {[{\hat{Θ}}_{(- i)}]}^{- 1} [\hat{Θ} - {\hat{Θ}}_{(- i)}] .$

(6.32)

The covariance trace and the ratio statistics, referred to as COVTRACE and COVRATIO, respectively, are the other two widely used diagnostics for identifying influential cases in linear regression models (Belsley et al., 1980; Christensen et al., 1992; Fox, 1991; SAS, 2012, Chapter 59). While Cook’s D, DFFITS, and MDFFITS statistics are used to measure a subject’s influence on the parameter estimates and the fitted values, the COVTRACE and COVRATIO statistics are applied to assess the influence on the precision of parameter estimates. For linear mixed models, Christensen et al. (1992) define the corresponding COVTRACE and COVRATIO statistics, given by

$COVTRACE [β_{(- i)}] = |trace \{var {(\hat{β})}^{-} var [{\hat{β}}_{(- i)}]\} - rank (X)|,$

(6.33)

$COVRATIO [β_{(- i)}] = \frac{\det \{var [{\hat{β}}_{(- i)}]\}}{\det [var (\hat{β})]},$

(6.34)

where

$\det \{var [{\hat{β}}_{(- i)}]\}$

indicates the determinant of the nonsingular part of the

$var [{\hat{β}}_{(- i)}]$

matrix. The COVRATIO statistic can be used to assess the precision of the fixed effects with the following criteria: if COVRATIO > 1, inclusion of subject i in the regression improves the precision of the parameter estimates; if COVRATIO < 1, the incorporation of the subject in the estimating process reduces the precision of the estimation, so that subject i may be deleted in the model fit.

In the iterative influence analysis, the COVTRACE and COVRATIO statistics can also be computed for the covariance parameter estimates:

$COVTRACE [Θ_{(- i)}] = |trace \{var {(\hat{Θ})}^{-} var [{\hat{Θ}}_{(- i)}]\} - rank [var (\hat{Θ})]|,$

(6.35)

$COVRATIO [(Θ_{(- i)})] = \frac{\det \{var [{\hat{Θ}}_{(- i)}]\}}{\det [var (\hat{Θ})]} .$

(6.36)

Empirically, iterative computations of COVTRACE and COVRATIO for the covariance parameter estimates are burdensome and tedious, particularly for a large sample. When difficulty in performing the iterative methods arises, the researcher might want to consider using another diagnostic approach.

6.2.4. Likelihood Displacement Statistic Approximation

The above influence diagnostics are useful for graphically identifying influential cases. Some of those diagnostic statistics are linked to certain types of probability distributions by using standardization. Based on the standardized results, empirically based cut-points may be created. In this aspect, a popular diagnostic approach is LD, which is directly associated with the likelihood ratio statistic. This likelihood-type diagnostic statistic, originally proposed by Cook (1977), produces both graphical plots and analytic results simultaneously. In the literature of influence diagnostics, the LD statistic is generally referred to as LD if the statistic is generated from the log-likelihood function or as RLD if it is derived from the restricted log-likelihood function.

Let l be the log-likelihood function log L and l_R be the restricted log-likelihood function log L_R. In general linear models, the LD and RLD scores for subject i, denoted by LD_i and RLD_i, respectively, can then be defined as

$L D_{i} = 2 l (\hat{β}) - 2 l [{\hat{β}}_{(- i)}],$

(6.37a)

$RL D_{i} = 2 l_{R} (\hat{β}) - 2 l_{R} [{\hat{β}}_{(- i)}],$

(6.37b)

where

$l (\hat{β})$

is the log-likelihood function and

$l_{R} (\hat{β})$

is the log restricted likelihood function with respect to the estimated regression coefficients

$\hat{β}$

. In simple linear regression models, the LD statistic is distributed as chi-square with one degree of freedom under the null hypothesis that

${\hat{β}}_{(- i)} = \hat{β}$

. With this desirable property, the observations having a strong impact on

$\hat{β}$

can be statistically tested as well as graphically identified. Consequently, this statistic provides sufficient information on whether an identified influential case should be removed from the regression.

In longitudinal data analysis, the use of this exact statistic is often not realistic. When the sample size is large, this case-deletion process becomes extremely tedious and time-consuming. Consider, for example, a sample of 2000 subjects: the analyst needs to create 2001 linear mixed models to identify which case or cases have exceptionally strong impact on the parameter estimates. If the distance score, denoted by

$\hat{θ} - {\hat{θ}}_{(- i)}$

, can be statistically approximated by a scalar measurement in linear mixed models, influential observations can be identified without removing each case in a sequence from the estimating process.

Cook (1986) developed a method to approximate Equation (6.37) by introducing weights into the likelihood function, with different individuals allowed different weights. In this method, an N-dimensional vector

$w = {(w_{1}, w_{2}, ..., w_{N})}^{'}$

is created, with element w_i (i = 1, 2, …, N) being the weight for subject i. The w vector in the classical log-likelihoods can be denoted by

$w_{0} = {(1,1,...,1)}^{'}$

, in which each subject has weight one. This approach can be readily extended to linear mixed models. Let θ be a parameter matrix consisting of β, G, and R. The log-likelihood and the log restricted likelihood functions with weights, referred to as perturbed log-likelihoods (Verbeke and Molenberghs, 2000), are then given by

$l (θ |w) = \sum_{i = 1}^{N} w_{i} l_{i} (θ),$

(6.38a)

$l_{R} (θ |w) = \sum_{i = 1}^{N} w_{i} l_{R i} (θ),$

(6.38b)

where

$l (θ |w)$

is the perturbed log-likelihood function given the inclusion of the weight vector w, and

$l_{R} (θ |w)$

is the corresponding perturbed log restricted likelihood function. In longitudinal data analysis, w_i is set at zero if the entire set of observations for subject i is not considered in linear mixed models.

Given the specification of weights, the influential cases can be identified by measuring the distance between

${\hat{θ}}_{w}$

and

$\hat{θ}$

in the LD formulation. The approximated LD and RLD scores for all subjects, denoted by

$LD (w)$

$RLD (w)$

, are given by

$LD (w) = 2 l (\hat{θ}) - 2 l [{\hat{θ}}_{w}],$

(6.39a)

$RLD (w) = 2 l_{R} (\hat{θ}) - 2 l_{R} [{\hat{θ}}_{w}] .$

(6.39b)

Obviously, it is still unrealistic to evaluate

$LD (w)$

$RLD (w)$

for all elements in w. Cook (1986) suggests studying the local behavior of

$LD (w)$

around w₀ because it describes the sensitivity of

$LD (w)$

with respect to slight perturbations of the weights. Accordingly, the

$LD (w)$

at w₀ can be approximated by the classical likelihood expressions without directly including w. In the context of linear mixed models, the LD score for subject i can then be approximated by

$L D_{i} (w_{i}) \approx {\tilde{U}}_{i}^{'} I^{- 1} (\hat{θ}) {\tilde{U}}_{i},$

(6.40)

where

${\tilde{U}}_{i}$

is the score function for subject I, mathematically defined as the first partial derivative of the log-likelihood function with respect to θ, and

$I^{- 1} (\hat{θ})$

is the inverse of the observed Fisher information matrix, mathematically defined as the second partial derivative of the log-likelihood function with respect to θ. Both matrices are evaluated at

$θ = \hat{θ}$

. As both

${\tilde{U}}_{i}$

and

$I (\hat{θ})$

can be obtained from the standard linear mixed modeling, this approximation does not include additional statistical inference, thereby evading the specification of w_i. Given Equation (4.16), the

$RL D_{i} (w_{i})$

statistic can be obtained in the same fashion with minor modifications (Lesaffre and Verbeke, 1998).

As discussed in Lesaffre and Verbeke (1998) and Verbeke and Molenberghs (2000), the change in the log-likelihoods between

${\hat{θ}}_{w}$

and

$\hat{θ}$

reflects variability of

$\hat{θ}$

. If the

$LD (w)$

or the

$RLD (w)$

score is large,

$l (θ)$

$l_{R} (θ)$

is strongly curved at

$\hat{θ}$

, thereby suggesting θ to be estimated with high precision. If the statistic is small, θ is estimated with high variability. As a result, a graph of

$LD (w)$

$RLD (w)$

approximates against the w vector can provide important information on the total local influence of each subject to identify influential cases. For statistical details concerning this graphical method, the reader is referred to Gruttola et al. (1987), Lesaffre and Verbeke (1998), and Verbeke and Molenberghs (2000, Chapter 11).

6.2.5. LMAX Statistic for Identification of Influential Observations

Based on the LD approximation, a more refined, innovative technique for identifying influential observations is the LMAX statistic, originally developed by Cook (1986) as a standardized diagnostic method in general regression modeling and later introduced and advanced into linear mixed models by Lesaffre and Verbeke (1998). Methodologically, this method maximizes approximation to the LD(w) statistic for the changes standardized to the unit length.

Cook (1986) suggests the use of a standardized LD statistic to more adequately measure influences of particular cases in the estimating process. Given the LD statistic, Cook first defines a symmetric matrix B, given by

$B \approx {\tilde{U}}^{'} I^{- 1} (\hat{θ}) \tilde{U},$

(6.41)

where

$\tilde{U}$

is the matrix with rows containing the score residual vector

${\hat{\tilde{U}}}_{i}$

With B defined, Cook considers the direction of the N × 1 vector

$\tilde{l}$

that maximizes

${\tilde{l}}^{'} B \tilde{l}$

, and

$\tilde{l}$

is standardized to have unit length. Because the M × M matrix

$\hat{I} {(\hat{θ})}^{- 1}$

is positive definite, the N × N symmetric matrix B is positive semidefinite, with rank no more than M. The statistic

${\tilde{l}}_{max}$

corresponds to the unit length eigenvector of B, which has the largest eigenvalue

${\tilde{γ}}_{\max}$

. Therefore,

${\tilde{l}}_{max}^{'} B {\tilde{l}}_{max}$

maximizes

${\tilde{l}}^{'} B \tilde{l}$

and satisfies the equation

$B {\tilde{l}}_{max} = {\ddot{λ}}_{\max} {\tilde{l}}_{max} and {\tilde{l}}_{max}^{'} {\tilde{l}}_{max} = 1,$

where

${\ddot{λ}}_{\max}$

is the largest eigenvalue of B, and

${\tilde{l}}_{max}$

is the eigenvector associated with

${\ddot{λ}}_{\max}$

. The elements of

${\tilde{l}}_{max}$

, standardized to unit length, measure sensitivity of the model fit to each observation. The absolute value of

${\tilde{l}}_{i}$

, the ith element in

${\tilde{l}}_{max}$

, is used as the LMAX score for subject i. Given the unit length, the expected value of the squared LMAX statistic for each case is 1/N where N is sample size, and correspondingly, a value significantly greater than this expected value indicates a strong influence on the overall fit of a regression model thereby being identified. If M = 1, the LMAX statistic is proportional to

${\tilde{U}}^{'}$

and

${\ddot{λ}}_{\max} = I {(\hat{θ})}^{-1} ∥\tilde{U}∥ .$

When M > 1, an advantage of examining elements of the LMAX statistic is that each case has a single summary measurement of influence.

As the LMAX score is a standardized statistic, it is useful to plot the elements of the LMAX scores against time points and/or the values of other covariates. The standardization of

${\tilde{l}}_{max}$

to unit length means that the squares of the elements in

${\tilde{l}}_{max}$

sum up to unity, and therefore, the signs of the elements of

${\tilde{l}}_{max}$

are not of concern. Therefore, for

${\tilde{l}}_{i}$

, only the absolute value needs to be plotted. Observations that have the most unduly influences on parameter estimates and the model fit can be readily identified by examining the relative influence of the elements in

${\tilde{l}}_{max}$

. If none of the subjects has an undue impact on the model fit, the plot of the LMAX scores should approximate a horizontal line.

There is a lack of analytic expressions for

${\tilde{l}}_{max}$

. Given high proportional contributions of the graphically influential cases to the overall fit of a linear mixed model, it is necessary to check the exact LD for them. The researcher can perform the significance test to further analyze those influential cases with two steps (Liu, 2012). First, graphically identify the most influential observations by using the LMAX approximation (there are usually only a few distinctive outliers in linear models). Next, analytically examine the exact changes in the estimates of the fixed effects and the covariance parameters after deleting each of those few cases, using the classical LD criterion. Following this strategy, a decision can be made regarding whether those influential cases should be removed in fitting a linear mixed model.

6.3. Empirical Illustrations on Influence Diagnostics

In this empirical illustration, I provide two examples for performing influence diagnostics in longitudinal data analysis. I continue to use the two longitudinal datasets indicated in the preceding chapters: one on the effectiveness of acupuncture treatment on PTSD and one concerning the effect of marital status on an older person’s disability severity.

6.3.1. Influence Checks on Linear Mixed Model Concerning Effectiveness of Acupuncture Treatment on PCL Score

The first example is a diagnostic check of influence on the linear mixed model fitted in Chapter 5 (Section 5.5.1). As a follow-up analysis, the model specifications, covariate definitions, and the use of an appropriate covariance structure are all the same as described in Chapter 5. Specifically, the dependent variable is PCL_SUM, measured at four time points: 0 = baseline survey, 1 = 4-week follow-up, 2 = 8-week follow-up, and 3 = 12-week follow-up. The treatment factor is dichotomous with 1 = receiving acupuncture treatment, 0 = else. From three covariance pattern models, CS, [AR(1)], and TOEP, CS yields the smallest values in all four information criteria, thereby being selected as the appropriate residual covariance pattern model for this analysis. The objective of this diagnostic analysis is to identify whether there are influential cases for both the fixed effects and the covariance parameters in the model fit. The following SAS program is created to perform the diagnostics by using an iterative approach.

SAS Program 6.1:. . . . . .

In SAS Program 6.1, the ODS Graphics is enabled to create diagnostic plots. As the analysis is focused on the identification of influential cases, the options to derive the fixed effects and the covariance parameters are not specified. The ML estimator is applied in the diagnostic checks because the REML estimator cannot be used to compare nested regression models for the mean (Fitzmaurice et al., 2004). The INFLUENCE option is specified in the MODEL statement telling SAS to compute influence statistics. The influence suboption EFFECT = ID specifies that the classification effect is subject. Additionally, the suboption ITER = 5 informs SAS that the fixed effects and the covariance parameters be updated by refitting the mixed model up to five iterations. It may be mentioned that for ITER = 0, SAS performs a noniterative influence analysis on the fixed effects only, assuming the covariance parameters or their ratios to be fixed.

Given SAS Program 6.1, SAS constructs a table of influence diagnostics at the subject level. This diagnostic summary table contains eleven diagnostic statistics, including Cook’s D for both the fixed effects and the covariance parameters, PRESS, MDFFITS for both the fixed effects and the covariance parameters, COVRATIO for both the fixed effects and the covariance parameters, COVTRACE for both the fixed effects and the covariance parameters, the likelihood distance (LD), and the root of mean square error (RMSE). While the diagnostic table includes a large amount of data, a portion of the statistics is selected for display, including Cook’s D, MDFFITS for the fixed effects, COVRATIO for the fixed effects, and the LD statistic. Table 6.1 presents those statistics for the first ten subjects.

Table 6.1

Selected Influence Statistics for 10 Subjects: DHCC Acupuncture Treatment Study (N = 10)

Subject’s ID	Influence Diagnostics for 10 Subjects
Subject’s ID	Number of Observations	Number of Iterations	Cook’s D	MDFFITS on β	COVRATIO on β	LD Statistic
132	4	2	0.0533	0.0524	1.0080	0.5396
167	4	1	0.0114	0.0108	1.2674	0.1007
193	4	2	0.0242	0.0236	1.1246	0.2139
220	4	2	0.0134	0.0132	1.2960	0.1690
227	4	2	0.0195	0.0188	1.1914	0.1657
230	2	1	0.0006	0.0006	1.1728	0.0158
235	4	1	0.0263	0.0252	1.1858	0.2130
245	4	2	0.0358	0.0351	1.0913	0.3399
271	1	1	0.0000	0.0000	1.0785	0.0057
276	4	2	0.0324	0.0315	1.1090	0.3003

Table 6.1 displays that each of the ten subjects has multiple observations except subject 271, among whom eight have complete information on the PCL score. All subjects have relatively small values of Cook’s D and the MDFFITS statistics on the fixed effects. As they differ only in the specification of the covariance matrix for the fixed effects, the values of Cook’s D and the MDFFITS are very close, with the MDEFITS value slightly smaller given the use of

$var [{\hat{β}}_{(- i)}]$

rather than

$var [\hat{β}]$

in computation. The COVRATIO and the LD statistics are also fairly stable, thereby suggesting that none of these ten subjects has a strong impact on the fit of the linear mixed model on PCL_SUM. From the results of other influence diagnostics and for other subjects, no distinctively influential cases can be found.

From SAS Program 6.1, a set of diagnostic plots are displayed. First, a plot of the likelihood distance is presented Fig 6.1.

Figure 6.1 Likelihood Distance for Each Subject

As there are 55 subjects in the analysis, not all the ID numbers are displayed in Fig. 6.1. Based on the output table from SAS Program 6.1, influential cases can be easily identified. As judged from both Fig. 6.1 and the output table, subjects 791 and 814 have the strongest impact on the model fit. With the large value of the overall model fit statistic, however, those relatively outstanding values, 1.1328 and 1.5064, cannot lead to a firm conclusion that the two cases make actual influences on the fitness of the linear regression model.

Figure 6.2 displays Cook’s D and the COVRATIO statistics for both the fixed effects and the covariance parameters considered in the model.

Figure 6.2 Other Influence Statistics on PCL_SUM

In Fig. 6.2, the aforementioned two subjects also display high Cook’s D values, thereby suggesting a strong impact on the results of the model fit. These two cases are shown to be influential not only on the fixed effects but also on the estimates of the variance and covariance parameters. With respect to Cook’s D, subject 623 is identified as an additional influential case in the estimation of the fixed effects but not in the covariance parameters. The three subjects, 791, 814, and 623, are also linked to low precision both in the fixed effects and in the covariance parameters, given the low COVRATIO scores. These diagnostic statistics, however, display the relative contribution to the model fit but do not necessarily translate into an actual influence.

Given the relative importance of the two most influential cases to the overall fit of the linear mixed model, it may be necessary to check the exact LD for both. Accordingly, two additional linear mixed models are created with each deleting one of those influential cases. The SAS program for this step is not presented because programming of the two additional models follows the standard procedure, as previously exhibited, except for removal of a single observation. Table 6.2 summarizes the results.

Table 6.2

Maximum Likelihood Estimates and the Likelihood Displacement Statistic for Three Linear Mixed Models

Explanatory Variable	Parameter Estimate	Standard Error	Degrees of Freedom	t-value	p-value
	Linear Mixed Model With Full Data (-2 LL = 1411.3; p < 0.0001)
Intercept	55.444	2.509	53	22.10	<0.0001
Treatment	2.698	3.516	53	0.77	0.4463
Time 1	−3.963	2.056	128	−1.93	0.0561
Time 2	−2.977	2.168	128	−1.37	0.1721
Time 3	−9.233	2.137	128	−4.32	<0.0001
Treat × Time 1	−14.494	3.007	128	−4.82	<0.0001
Treat × Time 2	−15.803	3.232	128	−4.89	<0.0001
Treat × Time 3	−8.666	3.177	128	−2.73	0.0073
	Linear Mixed Model Deleting Subject 791 (-2 LL = 1371.3; p < 0.0001)
Intercept	56.231	2.550	52	22.05	<0.0001
Treatment	1.912	3.542	52	0.54	0.5916
Time 1	−4.885	2.007	125	−2.43	0.0164
Time 2	−4.359	2.122	125	−2.05	0.0420
Time 3	−10.613	2.090	125	−5.08	<0.0001
Treat × Time 1	−13.537	2.908	125	−4.65	<0.0001
Treat × Time 2	−14.369	3.129	125	−4.59	<0.0001
Treat × Time 3	−7.229	3.075	125	−2.35	0.0203
LD statistic	40.0 (df = 4; p < 0.01)
	Linear Mixed Model Deleting Subject 814 (-2 LL = 1368.6; p < 0.0001)
Intercept	54.923	2.535	52	21.67	<0.0001
Treatment	3.220	3.520	52	0.91	0.3646
Time 1	−4.231	1.992	125	−2.12	0.0356
Time 2	−3.338	2.106	125	−1.59	0.1154
Time 3	−7.959	2.074	125	−3.84	0.0002
Treat × Time 1	−14.189	2.886	125	−4.92	<0.0001
Treat × Time 2	−15.388	3.105	125	−4.96	<0.0001
Treat × Time 3	−9.880	3.052	125	−3.24	0.0015
LD statistic	42.7 (df = 4; p < 0.01)

In Table 6.2, the first linear mixed model uses full data, with results previously reported. The second model is fitted after removing subject 791 with four observations. The exact LD statistic, from the formula

$2 l (\hat{θ}) - 2 l [{\hat{θ}}_{(- i)}]$

, is statistically significant with four degrees of freedom and at α = 0.05 (LD = 40.0, p < 0.01), thereby indicating that this subject makes a very strong statistical impact on the overall fit of the first linear mixed model. Likewise, the third model is fitted after deleting subject 814, with the exact LD statistic also statistically significant at the same criterion (LD = 42.7, p < 0.01). Furthermore, in both the second and the third models, the parameter estimates, including the regression coefficients and the standard errors, vary notably after removing each of those two cases. Obviously, those two cases make a genuine impact in fitting the linear mixed model. Given the considerable changes in the estimated regression coefficients as well as the exceptionally strong statistical contributions, I would recommend that the two influential cases should be eliminated from the regression. The lack of statistical stability may also be linked to the small sample size for the data. For large samples, a strong relative influence of a few particular cases usually can be washed out by the effects of the vast majority of the normal observations in the estimating process. This argument will be further discussed in the next example where a large sample is analyzed.

Influence diagnostics can also be applied in the random coefficient model in which time is treated as a continuous variable. For example, I can graphically examine the fixed effects and the covariance parameters in the random coefficient model on the PCL score specified in Chapter 3. By adapting SAS Program 4.1, the code is displayed below.

SAS Program 6.2:. . . . . .

In SAS Program 6.2, some options are selected to request an influence analysis on the parameter estimates in the random coefficient model. In the PROC MIXDED statement, the PLOT(ONLY) = INFLUENCESTATPANEL option requests panels of influence statistics, with the ONLY suboption in the parentheses informing SAS that only the requested plots be produced while the default plot in the package be suppressed. For iterative analysis, the panel displays Cook’s D and the COVRATIO statistic for the fixed effects and the covariance parameters. In the INFLUENCE option in the MODEL statement, the suboption EST is added that requests SAS to write the updated parameter estimates to the ODS output dataset. The resulting plots from this program are displayed in Fig. 6.3.

Figure 6.3 Influence Statistics on PCL_SUM in the Random Coefficient Model

Generally, Fig. 6.3 demonstrates the same results as those from the influence analysis for the linear model using a residual covariance pattern model. Such a similarity is not surprising, as both types of linear mixed models usually yield very close estimates for the fixed effects and the covariance parameters.

If the researcher is interested in graphically checking the detailed influence statistics for each fixed-effect and covariance parameter in the random coefficient model, additional analyses are needed for a thorough assessment. This influence analysis can be achieved by making some minor modifications in SAS Program 6.2. Specifically, by replacing the option PLOT(ONLY) = INFLUENCESTATPANEL with the PLOT(ONLY) = INFLUENCEESTPLOT option, the following graphs are produced for the fixed effects and the 3 × 3 variance–covariance matrix of the random coefficients.

In the interpretation of Fig. 6.4, the focus is on the behaviors of the two most influential cases, subjects 791 and 814. Specifically, subject 791 has a strong impact on the model fit of the intercept and the slopes for CT, CT_2, TREAT, and the interactions between CT and TREAT and between CT_2 and TREAT. Removal of this subject, however, has no influence at all on the fixed effect of CT_3 and the interaction between CT_3 and TREAT. On the other hand, subject 814 has a very strong influence on the regression coefficient estimates of the intercept, CT_2, TREAT, and the interaction between CT_2 and TREAT. At the same time, subject 814 has some impact on the fixed effects of CT_3 and the interactions between CT and TREAT and between CT_3 and TREAT but has no influence at all on the fixed effects of CT. In this analysis, subject 998 is identified as an additional influential case since it has a strong impact on half of the fixed effects.

Figure 6.4 Fixed-Effects Deletion Estimates on PCL_SUM

Figure 6.5 displays the results of the deletion estimates for each of the covariance parameters on the PCL score.

Figure 6.5 Covariance Parameter Deletion Estimates on PCL_SUM

In Fig. 6.5, all cases, except subjects 132, 227, and 998, do not contain information because the estimation of G after removal of those subjects does not yield a positive definite matrix, thereby failing to function. With a small sample size of a longitudinal dataset, the influence analysis on the covariance parameter deletion estimates is often inefficient in checking the model fit of the random coefficient model.

6.3.2. Influence Diagnostics on Linear Mixed Model Concerning Marital Status and Disability Severity Among Older Americans

In the second example of influence diagnostics, the AHEAD data are used (2000 randomly selected subjects and six waves: 1998, 2000, 2002, 2004, 2006, and 2008). In this analysis, the dependent variable remains the health-related difficulty in performing five activities of daily living (dress, bath/shower, eat, walk across time, get in/out of bed), measured at six time points and named ADL_COUNT. The disability is scored one if an older person has difficulty, personal help, or equipment help for health-related reasons and scored zero if otherwise. Therefore, the value of the ADL count ranges from 0 to 5 at each time point. Marital status, named married, is the same dichotomous variable with 1 = currently married and 0 = else, specified as a time-varying variable. The controls are the three centered variables, Age_mean, Educ_mean, and Female_mean. Given the same set of covariates, the same linear mixed model described in Chapter 5 (Section 5.6.2) is applied. The ML estimator is used to perform an influence analysis on the fixed effects and the covariance parameters. The following is the SAS program for this analysis, an adapted version of SAS Program 5.4.

SAS Program 6.3:. . . . . .

SAS Program 6.3 is similar to SAS Program 6.1, with contextual modifications. In the PROC MIXED statement, the PLOTS(MAXPOINTS = 20,000) option tells SAS to increase the maximum points of data in plotting the least squares means from the default 5,000 to 20,000, given a large sample size of the AHEAD longitudinal data. As indicated in Chapter 5 (Section 5.6.2), the TOEP covariance pattern model is selected to address the covariance structure in residuals.

Given the large sample size, SAS Program 6.3 yields a substantially large quantity of output data, containing eleven diagnostic statistics for each of the 2000 subjects. The diagnostic table for this influence analysis takes the same format as that of the first example, in which a rich body of influence diagnostics, including tables and graphs, is presented. Given the large quantity of the output results, the detailed contents of diagnostic statistics and the figures cannot all be reported. Instead, the focus of this influence analysis is placed upon identification of the most influential cases, rather than on the interpretation of the output tables and graphs. It follows that the exact LD scores are computed by removing each of the most influential cases in the linear mixed model. Furthermore, the actual influences of those outstanding cases are examined by checking changes in the parameter estimates after removing each of them from the regression.

After a careful examination on the results from SAS Program 6.3, two influential subjects are identified: subject 200520020 (HHIDPN number) and subject 207669010. Given the relatively high LD scores, it is necessary to check the exact LD for both cases. Analogous to the approach applied for the first example, two additional linear mixed models are created with each deleting one of those two influential cases. The SAS program for this step is not presented because the two mixed models follow the standard procedure, as previously displayed, except for the removal of a single observation. Table 6.3 summarizes the results for the fixed effects.

Table 6.3

Maximum Likelihood Estimates and the Likelihood Displacement Statistic for Three Linear Mixed Models on ADL Count: Older Americans

Explanatory Variable	Parameter Estimate	Standard Error	Degrees of Freedom	t-value	p-value
	Linear Mixed Model With Full Data (-2 LL = 20,016.3; p < 0.0001)
Intercept	0.673	0.044	1726	15.31	<0.0001
Married	0.029	0.059	300	0.49	0.6223
Time 1	0.257	0.036	4814	7.19	<0.0001
Time 2	0.513	0.052	4814	9.87	<0.0001
Time 3	0.579	0.058	4814	9.94	<0.0001
Time 4	0.768	0.061	4814	12.56	<0.0001
Time 5	0.932	0.060	4814	15.60	<0.0001
Treat × Time 1	−0.159	0.053	4814	−3.01	0.0026
Treat × Time 2	−0.252	0.078	4814	−3.24	0.0012
Treat × Time 3	−0.233	0.090	4814	−2.58	0.0099
Treat × Time 4	−0.060	0.098	4814	−0.61	0.5441
Treat × Time 5	−0.309	0.101	4814	−3.07	0.0021
	Linear Mixed Model Deleting Subject 200520020 (-2 LL = 20000.5; p < 0.0001)
Intercept	0.673	0.044	1725	15.29	<0.0001
Married	0.028	0.059	300	0.48	0.6335
Time 1	0.257	0.036	4809	7.18	<0.0001
Time 2	0.513	0.052	4809	9.87	<0.0001
Time 3	0.580	0.058	4809	9.94	<0.0001
Time 4	0.768	0.061	4809	12.56	<0.0001
Time 5	0.932	0.060	4809	15.60	<0.0001
Treat × Time 1	−0.159	0.053	4809	−3.01	0.0026
Treat × Time 2	−0.250	0.078	4809	−3.21	0.0013
Treat × Time 3	−0.234	0.090	4809	−2.59	0.0096
Treat × Time 4	−0.059	0.098	4809	−0.60	0.5478
Treat × Time 5	−0.302	0.101	4809	−3.00	0.0027
LD statistic	15.8 (df = 2; p < 0.01)
	Linear Mixed Model Deleting Subject 207669010 (-2 LL = 19985.0; p < 0.0001)
Intercept	0.675	0.044	1725	15.35	<0.0001
Married	0.026	0.059	300	0.44	0.6599
Time 1	0.255	0.036	4809	7.11	<0.0001
Time 2	0.513	0.052	4809	9.86	<0.0001
Time 3	0.578	0.058	4809	9.92	<0.0001
Time 4	0.770	0.061	4809	12.58	<0.0001
Time 5	0.923	0.060	4809	15.47	<0.0001
Treat × Time 1	−0.156	0.053	4809	−2.96	0.0031
Treat × Time 2	−0.252	0.078	4809	−3.23	0.0012
Treat × Time 3	−0.231	0.090	4809	−2.56	0.0105
Treat × Time 4	−0.061	0.098	4809	−0.62	0.5324
Treat × Time 5	−0.300	0.100	4809	−2.99	0.0028
LD statistic	31.3 (df = 6; p < 0.01)

In Table 6.3, the first linear model uses full data, while the second model is fitted after deleting the most influential case (subject ID: 200520020). The exact LD statistic, from the formula

$2 l (\hat{θ}) - 2 l [{\hat{θ}}_{(- i)}]$

, is statistically significant with two degrees of freedom (this subject has two observations) and at α = 0.05 (LD = 15.8, p < 0.01), indicating that the most influential case makes a very strong statistical impact on the overall fit of the fixed effects. Likewise, the third mixed model is fitted after deleting the second most influential case (subject ID: 207669010), with the exact LD statistic also statistically significant with six degrees of freedom at the same criterion (LD = 31.3, p < 0.01). In both the second and the third models, however, the fixed estimates, including the regression coefficients and the standard errors, do not vary at all after removing each of those two cases. The fixed effects of the three control variables, not presented in Table 6.3, are also strikingly consistent after removing each of those two cases. The three sets of the estimated regression coefficients are almost identical. Obviously, deleting those two cases makes no genuine impact on the fixed-effects estimates, albeit strong statistical influences.

Next, the impacts of the two influential cases are further examined on the covariance parameter estimates. The following SAS program output displays three panels of covariance parameter estimates, derived from the aforementioned three linear mixed models, respectively.

SAS Program Output 6.1:

The covariance parameter estimates for full data

The covariance parameter estimates after deleting the first case

The covariance parameter estimates after deleting the second case

As shown in SAS Program Output 6.1, the three sets of the covariance parameter estimates are very close, and therefore, deletion of those two statistically influential cases does not affect the quality of the covariance parameter estimates, either. Given the remarkable similarities of the estimated regression coefficients and covariance parameters, I do not see any reason that the two statistically influential cases should be eliminated from the linear mixed model, although they have exceptionally strong statistical contributions to the model fit. Indeed, for large samples, a strong relative influence of a few particular cases can be easily averaged out by the effects of other cases in the estimating process. Consequently, deleting any influential observation can hardly make any actual impact on the regression coefficient estimates and the covariance parameter approximates in linear mixed models.

6.4. Summary

In this chapter, a number of statistical methods are described for performing residual diagnostics in linear mixed models. The chapter starts with an introduction of a variety of residual types that are widely used in longitudinal data analysis. The semivariogram approaches are then delineated to help the reader further command the techniques for checking the behaviors of residuals in linear mixed models. The semivariogram can be applied to check whether serial correlation in repeated measurements is present given the fixed effects and the specified random effects. Compared to the classical residual diagnostics, the semivariogram is proposed to analyze spatial data, thereby not seeing many applications in the analysis of longitudinal data with equal time intervals. In longitudinal data analysis, plotting various residuals is effective in revealing inconsistencies between the observed data and the model-based predictions (Diggle et al., 2002).

Compared to residual diagnostics, the identification of influential cases on the goodness-of-fit of linear mixed models is more complex. In Section 6.2, I describe a variety of popular approaches in this area. For large samples, the actual impact of a few statistically influential cases is often found to be very limited, even though they contribute significantly to the model fit. Given this finding, the techniques for the identification of influential cases are particularly important in the analysis of longitudinal data with a small sample size. In such analyses, a few influential cases can significantly modify the analytic results in the application of linear mixed models. I would like to recommend the following steps for identifying influential cases in those studies. First, identify the most influential cases by using the influence diagnostic methods described in this chapter. Next, examine the exact change in the estimated regression coefficients and the covariance parameter approximates after deleting each of the identified cases. Following this strategy, a decision can be made regarding whether those cases should be removed in the application of a linear mixed model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Residual and influence diagnostics

Create new playlist

Sign In

Sign Up

Abstract

Keywords

6.1. Residual Diagnostics

6.1.1. Types of Residuals in Linear Regression Models

6.1.2. Semivariogram in Random Intercept Linear Models

6.1.3. Semivariogram in the Linear Random Coefficient Model

6.2. Influence Diagnostics

6.2.1. Cook’s D and Related Influence Diagnostics

6.2.2. Leverage

6.2.3. DFFITS, MDFFITS, COVTRACE, and COVRATIO Statistics

6.2.4. Likelihood Displacement Statistic Approximation

6.2.5. LMAX Statistic for Identification of Influential Observations

6.3. Empirical Illustrations on Influence Diagnostics

6.3.1. Influence Checks on Linear Mixed Model Concerning Effectiveness of Acupuncture Treatment on PCL Score

6.3.2. Influence Diagnostics on Linear Mixed Model Concerning Marital Status and Disability Severity Among Older Americans

6.4. Summary

Table of Contents for
Chapter 6: Residual and influence diagnostics