Chapter 10
Beyond mixed modelling

10.1 Review of the uses of mixed models

Just as mixed modelling is an extension of the linear-modelling methods comprised in regression analysis and analysis of variance, mixed modelling itself can be further extended in several directions, to give even more versatile and realistic models. This chapter reviews the various contexts in which we have seen that mixed modelling is preferable to a simple regression or analysis of variance approach, then outlines the ways in which the concepts of mixed modelling can be developed further. Fuller accounts of such advanced uses of mixed modelling are given by Brown and Prescott (2006) and by Pinhero and Bates (2000). Brown and Prescott demonstrate the use of the statistical software SAS to fit the models, whereas Pinhero and Bates use the statistical computer language S (of which the software R is one implementation – see Section 1.11). Both books place much more emphasis on the underlying mathematical theory than is given here.

A mixed-model analysis provides a fuller interpretation of the data than a simple regression or analysis of variance approach, and permits wider inferences about the observations to be expected in future, in the following situations:

  • When one or more of the factors in a regression model is a random-effect term, and should therefore contribute to the standard errors (SEs) of estimates of effects of other terms. Examples include
    • the variation in house prices among towns, which contributed to the SE of the estimated effect of latitude (Chapter 1);
    • the variation in bone mineral density among patients sampled at different hospitals, which contributed to the SEs of the estimated effects of gender, age, height and weight (Sections 7.2–7.4);
    • a meta-analysis of treatment effects in several studies, when study × treatment interaction effects are present (Sections 8.1–8.7).
  • Where variance components are of intrinsic interest, for example, in the investigation of the sources of variation in the strength of a chemical paste (delivery, cask and sample – Sections 3.2–3.7). The relative magnitude of the different sources of variation can be estimated, with a view to their control by replication in subsequent investigation.
  • When candidates from an exchangeable set of entities are to be identified, for example, in the identification of high-yielding breeding lines among the progeny of a cross between two barley varieties (Sections 3.8–3.17). The Best Linear Unbiased Predictor (BLUP), obtained from the mixed-model analysis, provides a more realistic – and more conservative – prediction of the future performance of the selected candidates than is given by the simple mean performance of each candidate (or by the closely related Best Linear Unbiased Estimate (BLUE)) (Chapter 5; Sections 8.8–8.10).
  • When it has not been possible to achieve exact balance in the design of an experiment, for example:
    • if one block has to be omitted from a balanced incomplete block design (Sections 9.1–9.4);
    • in an alpha or alphalpha design (Sections 9.5–9.8).

10.2 The generalized linear mixed model (GLMM): Fitting a logistic (sigmoidal) curve to proportions of observations

All the models that we have considered so far are linear: that is, they can be expressed in the form

10.1 equation

where

  1. yk = the kth observation of the response variable Y,
  2. xik = the kth observation of the ith explanatory variable in the fixed-effect model, Xi,
  3. p = the number of explanatory variables in the fixed-effect model,
  4. zjk = the kth observation of the jth explanatory variable in the random-effect model, Zj,
  5. q = the number of explanatory variables in the random-effect model,
  6. c10-math-0002 = the kth value of the random variable c10-math-0003, which represents the residual variation in Y and c10-math-0004 and c10-math-0005 are parameters to be estimated.

When one or more of the explanatory variables are factors, some ingenuity is needed to express the model in this form. For example, the model used in Chapter 1 to relate house prices to latitude and town can be expressed in this form by setting

Y = log(house price)
X1 = latitude
Z1 = 1 for observations from Bradford, 0 otherwise
Z2 = 1 for observations from Buxton, 0 otherwise
Z11 = 1 for observations from Witney, 0 otherwise.

The variables Z1 to Z11, with their arbitrary values that indicate the category (the town) to which each observation belongs, are known as dummy variables. When the response variable and the explanatory variables are specified in this way, we find that

c10-math-0006 = intercept
c10-math-0007 = effect of latitude
c10-math-0008 = effect of Bradford
c10-math-0009 = effect of Buxton
c10-math-0010 = effect of Witney.

The estimates of c10-math-0011 are also the estimates of the parameters τ1 to τ11, the deviations of the town means from the regression line relating log(house price) to latitude, defined in Section 1.4. The decision to treat this model as a mixed model is equivalent to a decision to treat these parameters as values of a random variable, as described in Section 1.6.

In addition to being linear, all the models considered so far have had residuals that can reasonably be assumed to be normally distributed. There are many other regression models, relating a response variable to one or more explanatory variables, that do not have these properties. As an example of a situation in which neither a linear model nor a normal distribution of the residuals is adequate, we can consider the results of an experiment to determine the toxicity of ammonia to a species of beetle, Tribolium confusum (Finney, 1971, Section 9.1, p. 177). The experiment was performed in two batches, each comprising a series of samples. These batches will be represented by dummy variables, Z1 and Z2, and the explanatory variable X is log10(concentration of ammonia) applied to each sample. In any sample, the number of dead beetles, R, must be an integer between 0 and the number of beetles in the sample, N. The data are shown in the spreadsheet in Table 10.1. (Data reproduced by kind permission of Cambridge University Press.)

Table 10.1 Mortality of the beetle Tribolium confusum at different concentrations of ammonia.

X = log10(concentration of ammonia), N = number of beetles in the sample, and R = number of dead beetles in the sample.

A B C D
1 batch! X N R
2 1 0.72 29 2
3 2 0.72 29 1
4 1 0.80 30 7
5 2 0.80 31 12
6 1 0.87 31 12
7 2 0.87 32 4
8 1 0.93 28 19
9 2 0.93 31 18
10 1 0.98 26 24
11 2 0.98 31 25
12 1 1.02 27 27
13 2 1.02 28 27
14 1 1.07 26 26
15 2 1.07 31 29
16 1 1.10 30 30
17 2 1.10 31 30

Source: Data reproduced by kind permission of Cambridge University Press.

If the death of each individual is independent of that of every other individual, then the random variable R has a binomial distribution, the precise shape of which is determined by the value of N and by the probability that an individual beetle dies, designated by π. This statement can be written in symbolic shorthand as

10.2 equation

The value of π may depend on the value of X under consideration and on the batch, that is, π may be a function of X and ‘batch’. This value can never be known, though it can be estimated from the data.

A brief digression on the binomial distribution is required here. This distribution is defined by the statement

where

equation

For example, if

equation

then in a sample of 30 beetles, the probability that 8 die is given by

equation

Substituting each possible value of r from 0 to 30 into Equation 10.3, we obtain the distribution illustrated in Figure 10.1. For a fuller account of the binomial distribution, and why it occurs in such situations, see, for example, Snedecor and Cochran (1989, Sections 7.1–7.5, pp. 107–117) or Bulmer (1979, Chapter 6, pp. 81–90).

c10f001

Figure 10.1 The binomial distribution with parameters π = 0.3 and N = 30.

In a system of this kind, the relationship between the combined value of the explanatory variables (‘batch’ and X in this case) and c10-math-0017 is often sigmoidal (S-shaped – see Figure 10.2). At one extreme of the range of the explanatory variables, the probability of the event under consideration (death in this case) is close to zero: at the other extreme it is close to one. There are two commonly used functions that specify a relationship of this form. One of these, the integral of the normal distribution, is the basis of a method called probit analysis (Finney, 1971). This method of fitting a sigmoid curve probably has the clearer conceptual basis: it is based on the assumption that an underlying variable, in this case the tolerance of the beetles to the toxin, is normally distributed, and that at any given dose, all individuals up to a certain level of tolerance are killed. The alternative is the logistic function, which we will use here as it has the advantage of being rather easier to express algebraically: in the present case, it is

where

  1. rij = the value of R at the jth level of X in the ith batch, that is, the number of dead beetles in this sample (the ijth sample),
  2. nij = the value of N, that is, the total number of beetles, in the ijth sample,
  3. xj = the jth level of X
  4. z1i = the value of the dummy variable Z1 in the ith batch, indicating whether each observation was obtained from Batch 1
  5. z2i = the value of the dummy variable Z2 in the ith batch, indicating whether each observation was obtained from Batch 2
  6. c10-math-0019 = the value of the random variable E in the ijth sample, which represents the residual effect on pij,
  7. c10-math-0020 = the mean value of X over all samples,

and the βs and υs are parameters to be estimated, namely

c10-math-0021 = constant
c10-math-0022 = effect of X
c10-math-0023 = effect of Batch 1
c10-math-0024 = effect of Batch 2
c10-math-0025 = effect of interaction between X and Batch 1
c10-math-0026 = effect of interaction between X and Batch 2.

The relationship between the batches and the dummy variables is as shown in Table 10.2. The value pij gives an estimate of π based on the information from the ijth sample, applicable to the batch and the level of X in question. However, the fitting of the logistic function will give estimates based on all the data, for any combination of levels of ‘batch’ and X.

c10f002

Figure 10.2 A sigmoidal relationship between an explanatory variable (or variables) and the probability (π) of a response.

Table 10.2 The relationship between batches and dummy variables in the model fitted to the data on beetle mortality.

Batch Z1 Z2
1 z11 = 1 z21 = 0
2 z12 = 0 z22 = 1

The relationship between Y and the explanatory variables in this equation is not linear. However, if we ignore the residual term c10-math-0027 for the time being, we can transform the equation to the familiar linear-model form, as follows:

10.5 equation

The function c10-math-0029 is known as the logit function. Models of this type, which can be expressed in linear form by dropping the residual term and applying a suitable transformation to the response variable, are known as generalized linear. Every generalized linear model is characterized by a probability distribution and a link function: in the present case, the binomial distribution and the logit function. For each probability distribution, there is a particular link function known as the canonical link which has special mathematical properties, including the fact that when the residual term is the only random-effect term in the model, it always gives a unique set of parameter estimates, the sufficient statistic (McCullagh and Nelder, 1989, Sections 2.2.2–2.2.4, pp. 28–32). In the case of the binomial distribution, the logit function is the canonical link – another reason for preferring it to the probit function, the corresponding function used in probit analysis.

In the notation of Wilkinson and Rogers (1973; see Section 2.2), the model specified is

equation

It is reasonable to specify ‘X’ as a fixed-effect term, and ‘batch’ as a random-effect term, from which it follows that ‘X.batch’ is also a random-effect term (see the guidelines in Section 6.3). We have then specified a generalized linear mixed model (GLMM).

This GLMM can be fitted to the data by the following GenStat statements:

IMPORT "IMM edn 2\Ch 10\ammonia Tribolium.xlsx"
GLMM [PRINT = model, monitoring, components, vcovariance, effects; 
 DISTRIBUTION = binomial; LINK = logit; DISPERSION = *; 
 RANDOM = batch + X.batch; FIXED = X] Y = R; NBINOMIAL = N
VDISPLAY [PRINT = effects; PTERMS = batch/X; PSE = estimates]

In the GLMM statement, the DISTRIBUTION option specifies that the response variate ‘R’ follows the general form of the binomial distribution, though its variance may be greater or less than that of a binomial variable, as will be specified in a moment. The LINK option specifies that the logit function has been chosen as the link function. If the assumptions that each death is independent (conditional on π) and that R follows the binomial distribution in every respect were correct, it would follow that

10.6 equation

and there would be no need to estimate the residual variance from the data. We could indicate that we were willing to make this assumption by setting the option ‘DISPERSION = 1’: instead, by setting this option to a missing value (‘*’) we indicate that the residual variance is to be estimated. The options RANDOM and FIXED specify the random-effect and fixed-effect model terms, respectively. It is possible that there is a correlation between the effects of ‘batch’ and those of ‘X.batch’, but with only two batches there is not enough information to estimate this and it cannot be included in the model. The parameter Y specifies the response variate, and the parameter NBINOMIAL specifies the variate that holds the number of observations in each sample, that is, the maximum value that each value of the response variate might take.

The output of the GLMM and VDISPLAY statements is as follows:

The output first specifies the fitting method used and the model fitted. Next comes some monitoring information, indicating the estimated values of certain model parameters at successive iterations of the model-fitting process, leading to convergence, that is, successful fitting (see Section 10.5). Next come estimates of the variance components for the random-effect terms, with their SEs. The estimate for ‘batch’ is smaller than its SE, and that for ‘batch.X’ is zero, indicating that these terms could probably be dropped from the model. The estimate of the residual variance is presented in terms of the dispersion, which is given by the ratio

equation

In the present case, the dispersion is substantially larger than 1, indicating that there is more residual variation from sample to sample than would be expected if the distribution were truly binomial. Next comes the matrix of covariances among these variance estimates. The values on the diagonal are simply the squares of the SEs above – the variances of the variance estimates. For example,

equation

The off-diagonal values indicate the extent to which the estimate of one variance component is associated with that of another.

Next come the estimated effects of each model term. These can be substituted into the original model (Equation (10.4)), together with the value

equation

to provide an estimate of π, the true probability that an insect dies, for any combination of batch and X, thus

equation

This function is displayed over the range of the data, together with the observed values, in Figure 10.3.

c10f003

Figure 10.3 Fitted curves from a logistic model relating the proportion of T. confusum individuals killed to the concentration of ammonia applied.

fitted, Batch 1; fitted, Batch 2; × observed, Batch 1; ○ observed, Batch 2.

Overall, the curves give a reasonable fit to the data, permitting a realistic estimate of the proportion of insects that would be killed by a particular concentration of ammonia. The effect of batch corresponds to the horizontal displacement between the two curves. This is small relative to the scatter of the data, confirming that the batch has little if any effect. The ‘X.batch’ interaction effect corresponds to the difference in average slope between the two curves, which is much too slight to be detected by eye. The corresponding variance component estimate is zero, and it might be preferable to drop these terms from the model, but they are included in the substitution into Equation (10.4) in order to show how the equation is constructed. The fact that the dispersion is larger than 1 indicates that the scatter of the observations about the curves is wider than would be expected if the deaths of individuals within each sample were independent: that is, there is evidence of some heterogeneity in the conditions of each sample, even after allowing for the effects of ammonia and batch. Because the ‘X.batch’ interaction effect is negligible, the decision to use a mixed model has had little effect on the precision of the parameter estimates in this case. However, if this effect were substantial, it would be important to take it into account in order to obtain realistic values for the SEs of c10-math-0036, c10-math-0037 and c10-math-0038.

10.3 Use of R to fit the logistic curve

The following commands import the data into R, convert batch to a factor and fit the model in Equation (10.4) to the data:

rm(list = ls())
ammonia.tribolium <- read.table(
 "IMM edn 2\Ch 10\ammonia tribolium.txt",
 header=TRUE)
attach(ammonia.tribolium)
fbatch <- factor(batch)
library(lme4)
responses <- cbind(R, N - R)
ammonia.tribolium.glmer <- glmer(responses ∼ X + (1|fbatch)+
 (X|fbatch), family = binomial(link = "logit"),
 data = ammonia.tribolium)
summary(ammonia.tribolium.glmer)
coef(ammonia.tribolium.glmer)

The function cbind() combines the number of individuals that respond (R) in each sample and the number that do not respond (NR) into a single structure, ‘responses’. The model specified in the glmer() function indicates that

  • ‘responses’ holds the information on the responses;
  • ‘X’ is a fixed-effect term;
  • the main effect of batch is a random-effect term, indicated by (1|batch);
  • the variation in the response to ‘X’ among batches is a random-effect term, indicated by (X|batch).

The family option in this function indicates the distribution of the response variable and the link function. The output of the function summary() is as follows:

The estimates of effects are generally different from those obtained from GenStat, because the GenStat fits the model centred on the mean value of X, whereas R fits it without centring. One consequence of this is that it is now the main effect of batch that has a variance component estimate of zero, while the X.batch interaction has a positive variance component estimate. The function summary() reports the intercept and the coefficient of X for the two levels of fbatch directly, rather than as departures from the fixed-effect estimates as specified in Equation (10.4). Adapting this equation accordingly, and substituting the coefficient estimates, the probability that an insect dies is estimated by

equation

The curves produced by this function do not agree closely with those produced by GenStat, but agree closely with those produced by SAS, and fit the data fairly well. No straightforward method of obtaining an estimate of the dispersion parameter from R is known to the author.

10.4 Use of SAS to fit the logistic curve

The following commands import the data into SAS and fit the model in Equation (10.4) to the data:

PROC IMPORT OUT = ammonia DBMS = EXCELCS REPLACE
   DATAFILE = "&pathname.IMM edn 2Ch 10ammonia Tribolium.xlsx";
 SHEET = "for SAS";
RUN;
ODS RTF;
PROC GLIMMIX DATA = ammonia METHOD = RSPL ASYCOV;
   CLASS batch;
   MODEL R/N = X/
      DISTRIBUTION = BINOMIAL LINK = LOGIT SOLUTION;
   RANDOM batch batch * X/ SOLUTION;
   RANDOM _RESIDUAL_;
RUN;
ODS RTF CLOSE;

The logistic model is fitted by the procedure GLIMMIX. The option NOBOUND of this procedure is not set, so variance components estimates are constrained to be positive: without this constraint, the model-fitting process fails to converge. The MODEL statement indicates that the response variable is the number of individuals responding, R, out of a total N, and gives the fixed-effect model. In this statement, the DISTRIBUTION option specifies the distribution of the response variable, the LINK option specifies the link function and the SOLUTION option specifies that the regression coefficients for the fixed-effect terms are to be displayed in the output. The first RANDOM statement indicates the terms in the random-effects model. The second RANDOM statement indicates that the dispersion parameter, related to the residual variance, is to be estimated, not fixed at 1.

Part of the output from PROC GLIMMIX is as follows:

Covariance para meter estimates
Cov Parm Estimate Standard error
batch 0
X*batch 0.09689 0.2797
Residual (VC) 2.2192 0.8704
Asymptotic covariance matrix of covariance parameter estimates
Cov Parm CovP1 CovP2 CovP3
batch
X*batch 0.07824 −0.03379
Residual (VC) −0.03379 0.7576
Solutions for fixed effects
Effect Estimate Standard error DF t value Pr > |t|
Intercept −15.8489 2.0920 1 −7.58 0.0835
X 17.8700 2.3105 1 7.73 0.0819
Type III tests of fixed effects
Effect Num DF Den DF F value Pr > F
X 1 1 59.82 0.0819
Solution for random effects
Effect batch Estimate Std Err Pred DF t value Pr > |t|
batch 1 0
batch 2 0
X*batch 1 0.1548 0.2700 12 0.57 0.5771
X*batch 2 −0.1548 0.2700 12 −0.57 0.5771

The estimates of effects are generally different from those obtained from GenStat, because the GenStat fits the model centred on the mean value of X, whereas SAS fits it without centring. One consequence of this is that it is now the main effect of batch that has a variance component estimate of zero, while the X.batch interaction has a positive variance component estimate. PROC GLIMMIX reports the intercept and the coefficient of X for the two levels of fbatch directly, rather than as departures from the fixed-effect estimates as specified in Equation (10.4). Adapting Equation (10.4) to take account of the absence of centring, and substituting the coefficient estimates into the model, the probability that an insect dies is estimated by

equation

The curves produced by this function do not agree closely with those produced by GenStat, but agree closely with those produced by R and fit the data fairly well.

10.5 Fitting a GLMM to a contingency table: Trouble-shooting when the mixed modelling process fails

There are several other types of data that cannot be realistically represented by an ordinary linear model with normally distributed residual variation, but which do fulfil the criteria for fitting a GLMM, namely

  • they can be represented by a model that can be converted to the general-linear form by omitting the residual term and applying an appropriate transformation;
  • an appropriate probability distribution can be specified for the response variable.

Hence the use of GLMMs permits a wide extension to the range of situations in which the concepts of mixed modelling can be applied. Another important case in which generalized linear models can be used is the analysis of contingency tables. These are data sets in which events of a particular type are counted and are classified by factors that indicate the combination of circumstances, or contingency, in which each event occurred. This type of data set is illustrated by an example concerning the frequency of damage caused by waves to the forward sections of cargo-carrying ships (McCullagh and Nelder, 1989, Section 6.3.2, pp. 204–208). Each occurrence of damage is classified by the type of ship to which it occurred (A to E), the year of construction of the ship and its period of operation. For each category defined by these three factors – that is, each contingency – the number of incidents of damage observed was recorded, along with the number of months of service over which observations were available. The first and last few rows of the data are shown in the spreadsheet in Table 10.3. The null hypothesis to be tested (H0) is that none of the factors influenced the frequency of incidents, and in this case the number of incidents in each category is expected to be proportional to the number of months of service. (Data reproduced by kind permission of Chapman and Hall.)

Table 10.3 Number of incidents of damage to ships, classified by ship type, year of construction and period of operation.

A B C D E
1 type! constrctn_y! operatn_period! service_months damage_incidents
2 A 1960–64 1960–74 127 0
3 A 1960–64 1975–79 63 0
4 A 1965–69 1960–74 1095 3
5 A 1965–69 1975–79 1095 4
6 A 1970–74 1960–74 1512 6
7 A 1970–74 1975–79 3353 18
8 A 1975–79 1960–74
9 A 1975–79 1975–79 2244 11
10 B 1960–64 1960–74 44882 39
11 B 1960–64 1975–79 17176 29
· ·
· ·
· ·
34 E 1960–64 1960–74 45 0
35 E 1960–64 1975–79 0 0
36 E 1965–69 1960–74 789 7
37 E 1965–69 1975–79 437 7
38 E 1970–74 1960–74 1157 5
39 E 1970–74 1975–79 2161 12
40 E 1975–79 1960–74
41 E 1975–79 1975–79 542 1

Source: Data reproduced by kind permission of Chapman and Hall.

We can represent the number of incidents of damage to ships of the ith type, constructed during the jth range of years, during the kth period of operation (the ijkth category) by the symbol rijk. If each damage incident is independent of all the others, then it can be shown that rijk is an observation of a random variable Rijk which has a Poisson distribution, the mean of which is given by ijk where

  1. N = the expected total number of incidents of damage given the total number of months of observation and their distribution over the categories

and

  1. πijk = the true probability that an individual damage incident falls in the ijkth category.

This statement can be written in symbolic shorthand as

10.7 equation

Again a brief digression is required, this time on the Poisson distribution. This is defined by the statement that if

10.8 equation

where

  1. μ = the mean number of incidents in an observation period,

then

10.9 equation

For example, if the expected number of incidents of damage in a particular category (the mean number over an infinite hypothetical population of data sets similar to the present data set) is 8, then the observed number of incidents will be distributed as shown in Figure 10.4. For a fuller account of the Poisson distribution, and why it occurs in this context, see, for example, Snedecor and Cochran (1989, Section 7.14, pp. 130–133) or Bulmer (1979, pp. 90–97).

c10f004

Figure 10.4 The Poisson distribution with parameter (mean value) μ = 8.

In the present case, if H0 is true, then

10.10 equation

where

  1. aijk = the proportion of months of service that fall in the ijkth category.

In this case, the expected number of incidents in the ijkth category is Naijk, and

10.11 equation

where

  1. c10-math-0046 = the value of the random variable c10-math-0047 in the ijkth category, which represents the residual effect on rijk.

We can add terms to this model to represent the possibility that the factors type of ship, year of construction and period of operation and their interactions influence the probability of a damage incident, as follows:

10.12 equation

where

  1. qi.. = the main effect of the ith type of ship,
  2. q.j. = the main effect of the jth range of years of construction,
  3. q..k = the main effect of the kth period of operation,
  4. qij. = the effect of interaction between the ith type and the jth range of years, and so on.

N and the qs are parameters to be estimated from the data. This model, like that in Equation (10.4), can be transformed to the linear-model form by ignoring the residual term c10-math-0049 and applying a suitable transformation, in this case the logarithmic transformation, namely

10.13 equation

However, the term c10-math-0051 does not have to be estimated: it is a separate variable supplied with the data. It is equivalent to a term c10-math-0052 when the value of the parameter β is fixed at 1. Such a term is called an offset. In the notation of Wilkinson and Rogers (1973), the model specified here (excluding the offset term) is

equation

The following statements import the data and fit this model, specifying all terms as fixed-effect terms for the time being:

IMPORT 'IMM edn 2\Ch 10\ship damage.xlsx'
CALCULATE logservice = LOG(service_months/SUM(service_months))
MODEL [DISTRIBUTION = poisson; LINK = log; DISPERSION = 1; 
 OFFSET = logservice] damage_incidents
TERMS type*constrctn_y*operatn_period
FIT [PRINT = *] type
ADD [PRINT = *] constrctn_y
ADD [PRINT = *] operatn_period
ADD [PRINT = *] type.constrctn_y
ADD [PRINT = *] type.operatn_period
ADD [PRINT = model, accumulated; FPROBABILITY = yes]
 constrctn_y.operatn_period

The CALCULATE statement obtains the natural logarithm of the proportion of months of service in each category, transforming this variable to the scale on which it will be required in the model. A message in the output (not shown here) warns that in the case of Unit 34 (Row 35 of the spreadsheet) an attempt has been made to obtain the logarithm of zero, and that the result is a missing value. Consequently this unit is omitted from the analysis, which is appropriate, as it represents a category of ship that spent no time at sea, and therefore was not exposed to the risk of damage. In the MODEL statement, the DISTRIBUTION option specifies that the response variate (‘damage_incidents’) follows the general form of the Poisson distribution, though its variance may be greater or less than that of a Poisson variable, unless we constrain it using the DISPERSION option (see below). The LINK function specifies the function required to transform the model to the linear form, just as the same option did in the GLMM statement in the previous example (Section 10.2). If the assumption that each damage incident is independent and that R follows the Poisson distribution in every respect is correct, then it follows that

10.14 equation

and there is no need to estimate the residual variance from the data. (It is peculiarity of the Poisson distribution that its variance is equal to its mean.) The option setting ‘DISPERSION = 1’ indicates that we are willing to make this assumption. The offset term is specified by the OFFSET option.

In order to obtain an analysis in which the deviance accounted for by each model term is shown separately, it is necessary to specify each of the terms in a separate statement. The FIT statement specifies the model on H0, comprising only the constant term c10-math-0055 and the offset term c10-math-0056. The option setting ‘PRINT = *’ indicates that no output is to be produced from this initial model. The main effect and interaction terms are added by a succession of ADD statements, again with no printing, until the final ADD statement specifies that the complete model and the accumulated analysis of deviance are to be printed. The analysis of deviance is a method closely related to analysis of variance: in the special case where the residual variation is normally distributed (i.e. in all the models considered prior to the present chapter), the two are equivalent. The term accumulated indicates that the deviance accounted for by adding each term to the model successively is to be presented (cf. Section 1.4, where an anova constructed on the same basis is presented). The FPROBABILITY option specifies that the analysis of deviance table is to include a p-value for the significance of each term in the model. Note that it is not necessary to fit the three-way interaction term ‘type.constrctn_y.operatn_period’ explicitly, as there is only one observation for each combination of these factors, and this term is therefore the residual term.

The output of the final ADD statement is as follows:

A message first warns that the term ‘constrctn_y.operatn_period’ cannot be fully included in the model. This is because of the presence of missing values, as a result of which not all combinations of levels of the three factors are represented in the data. The model fitted and the fitting method used are then specified. The accumulated analysis of deviance then shows how the variation among the values of ‘damage_incidents’ (after adjusting for ‘logservice’) is distributed among the terms in the model. Roughly speaking, the deviance per degree of freedom – the mean deviance – gives a measure of the amount of variation accounted for by each term, when the terms are added successively to the model. Thus the mean deviance corresponds to the mean square in an analysis of variance. It is divided by the dispersion parameter to give the deviance ratio: since the dispersion parameter has been fixed at 1 (i.e. ‘damage_incidents’ has been assumed to follow a Poisson distribution in every respect, including the value of the variance), the two are identical. If the dispersion parameter had not been specified, it would be estimated by the residual mean deviance: hence the deviance ratio is equivalent to the variance ratio (the F statistic) in an analysis of variance. If the assumption that ‘damage_incidents’ follows a Poisson distribution is correct, and if H0 is true, then the deviance for each term is distributed approximately as χ2 with the degrees of freedom (d.f.) indicated. Thus each deviance provides a significance test for the term in question, and the column headed ‘approx chi pr’ gives the corresponding p-value. These p-values indicate that the main effects of the three factors are highly significant, and that the ‘type.constructn_y’ interaction is also significant, but that the other interactions are not.

It is of interest to obtain estimates of the mean frequency of damage to ships of each type, but it is not possible to do so from the model fitted above, because some pair-wise combinations of factor levels are not represented in the data. However, we can omit the non-significant two-way interaction terms from the model and obtain estimates based on the simpler model comprising only the main effects of the three factors and the type.constrctn_y interaction. This is done by the following statements:

FIT [PRINT = *] type * constrctn_y + operatn_period
PREDICT [PRINT = description, prediction, se, sed] type

The output of the PREDICT statement is as follows:

The note that the estimated means are formed on the scale of the response variable indicates that they must be back-transformed, using the inverse of the link function, in order to obtain them on the original scale, namely the number of incidents of damage. For example, the mean value for ships of Type A corresponds to

equation

The accompanying SE indicates that the true value is likely to lie in the range from

equation

to

equation

Another note states that these estimates are based on a value of −4.956 for the offset variate ‘logservice’. This means that they are based on the value

where

  1. A = the proportion of months of service that is assumed to fall in the category under consideration.

Although A varies from category to category in the data, it is held constant for the purpose of prediction, in order to provide a valid basis for comparison between risks in the different categories. From this decision, it follows that

where

  1. T = total number of months of service.

Totalling the values of ‘service_months’ in the data, we obtain

rearranging Equation 10.15 we obtain

and substituting from Equations 10.17 and 10.18 into Equation 10.16, we find that the predicted numbers of incidents are based on

equation

Another note indicates that the calculation of this mean has required averaging over the levels of the other factors, and that in this process marginal weights have been applied: that is, operation periods and construction years that are more heavily represented in the data contribute more heavily to the mean. However, it is noted that the status of weights is constant over levels of other factors: this means that a combination of ‘operatn_period’ and ‘constrctn_y’ that is more heavily represented in the data than the marginal weights would lead one to predict does not contribute more heavily to the mean. Another note states that the SEs are not appropriate for the forecasting of new observations: that is, they indicate the precision of the means themselves, not the amount of variation among individual observations. As noted earlier (Section 4.6), SEs of differences between means are usually of more interest than SEs of the means themselves, and these are provided, for all pair-wise comparisons, on the transformed scale.

If ‘constrctn_y’ and ‘operatn_period’ are representative of a broader population of construction and operation dates, it may be reasonable to specify these factors as random-effect terms. However, it may be reasonable to retain ‘type’ as a fixed-effect term representing particular methods of construction that are of individual interest, and that may be amenable to choice or control: ship builders can decide what type of hull to construct, and insurers can decide what premiums to offer on each type of ship. According to the guidelines given in Section 6.3, it follows that the remaining terms in the model should then be specified as random-effect terms. The following statements specify the fitting of this model:

GLMM [DISTRIBUTION = poisson; LINK = log; DISPERSION = 1; 
 OFFSET = logservice; 
 RANDOM = constrctn_y * operatn_period + type.constrctn_y + 
 type.operatn_period; FIXED = type] Y = damage_incidents

The structure of this statement corresponds to that of the GLMM statement in Section 10.2: the only new feature is the OFFSET option, which has the same meaning as that in the MODEL statement earlier in this section. However, the output from this statement begins with the following diagnostic messages:

It is therefore the safest to ignore the rest of the output. Whereas the methods of ordinary regression analysis and analysis of variance can be applied without arithmetic problems to almost any data set, the same is not true of mixed modelling, and certainly not of GLMM. We will now examine the kinds of problems that can be encountered, in order to interpret these warning messages.

In ordinary regression analysis and analysis of variance, analytical formulae are applied which give, in a single step, the best estimates, according to a certain criterion, of the parameters of the model being fitted. The criterion used to identify the ‘best’ parameter estimates is that they should be the values that maximize the probability of the observed data, and hence minimize the value of the deviance (a concept introduced in Section 3.12). These are said to be the parameter estimates with the highest likelihood: as noted in Section 3.12,

10.19 equation

When all the random-effect terms are assumed to be normally distributed (as is the case in all the examples considered in earlier chapters), these estimates are also those that minimize the estimate of the residual variance. However, in mixed modelling, the maximum-likelihood estimates cannot generally be obtained in a single step: a search must be made for them. In broad terms, the search strategy is to start from an arbitrary set of parameter values, then determine a set of changes that can be made to these that will increase the value of the likelihood. This process is repeated, the change made getting smaller at each iteration, until no change that produces a further increase can be found. At this point, the model-fitting process is said to have converged. The process it analogous to trying to climb a mountain in fog, following the rule ‘always walk uphill’. The process will lead to the summit (analogous to the maximum-likelihood estimates), provided that

  • one starts on the flank of the mountain, not in some other part of the landscape,
  • one takes steps that are not too large and
  • the mountain has one, and only one, peak.

If these criteria are not met, one may walk uphill indefinitely (failure of convergence), or the process may lead to a small peak on the flank of the likelihood ‘mountain’, not its true summit (convergence to a local maximum). The latter problem may be recognized by unrealistic parameter estimates and a poor fit to the data. Sometimes these problems can be overcome by a careful choice of initial parameter values for the fitting process, giving the process a better chance of ‘climbing the right peak’.

In the present case, the model-fitting process has encountered severe difficulties. Generally, such problems occur because the model being fitted does not represent the pattern of variation in the data well: a review of the model may suggest terms that can be dropped, or others that should be added. In the output above, the message that negative variance components are present suggests that the difficulties may have occurred because some of the model terms have little or no effect: if such terms are dropped, model-fitting may be more successful. Inspection of the results produced when all terms were specified as fixed-effect terms indicates that the non-significant terms ‘type.operatn_period’ and ‘constrctn_y.operatn_period’ are candidates for omission. Fitting of the resulting model is specified by the following statement:

GLMM [DISTRIBUTION = poisson; LINK = log; DISPERSION = 1; 
 OFFSET = logservice; 
 RANDOM = constrctn_y + operatn_period + type.constrctn_y; FIXED = type; 
 PRINT = model, monitoring, components, vcovariance, means, 
 backmeans, waldtests] Y = damage_incidents

The output produced by this statement is as follows:

This simpler model has been successfully fitted. The output follows the same general form as that from the GLMM fitted in Section 10.2. The statement of the fitting method used and the model fitted is followed by a warning about the missing values in the data. Next comes the monitoring information on the fitting process, which shows that convergence has been achieved: this is indicated by the values of the gammas, statistics closely related to the variance components for the three random-effect terms (see Section 10.10), which hardly change between the sixth and seventh iteration. Next come the estimates of the variance components. Although the main effects of ‘constrctn_y’ and ‘operatn_period’ were significant in the fixed-effects-only analysis, the variance component estimates for these terms are smaller than their respective SEs. It should be remembered that in the fixed-effects-only analysis, each term was tested against the residual deviance: the significance of these terms may have been due to other variance components that contribute to their deviance. Several variance components can contribute to the deviance for a particular model term, just as several can contribute to an expected mean square in an anova (see Section 3.3). Next comes the variance–covariance matrix among these variance component estimates: the interpretation of such a matrix is explained in Section 10.2. Next come the tests of the significance of the fixed-effect term ‘type’. The F statistic shows that this falls short of significance. Nevertheless, the estimated means for each level of type is then presented, on the transformed scale, that is, the scale after the link function has been applied to the model: in this case, the logarithmic scale. These means are somewhat different from those obtained from the fixed-effects model: consistently larger, but less variable. However, the ranking of the five types of ship is the same, Type C giving the fewest incidents of damage and Type E the most. The SEs of differences between these means are consistently smaller than those from the fixed-effects model. Overall, the variation among the means is somewhat less, relative to the SEs of differences, than when the fixed-effects model is used. This is perhaps to be expected, as the mixed model recognizes ‘constrctn_y.type’ as a random-effect term, which contributes to the SEs of differences between levels of ‘type’. Finally, the back-transformed means on the original scale (count of incidents) are presented. These are obtained from the means on the transformed scale using the inverse of the link function, namely the exponential function: for example,

equation

10.6 The hierarchical generalized linear model (HGLM)

In the GLMMs fitted above (Sections 10.2 and 10.5), it is assumed that although the residual variation may not be normally distributed, the other random-effect terms do follow the normal distribution. However, this is not always the most natural assumption. For example, it has been suggested (Lee and Nelder, 2001) that when a Poisson distribution and a logarithmic link function are specified for the residual variation (as in the model fitted in Section 10.5), a more appropriate assumption for the other random-effect terms might be a gamma distribution and a logarithmic link function. Such models, in which a probability distribution and a link function can be specified for each random-effect term, have been called hierarchical generalized linear models (HGLMs). However, this name is slightly misleading, as the random-effect terms do not have to be nested to form a hierarchy: they may be crossed, as in the model in Section 10.5. An alternative name, taking account of this possibility, would be stratified generalized linear models.

A system to fit HGLMs has been developed (Lee and Nelder, 1996, 2001), and is available in GenStat. The following statements use this system to fit the same GLMM as was fitted in Section 10.5, with no change to the distributions or link functions:

CALCULATE logservice1 = LOG(service_months)
SUBSET [logservice1 .NE. !s(*)] 
 type,constrctn_y,operatn_period,damage_incidents,logservice1
HGFIXEDMODEL [DISTRIBUTION = poisson; LINK = logarithm; 
 OFFSET = logservice1] type
HGRANDOMMODEL [DISTRIBUTION = normal; LINK = identity] 
 constrctn_y + operatn_period + type.constrctn_y
HGANALYSE [LMETHOD = eql] damage_incidents
HGPLOT METHOD = histogram, fittedvalues, normal, halfnormal
HGPLOT [RANDOMTERM = type.constrctn_y] 
 METHOD = histogram, fittedvalues, normal, halfnormal

Note that for presentation to the HGLM fitting system, the offset variable must be specified as ‘LOG(service_months)’ not ‘LOG(service_months/SUM(service_months))’. Note also that HGLM fitting system is not able to cope with the missing values of the offset variate, so the SUBSET statement is used to exclude these from consideration, together with the corresponding values of the other variates and factors used. The HGFIXEDMODEL and HGRANDOMMODEL statements specify the fixed-effect and random-effect models, respectively. The HGFIXEDMODEL statement also specifies the distribution and link function for the residual term, and the offset variable, as was done in the equivalent GLMM statement (Section 10.5). The HGRANDOMMODEL statement further indicates that the other random-effect terms are to have a normal distribution and the identity link function, that is, no transformation, as was assumed implicitly in the GLMM statement. The HGANALYSE statement indicates that the response variable is ‘damage_incidents’, and that the model is to be fitted using a criterion known as extended quasi-likelihood, rather than the default criterion of exact likelihood. This specification produces output (not shown) that is numerically equivalent to that from the GLMM statement with regard to the fixed-effect estimates. The two HGPLOT statements produce diagnostic plots of the random effects, for the residual term and the term ‘type.constrctn_y’, respectively (Figure 10.5). The other two random-effect terms do not have enough levels to justify the production of diagnostic plots.

c10f005

Figure 10.5 Diagnostic plots of the distribution of random effects from the mixed model relating damage to ships to their type, year of construction and period of operation, fitted with the normal distribution and the identity link function. (a) Effects in the residual term and (b) effects in the term ‘type.constrctn_y’.

The distributions used in the model fitted here are not both normal, and the reference to normal plots and normal quantiles in the labelling of these diagnostic plots is therefore imprecise. However, the interpretation of the plots is the same as in models that assume a normal distribution (Section 1.10). In the case of the residual term, the histogram is reasonably bell-shaped and symmetrical, the scatter of the points in the fitted-value plot is reasonably even over the range of fitted values, and the points in the normal and half-normal plot lie reasonably close to a straight diagonal line. In the case of term ‘type.constrctn_y’, the diagnostic plots do not conform so closely to these ideals: in particular, more will be said about the fitted-value plot in a moment.

In order to adopt the suggested distribution and link function for the random-effect terms other than the residual term, the HGRANDOMMODEL statement is modified to

HGRANDOMMODEL [DISTRIBUTION = gamma; LINK = logarithm] 
 constrctn_y + operatn_period + type.constrctn_y

The output of the HGANALYSE statement is then as follows:

The monitoring information at the beginning of the output shows the value of the dispersion parameter λi for each random-effect term (see below) at each stage of the model-fitting process, and the largest absolute change in this parameter value at each step. The largest absolute change between the final two cycles is very small, confirming that the model-fitting process has converged. The message ‘Aitken extrapolation OK’ indicates points at which the use of this method to accelerate convergence is acceptable. The model fitted is then specified. Next come the estimates of the constant and of the effects of ‘type’. These are calculated using ‘A’ as the reference level of ‘type’, so the corresponding means are given by

equation

and

equation

for the other types. The values obtained are presented in Table 10.4. They differ from those given by the GLMM by approximately loge(T) = loge(163574) = 12.005, because the offset of the HGLM was specified as loge(service months) whereas that of the GLMM was specified as loge(service months/sum(service months)). Hence they are based on A = e−4.956 = 0.0070410 months of service.

Table 10.4 Predicted number of damage incidents suffered by each type of ship, per 0.0070410 months of service, transformed to natural logarithms, obtained from an HGLM.

Type
A B C D E
−5.624 −6.203 −6.353 −5.774 −5.185

The parameter estimates from the dispersion model, c10-math-0069, c10-math-0070 and c10-math-0071, indicate the deviance due to the random-effect terms ‘constrctn_y’, ‘operatn_period’ and ‘constrctn_y.type’, respectively. These estimates are given relative to the dispersion: that is,

In the present case, we have specified that

so rearranging Equation 10.20 and substituting the numerical values from the output and from Equation 10.21, we obtain, for example,

equation

The deviances for the random-effect terms can be interpreted and compared in roughly the same way as estimates of variance components. For example, each is compared in the output with its own SE, to obtain a t statistic that gives a tentative indication of whether any variation is accounted for by the term (cf. Section 3.11, where the analogous statistic based on estimates of variance components is referred to as a z statistic). In the present case, all three dispersion parameter estimates are negative, although GLMM gave small positive values for the corresponding variance component estimates. Thus according to the HGLM models, no variation is accounted for by these terms, and the estimates of all random effects, except the residual effects, are shrunk to zero. This is why the fitted values in the fitted-value plots for ‘type.constrctn_y’ are all zero (Figures 10.5b and 10.6b).

The diagnostic plots obtained from the residual term in this model (Figure 10.6a) is very little changed from that obtained previously (Figure 10.5a) but that for term ‘type.constrctn_y’ (Figure 10.6b) is considerably less satisfactory than when the normal distribution and the identity link function were used (Figure 10.5b). Thus, in this particular case, the use of the HGLM system has not improved the outcome of the modelling process. This may be because the parameters of a non-normal distribution are more difficult to fit, even when such a distribution is more appropriate on theoretical grounds (D. Hedderly, personal communication).

c10f006

Figure 10.6 Diagnostic plots of the distribution of random effects from the mixed model relating damage to ships to their type, year of construction and period of operation, fitted with the gamma distribution and the logarithmic link function. (a) Effects in the residual term and (b) effects in the term ‘type.constrctn_y’.

10.7 Use of R to fit a GLMM and a HGLM to a contingency table

The following commands import the ship-damage data into R, convert ‘type’, ‘constrctn_y’ and ‘operatn_period’ to factors and calculate the offset variable ‘logservice’:

rm(list = ls())
ship.damage <- read.table(
 "IMM edn 2\Ch 10\ship damage.txt",
 header=TRUE)
attach(ship.damage)
ftype <- factor(type)
fconstrctn_y <- factor(constrctn_y)
foperatn_period <- factor(operatn_period)
logservice <- log(service_months/sum(service_months, na.rm = TRUE))
logservice[logservice == -Inf] <- NA

Note that in categories where there are no months of service, and hence ‘logservice’ is minus infinity, this is replaced by a missing value.

The following commands fit to these data the model in which all terms are specified as fixed-effect terms:

ship.damage.glm.1 <-
 glm(damage_incidents ∼ ftype * fconstrctn_y * foperatn_period,
 family = poisson(link = "log"),
 offset = logservice, data = ship.damage)
anova(ship.damage.glm.1)

The output from the function anova() is as follows:

The deviances and their d.f. agree with those obtained from GenStat.

The following commands fit the model from which the non-significant two-way interaction terms have been omitted:

ship.damage.glm.2 <-
 glm(damage_incidents ∼ ftype * fconstrctn_y + foperatn_period,
 family = poisson(link = "log"),
 offset = logservice, data = ship.damage)
anova(ship.damage.glm.2)

Again, the deviances and their d.f. agree with those obtained from GenStat. However, the author has not been able to obtain meaningful predictions of the ship type means, or the differences between them, from this model.

The combination of the random-effect model ‘constrctn_y*operatn_period + type.constrctn_y + type.operatn_period’ with the fixed-effect model ‘type’, which produced non-convergence in GenStat, is successfully fitted by R, but for purposes of comparison, the model subsequently fitted in GenStat will also be fitted here. This is done by the following commands:

library(lme4)
ship.damage.glmer.2 <- glmer(damage_incidents ∼ ftype +
 (1|fconstrctn_y) + (1|foperatn_period) +
 (1|ftype:fconstrctn_y),
 family = poisson(link = "log"), offset = logservice)
summary(ship.damage.glmer.2)

The function glmer() fits a generalized linear mixed model, the response variable, fixed-effect model and random-effect model being specified as described in Section 10.3. The family option indicates the distribution of the response variable and the link function, and the offset option indicates the offset variable.

Part of the output from the function summary() is as follows:

The variance component estimates do not agree with those obtained from GenStat: the reason for this discrepancy is not known to the author. The estimates of the effect of the type are calculated relative to the estimated mean for Type A, which is the intercept. Thus, for example, the estimated mean for Type B on the transformed (logarithmic) scale is

equation

These estimates do not agree closely with those obtained from GenStat, but they rank the ship types in the same order.

At the time of writing, no straightforward method for specifying in R the HGLM fitted in GenStat, with several random-effect terms, is known to the author. A simpler HGLM with a single random-effect term is therefore fitted, using the package ‘hglm’. This package can be installed from the R website by the method described in Section 3.16. It is not able to deal with an offset variate containing missing values, and a new data frame (see Section 1.11 for an explanation of this concept) in which the corresponding observations are excluded from all variables is therefore created by the following command:

ship.damage.short <-
 data.frame(damage_incidents, ftype, fconstrctn_y,
 foperatn_period, logservice)[is.na(logservice) == FALSE,]

The following commands then fit a HGLM that estimates the effects of ‘ftype’ and ‘fconstrctn_y’, but does not take account of ‘foperatn_period’:

library(hglm)
ship.damage.hglm.1 <- hglm(fixed = damage_incidents ∼ ftype,
 random = ∼ 1|fconstrctn_y,
 family = poisson(link = "log"), rand.family = gaussian
     (link = "identity"),
 offset = logservice, data = ship.damage.short)
summary(ship.damage.hglm.1)

In the function hglm(), the fixed and random options specify the fixed-effect and random-effect models, respectively. The family option specifies the distribution and the link function in the residual stratum, and the rand.family option specifies the distribution and link function for the term in the random-effect model – in the present case, the Gaussian (normal) distribution and the identity link function. The data option indicates that the analysis is to be performed using the data frame from which observations with missing values have been excluded.

The output from the summary() function is as follows:

The output first reproduces the HGLM model specification. It then shows the effect estimates for the fixed-effect model ‘type’, calculated relative to Type A as before. The ship type means obtained from these effect estimates agree fairly closely with those given by GenStat. BLUPs are then presented for the levels of the term in the random-effect model, ‘fconstrctn_y’. A fuller account of the function hglm() is given by Rönnegård, Shen and Alam (2010), and a somewhat broader range of HGLMs can be fitted using the newer function hglm(2) (L. Rönnegård, personal communication).

Diagnostic plots of residuals from this model are produced by the following commands:

windows()
par(mfrow=c(2,2))
plot(ship.damage.hglm.1)

Part of the output from the plot() function is shown in Figure 10.7. The plot of studentized residuals against fitted values (Plot a) shows that larger fitted values are associated with more variable residuals, indicating that the model fitted should ideally be refined to take account of this heterogeneity of variance. This pattern is confirmed by the corresponding plot of the absolute values of the residuals (Plot b), which shows a clear positive trend. The Q–Q plot (Plot c) shows somewhat more residuals with very large positive or negative values than expected. However, the histogram of residuals (Plot d) is as close to a normal distribution as can reasonably be expected from this fairly small sample. Further diagnostic plots are produced, but they are less informative than those shown here. In particular, the plots that related to the random effects of ‘fconstrctn_y’ give very little information, being based on only four values.

c10f007

Figure 10.7 (a–d) Diagnostic plots of the distribution of random effects from the mixed model relating damage to ships to their type and year of construction, fitted with the normal distribution and the identity link function.

The HGLM can be modified to specify a gamma distribution for the effects of ‘fconstrctn_y’ and the logarithmic link function for this stratum by changing the setting of the rand.family option to ‘rand.family = Gamma(link = "log")’. This causes some change to the estimates of the effects of ‘type’, but has very little effect on the diagnostic plots.

10.8 Use of SAS to fit a GLMM to a contingency table

The following SAS statements import the ship-damage data, calculate the offset variable ‘logservice’ and fit the model specifying all terms as fixed-effect terms:

PROC IMPORT OUT = ship DBMS = EXCELCS REPLACE
   DATAFILE = "&pathname.IMM edn 2Ch 10ship damage.xlsx";
   SHEET = "for SAS";
RUN;
PROC UNIVARIATE DATA = ship NOPRINT;
   VAR service_months;
   OUTPUT OUT = ship_summ SUM = serv_sum;
RUN;
DATA ship_summ_2; SET ship_summ;
   dummy = 1;
RUN;
DATA ship_2; SET ship;
   dummy = 1;
RUN;
PROC SQL;
   CREATE TABLE ship_3 AS
   SELECT a.*, b.*
   FROM ship_2 AS a, ship_summ_2 AS b
   WHERE a.dummy EQ b.dummy;
QUIT;
DATA ship_4; SET ship_3;
   logservice = log(service_months/serv_sum);
RUN;
ODS RTF;
PROC GENMOD DATA = ship_4;
   CLASS type constrctn_y operatn_period;
   MODEL damage_incidents = type constrctn_y operatn_period
      type * constrctn_y type * operatn_period constrctn_y * operatn_period /
      DIST = POISSON LINK = LOG OFFSET = logservice TYPE1;
RUN;
ODS RTF CLOSE;

PROC UNIVARIATE obtains the total of ‘service_months’, and the DATA steps and use of PROC SQL that follow place this total in an additional column in the data set and calculate ‘logservice’. PROC GENMOD fits the model. In the MODEL statement, the DIST, LINK and OFFSET options specify the distribution of the response variable, the link function and the offset variable, respectively, and the TYPE1 option specifies Type I hypothesis tests, which are those performed in an accumulated analysis of deviance. Part of the output from PROC GENMOD is as follows:

LR statistics for Type 1 analysis
Source Deviance DF Chi-square Pr > ChiSq
Intercept 146.3283
type 90.8893 4 55.44 <0.0001
constrctn_y 49.3552 3 41.53 <0.0001
operatn_period 38.6951 1 10.66 0.0011
type*constrctn_y 14.5869 12 24.11 0.0197
type*operatn_period 8.5208 4 6.07 0.1943
constrctn*operatn_pe 6.8565 2 1.66 0.4351

The chi-square statistics and their d.f. agree with the deviances in the accumulated analysis of deviance produced by GenStat.

The following statements fit the GLMM with ‘type’ as the fixed-effect model and all other terms in the random-effect model, with the omission of the terms ‘type.operatn_period’ and ‘constrctn_y.operatn_period’:

ODS RTF;
PROC GLIMMIX DATA = ship_4 METHOD = RSPL ASYCOV NOBOUND;
   CLASS type constrctn_y operatn_period;
   MODEL damage_incidents = type /
      DISTRIBUTION = POISSON LINK = LOG OFFSET = logservice;
   RANDOM constrctn_y operatn_period constrctn_y * type;
   LSMEANS type;
RUN;
ODS RTF CLOSE;

In PROC GLIMMIX, the option setting ‘METHOD = RSPL’ indicates that the default RSPL method of estimation will be used. The MODEL statement specifies the response variable and the fixed-effect term, and the RANDOM statement specifies the random-effect terms. The LSMEANS statement indicates that the mean for each level of ‘type’ is to be estimated. Part of the output from PROC GLIMMIX is as follows:

Covariance parameter estimates
Cov Parm Estimate Standard error
constrctn_y 0.07661 0.1329
operatn_period 0.07152 0.1109
type*constrctn_y 0.1440 0.1372
Asymptotic covariance matrix of covariance parameter estimates
Cov Parm CovP1 CovP2 CovP3
constrctn_y 0.01767 −0.00035 −0.00454
operatn_period −0.00035 0.01230 −0.00002
type*constrctn_y −0.00454 −0.00002 0.01883
Type III tests of fixed effects
Effect Num DF Den DF F value Pr > F
type 4 12 2.43 0.1043
Type least squares means
Type Estimate Standard error DF t value Pr > |t|
A 6.3487 0.3639 12 17.45 <0.0001
B 5.7631 0.3114 12 18.51 <0.0001
C 5.6096 0.4411 12 12.72 <0.0001
D 6.1312 0.4175 12 14.68 <0.0001
E 6.7527 0.3853 12 17.53 <0.0001

The covariance parameter estimates agree with the estimated variance components produced by GenStat, and the covariances between the estimates are similar though not identical. The F test for ‘type’ agrees fairly closely with that given by GenStat, but the denominator d.f. is different, being obtained by a somewhat different method, and the p-value is therefore also different, though still non-significant. The estimated means for types of ship (on the logarithmic scale) agree with those given by GenStat.

At the time of writing, no straightforward method of fitting an HGLM in SAS is known to the author.

10.9 The role of the covariance matrix in the specification of a mixed model

In all the mixed models we have examined so far, each random-effect term consists of a set of factor levels. Any two observations either come from the same level of the factor, in which case they have the same random effect for the term in question, or from different levels, in which case their random effects are independent. In the residual stratum, each observation comprises a separate level, and its random effect is independent of those of all other observations. However, the relationship between the random effects in each term need not be so simple. In order to explore more elaborate relationships among the random effects, we need first to establish a notation in which to display them. Such a notation is provided by the covariance matrix, which is specified as follows.

Consider a simple data set comprising four groups of three observations of a response variable (Table 10.5). The natural model to fit to these data is

where

  1. yij = the jth observation of the response variable Y in the ith group (the ijth observation),
  2. μ = the overall mean,
  3. γi = the effect of the ith group,
  4. ϵij = the residual effect on the ijth observation of Y.

We will make the usual assumption that the ϵij are values of a random, normally distributed variable Ε, that is,

10.23 equation

The anova corresponding to this model is shown in Table 10.6, and the estimate of c10-math-0078 is given by MSResidual,

equation

Table 10.5 A simple data set comprising a grouping factor and a response variable.

A B
1 group! y
2 1 1.97
3 1 1.01
4 1 4.53
5 2 0.76
6 2 1.60
7 2 3.52
8 3 4.37
9 3 4.05
10 3 5.17
11 4 6.94
12 4 7.50
13 4 5.62

Table 10.6 Anova of a simple data set comprising a grouping factor and a response variable.

Source of variation DF MS F p
Group 3 13.875 8.44 0.007
Residual 8 1.644
Total 11

In order to express the idea that the 12 values ϵij, i = 1…4, j = 1…3, are independent, we need to think of each of them as a value of a different variable, Εij. Similarly, each value of the response variable, yij, is thought of as an observation of a different variable, Yij. The relationships among the 12 variables Εij can then be expressed in a covariance matrix, namely:

equation

where

  1. i = the group to which the first value in a particular comparison compared belongs,
  2. j = the position within group i of the first value,
  3. i′ = the group to which the second value in the comparison belongs and
  4. j′ = the position within group i′ of the second value.

The value of cov(Εij,Εij), the covariance between Εij and Εij, is given in the cell in the ijth row and the ij′th column of this matrix. Since Εij is the only random term that contributes to Yij,

10.24 equation

The variance of each of the variables Εij – its covariance with itself – is given along the leading diagonal of the matrix (the 12 cells from top left to bottom right). All other covariances are zero, reflecting the assumption that all the variables are mutually independent. Because

10.25 equation

by definition, the matrix is symmetrical about the leading diagonal: hence to improve clarity, the top right-hand half can be omitted.

Now suppose that we decide to treat the groups as a random-effect term. We then make the assumption that the γi, i = 1…4, are values of a variable Γ, and that

10.26 equation

The anova can now be interpreted as shown in Table 10.7, and the estimate of c10-math-0085 is

equation

In order to express the idea that the four values γi, i = 1…4, are independent, we think of each of them as a value of a different variable, Γi. The variable representing the random part of Yij is now no longer Εij but Γi + Εij, and the covariances among these 12 variables are as follows:

equation

Table 10.7 Expected mean squares in the anova of a simple data set comprising a grouping factor and a response variable.

Source of variation DF Expected MS
Group 3 c10-math-0082
Residual 8 c10-math-0083
Total 11

The variance of each observation of the response variable is now c10-math-0087. The covariance between observations from different groups – for example, between Y12 and Y23 – is zero, as before. However, the covariance between observations from the same group – for example, between Y12 and Y13 – is now c10-math-0088, because they share the same group effect (Γ1) though they have different individual-observation effects (Ε12 and Ε13). This covariance matrix can be expressed in the notation of matrix algebra as

equation

In the first term in this expression, all rows and all columns representing observations in the same group are identical. (To see this, it may be helpful to fill in the top right-hand half of the matrix.) Hence the information in this term can be expressed more concisely in the form

10.27 equation
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.180.111