Just as mixed modelling is an extension of the linear-modelling methods comprised in regression analysis and analysis of variance, mixed modelling itself can be further extended in several directions, to give even more versatile and realistic models. This chapter reviews the various contexts in which we have seen that mixed modelling is preferable to a simple regression or analysis of variance approach, then outlines the ways in which the concepts of mixed modelling can be developed further. Fuller accounts of such advanced uses of mixed modelling are given by Brown and Prescott (2006) and by Pinhero and Bates (2000). Brown and Prescott demonstrate the use of the statistical software SAS to fit the models, whereas Pinhero and Bates use the statistical computer language S (of which the software R is one implementation – see Section 1.11). Both books place much more emphasis on the underlying mathematical theory than is given here.
A mixed-model analysis provides a fuller interpretation of the data than a simple regression or analysis of variance approach, and permits wider inferences about the observations to be expected in future, in the following situations:
All the models that we have considered so far are linear: that is, they can be expressed in the form
where
When one or more of the explanatory variables are factors, some ingenuity is needed to express the model in this form. For example, the model used in Chapter 1 to relate house prices to latitude and town can be expressed in this form by setting
Y | = | log(house price) |
X1 | = | latitude |
Z1 | = | 1 for observations from Bradford, 0 otherwise |
Z2 | = | 1 for observations from Buxton, 0 otherwise |
⋮ | ||
Z11 | = | 1 for observations from Witney, 0 otherwise. |
The variables Z1 to Z11, with their arbitrary values that indicate the category (the town) to which each observation belongs, are known as dummy variables. When the response variable and the explanatory variables are specified in this way, we find that
= | intercept | |
= | effect of latitude | |
= | effect of Bradford | |
= | effect of Buxton | |
⋮ | ||
= | effect of Witney. |
The estimates of are also the estimates of the parameters τ1 to τ11, the deviations of the town means from the regression line relating log(house price) to latitude, defined in Section 1.4. The decision to treat this model as a mixed model is equivalent to a decision to treat these parameters as values of a random variable, as described in Section 1.6.
In addition to being linear, all the models considered so far have had residuals that can reasonably be assumed to be normally distributed. There are many other regression models, relating a response variable to one or more explanatory variables, that do not have these properties. As an example of a situation in which neither a linear model nor a normal distribution of the residuals is adequate, we can consider the results of an experiment to determine the toxicity of ammonia to a species of beetle, Tribolium confusum (Finney, 1971, Section 9.1, p. 177). The experiment was performed in two batches, each comprising a series of samples. These batches will be represented by dummy variables, Z1 and Z2, and the explanatory variable X is log10(concentration of ammonia) applied to each sample. In any sample, the number of dead beetles, R, must be an integer between 0 and the number of beetles in the sample, N. The data are shown in the spreadsheet in Table 10.1. (Data reproduced by kind permission of Cambridge University Press.)
Table 10.1 Mortality of the beetle Tribolium confusum at different concentrations of ammonia.
X = log10(concentration of ammonia), N = number of beetles in the sample, and R = number of dead beetles in the sample.
A | B | C | D | |
1 | batch! | X | N | R |
2 | 1 | 0.72 | 29 | 2 |
3 | 2 | 0.72 | 29 | 1 |
4 | 1 | 0.80 | 30 | 7 |
5 | 2 | 0.80 | 31 | 12 |
6 | 1 | 0.87 | 31 | 12 |
7 | 2 | 0.87 | 32 | 4 |
8 | 1 | 0.93 | 28 | 19 |
9 | 2 | 0.93 | 31 | 18 |
10 | 1 | 0.98 | 26 | 24 |
11 | 2 | 0.98 | 31 | 25 |
12 | 1 | 1.02 | 27 | 27 |
13 | 2 | 1.02 | 28 | 27 |
14 | 1 | 1.07 | 26 | 26 |
15 | 2 | 1.07 | 31 | 29 |
16 | 1 | 1.10 | 30 | 30 |
17 | 2 | 1.10 | 31 | 30 |
Source: Data reproduced by kind permission of Cambridge University Press.
If the death of each individual is independent of that of every other individual, then the random variable R has a binomial distribution, the precise shape of which is determined by the value of N and by the probability that an individual beetle dies, designated by π. This statement can be written in symbolic shorthand as
The value of π may depend on the value of X under consideration and on the batch, that is, π may be a function of X and ‘batch’. This value can never be known, though it can be estimated from the data.
A brief digression on the binomial distribution is required here. This distribution is defined by the statement
where
For example, if
then in a sample of 30 beetles, the probability that 8 die is given by
Substituting each possible value of r from 0 to 30 into Equation 10.3, we obtain the distribution illustrated in Figure 10.1. For a fuller account of the binomial distribution, and why it occurs in such situations, see, for example, Snedecor and Cochran (1989, Sections 7.1–7.5, pp. 107–117) or Bulmer (1979, Chapter 6, pp. 81–90).
In a system of this kind, the relationship between the combined value of the explanatory variables (‘batch’ and X in this case) and is often sigmoidal (S-shaped – see Figure 10.2). At one extreme of the range of the explanatory variables, the probability of the event under consideration (death in this case) is close to zero: at the other extreme it is close to one. There are two commonly used functions that specify a relationship of this form. One of these, the integral of the normal distribution, is the basis of a method called probit analysis (Finney, 1971). This method of fitting a sigmoid curve probably has the clearer conceptual basis: it is based on the assumption that an underlying variable, in this case the tolerance of the beetles to the toxin, is normally distributed, and that at any given dose, all individuals up to a certain level of tolerance are killed. The alternative is the logistic function, which we will use here as it has the advantage of being rather easier to express algebraically: in the present case, it is
where
and the βs and υs are parameters to be estimated, namely
= | constant | |
= | effect of X | |
= | effect of Batch 1 | |
= | effect of Batch 2 | |
= | effect of interaction between X and Batch 1 | |
= | effect of interaction between X and Batch 2. |
The relationship between the batches and the dummy variables is as shown in Table 10.2. The value pij gives an estimate of π based on the information from the ijth sample, applicable to the batch and the level of X in question. However, the fitting of the logistic function will give estimates based on all the data, for any combination of levels of ‘batch’ and X.
Table 10.2 The relationship between batches and dummy variables in the model fitted to the data on beetle mortality.
Batch | Z1 | Z2 |
1 | z11 = 1 | z21 = 0 |
2 | z12 = 0 | z22 = 1 |
The relationship between Y and the explanatory variables in this equation is not linear. However, if we ignore the residual term for the time being, we can transform the equation to the familiar linear-model form, as follows:
The function is known as the logit function. Models of this type, which can be expressed in linear form by dropping the residual term and applying a suitable transformation to the response variable, are known as generalized linear. Every generalized linear model is characterized by a probability distribution and a link function: in the present case, the binomial distribution and the logit function. For each probability distribution, there is a particular link function known as the canonical link which has special mathematical properties, including the fact that when the residual term is the only random-effect term in the model, it always gives a unique set of parameter estimates, the sufficient statistic (McCullagh and Nelder, 1989, Sections 2.2.2–2.2.4, pp. 28–32). In the case of the binomial distribution, the logit function is the canonical link – another reason for preferring it to the probit function, the corresponding function used in probit analysis.
In the notation of Wilkinson and Rogers (1973; see Section 2.2), the model specified is
It is reasonable to specify ‘X’ as a fixed-effect term, and ‘batch’ as a random-effect term, from which it follows that ‘X.batch’ is also a random-effect term (see the guidelines in Section 6.3). We have then specified a generalized linear mixed model (GLMM).
This GLMM can be fitted to the data by the following GenStat statements:
IMPORT "IMM edn 2\Ch 10\ammonia Tribolium.xlsx"
GLMM [PRINT = model, monitoring, components, vcovariance, effects;
DISTRIBUTION = binomial; LINK = logit; DISPERSION = *;
RANDOM = batch + X.batch; FIXED = X] Y = R; NBINOMIAL = N
VDISPLAY [PRINT = effects; PTERMS = batch/X; PSE = estimates]
In the GLMM
statement, the DISTRIBUTION
option specifies that the response variate ‘R’ follows the general form of the binomial distribution, though its variance may be greater or less than that of a binomial variable, as will be specified in a moment. The LINK
option specifies that the logit function has been chosen as the link function. If the assumptions that each death is independent (conditional on π) and that R follows the binomial distribution in every respect were correct, it would follow that
and there would be no need to estimate the residual variance from the data. We could indicate that we were willing to make this assumption by setting the option ‘DISPERSION = 1
’: instead, by setting this option to a missing value (‘*’) we indicate that the residual variance is to be estimated. The options RANDOM
and FIXED
specify the random-effect and fixed-effect model terms, respectively. It is possible that there is a correlation between the effects of ‘batch’ and those of ‘X.batch’, but with only two batches there is not enough information to estimate this and it cannot be included in the model. The parameter Y
specifies the response variate, and the parameter NBINOMIAL
specifies the variate that holds the number of observations in each sample, that is, the maximum value that each value of the response variate might take.
The output of the GLMM
and VDISPLAY
statements is as follows:
The output first specifies the fitting method used and the model fitted. Next comes some monitoring information, indicating the estimated values of certain model parameters at successive iterations of the model-fitting process, leading to convergence, that is, successful fitting (see Section 10.5). Next come estimates of the variance components for the random-effect terms, with their SEs. The estimate for ‘batch’ is smaller than its SE, and that for ‘batch.X’ is zero, indicating that these terms could probably be dropped from the model. The estimate of the residual variance is presented in terms of the dispersion, which is given by the ratio
In the present case, the dispersion is substantially larger than 1, indicating that there is more residual variation from sample to sample than would be expected if the distribution were truly binomial. Next comes the matrix of covariances among these variance estimates. The values on the diagonal are simply the squares of the SEs above – the variances of the variance estimates. For example,
The off-diagonal values indicate the extent to which the estimate of one variance component is associated with that of another.
Next come the estimated effects of each model term. These can be substituted into the original model (Equation (10.4)), together with the value
to provide an estimate of π, the true probability that an insect dies, for any combination of batch and X, thus
This function is displayed over the range of the data, together with the observed values, in Figure 10.3.
Overall, the curves give a reasonable fit to the data, permitting a realistic estimate of the proportion of insects that would be killed by a particular concentration of ammonia. The effect of batch corresponds to the horizontal displacement between the two curves. This is small relative to the scatter of the data, confirming that the batch has little if any effect. The ‘X.batch’ interaction effect corresponds to the difference in average slope between the two curves, which is much too slight to be detected by eye. The corresponding variance component estimate is zero, and it might be preferable to drop these terms from the model, but they are included in the substitution into Equation (10.4) in order to show how the equation is constructed. The fact that the dispersion is larger than 1 indicates that the scatter of the observations about the curves is wider than would be expected if the deaths of individuals within each sample were independent: that is, there is evidence of some heterogeneity in the conditions of each sample, even after allowing for the effects of ammonia and batch. Because the ‘X.batch’ interaction effect is negligible, the decision to use a mixed model has had little effect on the precision of the parameter estimates in this case. However, if this effect were substantial, it would be important to take it into account in order to obtain realistic values for the SEs of , and .
The following commands import the data into R, convert batch to a factor and fit the model in Equation (10.4) to the data:
rm(list = ls())
ammonia.tribolium <- read.table(
"IMM edn 2\Ch 10\ammonia tribolium.txt",
header=TRUE)
attach(ammonia.tribolium)
fbatch <- factor(batch)
library(lme4)
responses <- cbind(R, N - R)
ammonia.tribolium.glmer <- glmer(responses ∼ X + (1|fbatch)+
(X|fbatch), family = binomial(link = "logit"),
data = ammonia.tribolium)
summary(ammonia.tribolium.glmer)
coef(ammonia.tribolium.glmer)
The function cbind()
combines the number of individuals that respond (R) in each sample and the number that do not respond (N − R) into a single structure, ‘responses’. The model specified in the glmer()
function indicates that
(1|batch)
;(X|batch)
.The family option in this function indicates the distribution of the response variable and the link function. The output of the function summary()
is as follows:
The estimates of effects are generally different from those obtained from GenStat, because the GenStat fits the model centred on the mean value of X, whereas R fits it without centring. One consequence of this is that it is now the main effect of batch that has a variance component estimate of zero, while the X.batch interaction has a positive variance component estimate. The function summary()
reports the intercept and the coefficient of X for the two levels of fbatch directly, rather than as departures from the fixed-effect estimates as specified in Equation (10.4). Adapting this equation accordingly, and substituting the coefficient estimates, the probability that an insect dies is estimated by
The curves produced by this function do not agree closely with those produced by GenStat, but agree closely with those produced by SAS, and fit the data fairly well. No straightforward method of obtaining an estimate of the dispersion parameter from R is known to the author.
The following commands import the data into SAS and fit the model in Equation (10.4) to the data:
PROC IMPORT OUT = ammonia DBMS = EXCELCS REPLACE
DATAFILE = "&pathname.IMM edn 2Ch 10ammonia Tribolium.xlsx";
SHEET = "for SAS";
RUN;
ODS RTF;
PROC GLIMMIX DATA = ammonia METHOD = RSPL ASYCOV;
CLASS batch;
MODEL R/N = X/
DISTRIBUTION = BINOMIAL LINK = LOGIT SOLUTION;
RANDOM batch batch * X/ SOLUTION;
RANDOM _RESIDUAL_;
RUN;
ODS RTF CLOSE;
The logistic model is fitted by the procedure GLIMMIX
. The option NOBOUND
of this procedure is not set, so variance components estimates are constrained to be positive: without this constraint, the model-fitting process fails to converge. The MODEL
statement indicates that the response variable is the number of individuals responding, R, out of a total N, and gives the fixed-effect model. In this statement, the DISTRIBUTION
option specifies the distribution of the response variable, the LINK
option specifies the link function and the SOLUTION
option specifies that the regression coefficients for the fixed-effect terms are to be displayed in the output. The first RANDOM
statement indicates the terms in the random-effects model. The second RANDOM
statement indicates that the dispersion parameter, related to the residual variance, is to be estimated, not fixed at 1.
Part of the output from PROC GLIMMIX
is as follows:
Covariance para meter estimates | ||
Cov Parm | Estimate | Standard error |
batch | 0 | |
X*batch | 0.09689 | 0.2797 |
Residual (VC) | 2.2192 | 0.8704 |
Asymptotic covariance matrix of covariance parameter estimates | |||
Cov Parm | CovP1 | CovP2 | CovP3 |
batch | |||
X*batch | 0.07824 | −0.03379 | |
Residual (VC) | −0.03379 | 0.7576 |
Solutions for fixed effects | |||||
Effect | Estimate | Standard error | DF | t value | Pr > |t| |
Intercept | −15.8489 | 2.0920 | 1 | −7.58 | 0.0835 |
X | 17.8700 | 2.3105 | 1 | 7.73 | 0.0819 |
Type III tests of fixed effects | ||||
Effect | Num DF | Den DF | F value | Pr > F |
X | 1 | 1 | 59.82 | 0.0819 |
Solution for random effects | ||||||
Effect | batch | Estimate | Std Err Pred | DF | t value | Pr > |t| |
batch | 1 | 0 | ||||
batch | 2 | 0 | ||||
X*batch | 1 | 0.1548 | 0.2700 | 12 | 0.57 | 0.5771 |
X*batch | 2 | −0.1548 | 0.2700 | 12 | −0.57 | 0.5771 |
The estimates of effects are generally different from those obtained from GenStat, because the GenStat fits the model centred on the mean value of X, whereas SAS fits it without centring. One consequence of this is that it is now the main effect of batch that has a variance component estimate of zero, while the X.batch interaction has a positive variance component estimate. PROC GLIMMIX
reports the intercept and the coefficient of X for the two levels of fbatch directly, rather than as departures from the fixed-effect estimates as specified in Equation (10.4). Adapting Equation (10.4) to take account of the absence of centring, and substituting the coefficient estimates into the model, the probability that an insect dies is estimated by
The curves produced by this function do not agree closely with those produced by GenStat, but agree closely with those produced by R and fit the data fairly well.
There are several other types of data that cannot be realistically represented by an ordinary linear model with normally distributed residual variation, but which do fulfil the criteria for fitting a GLMM, namely
Hence the use of GLMMs permits a wide extension to the range of situations in which the concepts of mixed modelling can be applied. Another important case in which generalized linear models can be used is the analysis of contingency tables. These are data sets in which events of a particular type are counted and are classified by factors that indicate the combination of circumstances, or contingency, in which each event occurred. This type of data set is illustrated by an example concerning the frequency of damage caused by waves to the forward sections of cargo-carrying ships (McCullagh and Nelder, 1989, Section 6.3.2, pp. 204–208). Each occurrence of damage is classified by the type of ship to which it occurred (A to E), the year of construction of the ship and its period of operation. For each category defined by these three factors – that is, each contingency – the number of incidents of damage observed was recorded, along with the number of months of service over which observations were available. The first and last few rows of the data are shown in the spreadsheet in Table 10.3. The null hypothesis to be tested (H0) is that none of the factors influenced the frequency of incidents, and in this case the number of incidents in each category is expected to be proportional to the number of months of service. (Data reproduced by kind permission of Chapman and Hall.)
Table 10.3 Number of incidents of damage to ships, classified by ship type, year of construction and period of operation.
A | B | C | D | E | |
1 | type! | constrctn_y! | operatn_period! | service_months | damage_incidents |
2 | A | 1960–64 | 1960–74 | 127 | 0 |
3 | A | 1960–64 | 1975–79 | 63 | 0 |
4 | A | 1965–69 | 1960–74 | 1095 | 3 |
5 | A | 1965–69 | 1975–79 | 1095 | 4 |
6 | A | 1970–74 | 1960–74 | 1512 | 6 |
7 | A | 1970–74 | 1975–79 | 3353 | 18 |
8 | A | 1975–79 | 1960–74 | ||
9 | A | 1975–79 | 1975–79 | 2244 | 11 |
10 | B | 1960–64 | 1960–74 | 44882 | 39 |
11 | B | 1960–64 | 1975–79 | 17176 | 29 |
· | · | ||||
· | · | ||||
· | · | ||||
34 | E | 1960–64 | 1960–74 | 45 | 0 |
35 | E | 1960–64 | 1975–79 | 0 | 0 |
36 | E | 1965–69 | 1960–74 | 789 | 7 |
37 | E | 1965–69 | 1975–79 | 437 | 7 |
38 | E | 1970–74 | 1960–74 | 1157 | 5 |
39 | E | 1970–74 | 1975–79 | 2161 | 12 |
40 | E | 1975–79 | 1960–74 | ||
41 | E | 1975–79 | 1975–79 | 542 | 1 |
Source: Data reproduced by kind permission of Chapman and Hall.
We can represent the number of incidents of damage to ships of the ith type, constructed during the jth range of years, during the kth period of operation (the ijkth category) by the symbol rijk. If each damage incident is independent of all the others, then it can be shown that rijk is an observation of a random variable Rijk which has a Poisson distribution, the mean of which is given by Nπijk where
and
This statement can be written in symbolic shorthand as
Again a brief digression is required, this time on the Poisson distribution. This is defined by the statement that if
where
then
For example, if the expected number of incidents of damage in a particular category (the mean number over an infinite hypothetical population of data sets similar to the present data set) is 8, then the observed number of incidents will be distributed as shown in Figure 10.4. For a fuller account of the Poisson distribution, and why it occurs in this context, see, for example, Snedecor and Cochran (1989, Section 7.14, pp. 130–133) or Bulmer (1979, pp. 90–97).
In the present case, if H0 is true, then
where
In this case, the expected number of incidents in the ijkth category is Naijk, and
where
We can add terms to this model to represent the possibility that the factors type of ship, year of construction and period of operation and their interactions influence the probability of a damage incident, as follows:
where
N and the qs are parameters to be estimated from the data. This model, like that in Equation (10.4), can be transformed to the linear-model form by ignoring the residual term and applying a suitable transformation, in this case the logarithmic transformation, namely
However, the term does not have to be estimated: it is a separate variable supplied with the data. It is equivalent to a term when the value of the parameter β is fixed at 1. Such a term is called an offset. In the notation of Wilkinson and Rogers (1973), the model specified here (excluding the offset term) is
The following statements import the data and fit this model, specifying all terms as fixed-effect terms for the time being:
IMPORT 'IMM edn 2\Ch 10\ship damage.xlsx'
CALCULATE logservice = LOG(service_months/SUM(service_months))
MODEL [DISTRIBUTION = poisson; LINK = log; DISPERSION = 1;
OFFSET = logservice] damage_incidents
TERMS type*constrctn_y*operatn_period
FIT [PRINT = *] type
ADD [PRINT = *] constrctn_y
ADD [PRINT = *] operatn_period
ADD [PRINT = *] type.constrctn_y
ADD [PRINT = *] type.operatn_period
ADD [PRINT = model, accumulated; FPROBABILITY = yes]
constrctn_y.operatn_period
The CALCULATE
statement obtains the natural logarithm of the proportion of months of service in each category, transforming this variable to the scale on which it will be required in the model. A message in the output (not shown here) warns that in the case of Unit 34 (Row 35 of the spreadsheet) an attempt has been made to obtain the logarithm of zero, and that the result is a missing value. Consequently this unit is omitted from the analysis, which is appropriate, as it represents a category of ship that spent no time at sea, and therefore was not exposed to the risk of damage. In the MODEL
statement, the DISTRIBUTION
option specifies that the response variate (‘damage_incidents’) follows the general form of the Poisson distribution, though its variance may be greater or less than that of a Poisson variable, unless we constrain it using the DISPERSION
option (see below). The LINK
function specifies the function required to transform the model to the linear form, just as the same option did in the GLMM
statement in the previous example (Section 10.2). If the assumption that each damage incident is independent and that R follows the Poisson distribution in every respect is correct, then it follows that
and there is no need to estimate the residual variance from the data. (It is peculiarity of the Poisson distribution that its variance is equal to its mean.) The option setting ‘DISPERSION = 1
’ indicates that we are willing to make this assumption. The offset term is specified by the OFFSET
option.
In order to obtain an analysis in which the deviance accounted for by each model term is shown separately, it is necessary to specify each of the terms in a separate statement. The FIT
statement specifies the model on H0, comprising only the constant term and the offset term . The option setting ‘PRINT = *
’ indicates that no output is to be produced from this initial model. The main effect and interaction terms are added by a succession of ADD
statements, again with no printing, until the final ADD
statement specifies that the complete model and the accumulated analysis of deviance are to be printed. The analysis of deviance is a method closely related to analysis of variance: in the special case where the residual variation is normally distributed (i.e. in all the models considered prior to the present chapter), the two are equivalent. The term accumulated indicates that the deviance accounted for by adding each term to the model successively is to be presented (cf. Section 1.4, where an anova constructed on the same basis is presented). The FPROBABILITY
option specifies that the analysis of deviance table is to include a p-value for the significance of each term in the model. Note that it is not necessary to fit the three-way interaction term ‘type.constrctn_y.operatn_period’ explicitly, as there is only one observation for each combination of these factors, and this term is therefore the residual term.
The output of the final ADD
statement is as follows:
A message first warns that the term ‘constrctn_y.operatn_period’ cannot be fully included in the model. This is because of the presence of missing values, as a result of which not all combinations of levels of the three factors are represented in the data. The model fitted and the fitting method used are then specified. The accumulated analysis of deviance then shows how the variation among the values of ‘damage_incidents’ (after adjusting for ‘logservice’) is distributed among the terms in the model. Roughly speaking, the deviance per degree of freedom – the mean deviance – gives a measure of the amount of variation accounted for by each term, when the terms are added successively to the model. Thus the mean deviance corresponds to the mean square in an analysis of variance. It is divided by the dispersion parameter to give the deviance ratio: since the dispersion parameter has been fixed at 1 (i.e. ‘damage_incidents’ has been assumed to follow a Poisson distribution in every respect, including the value of the variance), the two are identical. If the dispersion parameter had not been specified, it would be estimated by the residual mean deviance: hence the deviance ratio is equivalent to the variance ratio (the F statistic) in an analysis of variance. If the assumption that ‘damage_incidents’ follows a Poisson distribution is correct, and if H0 is true, then the deviance for each term is distributed approximately as χ2 with the degrees of freedom (d.f.) indicated. Thus each deviance provides a significance test for the term in question, and the column headed ‘approx chi pr’ gives the corresponding p-value. These p-values indicate that the main effects of the three factors are highly significant, and that the ‘type.constructn_y’ interaction is also significant, but that the other interactions are not.
It is of interest to obtain estimates of the mean frequency of damage to ships of each type, but it is not possible to do so from the model fitted above, because some pair-wise combinations of factor levels are not represented in the data. However, we can omit the non-significant two-way interaction terms from the model and obtain estimates based on the simpler model comprising only the main effects of the three factors and the type.constrctn_y interaction. This is done by the following statements:
FIT [PRINT = *] type * constrctn_y + operatn_period
PREDICT [PRINT = description, prediction, se, sed] type
The output of the PREDICT
statement is as follows:
The note that the estimated means are formed on the scale of the response variable indicates that they must be back-transformed, using the inverse of the link function, in order to obtain them on the original scale, namely the number of incidents of damage. For example, the mean value for ships of Type A corresponds to
The accompanying SE indicates that the true value is likely to lie in the range from
to
Another note states that these estimates are based on a value of −4.956 for the offset variate ‘logservice’. This means that they are based on the value
where
Although A varies from category to category in the data, it is held constant for the purpose of prediction, in order to provide a valid basis for comparison between risks in the different categories. From this decision, it follows that
where
Totalling the values of ‘service_months’ in the data, we obtain
rearranging Equation 10.15 we obtain
and substituting from Equations 10.17 and 10.18 into Equation 10.16, we find that the predicted numbers of incidents are based on
Another note indicates that the calculation of this mean has required averaging over the levels of the other factors, and that in this process marginal weights have been applied: that is, operation periods and construction years that are more heavily represented in the data contribute more heavily to the mean. However, it is noted that the status of weights is constant over levels of other factors: this means that a combination of ‘operatn_period’ and ‘constrctn_y’ that is more heavily represented in the data than the marginal weights would lead one to predict does not contribute more heavily to the mean. Another note states that the SEs are not appropriate for the forecasting of new observations: that is, they indicate the precision of the means themselves, not the amount of variation among individual observations. As noted earlier (Section 4.6), SEs of differences between means are usually of more interest than SEs of the means themselves, and these are provided, for all pair-wise comparisons, on the transformed scale.
If ‘constrctn_y’ and ‘operatn_period’ are representative of a broader population of construction and operation dates, it may be reasonable to specify these factors as random-effect terms. However, it may be reasonable to retain ‘type’ as a fixed-effect term representing particular methods of construction that are of individual interest, and that may be amenable to choice or control: ship builders can decide what type of hull to construct, and insurers can decide what premiums to offer on each type of ship. According to the guidelines given in Section 6.3, it follows that the remaining terms in the model should then be specified as random-effect terms. The following statements specify the fitting of this model:
GLMM [DISTRIBUTION = poisson; LINK = log; DISPERSION = 1;
OFFSET = logservice;
RANDOM = constrctn_y * operatn_period + type.constrctn_y +
type.operatn_period; FIXED = type] Y = damage_incidents
The structure of this statement corresponds to that of the GLMM
statement in Section 10.2: the only new feature is the OFFSET
option, which has the same meaning as that in the MODEL
statement earlier in this section. However, the output from this statement begins with the following diagnostic messages:
It is therefore the safest to ignore the rest of the output. Whereas the methods of ordinary regression analysis and analysis of variance can be applied without arithmetic problems to almost any data set, the same is not true of mixed modelling, and certainly not of GLMM. We will now examine the kinds of problems that can be encountered, in order to interpret these warning messages.
In ordinary regression analysis and analysis of variance, analytical formulae are applied which give, in a single step, the best estimates, according to a certain criterion, of the parameters of the model being fitted. The criterion used to identify the ‘best’ parameter estimates is that they should be the values that maximize the probability of the observed data, and hence minimize the value of the deviance (a concept introduced in Section 3.12). These are said to be the parameter estimates with the highest likelihood: as noted in Section 3.12,
When all the random-effect terms are assumed to be normally distributed (as is the case in all the examples considered in earlier chapters), these estimates are also those that minimize the estimate of the residual variance. However, in mixed modelling, the maximum-likelihood estimates cannot generally be obtained in a single step: a search must be made for them. In broad terms, the search strategy is to start from an arbitrary set of parameter values, then determine a set of changes that can be made to these that will increase the value of the likelihood. This process is repeated, the change made getting smaller at each iteration, until no change that produces a further increase can be found. At this point, the model-fitting process is said to have converged. The process it analogous to trying to climb a mountain in fog, following the rule ‘always walk uphill’. The process will lead to the summit (analogous to the maximum-likelihood estimates), provided that
If these criteria are not met, one may walk uphill indefinitely (failure of convergence), or the process may lead to a small peak on the flank of the likelihood ‘mountain’, not its true summit (convergence to a local maximum). The latter problem may be recognized by unrealistic parameter estimates and a poor fit to the data. Sometimes these problems can be overcome by a careful choice of initial parameter values for the fitting process, giving the process a better chance of ‘climbing the right peak’.
In the present case, the model-fitting process has encountered severe difficulties. Generally, such problems occur because the model being fitted does not represent the pattern of variation in the data well: a review of the model may suggest terms that can be dropped, or others that should be added. In the output above, the message that negative variance components are present suggests that the difficulties may have occurred because some of the model terms have little or no effect: if such terms are dropped, model-fitting may be more successful. Inspection of the results produced when all terms were specified as fixed-effect terms indicates that the non-significant terms ‘type.operatn_period’ and ‘constrctn_y.operatn_period’ are candidates for omission. Fitting of the resulting model is specified by the following statement:
GLMM [DISTRIBUTION = poisson; LINK = log; DISPERSION = 1;
OFFSET = logservice;
RANDOM = constrctn_y + operatn_period + type.constrctn_y; FIXED = type;
PRINT = model, monitoring, components, vcovariance, means,
backmeans, waldtests] Y = damage_incidents
The output produced by this statement is as follows:
This simpler model has been successfully fitted. The output follows the same general form as that from the GLMM fitted in Section 10.2. The statement of the fitting method used and the model fitted is followed by a warning about the missing values in the data. Next comes the monitoring information on the fitting process, which shows that convergence has been achieved: this is indicated by the values of the gammas, statistics closely related to the variance components for the three random-effect terms (see Section 10.10), which hardly change between the sixth and seventh iteration. Next come the estimates of the variance components. Although the main effects of ‘constrctn_y’ and ‘operatn_period’ were significant in the fixed-effects-only analysis, the variance component estimates for these terms are smaller than their respective SEs. It should be remembered that in the fixed-effects-only analysis, each term was tested against the residual deviance: the significance of these terms may have been due to other variance components that contribute to their deviance. Several variance components can contribute to the deviance for a particular model term, just as several can contribute to an expected mean square in an anova (see Section 3.3). Next comes the variance–covariance matrix among these variance component estimates: the interpretation of such a matrix is explained in Section 10.2. Next come the tests of the significance of the fixed-effect term ‘type’. The F statistic shows that this falls short of significance. Nevertheless, the estimated means for each level of type is then presented, on the transformed scale, that is, the scale after the link function has been applied to the model: in this case, the logarithmic scale. These means are somewhat different from those obtained from the fixed-effects model: consistently larger, but less variable. However, the ranking of the five types of ship is the same, Type C giving the fewest incidents of damage and Type E the most. The SEs of differences between these means are consistently smaller than those from the fixed-effects model. Overall, the variation among the means is somewhat less, relative to the SEs of differences, than when the fixed-effects model is used. This is perhaps to be expected, as the mixed model recognizes ‘constrctn_y.type’ as a random-effect term, which contributes to the SEs of differences between levels of ‘type’. Finally, the back-transformed means on the original scale (count of incidents) are presented. These are obtained from the means on the transformed scale using the inverse of the link function, namely the exponential function: for example,
In the GLMMs fitted above (Sections 10.2 and 10.5), it is assumed that although the residual variation may not be normally distributed, the other random-effect terms do follow the normal distribution. However, this is not always the most natural assumption. For example, it has been suggested (Lee and Nelder, 2001) that when a Poisson distribution and a logarithmic link function are specified for the residual variation (as in the model fitted in Section 10.5), a more appropriate assumption for the other random-effect terms might be a gamma distribution and a logarithmic link function. Such models, in which a probability distribution and a link function can be specified for each random-effect term, have been called hierarchical generalized linear models (HGLMs). However, this name is slightly misleading, as the random-effect terms do not have to be nested to form a hierarchy: they may be crossed, as in the model in Section 10.5. An alternative name, taking account of this possibility, would be stratified generalized linear models.
A system to fit HGLMs has been developed (Lee and Nelder, 1996, 2001), and is available in GenStat. The following statements use this system to fit the same GLMM as was fitted in Section 10.5, with no change to the distributions or link functions:
CALCULATE logservice1 = LOG(service_months)
SUBSET [logservice1 .NE. !s(*)]
type,constrctn_y,operatn_period,damage_incidents,logservice1
HGFIXEDMODEL [DISTRIBUTION = poisson; LINK = logarithm;
OFFSET = logservice1] type
HGRANDOMMODEL [DISTRIBUTION = normal; LINK = identity]
constrctn_y + operatn_period + type.constrctn_y
HGANALYSE [LMETHOD = eql] damage_incidents
HGPLOT METHOD = histogram, fittedvalues, normal, halfnormal
HGPLOT [RANDOMTERM = type.constrctn_y]
METHOD = histogram, fittedvalues, normal, halfnormal
Note that for presentation to the HGLM fitting system, the offset variable must be specified as ‘LOG(service_months)
’ not ‘LOG(service_months/SUM(service_months))
’. Note also that HGLM fitting system is not able to cope with the missing values of the offset variate, so the SUBSET
statement is used to exclude these from consideration, together with the corresponding values of the other variates and factors used. The HGFIXEDMODEL
and HGRANDOMMODEL
statements specify the fixed-effect and random-effect models, respectively. The HGFIXEDMODEL
statement also specifies the distribution and link function for the residual term, and the offset variable, as was done in the equivalent GLMM
statement (Section 10.5). The HGRANDOMMODEL
statement further indicates that the other random-effect terms are to have a normal distribution and the identity link function, that is, no transformation, as was assumed implicitly in the GLMM
statement. The HGANALYSE
statement indicates that the response variable is ‘damage_incidents’, and that the model is to be fitted using a criterion known as extended quasi-likelihood, rather than the default criterion of exact likelihood. This specification produces output (not shown) that is numerically equivalent to that from the GLMM
statement with regard to the fixed-effect estimates. The two HGPLOT
statements produce diagnostic plots of the random effects, for the residual term and the term ‘type.constrctn_y’, respectively (Figure 10.5). The other two random-effect terms do not have enough levels to justify the production of diagnostic plots.
The distributions used in the model fitted here are not both normal, and the reference to normal plots and normal quantiles in the labelling of these diagnostic plots is therefore imprecise. However, the interpretation of the plots is the same as in models that assume a normal distribution (Section 1.10). In the case of the residual term, the histogram is reasonably bell-shaped and symmetrical, the scatter of the points in the fitted-value plot is reasonably even over the range of fitted values, and the points in the normal and half-normal plot lie reasonably close to a straight diagonal line. In the case of term ‘type.constrctn_y’, the diagnostic plots do not conform so closely to these ideals: in particular, more will be said about the fitted-value plot in a moment.
In order to adopt the suggested distribution and link function for the random-effect terms other than the residual term, the HGRANDOMMODEL
statement is modified to
HGRANDOMMODEL [DISTRIBUTION = gamma; LINK = logarithm]
constrctn_y + operatn_period + type.constrctn_y
The output of the HGANALYSE
statement is then as follows:
The monitoring information at the beginning of the output shows the value of the dispersion parameter λi for each random-effect term (see below) at each stage of the model-fitting process, and the largest absolute change in this parameter value at each step. The largest absolute change between the final two cycles is very small, confirming that the model-fitting process has converged. The message ‘Aitken extrapolation OK’ indicates points at which the use of this method to accelerate convergence is acceptable. The model fitted is then specified. Next come the estimates of the constant and of the effects of ‘type’. These are calculated using ‘A’ as the reference level of ‘type’, so the corresponding means are given by
and
for the other types. The values obtained are presented in Table 10.4. They differ from those given by the GLMM by approximately loge(T) = loge(163574) = 12.005, because the offset of the HGLM was specified as loge(service months) whereas that of the GLMM was specified as loge(service months/sum(service months)). Hence they are based on A = e−4.956 = 0.0070410 months of service.
Table 10.4 Predicted number of damage incidents suffered by each type of ship, per 0.0070410 months of service, transformed to natural logarithms, obtained from an HGLM.
Type | ||||
A | B | C | D | E |
−5.624 | −6.203 | −6.353 | −5.774 | −5.185 |
The parameter estimates from the dispersion model, , and , indicate the deviance due to the random-effect terms ‘constrctn_y’, ‘operatn_period’ and ‘constrctn_y.type’, respectively. These estimates are given relative to the dispersion: that is,
In the present case, we have specified that
so rearranging Equation 10.20 and substituting the numerical values from the output and from Equation 10.21, we obtain, for example,
The deviances for the random-effect terms can be interpreted and compared in roughly the same way as estimates of variance components. For example, each is compared in the output with its own SE, to obtain a t statistic that gives a tentative indication of whether any variation is accounted for by the term (cf. Section 3.11, where the analogous statistic based on estimates of variance components is referred to as a z statistic). In the present case, all three dispersion parameter estimates are negative, although GLMM gave small positive values for the corresponding variance component estimates. Thus according to the HGLM models, no variation is accounted for by these terms, and the estimates of all random effects, except the residual effects, are shrunk to zero. This is why the fitted values in the fitted-value plots for ‘type.constrctn_y’ are all zero (Figures 10.5b and 10.6b).
The diagnostic plots obtained from the residual term in this model (Figure 10.6a) is very little changed from that obtained previously (Figure 10.5a) but that for term ‘type.constrctn_y’ (Figure 10.6b) is considerably less satisfactory than when the normal distribution and the identity link function were used (Figure 10.5b). Thus, in this particular case, the use of the HGLM system has not improved the outcome of the modelling process. This may be because the parameters of a non-normal distribution are more difficult to fit, even when such a distribution is more appropriate on theoretical grounds (D. Hedderly, personal communication).
The following commands import the ship-damage data into R, convert ‘type’, ‘constrctn_y’ and ‘operatn_period’ to factors and calculate the offset variable ‘logservice’:
rm(list = ls())
ship.damage <- read.table(
"IMM edn 2\Ch 10\ship damage.txt",
header=TRUE)
attach(ship.damage)
ftype <- factor(type)
fconstrctn_y <- factor(constrctn_y)
foperatn_period <- factor(operatn_period)
logservice <- log(service_months/sum(service_months, na.rm = TRUE))
logservice[logservice == -Inf] <- NA
Note that in categories where there are no months of service, and hence ‘logservice’ is minus infinity, this is replaced by a missing value.
The following commands fit to these data the model in which all terms are specified as fixed-effect terms:
ship.damage.glm.1 <-
glm(damage_incidents ∼ ftype * fconstrctn_y * foperatn_period,
family = poisson(link = "log"),
offset = logservice, data = ship.damage)
anova(ship.damage.glm.1)
The output from the function anova()
is as follows:
The deviances and their d.f. agree with those obtained from GenStat.
The following commands fit the model from which the non-significant two-way interaction terms have been omitted:
ship.damage.glm.2 <-
glm(damage_incidents ∼ ftype * fconstrctn_y + foperatn_period,
family = poisson(link = "log"),
offset = logservice, data = ship.damage)
anova(ship.damage.glm.2)
Again, the deviances and their d.f. agree with those obtained from GenStat. However, the author has not been able to obtain meaningful predictions of the ship type means, or the differences between them, from this model.
The combination of the random-effect model ‘constrctn_y*operatn_period + type.constrctn_y + type.operatn_period’ with the fixed-effect model ‘type’, which produced non-convergence in GenStat, is successfully fitted by R, but for purposes of comparison, the model subsequently fitted in GenStat will also be fitted here. This is done by the following commands:
library(lme4)
ship.damage.glmer.2 <- glmer(damage_incidents ∼ ftype +
(1|fconstrctn_y) + (1|foperatn_period) +
(1|ftype:fconstrctn_y),
family = poisson(link = "log"), offset = logservice)
summary(ship.damage.glmer.2)
The function glmer()
fits a generalized linear mixed model, the response variable, fixed-effect model and random-effect model being specified as described in Section 10.3. The family
option indicates the distribution of the response variable and the link function, and the offset
option indicates the offset variable.
Part of the output from the function summary()
is as follows:
The variance component estimates do not agree with those obtained from GenStat: the reason for this discrepancy is not known to the author. The estimates of the effect of the type are calculated relative to the estimated mean for Type A, which is the intercept. Thus, for example, the estimated mean for Type B on the transformed (logarithmic) scale is
These estimates do not agree closely with those obtained from GenStat, but they rank the ship types in the same order.
At the time of writing, no straightforward method for specifying in R the HGLM fitted in GenStat, with several random-effect terms, is known to the author. A simpler HGLM with a single random-effect term is therefore fitted, using the package ‘hglm’. This package can be installed from the R website by the method described in Section 3.16. It is not able to deal with an offset variate containing missing values, and a new data frame (see Section 1.11 for an explanation of this concept) in which the corresponding observations are excluded from all variables is therefore created by the following command:
ship.damage.short <-
data.frame(damage_incidents, ftype, fconstrctn_y,
foperatn_period, logservice)[is.na(logservice) == FALSE,]
The following commands then fit a HGLM that estimates the effects of ‘ftype’ and ‘fconstrctn_y’, but does not take account of ‘foperatn_period’:
library(hglm)
ship.damage.hglm.1 <- hglm(fixed = damage_incidents ∼ ftype,
random = ∼ 1|fconstrctn_y,
family = poisson(link = "log"), rand.family = gaussian
(link = "identity"),
offset = logservice, data = ship.damage.short)
summary(ship.damage.hglm.1)
In the function hglm()
, the fixed
and random
options specify the fixed-effect and random-effect models, respectively. The family
option specifies the distribution and the link function in the residual stratum, and the rand.family
option specifies the distribution and link function for the term in the random-effect model – in the present case, the Gaussian (normal) distribution and the identity link function. The data option indicates that the analysis is to be performed using the data frame from which observations with missing values have been excluded.
The output from the summary()
function is as follows:
The output first reproduces the HGLM model specification. It then shows the effect estimates for the fixed-effect model ‘type’, calculated relative to Type A as before. The ship type means obtained from these effect estimates agree fairly closely with those given by GenStat. BLUPs are then presented for the levels of the term in the random-effect model, ‘fconstrctn_y’. A fuller account of the function hglm()
is given by Rönnegård, Shen and Alam (2010), and a somewhat broader range of HGLMs can be fitted using the newer function hglm(2) (L. Rönnegård, personal communication).
Diagnostic plots of residuals from this model are produced by the following commands:
windows()
par(mfrow=c(2,2))
plot(ship.damage.hglm.1)
Part of the output from the plot()
function is shown in Figure 10.7. The plot of studentized residuals against fitted values (Plot a) shows that larger fitted values are associated with more variable residuals, indicating that the model fitted should ideally be refined to take account of this heterogeneity of variance. This pattern is confirmed by the corresponding plot of the absolute values of the residuals (Plot b), which shows a clear positive trend. The Q–Q plot (Plot c) shows somewhat more residuals with very large positive or negative values than expected. However, the histogram of residuals (Plot d) is as close to a normal distribution as can reasonably be expected from this fairly small sample. Further diagnostic plots are produced, but they are less informative than those shown here. In particular, the plots that related to the random effects of ‘fconstrctn_y’ give very little information, being based on only four values.
The HGLM can be modified to specify a gamma distribution for the effects of ‘fconstrctn_y’ and the logarithmic link function for this stratum by changing the setting of the rand.family
option to ‘rand.family = Gamma(link = "log")
’. This causes some change to the estimates of the effects of ‘type’, but has very little effect on the diagnostic plots.
The following SAS statements import the ship-damage data, calculate the offset variable ‘logservice’ and fit the model specifying all terms as fixed-effect terms:
PROC IMPORT OUT = ship DBMS = EXCELCS REPLACE
DATAFILE = "&pathname.IMM edn 2Ch 10ship damage.xlsx";
SHEET = "for SAS";
RUN;
PROC UNIVARIATE DATA = ship NOPRINT;
VAR service_months;
OUTPUT OUT = ship_summ SUM = serv_sum;
RUN;
DATA ship_summ_2; SET ship_summ;
dummy = 1;
RUN;
DATA ship_2; SET ship;
dummy = 1;
RUN;
PROC SQL;
CREATE TABLE ship_3 AS
SELECT a.*, b.*
FROM ship_2 AS a, ship_summ_2 AS b
WHERE a.dummy EQ b.dummy;
QUIT;
DATA ship_4; SET ship_3;
logservice = log(service_months/serv_sum);
RUN;
ODS RTF;
PROC GENMOD DATA = ship_4;
CLASS type constrctn_y operatn_period;
MODEL damage_incidents = type constrctn_y operatn_period
type * constrctn_y type * operatn_period constrctn_y * operatn_period /
DIST = POISSON LINK = LOG OFFSET = logservice TYPE1;
RUN;
ODS RTF CLOSE;
PROC UNIVARIATE
obtains the total of ‘service_months’, and the DATA
steps and use of PROC SQL
that follow place this total in an additional column in the data set and calculate ‘logservice’. PROC GENMOD
fits the model. In the MODEL
statement, the DIST
, LINK
and OFFSET
options specify the distribution of the response variable, the link function and the offset variable, respectively, and the TYPE1
option specifies Type I hypothesis tests, which are those performed in an accumulated analysis of deviance. Part of the output from PROC GENMOD
is as follows:
LR statistics for Type 1 analysis | ||||
Source | Deviance | DF | Chi-square | Pr > ChiSq |
Intercept | 146.3283 | |||
type | 90.8893 | 4 | 55.44 | <0.0001 |
constrctn_y | 49.3552 | 3 | 41.53 | <0.0001 |
operatn_period | 38.6951 | 1 | 10.66 | 0.0011 |
type*constrctn_y | 14.5869 | 12 | 24.11 | 0.0197 |
type*operatn_period | 8.5208 | 4 | 6.07 | 0.1943 |
constrctn*operatn_pe | 6.8565 | 2 | 1.66 | 0.4351 |
The chi-square statistics and their d.f. agree with the deviances in the accumulated analysis of deviance produced by GenStat.
The following statements fit the GLMM with ‘type’ as the fixed-effect model and all other terms in the random-effect model, with the omission of the terms ‘type.operatn_period’ and ‘constrctn_y.operatn_period’:
ODS RTF;
PROC GLIMMIX DATA = ship_4 METHOD = RSPL ASYCOV NOBOUND;
CLASS type constrctn_y operatn_period;
MODEL damage_incidents = type /
DISTRIBUTION = POISSON LINK = LOG OFFSET = logservice;
RANDOM constrctn_y operatn_period constrctn_y * type;
LSMEANS type;
RUN;
ODS RTF CLOSE;
In PROC GLIMMIX
, the option setting ‘METHOD = RSPL
’ indicates that the default RSPL method of estimation will be used. The MODEL
statement specifies the response variable and the fixed-effect term, and the RANDOM
statement specifies the random-effect terms. The LSMEANS
statement indicates that the mean for each level of ‘type’ is to be estimated. Part of the output from PROC GLIMMIX
is as follows:
Covariance parameter estimates | ||
Cov Parm | Estimate | Standard error |
constrctn_y | 0.07661 | 0.1329 |
operatn_period | 0.07152 | 0.1109 |
type*constrctn_y | 0.1440 | 0.1372 |
Asymptotic covariance matrix of covariance parameter estimates | |||
Cov Parm | CovP1 | CovP2 | CovP3 |
constrctn_y | 0.01767 | −0.00035 | −0.00454 |
operatn_period | −0.00035 | 0.01230 | −0.00002 |
type*constrctn_y | −0.00454 | −0.00002 | 0.01883 |
Type III tests of fixed effects | ||||
Effect | Num DF | Den DF | F value | Pr > F |
type | 4 | 12 | 2.43 | 0.1043 |
Type least squares means | |||||
Type | Estimate | Standard error | DF | t value | Pr > |t| |
A | 6.3487 | 0.3639 | 12 | 17.45 | <0.0001 |
B | 5.7631 | 0.3114 | 12 | 18.51 | <0.0001 |
C | 5.6096 | 0.4411 | 12 | 12.72 | <0.0001 |
D | 6.1312 | 0.4175 | 12 | 14.68 | <0.0001 |
E | 6.7527 | 0.3853 | 12 | 17.53 | <0.0001 |
The covariance parameter estimates agree with the estimated variance components produced by GenStat, and the covariances between the estimates are similar though not identical. The F test for ‘type’ agrees fairly closely with that given by GenStat, but the denominator d.f. is different, being obtained by a somewhat different method, and the p-value is therefore also different, though still non-significant. The estimated means for types of ship (on the logarithmic scale) agree with those given by GenStat.
At the time of writing, no straightforward method of fitting an HGLM in SAS is known to the author.
In all the mixed models we have examined so far, each random-effect term consists of a set of factor levels. Any two observations either come from the same level of the factor, in which case they have the same random effect for the term in question, or from different levels, in which case their random effects are independent. In the residual stratum, each observation comprises a separate level, and its random effect is independent of those of all other observations. However, the relationship between the random effects in each term need not be so simple. In order to explore more elaborate relationships among the random effects, we need first to establish a notation in which to display them. Such a notation is provided by the covariance matrix, which is specified as follows.
Consider a simple data set comprising four groups of three observations of a response variable (Table 10.5). The natural model to fit to these data is
where
We will make the usual assumption that the ϵij are values of a random, normally distributed variable Ε, that is,
The anova corresponding to this model is shown in Table 10.6, and the estimate of is given by MSResidual,
Table 10.5 A simple data set comprising a grouping factor and a response variable.
A | B | |
1 | group! | y |
2 | 1 | 1.97 |
3 | 1 | 1.01 |
4 | 1 | 4.53 |
5 | 2 | 0.76 |
6 | 2 | 1.60 |
7 | 2 | 3.52 |
8 | 3 | 4.37 |
9 | 3 | 4.05 |
10 | 3 | 5.17 |
11 | 4 | 6.94 |
12 | 4 | 7.50 |
13 | 4 | 5.62 |
Table 10.6 Anova of a simple data set comprising a grouping factor and a response variable.
Source of variation | DF | MS | F | p |
Group | 3 | 13.875 | 8.44 | 0.007 |
Residual | 8 | 1.644 | ||
Total | 11 |
In order to express the idea that the 12 values ϵij, i = 1…4, j = 1…3, are independent, we need to think of each of them as a value of a different variable, Εij. Similarly, each value of the response variable, yij, is thought of as an observation of a different variable, Yij. The relationships among the 12 variables Εij can then be expressed in a covariance matrix, namely:
where
The value of cov(Εij,Εi′j′), the covariance between Εij and Εi′j′, is given in the cell in the ijth row and the i′j′th column of this matrix. Since Εij is the only random term that contributes to Yij,
The variance of each of the variables Εij – its covariance with itself – is given along the leading diagonal of the matrix (the 12 cells from top left to bottom right). All other covariances are zero, reflecting the assumption that all the variables are mutually independent. Because
by definition, the matrix is symmetrical about the leading diagonal: hence to improve clarity, the top right-hand half can be omitted.
Now suppose that we decide to treat the groups as a random-effect term. We then make the assumption that the γi, i = 1…4, are values of a variable Γ, and that
The anova can now be interpreted as shown in Table 10.7, and the estimate of is
In order to express the idea that the four values γi, i = 1…4, are independent, we think of each of them as a value of a different variable, Γi. The variable representing the random part of Yij is now no longer Εij but Γi + Εij, and the covariances among these 12 variables are as follows:
Table 10.7 Expected mean squares in the anova of a simple data set comprising a grouping factor and a response variable.
Source of variation | DF | Expected MS |
Group | 3 | |
Residual | 8 | |
Total | 11 |
The variance of each observation of the response variable is now . The covariance between observations from different groups – for example, between Y12 and Y23 – is zero, as before. However, the covariance between observations from the same group – for example, between Y12 and Y13 – is now , because they share the same group effect (Γ1) though they have different individual-observation effects (Ε12 and Ε13). This covariance matrix can be expressed in the notation of matrix algebra as
In the first term in this expression, all rows and all columns representing observations in the same group are identical. (To see this, it may be helpful to fill in the top right-hand half of the matrix.) Hence the information in this term can be expressed more concisely in the form
13.59.180.111