Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14
Multivariate Statistical Analysis - I

Package(s): ICSNP, scatterplot3d, aplpack, mvtnorm, foreign

Dataset(s): cardata, stiff, iris, hw, calcium, mfp, rootstock, waterquality, pw, sheishu

14.1 Introduction

In many real-world problems, data is seldom univariate. We have more than one variable, which needs a good understanding of the underlying uncertain phenomenons. Thus, we need a set of tools to handle this type of data, and this is provided by Multivariate Statistical Analysis (MSA), a branch of the subject. We saw in the previous chapters on regression, that multiple regressors explain the regressand. Sometimes experiments may need a deeper study of the covariates themselve. In particular, we are now concerned with a random vector, the characteristics of which form the crux of this and the next chapter.

In Section 14.2 we look at graphical plots, which give a deeper insight into the structure of the dataset. The core concepts of MSA are introduced in Section 14.3. Sections 14.4 and 14.5 deal with the inference problem related to the mean vectors of multivariate data, whereas inference related with the variance-covariance matrix are performed in Sections 14.7 and 14.8. Multivariate Analysis of Variance, abbreviated as MANOVA, tools are introduced and illustrated in Section 14.6 and some tests for independence of sub-vectors are addressed in Section 14.9. Advanced topics of multivariate statistical analysis are carried over to the next chapter.

14.2 Graphical Plots for Multivariate Data

In Chapter 12 we saw the use of scatter plots and pairs (matrix of scatter plots). A slight modification of the matrix of a scatter plot is considered here, which gives more insight into the multivariate aspect of the dataset. Multi-dimensional data can be still visualized in two dimensions and a particularly effective technique provided by Chernoff faces is detailed. We will begin with a multivariate dataset.

Example 14.2.1. Car Data

We consider the car dataset consisting of 13 variables and 74 car variables. This example is drawn from Hárdle and Simar (2007), Appendix B.3, which has been earlier analyzed by Chambers, et al. (1983). The variable description, variables names as in the dataset in verbatim font, and the notation of random variable in italics, are given in the following:

Price: P, $c14-math-0001$
Mileage (in miles per gallon): M, $c14-math-0002$
Repair record 1978 (rated on a 5-point scale; 5 best, 1 worst): R78, $c14-math-0003$
Repair record 1977 (scale as before): R77, $c14-math-0004$
Headroom (in inches): H, $c14-math-0005$
Rear seat clearance (distance from front seat back to rear seat, in inches): R, $c14-math-0006$
Trunk space (in cubic feet): Tr, $c14-math-0007$
Weight (in pound): W, $c14-math-0008$
Length (in inches): L, $c14-math-0009$
Turning diameter (clearance required to make a U-turn, in feet): T, $c14-math-0010$
Displacement (in cubic inches): D, $c14-math-0011$
Gear ratio for high gear: G, $c14-math-0012$
Company headquarter (1 for U.S., 2 for Japan, 3 for Europe): C, $c14-math-0013$

Let us now plot the matrix of scatter plots and examine the correlation among the 13 variables. The matrix of scatter plots can be customized to extract a lot of information. For instance, we use the functions panel.hist and panel.cor, which have been defined in the R example of pairs function, seen by running example(pairs) at the R terminal. The dataset is first imported into R and then the pairs function is applied to it with the options of panel.hist and panel.cor.

> panel.hist <- function(x, ...)
+  {
+      usr <- par("usr"); on.exit(par(usr))
+      par(usr = c(usr[1:2], 0, 1.5) )
+      h <- hist(x, plot = FALSE)
+      breaks <- h$breaks; nB <- length(breaks)
+      y <- h$counts; y <- y/max(y)
+      rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
+  }
> panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) {
+      usr <- par("usr"); on.exit(par(usr))
+      par(usr = c(0, 1, 0, 1))
+      r <- abs(cor(x, y))
+      txt <- format(c(r, 0.123456789), digits=digits)[1]
+      txt <- paste(prefix, txt, sep="")
+      if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
+      text(0.5, 0.5, txt, cex = cex.cor * r)
+  }
> data(cardata)
> pairs(cardata[,2:14],diag.panel=panel.hist, lower.panel=panel.smooth,
+ upper.panel=panel.cor)
+ # as some data is missing, we remove them and replot below
> pairs(na.omit(cardata[,2:14]),diag.panel=panel.hist,
+ lower.panel=panel.smooth, upper.panel=panel.cor)

Figure 14.1 is the output of the previous R program. The price P appears to be weakly correlated with the other components of the random vector. Strong associations appear among the variables M, W, L, T, D, and G. In this figure, as we look at the upper panel for the correlation coefficients, the lower panel also needs to be checked for the scatter plot. Of course, the diagonal elements indicate the spread of the variables themselves. Thus, using the options in diag.panel, lower.panel, and upper.panel, the pairs becomes a very effective tool for visualization of multivariate data.□

Figure 14.1 A Correlation Matrix Scatter Plot for the Car Data

Chernoff (1973) gave a very innovative technique to visualize multivariate data, which considers each variate as some dimension of the human face, say nose, ears, smile, cheeks, etc. Interpretation of three-dimensional plots itself is difficult, even if we were to deploy features such as rotation of the plots. Certainly, visualization in more than three dimensions is not possible. Thus, the Chernoff technique of visualizing the multivariate data through faces is very helpful and such a plot is of course well known as the Chernoff faces. We deploy this method here using the graphical function faces from the R package aplpack.

Example 14.2.2. Car Data

Contd. The first 25 data points are considered here for purposes of brevity.

> library(aplpack)
> faces(cardata[1:25,2:14])
[1] "Warning: NA elements have been exchanged by mean values!!"
effect of variables:
 modified item       Var
 "height of face   " "P"
 "width of face    " "M"
 "height of ear   "  "M"

Note that the missing values have been replaced by the average values of that variate. The diagram is given in Figure 14.2.

Figure 14.2 Chernoff Faces for a Sample of 25 Data Points of Car Data

A car freak reader will not be surprised by the Chernoff faces, as the cars labeled 14 to 16 are Cadillac luxury car models Deville, Eldorado, and Seville, whereas the ones labeled 2, 3, 21, and 25 (yes, the ones with grim faces), are low-end hatchback cars, respectively AMC-Pacer, AMC-Spirit, Chevrolet Monza, and Datsun-510.□

Chernoff faces gives one facet of data visualization of multivariate data. There are many other similar techniques, though we will not dwell more on them. In the next Section 14.3 we consider more basic aspects of the multivariate and define the multivariate normal distribution in more detail.

$c14-math-0014$

14.3 Definitions, Notations, and Summary Statistics for Multivariate Data

14.3.1 Definitions and Data Visualization

We will denote a $c14-math-0015$ -random vector by $c14-math-0016$ , and its $c14-math-0017$ replicates by $c14-math-0018$ . The random vector of the $c14-math-0019$ replicate is denoted as $c14-math-0020$ . The mean vector and variance-covariance matrix of X are respectively denoted as

14.1

14.2

14.3

Note that the matrix $c14-math-0024$ is a symmetric matrix, that is, $c14-math-0025$ .

Sometimes, we may also be interested in the correlation matrix defined by

Each correlation coefficient will be a number between –1 and +1, that is, $c14-math-0027$ .

The $c14-math-0031$ -dimensional normal density will be denoted by $c14-math-0032$ . The case of $c14-math-0033$ refers to the bivariate normal random distribution. For bivariate normal random variables with a zero-mean vector and a couple of positive and negative correlations, we obtain the probability density plots.

Example 14.3.1. Plot of a few Bivariate Normal Densities

Assume $c14-math-0034$ , with $c14-math-0035$ , $c14-math-0036$ , $c14-math-0037$ or $c14-math-0038$ . That is, we consider four types of bivariate normal distribution. Using the dmvnorm function from the mvtnorm package and scatterplot3d from the same named package, we generate the three-dimensional plots of bivariate normal densities.

> x <- rep(seq(-4,4,.2),each=41)
> y <- rep(seq(-4,4,.2),41)
> sigma5 <- matrix(c(1,.5,.5,1),nrow=2)
> sigma_5 <- matrix(c(1,-.5,-.5,1),nrow=2)
> sigma9 <- matrix(c(1,.9,.9,1),nrow=2)
> sigma_9 <- matrix(c(1,-.9,-.9,1),nrow=2)
> dxy5 <- dmvnorm(cbind(x,y),sigma=sigma5)
> dxy_5 <- dmvnorm(cbind(x,y),sigma=sigma_5)
> dxy9 <- dmvnorm(cbind(x,y),sigma=sigma9)
> dxy_9 <- dmvnorm(cbind(x,y),sigma=sigma_9)
> scatterplot3d(x, y, dxy5, highlight.3d=TRUE,type="l",xlab="x",
+ ylab="y",zlab="Bivariate Density Function",main="Bivariate Normal
+ Density with Correlation 0.5")
> scatterplot3d(x, y, dxy_5, highlight.3d=TRUE,type="l",xlab="x",
+ ylab="y",zlab="Bivariate Density Function",main="Bivariate Normal
+ Density with Correlation -0.5")
> scatterplot3d(x, y, dxy9, highlight.3d=TRUE,type="l",xlab="x",
+ ylab="y",zlab="Bivariate Density Function",main="Bivariate Normal
+ Density with Correlation 0.9")
> scatterplot3d(x, y, dxy_9, highlight.3d=TRUE,type="l",xlab="x",
+ ylab="y",zlab="Bivariate Density Function",main="Bivariate Normal
+ Density with Correlation -0.9")

The resulting three-dimensional plots for the bivariate normal densities are displayed in Figure 14.3. The user must experiment with different values in the mean vector and the variance-covariance matrix to get a firm grip over the bivariate normal densities.□

Figure 14.3 Understanding Bivariate Normal Densities

Example 14.3.2. Normally Distributed and Uncorrelated Random Variables do not imply that the Random Variables are Independent

Melnick and Tenenbein (1982) have demystified the popular notion that normally distributed and uncorrelated random variable are independent random variables. To see this, consider the following counter example. Let $c14-math-0039$ be a standard normal variable, and define

For small values of $c14-math-0041$ , the correlation between $c14-math-0042$ and $c14-math-0043$ is nearly equal to 1, and for large $c14-math-0044$ values, it is –1. Thus it tells us that there must be some intermediate value of $c14-math-0045$ , which should make the correlation value close to 0, and this value is in the vicinity of 1.54. However, as the $c14-math-0046$ values completely determine the corresponding $c14-math-0047$ value, they are certainly not independent. In the following R program and Figure 14.4, we see the above-mentioned theory.

> x <- rnorm(300)
> constant <- c(.005,1.54,5)
> y1 <- ifelse(abs(x)>constant[1],x,-x)
> y2 <- ifelse(abs(x)>constant[2],x,-x)
> y3 <- ifelse(abs(x)>constant[3],x,-x)
> layout(matrix(c(1,2,3,3),2,byrow=TRUE))
> plot(x,y1,col="blue",ylab="y-values",xlab="x-values","p", main="c = 0.005")
> plot(x,y3,col="green",pch=21,ylab="y-values",xlab="x-values", "p",main="c = 5")
> plot(x,y2,col="red",ylab="y-values",xlab="x-values","p", main="c = 1.54")

Figure 14.4 A Counter Example of the Myth that Uncorrelated and Normal Distribution imply Independence

First, standard normal deviates are obtained using the rnorm simulator and stored in the object x. Three constant values are stored with c(0.005,1.54,5) and the associated y1-y3 objects are generated with the ifelse control loop. We have not considered the layout options earlier. The layout function divides the device into rows and columns as specified by the matrix. It is a very effective method of obtaining multiple plots in a single frame. The resulting Figure 14.4 confirms the earlier discussion.□

We now describe some standard methods of estimation of the mean vector and the variance-covariance matrix. Define

14.5

14.6

Result 4.11 of Johnson and Wichern (2006) shows that the estimators $c14-math-0050$ and $c14-math-0051$ are respectively MLE's of their parameters.

Example 14.3.3. The Board Stiffness Dataset

Four measures of stiffness of 30 boards are available. The first measure of stiffness is obtained by sending a shock wave down the board, the second measure is obtained by vibrating the board, and the remaining are obtained from static tests. We see that the mean, variance-covariance matrix, as well as the correlation matrix are very easy to compute.

> data(stiff)
> mean(stiff)
      x1       x2       x3       x4
1906.100 1749.533 1509.133 1724.967
> var(stiff)
          x1        x2       x3        x4
x1 105616.30  94613.53 87289.71  94230.73
x2  94613.53 101510.12 76137.10  81064.36
x3  87289.71  76137.10 91917.09  90352.38
x4  94230.73  81064.36 90352.38 104227.96
> cor(stiff)
          x1        x2        x3        x4
x1 1.0000000 0.9137620 0.8859301 0.8981212
x2 0.9137620 1.0000000 0.7882129 0.7881034
x3 0.8859301 0.7882129 1.0000000 0.9231013
x4 0.8981212 0.7881034 0.9231013 1.0000000
> pairs(stiff)

The matrix of scatter plots indicate that there might be an outlier among the observations. We will say more about this in the coming subsection.□

Example 14.3.4. Conversion of a Variance-Covariance Matrix into a Correlation Matrix

The variance-covariance matrix for the four parameters of the iris dataset is the following:

The task is to obtain $c14-math-0053$ from the given estimated variance-covariance matrix $c14-math-0054$ . First, define the following diagonal matrix:

It can then be shown that $c14-math-0056$ . Thus, the sample correlation matrix can be obtained by the relation

The variance-covariance matrix $c14-math-0058$ is now converted into a correlation matrix in R. An R function cov2cor is also available for the same purpose, which will be used to confirm our understanding.

> covmat <- round(var(iris[,1:4]),4)
> v1_2 <- covmat*0; diag(v1_2)=sqrt(diag(covmat))
> cormat <- solve(v1_2)%*%covmat%*%solve(v1_2)
> cormat
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1174687    0.8717360   0.8179881
Sepal.Width    -0.1174687   1.0000000   -0.4284721  -0.3659896
Petal.Length    0.8717360  -0.4284721    1.0000000   0.9628602
Petal.Width     0.8179881  -0.3659896    0.9628602   1.0000000
> cov2cor(cormat)
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1174687    0.8717360   0.8179881
Sepal.Width    -0.1174687   1.0000000   -0.4284721  -0.3659896
Petal.Length    0.8717360  -0.4284721    1.0000000   0.9628602
Petal.Width     0.8179881  -0.3659896    0.9628602   1.0000000

Note that any arbitrary square matrix cannot be a variance-covariance matrix. For such inappropriate matrices, you will find that the correlation matrix, if converted, will contain elements which will be less than –1 or greater than 1.□

14.3.2 Early Outlier Detection

An important concept for analysis of multivariate data is given by the Mahalanobis distance.

Thus, for each vector $c14-math-0063$ , the Mahalanobis distance $c14-math-0064$ may be computed and any unusually large value may then be marked as an outlier. The Mahalanobis distance is also called the generalized squared distance.

It is a familiar story that outliers, as in univariate cases, need to be addressed as early as possible. Graphical methods become a bit difficult if the number of variables is more than three. Johnson and Wichern (2006) suggest we should obtain the standardized values of the observations with respect to each variable, and they also recommend looking at the matrix of scatter plots. The four steps listed by them are:

Obtain the dot plot for each variable.
Obtain the matrix of scatter plots.
Obtain the standardized scores for each variable, $c14-math-0065$ , for $c14-math-0066$ , and $c14-math-0067$ . Check for large and small values of these scores.
Obtain the generalized squared distances $c14-math-0068$ . Check for large distances.

A dot plot is also known as a Cleveland dot plot and it is set up using the dotchart function, which in turn is a substitute to the bar plot. In this plot, a dot is used to represent the magnitude of the observation along the $c14-math-0069$ -axis with the observation number on the $c14-math-0070$ -axis.

Example 14.3.5. The Board Stiffness Dataset

Contd. The matrix of scatter plots from Figure 14.5 already indicates the presence of an outlier. Next, look at the dot plots of each of the variables.

> par(mfrow=c(4,1),cex=.5)
> dotchart(stiff[,1],main="Dotchart of X1")
> dotchart(stiff[,2],main="Dotchart of X2")
> dotchart(stiff[,3],main="Dotchart of X3")
> dotchart(stiff[,4],main="Dotchart of X4")

Figure 14.5 A Matrix Scatter Plot for the Board Stiffness Dataset

The dot chart in Figure 14.6 clearly shows that, for each variable, there is one observation at the right-hand side of the diagram whose value is very large compared with the rest of the observations. This is a clean case of the presence of an outlier. If all the four points on the dot chart belong to one observation, we have one outlier, else there may be more outliers. The standardized values are obtained using the scale function.

> cbind(stiff,round(scale(stiff),1))
     x1   x2   x3   x4   x1   x2   x3   x4
1  1889 1651 1561 1778 -0.1 -0.3  0.2  0.2
2  2403 2048 2087 2197  1.5  0.9  1.9  1.5
3  2119 1700 1815 2222  0.7 -0.2  1.0  1.5
4  1645 1627 1110 1533 -0.8 -0.4 -1.3 -0.6
5  1976 1916 1614 1883  0.2  0.5  0.3  0.5
9  2983 2794 2412 2581  3.3  3.3  3.0  2.7
30 1490 1382 1214 1284 -1.3 -1.2 -1.0 -1.4

Figure 14.6 Early Outlier Detection through Dot Charts

Note that the last four columns are standardized values. From our knowledge of the univariate normal cases, we guess that the observation labeled 9 must be an outlier, since three of the four components are in excess of the 3 sigma distance of the mean 0. Multivariate data have a lot of surprises as we see from the outliers given by the generalized squared distances. We now calculate the generalized squared distances for each of the observations.

> mahalanobis(stiff,colMeans(stiff),cov(stiff))
 [1]  0.6000129  5.4770196  7.6166439  5.2076098  1.3980776
[29]  6.2837628  2.5838186

As we mentioned earlier, we have a surprise outlier in the observation labeled 16 in addition to that labeled 9. Though each of its components is within the control level sense of the univariate data, the generalized squared distance for observation 16 is unusually high.□

We have seen some interesting graphical plots for the multivariate data. The multivariate normal distribution has also been introduced here, along with a few of its properties. We will next consider some testing problems for multivariate normal distribution.

$c14-math-0071$

14.4 Testing for Mean Vectors : One Sample

Suppose that we have random vector samples from a multivariate normal distribution, say $c14-math-0072$ . In Chapter 7, we saw testing problems for the univariate normal distribution. If the variance-covariance matrix $c14-math-0073$ is known to be a diagonal matrix, that is components are uncorrelated random variables, we can revert back to the methods discussed there. However, this is not generally the case for multivariate distributions and they need specialized methods for testing hypothesis problems.

For the problem of testing $c14-math-0074$ against $c14-math-0075$ , we consider two cases here: (i) $c14-math-0076$ is known, and (ii) $c14-math-0077$ is unknown. Suppose that we have $c14-math-0078$ samples of random vectors from $c14-math-0079$ , which we will denote as $c14-math-0080$ . The sample of random vectors is the same as saying that we have iid random vectors.

14.4.1 Testing for Mean Vector with Known Variance-Covariance Matrix

If the variance-covariance matrix is known, the test statistic is a multivariate extension of the $c14-math-0081$ -statistic, and is given by

14.8

Under the hypothesis $c14-math-0084$ , the test statistic $c14-math-0085$ is distributed as a chi-square variate with $c14-math-0086$ degrees of freedom, and a $c14-math-0087$ random variable. The computations are fairly easy and we do that in the next example.

Example 14.4.1. The Importance of Handling the Covariances

Rencher (2002). Consider the height and weight of 20 college-age males. This dataset is available in the csv file Height_Weight.csv. Assume that the variance-covariance matrix is known as $c14-math-0088$ , and that the hypothesis of interest is $c14-math-0089$ , where the height is measured in inches and the weight in pounds.

> data(hw)
> mu0 <- c(70,170)
> n <- nrow(hw)
> sigma <- matrix(c(20, 100, 100,1000),nrow=2)
> meanx <- mean(hw)
> z2 <- n*t(meanx-mu0)%*%solve(sigma)%*%(meanx-mu0)
> z2 # the test statistic value
[1] 8.4026
> qchisq(1-.05,2) # 95% confidence level and 2 d.f.
[1] 5.991465

Since the calculated $c14-math-0090$ value is greater than the tabulated value, we reject the hypothesis that the average height and weight are at 70 and 170 units respectively. If we were to ignore the correlation between the height and weight, we have the following conclusions:

> htest <- (meanx[1]-70)/(sqrt(sigma[1,1]/n)) # testing for height
> wtest <- (meanx[2]-170)/(sqrt(sigma[2,2]/n)) # testing for weight
> as.numeric(htest);as.numeric(wtest)
[1] 1.45
[1] -0.7495332

The absolute value of each of these tests is less than 1.96, and hence we would have failed to reject the hypothesis $c14-math-0091$ , which is not the case when the correlations are adjusted for. Thus, we learn an important story that whenever the correlations are known, it is always better to adjust the statistical procedure for them.□

Example 14.4.2. The Board Stiffness Data

Contd. Suppose that the variance-covariance matrix for this data is known as

The hypothesis testing problem is $c14-math-0093$ against the hypothesis $c14-math-0094$ . Repeating the above program with small changes, we get the value of the test statistics, and also the critical value.

> n <- nrow(stiff)
> sigma <- matrix(10ˆ4*c(11,9,9,9,9,10,8,8,9,8,9,9,9,8,9,10),nrow=4)
> mu0 <- 10ˆ3*c(2,1.5,1.5,2)
> meanx <- mean(stiff)
> z2 <- n*t(meanx-mu0)%*%solve(sigma)%*%(meanx-mu0)
> z2 # the test statistic value
         [,1]
[1,] 365.9636
> qchisq(1-.05,4) #95% confidence level and 4 d.f.
[1] 9.487729

The test procedure thus rejects the hypothesis $c14-math-0095$ , since the value of the test statistic is larger than the critical value.□

14.4.2 Testing for Mean Vectors with Unknown Variance-Covariance Matrix

It turns out that in many practical settings, the variance-covariance matrix is unknown. Therefore, we need to extend the test procedure for this important case. The Hotelling's $c14-math-0097$ -statistic is given by

14.9

where $c14-math-0100$ is the sampling covariance matrix. Under the hypothesis $c14-math-0101$ , the test statistic $c14-math-0102$ is distributed as Hotellings' $c14-math-0103$ distribution with $c14-math-0104$ and $c14-math-0105$ degrees of freedom.

Example 14.4.3. The Calcium in Soil and Turnip Greens Data of Rencher (2002)

Kramer and Jensen (1969) collected data on three variables at ten different locations. The variables of interest are (i) $c14-math-0106$ : available calcium in the soil, (ii) $c14-math-0107$ : exchangeable soil calcium, and (iii) $c14-math-0108$ : turnip green calcium. Suppose the hypothesis of interest is $c14-math-0109$ .

> data(calcium)
> n <- nrow(calcium)
> meanx <- mean(calcium[,-1])
> varx <- var(calcium[,-1])
> mu0 <- c(15,6,2.85)
> t2 <- n*t(meanx-mu0)%*%solve(varx)%*%(meanx-mu0)
> t2
[1] 24.55891

If we compare the test statistic value with a Hotelling's distribution with three degrees of freedom and non-centrality parameter 9 at 95% significance level, the critical value is 16.766, and thus we will have to reject the null-hypothesis.□

The Hotellings' $c14-math-0110$ can be converted to an $c14-math-0112$ -statistics using the transformation:

14.10

An R function for calculating the test statistic is available in the ICSNP package. The test function HotellingsT2 implements the $c14-math-0115$ -test by comparing if the mean of a normal vector equals some specified null vector. We can easily carry out the $c14-math-0116$ -test for the above example.

Example 14.4.4. The Calcium in Soil and Turnip Greens Data of Rencher (2002)

Contd. First call up the ICSNP library and then use the function HotellingsT2 for the calcium problem.

> library(ICSNP)
> HotellingsT2(calcium[,-1],mu=mu0,test="f")
        Hotelling's one sample T2-test
data:  calcium[, -1]
T.2 = 6.3671, df1 = 3, df2 = 7, p-value = 0.02068
alternative hypothesis: true location is not equal to c(15,6,2.85)

The conclusion on using any of the methods does not vary and we are led to reject the hypothesis $c14-math-0117$ .□

It is known that the likelihood-ratio tests are very generic methods for testing hypothesis problems. The likelihood-ratio test of $c14-math-0118$ against $c14-math-0119$ is given by

14.11

It is further known from the general theory of the likelihood ratios that the above-mentioned test statistics follow a chi-square random variate. Yes, the computations are not trivial for the $c14-math-0121$ -test based on the likelihood-ratio test, and hence the ICSNP package will be used to bail us out of this scenario. The illustration is continued with the previous data-set.

Example 14.4.5. The Calcium in Soil and Turnip Greens Data of Rencher (2002)

Contd. The option of test=“f” needs to be replaced with test=“chi” in the previous R line to carry out the $c14-math-0122$ -test for testing the hypothesis $c14-math-0123$ .

> HotellingsT2(calcium[,-1],mu=mu0,test="chi")
        Hotelling's one sample T2-test
data:  calcium[, -1]
T.2 = 24.5589, df = 3, p-value = 1.909e-05
alternative hypothesis: true location is not equal to c(15,6,2.85)

The conclusion does not change and, as earlier, the hypotheses $c14-math-0124$ is rejected in confirmation with earlier methods.□

The problem of testing $c14-math-0125$ against $c14-math-0126$ will be considered for the two-sample problem in Section 14.5.

$c14-math-0127$

14.5 Testing for Mean Vectors : Two-Samples

Consider the case of random vector samples from two plausible populations, $c14-math-0128$ and $c14-math-0129$ . Such scenarios are very likely if new machinery and a population labeled 1 refers to samples obtained under the new machinery and that labeled 2 refers to the samples under older machinery. As the comparison of the mean vectors become sensible only if the covariance matrices are equal, we assume that the covariance matrices are equal, but not known. That is, $c14-math-0130$ , with $c14-math-0131$ unknown.

Suppose that we have samples from the two populations as $c14-math-0132$ , $c14-math-0133$ , and $c14-math-0134$ , $c14-math-0135$ . The hypothesis of interest is to test $c14-math-0136$ against $c14-math-0137$ . The estimates of the mean vectors from each of the populations is given by $c14-math-0138$ .

We will first define the matrices of sum squares and cross-products as below:

14.12

14.13

Further define the pooled covariance matrix:

14.14

The Hotelling's $c14-math-0142$ test statistic is then given by

14.15

The test statistic $c14-math-0146$ , under the hypothesis $c14-math-0147$ , is distributed as Hotelling's $c14-math-0148$ distribution with parameters $c14-math-0149$ and $c14-math-0150$ . We list below some important properties of the Hotelling's $c14-math-0151$ statistic:

1. Hotelling's $c14-math-0153$ distribution is skewed.
2. For a two-sided alternative hypothesis, the critical-region is one-tailed.
3. A necessary condition for the inverse of the pooled covariance matrix to exist is that the $c14-math-0154$ .
4. A straightforward, not necessarily simple, transformation of the Hotelling's statistic gives us an $c14-math-0155$ -statistic.

As in the previous section, we may also use the likelihood-ratio tests, which lead to an appropriate $c14-math-0156$ -test, for large $c14-math-0157$ of course, see Rencher (2002) or Johnson and Wichern (2006). In the next illustrative example, we obtain the Hotelling's test statistics, the associated $c14-math-0158$ -statistic, and the likelihood-ratio test.

Example 14.5.1. Psychological Tests for Males and Females

A psychological study consisting of four tests was conducted on male and female groups and the results were noted. Since the four tests are correlated and each one is noted for all the individuals, we are interested in knowing if the mean vector of the test scores is the same across the gender group. The four tests here are as follows:

$c14-math-0159$ : pictorial inconsistencies
$c14-math-0160$ : paper form board

$c14-math-0161$ : tool recognition
$c14-math-0162$ : vocabulary.

Assume that the covariance matrix is the same for both the groups, and that it is unknown. We will write a small program for calculating the Hotelling's $c14-math-0163$ statistic, and use the function from the ICSNP package for the $c14-math-0164$ - test and the chi-square tests.

> data(mfp)
> males <- mfp[,1:4]; females <- mfp[,5:8]
> nm <- nrow(males); nf <- nrow(females)
> meanm <- colMeans(males); meanf <- colMeans(females)
> sigmam <- var(males); sigmaf <- var(females)
> sigmapl <- (1/(nm+nf-2))*((nm-1)*sigmam+(nf-1)*sigmaf)
> t2 <- ((nm*nf)/(nm+nf))*(t(meanm-meanf)%*%solve(sigmapl)%*% (meanm-meanf))
> nm;nf;meanm;meanf;sigmapl;t2
[1] 32
[1] 32
    M_y1     M_y2     M_y3     M_y4
15.96875 15.90625 27.18750 22.75000
    F_y1     F_y2     F_y3     F_y4
12.34375 13.90625 16.65625 21.93750
         M_y1      M_y2      M_y3      M_y4
M_y1 7.164315  6.047379  5.693044  4.700605
M_y2 6.047379 15.894153  8.492440  5.855847
M_y3 5.693044  8.492440 29.356351 13.980847
M_y4 4.700605  5.855847 13.980847 22.320565
        [,1]
[1,] 97.6015
> HotellingsT2(males,females,test="f")
        Hotelling's two sample T2-test
data:  males and females
T.2 = 23.2197, df1 = 4, df2 = 59, p-value = 1.464e-11
alternative hypothesis: true location difference is not equal to c(0,0,0,0)
> HotellingsT2(males,females,test="chi")
        Hotelling's two sample T2-test
data:  males and females
T.2 = 97.6015, df = 4, p-value < 2.2e-16
alternative hypothesis: true location difference is not equal to c(0,0,0,0)

After importing the dataset, the first few lines of the code obtain the quantities $c14-math-0165$ , $c14-math-0166$ , $c14-math-0167$ , $c14-math-0168$ , $c14-math-0169$ , and $c14-math-0170$ . The pooled variance as specified in Equation 14.14 is computed by the R code line beginning with t2. The computations are illustrated to clarify the formulas. The R function HotellingsT2 is then used to obtain the results in the routine way.

Comparing the value of the test statistic $c14-math-0171$ with critical value $c14-math-0172$ , we are led to reject the hypothesis of equal mean vectors for the gender groups.□

$c14-math-0173$

14.6 Multivariate Analysis of Variance

The ANOVA deals with testing of $c14-math-0174$ -means being equal in the univariate case. A host of ANOVA techniques was seen in Chapter 13. For the multivariate case, we have the generalization multivariate analysis of variance, more commonly simply known as MANOVA. The data structure can be easily displayed in tabular form, and we adapt the notation from Rencher (2002).

Suppose we want to test for equality of mean of $c14-math-0175$ -vector samples. Let $c14-math-0176$ denote observation $c14-math-0177$ from population $c14-math-0178$ , $c14-math-0179$ . We assume that $c14-math-0180$ . The observation model is specified by

14.16

Here $c14-math-0182$ , and $c14-math-0183$ is the mean effect in the $c14-math-0184$ population. The hypothesis of interest is given by $c14-math-0185$ . To test the hypothesis $c14-math-0186$ , we need to define, as usual, the “between” and “within” sum of squares matrices, denoted by $c14-math-0187$ and $c14-math-0188$ respectively:

14.17

14.18

Let $c14-math-0191$ and $c14-math-0192$ respectively denote the rank of $c14-math-0193$ and $c14-math-0194$ . There are four different statistics to test for $c14-math-0195$ and they are now explained in some detail.

14.6.1 Wilks Test Statistic

The Wilks test statistic for $c14-math-0196$ is given by

14.19

In the above expression, $c14-math-0198$ denotes the determinant of the matrix. The multivariate literature refers to $c14-math-0199$ as Wilks' $c14-math-0200$ . The test statistic can be equivalently expressed in terms of the eigenvalues $c14-math-0201$ of $c14-math-0202$ , where $c14-math-0203$ is the rank of $c14-math-0204$ , and is given by

14.20

The Wilks' $c14-math-0206$ takes values in the interval $c14-math-0207$ . Thus, the test procedure is to reject $c14-math-0208$ if $c14-math-0209$ . These ideas and concepts are next illustrated using the well-known root-stack dataset.

Example 14.6.1. Apple of Different Rootstock

The variables description is listed:

$c14-math-0210$ = trunk girth at 4 years ( $c14-math-0211$ )
$c14-math-0212$ = extension growth at 4 years ( $c14-math-0213$ )

$c14-math-0214$ = trunk girth at 15 years ( $c14-math-0215$ )
$c14-math-0216$ = weight of tree above ground at 15 years ( $c14-math-0217$ )

The goal is to test if the mean vector of the four variables is the same across six stratas of the experiment, that is, $c14-math-0218$ . We will first use a laborious approach to obtain the Wilks' $c14-math-0219$ and test the hypothesis. The R function will follow these tedious codes later.

> # rootstock.dta is available at
> # http://www.stata-press.com/data/r10/rootstock.dta
> library(foreign)
> rootstock <- read.dta("rootstock.dta")
> rootstock1 <- rootstock[rootstock[,1]==1,2:5]
> rootstock2 <- rootstock[rootstock[,1]==2,2:5]
> rootstock3 <- rootstock[rootstock[,1]==3,2:5]
> rootstock4 <- rootstock[rootstock[,1]==4,2:5]
> rootstock5 <- rootstock[rootstock[,1]==5,2:5]
> rootstock6 <- rootstock[rootstock[,1]==6,2:5]
> n <- 8; p <- 4; vh <- 5; ve <- 6*(8-1); k <- 6
> ymm<- colSums(rootstock[,2:5])
> y1m <- colSums(rootstock1)
> y2m <- colSums(rootstock2)
> y3m <- colSums(rootstock3)
> y4m <- colSums(rootstock4)
> y5m <- colSums(rootstock5)
> y6m <- colSums(rootstock6)
> H <- ((y1m%*%t(y1m))/n) + ((y2m%*%t(y2m))/n)+((y3m%*%t(y3m))/n)
+ +((y4m%*%t(y4m))/n) + ((y5m%*%t(y5m))/n)+((y6m%*%t(y6m))/n)
+ - (ymm%*%t(ymm))/(k*n)
> E <- matrix(0,nrow=4, ncol=4);
> for(i in 1:nrow(rootstock)) {
+ a <- as.numeric(rootstock[i,2:5])
+ E <- E + a%*%t(a)
+ }
> E <- E - (((y1m%*%t(y1m))/n) + ((y2m%*%t(y2m))/n)
+ +((y3m%*%t(y3m))/n)+((y4m%*%t(y4m))/n) + ((y5m%*%t(y5m))/n)
+ +((y6m%*%t(y6m))/n))
> E_H <- E+H
> wlambda <- det(E)/(det(E_H))
> options(digits=3)
> E;H;E_H;wlambda
        y1    y2    y3    y4
[1,] 0.320  1.70 0.554 0.217
[2,] 1.697 12.14 4.364 2.110
[3,] 0.554  4.36 4.291 2.482
[4,] 0.217  2.11 2.482 1.723
         y1    y2    y3    y4
[1,] 0.0736 0.537 0.332 0.208
[2,] 0.5374 4.200 2.355 1.637
[3,] 0.3323 2.355 6.114 3.781
[4,] 0.2085 1.637 3.781 2.493
        y1    y2     y3    y4
[1,] 0.394  2.23  0.886 0.426
[2,] 2.234 16.34  6.719 3.747
[3,] 0.886  6.72 10.405 6.263
[4,] 0.426  3.75  6.263 4.216
[1] 0.154

From the data object rootstock, the data for each stratum is extracted with the code rootstock1 <- rootstock[rootstock[,1]==1,2:5]. We have exactly eight observations from each stratum and hence n <- 8, and similarly the other scalar numbers are created. The R function colSums is essentially used to capture terms such as $c14-math-0220$ and $c14-math-0221$ . To obtain the matrix $c14-math-0222$ , the R code ((y1m%*%t(y1m))/n)+ ... +((y6m%*%t(y6m))/n) - (ymm%*%t(ymm))/(k*n) imitates Formula 14.17. To obtain the matrix $c14-math-0223$ , the R program takes it into two steps. In the first step, it simply captures the first part of Formula 14.18, that is the $c14-math-0224$ part with the for loop for(i in 1:nrow(rootstock)). The rest of Equation 14.18 is computed with E <- E - (((y1m%*%t(y1m))/n) + ... + ((y6m%*%t(y6m))/n)). The rest of the R program is trivial to follow.

The calculated values of Wilks lambda 0.154 is less than the theoretical value of 0.455 (corresponding to $c14-math-0225$ , $c14-math-0226$ = 5, $c14-math-0227$ = 42). Thus, we reject the hypothesis that the mean vector is the same for the six strata. The R function manova shows the same result:

> attach(rootstock)
> rs <- rootstock[,1]
> rs <- factor(rs,ordered=is.ordered(rs)) # Too important a step
> root.manova <- manova(cbind(y1,y2,y3,y4)∼rs)
> summary(root.manova, test = "Wilks")
          Df Wilks approx F num Df den Df  Pr(>F)
rs         5 0.154     4.94     20    130 7.7e-09 ***
Residuals 42
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Complex computational statistics, at least from a vector and matrix point of view, is completely followed through a rudimentary approach and then verified through the R function manova.□

14.6.2 Roy's Test

It is beyond the scope of the current work to clearly underpin the statistical motivation of Roy's test. An elegant description of the Roy's test can be found in Section 6.1.4 of Rencher (2002). We now give a watered-down version of Roy's test. Let $c14-math-0228$ denote the largest eigenvalue of the matrix $c14-math-0229$ . The Roy's largest root test is given by

14.21

The test procedure is then to reject the hypothesis $c14-math-0231$ if $c14-math-0232$ , where $c14-math-0233$ .

For the rootstock dataset, the one-line R code below gives the result based on the Roy's test.

> summary(root.manova, test = "Roy")
          Df  Roy approx F num Df den Df Pr(>F)
rs         5 1.88     15.8      5     42  1e-08 ***
Residuals 42
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Roy's test also rejects the hypothesis $c14-math-0234$ that the mean vector for the six strata are equal.

14.6.3 Pillai's Test Statistic

Let $c14-math-0235$ denote the $c14-math-0236$ eigenvalues of the matrix $c14-math-0237$ . The Pillai test statistic is then given by

14.22

The test procedure is to reject $c14-math-0239$ if $c14-math-0240$ . In R, we carry out the Pillai's method as below.

> summary(root.manova, test = "Pillai")
          Df Pillai approx F num Df den Df Pr(>F)
rs         5   1.30     4.07     20    168  2e-07 ***
Residuals 42
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Pillai's test statistic confirms the findings of the Wilks's and Roy's test that the mean vector for the six strata are significantly different. Finally, we look at the fourth test for testing the hypothesis $c14-math-0241$ , that the strata mean vectors are equal to the Lawley-Hotelling test statistic.

14.6.4 The Lawley-Hotelling Test Statistic

The Lawley-Hotelling statistic is defined by:

14.23

The test procedure is to reject the hypothesis $c14-math-0243$ for large values of $c14-math-0244$ . This is illustrated in R.

> summary(root.manova, test = "Hotelling")
          Df Hotelling-Lawley approx F num Df den Df  Pr(>F)
rs         5             2.92     5.48     20    150 2.6e-10 ***
Residuals 42
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus, it is concluded that all the four statistical tests lead to the same conclusion, that the mean vector for the six strata are different. From a theoretical point of view, there is no reason to prefer one of the four methods over the others. A general advice is to use all the four methods. Consider one more example before closing the MANOVA section.

Example 14.6.2. Testing for Physico-chemical Properties of Water in Four Cities

Water samples from four cities are collected and their physico-chemical properties for ten variables, such as pH, Conductivity, Total_Dissolved_Solid, etc., are measured. We would then like to test if the properties are the same across the four cities and in which case a same water treatment approach can be adopted for all cities. The MANOVA test is used for statistical analysis of the problem. This example is drawn from Gore, et al. (2006). This dataset is also available in the R package gpk and the dataset name is Waterquality.

> data(Waterquality)
> attach(Waterquality)
> City <- factor(City,ordered=is.ordered(City))
> WQ.manova <- manova(cbind(pH,Conductivity,Total_Dissolved_Solid, Alkalinity,Hardness,
+ Calcium_Hardness,Magnesium_Hardness,Chlorides,Sulphates)∼City)
> summary(WQ.manova, test = "Wilks")
          Df    Wilks approx F num Df den Df    Pr(>F)
City       3 0.030522   12.758     27 149.59 < 2.2e-16 ***
Residuals 59
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> summary(WQ.manova, test = "Roy")
          Df    Roy approx F num Df den Df    Pr(>F)
City       3 6.3098   37.158      9     53 < 2.2e-16 ***
Residuals 59
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> summary(WQ.manova, test = "Pillai")
          Df Pillai approx F num Df den Df    Pr(>F)
City       3  1.909   10.305     27    159 < 2.2e-16 ***
Residuals 59
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> summary(WQ.manova, test = "Hotelling")
          Df Hotelling-Lawley approx F num Df den Df    Pr(>F)
City       3           8.5864   15.795     27    149 < 2.2e-16 ***
Residuals 59
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since each of the four statistical tests indicates that the mean vector of the ten variates across the four cities are significantly different from each other, the water treatment program across the cities has to be different.□

$c14-math-0245$

We will next look at testing hypotheses problems related to the variance-covariance matrices.

14.7 Testing for Variance-Covariance Matrix: One Sample

In MSA, the covariance matrix plays the role of the scale parameter. We thus naturally encounter the problem of testing $c14-math-0246$ . Let us begin with the testing problems of covariance matrix in the one sample case.

Let $c14-math-0247$ denote the sample covariance matrix of $c14-math-0248$ . Define $c14-math-0249$ . The likelihood ratio test statistic for the hypothesis $c14-math-0250$ is given by

14.24

where $c14-math-0252$ is the degrees of freedom of $c14-math-0253$ , $c14-math-0254$ is the natural logarithm (base $c14-math-0255$ ), and tr is the trace, sum of diagonal matrix, of a matrix. For large $c14-math-0256$ values, $c14-math-0257$ is approximately distributed as a $c14-math-0258$ - random variable with $c14-math-0259$ degrees of freedom. For moderate-sized samples, Rencher (2002) recommends the use of the following modification:

14.25

The test procedure is to reject the hypothesis $c14-math-0261$ if the values of $c14-math-0262$ or $c14-math-0263$ are greater than $c14-math-0264$ . Both the test statistics $c14-math-0265$ and $c14-math-0266$ are computed in the next example.

Example 14.7.1. Understanding the Height-Weight Relationship

This is a continuation of Example 14.4.1, where we have the height and weight of 20 college-age males. Here, test if the covariance matrix is indeed $c14-math-0267$ .

> data(hw)
> sigma0 <- matrix(c(20, 100, 100, 1000),nrow=2)
> sigma <- var(hw)
> v <- nrow(hw)-1
> p <- ncol(hw)
> u <- v*(log(det(sigma0))-log(det(sigma)) + sum(diag(sigma%*%solve (sigma0)))-p)
> u1 <- (1- (1/(6*v-1))*(2*p+1 - 2/(p+1)))*u
> u;u1;qchisq(1-0.05,p*(p+1)/2)
[1] 11.09374
[1] 10.66832
[1] 7.814728

The R code u <- v*(log(det(sigma0))... does the computation as required in Equation 14.24, while u1 <- (1- (1/(6*v-1))... does the computation for Equation 14.25. Comparing with the critical values of $c14-math-0268$ , with p <- ncol(hw), since the values of $c14-math-0269$ and $c14-math-0270$ are greater than the critical value, we reject the hypothesis $c14-math-0271$ and conclude that $c14-math-0272$ .□

14.7.1 Testing for Sphericity

The problem of testing if the components of a random vector are independent is equivalent to the problem of testing $c14-math-0273$ , where $c14-math-0274$ is the identity matrix. We would like to caution the reader to always keep in mind the counter-example of Section 14.3. The hypothesis is that we are testing equivalent tests if all the correlations among the component are equal to zero, that is, it examines if the components are independent.

Note that if the hypothesis $c14-math-0275$ holds true, the ellipsoid $c14-math-0276$ becomes $c14-math-0277$ , which is the equation of a sphere. Here, $c14-math-0278$ is some non-negative constant, that is, $c14-math-0279$ . Hence, the problem of a test for independence of the components is also known as tests of sphericity.

The log-likelihood ratio test for $c14-math-0280$ is given by

which on further evaluation leads to

where $c14-math-0283$ . The test statistic $c14-math-0284$ can be restated in terms of the eigenvalues as

14.26

where $c14-math-0286$ are the eigenvalues of the sample covariance matrix $c14-math-0287$ . An improvement of $c14-math-0288$ by $c14-math-0289$ is further given by

14.27

where $c14-math-0291$ is the degrees of freedom of $c14-math-0292$ . The statistic $c14-math-0293$ , under the hypothesis $c14-math-0294$ , has a $c14-math-0295$ distribution with $c14-math-0296$ degrees of freedom. The test procedure is to reject the hypothesis $c14-math-0297$ if $c14-math-0298$ .

Example 14.7.2. The Linguistic Probe Word Analysis

Probe words are used to test the recall ability of words in various linguistic contexts. In this experiment the response time to 5 different probe words are recorded for 11 individuals. The interest in the experiment is to examine if the response times to the different words are independent or not. The failure to reject the hypothesis of sphericity implies that the response times can be compared using ANOVA.

> data(pw)
> sigma <- var(pw[2:6])
> p <- ncol(pw)-1; v <- nrow(pw)-1
> u <- pˆp*(det(sigma))/(sum(diag(sigma))ˆp)
> u1 <- -(v-(2*pˆ2+p+2)/(6*p))*log(u)
> u;u1
[1] 0.03948874
[1] 26.17709

Note here that the R program code u <- pˆp*(det(sigma))/(sum(diag(sigma))ˆp) does not imitate the expression given in Equation 14.26, though the code for u1 follows Equation 14.27. However, we will leave it to the reader to obtain the relationship between determinants and eigenvalues, which will assure that the program is correct.

Since the calculated $c14-math-0299$ value is greater than the critical value of 23.68479, we reject the sphericity hypothesis.□

$c14-math-0300$

14.8 Testing for Variance-Covariance Matrix: $c14-math-0301$ -Samples

Consider the case when we have samples from $c14-math-0302$ -populations, that is, $c14-math-0303$ , for $c14-math-0304$ . For the $c14-math-0305$ sample, we have a sample of size $c14-math-0306$ . The hypothesis of interest here is $c14-math-0307$ . Also, define the following:

$c14-math-0309$ : the sample covariance matrix of the $c14-math-0310$ population;
$c14-math-0311$ : the degrees of freedom associated with the estimated covariance matrix $c14-math-0312$ .

Technically, we need to have $c14-math-0313$ for ensuring that the estimated covariance matrices are non-singular. Define the pooled sample covariance matrix by

The test statistic for the hypothesis $c14-math-0315$ is then given by the following:

14.28

The range of values for $c14-math-0317$ is between 0 and 1, with values closer to 1 favoring the hypothesis $c14-math-0318$ , and values closer to 0 leading to its rejection. This can be easily seen by rewriting the expression of $c14-math-0319$ as

14.29

An expression for $c14-math-0321$ is given by

14.30

The hypothesis $c14-math-0323$ may be tested using the exact M-test with $c14-math-0324$ , see page 258 of Rencher (2002).

To test the hypothesis $c14-math-0325$ , we may also use the Box's $c14-math-0326$ and $c14-math-0327$ - approximations for the probability distribution of $c14-math-0328$ . Towards this, we will first define $c14-math-0329$ as follows:

14.31

It can then be proved that

14.32

is distributed as a $c14-math-0332$ random variable with $c14-math-0333$ degrees of freedom.

The steps for obtaining the $c14-math-0336$ -approximation may appear cumbersome, but its benefits are also equally rewarding. As with the $c14-math-0337$ approximation, we will first define the required quantities. Define $c14-math-0338$ as a function of $c14-math-0339$ by

14.33

and also define the quantities $c14-math-0341$ in the following:

We have two scenarios here: (i) $c14-math-0343$ , and (ii) $c14-math-0344$ . In case (i), the appropriate $c14-math-0345$ -statistic is

14.34

and in case (ii), it is

14.35

In both cases, the approximation follows the $c14-math-0349$ distribution, and the test procedure is to reject the hypothesis $c14-math-0350$ if $c14-math-0351$ .

Example 14.8.1. Psychological Tests for Males and Females

Contd. We have considered the problem of testing hypothesis of equality of mean vectors in Section 14.5. We will now test if the covariance matrices for the male and female groups are equal or not, that is, $c14-math-0352$ . We first compute the exact $c14-math-0353$ test statistic value, and then calculate the Box's $c14-math-0354$ and $c14-math-0355$ test statistic values.

> # Testing for Equality of Covariance Matrices
> data(mfp)
> males <- mfp[,1:4]; females <- mfp[,5:8]
> nm <- nrow(males);nf <- nrow(females)
> p <- 4; k <- 2
> vm <- nm-1; vf <- nf-1
> meanm <- mean(males); meanf <- mean(females)
> sigmam <- var(males); sigmaf <- var(females)
> sigmapl <- (1/(nm+nf-2))*((nm-1)*sigmam+(nf-1)*sigmaf)
> ln_M <- .5*(vm*log(det(sigmam))+vf*log(det(sigmaf)))
+ -.5*(vm+vf)*log(det(sigmapl))
> exact_test <- -2*ln_M # the Exact Test
> exact_test
[1] 14.5606
> # The Box's chi-square approximation
> c1 <- (sum(c(1/vm,1/vf))- (1/sum(c(vm,vf))))*((2*pˆ2+3*p-1)
+ /(6*(p+1)*(k-1)))
> u <- -2*(1-c1)*ln_M
> qchisq(1-0.05,(k-1)*p*(p+1)/2)
[1] 18.30704
> u; qchisq(1-0.05,(k-1)*p*(p+1)/2)
[1] 13.55075
[1] 18.30704
> c2 <- ((p-1)*(p+2)/(6*(k-1)))*(sum(c(1/vm,1/vf)ˆ2)-
+ (1/(sum(c(vm,vf))ˆ2)))
> a1 <- (k-1)*p*(p+1)/2; a2 <- (a1+2)/(abs(c2-c1ˆ2))
> b1 <- (1-c1-a1/a2)/a1; b2 <- (1-c1+2/a2)/a2
> if(c2>c1ˆ2) {Ftest = -2*b1*ln_M} else {Ftest = (2*a2*b2*ln_M)/
+ (a1*(1+2*b2*ln_M))}
> Ftest; qf(1-.05,10,Inf)
[1] 1.354283
[1] 1.830704

The R code ln_M <- .5*(vm*log(det(sigmam))... delivers the computation for Equation 14.30 and exact_test <- -2*ln_M obtains the required exact M test statistic of 14.5606. Compared with the critical value, Appendix Table A.14 of Rencher (2002), of 19.74, we will be rejecting the hypothesis that the covariance matrices for males and females are equal.

To obtain the Box's $c14-math-0356$ , and $c14-math-0357$ -statistics, we need to obtain the $c14-math-0358$ and $c14-math-0359$ values. The R codes c1 <- ... and c2 <- ... clearly implement the required formulas, as given in Equations 14.31 and 14.33 . The program for Box's $c14-math-0360$ statistic in Equation 14.32 is carried out with u <- -2*(1-c1)*ln_M and the resultant value is 13.55075, and a comparison with the critical value for qchisq(1-0.05,(k-1)*p*(p1)/2)+ in 18.30704 again leads to rejecting the hypothesis.

For our problem, we have that $c14-math-0361$ , although the R program takes care of both scenarios. The details of obtaining $c14-math-0362$ , and $c14-math-0363$ is left to the reader. The computation of the $c14-math-0364$ -statistic is handled by the if-else control loop for both the cases, as given in Equations 14.34 and 14.35. All the three procedures lead to rejection of the hypothesis.□

14.9 Testing for Independence of Sub-vectors

The test of sphericity addresses the problem of testing if all the covariates are independent or not. A very likely practical problem could be that we may know beforehand that certainly not all the components are independent. However, we may also have knowledge that though the first three components and the next four components are related, it may be the case that the set of the first three components are independent of the set of the next four components. We would thus like to have some statistical tests to help us prove if our hypothesis is true or not. In fact, we need methods to help us test any combination of vectors as independent or not, and the methods in this section exactly help us to accomplish this.

Consider the $c14-math-0365$ -dimensional random vector $c14-math-0366$ . Suppose that we are interested to find if the sub-vectors $c14-math-0367$ are a $c14-math-0368$ mutually independent sets of sub-vectors. The notation needs a bit of explanation. If $c14-math-0369$ , we are testing if all the components are independent. We denote $c14-math-0370$ for the number of elements in the $c14-math-0371$ sub-vector $c14-math-0372$ , and we require that $c14-math-0373$ . Note that $c14-math-0374$ denotes a partitioning of $c14-math-0375$ and not a random sample of size $c14-math-0376$ of $c14-math-0377$ .

Let $c14-math-0378$ denote the covariance matrix between the sub-vectors $c14-math-0379$ and $c14-math-0380$ , $c14-math-0381$ . The hypothesis for independence of the sub-vectors can then be stated symbolically as $c14-math-0382$ , $c14-math-0383$ , and in matrix notation as below:

14.36

Let us denote the partition of estimated covariance matrix by the following:

14.37

The likelihood ratio test statistic for the hypothesis $c14-math-0386$ is given by

14.38

A $c14-math-0388$ approximation of the distribution of $c14-math-0390$ is given by

14.39

where $c14-math-0392$ and $c14-math-0393$ are determined by the following:

14.40

We reject the hypothesis $c14-math-0395$ if $c14-math-0396$ .

Example 14.9.1. The Seishu Wine Study

The odor and taste of wines are recorded in a study. It is believed that the variables such as the $c14-math-0397$ concentration, alcohol content, total sugar, etc., explain the odor and taste of the wine. The variables are enumerated below.

1. $c14-math-0398$ : Taste
2. $c14-math-0399$ : Odor
3. $c14-math-0400$ : pH
4. $c14-math-0401$ : Acidity 1
5. $c14-math-0402$ : Acidity 2

6 $c14-math-0403$ : Sake meter
7 $c14-math-0404$ : Direct reducing sugar
8 $c14-math-0405$ : Total sugar
9 $c14-math-0406$ : Alcohol
10 $c14-math-0407$ : Formyl nitrogen

Note that the variables $c14-math-0408$ and $c14-math-0409$ are in some sense outputs, or regressands, and appear to be a function of the other regressors. The (input) variables $c14-math-0410$ are the acidic content, whereas $c14-math-0411$ are the sugar content of the wine, and the remaining variables describe the alcohol and nitrogen content. It is thus natural to believe that we have here four types of sub-vectors, which may be mutually exclusive. Mathematically speaking, we are interested in testing if the sub-vectors $c14-math-0412$ are a mutually exclusive subsets of vector. The following R program deals with the related computations.

> data(sheishu)
> noc <- c(2,3,3,2)
> nov <- 10
> v <- nrow(sheishu)-1
> varsheishu <- var(sheishu)
> s11 <- varsheishu[1:2,1:2]
> s22 <- varsheishu[3:5,3:5]
> s33 <- varsheishu[6:8,6:8]
> s44 <- varsheishu[9:10,9:10]
> u <- det(varsheishu)/(det(s11)*det(s22)*det(s33)*det(s44))
> a2 <- novˆ2 - sum(nocˆ2)
> a3 <- novˆ3 - sum(nocˆ3)
> f <- a2/2
> cc <- 1 - (2*a3 + 3*a2)/(12*f*v)
> u1 <- -v*cc*log(u)
> u; a2; a3; f; cc; u1
[1] 0.01627025
[1] 74
[1] 930
[1] 37
[1] 0.8383038
[1] 100.1221
> qchisq(1-0.001,37)
[1] 69.34645

Using the var function, we first obtain $c14-math-0413$ as required in Equation 14.37, and the partitions with s11, s22, s33, and s44. The test statistic $c14-math-0414$ as required by Equation 14.38 is computed by u <- det(varsheishu)/(det(s11)*...*det(s44)). The constants required in Equation 14.40 are easily obtained in the variables a2, a3, f, v, and cc, which further help to obtain $c14-math-0415$ given by Equation 14.39, calculated with u1 <- -v*cc*log(u). Since the $c14-math-0416$ value exceeds qchisq(1-0.001,37) equal to 69.34645, we reject the hypothesis of independence of sub-vectors.□

$c14-math-0417$

In the next chapter on MSA, we will consider some of the more advanced topics, which are more useful in an applied context.

14.10 Further Reading

Anderson (1953, 1984, and 2003) is the first comprehensive and benchmark book in this area of statistics. Rencher (2002), Johnson and Wichern (2007), Hair, et al. (2010), Hardle and Simar (2007), and Izenman (2008) are some of the modern accounts of multivariate statistical analysis. Everitt (2005) has handled the associated computations through R and S-Plus. However, as Everitt (2005) and Everitt and Hothorn (2011) are dwelling more in advanced methods of MSA, we believe that the reader can benefit from the coverage given in this and the next chapter.

14.11 Complements, Problems, and Programs

Problem 14.1 The iris data has been introduced in AD2. Obtain the matrix of scatter plots for (i) the overall dataset (removing the Species), and (ii) three subsets according to the Species. Obtain the average of the four characteristics by the Species group and using the faces function from the aplpack package, plot the Chernoff faces. Do the Chernoff faces offer enough insight to identify the group?
Problem 14.2 For the board stiffness data discussed in Example 14.3.3, obtain the covariance matrix and then using the cov2cor function, obtain the correlation matrix.
Problem 14.3 The Mahalanobis distance $c14-math-0418$ given in Equation 14.7 is easily obtained in R using the mahalanobis function. Using this function, obtain the distance of the observations from the entire dataset for the board stiffness dataset and investigate for presence of outliers. Repeat the exercise for the presence of outliers in the iris dataset too.
Problem 14.4 Using the HotellingsT2 function from the ICSNP package, test whether average sepal and petal length and width for setosa species equals [5.936 2.770 4.260 1.326] in the iris dataset.
Problem 14.5 Using the HotellingsT2 function from the ICSNP package, test whether average sepal and petal length and width for setosa species equals that of versicolor in the iris dataset.
Problem 14.6 Run the example code of the function HotellingsT2, that is run example(HotellingsT2), and explore the options available with this function.
Problem 14.7 Carry out the MANOVA analysis for the iris datasets, where the hypothesis problem is that the mean of the multivariate vector of the four variables are equal across the three types of species.
Problem 14.8 Using base matrix tools of R, create a function which returns the value of Roy's test statistic given in Equation 14.21.
Problem 14.9 Repeat the above exercise for the Pillai and Lawley-Hotelling tests respectively given in Equations 14.22 and 14.23.
Problem 14.10 For the iris dataset, test the hypothesis $c14-math-0419$ . Repeat the exercise for the stack loss problem too.
Problem 14.11 Test whether the Sepal and Petal characteristics are independent of each other in the iris dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.