Chapter 14
Multivariate Statistical Analysis - I

Package(s): ICSNP, scatterplot3d, aplpack, mvtnorm, foreign

Dataset(s): cardata, stiff, iris, hw, calcium, mfp, rootstock, waterquality, pw, sheishu

14.1 Introduction

In many real-world problems, data is seldom univariate. We have more than one variable, which needs a good understanding of the underlying uncertain phenomenons. Thus, we need a set of tools to handle this type of data, and this is provided by Multivariate Statistical Analysis (MSA), a branch of the subject. We saw in the previous chapters on regression, that multiple regressors explain the regressand. Sometimes experiments may need a deeper study of the covariates themselve. In particular, we are now concerned with a random vector, the characteristics of which form the crux of this and the next chapter.

In Section 14.2 we look at graphical plots, which give a deeper insight into the structure of the dataset. The core concepts of MSA are introduced in Section 14.3. Sections 14.4 and 14.5 deal with the inference problem related to the mean vectors of multivariate data, whereas inference related with the variance-covariance matrix are performed in Sections 14.7 and 14.8. Multivariate Analysis of Variance, abbreviated as MANOVA, tools are introduced and illustrated in Section 14.6 and some tests for independence of sub-vectors are addressed in Section 14.9. Advanced topics of multivariate statistical analysis are carried over to the next chapter.

14.2 Graphical Plots for Multivariate Data

In Chapter 12 we saw the use of scatter plots and pairs (matrix of scatter plots). A slight modification of the matrix of a scatter plot is considered here, which gives more insight into the multivariate aspect of the dataset. Multi-dimensional data can be still visualized in two dimensions and a particularly effective technique provided by Chernoff faces is detailed. We will begin with a multivariate dataset.

Chernoff (1973) gave a very innovative technique to visualize multivariate data, which considers each variate as some dimension of the human face, say nose, ears, smile, cheeks, etc. Interpretation of three-dimensional plots itself is difficult, even if we were to deploy features such as rotation of the plots. Certainly, visualization in more than three dimensions is not possible. Thus, the Chernoff technique of visualizing the multivariate data through faces is very helpful and such a plot is of course well known as the Chernoff faces. We deploy this method here using the graphical function faces from the R package aplpack.

Chernoff faces gives one facet of data visualization of multivariate data. There are many other similar techniques, though we will not dwell more on them. In the next Section 14.3 we consider more basic aspects of the multivariate and define the multivariate normal distribution in more detail.

c14-math-0014

14.3 Definitions, Notations, and Summary Statistics for Multivariate Data

14.3.1 Definitions and Data Visualization

We will denote a c14-math-0015-random vector by c14-math-0016, and its c14-math-0017 replicates by c14-math-0018. The random vector of the c14-math-0019 replicate is denoted as c14-math-0020. The mean vector and variance-covariance matrix of X are respectively denoted as

14.1 equation
14.2 equation
14.3 equation

Note that the matrix c14-math-0024 is a symmetric matrix, that is, c14-math-0025.

Sometimes, we may also be interested in the correlation matrix defined by

equation

Each correlation coefficient will be a number between –1 and +1, that is, c14-math-0027.

The c14-math-0031-dimensional normal density will be denoted by c14-math-0032. The case of c14-math-0033 refers to the bivariate normal random distribution. For bivariate normal random variables with a zero-mean vector and a couple of positive and negative correlations, we obtain the probability density plots.

We now describe some standard methods of estimation of the mean vector and the variance-covariance matrix. Define

14.5 equation
14.6 equation

Result 4.11 of Johnson and Wichern (2006) shows that the estimators c14-math-0050 and c14-math-0051 are respectively MLE's of their parameters.

14.3.2 Early Outlier Detection

An important concept for analysis of multivariate data is given by the Mahalanobis distance.

Thus, for each vector c14-math-0063, the Mahalanobis distance c14-math-0064 may be computed and any unusually large value may then be marked as an outlier. The Mahalanobis distance is also called the generalized squared distance.

It is a familiar story that outliers, as in univariate cases, need to be addressed as early as possible. Graphical methods become a bit difficult if the number of variables is more than three. Johnson and Wichern (2006) suggest we should obtain the standardized values of the observations with respect to each variable, and they also recommend looking at the matrix of scatter plots. The four steps listed by them are:

  • Obtain the dot plot for each variable.
  • Obtain the matrix of scatter plots.
  • Obtain the standardized scores for each variable, c14-math-0065, for c14-math-0066, and c14-math-0067. Check for large and small values of these scores.
  • Obtain the generalized squared distances c14-math-0068. Check for large distances.

A dot plot is also known as a Cleveland dot plot and it is set up using the dotchart function, which in turn is a substitute to the bar plot. In this plot, a dot is used to represent the magnitude of the observation along the c14-math-0069-axis with the observation number on the c14-math-0070-axis.

We have seen some interesting graphical plots for the multivariate data. The multivariate normal distribution has also been introduced here, along with a few of its properties. We will next consider some testing problems for multivariate normal distribution.

c14-math-0071

14.4 Testing for Mean Vectors : One Sample

Suppose that we have random vector samples from a multivariate normal distribution, say c14-math-0072. In Chapter 7, we saw testing problems for the univariate normal distribution. If the variance-covariance matrix c14-math-0073 is known to be a diagonal matrix, that is components are uncorrelated random variables, we can revert back to the methods discussed there. However, this is not generally the case for multivariate distributions and they need specialized methods for testing hypothesis problems.

For the problem of testing c14-math-0074 against c14-math-0075, we consider two cases here: (i) c14-math-0076 is known, and (ii) c14-math-0077 is unknown. Suppose that we have c14-math-0078 samples of random vectors from c14-math-0079, which we will denote as c14-math-0080. The sample of random vectors is the same as saying that we have iid random vectors.

14.4.1 Testing for Mean Vector with Known Variance-Covariance Matrix

If the variance-covariance matrix is known, the test statistic is a multivariate extension of the c14-math-0081-statistic, and is given by

14.8 equation

Under the hypothesis c14-math-0084, the test statistic c14-math-0085 is distributed as a chi-square variate with c14-math-0086 degrees of freedom, and a c14-math-0087 random variable. The computations are fairly easy and we do that in the next example.

14.4.2 Testing for Mean Vectors with Unknown Variance-Covariance Matrix

It turns out that in many practical settings, the variance-covariance matrix is unknown. Therefore, we need to extend the test procedure for this important case. The Hotelling's c14-math-0097-statistic is given by

14.9 equation

where c14-math-0100 is the sampling covariance matrix. Under the hypothesis c14-math-0101, the test statistic c14-math-0102 is distributed as Hotellings' c14-math-0103 distribution with c14-math-0104 and c14-math-0105 degrees of freedom.

The Hotellings' c14-math-0110 can be converted to an c14-math-0112-statistics using the transformation:

14.10 equation

An R function for calculating the test statistic is available in the ICSNP package. The test function HotellingsT2 implements the c14-math-0115-test by comparing if the mean of a normal vector equals some specified null vector. We can easily carry out the c14-math-0116-test for the above example.

It is known that the likelihood-ratio tests are very generic methods for testing hypothesis problems. The likelihood-ratio test of c14-math-0118 against c14-math-0119 is given by

14.11 equation

It is further known from the general theory of the likelihood ratios that the above-mentioned test statistics follow a chi-square random variate. Yes, the computations are not trivial for the c14-math-0121-test based on the likelihood-ratio test, and hence the ICSNP package will be used to bail us out of this scenario. The illustration is continued with the previous data-set.

The problem of testing c14-math-0125 against c14-math-0126 will be considered for the two-sample problem in Section 14.5.

c14-math-0127

14.5 Testing for Mean Vectors : Two-Samples

Consider the case of random vector samples from two plausible populations, c14-math-0128 and c14-math-0129. Such scenarios are very likely if new machinery and a population labeled 1 refers to samples obtained under the new machinery and that labeled 2 refers to the samples under older machinery. As the comparison of the mean vectors become sensible only if the covariance matrices are equal, we assume that the covariance matrices are equal, but not known. That is, c14-math-0130, with c14-math-0131 unknown.

Suppose that we have samples from the two populations as c14-math-0132, c14-math-0133, and c14-math-0134, c14-math-0135. The hypothesis of interest is to test c14-math-0136 against c14-math-0137. The estimates of the mean vectors from each of the populations is given by c14-math-0138.

We will first define the matrices of sum squares and cross-products as below:

14.12 equation
14.13 equation

Further define the pooled covariance matrix:

The Hotelling's c14-math-0142 test statistic is then given by

14.15 equation

The test statistic c14-math-0146, under the hypothesis c14-math-0147, is distributed as Hotelling's c14-math-0148 distribution with parameters c14-math-0149 and c14-math-0150. We list below some important properties of the Hotelling's c14-math-0151 statistic:

  1. 1. Hotelling's c14-math-0153 distribution is skewed.
  2. 2. For a two-sided alternative hypothesis, the critical-region is one-tailed.
  3. 3. A necessary condition for the inverse of the pooled covariance matrix to exist is that the c14-math-0154.
  4. 4. A straightforward, not necessarily simple, transformation of the Hotelling's statistic gives us an c14-math-0155-statistic.

As in the previous section, we may also use the likelihood-ratio tests, which lead to an appropriate c14-math-0156-test, for large c14-math-0157 of course, see Rencher (2002) or Johnson and Wichern (2006). In the next illustrative example, we obtain the Hotelling's test statistics, the associated c14-math-0158-statistic, and the likelihood-ratio test.

c14-math-0173

14.6 Multivariate Analysis of Variance

The ANOVA deals with testing of c14-math-0174-means being equal in the univariate case. A host of ANOVA techniques was seen in Chapter 13. For the multivariate case, we have the generalization multivariate analysis of variance, more commonly simply known as MANOVA. The data structure can be easily displayed in tabular form, and we adapt the notation from Rencher (2002).

Suppose we want to test for equality of mean of c14-math-0175-vector samples. Let c14-math-0176 denote observation c14-math-0177 from population c14-math-0178, c14-math-0179. We assume that c14-math-0180. The observation model is specified by

14.16 equation

Here c14-math-0182, and c14-math-0183 is the mean effect in the c14-math-0184 population. The hypothesis of interest is given by c14-math-0185. To test the hypothesis c14-math-0186, we need to define, as usual, the “between” and “within” sum of squares matrices, denoted by c14-math-0187 and c14-math-0188 respectively:

Let c14-math-0191 and c14-math-0192 respectively denote the rank of c14-math-0193 and c14-math-0194. There are four different statistics to test for c14-math-0195 and they are now explained in some detail.

14.6.1 Wilks Test Statistic

The Wilks test statistic for c14-math-0196 is given by

14.19 equation

In the above expression, c14-math-0198 denotes the determinant of the matrix. The multivariate literature refers to c14-math-0199 as Wilks' c14-math-0200. The test statistic can be equivalently expressed in terms of the eigenvalues c14-math-0201 of c14-math-0202, where c14-math-0203 is the rank of c14-math-0204, and is given by

14.20 equation

The Wilks' c14-math-0206 takes values in the interval c14-math-0207. Thus, the test procedure is to reject c14-math-0208 if c14-math-0209. These ideas and concepts are next illustrated using the well-known root-stack dataset.

14.6.2 Roy's Test

It is beyond the scope of the current work to clearly underpin the statistical motivation of Roy's test. An elegant description of the Roy's test can be found in Section 6.1.4 of Rencher (2002). We now give a watered-down version of Roy's test. Let c14-math-0228 denote the largest eigenvalue of the matrix c14-math-0229. The Roy's largest root test is given by

The test procedure is then to reject the hypothesis c14-math-0231 if c14-math-0232, where c14-math-0233.

For the rootstock dataset, the one-line R code below gives the result based on the Roy's test.

> summary(root.manova, test = "Roy")
          Df  Roy approx F num Df den Df Pr(>F)
rs         5 1.88     15.8      5     42  1e-08 ***
Residuals 42
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Roy's test also rejects the hypothesis c14-math-0234 that the mean vector for the six strata are equal.

14.6.3 Pillai's Test Statistic

Let c14-math-0235 denote the c14-math-0236 eigenvalues of the matrix c14-math-0237. The Pillai test statistic is then given by

The test procedure is to reject c14-math-0239 if c14-math-0240. In R, we carry out the Pillai's method as below.

> summary(root.manova, test = "Pillai")
          Df Pillai approx F num Df den Df Pr(>F)
rs         5   1.30     4.07     20    168  2e-07 ***
Residuals 42
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Pillai's test statistic confirms the findings of the Wilks's and Roy's test that the mean vector for the six strata are significantly different. Finally, we look at the fourth test for testing the hypothesis c14-math-0241, that the strata mean vectors are equal to the Lawley-Hotelling test statistic.

14.6.4 The Lawley-Hotelling Test Statistic

The Lawley-Hotelling statistic is defined by:

The test procedure is to reject the hypothesis c14-math-0243 for large values of c14-math-0244. This is illustrated in R.

> summary(root.manova, test = "Hotelling")
          Df Hotelling-Lawley approx F num Df den Df  Pr(>F)
rs         5             2.92     5.48     20    150 2.6e-10 ***
Residuals 42
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus, it is concluded that all the four statistical tests lead to the same conclusion, that the mean vector for the six strata are different. From a theoretical point of view, there is no reason to prefer one of the four methods over the others. A general advice is to use all the four methods. Consider one more example before closing the MANOVA section.

c14-math-0245

We will next look at testing hypotheses problems related to the variance-covariance matrices.

14.7 Testing for Variance-Covariance Matrix: One Sample

In MSA, the covariance matrix plays the role of the scale parameter. We thus naturally encounter the problem of testing c14-math-0246. Let us begin with the testing problems of covariance matrix in the one sample case.

Let c14-math-0247 denote the sample covariance matrix of c14-math-0248. Define c14-math-0249. The likelihood ratio test statistic for the hypothesis c14-math-0250 is given by

where c14-math-0252 is the degrees of freedom of c14-math-0253, c14-math-0254 is the natural logarithm (base c14-math-0255), and tr is the trace, sum of diagonal matrix, of a matrix. For large c14-math-0256 values, c14-math-0257 is approximately distributed as a c14-math-0258- random variable with c14-math-0259 degrees of freedom. For moderate-sized samples, Rencher (2002) recommends the use of the following modification:

The test procedure is to reject the hypothesis c14-math-0261 if the values of c14-math-0262 or c14-math-0263 are greater than c14-math-0264. Both the test statistics c14-math-0265 and c14-math-0266 are computed in the next example.

14.7.1 Testing for Sphericity

The problem of testing if the components of a random vector are independent is equivalent to the problem of testing c14-math-0273, where c14-math-0274 is the identity matrix. We would like to caution the reader to always keep in mind the counter-example of Section 14.3. The hypothesis is that we are testing equivalent tests if all the correlations among the component are equal to zero, that is, it examines if the components are independent.

Note that if the hypothesis c14-math-0275 holds true, the ellipsoid c14-math-0276 becomes c14-math-0277, which is the equation of a sphere. Here, c14-math-0278 is some non-negative constant, that is, c14-math-0279. Hence, the problem of a test for independence of the components is also known as tests of sphericity.

The log-likelihood ratio test for c14-math-0280 is given by

equation

which on further evaluation leads to

equation

where c14-math-0283. The test statistic c14-math-0284 can be restated in terms of the eigenvalues as

where c14-math-0286 are the eigenvalues of the sample covariance matrix c14-math-0287. An improvement of c14-math-0288 by c14-math-0289 is further given by

where c14-math-0291 is the degrees of freedom of c14-math-0292. The statistic c14-math-0293, under the hypothesis c14-math-0294, has a c14-math-0295 distribution with c14-math-0296 degrees of freedom. The test procedure is to reject the hypothesis c14-math-0297 if c14-math-0298.

c14-math-0300

14.8 Testing for Variance-Covariance Matrix: c14-math-0301-Samples

Consider the case when we have samples from c14-math-0302-populations, that is, c14-math-0303, for c14-math-0304. For the c14-math-0305 sample, we have a sample of size c14-math-0306. The hypothesis of interest here is c14-math-0307. Also, define the following:

  • c14-math-0309: the sample covariance matrix of the c14-math-0310 population;
  • c14-math-0311: the degrees of freedom associated with the estimated covariance matrix c14-math-0312.

Technically, we need to have c14-math-0313 for ensuring that the estimated covariance matrices are non-singular. Define the pooled sample covariance matrix by

equation

The test statistic for the hypothesis c14-math-0315 is then given by the following:

14.28 equation

The range of values for c14-math-0317 is between 0 and 1, with values closer to 1 favoring the hypothesis c14-math-0318, and values closer to 0 leading to its rejection. This can be easily seen by rewriting the expression of c14-math-0319 as

14.29 equation

An expression for c14-math-0321 is given by

The hypothesis c14-math-0323 may be tested using the exact M-test with c14-math-0324, see page 258 of Rencher (2002).

To test the hypothesis c14-math-0325, we may also use the Box's c14-math-0326 and c14-math-0327- approximations for the probability distribution of c14-math-0328. Towards this, we will first define c14-math-0329 as follows:

It can then be proved that

is distributed as a c14-math-0332 random variable with c14-math-0333 degrees of freedom.

The steps for obtaining the c14-math-0336-approximation may appear cumbersome, but its benefits are also equally rewarding. As with the c14-math-0337 approximation, we will first define the required quantities. Define c14-math-0338 as a function of c14-math-0339 by

and also define the quantities c14-math-0341 in the following:

equation

We have two scenarios here: (i) c14-math-0343, and (ii) c14-math-0344. In case (i), the appropriate c14-math-0345-statistic is

and in case (ii), it is

In both cases, the approximation follows the c14-math-0349 distribution, and the test procedure is to reject the hypothesis c14-math-0350 if c14-math-0351.

14.9 Testing for Independence of Sub-vectors

The test of sphericity addresses the problem of testing if all the covariates are independent or not. A very likely practical problem could be that we may know beforehand that certainly not all the components are independent. However, we may also have knowledge that though the first three components and the next four components are related, it may be the case that the set of the first three components are independent of the set of the next four components. We would thus like to have some statistical tests to help us prove if our hypothesis is true or not. In fact, we need methods to help us test any combination of vectors as independent or not, and the methods in this section exactly help us to accomplish this.

Consider the c14-math-0365-dimensional random vector c14-math-0366. Suppose that we are interested to find if the sub-vectors c14-math-0367 are a c14-math-0368 mutually independent sets of sub-vectors. The notation needs a bit of explanation. If c14-math-0369, we are testing if all the components are independent. We denote c14-math-0370 for the number of elements in the c14-math-0371 sub-vector c14-math-0372, and we require that c14-math-0373. Note that c14-math-0374 denotes a partitioning of c14-math-0375 and not a random sample of size c14-math-0376 of c14-math-0377.

Let c14-math-0378 denote the covariance matrix between the sub-vectors c14-math-0379 and c14-math-0380, c14-math-0381. The hypothesis for independence of the sub-vectors can then be stated symbolically as c14-math-0382, c14-math-0383, and in matrix notation as below:

14.36 equation

Let us denote the partition of estimated covariance matrix by the following:

The likelihood ratio test statistic for the hypothesis c14-math-0386 is given by

A c14-math-0388 approximation of the distribution of c14-math-0390 is given by

where c14-math-0392 and c14-math-0393 are determined by the following:

We reject the hypothesis c14-math-0395 if c14-math-0396.

c14-math-0417

In the next chapter on MSA, we will consider some of the more advanced topics, which are more useful in an applied context.

14.10 Further Reading

Anderson (1953, 1984, and 2003) is the first comprehensive and benchmark book in this area of statistics. Rencher (2002), Johnson and Wichern (2007), Hair, et al. (2010), Hardle and Simar (2007), and Izenman (2008) are some of the modern accounts of multivariate statistical analysis. Everitt (2005) has handled the associated computations through R and S-Plus. However, as Everitt (2005) and Everitt and Hothorn (2011) are dwelling more in advanced methods of MSA, we believe that the reader can benefit from the coverage given in this and the next chapter.

14.11 Complements, Problems, and Programs

  1. Problem 14.1 The iris data has been introduced in AD2. Obtain the matrix of scatter plots for (i) the overall dataset (removing the Species), and (ii) three subsets according to the Species. Obtain the average of the four characteristics by the Species group and using the faces function from the aplpack package, plot the Chernoff faces. Do the Chernoff faces offer enough insight to identify the group?

  2. Problem 14.2 For the board stiffness data discussed in Example 14.3.3, obtain the covariance matrix and then using the cov2cor function, obtain the correlation matrix.

  3. Problem 14.3 The Mahalanobis distance c14-math-0418 given in Equation 14.7 is easily obtained in R using the mahalanobis function. Using this function, obtain the distance of the observations from the entire dataset for the board stiffness dataset and investigate for presence of outliers. Repeat the exercise for the presence of outliers in the iris dataset too.

  4. Problem 14.4 Using the HotellingsT2 function from the ICSNP package, test whether average sepal and petal length and width for setosa species equals [5.936 2.770 4.260 1.326] in the iris dataset.

  5. Problem 14.5 Using the HotellingsT2 function from the ICSNP package, test whether average sepal and petal length and width for setosa species equals that of versicolor in the iris dataset.

  6. Problem 14.6 Run the example code of the function HotellingsT2, that is run example(HotellingsT2), and explore the options available with this function.

  7. Problem 14.7 Carry out the MANOVA analysis for the iris datasets, where the hypothesis problem is that the mean of the multivariate vector of the four variables are equal across the three types of species.

  8. Problem 14.8 Using base matrix tools of R, create a function which returns the value of Roy's test statistic given in Equation 14.21.

  9. Problem 14.9 Repeat the above exercise for the Pillai and Lawley-Hotelling tests respectively given in Equations 14.22 and 14.23.

  10. Problem 14.10 For the iris dataset, test the hypothesis c14-math-0419. Repeat the exercise for the stack loss problem too.

  11. Problem 14.11 Test whether the Sepal and Petal characteristics are independent of each other in the iris dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.238.20