Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 15
Multivariate Statistical Analysis - II

Package(s): DAAG, HSAUR2, qcc

Dataset(s): iris, socsupport, chemicaldata, USairpollution, hearing, cork, adjectives, life

15.1 Introduction

In the previous chapter we built on some of the essential multivariate techniques. The results there helped set up a platform to stage more practical applications. The classification and discriminant analysis techniques work well for classifying observations into distinct groups. This topic forms the content of Section 15.2. Canonical correlations help to identify if there are groups of variables present in a multivariate vector, which will be dealt with in Section 15.3. Principal Component Analysis (PCA) helps in obtaining a new set of fewer variables, which have the overall variation of the original set of variables. This multivariate technique will be developed in Section 15.4, whereas specific areas of application of the technique will be dealt in Section 15.5. Multivariate data may also be used to find a new set of variables using Factor Analysis, check Section 15.6.

15.2 Classification and Discriminant Analysis

The application of MSA is to classify the data into distinct groups. This task is achieved through two steps: (i) Discriminant Analysis, and (ii) Classification. In the first step we identify linear functions, which describe the similarities and differences among the groups. This is achieved through the relative contribution of variables towards the separation of groups and finds an optimal plane which separates the groups. The second task is allocation of the observations to the groups identified in the first step. This is broadly called Classification. We will begin with the first task in the forthcoming subsection.

15.2.1 Discrimination Analysis

Suppose that there are two groups characterized by two multivariate normal distributions: $c15-math-0001$ and $c15-math-0002$ . It is assumed that the variance-covariance matrix $c15-math-0003$ is the same for both the groups. Assume that we have $c15-math-0004$ observations $c15-math-0005$ from $c15-math-0006$ and $c15-math-0007$ observations $c15-math-0008$ from $c15-math-0009$ . The discriminant function is a linear combination of the $c15-math-0010$ variables, which will maximize the distance between the two group's mean vectors. Thus, we are seeking a vector $c15-math-0011$ , which achieves the required objective.

As a first step, the $c15-math-0012$ vectors are transformed to $c15-math-0013$ scalars through $c15-math-0014$ as below:

15.1

Define the means of the transformed scalars and the pooled variance as below:

15.2

Since the goal is to find that $c15-math-0017$ which maximizes the distance between the group means, the problem is to maximize the squared distance:

15.3

The maximum of the squared distance occurs at $c15-math-0019$ given by

15.4

An illustration of the discriminant analysis steps is done through the next example.

Example 15.2.1. Discriminant Function for the “setosa” species in Iris Data

Suppose that based on the four variables of sepal length and width, and petal length and width, we need to find $c15-math-0021$ which will maximize the distance between the two groups: “setosa” and “not a setosa” species. The formulas are clearly illustrated in the following R program.

> data(iris)
> x1bar <- colMeans(iris[iris$Species=="setosa",1:4])
> x2bar <- colMeans(iris[iris$Species!="setosa",1:4])
> table(iris$Species)
    setosa versicolor  virginica
        50         50         50
> S_pl <- ((49*var(iris[iris$Species=="setosa",1:4])+
+             99*var(iris[iris$Species!="setosa",1:4]))/148)
> x1bar;x2bar; S_pl
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       5.006        3.428        1.462        0.246
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       6.262        2.872        4.906        1.676
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    0.3350257  0.11456216   0.30867703  0.11523649
Sepal.Width     0.1145622  0.12163784   0.09939189  0.05661081
Petal.Length    0.3086770  0.09939189   0.46590676  0.19514730
Petal.Width     0.1152365  0.05661081   0.19514730  0.12436892
> solve(S_pl)
                   [,1]
Sepal.Length   3.186486
Sepal.Width   11.719430
Petal.Length -10.841575
Petal.Width   -2.773537

Thus, the discriminant function is given by $c15-math-0022$ . $c15-math-0023$

The use of the discriminant function for classification is considered next.

15.2.2 Classification

Let $c15-math-0024$ be a new vector of observation. The goal is to classify it into one of the groups by using the discriminant function. The simple, and fairly obvious, technique is to first obtain the discriminant score by

Next, classify $c15-math-0026$ to group 1 or 2 accordingly, as $c15-math-0027$ is closer to $c15-math-0028$ or $c15-math-0029$ . A simple illustration is done next.

Example 15.2.2. Classification for Iris Data

The above description is captured in the next R program. We simply verify if the original observations are correctly identified by the discriminant function or not.

> a <- solve(S_pl)%*%(x1bar-x2bar)
> z1bar <- t(a)%*%x1bar
> z2bar <- t(a)%*% x2bar
> pred_gr <- NULL
> for(i in 1:150) {
+  mynew <- t(a)%*%t(as.matrix(iris[i,1:4]))
+  pred_gr[i] <- ifelse(abs(mynew-z1bar)>abs(mynew-z2bar), "not setosa","setosa")
+   }
> pred_gr
  [1] "setosa"     "setosa"        "setosa"     "setosa"
 [43] "setosa"     "setosa"        "setosa"     "setosa"
 [49] "setosa"     "not setosa"    "not setosa" "not setosa"
 [55] "not setosa" "not setosa"    "not setosa" "not setosa"
[145] "not setosa" "not setosa"    "not setosa" "not setosa"

For the original observations, the discriminant function has properly identified their groups. $c15-math-0030$

The function lda from the MASS package handles the Linear Discriminant Analysis very well. The particular reason for not using the function here is that our focus has been elucidation of the formulas in the scheme of flow of the theory. The results arising as a consequence of using the lda function by the command lda(GROUP˜X1+X2, data=rencher) is a bit different and the reader is asked to figure out the same. It goes without an explicit mention that the reader has a host of other options using the lda function.

$c15-math-0031$

15.3 Canonical Correlations

In multivariate data, we may have the case that there are two distinct subsets of vectors, with each subset characterizing certain traits of the unit of measurement. As an example, the marks obtained by a student in the examination for different subjects is one subset of measurements, whereas the performance in different sports may form another subset of measurements. Canonical correlations help us to understand the relationship between such sets of vector data.

Let $c15-math-0032$ and $c15-math-0033$ be two set of vectors measured on the same experimental unit. The goal of a canonical correlation study is to obtain vectors $c15-math-0034$ and $c15-math-0035$ such that correlation between $c15-math-0036$ and $c15-math-0037$ is a maximum, that is, $c15-math-0038$ is a maximum.

The sample covariance matrix for the vector $c15-math-0039$ is

15.5

where $c15-math-0041$ is the sample covariance matrix of $c15-math-0042$ , $c15-math-0043$ is the sample covariance matrix between $c15-math-0044$ and $c15-math-0045$ , and $c15-math-0046$ of $c15-math-0047$ . A measure of association between the $c15-math-0048$ ′s and the $c15-math-0049$ ′s is given by

15.6

where $c15-math-0051$ , and $c15-math-0052$ are the eigenvalues of $c15-math-0053$ . Note that the association measure $c15-math-0054$ will be a poor measure, since each of the $c15-math-0055$ values is between 0 and 1, and hence the product of such numbers approach 0 faster. However, the eigenvalues provide a useful measure of association between the vectors. Particularly, the square root of the eigenvalues leads to useful interpretations of the measures of the association. The collection of the square root of the eigenvalues $c15-math-0056$ has been named the canonical correlations in the multivariate literature. Without loss of generality we assume that $c15-math-0057$ .

As mentioned in Rencher (2002), the best overall measure of association between the $c15-math-0058$ ′s and $c15-math-0059$ ′s is the largest squared canonical correlation $c15-math-0060$ . However, the other eigenvalues $c15-math-0061$ leading to the squared canonical correlations $c15-math-0062$ also provide measures of supplemental dimensions of linear relationships between the $c15-math-0063$ ′s and $c15-math-0064$ ′s.

The two important properties of canonical correlations as listed by Rencher are the following:

Canonical correlations are scale invariant, scales of the $c15-math-0065$ ′s as well as the $c15-math-0066$ ′s.
The first canonical correlation $c15-math-0067$ is the maximum correlation among all linear combinations between the $c15-math-0068$ ′s and the $c15-math-0069$ ′s.

See Chapter 11 of Rencher for a comprehensive coverage of canonical correlations. We can test the independence of the $c15-math-0070$ ′s and the $c15-math-0071$ ′s using any of the four tests discussed in Section 14.6. The concepts are illustrated for the Chemical Dataset of Box and Youle (1955) and are illustrated in Rencher.

Example 15.3.1. Chemical Reaction Experiment

In this experiment temperature ( $c15-math-0072$ ), concentration ( $c15-math-0073$ ), and time ( $c15-math-0074$ ) have influence on three yield variables, namely outputs, the percentage of unchanged starting material ( $c15-math-0075$ ), the percentage converted to the desired product ( $c15-math-0076$ ), and the percentage of unwanted by-product ( $c15-math-0077$ ). The cross-products and squares of the input variables are also believed to influence the three outputs, and hence we need to take this information into account when we construct the canonical correlations between these two sets of variables. That is, we now have nine input variables and three output variables. An R program, using the inbuilt cancor function, is put into action for canonical correlation analysis.

> data(chemicaldata)
> names(chemicaldata)
[1] "y1" "y2" "y3" "x1" "x2" "x3"
> chemicaldata$x12 <- chemicaldata$x1*chemicaldata$x2;
> chemicaldata$x13 <- chemicaldata$x1*chemicaldata$x3;
> chemicaldata$x23 <- chemicaldata$x2*chemicaldata$x3
> chemicaldata$x1sq <- chemicaldata$x1^{2}
> chemicaldata$x2sq <- chemicaldata$x2^{2}
> chemicaldata$x3sq <- chemicaldata$x3^{2}
> S_Total <- cov(chemicaldata)
> cancor_xy <- sqrt(eigen(solve(S_Total[1:3,1:3])%*%S_Total [1:3,4:12]
+ %*%solve(S_Total[4:12,4:12])%*%S_Total[4:12,1:3])$values)
> cancor_xy
[1] 0.9899 0.9528 0.4625
> cancor(chemicaldata[,1:3],chemicaldata[,4:12])
$cor
[1] 0.9899 0.9528 0.4625
$xcoef
      [,1]   [,2]   [,3]
y1 0.03633 0.1057 0.1371
y2 0.01054 0.1414 0.1113
y3 0.01638 0.1097 0.1802
$ycoef
          [,1]      [,2]         [,9]
x1   -0.189983  1.451850  -0.9081431
x2   -0.325733  0.986862    0.0363422
x3sq -0.006645 -0.010175   0.0972184
$xcenter
   y1    y2    y3
20.18 56.34 20.78
$ycenter
      x1     x3      x12      x13      x23     x1sq     x2sq     x3sq
  167.32   6.50  4536.82  1087.34   177.86 28031.21   755.99    44.78

In this case, we have $c15-math-0078$ , that is, we can have three canonical correlations between $c15-math-0079$ ′s and $c15-math-0080$ ′s. The cross-product and squares term are integrated into the original data frame chemicaldata, itself with the first three columns for $c15-math-0081$ ′s and the rest for $c15-math-0082$ ′s. Before using the R canonical correlation function cancor, we attempt to obtain them using formulas 15.5 and 15.6. Thus, the covariance matrix S_Total is first obtained. Next, the code solve(S_Total[1:3,1:3])%*% ... %*%S_Total[4:12,1:3]) does the computation for $c15-math-0083$ . Using eigen(matrix)$values for the eigenvalues and followed by sqrt, we get the three canonical correlations, as specified in Equation 15.6, between the $c15-math-0084$ ′s and $c15-math-0085$ ′s as 0.9899 0.9528 0.4625.

By using the cancor function on the data frame chemicaldata, the previous result is confirmed in the $cor values.

Finally, we would like to test the hypothesis $c15-math-0086$ . The four tests seen in Section 14.6 may be used to confirm if $c15-math-0087$ is independent of $c15-math-0088$ or not.

> y <- as.matrix(chemicaldata[,1:3])
> x <- as.matrix(chemicaldata[,4:12])
> chemical_manova <- manova(y˜x)
> summary(chemical_manova,test="Wilks")
          Df   Wilks approx F num Df den Df  Pr(>F)
x          9 0.00145     6.54     27   21.1 2.1e-05 ***
Residuals  9
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> summary(chemical_manova,test="Roy")
          Df  Roy approx F num Df den Df  Pr(>F)
x          9 48.9     48.9      9      9 1.4e-06 ***
Residuals  9
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> summary(chemical_manova,test="Pillai")
          Df Pillai approx F num Df den Df Pr(>F)
x          9    2.1     2.34     27     27  0.016 *
Residuals  9
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> summary(chemical_manova,test="Hotelling")
          Df Hotelling-Lawley approx F num Df den Df  Pr(>F)
x          9               59     12.4     27     17 9.5e-07 ***
Residuals  9
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All the four type tests reject the hypothesis $c15-math-0089$ that the two sets of vectors are independent, and hence the canonical correlations among them may be accepted. $c15-math-0090$

The next section is a very important concept in multivariate analysis.

$c15-math-0091$

15.4 Principal Component Analysis – Theory and Illustration

Principal Component Analysis (PCA) is a powerful data reduction tool. In the earlier multivariate studies we had $c15-math-0092$ components for a random vector. PCA considers the problem of identifying a new set of variables which explain more variance in the dataset. Jolliffe (2002) explains the importance of PCA as “The central idea of principal component analysis (PCA) is to reduce the dimensionality of a dataset consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the dataset.” In general, most of the ideas in multivariate statistics are extensions of the concepts from univariate statistics. PCA is an exception!

Jolliffe (2002) considers the PCA theory and applications in a monumental way. Jackson (1991) is a very elegant exposition of PCA applications. For useful applications of PCA in chemometrics, refer to Varmuza and Filzmoser (2009). The development of this section is owed in a large extent to Jolliffe (2002) and Rencher (2002).

PCA may be useful in the following two cases: (i) too many explanatory variables relative to the number of observations; and (ii) the explanatory variables are highly correlated. Let us begin with a brief discussion of the math behind PCA.

15.4.1 The Theory

We begin with a discussion of population principal components. Consider a $c15-math-0093$ -variate normal random vector $c15-math-0094$ with mean $c15-math-0095$ and variance-covariance matrix $c15-math-0096$ . We assume that we have a random sample of $c15-math-0097$ observations. The goal of PCA is to return a new set of variables $c15-math-0098$ , where each $c15-math-0099$ is some linear combination of the $c15-math-0100$ s. Furthermore, and importantly, the $c15-math-0101$ 's are in decreasing order of importance in the sense that $c15-math-0102$ has more information about $c15-math-0103$ 's than $c15-math-0104$ , whenever $c15-math-0105$ . The $c15-math-0106$ 's are constructed in such a way that they are uncorrelated. Information here is used to convey the fact that the $c15-math-0107$ whenever $c15-math-0108$ .

From its definition, the PCAs are linear combinations of the $c15-math-0109$ 's. The $c15-math-0110$ principal component is defined by

15.7

We know from the linearity of variance that we can specify the $c15-math-0112$ in such a way that variance of $c15-math-0113$ can be infinite. Thus we may end up with components such that variance is infinite for each of them, which is of course meaningless. We will thus impose a restriction:

We need to find $c15-math-0115$ such that $c15-math-0116$ is a maximum. Next, we need to obtain $c15-math-0117$ such that

and in general

For the first component, mathematically, we need to solve the maximization problem

15.8

where $c15-math-0121$ is a Lagrangian multiplier. As with an optimization problem, we will differentiate the above expression and equate the result to 0 for obtaining the optimal value of $c15-math-0122$ :

Thus, we see that $c15-math-0124$ is an eigenvalue of $c15-math-0125$ and $c15-math-0126$ is the corresponding eigenvector. Since we need to maximize $c15-math-0127$ , we select the maximum of the eigenvalue and its corresponding eigenvector for $c15-math-0128$ .

Let $c15-math-0129$ denote the $c15-math-0130$ eigenvalues of $c15-math-0131$ . We assume that the eigenvalues are distinct. Without loss of generality, we further assume that $c15-math-0132$ . For the first PC we select the eigenvector corresponding to $c15-math-0133$ , that is, $c15-math-0134$ is the eigenvector related to $c15-math-0135$ .

The second PC $c15-math-0136$ needs to maximize $c15-math-0137$ and with the restriction that $c15-math-0138$ . Note that, post a few matrix computational steps,

Thus, the constraint that the first two PCs are uncorrelated may be specified by $c15-math-0140$ . The maximization problem for the second PC is specified in the equation below:

15.9

where $c15-math-0142$ are the Lagrangian multipliers. We need to optimize the above equation and obtain the second PC. As we generally do with optimization problems, we will differentiate the maximization statement with respect to $c15-math-0143$ and obtain:

which by multiplication of the left-hand side by $c15-math-0145$ gives us

15.10

Since $c15-math-0147$ , the first two terms of the above equation equal zero and since $c15-math-0148$ , we get $c15-math-0149$ . Substituting this into the two displayed expressions above, we get $c15-math-0150$ . On readjustment, we get $c15-math-0151$ , and we again see $c15-math-0152$ as the eigenvalue of $c15-math-0153$ . Under the assumption of distinct eigenvalues for $c15-math-0154$ , we choose the second largest eigenvalue and its corresponding eigenvector for $c15-math-0155$ . We proceed in a similar way for the rest of the $c15-math-0156$ PCs. As with the first PC, $c15-math-0157$ is chosen as the eigenvector corresponding to $c15-math-0158$ for $c15-math-0159$ .

The variance of the $c15-math-0160$ principal component $c15-math-0161$ is

15.11

The amount of variation explained by the $c15-math-0163$ PC is

Since the PCs are uncorrelated, the variation explained by the first $c15-math-0165$ PCs is

15.12

The variance explained by the PCs are best understood through a screeplot. A screeplot looks like the profile of a mountain where after a steep slope a flatter region appears that is built by fallen and deposited stones (called scree). Therefore, this plot is often named as the SCREE PLOT. It is investigated from the top until the debris is reached. This explanation is from Varmura and Filzmoser (2009).

The development thus far focuses on population principal components, which involve unknown parameters $c15-math-0167$ and $c15-math-0168$ . Since these parameters are seldom known, the sample principal components are obtained by replacing the unknown parameters with their respective MLEs. If the observations are on different scales of measurements, a practical rule is to use the sample correlation matrix instead of the covariance matrix.

The covariance between observation $c15-math-0169$ and PC $c15-math-0170$ is given by

and the correlation is

However, if the PCs are extracted from the correlation matrix, then

The concepts will be demonstrated in the next subsection.

15.4.2 Illustration Through a Dataset

We will use two datasets for the usage of PCA.

Example 15.4.1. US Air Pollution Data

The dataset Usairpollution from the HSAUR2 package will be used to demonstrate PCA. A brief description of the variables in this dataset is given below:

S02: Sulphur dioxide content of air in micrograms per cubic meter.
Temp: Average annual temperature in OF.
Manu: Number of manufacturing enterprises employing 20 or more workers.
Pop: Population size (1970 census) in thousands.
Wind: Average annual wind speed in miles per hour.
Precip: Average annual precipitation in inches.
Days: Average number of days with precipitation per year.

The problem of interest here is to understand the dependency of SO2 on the other variables. Since the units of measurements are not the same across the variables, we will need to form the PCs based on the correlation matrix. The pairs function is used to understand the relationship among the variables here. Recollect the definition of panel.cor and panel.hist from Chapter 14 and use them to visualize USairpollution. The graphical output is suppressed and the reader should make preliminary investigations into the dataset.

The PCs are obtained in R with the princomp function and the correlation matrix requirement is made explicit with the option cor=TRUE. The summary.princomp returns the eigenvalues, proportion of the variance $c15-math-0174$ , and the cumulative variance percentages. The pairs function is applied over the principal component variables $c15-math-0175$ 's to check if the orthogonality condition of $c15-math-0176$ , is satisfied or not. The R function screeplot helps to determine the number of principal components to be chosen. Finally, the part of an princomp object in $loadings gives the relationship between the variables and PCs.

> library(HSAUR2)
> data(USairpollution)
> pairs(USairpollution[,-1],upper.panel=panel.cor)
> usair.pc <- princomp(USairpollution[,-1],cor=TRUE)
> summary(usair.pc)
Importance of components:
                       Comp.1 Comp.2 Comp.3 Comp.4 Comp.5   Comp.6
Standard deviation      1.482  1.225 1.1810 0.8719 0.3385 0.185600
Proportion of Variance  0.366  0.250 0.2324 0.1267 0.0191 0.005741
Cumulative Proportion   0.366  0.616 0.8485 0.9752 0.9943 1.000000
> pairs(usair.pc$scores)
> screeplot(usair.pc)
> usair.pc$loadings
Loadings:
        Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
temp     0.330 -0.128  0.672 -0.306 -0.558 -0.136
manu    -0.612 -0.168  0.273  0.137  0.102 -0.703
popul   -0.578 -0.222  0.350                0.695
wind    -0.354  0.131 -0.297 -0.869 -0.113
precip          0.623  0.505 -0.171  0.568
predays -0.238  0.708         0.311 -0.580
               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.167  0.167  0.167  0.167  0.167  0.167
Cumulative Var  0.167  0.333  0.500  0.667  0.833  1.000

The summary shows that if we need 80% of the variation is to be explained by the PCs, we can choose the first three PCs, and if 90% coverage is required, the first four PCs will provide us with the coverage. The output of pairs(usair.pc$scores) in Figure 15.1 shows that the orthogonality requirement of PCs is satisfied here. The debris in the screeplot of usair.pc, Part A of Figure 15.2, clearly shows that the first four PCs are required to explain the variation of the original dataset. The covariance between the variables and the PCs may be investigated by the reader. $c15-math-0177$

Figure 15.1 Uncorrelatedness of Principal Components

Figure 15.2 Scree Plots for Identifying the Number of Important Principal Components

Example 15.4.2. The Hearing Loss Data

Jackson (1991) describes in detail the “Hearing Loss” data. A study was carried out by the Eastman Kodak Company, which involved the measurement of hearing loss. Such studies are called audiometric studies. This dataset contains 100 males, each aged 39, who had no history of noise exposure or hearing disorders. A method of measuring the hearing capabilities is the use of an instrument called an audiometer. Here, the individual is exposed to a signal of a given frequency with an increasing intensity until the signal is perceived. Observations are obtained for intensities: 500 Hz, 1000 Hz, 2000 Hz, and 4000 Hz. This signal perception is carried out for both ears.

We will first read the data, and look at the covariance and correlation matrix.

> data(hearing)
> round(cor(hearing[,-1]),2)
      L500 L1000 L2000 L4000 R500 R1000 R2000 R4000
L500  1.00  0.78  0.40  0.26 0.70  0.64  0.24  0.20
L1000 0.78  1.00  0.54  0.27 0.55  0.71  0.36  0.22
R4000 0.20  0.22  0.33  0.71 0.13  0.22  0.37  1.00
> round(cov(hearing[,-1]),2)
       L500 L1000  L2000  L4000  R500 R1000 R2000  R4000
L500  41.07 37.73  28.13  32.10 31.79 26.30 14.12  25.28
L1000 37.73 57.32  44.44  40.83 29.75 34.24 25.30  31.74
R4000 25.28 31.74  68.99 269.12 18.19 27.22 67.26 373.66

The above results can be matched with Tables 5.2 and 5.3 of Jackson (1991). All the variables have the same unit of measurement and hence the covariance matrix can be used here to obtain the PCs. We look at the screeplot and decide on the number of PCs.

> hearing.pc <- princomp(hearing[,-1])
> screeplot(hearing.pc)

The screeplot in Part B of Figure 15.2 suggests that four PCs suffice for explaining the variation contained among the eight variables. $c15-math-0178$

In the next subsection, we focus on the applications of PCA.

$c15-math-0179$

15.5 Applications of Principal Component Analysis

Jolliffe (2002) and Jackson (1991) are two detailed treatises which discuss variants of PCA and their applications. PCA can be applied and/or augmented by statistical techniques such as ANOVA, linear regression, Multidimensional scaling, factor analysis, microarray modeling, time series, etc.

15.5.1 PCA for Linear Regression

Section 12.6 indicated the problem of multicollinearity in linear models. If the covariates are replaced with the PCs, the problem of multicollinearity will cease, since the PCs are uncorrelated with each other. It is thus the right time to fuse the multicollinearity problems of linear models with PCA. We are familiar with all the relevant concepts and hence will take the example of Maindonald and Braun (2009) for throwing light on this technique. Maindonald and Braun (2009) have made the required dataset available in their package DAAG. See Streiner and Norman (2003) for more details of this study.

Example 15.5.1. The socsupport Dataset

The dataset socsupport is available in the DAAG package. This dataset contains 19 predictor variables as follows. The first eight variables describe the characteristics of the observation such as age, gender, country, marital status in verbmarital, live with status in variable livewith, employment, firstyr, and enrollment. Variable pairs, (9,10), (11,12), (13,14), and (15,16) are nested pairs of information in the sense that the answer of the first component may determine the answer of the second component. Here, variables 9, 11, 13, and 15 are respective indicators of availability of emotional satisfaction, tangible support existence, affectionate support existence in affect, and availability of positive social interaction in the variable psisat. The respective other half of these pairs (emotionalsat, tangiblesat, affectsat, psisat) of information are based on some sets of questions.

The output, Beck depression index (BDI), is a score of the standard psychological measure of depression. The aim of the study is to understand the effect of the support measures (9–19) on the BDI. We will first perform a PCA analysis on the variables (9–19) and obtain the important PCs.

The PCs for the variables 9 to 19 are created using the princomp function and then ss.pr1 is generated. Since the variables do not have the same unit of measurement, the option of cor=TRUE is exercised. In this example, we will use the pareto.chart function from the qcc package as an alternative to the screeplot. The Pareto chart details may be referred to in Chapter 4.3. Note the advantage of the Pareto chart over the screeplots. If we decide to use the PCs, which offer us 90% of the variation in the original data, the first six PCs provide the necessary coverage, see summary(ss.pa1) and the Pareto chart output in Figure 15.3.

> library(DAAG)
> data(socsupport)
> names(socsupport)
 [1] "gender"  "age"        "country"      "marital"      "livewith"
[16] "psisat"  "esupport"   "psupport"     "supsources"   "BDI"
> sum(is.na(socsupport[,9:19]))
[1] 10
> # Since observations are missing, we will remove them to  obtain the PCs
> ss.pr1 <- princomp(as.matrix(na.omit(socsupport[,9:19])),cor=TRUE)
> # screeplot(ss.pr1)
> library(qcc)
> pareto.chart(summary(ss.pr1)[[1]])
> summary(ss.pr1)
Importance of components:
                          Comp.1    Comp.2      Comp.4      Comp.5
Standard deviation     2.4967051 1.1620727  0.79328034  0.71977169
Proportion of Variance 0.5666851 0.1227648  0.05720852  0.04709739
Cumulative Proportion  0.5666851 0.6894500  0.85223245  0.89932984
                           Comp.6     Comp.7      Comp.9     Comp.10
Standard deviation     0.66590642 0.47892850  0.33771822  0.28154276
Proportion of Variance 0.04031194 0.02085205  0.01036851  0.00720603
Cumulative Proportion  0.93964178 0.96049383  0.98830796  0.99551399
                           Comp.11
Standard deviation     0.222139794
Proportion of Variance 0.004486008
Cumulative Proportion  1.000000000

Next, obtain the pairs diagram for the first six PC scores. The pairs diagram in Figure 15.3 indicates the presence of an outlier, which needs to be removed from further analyses. The outlier can be identified from the console using the simple sort function.

> pcscores <- ss.pr1$scores[,1:6]
> pairs(pcscores)
> sort(ss.pr1$scores[,1],decreasing=TRUE)[1:10]
      36       30          81       73
9.898667 5.614594    3.902259 3.703742
       2       75
3.497409 3.297526

As in Maindonald and Braun, we will use the first six PC scores to build a regression model for the BDI. Furthermore, we remove missing observations using complete.cases and the outlier through a simple trick, which may be easily figured out by the reader. Following this, we obtain the PCs for the refined predictors and then build a linear model for this setup. The next block of R codes is aimed at these steps.

> pcscores <- ss.pr1$scores[,1:6]
> soccases <- complete.cases(socsupport[,9:19])
> soccases[36] <- FALSE
> ss.pr <- princomp(as.matrix(socsupport[soccases,9:19]),cor=TRUE)
> ss.lm <- lm(socsupport$BDI[soccases]∼ss.pr$scores[,1:6])
> summary(ss.lm)
Call:
lm(formula = socsupport$BDI[soccases] ∼ ss.pr$scores[, 1:6])
Residuals:
     Min       1Q   Median       3Q      Max
-13.8017  -4.9450  -0.2718   3.1257  36.1143
Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)                10.4607     0.8934  11.709  < 2e-16 ***
ss.pr$scores[, 1:6]Comp.1   1.3113     0.3732   3.513 0.000723 ***
ss.pr$scores[, 1:6]Comp.2  -0.3959     0.7329  -0.540 0.590526
ss.pr$scores[, 1:6]Comp.3   0.6036     0.7860   0.768 0.444744
ss.pr$scores[, 1:6]Comp.4   1.4248     1.0576   1.347 0.181610
ss.pr$scores[, 1:6]Comp.5   2.1459     1.1841   1.812 0.073622 .
ss.pr$scores[, 1:6]Comp.6   1.2882     1.2848   1.003 0.318967
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.428 on 82 degrees of freedom
Multiple R-squared: 0.1908, Adjusted R-squared: 0.1315
F-statistic: 3.222 on 6 and 82 DF,  p-value: 0.006837

Note that except for the first PC, the remaining PCs do not have a significant power in the explanation of BDI. For interesting further details, see Section 13.1 of Maindonald and Braun (2009). $c15-math-0180$

Figure 15.3 Pareto Chart and Pairs for the PC Scores

It is thus seen how the PCA helps to reduce the number of variables in the linear regression model. Note that even if we replace the original variables with equivalent PCs, the problem of multicollinearity is fixed.

15.5.2 Biplots

Gower and Hand (1996) have written a monograph on the use of biplots for multivariate data. Gower, et al. (2011) is a recent book on biplots complemented with the R package UBbipl, and is also an extension of Gower and Hand (1996). Greenacre (2010) has implemented all the biplot techniques in his book. This book has R codes for doing all the data analysis, and he has also been very generous to gift it to the world at http://www.multivariatestatistics.org/biplots.html. For theoretical aspects of biplots, the reader may also refer to Rencher (2002), Johnson and Wichern (2007), and Jolliffe (2002) among others. For a simpler and effective understanding of the biplots, see the Appendix of Desmukh and Purohit (2007).

The biplot is a visualization technique of the data matrix $c15-math-0181$ through two coordinate systems representing the observations (row) and variables (columns) of the dataset. In this method, the variance-covariance between the variable and the distance between the observations, are plotted in a single figure, and to reflect this facet the prefix “bi” is used here. In this plot, the distance between the points, which are observations, represents the Mahalanobis distance between them. The length of a vector, displayed on the plot, from the origin to the coordinates, represents the variance of the variable with the angle between the variables (represented by the vectors) denoting the correlation. If the angle between the vectors is small, it will indicate that the vectors are strongly correlated.

For the sake of simplicity, we will assume that the data matrix $c15-math-0182$ , consisting of $c15-math-0183$ observations of a $c15-math-0184$ -dimensional vector, is a centered matrix in the sense that each column has a zero mean. By the singular value decomposition, SVD, result, we can write the matrix $c15-math-0185$ as

15.13

where $c15-math-0187$ is an $c15-math-0188$ matrix, $c15-math-0189$ is an diagonal $c15-math-0190$ matrix, and $c15-math-0191$ is an $c15-math-0192$ matrix. By the properties of SVD, we have $c15-math-0193$ and $c15-math-0194$ . Furthermore, $c15-math-0195$ has diagonal elements in $c15-math-0196$ . We will consider a simple illustration of the SVD for the famous “Cork” dataset of Rao (1973).

Example 15.5.2. Understanding SVD for the Cork Dataset

Thickness of cork borings in four directions of North, South, East, and West are measured for 28 trees. The problem here is to examine if the bark deposit is the same in all directions.

> data(cork)
> corkcent <- cork*0
> corkcent[,1] <- cork[,1]-mean(cork[,1])
> corkcent[,2] <- cork[,2]-mean(cork[,2])
> corkcent[,3] <- cork[,3]-mean(cork[,3])
> corkcent[,4] <- cork[,4]-mean(cork[,4])
> corkcentsvd <- svd(corkcent)
> t(corkcentsvd$u)%*%corkcentsvd$u
             [,1]          [,2]          [,3]          [,4]
[1,] 1.000000e+00  1.778092e-17  4.857226e-17  6.711211e-17
[2,] 1.778092e-17  1.000000e+00 -1.387779e-16 -8.023096e-17
[3,] 4.857226e-17 -1.387779e-16  1.000000e+00 -1.309716e-16
[4,] 6.711211e-17 -8.023096e-17 -1.309716e-16  1.000000e+00
> t(corkcentsvd$v)%*%corkcentsvd$v
              [,1]          [,2]          [,3]          [,4]
[1,]  1.000000e+00  1.110223e-16 -8.326673e-17 -1.110223e-16
[2,]  1.110223e-16  1.000000e+00 -1.665335e-16  1.665335e-16
[3,] -8.326673e-17 -1.665335e-16  1.000000e+00  3.330669e-16
[4,] -1.110223e-16  1.665335e-16  3.330669e-16  1.000000e+00
> round(corkcentsvd$u %*% diag(corkcentsvd$d) %*% t(corkcentsvd$v),2)
        [,1]   [,2]   [,3]   [,4]
 [1,]  21.46  19.82  26.32  31.82
 [2,]   9.46   6.82  16.32  17.82
 [3,]   5.46  10.82  14.32  12.82
[26,]  -0.54 -12.18 -12.68  -5.18
[27,]  -7.54  -9.18 -10.68   4.82
[28,]  -2.54   7.82   7.32  -2.18
> round(corkcent,2)
    North   East  South   West
1   21.46  19.82  26.32  31.82
2    9.46   6.82  16.32  17.82
3    5.46  10.82  14.32  12.82
26  -0.54 -12.18 -12.68  -5.18
27  -7.54  -9.18 -10.68   4.82
28  -2.54   7.82   7.32  -2.18
> corkcentsvd$d
[1] 163.03355  40.17752  25.40940  22.16929

We have thus seen the use of SVD for the cork dataset. This example will be carried forward to the rest of the discussion in this section. $c15-math-0197$

Notice the decline of the singular values, $c15-math-0198$ values, for the cork dataset. In the spirit of PCA, we tend to believe that if such a decline is steep, we can probably have a good understanding of the dataset if we resort to some plots which use two variables. In fact, such a result is validated by a theorem of Eckart and Young (1936). We need to connect the SVD result with the well-known quadratic decomposition, QR, result, which is now stated. The QR decomposition says that any $c15-math-0199$ matrix can be expressed as

15.14

where $c15-math-0201$ is an $c15-math-0202$ matrix and $c15-math-0203$ is an $c15-math-0204$ matrix, and $c15-math-0205$ is the rank of matrix $c15-math-0206$ . In a certain sense, the goal is to understand the variance among the $c15-math-0207$ observations through the matrix $c15-math-0208$ and the variance among the $c15-math-0209$ variables through $c15-math-0210$ . The matrices $c15-math-0211$ and $c15-math-0212$ may be obtained as a combination of the SVD elements as $c15-math-0213$ and $c15-math-0214$ . For different choices of $c15-math-0215$ , we have different representations for $c15-math-0216$ . The three most common choices of $c15-math-0217$ are 0, 1/2, and 1, see Gabriel (1971). We mention some consequences of these choices, see Khatree and Naik (1999).

$c15-math-0218$ . In the this case, the QR matrices may be expressed in terms of the SVD matrices as
15.15

For the choice $c15-math-0220$ , we place an equal emphasis on the variables and the observations.
$c15-math-0221$ . Here

15.16

The distance between the vectors $c15-math-0223$ approximates the squared Mahalanobis distance between the observation vectors. Furthermore, the inner product between the vectors $c15-math-0224$ approximates the covariances between them and length of a vector $c15-math-0225$ gives its variance.
$c15-math-0226$ . Here,

15.17

For this case, the distance between $c15-math-0228$ 's is the usual Euclidean distance between them and the values of $c15-math-0229$ equals the principal component score for the observations, whereas the values of $c15-math-0230$ refer to the principal component loadings.

For the cork dataset, we will obtain the biplot for the choice $c15-math-0231$ .

Example 15.5.3. Understanding SVD for the Cork Dataset

Contd. The biplot function R can be readily used for the analysis.

> corkcentpca <- princomp(corkcent,cor=TRUE)
> summary(corkcentpca)
Importance of components:
                          Comp.1     Comp.2     Comp.3     Comp.4
Standard deviation     1.8965419 0.50362240 0.28303246 0.26341193
Proportion of Variance 0.8992178 0.06340888 0.02002684 0.01734646
Cumulative Proportion  0.8992178 0.96262670 0.98265354 1.00000000
> biplot(corkcentpca,scale=1/2)

The first two PCs explain 96.26% of the variation in the four variables. The principal component can be interpreted as the weighted average of the original variables. The original variables are all positively correlated with respect to the first component. It can be seen from Figure 15.4 that North and East are positively correlated with the second component and South and West are negatively correlated. Hence, we have a contrast in North + East - South - West. Furthermore, observations numbered 12 and 18 have maximum difference with respect to PC 1, while 6, 10, and 11 observations have maximum difference with observation 15 with respect to PC 2.

Figure 15.4 Biplot of the Cork Dataset

$c15-math-0232$

15.6 Factor Analysis

We will have a look at another important facet of multivariate statistical analysis: Factor Analysis. The data observations $c15-math-0233$ , are assumed to arise from an $c15-math-0234$ distribution. Consider a hypothetical example where the correlation matrix is given by

Here, we can see that the first two components are strongly correlated with each other and also appear to be independent of the rest of the components. Similarly, the last three components are strongly correlated among themselves and independent of the first two components. A natural intuition is to think of the first two components arising due to one factor and the remaining three due to a second factor. The factors are also sometimes called latent variables.

The development in the rest of the section is only related to orthogonal factor model and it is the same whenever we talk about the factor analysis model. For other variants, refer to Basilevsky (1994), Reyment and J'oreskog (1996), and Brown (2006).

15.6.1 The Orthogonal Factor Analysis Model

Let $c15-math-0236$ be a $c15-math-0237$ -vector. To begin with, we will assume that there are $c15-math-0238$ factors with $c15-math-0239$ and that each of the $c15-math-0240$ 's is a function of the $c15-math-0241$ factors. The factor analysis model is given by

15.18

where $c15-math-0243$ , are normally distributed errors associated with the variable $c15-math-0244$ . In the factor analysis model, the $c15-math-0245$ , are the regression coefficients between the observed variables and the factors. Two points need to be observed. In the factor analysis literature, the regression coefficients are called loadings, which indicate how the weights of the $c15-math-0246$ 's depend on the factors $c15-math-0247$ 's. The loadings are denoted by $c15-math-0248$ 's, which we thus far used for eigenvalues and eigenvectors. However, the notation of $c15-math-0249$ 's for the loadings is standard in the factor analysis literature and in the rest of this section they will denote the loadings and not quantities related to eigenvalues.

We will now use the matrix notation and then state the essential assumptions. The (orthogonal) factor model may be stated in matrix form as

15.19

where

The essential assumptions related to the factors are as follows:

15.20

15.21

15.22

15.23

15.24

Under the above assumptions, we can see that the variance of component $c15-math-0257$ can be expressed in terms of the loadings as

15.25

Define $c15-math-0259$ . Thus, the variance of a component can be written as the sum of a common variance component and a specific variance component. It is common practice in the factor analysis literature to refer to $c15-math-0260$ as the common variance and the specific variance $c15-math-0261$ as specificity, unique variance, or residual variance.

The covariance matrix $c15-math-0262$ can be written in terms of $c15-math-0263$ and $c15-math-0264$ as

15.26

Using the above relationship, we can arrive at the next expression:

We will consider three methods for estimation of the loadings and communalities: (i) The Principal Component Method, (ii) The Principal Factor Method, and (iii) Maximum Likelihood Function. We omit a fourth important technique of estimation of factors in “Iterated Principal Factor Method”.

15.6.2 Estimation of Loadings and Communalities

We will first consider the principal component method. Let $c15-math-0267$ denote the sample covariance matrix. The problem is then to find an estimator $c15-math-0268$ , which will approximate $c15-math-0269$ such that

15.27

In this approach, the last component $c15-math-0271$ is ignored and we approximate the sampling covariance matrix by a spectral decomposition:

15.28

where $c15-math-0273$ is an orthogonal matrix constructed with normalized eigenvectors, $c15-math-0274$ , of $c15-math-0275$ and $c15-math-0276$ is a diagonal matrix with eigenvalues of $c15-math-0277$ . That is, if $c15-math-0278$ are the eigenvalues of $c15-math-0279$ , then $c15-math-0280$ . Since the eigenvalues $c15-math-0281$ of the positive semi-definite matrix $c15-math-0282$ are all positive or zero, we can factor $c15-math-0283$ as

and substituting this in (15.28), we get

15.29

This suggests that we can use $c15-math-0286$ . However, we seek a $c15-math-0287$ whose order is less than $c15-math-0288$ , and hence we consider the first $c15-math-0289$ largest $c15-math-0290$ eigenvalues and take $c15-math-0291$ and $c15-math-0292$ with their corresponding eigenvectors. Thus, an useful estimator of $c15-math-0293$ is given by

15.30

Note that the $c15-math-0295$ diagonal element of $c15-math-0296$ is the sum of squares of $c15-math-0297$ . We can then use this to estimate the diagonal elements of $c15-math-0298$ by

15.31

and using this relationship approximate $c15-math-0300$ by

15.32

Since, here, the sums of squares of the rows and columns of $c15-math-0302$ equal the communalities and eigenvalues respectively, an estimate of the $c15-math-0303$ communality is given by

15.33

Similarly, we have

15.34

where the last equality follows from the fact that $c15-math-0306$ . Using the estimates of $c15-math-0307$ and $c15-math-0308$ in (15.29), we obtain a partition of the variance of the $c15-math-0309$ variable as

15.35

The contribution of the $c15-math-0311$ factor to the total sample variance is therefore

15.36

We will now illustrate the concepts with a solved example from Rencher (2002).

Example 15.6.1. Renchers Example 13.3.2

A girl rates seven of her acquaintances on a grade of 1 to 9 based on the five adjectives kind, intelligent, happy, likable, and just.

> data(adjectives)
> adjectivescor <- cor(adjectives[,-1])
> round(adjectivescor,3)
             Kind Intelligent  Happy Likeable  Just
Kind        1.000       0.296  0.881    0.995 0.545
Intelligent 0.296       1.000 -0.022    0.326 0.837
Happy       0.881      -0.022  1.000    0.867 0.130
Likeable    0.995       0.326  0.867    1.000 0.544
Just        0.545       0.837  0.130    0.544 1.000
> adj_eig <- eigen(adjectivescor)
> cumsum(adj_eig$values)/sum(adj_eig$values)
[1] 0.6526490 0.9603115 0.9938815 1.0000000 1.0000000
> adj_eig$vectors[,1:2]
           [,1]       [,2]
[1,] -0.5366646 -0.1863665
[2,] -0.2875272  0.6506116
[3,] -0.4342879 -0.4734720
[4,] -0.5374480 -0.1692745
[5,] -0.3896959  0.5377197
> loadings1 <- adj_eig$vectors[,1]*sqrt(adj_eig$values[1])
> loadings2 <- adj_eig$vectors[,2]*sqrt(adj_eig$values[2])
> cbind(loadings1,loadings2)
      loadings1  loadings2
[1,] -0.9694553 -0.2311480
[2,] -0.5194021  0.8069453
[3,] -0.7845174 -0.5872412
[4,] -0.9708704 -0.2099491
[5,] -0.7039644  0.6669269
> communalities <- (adj_eig$vectors[,1]*sqrt(adj_eig$values[1]))^2
+ + (adj_eig$vectors[,2]*sqrt(adj_eig$values[2]))^2
> round(communalities,3)
[1] 0.993 0.921 0.960 0.987 0.940
> specific_variances <- 1-communalities
> round(specific_variances,3)
[1] 0.007 0.079 0.040 0.013 0.060
> var_acc_factors <- adj_eig$values
> round(var_acc_factors,3)
[1] 3.263 1.538 0.168 0.031 0.000
> prop_var <- adj_eig$values/sum(adj_eig$values)
> round(prop_var,3)
[1] 0.653 0.308 0.034 0.006 0.000
> cum_prop <- cumsum(adj_eig$values)/sum(adj_eig$values)
> round(cum_prop,3)
[1] 0.653 0.960 0.994 1.000 1.000

Note that for this example, we may need further adjustments if we wish to use the factanal function of R. $c15-math-0313$

We will next consider the principal factor method. In the previous method we have omitted $c15-math-0314$ . In the principal factor method we use an initial estimate of $c15-math-0315$ , say $c15-math-0316$ , and factor for $c15-math-0317$ , or $c15-math-0318$ , whichever is appropriate:

15.37

15.38

where $c15-math-0321$ is as specified in (15.30) with the eigenvalues and eigenvectors of $c15-math-0322$ or $c15-math-0323$ . Since the $c15-math-0324$ diagonal element of $c15-math-0325$ is the $c15-math-0326$ communality, we have $c15-math-0327$ . In the case of $c15-math-0328$ , we have $c15-math-0329$ . For more details, refer to Section 13.2 of Rencher (2002). We will illustrate these computations as a continuation of the previous example.

Example 15.6.2. Example 15.6.1

Contd. To the best of our knowledge, the Principal Factor Method of estimation of factors does not have a ready-to-use function in R. We have developed an ad-hoc function for the same.

> RPsi <- adjectivescor
> for(i in 1:nrow(RPsi)){
+     RPsi[i,i] <- max(abs(RPsi[i,-i]))
+ }
> RPsi_eig <- eigen(RPsi)
> cumsum(RPsi_eig$values)/sum(RPsi_eig$values)
[1] 0.7043136 1.0110721 1.0175840 1.0175356 1.0000000
> RPsi_eig$vectors[,1:2]
           [,1]       [,2]
[1,] -0.5479823 -0.1775575
[2,] -0.2721324  0.6556565
[3,] -0.4310945 -0.4604984
[4,] -0.5486491 -0.1587823
[5,] -0.3725601  0.5489237
> loadings1 <- RPsi_eig$vectors[,1]*sqrt(RPsi_eig$values[1])
> loadings2 <- RPsi_eig$vectors[,2]*sqrt(RPsi_eig$values[2])
> communalities <- (RPsi_eig$vectors[,1]*sqrt(RPsi_eig$values[1]))^2
+ + (RPsi_eig$vectors[,2]*sqrt(RPsi_eig$values[2]))^2
> specific_variances <- 1-communalities
> var_acc_factors <- RPsi_eig$values
> prop_var <- RPsi_eig$values/sum(RPsi_eig$values)
> round(prop_var,3)
[1]  0.704  0.307  0.007  0.000 -0.018
> cum_prop <- cumsum(RPsi_eig$values)/sum(RPsi_eig$values)
> round(cum_prop,3)
[1] 0.704 1.011 1.018 1.018 1.000
> lambda <- cbind(loadings1,loadings2)
> lambda
          [,1]        [,2]        [,3]      [,4]      [,5]
[1,] 1.0000000  0.31512040  0.87039495 1.0019415 0.5177525
[2,] 0.3151204  1.00000000 -0.04542739 0.3328682 0.8265158
[3,] 0.8703950 -0.04542739  1.00000000 0.8592584 0.1617328
[4,] 1.0019415  0.33286820  0.85925837 1.0000000 0.5329202
[5,] 0.5177525  0.82651577  0.16173275 0.5329202 1.0000000

The computations may be cross-checked with page 423 of Rencher. The proportion of variance for the first principal factor is .704. $c15-math-0330$

Finally, we conclude this section with a discussion of the Maximum Likelihood Estimation method. Under the assumption that the observations $c15-math-0331$ are a random sample from $c15-math-0332$ , it may be shown that the estimates $c15-math-0333$ and $c15-math-0334$ satisfy the following set of equations:

15.39

15.40

15.41

The equations need to be solved iteratively, and happily for us R does that. The MLE technique is illustrated in the next example. We need to address a few important questions before then.

The important question is regarding the choice of the number of factors to be determined. Some rules given in Rencher are stated in the following.

Select $c15-math-0338$ as equal to the number of factors necessary, which account for a pre-specified percentage of the variance accounted by the factors, say 80%.
Select $c15-math-0339$ as the number of eigenvalues that are greater than the average eigenvalue.
Use a screeplot to determine $c15-math-0340$ .
Test the hypothesis that $c15-math-0341$ is the correct number of factors, that is, $c15-math-0342$ .

We leave it to the reader to find out more about the concept of Rotation and give a summary of them, adapted from Hair, et al. (2010).

Varimax Rotation is the most popular orthogonal factor rotation method, which focuses on simplifying the columns of a factor matrix. It is generally superior to other orthogonal factor rotation methods. Here, we seek to rotate the loadings, which maximize the variance of the squared loadings in each column of $c15-math-0343$ .
Quartimax Rotation is a less powerful technique than varimax rotation, which focuses on simplifying the columns of the factor matrix.
Oblique Rotation obtains the factors such that the extracted factors are correlated, and hence it identifies the extent to which the factors are correlated.

Example 15.6.3. Life Expectancies

We have read about this dataset in Example 3.2.8. This example is borrowed from Section 5.9 of Everitt and Hothorn (2011). We need to determine the factor model at which the number of factors are such that the statistic $c15-math-0344$ is insignificant. For this dataset, the number of factors $c15-math-0345$ is three and hence a three-factor is adequate.

> data(life)
> factanal(life,factors=1)$PVAL
   objective
1.879555e-24
> factanal(life,factors=2)$PVAL
   objective
1.911514e-05
> factanal(life,factors=3)
Call:
factanal(x = life, factors = 3)
Uniquenesses:
   m0   m25   m50   m75    w0   w25   w50   w75
0.005 0.362 0.066 0.288 0.005 0.011 0.020 0.146
Loadings:
    Factor1 Factor2 Factor3
m0  0.964   0.122   0.226
m25 0.646   0.169   0.438
m50 0.430   0.354   0.790
m75         0.525   0.656
w0  0.970   0.217
w25 0.764   0.556   0.310
w50 0.536   0.729   0.401
w75 0.156   0.867   0.280
               Factor1 Factor2 Factor3
SS loadings      3.375   2.082   1.640
Proportion Var   0.422   0.260   0.205
Cumulative Var   0.422   0.682   0.887
Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 6.73 on 7 degrees of freedom.
The p-value is 0.458
> factanal(life,factors=4)$PVAL
Error in factanal(life, factors = 4) :
  unable to optimize from this starting value
> round(factanal(life,factors=3,scores="reg")$scores,3)
                     Factor1 Factor2 Factor3
Algeria               -0.258   1.901   1.916
Cameroon              -2.782  -0.723  -1.848
Madagascar            -2.806  -0.812  -0.012
Colombia              -0.241  -0.295   0.429
Ecuador               -0.723   0.442   1.592

The loadings for Factor 1 are strongly associated with the variables m0 and w0, which represent the life force at birth. Similarly, Factor 2 is associated with older women, and Factor 3 is associated with older men. $c15-math-0346$

We have thus learnt about fairly complex and powerful techniques in multivariate statistics. The techniques vary from classifying observations to specific class, identifying group of independent (sub) vectors, reducing the number of variables, and determining hidden variables which possibly explain the observed variables. More details can be found in the references concluding this chapter.

$c15-math-0347$

15.7 Further Reading

We will begin with a disclaimer that the classification of the texts in different sections is not perfect.

15.7.1 The Classics and Applied Perspectives

Anderson (1958, 1984, and 2003) are the first primers on MSA. Currently, Anderson's book is in its third edition and it is worth noting that the second and third editions are probably the only ones which discuss the Stein effect in depth. Chapter 8 of Rao (1973) provides the necessary theoretical background for multivariate analysis and also contains some of Rao's remarkable research in multivariate statistics. In a certain way, one chapter may have more results than we can possibly cover in a complete book. A vector space approach for multivariate statistics is to be found in Eaton (1983, 2007). Mardia, Kent, and Bibby (1979) is an excellent treatise on multivariate analysis and considers many geometrical aspects. The geometrical approach is also considered in Gnanadesikan (1977, 1997), and further robustness aspects are also developed within it. Muirhead (1982), Giri (2004), Bilodeau and Brenner (1999), Rencher (2002), and Rencher (1998) are among some of the important texts on multivariate analysis. We note here that our coverage is mainly based on Rencher (2002).

Jolliffe (2002) is a detailed monograph on Principal Component Analysis. Jackson (1991) is a remarkable account on the applications on PCA. It is needless to say that if you read through these two books, you may become an authority on PCA.

Missing data, EM algorithms, and multivariate analysis have been aptly handled in Schafer (1997) and in fact many useful programs have been provided in S, which can be easily adapted in R. In this sense, this is a stand-alone reference book which deals with missing data. Of course, McLachlan and Krishnan (2008) may also be used!

Johnson and Wichern (2007) is a popular course, which does apt justice between theory and applications. Hair, et al. (2010) may be commonly found on a practitioner's desk. Izenman (2008) is a modern flavor of multivariate statistics with converage of the fashionable area of machine learning. Sharma (1996) and Timm (2002) also provide a firm footing in multivariate statistics.

Gower, et al. (2011) discuss many variants of biplots, which is an extension of Gower and Hand (1996). Greenacre (2010) is an open source book with in-depth coverage of biplots.

15.7.2 Multivariate Analysis and Software

The two companion volumes of Khatree and Naik (1999) and Khatree and Naik (2000) provide excellent coverage of multivariate analysis and computations through SAS software. It may be noted, one more time, that the programs and logical thinking are of paramount importance rather than a particular software. It is worth recording here that these two companions provide a fine balance between the theoretical aspects and computations. H'ardle and Simar (2007) have used “XploRe” software for computations. Last, and not least, the most recent book of Everitt and Hothorn (2011) is a good source for multivariate analysis through R. Varmuza and Filzmoser (2009) have used R software with a special emphasis on the applications to Chemometrics. Husson, et al. (2011) is also a recent arrival, which integrates R with multivariate analysis. Desmukh and Purohit (2007) also present PCA, biplot, and other multivariate aspects in R, though their emphasis is more on microarray data.

15.8 Complements, Problems, and Programs

Problem 15.1 Explore the R examples for linear discriminant analysis and canonical correlation with example(lda) and example(cancor).
Problem 15.2 In the “Seishu Wine Study” of Example 16.9.1, the tests for independence of four sub-vectors lead to rejection of the hypothesis of their independence. Combine the subvectors s11 with s22 and s33 with s44. Find the canonical correlations between these combined subvectors. Furthermore, find the canonical correlations for each subvector while pooling the others together.
Problem 15.3 Principal components offer effective reduction in data dimensionality. In Examples 15.4.1 and 15.4.2, it is observed that the first few PCs explain most of the variation in the original data. Do you expect further reduction if you perform PCA on these PCs? Validate your answer by running princomp on the PCs.
Problem 15.4 Find the PCs for the stack loss dataset, which explain 85% of the variation in the original dataset.
Problem 15.5 Perform the PCA on the iris dataset along the two lines: (i) the entire dataset, (ii) three subsets according to the three species. Check whether the PC scores are significantly different across the three species using an appropriate multivariate testing problem.
Problem 15.6 For the US crime data of Example 13.4.2, carry out the PCA for the covariates and then perform the regression analysis on the PC scores. Investigate if the multicollinearity problem persists in the fitted regression model based on the PC scores.
Problem 15.7 How do outliers effect PC scores? Perform PCA on the board stiffness dataset of Example 16.3.5 with and without the detected outliers therein.
Problem 15.8 Check out for the example of the factanal function. Are factors present in the iris dataset? Develop the complete analysis for the problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 15: Multivariate Statistical Analysis - II

Create new playlist

Sign In

Sign Up