Chapter 15
Multivariate Statistical Analysis - II

Package(s): DAAG, HSAUR2, qcc

Dataset(s): iris, socsupport, chemicaldata, USairpollution, hearing, cork, adjectives, life

15.1 Introduction

In the previous chapter we built on some of the essential multivariate techniques. The results there helped set up a platform to stage more practical applications. The classification and discriminant analysis techniques work well for classifying observations into distinct groups. This topic forms the content of Section 15.2. Canonical correlations help to identify if there are groups of variables present in a multivariate vector, which will be dealt with in Section 15.3. Principal Component Analysis (PCA) helps in obtaining a new set of fewer variables, which have the overall variation of the original set of variables. This multivariate technique will be developed in Section 15.4, whereas specific areas of application of the technique will be dealt in Section 15.5. Multivariate data may also be used to find a new set of variables using Factor Analysis, check Section 15.6.

15.2 Classification and Discriminant Analysis

The application of MSA is to classify the data into distinct groups. This task is achieved through two steps: (i) Discriminant Analysis, and (ii) Classification. In the first step we identify linear functions, which describe the similarities and differences among the groups. This is achieved through the relative contribution of variables towards the separation of groups and finds an optimal plane which separates the groups. The second task is allocation of the observations to the groups identified in the first step. This is broadly called Classification. We will begin with the first task in the forthcoming subsection.

15.2.1 Discrimination Analysis

Suppose that there are two groups characterized by two multivariate normal distributions: c15-math-0001 and c15-math-0002. It is assumed that the variance-covariance matrix c15-math-0003 is the same for both the groups. Assume that we have c15-math-0004 observations c15-math-0005 from c15-math-0006 and c15-math-0007 observations c15-math-0008 from c15-math-0009. The discriminant function is a linear combination of the c15-math-0010 variables, which will maximize the distance between the two group's mean vectors. Thus, we are seeking a vector c15-math-0011, which achieves the required objective.

As a first step, the c15-math-0012 vectors are transformed to c15-math-0013 scalars through c15-math-0014 as below:

15.1 equation

Define the means of the transformed scalars and the pooled variance as below:

15.2 equation

Since the goal is to find that c15-math-0017 which maximizes the distance between the group means, the problem is to maximize the squared distance:

15.3 equation

The maximum of the squared distance occurs at c15-math-0019 given by

15.4 equation

An illustration of the discriminant analysis steps is done through the next example.

The use of the discriminant function for classification is considered next.

15.2.2 Classification

Let c15-math-0024 be a new vector of observation. The goal is to classify it into one of the groups by using the discriminant function. The simple, and fairly obvious, technique is to first obtain the discriminant score by

equation

Next, classify c15-math-0026 to group 1 or 2 accordingly, as c15-math-0027 is closer to c15-math-0028 or c15-math-0029. A simple illustration is done next.

The function lda from the MASS package handles the Linear Discriminant Analysis very well. The particular reason for not using the function here is that our focus has been elucidation of the formulas in the scheme of flow of the theory. The results arising as a consequence of using the lda function by the command lda(GROUP˜X1+X2, data=rencher) is a bit different and the reader is asked to figure out the same. It goes without an explicit mention that the reader has a host of other options using the lda function.

c15-math-0031

15.3 Canonical Correlations

In multivariate data, we may have the case that there are two distinct subsets of vectors, with each subset characterizing certain traits of the unit of measurement. As an example, the marks obtained by a student in the examination for different subjects is one subset of measurements, whereas the performance in different sports may form another subset of measurements. Canonical correlations help us to understand the relationship between such sets of vector data.

Let c15-math-0032 and c15-math-0033 be two set of vectors measured on the same experimental unit. The goal of a canonical correlation study is to obtain vectors c15-math-0034 and c15-math-0035 such that correlation between c15-math-0036 and c15-math-0037 is a maximum, that is, c15-math-0038 is a maximum.

The sample covariance matrix for the vector c15-math-0039 is

where c15-math-0041 is the sample covariance matrix of c15-math-0042, c15-math-0043 is the sample covariance matrix between c15-math-0044 and c15-math-0045, and c15-math-0046 of c15-math-0047. A measure of association between the c15-math-0048′s and the c15-math-0049′s is given by

where c15-math-0051, and c15-math-0052 are the eigenvalues of c15-math-0053. Note that the association measure c15-math-0054 will be a poor measure, since each of the c15-math-0055 values is between 0 and 1, and hence the product of such numbers approach 0 faster. However, the eigenvalues provide a useful measure of association between the vectors. Particularly, the square root of the eigenvalues leads to useful interpretations of the measures of the association. The collection of the square root of the eigenvalues c15-math-0056 has been named the canonical correlations in the multivariate literature. Without loss of generality we assume that c15-math-0057.

As mentioned in Rencher (2002), the best overall measure of association between the c15-math-0058′s and c15-math-0059′s is the largest squared canonical correlation c15-math-0060. However, the other eigenvalues c15-math-0061 leading to the squared canonical correlations c15-math-0062 also provide measures of supplemental dimensions of linear relationships between the c15-math-0063′s and c15-math-0064′s.

The two important properties of canonical correlations as listed by Rencher are the following:

  • Canonical correlations are scale invariant, scales of the c15-math-0065′s as well as the c15-math-0066′s.
  • The first canonical correlation c15-math-0067 is the maximum correlation among all linear combinations between the c15-math-0068′s and the c15-math-0069′s.

See Chapter 11 of Rencher for a comprehensive coverage of canonical correlations. We can test the independence of the c15-math-0070′s and the c15-math-0071′s using any of the four tests discussed in Section 14.6. The concepts are illustrated for the Chemical Dataset of Box and Youle (1955) and are illustrated in Rencher.

The next section is a very important concept in multivariate analysis.

c15-math-0091

15.4 Principal Component Analysis – Theory and Illustration

Principal Component Analysis (PCA) is a powerful data reduction tool. In the earlier multivariate studies we had c15-math-0092 components for a random vector. PCA considers the problem of identifying a new set of variables which explain more variance in the dataset. Jolliffe (2002) explains the importance of PCA as “The central idea of principal component analysis (PCA) is to reduce the dimensionality of a dataset consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the dataset.” In general, most of the ideas in multivariate statistics are extensions of the concepts from univariate statistics. PCA is an exception!

Jolliffe (2002) considers the PCA theory and applications in a monumental way. Jackson (1991) is a very elegant exposition of PCA applications. For useful applications of PCA in chemometrics, refer to Varmuza and Filzmoser (2009). The development of this section is owed in a large extent to Jolliffe (2002) and Rencher (2002).

PCA may be useful in the following two cases: (i) too many explanatory variables relative to the number of observations; and (ii) the explanatory variables are highly correlated. Let us begin with a brief discussion of the math behind PCA.

15.4.1 The Theory

We begin with a discussion of population principal components. Consider a c15-math-0093-variate normal random vector c15-math-0094 with mean c15-math-0095 and variance-covariance matrix c15-math-0096. We assume that we have a random sample of c15-math-0097 observations. The goal of PCA is to return a new set of variables c15-math-0098, where each c15-math-0099 is some linear combination of the c15-math-0100s. Furthermore, and importantly, the c15-math-0101's are in decreasing order of importance in the sense that c15-math-0102 has more information about c15-math-0103's than c15-math-0104, whenever c15-math-0105. The c15-math-0106's are constructed in such a way that they are uncorrelated. Information here is used to convey the fact that the c15-math-0107 whenever c15-math-0108.

From its definition, the PCAs are linear combinations of the c15-math-0109's. The c15-math-0110 principal component is defined by

15.7 equation

We know from the linearity of variance that we can specify the c15-math-0112 in such a way that variance of c15-math-0113 can be infinite. Thus we may end up with components such that variance is infinite for each of them, which is of course meaningless. We will thus impose a restriction:

equation

We need to find c15-math-0115 such that c15-math-0116 is a maximum. Next, we need to obtain c15-math-0117 such that

equation

and in general

equation

For the first component, mathematically, we need to solve the maximization problem

15.8 equation

where c15-math-0121 is a Lagrangian multiplier. As with an optimization problem, we will differentiate the above expression and equate the result to 0 for obtaining the optimal value of c15-math-0122:

equation

Thus, we see that c15-math-0124 is an eigenvalue of c15-math-0125 and c15-math-0126 is the corresponding eigenvector. Since we need to maximize c15-math-0127, we select the maximum of the eigenvalue and its corresponding eigenvector for c15-math-0128.

Let c15-math-0129 denote the c15-math-0130 eigenvalues of c15-math-0131. We assume that the eigenvalues are distinct. Without loss of generality, we further assume that c15-math-0132. For the first PC we select the eigenvector corresponding to c15-math-0133, that is, c15-math-0134 is the eigenvector related to c15-math-0135.

The second PC c15-math-0136 needs to maximize c15-math-0137 and with the restriction that c15-math-0138. Note that, post a few matrix computational steps,

equation

Thus, the constraint that the first two PCs are uncorrelated may be specified by c15-math-0140. The maximization problem for the second PC is specified in the equation below:

15.9 equation

where c15-math-0142 are the Lagrangian multipliers. We need to optimize the above equation and obtain the second PC. As we generally do with optimization problems, we will differentiate the maximization statement with respect to c15-math-0143 and obtain:

equation

which by multiplication of the left-hand side by c15-math-0145 gives us

15.10 equation

Since c15-math-0147, the first two terms of the above equation equal zero and since c15-math-0148, we get c15-math-0149. Substituting this into the two displayed expressions above, we get c15-math-0150. On readjustment, we get c15-math-0151, and we again see c15-math-0152 as the eigenvalue of c15-math-0153. Under the assumption of distinct eigenvalues for c15-math-0154, we choose the second largest eigenvalue and its corresponding eigenvector for c15-math-0155. We proceed in a similar way for the rest of the c15-math-0156 PCs. As with the first PC, c15-math-0157 is chosen as the eigenvector corresponding to c15-math-0158 for c15-math-0159.

The variance of the c15-math-0160 principal component c15-math-0161 is

15.11 equation

The amount of variation explained by the c15-math-0163 PC is

equation

Since the PCs are uncorrelated, the variation explained by the first c15-math-0165 PCs is

15.12 equation

The variance explained by the PCs are best understood through a screeplot. A screeplot looks like the profile of a mountain where after a steep slope a flatter region appears that is built by fallen and deposited stones (called scree). Therefore, this plot is often named as the SCREE PLOT. It is investigated from the top until the debris is reached. This explanation is from Varmura and Filzmoser (2009).

The development thus far focuses on population principal components, which involve unknown parameters c15-math-0167 and c15-math-0168. Since these parameters are seldom known, the sample principal components are obtained by replacing the unknown parameters with their respective MLEs. If the observations are on different scales of measurements, a practical rule is to use the sample correlation matrix instead of the covariance matrix.

The covariance between observation c15-math-0169 and PC c15-math-0170 is given by

equation

and the correlation is

equation

However, if the PCs are extracted from the correlation matrix, then

equation

The concepts will be demonstrated in the next subsection.

15.4.2 Illustration Through a Dataset

We will use two datasets for the usage of PCA.

In the next subsection, we focus on the applications of PCA.

c15-math-0179

15.5 Applications of Principal Component Analysis

Jolliffe (2002) and Jackson (1991) are two detailed treatises which discuss variants of PCA and their applications. PCA can be applied and/or augmented by statistical techniques such as ANOVA, linear regression, Multidimensional scaling, factor analysis, microarray modeling, time series, etc.

15.5.1 PCA for Linear Regression

Section 12.6 indicated the problem of multicollinearity in linear models. If the covariates are replaced with the PCs, the problem of multicollinearity will cease, since the PCs are uncorrelated with each other. It is thus the right time to fuse the multicollinearity problems of linear models with PCA. We are familiar with all the relevant concepts and hence will take the example of Maindonald and Braun (2009) for throwing light on this technique. Maindonald and Braun (2009) have made the required dataset available in their package DAAG. See Streiner and Norman (2003) for more details of this study.

It is thus seen how the PCA helps to reduce the number of variables in the linear regression model. Note that even if we replace the original variables with equivalent PCs, the problem of multicollinearity is fixed.

15.5.2 Biplots

Gower and Hand (1996) have written a monograph on the use of biplots for multivariate data. Gower, et al. (2011) is a recent book on biplots complemented with the R package UBbipl, and is also an extension of Gower and Hand (1996). Greenacre (2010) has implemented all the biplot techniques in his book. This book has R codes for doing all the data analysis, and he has also been very generous to gift it to the world at http://www.multivariatestatistics.org/biplots.html. For theoretical aspects of biplots, the reader may also refer to Rencher (2002), Johnson and Wichern (2007), and Jolliffe (2002) among others. For a simpler and effective understanding of the biplots, see the Appendix of Desmukh and Purohit (2007).

The biplot is a visualization technique of the data matrix c15-math-0181 through two coordinate systems representing the observations (row) and variables (columns) of the dataset. In this method, the variance-covariance between the variable and the distance between the observations, are plotted in a single figure, and to reflect this facet the prefix “bi” is used here. In this plot, the distance between the points, which are observations, represents the Mahalanobis distance between them. The length of a vector, displayed on the plot, from the origin to the coordinates, represents the variance of the variable with the angle between the variables (represented by the vectors) denoting the correlation. If the angle between the vectors is small, it will indicate that the vectors are strongly correlated.

For the sake of simplicity, we will assume that the data matrix c15-math-0182, consisting of c15-math-0183 observations of a c15-math-0184-dimensional vector, is a centered matrix in the sense that each column has a zero mean. By the singular value decomposition, SVD, result, we can write the matrix c15-math-0185 as

15.13 equation

where c15-math-0187 is an c15-math-0188 matrix, c15-math-0189 is an diagonal c15-math-0190 matrix, and c15-math-0191 is an c15-math-0192 matrix. By the properties of SVD, we have c15-math-0193 and c15-math-0194. Furthermore, c15-math-0195 has diagonal elements in c15-math-0196. We will consider a simple illustration of the SVD for the famous “Cork” dataset of Rao (1973).

Notice the decline of the singular values, c15-math-0198 values, for the cork dataset. In the spirit of PCA, we tend to believe that if such a decline is steep, we can probably have a good understanding of the dataset if we resort to some plots which use two variables. In fact, such a result is validated by a theorem of Eckart and Young (1936). We need to connect the SVD result with the well-known quadratic decomposition, QR, result, which is now stated. The QR decomposition says that any c15-math-0199 matrix can be expressed as

15.14 equation

where c15-math-0201 is an c15-math-0202 matrix and c15-math-0203 is an c15-math-0204 matrix, and c15-math-0205 is the rank of matrix c15-math-0206. In a certain sense, the goal is to understand the variance among the c15-math-0207 observations through the matrix c15-math-0208 and the variance among the c15-math-0209 variables through c15-math-0210. The matrices c15-math-0211 and c15-math-0212 may be obtained as a combination of the SVD elements as c15-math-0213 and c15-math-0214. For different choices of c15-math-0215, we have different representations for c15-math-0216. The three most common choices of c15-math-0217 are 0, 1/2, and 1, see Gabriel (1971). We mention some consequences of these choices, see Khatree and Naik (1999).

  • c15-math-0218. In the this case, the QR matrices may be expressed in terms of the SVD matrices as
    15.15 equation

    For the choice c15-math-0220, we place an equal emphasis on the variables and the observations.

  • c15-math-0221. Here

    15.16 equation

    The distance between the vectors c15-math-0223 approximates the squared Mahalanobis distance between the observation vectors. Furthermore, the inner product between the vectors c15-math-0224 approximates the covariances between them and length of a vector c15-math-0225 gives its variance.

  • c15-math-0226. Here,

    15.17 equation

    For this case, the distance between c15-math-0228's is the usual Euclidean distance between them and the values of c15-math-0229 equals the principal component score for the observations, whereas the values of c15-math-0230 refer to the principal component loadings.

For the cork dataset, we will obtain the biplot for the choice c15-math-0231.

c15-math-0232

15.6 Factor Analysis

We will have a look at another important facet of multivariate statistical analysis: Factor Analysis. The data observations c15-math-0233, are assumed to arise from an c15-math-0234 distribution. Consider a hypothetical example where the correlation matrix is given by

equation

Here, we can see that the first two components are strongly correlated with each other and also appear to be independent of the rest of the components. Similarly, the last three components are strongly correlated among themselves and independent of the first two components. A natural intuition is to think of the first two components arising due to one factor and the remaining three due to a second factor. The factors are also sometimes called latent variables.

The development in the rest of the section is only related to orthogonal factor model and it is the same whenever we talk about the factor analysis model. For other variants, refer to Basilevsky (1994), Reyment and J'oreskog (1996), and Brown (2006).

15.6.1 The Orthogonal Factor Analysis Model

Let c15-math-0236 be a c15-math-0237-vector. To begin with, we will assume that there are c15-math-0238 factors with c15-math-0239 and that each of the c15-math-0240's is a function of the c15-math-0241 factors. The factor analysis model is given by

15.18 equation

where c15-math-0243, are normally distributed errors associated with the variable c15-math-0244. In the factor analysis model, the c15-math-0245, are the regression coefficients between the observed variables and the factors. Two points need to be observed. In the factor analysis literature, the regression coefficients are called loadings, which indicate how the weights of the c15-math-0246's depend on the factors c15-math-0247's. The loadings are denoted by c15-math-0248's, which we thus far used for eigenvalues and eigenvectors. However, the notation of c15-math-0249's for the loadings is standard in the factor analysis literature and in the rest of this section they will denote the loadings and not quantities related to eigenvalues.

We will now use the matrix notation and then state the essential assumptions. The (orthogonal) factor model may be stated in matrix form as

15.19 equation

where

equation

The essential assumptions related to the factors are as follows:

15.20 equation
15.21 equation
15.22 equation
15.23 equation
15.24 equation

Under the above assumptions, we can see that the variance of component c15-math-0257 can be expressed in terms of the loadings as

15.25 equation

Define c15-math-0259. Thus, the variance of a component can be written as the sum of a common variance component and a specific variance component. It is common practice in the factor analysis literature to refer to c15-math-0260 as the common variance and the specific variance c15-math-0261 as specificity, unique variance, or residual variance.

The covariance matrix c15-math-0262 can be written in terms of c15-math-0263 and c15-math-0264 as

15.26 equation

Using the above relationship, we can arrive at the next expression:

equation

We will consider three methods for estimation of the loadings and communalities: (i) The Principal Component Method, (ii) The Principal Factor Method, and (iii) Maximum Likelihood Function. We omit a fourth important technique of estimation of factors in “Iterated Principal Factor Method”.

15.6.2 Estimation of Loadings and Communalities

We will first consider the principal component method. Let c15-math-0267 denote the sample covariance matrix. The problem is then to find an estimator c15-math-0268, which will approximate c15-math-0269 such that

15.27 equation

In this approach, the last component c15-math-0271 is ignored and we approximate the sampling covariance matrix by a spectral decomposition:

where c15-math-0273 is an orthogonal matrix constructed with normalized eigenvectors, c15-math-0274, of c15-math-0275 and c15-math-0276 is a diagonal matrix with eigenvalues of c15-math-0277. That is, if c15-math-0278 are the eigenvalues of c15-math-0279, then c15-math-0280. Since the eigenvalues c15-math-0281 of the positive semi-definite matrix c15-math-0282 are all positive or zero, we can factor c15-math-0283 as

equation

and substituting this in (15.28), we get

This suggests that we can use c15-math-0286. However, we seek a c15-math-0287 whose order is less than c15-math-0288, and hence we consider the first c15-math-0289 largest c15-math-0290 eigenvalues and take c15-math-0291 and c15-math-0292 with their corresponding eigenvectors. Thus, an useful estimator of c15-math-0293 is given by

Note that the c15-math-0295 diagonal element of c15-math-0296 is the sum of squares of c15-math-0297. We can then use this to estimate the diagonal elements of c15-math-0298 by

15.31 equation

and using this relationship approximate c15-math-0300 by

15.32 equation

Since, here, the sums of squares of the rows and columns of c15-math-0302 equal the communalities and eigenvalues respectively, an estimate of the c15-math-0303 communality is given by

15.33 equation

Similarly, we have

15.34 equation

where the last equality follows from the fact that c15-math-0306. Using the estimates of c15-math-0307 and c15-math-0308 in (15.29), we obtain a partition of the variance of the c15-math-0309 variable as

15.35 equation

The contribution of the c15-math-0311 factor to the total sample variance is therefore

15.36 equation

We will now illustrate the concepts with a solved example from Rencher (2002).

We will next consider the principal factor method. In the previous method we have omitted c15-math-0314. In the principal factor method we use an initial estimate of c15-math-0315, say c15-math-0316, and factor for c15-math-0317, or c15-math-0318, whichever is appropriate:

15.37 equation
15.38 equation

where c15-math-0321 is as specified in (15.30) with the eigenvalues and eigenvectors of c15-math-0322 or c15-math-0323. Since the c15-math-0324 diagonal element of c15-math-0325 is the c15-math-0326 communality, we have c15-math-0327. In the case of c15-math-0328, we have c15-math-0329. For more details, refer to Section 13.2 of Rencher (2002). We will illustrate these computations as a continuation of the previous example.

Finally, we conclude this section with a discussion of the Maximum Likelihood Estimation method. Under the assumption that the observations c15-math-0331 are a random sample from c15-math-0332, it may be shown that the estimates c15-math-0333 and c15-math-0334 satisfy the following set of equations:

15.39 equation
15.40 equation
15.41 equation

The equations need to be solved iteratively, and happily for us R does that. The MLE technique is illustrated in the next example. We need to address a few important questions before then.

The important question is regarding the choice of the number of factors to be determined. Some rules given in Rencher are stated in the following.

  • Select c15-math-0338 as equal to the number of factors necessary, which account for a pre-specified percentage of the variance accounted by the factors, say 80%.
  • Select c15-math-0339 as the number of eigenvalues that are greater than the average eigenvalue.
  • Use a screeplot to determine c15-math-0340.
  • Test the hypothesis that c15-math-0341 is the correct number of factors, that is, c15-math-0342.

We leave it to the reader to find out more about the concept of Rotation and give a summary of them, adapted from Hair, et al. (2010).

  • Varimax Rotation is the most popular orthogonal factor rotation method, which focuses on simplifying the columns of a factor matrix. It is generally superior to other orthogonal factor rotation methods. Here, we seek to rotate the loadings, which maximize the variance of the squared loadings in each column of c15-math-0343.
  • Quartimax Rotation is a less powerful technique than varimax rotation, which focuses on simplifying the columns of the factor matrix.
  • Oblique Rotation obtains the factors such that the extracted factors are correlated, and hence it identifies the extent to which the factors are correlated.

We have thus learnt about fairly complex and powerful techniques in multivariate statistics. The techniques vary from classifying observations to specific class, identifying group of independent (sub) vectors, reducing the number of variables, and determining hidden variables which possibly explain the observed variables. More details can be found in the references concluding this chapter.

c15-math-0347

15.7 Further Reading

We will begin with a disclaimer that the classification of the texts in different sections is not perfect.

15.7.1 The Classics and Applied Perspectives

Anderson (1958, 1984, and 2003) are the first primers on MSA. Currently, Anderson's book is in its third edition and it is worth noting that the second and third editions are probably the only ones which discuss the Stein effect in depth. Chapter 8 of Rao (1973) provides the necessary theoretical background for multivariate analysis and also contains some of Rao's remarkable research in multivariate statistics. In a certain way, one chapter may have more results than we can possibly cover in a complete book. A vector space approach for multivariate statistics is to be found in Eaton (1983, 2007). Mardia, Kent, and Bibby (1979) is an excellent treatise on multivariate analysis and considers many geometrical aspects. The geometrical approach is also considered in Gnanadesikan (1977, 1997), and further robustness aspects are also developed within it. Muirhead (1982), Giri (2004), Bilodeau and Brenner (1999), Rencher (2002), and Rencher (1998) are among some of the important texts on multivariate analysis. We note here that our coverage is mainly based on Rencher (2002).

Jolliffe (2002) is a detailed monograph on Principal Component Analysis. Jackson (1991) is a remarkable account on the applications on PCA. It is needless to say that if you read through these two books, you may become an authority on PCA.

Missing data, EM algorithms, and multivariate analysis have been aptly handled in Schafer (1997) and in fact many useful programs have been provided in S, which can be easily adapted in R. In this sense, this is a stand-alone reference book which deals with missing data. Of course, McLachlan and Krishnan (2008) may also be used!

Johnson and Wichern (2007) is a popular course, which does apt justice between theory and applications. Hair, et al. (2010) may be commonly found on a practitioner's desk. Izenman (2008) is a modern flavor of multivariate statistics with converage of the fashionable area of machine learning. Sharma (1996) and Timm (2002) also provide a firm footing in multivariate statistics.

Gower, et al. (2011) discuss many variants of biplots, which is an extension of Gower and Hand (1996). Greenacre (2010) is an open source book with in-depth coverage of biplots.

15.7.2 Multivariate Analysis and Software

The two companion volumes of Khatree and Naik (1999) and Khatree and Naik (2000) provide excellent coverage of multivariate analysis and computations through SAS software. It may be noted, one more time, that the programs and logical thinking are of paramount importance rather than a particular software. It is worth recording here that these two companions provide a fine balance between the theoretical aspects and computations. H'ardle and Simar (2007) have used “XploRe” software for computations. Last, and not least, the most recent book of Everitt and Hothorn (2011) is a good source for multivariate analysis through R. Varmuza and Filzmoser (2009) have used R software with a special emphasis on the applications to Chemometrics. Husson, et al. (2011) is also a recent arrival, which integrates R with multivariate analysis. Desmukh and Purohit (2007) also present PCA, biplot, and other multivariate aspects in R, though their emphasis is more on microarray data.

15.8 Complements, Problems, and Programs

  1. Problem 15.1 Explore the R examples for linear discriminant analysis and canonical correlation with example(lda) and example(cancor).

  2. Problem 15.2 In the “Seishu Wine Study” of Example 16.9.1, the tests for independence of four sub-vectors lead to rejection of the hypothesis of their independence. Combine the subvectors s11 with s22 and s33 with s44. Find the canonical correlations between these combined subvectors. Furthermore, find the canonical correlations for each subvector while pooling the others together.

  3. Problem 15.3 Principal components offer effective reduction in data dimensionality. In Examples 15.4.1 and 15.4.2, it is observed that the first few PCs explain most of the variation in the original data. Do you expect further reduction if you perform PCA on these PCs? Validate your answer by running princomp on the PCs.

  4. Problem 15.4 Find the PCs for the stack loss dataset, which explain 85% of the variation in the original dataset.

  5. Problem 15.5 Perform the PCA on the iris dataset along the two lines: (i) the entire dataset, (ii) three subsets according to the three species. Check whether the PC scores are significantly different across the three species using an appropriate multivariate testing problem.

  6. Problem 15.6 For the US crime data of Example 13.4.2, carry out the PCA for the covariates and then perform the regression analysis on the PC scores. Investigate if the multicollinearity problem persists in the fitted regression model based on the PC scores.

  7. Problem 15.7 How do outliers effect PC scores? Perform PCA on the board stiffness dataset of Example 16.3.5 with and without the detected outliers therein.

  8. Problem 15.8 Check out for the example of the factanal function. Are factors present in the iris dataset? Develop the complete analysis for the problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.121.160