Now that we have some cleansed data ready for analysis, let's first see how we can find our way around the high number of variables in our dataset. This chapter will introduce some statistical techniques to reduce the number of variables by dimension reduction and feature extraction, such as:
Most dimension reduction methods require that two or more numeric variables in the dataset are highly associated or correlated, so the columns in our matrix are not totally independent of each other. In such a situation, the goal of dimension reduction is to decrease the number of columns in the dataset to the actual matrix rank; or, in other words, the number of variables can be decreased whilst most of the information content can be retained. In linear algebra, the matrix rank refers to the dimensions of the vector space generated by the matrix—or, in simpler terms, the number of independent columns and rows in a quadratic matrix. Probably it's easier to understand rank by a quick example: imagine a dataset on students where we know the gender, the age, and the date of birth of respondents. This data is redundant as the age can be computed (via a linear transformation) from the date of birth. Similarly, the year variable is static (without any variability) in the hflights
dataset, and the elapsed time can be also computed by the departure and arrival times.
These transformations basically concentrate on the common variance identified among the variables and exclude the remaining total (unique) variance. This results in a dataset with fewer columns, which is probably easier to maintain and process, but at the cost of some information loss and the creation of artificial variables, which are usually harder to comprehend compared to the original columns.
In the case of perfect dependence, all but one of the perfectly correlated variables can be omitted, as the rest provide no additional information about the dataset. Although it does not happen often, in most cases it's still totally acceptable to keep only one or a few components extracted from a set of questions, for example in a survey for further analysis.
The first thing you want to do, when thinking about reducing the number of dimensions or looking for latent variables in the dataset with multivariate statistical analysis, is to check whether the variables are correlated and the data is normally distributed.
The latter is often not a strict requirement. For example, the results of a PCA can be still valid and interpreted if we do not have multivariate normality; on the other hand, maximum likelihood factor analysis does have this strong assumption.
Anyway, you can use (for example) qqplot
to do a pair-wise comparison of variables, and qqnorm
to do univariate normality tests of your variables. First, let's demonstrate this with a subset of hflights
:
> library(hlfights) > JFK <- hflights[which(hflights$Dest == 'JFK'), + c('TaxiIn', 'TaxiOut')]
So we filter our dataset to only those flights heading to the John F. Kennedy International Airport and we are interested in only two variables describing how long the taxiing in and out times were in minutes. The preceding command with the traditional [
indexing can be refactored with subset
for much more readable source code:
> JFK <- subset(hflights, Dest == 'JFK', select = c(TaxiIn, TaxiOut))
Please note that now there's no need to quote variable names or refer to the data.frame
name inside the subset
call. For more details on this, please see Chapter 3, Filtering and Summarizing Data. And now let's see how the values of these two columns are distributed:
> par(mfrow = c(1, 2)) > qqnorm(JFK$TaxiIn, ylab = 'TaxiIn') > qqline(JFK$TaxiIn) > qqnorm(JFK$TaxiOut, ylab = 'TaxiOut') > qqline(JFK$TaxiOut)
To render the preceding plot, we created a new graphical device (with par
to hold two plots in a row), then called qqnorm
, to show the quantiles of the empirical variables against the normal distribution, and also added a line for the latter with qqline
for easier comparison. If the data was scaled previously, qqline
would render a 45-degree line.
Checking the QQ-plots suggest that the data does not fit the normal distribution very well, which can be also verified by an analytical test such as the Shapiro-Wilk normality test:
> shapiro.test(JFK$TaxiIn) Shapiro-Wilk normality test data: JFK$TaxiIn W = 0.8387, p-value < 2.2e-16
The p-value
is really small, so the null hypothesis (stating that the data is normally distributed) is rejected. But how can we test normality for a bunch of variables without and beyond separate statistical tests?
Similar statistical tests exist for multiple variables as well; these methods provide different ways to check if the data fits the multivariate normal distribution. To this end, we will use the MVN
package, but similar methods can be also found in the mvnormtest
package. The latter includes the multivariate version of the previously discussed Shapiro-Wilk test as well.
But Mardia's test is more often used to check multivariate normality and, even better, it does not limit the sample size to below 5,000. After loading the MVN
package, calling the appropriate R function is pretty straightforward with a very intuitive interpretation—after getting rid of the missing values in our dataset:
> JFK <- na.omit(JFK) > library(MVN) > mardiaTest(JFK) Mardia's Multivariate Normality Test --------------------------------------- data : JFK g1p : 20.84452 chi.skew : 2351.957 p.value.skew : 0 g2p : 46.33207 z.kurtosis : 124.6713 p.value.kurt : 0 chi.small.skew : 2369.368 p.value.small : 0 Result : Data is not multivariate normal. ---------------------------------------
For more details on handling and filtering missing values, please see Chapter 8, Polishing Data.
Out of the three p values, the third one refers to cases when the sample size is extremely small (<20), so now we only concentrate on the first two values, both below 0.05. This means that the data does not seem to be multivariate normal. Unfortunately, Mardia's test fails to perform well in some cases, so more robust methods might be more appropriate to use.
The MVN
package can run the Henze-Zirkler's and Royston's Multivariate Normality Test as well. Both return user-friendly and easy to interpret results:
> hzTest(JFK) Henze-Zirkler's Multivariate Normality Test --------------------------------------------- data : JFK HZ : 42.26252 p-value : 0 Result : Data is not multivariate normal. --------------------------------------------- > roystonTest(JFK) Royston's Multivariate Normality Test --------------------------------------------- data : JFK H : 264.1686 p-value : 4.330916e-58 Result : Data is not multivariate normal. ---------------------------------------------
A more visual method to test multivariate normality is to render similar QQ plots to those we used before. But, instead of comparing only one variable with the theoretical normal distribution, let's first compute the squared Mahalanobis distance between our variables, which should follow a chi-square distribution with the degrees of freedom being the number of our variables. The MVN
package can automatically compute all the required values and render those with any of the preceding normality test R functions; just set the qqplot
argument to be TRUE
:
> mvt <- roystonTest(JFK, qqplot = TRUE)
If the dataset was normally distributed, the points shown in the preceding graphs should fit the straight line. Other alternative graphical methods can produce more visual and user-friendly plots with the previously created mvt
R object. The MVN
package ships the mvnPlot
function, which can render perspective and contour plots for two variables and thus provides a nice way to test bivariate normality:
> par(mfrow = c(1, 2)) > mvnPlot(mvt, type = "contour", default = TRUE) > mvnPlot(mvt, type = "persp", default = TRUE)
On the right plot, you can see the empirical distribution of the two variables on a perspective plot, where most cases can be found in the bottom-left corner. This means that most flights had only relatively short TaxiIn and TaxiOut times, which suggests a rather heavy-tailed distribution. The left plot shows a similar image, but from a bird's eye view: the contour lines represent a cross-section of the right-hand side 3D graph. Multivariate normal distribution looks more central, something like a 2-dimensional bell curve:
> set.seed(42) > mvt <- roystonTest(MASS::mvrnorm(100, mu = c(0, 0), + Sigma = matrix(c(10, 3, 3, 2), 2))) > mvnPlot(mvt, type = "contour", default = TRUE) > mvnPlot(mvt, type = "persp", default = TRUE)
See Chapter 13, Data Around Us on how to create similar contour maps on spatial data.
Besides normality, relatively high correlation coefficients are desired when applying dimension reduction methods. The reason is that, if there is no statistical relationship between the variables, for example, PCA will return the exact same values without much transformation.
To this end, let's see how the numerical variables of the hflights
dataset are correlated (the output, being a large matrix, is suppressed this time):
> hflights_numeric <- hflights[, which(sapply(hflights, is.numeric))] > cor(hflights_numeric, use = "pairwise.complete.obs")
In the preceding example, we have created a new R object to hold only the numeric columns of the original hflights
data frame, leaving out five character vectors. Then, we run cor
with pair-wise deletion of missing values, which returns a matrix with 16 columns and 16 rows:
> str(cor(hflights_numeric, use = "pairwise.complete.obs")) num [1:16, 1:16] NA NA NA NA NA NA NA NA NA NA ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:16] "Year" "Month" "DayofMonth" "DayOfWeek" ... ..$ : chr [1:16] "Year" "Month" "DayofMonth" "DayOfWeek" ...
The number of missing values in the resulting correlation matrix seems to be very high. This is because Year
was 2011 in all cases, thus resulting in a standard variation of zero. It's wise to exclude Year
along with the non-numeric variables from the dataset—by not only filtering for numeric values, but also checking the variance:
> hflights_numeric <- hflights[,which( + sapply(hflights, function(x) + is.numeric(x) && var(x, na.rm = TRUE) != 0))]
Now the number of missing values is a lot lower:
> table(is.na(cor(hflights_numeric, use = "pairwise.complete.obs"))) FALSE TRUE 209 16
Can you guess why we still have some missing values here despite the pair-wise deletion of missing values? Well, running the preceding command results in a rather informative warning, but we will get back to this question later:
Warning message: In cor(hflights_numeric, use = "pairwise.complete.obs") : the standard deviation is zero
Let's now proceed with analyzing the actual numbers in the 15x15 correlation matrix, which would be way too large to print in this book. To this end, we did not show the result of the original cor
command shown previously, but instead, let's rather visualize those 225 numbers with the graphical capabilities of the ellipse
package:
> library(ellipse) > plotcorr(cor(hflights_numeric, use = "pairwise.complete.obs"))
Now we see the values of the correlation matrix represented by ellipses, where:
To help you with analyzing the preceding results, let's render a similar plot with a few artificially generated numbers that are easier to interpret:
> plotcorr(cor(data.frame( + 1:10, + 1:10 + runif(10), + 1:10 + runif(10) * 5, + runif(10), + 10:1, + check.names = FALSE)))
Similar plots on the correlation matrix can be created with the corrgram
package.
But let's get back to the hflights
dataset! On the previous diagram, some narrow ellipses are rendered for the time-related variables, which show a relatively high correlation coefficient, and even the Month
variable seems to be slightly associated with the FlightNum
function:
> cor(hflights$FlightNum, hflights$Month) [1] 0.2057641
On the other hand, the plot shows perfect circles in most cases, which stand for a correlation coefficient around zero. This suggests that most variables are not correlated at all, so computing the principal components of the original dataset would not be very helpful due to the low proportion of common variance.
We can verify this assumption on low communalities by a number of statistical tests; for example, the SAS and SPSS folks tend to use KMO or Bartlett's test to see if the data is suitable for PCA. Both algorithms are available in R as well via, for example, via the
psych
package:
> library(psych) > KMO(cor(hflights_numeric, use = "pairwise.complete.obs")) Error in solve.default(r) : system is computationally singular: reciprocal condition number = 0 In addition: Warning message: In cor(hflights_numeric, use = "pairwise.complete.obs") : the standard deviation is zero matrix is not invertible, image not found Kaiser-Meyer-Olkin factor adequacy Call: KMO(r = cor(hflights_numeric, use = "pairwise.complete.obs")) Overall MSA = NA MSA for each item = Month DayofMonth DayOfWeek 0.5 0.5 0.5 DepTime ArrTime FlightNum 0.5 NA 0.5 ActualElapsedTime AirTime ArrDelay NA NA NA DepDelay Distance TaxiIn 0.5 0.5 NA TaxiOut Cancelled Diverted 0.5 NA NA
Unfortunately, the Overall MSA
(Measure of Sampling Adequacy, representing the average correlations between the variables) is not available in the preceding output due to the previously identified missing values of the correlation matrix. Let's pick a pair of variables where the correlation coefficient was NA
for further analysis! Such a pair can be easily identified from the previous plot; no circle or ellipse was drawn for missing values, for example, for Cancelled
and AirTime
:
> cor(hflights_numeric[, c('Cancelled', 'AirTime')]) Cancelled AirTime Cancelled 1 NA AirTime NA 1
This can be explained by the fact, that if a flight is cancelled, then the time spent in the air does not vary much; furthermore, this data is not available:
> cancelled <- which(hflights_numeric$Cancelled == 1) > table(hflights_numeric$AirTime[cancelled], exclude = NULL) <NA> 2973
So we get missing values when calling cor
due to these NA
; similarly, we also get NA
when calling cor
with pair-wise deletion, as only the non-cancelled flights remain in the dataset, resulting in zero variance for the Cancelled
variable:
> table(hflights_numeric$Cancelled) 0 1 224523 2973
This suggests removing the Cancelled
variable from the dataset before we run the previously discussed assumption tests, as the information stored in that variable is redundantly available in other columns of the dataset as well. Or, in other words, the Cancelled
column can be computed by a linear transformation of the other columns, which can be left out from further analysis:
> hflights_numeric <- subset(hflights_numeric, select = -Cancelled)
And let's see if we still have any missing values in the correlation matrix:
> which(is.na(cor(hflights_numeric, use = "pairwise.complete.obs")), + arr.ind = TRUE) row col Diverted 14 7 Diverted 14 8 Diverted 14 9 ActualElapsedTime 7 14 AirTime 8 14 ArrDelay 9 14
It seems that the Diverted
column is responsible for a similar situation, and the other three variables were not available when the flight was diverted. After another subset, we are now ready to call KMO on a full correlation matrix:
> hflights_numeric <- subset(hflights_numeric, select = -Diverted) > KMO(cor(hflights_numeric[, -c(14)], use = "pairwise.complete.obs")) Kaiser-Meyer-Olkin factor adequacy Call: KMO(r = cor(hflights_numeric[, -c(14)], use = "pairwise.complete.obs")) Overall MSA = 0.36 MSA for each item = Month DayofMonth DayOfWeek 0.42 0.37 0.35 DepTime ArrTime FlightNum 0.51 0.49 0.74 ActualElapsedTime AirTime ArrDelay 0.40 0.40 0.39 DepDelay Distance TaxiIn 0.38 0.67 0.06 TaxiOut 0.06
The Overall MSA
, or the so called Kaiser-Meyer-Olkin (KMO) index, is a number between 0 and 1; this value suggests whether the partial correlations of the variables are small enough to continue with data reduction methods. A general rating system or rule of a thumb for KMO can be found in the following table, as suggested by Kaiser:
Value |
Description |
---|---|
KMO < 0.5 |
Unacceptable |
0.5 < KMO < 0.6 |
Miserable |
0.6 < KMO < 0.7 |
Mediocre |
0.7 < KMO < 0.8 |
Middling |
0.8 < KMO < 0.9 |
Meritorious |
KMO > 0.9 |
Marvelous |
The KMO index being below 0.5 is considered unacceptable, which basically means that the partial correlation computed from the correlation matrix suggests that the variables are not correlated enough for a meaningful dimension reduction or latent variable model.
Although leaving out some variables with the lowest MSA would improve the Overall MSA
, and we could build some appropriate models in the following pages, for instructional purposes we won't spend any more time on data transformation for the time being, and we will use the mtcars
dataset, which was introduced in Chapter 3, Filtering and Summarizing Data:
> KMO(mtcars) Kaiser-Meyer-Olkin factor adequacy Call: KMO(r = mtcars) Overall MSA = 0.83 MSA for each item = mpg cyl disp hp drat wt qsec vs am gear carb 0.93 0.90 0.76 0.84 0.95 0.74 0.74 0.91 0.88 0.85 0.62
It seems that the mtcars
database is a great choice for multivariate statistical analysis. This can be also verified by the so-called Bartlett test, which suggests whether the correlation matrix is similar to an identity matrix. Or, in other words, if there is a statistical relationship between the variables. On the other hand, if the correlation matrix has only zeros except for the diagonal, then the variables are independent from each other; thus it would not make much sense to think of multivariate methods. The
psych
package provides an easy-to-use function to compute Bartlett's test as well:
> cortest.bartlett(cor(mtcars)) $chisq [1] 1454.985 $p.value [1] 3.884209e-268 $df [1] 55
The very low p-value
suggests that we reject the null-hypothesis of the Bartlett test. This means that the correlation matrix differs from the identity matrix, so the correlation coeffiecients between the variables seem to be closer to 1 than 0. This is in sync with the high KMO value.
Before focusing on the actual statistical methods, please be advised that, although the preceding assumptions make sense in most cases and should be followed as a rule of a thumb, KMO and Bartlett's tests are not always required. High communality is important for factor analysis and other latent models, while for example PCA is a mathematical transformation that will work with even low KMO values.
18.217.150.123