Chapter 9. From Big to Small Data

Now that we have some cleansed data ready for analysis, let's first see how we can find our way around the high number of variables in our dataset. This chapter will introduce some statistical techniques to reduce the number of variables by dimension reduction and feature extraction, such as:

  • Principal Component Analysis (PCA)
  • Factor Analysis (FA)
  • Multidimensional Scaling (MDS) and a few other techniques

Note

Most dimension reduction methods require that two or more numeric variables in the dataset are highly associated or correlated, so the columns in our matrix are not totally independent of each other. In such a situation, the goal of dimension reduction is to decrease the number of columns in the dataset to the actual matrix rank; or, in other words, the number of variables can be decreased whilst most of the information content can be retained. In linear algebra, the matrix rank refers to the dimensions of the vector space generated by the matrix—or, in simpler terms, the number of independent columns and rows in a quadratic matrix. Probably it's easier to understand rank by a quick example: imagine a dataset on students where we know the gender, the age, and the date of birth of respondents. This data is redundant as the age can be computed (via a linear transformation) from the date of birth. Similarly, the year variable is static (without any variability) in the hflights dataset, and the elapsed time can be also computed by the departure and arrival times.

These transformations basically concentrate on the common variance identified among the variables and exclude the remaining total (unique) variance. This results in a dataset with fewer columns, which is probably easier to maintain and process, but at the cost of some information loss and the creation of artificial variables, which are usually harder to comprehend compared to the original columns.

In the case of perfect dependence, all but one of the perfectly correlated variables can be omitted, as the rest provide no additional information about the dataset. Although it does not happen often, in most cases it's still totally acceptable to keep only one or a few components extracted from a set of questions, for example in a survey for further analysis.

Adequacy tests

The first thing you want to do, when thinking about reducing the number of dimensions or looking for latent variables in the dataset with multivariate statistical analysis, is to check whether the variables are correlated and the data is normally distributed.

Normality

The latter is often not a strict requirement. For example, the results of a PCA can be still valid and interpreted if we do not have multivariate normality; on the other hand, maximum likelihood factor analysis does have this strong assumption.

Tip

You should always use the appropriate methods to achieve your data analysis goals, based on the characteristics of your data.

Anyway, you can use (for example) qqplot to do a pair-wise comparison of variables, and qqnorm to do univariate normality tests of your variables. First, let's demonstrate this with a subset of hflights:

> library(hlfights)
> JFK <- hflights[which(hflights$Dest == 'JFK'),
+                 c('TaxiIn', 'TaxiOut')]

So we filter our dataset to only those flights heading to the John F. Kennedy International Airport and we are interested in only two variables describing how long the taxiing in and out times were in minutes. The preceding command with the traditional [ indexing can be refactored with subset for much more readable source code:

> JFK <- subset(hflights, Dest == 'JFK', select = c(TaxiIn, TaxiOut))

Please note that now there's no need to quote variable names or refer to the data.frame name inside the subset call. For more details on this, please see Chapter 3, Filtering and Summarizing Data. And now let's see how the values of these two columns are distributed:

> par(mfrow = c(1, 2))
> qqnorm(JFK$TaxiIn, ylab = 'TaxiIn')
> qqline(JFK$TaxiIn)
> qqnorm(JFK$TaxiOut, ylab = 'TaxiOut')
> qqline(JFK$TaxiOut)
Normality

To render the preceding plot, we created a new graphical device (with par to hold two plots in a row), then called qqnorm, to show the quantiles of the empirical variables against the normal distribution, and also added a line for the latter with qqline for easier comparison. If the data was scaled previously, qqline would render a 45-degree line.

Checking the QQ-plots suggest that the data does not fit the normal distribution very well, which can be also verified by an analytical test such as the Shapiro-Wilk normality test:

> shapiro.test(JFK$TaxiIn)

  Shapiro-Wilk normality test

data:  JFK$TaxiIn
W = 0.8387, p-value < 2.2e-16

The p-value is really small, so the null hypothesis (stating that the data is normally distributed) is rejected. But how can we test normality for a bunch of variables without and beyond separate statistical tests?

Multivariate normality

Similar statistical tests exist for multiple variables as well; these methods provide different ways to check if the data fits the multivariate normal distribution. To this end, we will use the MVN package, but similar methods can be also found in the mvnormtest package. The latter includes the multivariate version of the previously discussed Shapiro-Wilk test as well.

But Mardia's test is more often used to check multivariate normality and, even better, it does not limit the sample size to below 5,000. After loading the MVN package, calling the appropriate R function is pretty straightforward with a very intuitive interpretation—after getting rid of the missing values in our dataset:

> JFK <- na.omit(JFK)
> library(MVN)
> mardiaTest(JFK)
   Mardia's Multivariate Normality Test 
--------------------------------------- 
   data : JFK 

   g1p            : 20.84452 
   chi.skew       : 2351.957 
   p.value.skew   : 0 

   g2p            : 46.33207 
   z.kurtosis     : 124.6713 
   p.value.kurt   : 0 

   chi.small.skew : 2369.368 
   p.value.small  : 0 

   Result          : Data is not multivariate normal. 
---------------------------------------

Tip

For more details on handling and filtering missing values, please see Chapter 8, Polishing Data.

Out of the three p values, the third one refers to cases when the sample size is extremely small (<20), so now we only concentrate on the first two values, both below 0.05. This means that the data does not seem to be multivariate normal. Unfortunately, Mardia's test fails to perform well in some cases, so more robust methods might be more appropriate to use.

The MVN package can run the Henze-Zirkler's and Royston's Multivariate Normality Test as well. Both return user-friendly and easy to interpret results:

> hzTest(JFK)
  Henze-Zirkler's Multivariate Normality Test 
--------------------------------------------- 
  data : JFK 

  HZ      : 42.26252 
  p-value : 0 

  Result  : Data is not multivariate normal. 
--------------------------------------------- 

> roystonTest(JFK)
  Royston's Multivariate Normality Test 
--------------------------------------------- 
  data : JFK 

  H       : 264.1686 
  p-value : 4.330916e-58 

  Result  : Data is not multivariate normal. 
---------------------------------------------

A more visual method to test multivariate normality is to render similar QQ plots to those we used before. But, instead of comparing only one variable with the theoretical normal distribution, let's first compute the squared Mahalanobis distance between our variables, which should follow a chi-square distribution with the degrees of freedom being the number of our variables. The MVN package can automatically compute all the required values and render those with any of the preceding normality test R functions; just set the qqplot argument to be TRUE:

> mvt <- roystonTest(JFK, qqplot = TRUE)
Multivariate normality

If the dataset was normally distributed, the points shown in the preceding graphs should fit the straight line. Other alternative graphical methods can produce more visual and user-friendly plots with the previously created mvt R object. The MVN package ships the mvnPlot function, which can render perspective and contour plots for two variables and thus provides a nice way to test bivariate normality:

> par(mfrow = c(1, 2))
> mvnPlot(mvt, type = "contour", default = TRUE)
> mvnPlot(mvt, type = "persp", default = TRUE)
Multivariate normality

On the right plot, you can see the empirical distribution of the two variables on a perspective plot, where most cases can be found in the bottom-left corner. This means that most flights had only relatively short TaxiIn and TaxiOut times, which suggests a rather heavy-tailed distribution. The left plot shows a similar image, but from a bird's eye view: the contour lines represent a cross-section of the right-hand side 3D graph. Multivariate normal distribution looks more central, something like a 2-dimensional bell curve:

> set.seed(42)
> mvt <- roystonTest(MASS::mvrnorm(100, mu = c(0, 0),
+          Sigma = matrix(c(10, 3, 3, 2), 2)))
> mvnPlot(mvt, type = "contour", default = TRUE)
> mvnPlot(mvt, type = "persp", default = TRUE)
Multivariate normality

See Chapter 13, Data Around Us on how to create similar contour maps on spatial data.

Dependence of variables

Besides normality, relatively high correlation coefficients are desired when applying dimension reduction methods. The reason is that, if there is no statistical relationship between the variables, for example, PCA will return the exact same values without much transformation.

To this end, let's see how the numerical variables of the hflights dataset are correlated (the output, being a large matrix, is suppressed this time):

> hflights_numeric <- hflights[, which(sapply(hflights, is.numeric))]
> cor(hflights_numeric, use = "pairwise.complete.obs")

In the preceding example, we have created a new R object to hold only the numeric columns of the original hflights data frame, leaving out five character vectors. Then, we run cor with pair-wise deletion of missing values, which returns a matrix with 16 columns and 16 rows:

> str(cor(hflights_numeric, use = "pairwise.complete.obs"))
 num [1:16, 1:16] NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, "dimnames")=List of 2
 ..$ : chr [1:16] "Year" "Month" "DayofMonth" "DayOfWeek" ...
 ..$ : chr [1:16] "Year" "Month" "DayofMonth" "DayOfWeek" ...

The number of missing values in the resulting correlation matrix seems to be very high. This is because Year was 2011 in all cases, thus resulting in a standard variation of zero. It's wise to exclude Year along with the non-numeric variables from the dataset—by not only filtering for numeric values, but also checking the variance:

> hflights_numeric <- hflights[,which(
+     sapply(hflights, function(x)
+         is.numeric(x) && var(x, na.rm = TRUE) != 0))]

Now the number of missing values is a lot lower:

> table(is.na(cor(hflights_numeric, use = "pairwise.complete.obs")))
FALSE  TRUE 
  209    16

Can you guess why we still have some missing values here despite the pair-wise deletion of missing values? Well, running the preceding command results in a rather informative warning, but we will get back to this question later:

Warning message:
In cor(hflights_numeric, use = "pairwise.complete.obs") :
  the standard deviation is zero

Let's now proceed with analyzing the actual numbers in the 15x15 correlation matrix, which would be way too large to print in this book. To this end, we did not show the result of the original cor command shown previously, but instead, let's rather visualize those 225 numbers with the graphical capabilities of the ellipse package:

> library(ellipse)
> plotcorr(cor(hflights_numeric, use = "pairwise.complete.obs"))
Dependence of variables

Now we see the values of the correlation matrix represented by ellipses, where:

  • A perfect circle stands for the correlation coefficient of zero
  • Ellipses with a smaller area reflect the relatively large distance of the correlation coefficient from zero
  • The tangent represents the negative/positive sign of the coefficient

To help you with analyzing the preceding results, let's render a similar plot with a few artificially generated numbers that are easier to interpret:

> plotcorr(cor(data.frame(
+     1:10,
+     1:10 + runif(10),
+     1:10 + runif(10) * 5,
+     runif(10),
+     10:1,
+     check.names = FALSE)))
Dependence of variables

Similar plots on the correlation matrix can be created with the corrgram package.

But let's get back to the hflights dataset! On the previous diagram, some narrow ellipses are rendered for the time-related variables, which show a relatively high correlation coefficient, and even the Month variable seems to be slightly associated with the FlightNum function:

> cor(hflights$FlightNum, hflights$Month)
[1] 0.2057641

On the other hand, the plot shows perfect circles in most cases, which stand for a correlation coefficient around zero. This suggests that most variables are not correlated at all, so computing the principal components of the original dataset would not be very helpful due to the low proportion of common variance.

KMO and Barlett's test

We can verify this assumption on low communalities by a number of statistical tests; for example, the SAS and SPSS folks tend to use KMO or Bartlett's test to see if the data is suitable for PCA. Both algorithms are available in R as well via, for example, via the psych package:

> library(psych)
> KMO(cor(hflights_numeric, use = "pairwise.complete.obs"))
Error in solve.default(r) : 
 system is computationally singular: reciprocal condition number = 0
In addition: Warning message:
In cor(hflights_numeric, use = "pairwise.complete.obs") :
 the standard deviation is zero
matrix is not invertible, image not found
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = cor(hflights_numeric, use = "pairwise.complete.obs"))
Overall MSA = NA
MSA for each item = 
      Month    DayofMonth     DayOfWeek 
       0.5        0.5        0.5 
     DepTime      ArrTime     FlightNum 
       0.5        NA        0.5 
ActualElapsedTime      AirTime     ArrDelay 
        NA        NA        NA 
     DepDelay     Distance      TaxiIn 
       0.5        0.5        NA 
     TaxiOut     Cancelled     Diverted 
       0.5        NA        NA

Unfortunately, the Overall MSA (Measure of Sampling Adequacy, representing the average correlations between the variables) is not available in the preceding output due to the previously identified missing values of the correlation matrix. Let's pick a pair of variables where the correlation coefficient was NA for further analysis! Such a pair can be easily identified from the previous plot; no circle or ellipse was drawn for missing values, for example, for Cancelled and AirTime:

> cor(hflights_numeric[, c('Cancelled', 'AirTime')])
          Cancelled AirTime
Cancelled         1      NA
AirTime          NA       1

This can be explained by the fact, that if a flight is cancelled, then the time spent in the air does not vary much; furthermore, this data is not available:

> cancelled <- which(hflights_numeric$Cancelled == 1)
> table(hflights_numeric$AirTime[cancelled], exclude = NULL)
<NA> 
2973

So we get missing values when calling cor due to these NA; similarly, we also get NA when calling cor with pair-wise deletion, as only the non-cancelled flights remain in the dataset, resulting in zero variance for the Cancelled variable:

> table(hflights_numeric$Cancelled)
     0      1 
224523   2973

This suggests removing the Cancelled variable from the dataset before we run the previously discussed assumption tests, as the information stored in that variable is redundantly available in other columns of the dataset as well. Or, in other words, the Cancelled column can be computed by a linear transformation of the other columns, which can be left out from further analysis:

> hflights_numeric <- subset(hflights_numeric, select = -Cancelled)

And let's see if we still have any missing values in the correlation matrix:

> which(is.na(cor(hflights_numeric, use = "pairwise.complete.obs")),
+   arr.ind = TRUE)
                  row col
Diverted           14   7
Diverted           14   8
Diverted           14   9
ActualElapsedTime   7  14
AirTime             8  14
ArrDelay            9  14

It seems that the Diverted column is responsible for a similar situation, and the other three variables were not available when the flight was diverted. After another subset, we are now ready to call KMO on a full correlation matrix:

> hflights_numeric <- subset(hflights_numeric, select = -Diverted)
> KMO(cor(hflights_numeric[, -c(14)], use = "pairwise.complete.obs"))
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = cor(hflights_numeric[, -c(14)], use = "pairwise.complete.obs"))
Overall MSA =  0.36
MSA for each item = 
            Month        DayofMonth         DayOfWeek 
             0.42              0.37              0.35 
          DepTime           ArrTime         FlightNum 
             0.51              0.49              0.74 
ActualElapsedTime           AirTime          ArrDelay 
             0.40              0.40              0.39 
         DepDelay          Distance            TaxiIn 
             0.38              0.67              0.06 
          TaxiOut 
             0.06

The Overall MSA, or the so called Kaiser-Meyer-Olkin (KMO) index, is a number between 0 and 1; this value suggests whether the partial correlations of the variables are small enough to continue with data reduction methods. A general rating system or rule of a thumb for KMO can be found in the following table, as suggested by Kaiser:

Value

Description

KMO < 0.5

Unacceptable

0.5 < KMO < 0.6

Miserable

0.6 < KMO < 0.7

Mediocre

0.7 < KMO < 0.8

Middling

0.8 < KMO < 0.9

Meritorious

KMO > 0.9

Marvelous

The KMO index being below 0.5 is considered unacceptable, which basically means that the partial correlation computed from the correlation matrix suggests that the variables are not correlated enough for a meaningful dimension reduction or latent variable model.

Although leaving out some variables with the lowest MSA would improve the Overall MSA, and we could build some appropriate models in the following pages, for instructional purposes we won't spend any more time on data transformation for the time being, and we will use the mtcars dataset, which was introduced in Chapter 3, Filtering and Summarizing Data:

> KMO(mtcars)
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = mtcars)
Overall MSA =  0.83
MSA for each item = 
 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
0.93 0.90 0.76 0.84 0.95 0.74 0.74 0.91 0.88 0.85 0.62

It seems that the mtcars database is a great choice for multivariate statistical analysis. This can be also verified by the so-called Bartlett test, which suggests whether the correlation matrix is similar to an identity matrix. Or, in other words, if there is a statistical relationship between the variables. On the other hand, if the correlation matrix has only zeros except for the diagonal, then the variables are independent from each other; thus it would not make much sense to think of multivariate methods. The psych package provides an easy-to-use function to compute Bartlett's test as well:

> cortest.bartlett(cor(mtcars))
$chisq
[1] 1454.985

$p.value
[1] 3.884209e-268

$df
[1] 
55

The very low p-value suggests that we reject the null-hypothesis of the Bartlett test. This means that the correlation matrix differs from the identity matrix, so the correlation coeffiecients between the variables seem to be closer to 1 than 0. This is in sync with the high KMO value.

Note

Before focusing on the actual statistical methods, please be advised that, although the preceding assumptions make sense in most cases and should be followed as a rule of a thumb, KMO and Bartlett's tests are not always required. High communality is important for factor analysis and other latent models, while for example PCA is a mathematical transformation that will work with even low KMO values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.253.31