Finding associations between continuous variables

The first measure for the association of two continuous variables is covariance. Here is the formula:

 

Covariance indicates how two variables, X and Y, are related to each other. When large values of both variables occur together, the deviations are both positive (because Xi - Mean(X) > 0 and Yi - Mean(Y) > 0), and their product is therefore positive. Similarly, when small values occur together, the product is positive as well. When one deviation is negative and one is positive, the product is negative. This can happen when a small value of X occurs with a large value of Y and the other way around. If positive products are absolutely larger than negative products, the covariance is positive; otherwise, it is negative. If negative and positive products cancel each other out, the covariance is zero. And when do they cancel each other out? Well, you can instantly imagine such a situation —when two variables are really independent. So the covariance evidently summarizes the relation between variables:

  • If the covariance is positive, when the values of one variable are large, the values of the other one tend to be large as well
  • When negative, the values of one variable are large when the values of the other one tend to be small
  • If the covariance is zero, the variables are independent

In order to compare the strength of association between two different pairs of variables, a relative measure is better than an absolute one. This is Pearson's correlation coefficient, which divides the covariance by the product of the standard deviations of both variables:

 

The reason that the correlation coefficient is a useful measure of the relationship between two variables is that it is always bounded: -1 <= correlation coefficient <= 1. Of course, if the variables are independent, the correlation is zero, because the covariance is zero. The correlation can take the value 1 if the variables have a perfect positive linear relation (if you correlate a variable with itself, for example). Similarly, the correlation would be -1 for a perfect negative linear relation. The larger the absolute value of the coefficient, the more the variables are related. But the significance depends on the size of the sample. The following code creates a data frame that is a subset of the TM data frame used so far. The new data frame includes only continuous variables. Then the code calculates the covariance and the correlation coefficient between all possible pairs of variables:

x <- TM[,c("YearlyIncome", "Age", "NumberCarsOwned")]; 
cov(x); 
cor(x); 

Here are the correlation coefficients shown in a correlation matrix:

                    YearlyIncome       Age NumberCarsOwned
    YearlyIncome       1.0000000 0.1446627       0.4666472
    Age                0.1446627 1.0000000       0.1836616
    NumberCarsOwned    0.4666472 0.1836616       1.0000000

You can see that the income and number of cars owned are correlated better than other pairs of variables.

Pearson's coefficient is not so suitable for ordinal variables so you can calculate Spearman's coefficient instead. The following code shows you how to calculate Spearman's coefficient on ordinal variables:

y <- TM[,c("TotalChildren", "NumberChildrenAtHome", "HouseOwnerFlag", "BikeBuyer")]; 
cor(y); 
cor(y, method = "spearman"); 

Finally, you can also visualize a correlation matrix. A nice visualization is provided by the corrgram() function in the corrgram package, as the following code shows:

install.packages("corrgram"); 
library(corrgram); 
corrgram(y, order = TRUE, lower.panel = panel.shade, 
         upper.panel = panel.shade, text.panel = panel.txt, 
         cor.method = "spearman", main = "Corrgram"); 

In the following figure, you can see Spearman's correlation coefficient between pairs of four ordered variables graphically:

Correlation matrix visualization

The darker the shading in the preceding figure, the stronger the association; a right-oriented texture means a positive association, and a left-oriented texture means a negative association.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.35.193