Covariance and correlation

Before going in depth into the topic of this section, let me remind the reader of three mathematical notions that will be used in this chapter: arithmetic mean, variance, and standard deviation. Some have been already discussed in other chapters, but a more formal definition is interesting for the purposes of the chapter.

The arithmetic mean is a measure of central tendency. Considering a sample of observations of an attribute—for instance, the height of individuals—the arithmetic mean is simply the sum of the values of the observations divided by the number of observations. We are interested in computing the mean height of three individuals measuring 160 cm, 170 cm, and 180 cm.

The formula for the mean is:

Covariance and correlation

Type the following in the R console to compute the arithmetic mean of this sample:

(160 + 170 + 180) / 3

R outputs the following:

[1] 170

Check the solution by typing this:

mean(c(160,170,180))

Our computation of the mean was correct—R outputs:

[1] 170

Variance is a measure of dispersion of the data—that is, how different the values are in a sample or population. Considering a sample of observations of an attribute, the variance is computed as the sum of the squared mean subtracted observations (the sum of squares) divided by the number of observations minus 1 (the degrees of freedom). The formula for the variance is:

Covariance and correlation

Type the following to obtain the variance of heights of the three individuals:

Variance = ( (160-170)^2 + (170-170)^2 + (180-170)^2 ) / (3-1)
Variance

The output is as follows:

[1] 100

Now type the following to check our solution:

var(c(160,170,180))

The output is 100 again.

Standard deviation is another measure of dispersion. Unlike variance, it is expressed in the same unit as the data. The formula for standard deviation is as follows:

Covariance and correlation

In other words, considering a sample of observations of an attribute, the standard deviation is the square root of the variance. Type the following to obtain the standard deviation of height in the three individuals presented previously:

sqrt(Variance)

The output is 10. Now type the following:

sd(c(160,170,180))

Our computation of the standard deviation is correct; the output is 10 as well.

Let's now proceed to the main topics of this section.

Covariance and correlation are measures of how much two attributes are related—that is, how much they change together. For instance, one can easily figure out that the weight of individuals is related to their height (positive relation) more than to the length of their hair. The weight of individuals is most probably not at all related to the length of the last movie they have seen.

Covariance

The covariance of two normally distributed numeric attributes (data consisting of quantities or that can be treated as quantities) is computed as the sum of the mean subtracted observations of both attributes multiplied together, divided by the number of observations in the sample minus 1. The formula for the covariance is as follows:

Covariance

Imagine our three individuals (ordered by increasing height) weigh 55 kgs, 70 kgs, and 85 kgs. The arithmetic mean for the weight is therefore 70. The covariance of the two measures (height and weight) can be computed as follows:

Covariance = ((160-170) * (55-70) + (170-170) * (70-70) + (180-170) *
(85-70)) / (3-1)
Covariance

The output is 150. Let's check it using the ad hoc function:

heights=c(160,170,180)
weights=c(55,70,85)
cov(heights,weights)

The output is 150 again! Our solution is correct.

The problem with covariance is that it is not a standardized measure of association—that is, the value of the measure depends upon the unit in which the attributes are measured. In our example, the heights were previously measured in centimeters. Let's try it with the same values converted to meters. For this purpose, we will divide the height measures in centimeters by 100:

cov(heights/100, weights)

The result is 1.5.

Measuring the height in inches and the weight in pounds would have led to a totally different covariance value. The covariance allows knowing about the direction of the relationship between two attributes, but not the magnitude of the relationship. This is because, as mentioned previously, the covariance is not a standardized measure of association between two attributes. The correlation does not present such a drawback.

Correlation

The correlation is a standardized measure of association between attributes. We have already mentioned the correlation a few times in the previous chapters, but let's examine it in more details.

Pearson's correlation

Pearson's correlation indicates the strength of the association between two normally distributed numeric attributes.

Before we continue on the topic, let's examine how the measure is computed. There are multiple ways to compute the correlation. The easiest to remember (considering we already know how to compute the covariance) is to simply divide the covariance by the product of the standard deviations of both attributes. Let's try again, using our previous example (measures of height in centimeter). To obtain the correlation of height and weight, we simply type:

Covariance / (sd(heights) * sd(weights))

The output is 1. Let's check that we computed the correlation correctly, with the following line of code:

cor(heights,weights)

We were right! The output is 1 again. So what does this value mean? A correlation can have any value comprised between -1 and 1. A value of -1 means a complete and negative correspondence of the changes in the values of the two attributes. A correlation of 1 means a complete and positive correspondence. A value of 0 means that the two attributes are independent of each other. The correlation allowed us to examine the strength of the correspondence of changes in height and weight in our example, that is, a perfect association between the two attributes.

It is worth mentioning that the Pearson's correlation only assesses linear relationships between the attributes. Let's have a look at what is meant here using the classic example of Anscombe's quartet. The dataset is part of the datasets package and, therefore, directly available to us:

data(anscombe)

The dataset is composed of eight attributes x1, x2, x3, x4, y1, y2, y3, and y4. We want to know the correlation between each of the x and y attributes that share their numbers (x1 and y1; x2, and y2, …). This is achieved using the following code:

c1=cor(anscombe$x1, anscombe$y1)
c2=cor(anscombe$x2, anscombe$y2)
c3=cor(anscombe$x3, anscombe$y3)
c4=cor(anscombe$x4, anscombe$y4)
c1; c2; c3; c4

These four correlations have a value of 0.816. Does this mean that the relationship between each of the x and y attributes is the same? You might already suspect that this is not the case at all. In Chapter 2, Visualizing and Manipulating Data Using R and Chapter 3, Data Visualization with Lattice, we have looked at scatterplots already, and discovered that they allow visualizing the relationship between two attributes. Let's examine the relationships that we are interested in here. This demonstration is inspired by the example in the documentation (type ?anscombe):

par(mfcol=c(1,4))
plot(anscombe$x1, anscombe$y1)
plot(anscombe$x2, anscombe$y2)
plot(anscombe$x3, anscombe$y3)
plot(anscombe$x4, anscombe$y4)

Pearson's correlation

Figure 8.4: Scatterplots of four relationships yielding the same correlation (Anscombe's quartet)

From the preceding diagram, we can see that the relationship between x1 and y1 is positive and linear (y1 increases as x1 increases), but a bit noisy. The relationship between x2 and y2 is curvilinear—y2 increases as x2 increases up to an x1 value of 11, and decreases from the x1 value of 12. The relationship between x3 and y3 is linear but not as strong as for the first example. The correlation is so high because of a bivariate outlier at x1 = 13.

Finally, we can see that x4 is constant (with values equal to 8) and that, here again, a bivariate outlier is responsible for the strong association between x4 and y4. Note that, without this observation, the correlation could not be computed as there would be no variance in x4. In this extreme case, looking at the plot would already have discouraged any further analysis of the relationship between x4 and y4!

It is quite tempting to simply take note of the correlation between attributes, especially when they show interesting linear patterns in the data or confirm our hypotheses. We should refrain from drawing conclusions from a mere look at the correlations, and always visualize the data first. It is also necessary to always examine the significance of the correlations before drawing any conclusion from them. Therefore, I suggest using the cor.test() function instead of the cor() function, as it performs a significance test and informs about the 95 percent confidence intervals. In the case of the relationship between x1 and y1, check the following code:

cor.test(x1, y1)

The output follows and shows that the correlation is significantly different than 0 (p-value = 0.00217). Relying on 95 percent confidence intervals, the true value of the correlation lies between lower and upper bounds of the confidence intervals. These are respectively of 0.42 and 0.95. Note that we can know whether the correlation is significant by looking at the confidence interval; if it doesn't include 0, the correlation is significant at the given threshold (here 95 percent). The estimate of the correlation lies exactly in between these values and is 0.814205, which we can round to 0.816.

Pearson's product-moment correlation is as follows:

data:  x1 and y1
t = 4.2415, df = 9, p-value = 0.00217
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4243912 0.9506933
sample estimates:
      cor 
0.8164205

We obtain the part of variance shared between two attributes by squaring the correlation. In the present case, the two attributes share 0.8164205^2*100 = 66.65424 % of their variance.

For the interested reader, assessing the significance of the correlation requires the computation of the corresponding t value, which is obtained by dividing the correlation by the square root of 1 minus the correlation, divided by the degrees of freedom (the sample size minus 2). The significance is then obtained from the Student's t distribution.

Spearman's correlation

When the attributes are not normally distributed, Spearman's correlation should be used. This correlation coefficient first ranks the observations of both attributes included in the analysis. It then computes the differences between the ranks of each observation on these two attributes. Finally, it computes the correlation coefficient. In the computation, 6 times the sum of the observation-wise differences, divided by the number of observations multiplied by the number of observations squared minus 1, is subtracted from 1.

Let's examine this in an example using the following attributes named A and B:

A = c(3,4,2,6,7)
B = c(4,3,1,6,5)

We first compute the ranks of observations of A and B:

RankA = rank(A);  RankB = rank(B)

The spearman correlation can be computed like this:

1 - ( (6 * sum((RankA-RankB)^2)) / (5* (5^2 -1)) )

The output is 0.8. Let's check if our answer is right:

cor(A,B,method = "spearman")

We did it correctly; the output using the function in R is 0.8 as well:

As mentioned earlier, it is necessary to know the significance of the correlation, which we can obtain using the following code:

cor.test(A,B, method = "spearman")

The output follows and can be interpreted as the output of the test for a Pearson's correlation (but note, no confidence intervals are provided here):

        Spearman's rank correlation rho
data:  A and B
S = 4, p-value = 0.1333
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho 
0.8 

In this case, the correlation, although high, is not significantly different from 0, as the p value fails to reach the threshold of 0.05 (p-value = 0.1333).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.172.146