Pearson correlation coefficient

Sometimes, we want to measure the degree of (linear) dependence between two variables. The most common measure of the linear correlation between two variables is the Pearson correlation coefficient, often identified with a lowercase . When , we have a perfect positive linear correlation, that is, an increase of one variable predicts an increase of the other. When we have , we have a perfect negative linear correlation and the increase of one variable predicts a decrease of the other. When , we have no linear correlation. As a general rule, we will get intermediate values. It's important to always have two aspects in mind: the Pearson correlation coefficient says nothing about non-linear correlations. We should not confuse with the slope of a regression. The following image from Wikipedia is a good one to have at hand:

https://en.wikipedia.org/wiki/Correlation_and_dependence#/media/File:Correlation_examples2.svg.

Part of the confusion between and the slope of a regression may be explained by the following relationship:

That is, the slope () and the Pearson correlation coefficient () have the same value, but only when the standard deviation of and are equal. Notice that it is true, for example, when we standardize the data. To further clarify:

The Pearson correlation coefficient () is a measure of the degree of correlation between two variables and is always restricted to the interval [-1, 1]. It does not matter about the scale of the data.
The slope of a linear regression () indicates how much change per unit change of , and can take any real value.

The Pearson coefficient is related to a quantity known as the determination coefficient, and, for a linear regression model, this is just the square of the Pearson coefficient, that is, (or sometimes R²). This is pronounced as r squared and can be defined as the variance of the predicted values divided by the variance of the data. Therefore, it can be interpreted as the proportion of the variance in the dependent variable that is predicted from the independent variable. For Bayesian linear regression, the variance of the predicted values can be larger than the variance of the data and this will lead to an R² larger than 1, then a good solution is to define R² as follows:

In the above equation, is the expected (or mean) value over posterior samples.

This is the variance of the predicted values divided by the variance of the predicted values plus the errors (or residuals). This definition has the advantage of ensuring R² is restricted to the interval [0, 1].

The easiest way to compute R² is to use the r2_score() function from ArviZ. We need the observed values of and the predicted values of . Remember that we can get from sample_posterior_predictive():

az.r2_score(y, ppc['y_pred'])

By default, this function will return R² (0.8, for this example) and the standard deviation (0.03).

Table of Contents for Pearson correlation coefficient

Create new playlist

Sign In

Sign Up

Table of Contents for
Pearson correlation coefficient