Pearson correlation coefficient

Sometimes, we want to measure the degree of (linear) dependence between two variables. The most common measure of the linear correlation between two variables is the Pearson correlation coefficient, often identified with a lowercase . When , we have a perfect positive linear correlation, that is, an increase of one variable predicts an increase of the other. When we have , we have a perfect negative linear correlation and the increase of one variable predicts a decrease of the other. When , we have no linear correlation. As a general rule, we will get intermediate values. It's important to always have two aspects in mind: the Pearson correlation coefficient says nothing about non-linear correlations. We should not confuse  with the slope of a regression. The following image from Wikipedia is a good one to have at hand:

https://en.wikipedia.org/wiki/Correlation_and_dependence#/media/File:Correlation_examples2.svg.

Part of the confusion between  and the slope of a regression may be explained by the following relationship:

That is, the slope () and the Pearson correlation coefficient () have the same value, but only when the standard deviation of  and  are equal. Notice that it is true, for example, when we standardize the data. To further clarify:

  • The Pearson correlation coefficient () is a measure of the degree of correlation between two variables and is always restricted to the interval [-1, 1]. It does not matter about the scale of the data.
  • The slope of a linear regression () indicates how much  change per unit change of , and can take any real value.

The Pearson coefficient is related to a quantity known as the determination coefficient, and, for a linear regression model, this is just the square of the Pearson coefficient, that is,  (or sometimes R2). This is pronounced as r squared and can be defined as the variance of the predicted values divided by the variance of the data. Therefore, it can be interpreted as the proportion of the variance in the dependent variable that is predicted from the independent variable. For Bayesian linear regression, the variance of the predicted values can be larger than the variance of the data and this will lead to an R2 larger than 1, then a good solution is to define R2 as follows:

In the above equation,  is the expected (or mean) value  over  posterior samples.

This is the variance of the predicted values divided by the variance of the predicted values plus the errors (or residuals). This definition has the advantage of ensuring R2 is restricted to the interval [0, 1].

The easiest way to compute R2 is to use the r2_score() function from ArviZ. We need the observed values of  and the predicted values of . Remember that we can get from sample_posterior_predictive():

az.r2_score(y, ppc['y_pred'])

By default, this function will return R2 (0.8, for this example) and the standard deviation (0.03).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.79.26