Pearson coefficient from a multivariate Gaussian

Another way to compute the Pearson coefficient is by estimating the covariance matrix of a multivariate Gaussian distribution. A multivariate Gaussian distribution is the generalization of the Gaussian distribution to more than one dimension. Let's focus on the case of two dimensions because that is what we are going to use right now. Generalizing to higher dimensions is almost trivial once we understand the bivariate case. To fully describe a bivariate Gaussian distribution, we need two means (or a vector with two elements), one per each marginal Gaussian. We also need two standard deviations, right? Well, not exactly; we need a 2 x 2 covariance matrix, which looks like this:

Here, is the Greek capital sigma letter and it is common practice to use it to represent the covariance matrix. In the main diagonal, we have the variances of each variable, expressed as the square of their standard deviations and . The rest of the elements in the matrix are the covariances (the variance between variables), which are expressed in terms of the individual standard deviations and , the Pearson correlation coefficient between variables. Notice that we have a single because we have only two dimensions (or variables). For three variables, we would have three Pearson coefficients.

The following code generates contour plots for bivariate Gaussian distributions with both means fixed at (0, 0). One of the standard deviations is fixed , while the other takes the values 1 or 2 and different values for the Pearson correlation coefficient:

sigma_x1 = 1
sigmas_x2 = [1, 2]
rhos = [-0.90, -0.5, 0, 0.5, 0.90]

k, l = np.mgrid[-5:5:.1, -5:5:.1]
pos = np.empty(k.shape + (2,))
pos[:, :, 0] = k
pos[:, :, 1] = l

f, ax = plt.subplots(len(sigmas_x2), len(rhos),
                     sharex=True, sharey=True, figsize=(12, 6),
                     constrained_layout=True)
for i in range(2):
    for j in range(5):
        sigma_x2 = sigmas_x2[i]
        rho = rhos[j]
        cov = [[sigma_x1**2, sigma_x1*sigma_x2*rho],
               [sigma_x1*sigma_x2*rho, sigma_x2**2]]
        rv = stats.multivariate_normal([0, 0], cov)
        ax[i, j].contour(k, l, rv.pdf(pos))
        ax[i, j].set_xlim(-8, 8)
        ax[i, j].set_ylim(-8, 8)
        ax[i, j].set_yticks([-5, 0, 5])
        ax[i, j].plot(0, 0,
                      label=f'$\sigma_{{x2}}$ = {sigma_x2:3.2f}
$\rho$ = {rho:3.2f}', alpha=0)
        ax[i, j].legend()
f.text(0.5, -0.05, 'x_1', ha='center', fontsize=18)
f.text(-0.05, 0.5, 'x_2', va='center', fontsize=18, rotation=0)

Figure 3.8

Now that we know the multivariate Gaussian distribution, we can use it to estimate the Pearson correlation coefficient. Since we do not know the values of the covariance matrix, we have to put priors over it. One solution is to use the Wishart distribution, which is the conjugate prior of the inverse covariance matrix of a multivariate normal. The Wishart distribution can be considered as the generalization to higher dimensions of the gamma distribution we saw earlier or also the generalization of the distribution. A second option is to use the LKJ prior (see https://docs.pymc.io/notebooks/LKJ.html for details). This is a prior for the correlation matrix (and not the covariance matrix), which may be convenient, given that it is generally more useful to think in terms of correlations. We are going to explore a third option and we are going to put priors directly for , , and _, and then use those values to manually build the covariance matrix:

data = np.stack((x, y)).T
with pm.Model() as pearson_model:

    μ = pm.Normal('μ', mu=data.mean(0), sd=10, shape=2)

    σ_1 = pm.HalfNormal('σ_1', 10)
    σ_2 = pm.HalfNormal('σ_2', 10)
    ρ = pm.Uniform('ρ', -1., 1.)
    r2 = pm.Deterministic('r2', ρ**2)

    cov = pm.math.stack(([σ_1**2, σ_1*σ_2*ρ],
                         [σ_1*σ_2*ρ, σ_2**2]))

    y_pred = pm.MvNormal('y_pred', mu=μ, cov=cov, observed=data)

    trace_p = pm.sample(1000)

We are going to omit all variables other than r2:

az.plot_trace(trace_p, var_names=['r2'])

Figure 3.9

We can see that the distribution of r² values is around the value we got in the previous example using the r2_score function from ArviZ. Maybe an easier way to compare the value obtained from the multivariate Gaussian with the previous result is by using summary. As you can see, we got a pretty good match:

az.summary(trace_p, var_names=['r2'])

	mean	sd	mc error	hpd 3%	hpd 97%	eff_n	r_hat
r2	0.79	0.04	0.0	0.72	0.86	839.0	1.0

Table of Contents for Pearson coefficient from a multivariate Gaussian

Create new playlist

Sign In

Sign Up

Table of Contents for
Pearson coefficient from a multivariate Gaussian