Regression with spatial autocorrelation

The following example is taken from the book, Statistical Rethinking, by Richard McElreath. The author kindly allowed me to reuse his example here. I strongly recommend reading his book, as you will find many good examples like this and very good explanations. The only caveat is that the book examples are in R/Stan, but don't worry and keep sampling; you will find the Python/PyMC3 version of those examples in https://github.com/pymc-devs/resources.

Well, going back to the example, we have 10 different island-societies; for each one of them, we have the number of tools they use. Some theories predict that larger populations develop and sustain more tools than smaller populations. Another important factor is the contact rates among populations.

As we have number of tools as dependent variable, we can use a Poisson regression with the population, as an independent variable. In fact, we can use the logarithm of the population because what really matters (according to theory) is the order of magnitude of the populations and not the absolute size. One way to include the contact rate in our model is to gather information about how frequent these societies were in contact through history and create a categorical variable such as low/high rate (see the contact column in the islands DataFrame). Yet another way is to use the distance between societies as a proxy of the contact rate, since it is reasonable to assume that the closest societies come into contact more often than the distant ones.

We can access a distance matrix with the values expressed in thousand of kilometers, by reading the islands_dist.csv file that comes with this book:

islands_dist = pd.read_csv('../data/islands_dist.csv',
                           sep=',', index_col=0)
islands_dist.round(1)

	Ml	Ti	SC	Ya	Fi	Tr	Ch	Mn	To	Ha
Malekula	0.0	0.5	0.6	4.4	1.2	2.0	3.2	2.8	1.9	5.7
Tikopia	0.5	0.0	0.3	4.2	1.2	2.0	2.9	2.7	2.0	5.3
Santa Cruz	0.6	0.3	0.0	3.9	1.6	1.7	2.6	2.4	2.3	5.4
Yap	4.4	4.2	3.9	0.0	5.4	2.5	1.6	1.6	6.1	7.2
Lau Fiji	1.2	1.2	1.6	5.4	0.0	3.2	4.0	3.9	0.8	4.9
Trobriand	2.0	2.0	1.7	2.5	3.2	0.0	1.8	0.8	3.9	6.7
Chuuk	3.2	2.9	2.6	1.6	4.0	1.8	0.0	1.2	4.8	5.8
Manus	2.8	2.7	2.4	1.6	3.9	0.8	1.2	0.0	4.6	6.7
Tonga	1.9	2.0	2.3	6.1	0.8	3.9	4.8	4.6	0.0	5.0
Hawaii	5.7	5.3	5.4	7.2	4.9	6.7	5.8	6.7	5.0	0.0

As you can see, the main diagonal is filled with zeros. Each island society is at zero kilometers of itself. The matrix is also symmetrical; both the upper triangle and the lower triangle have the same information. This is a direct consequence of the fact that the distance from point A to B is the same as point B to A.

The number of tools and the population size is stored in another file, islands.csv, which is also distributed with the book:

islands = pd.read_csv('../data/islands.csv', sep=',')
islands.head().round(1)

	culture	population	contact	total_tools	mean_TU	lat	lon	lon2	logpop
0	Malekula	1100	low	13	3.2	-16.3	167.5	-12.5	7.0
1	Tikopia	1500	low	22	4.7	-12.3	168.8	-11.2	7.3
2	Santa Cruz	3600	low	24	4.0	-10.7	166.0	-14.0	8.2
3	Yap	4791	high	43	5.0	9.5	138.1	-41.9	8.5
4	Lau Fiji	7400	high	33	5.0	-17.7	178.1	-1.9	8.9

From this DataFrame, we are only going to use the columns culture, total_tools, lat, lon2, and logpop:

islands_dist_sqr = islands_dist.values**2
culture_labels = islands.culture.values
index = islands.index.values
log_pop = islands.logpop
total_tools = islands.total_tools
x_data = [islands.lat.values[:, None], islands.lon.values[:, None]]

The model we are going to build is:

Here, we are omitting the priors for and _, as well as the kernel's hyperpriors. is the log population and is the total number of tools.

Basically, this model is a Poisson regression with the novelty, compared to the models in Chapter 4, Generalizing Linear Models, that one of the terms in the linear model comes from a GP. For computing the kernel of the GP, we will use the distance matrix, islands_dist. In this way, we will be effectively incorporating a measure of similarity in technology exposure (estimated from the distance matrix). Thus, instead of assuming the total number is just a consequence of population alone and independent from one society to the next, we will be modeling the number of tools in each society as a function of their geographical similarly.

This model, including priors, looks like the following code in PyMC3:

with pm.Model() as model_islands:
    η = pm.HalfCauchy('η', 1)
    ℓ = pm.HalfCauchy('ℓ', 1)
    
    cov = η * pm.gp.cov.ExpQuad(1, ls=ℓ)
    gp = pm.gp.Latent(cov_func=cov)
    f = gp.prior('f', X=islands_dist_sqr)

    α = pm.Normal('α', 0, 10)
    β = pm.Normal('β', 0, 1)
    μ = pm.math.exp(α + f[index] + β * log_pop)
    tt_pred = pm.Poisson('tt_pred', μ, observed=total_tools)
    trace_islands = pm.sample(1000, tune=1000)

In order to understand the posterior distribution of covariance functions in terms of distances, we can plot some samples from the posterior distribution:

trace_η = trace_islands['η']
trace_ℓ = trace_islands['ℓ']

_, ax = plt.subplots(1, 1, figsize=(8, 5))
xrange = np.linspace(0, islands_dist.values.max(), 100)

ax.plot(xrange, np.median(trace_η) *
        np.exp(-np.median(trace_ℓ) * xrange**2), lw=3)

ax.plot(xrange, (trace_η[::20][:, None] * np.exp(- trace_ℓ[::20][:, None] * xrange**2)).T,
        'C0', alpha=.1)

ax.set_ylim(0, 1)
ax.set_xlabel('distance (thousand kilometers)')
ax.set_ylabel('covariance')

Figure 7.9

The thick line in Figure 7.9 is the posterior median of the covariance between pairs of societies as a function of distance. We use the median because the distribution for and is very skewed. We can see that the covariance is, on average, not that high and also drops to almost 0 at about 2,000 kilometers. The thin lines represent the uncertainty and we can see that there is a lot of uncertainty.

You may find it interesting to compare model_islands , and the posterior computed from it, with model m_10_10 in https://github.com/pymc-devs/resources. You may want to use ArviZ functions, such as az.summary or az.plot_forest. Model m_10_10 is similar to model_islands, but without including a Gaussian process term.

We are now going to explore how strong the islands-societies are correlated among them according to our model. In order to do this, we have to turn the covariance matrix into a correlation matrix:

# compute posterior median covariance among societies
Σ = np.median(trace_η) * (np.exp(-np.median(trace_ℓ) * islands_dist_sqr))
# convert to correlation matrix
Σ_post = np.diag(np.diag(Σ)**-0.5)
ρ = Σ_post @  Σ @ Σ_post
ρ = pd.DataFrame(ρ, index=islands_dist.columns, columns=islands_dist.columns)
ρ.round(2)

	Ml	Ti	SC	Ya	Fi	Tr	Ch	Mn	To	Ha
Ml	1.00	0.90	0.84	0.00	0.50	0.16	0.01	0.03	0.21	0.0
Ti	0.90	1.00	0.96	0.00	0.50	0.16	0.02	0.04	0.18	0.0
SC	0.84	0.96	1.00	0.00	0.34	0.27	0.05	0.08	0.10	0.0
Ya	0.00	0.00	0.00	1.00	0.00	0.07	0.34	0.31	0.00	0.0
Fi	0.50	0.50	0.34	0.00	1.00	0.01	0.00	0.00	0.77	0.0
Tr	0.16	0.16	0.27	0.07	0.01	1.00	0.23	0.72	0.00	0.0
Ch	0.01	0.02	0.05	0.34	0.00	0.23	1.00	0.52	0.00	0.0
Mn	0.03	0.04	0.08	0.31	0.00	0.72	0.52	1.00	0.00	0.0
To	0.21	0.18	0.10	0.00	0.77	0.00	0.00	0.00	1.00	0.0
Ha	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.0

Two observations that pop up from the rest is that Hawaii is very lonely. This makes sense, as Hawaii is very far away from the rest of the islands-societies. Also, we can see that Malekula (Ml), Tikopia (Ti), and Santa Cruz (SC), are highly correlated with one another. This also makes sense, as these societies are very close together, and they also have a similar number of tools.

Now we are going to use the latitude and longitude information to plot the islands-societies in their relative positions:

# scale point size to logpop
logpop = np.copy(log_pop)
logpop /= logpop.max()
psize = np.exp(logpop*5.5)
log_pop_seq = np.linspace(6, 14, 100)
lambda_post = np.exp(trace_islands['α'][:, None] +
                     trace_islands['β'][:, None] * log_pop_seq)

_, ax = plt.subplots(1, 2, figsize=(12, 6))

ax[0].scatter(islands.lon2, islands.lat, psize, zorder=3)
ax[1].scatter(islands.logpop, islands.total_tools, psize, zorder=3)

for i, itext in enumerate(culture_labels):
    ax[0].text(islands.lon2[i]+1, islands.lat[i]+1, itext)
    ax[1].text(islands.logpop[i]+.1, islands.total_tools[i]-2.5, itext)


ax[1].plot(log_pop_seq, np.median(lambda_post, axis=0), 'k--')

az.plot_hpd(log_pop_seq, lambda_post, fill_kwargs={'alpha':0},
            plot_kwargs={'color':'k', 'ls':'--', 'alpha':1})


for i in range(10):
    for j in np.arange(i+1, 10):
        ax[0].plot((islands.lon2[i], islands.lon2[j]),
                   (islands.lat[i], islands.lat[j]), 'C1-',
                   alpha=ρ.iloc[i, j]**2, lw=4)
        ax[1].plot((islands.logpop[i], islands.logpop[j]),
                   (islands.total_tools[i], islands.total_tools[j]), 'C1-',
                   alpha=ρ.iloc[i, j]**2, lw=4)
ax[0].set_xlabel('longitude')
ax[0].set_ylabel('latitude')


ax[1].set_xlabel('log-population')
ax[1].set_ylabel('total tools')
ax[1].set_xlim(6.8, 12.8)
ax[1].set_ylim(10, 73)

Figure 7.10

The left-hand panel of Figure 7.10 shows the lines the posterior median correlations we previously computed among societies in the context of the relative geographical positions. Some of the lines are not visible, since we have used the correlation to set the opacity of the lines (with matplotlib's alpha parameter). On the right panel, we have again the posterior median correlations, but this time plotted in terms of the log-population versus the total number of tools. The dashed lines represent the median number of tools and the HPD 94% interval as a function of log-population. In both panels, the size of the dots is proportional to the population of each island-society.

Notice how the correlations among Malekula, Tikopia, and Santa Cruz describe the fact that they have a rather low number of tools close to the median or lower than the expected number of tools for their populations. Something similar is happening with Trobriands and Manus; they are geographically close and have fewer tools than expected for their population sizes. Tonga has way more tools than expected for its population and a relative high correlation with Fiji. In a way, the model is telling us that Tonga has a positive effect on Lua Fiji, increasing the total number of tools and counteracting the effect of it over close neighbors, Malekula, Tikopia, and Santa Cruz.

Table of Contents for Regression with spatial autocorrelation

Create new playlist

Sign In

Sign Up

Table of Contents for
Regression with spatial autocorrelation