The redwood dataset

Now we are going to focus our attention on applying the same type of model that we just did to a 2-D spacial problem, using the redwood data. I take this dataset (distributed with a GPL license) from the GPstuff package, a Gaussian process package for Matlab, Octave, and R. The dataset consists of the location of redwood trees over a given area. The motivation of the inference is to identify how the rate of trees is distributed in this area.

As usual, we load the data and plot it:

rw_df = pd.read_csv('../data/redwood.csv', header=None)
_, ax = plt.subplots(figsize=(8, 8))
ax.plot(rw_df[0], rw_df[1], 'C0.')
ax.set_xlabel('x1 coordinate')
ax.set_ylabel('x2 coordinate')

Figure 7.16

As with the coal-mining disaster example, we need to discretize the data:

# discretize spatial data
bins = 20
hist, x1_edges, x2_edges = np.histogram2d(
    rw_df[1].values, rw_df[0].values, bins=bins)
# compute the location of the centers of the discretized data
x1_centers = x1_edges[:-1] + (x1_edges[1] - x1_edges[0]) / 2
x2_centers = x2_edges[:-1] + (x2_edges[1] - x2_edges[0]) / 2
# arrange xdata into proper shape for GP
x_data = [x1_centers[:, None], x2_centers[:, None]]
# arrange ydata into proper shape for GP
y_data = hist.flatten()

Notice that instead of doing a mesh grid, we treat x1 and x2 data as being separated. This allows us to build a covariance matrix for each coordinate, effectively reducing the size of the matrix needed to compute the GP. We only need to take care when using the LatentKron class to define the GP. It is important to note that this is not a numerical trick, but a mathematical property of the structure of this type of matrix, so we are not introducing any approximation or error in our model. We are just expressing it in a way that faster computations are possible:

with pm.Model() as model_rw:
    ℓ = pm.HalfNormal('ℓ',  rw_df.std().values, shape=2)
    cov_func1 = pm.gp.cov.ExpQuad(1, ls=ℓ[0])
    cov_func2 = pm.gp.cov.ExpQuad(1, ls=ℓ[1])

    gp = pm.gp.LatentKron(cov_funcs=[cov_func1, cov_func2])
    f = gp.prior('f', Xs=x_data)

    y = pm.Poisson('y', mu=pm.math.exp(f), observed=y_data)
    trace_rw = pm.sample(1000)

And, finally, we plot the results:

rate = np.exp(np.mean(trace_rw['f'], axis=0).reshape((bins, -1)))
fig, ax = plt.subplots(figsize=(6, 6))
ims = ax.imshow(rate, origin='lower')
ax.grid(False)
ticks_loc = np.linspace(0, bins-1, 6)
ticks_lab = np.linspace(0, 1, 6).round(1)
ax.set_xticks(ticks_loc)
ax.set_yticks(ticks_loc)
ax.set_xticklabels(ticks_lab)
ax.set_yticklabels(ticks_lab)
cbar = fig.colorbar(ims, fraction=0.046, pad=0.04)

Figure 7.17

In Figure 7.17, the lighter color means a higher rate of trees than a darker color. We may image that we are interested in finding high growing rate zones, because we may be interested on how a wood is recovering from a fire or maybe we are interested in the properties of the soil and we use the trees as a proxy.

Table of Contents for The redwood dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
The redwood dataset