The coal-mining disasters

The first example is known as the coal-mining disaster example. This example consists of a record of coal-mining disasters in the UK from 1851 to 1962. The number of disasters is thought to have been affected by changes in safety regulations during this period. We want to model the rate of disasters as a function of time. Our dataset consists of a single column and each entry corresponds to the time a disasters happens.

Let's load at the data and look at some of its values:

coal_df = pd.read_csv('../data/coal.csv', header=None)
coal_df.head()

0

0

1851.2026

1

1851.6324

2

1851.9692

3

1851.9747

4

1852.3142

 

The model we will use to fit the data in the coal_df dataframe is:

As you can see, this a Poisson regression problem. You may be wondering, at this point, how are we going to perform a regression if we only have a single column with just the date of the disasters. The answer is to discretize the data, just as if we were building a histogram. We are going to use the centers of the bins as the variable  and the counts per bin as the variable :

# discretize data
years = int(coal_df.max().values - coal_df.min().values)
bins = years // 4
hist, x_edges = np.histogram(coal_df, bins=bins)
# compute the location of the centers of the discretized data
x_centers = x_edges[:-1] + (x_edges[1] - x_edges[0]) / 2
# arrange xdata into proper shape for GP
x_data = x_centers[:, None]
# express data as the rate number of disaster per year
y_data = hist / 4

And now we define and solve the model with PyMC3:

with pm.Model() as model_coal:
ℓ = pm.HalfNormal('ℓ', x_data.std())
cov = pm.gp.cov.ExpQuad(1, ls=ℓ) + pm.gp.cov.WhiteNoise(1E-5)
gp = pm.gp.Latent(cov_func=cov)
f = gp.prior('f', X=x_data)

y_pred = pm.Poisson('y_pred', mu=pm.math.exp(f), observed=y_data)
trace_coal = pm.sample(1000, chains=1)

Now we plot the results:

_, ax = plt.subplots(figsize=(10, 6))

f_trace = np.exp(trace_coal['f'])
rate_median = np.median(f_trace, axis=0)

ax.plot(x_centers, rate_median, 'w', lw=3)
az.plot_hpd(x_centers, f_trace)

az.plot_hpd(x_centers, f_trace, credible_interval=0.5,
plot_kwargs={'alpha': 0})

ax.plot(coal_df, np.zeros_like(coal_df)-0.5, 'k|')
ax.set_xlabel('years')
ax.set_ylabel('rate')

Figure 7.15

Figure 7.15 shows, using a white line, the median disaster rate as a function of time. The bands describe the 50% HPD interval (darker) and the 94% HPD interval (lighter). At the bottom, we have in black markers for each disaster (this is sometimes known as a rug plot). As we can see, the rate of accidents decrease with time, except for a brief initial increase. The PyMC3 documentation includes the coal mining disaster, but modeled from a different perspective. I strongly recommend that you check that example as it is very useful on its own and also it is useful to compare it with the approach we just implemented with the model_coal model.

Notice that even when we binned the data, we obtain, as a result, a smooth curve. In this sense, we can see model_coal (and, in general, this type of model) as building an histogram and then smoothing it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.103.154