Dirichlet process

All models that we have seen so far were parametric models. These are models with a fixed number of parameters that we are interested in estimating, like a fixed number of clusters. We can also have non-parametric models, probably a better name for these models will be non-fixed-parametric models, but someone already decide the name for us. Non-parametric models are models with a theoretically infinite number of parameters. In practice, we somehow let the data to reduce the theoretically infinite number of parameters to some finite number, in other words the data decides the actual number of parameters, thus non-parametric models are very flexible. In this book we are going to see two examples of such models: the Gaussian process (this is the subject of the next chapter) and the Dirichlet process, which we will start discussing in the next paragraph.

As the Dirichlet distribution is the n-dimensional generalization of the beta distribution the Dirichlet process (DP) is the infinite-dimensional generalization of the Dirichlet distribution. The Dirichlet distribution is a probability distribution on the space of probabilities, while the DP is a probability distribution on the space of distributions, this means that a single draw from a DP is actually a distribution. For finite mixture models we used the Dirichlet distribution to assign a prior for the fixed number of clusters or groups. A DP is a way to assign a prior distribution to a non-fixed number of clusters, even more we can think of a DP as a way to sample from a prior distribution of distributions.

Before we move to the actual non-parametric mixture model let us take a moment to discuss a little bit some of the details of the DP. The formal definition of a DP is somehow obscure, unless you know your probability theory very well, so instead let me describe some of the properties of a DP that are relevant to understand its role in modeling mixture models:

A DP is a distribution whose realizations are probability distributions, instead of say real number like for a Gaussian distribution.
A DP is specified by a base distribution and a positive real number called the concentration parameter (this is analog to the concentration parameter in the Dirichlet distribution).
is the expected value of the DP, this means that a DP will generate distributions around the base distribution, this is somehow equivalent to the mean of a Gaussian distribution.
As increases the realizations become less and less concentrated.
In practice a DP always generates discrete distributions.
In the limit the realizations from a DP will be equal to the base distribution, thus if the base distribution is continuous the DP will generate a continuous distribution. For this reason mathematicians say that the distributions generated from a DP are almost surely discrete. In practice, as will be a finite number we always will work with discrete distributions.

To make these properties more concrete let us take a look again at the categorical distribution in Figure 6.3. We can completely specify such distribution by indicating the position on the x axis and the height on the y axis. For the categorical distribution the positions on the x axis are restricted to be integers and the sum of the heights has to be 1. Let keep the last restriction but relax the former one. To generate the positions on the x axis we are going to sample from a base distribution . In principle can be any distribution we want, thus if we choose a Gaussian the locations could be in principle any value from the real line, instead if we choose a beta, the locations will be restricted to the interval [0, 1] and if we choose a Poisson as the base distribution the locations will be restricted to the non-negative integers {0, 1, 2, ...}.

So far so good, how we choose the values on the y axis? We follow a Gedankenexperiment known as the stick-breaking process. Imagine we have a stick of length 1, then we break it in two parts (not necessarily equal). We set one part aside and we break the other part into two, and then we just keep doing this for ever and ever. In practice, as we can not really repeat the process infinitely we truncate it at some predefined value, but the general idea holds. To control the stick-breaking process we use a parameter . As we increase the value of α we will break the stick in smaller and smaller portions. Thus in the we will not brake the stick and when we will break it into infinite pieces. Figure 6.11 shows four draws from a DP, for four different values of . I will explain the code that generate that figure in a moment, let focus first on understanding what these samples tell us about a DP:

def stick_breaking_truncated(α, H, K):
    """
    Truncated stick-breaking process view of a DP
    
    Parameters
    ----------
    α : float
        concentration parameter
    H : scipy distribution
        Base distribution
    K : int
        number of components
    
    Returns
    -------
    locs : array
        locations
    w : array
        probabilities
    """
    βs = stats.beta.rvs(1, α, size=K)
    w = np.empty(K)
    w = βs * np.concatenate(([1.], np.cumprod(1 - βs[:-1])))
    locs = H.rvs(size=K)
    return locs, w

# Parameters DP
K = 500
H = stats.norm
alphas = [1, 10, 100, 1000]

_, ax = plt.subplots(2, 2, sharex=True, figsize=(10, 5))
ax = np.ravel(ax)
for idx, α in enumerate(alphas):
    locs, w = stick_breaking_truncated(α, H, K)
    ax[idx].vlines(locs, 0, w, color='C0')
    ax[idx].set_title('α = {}'.format(α))

plt.tight_layout()

Figure 6.11

We can see from Figure 6.10 that the DP is a discrete distribution. When α increases we obtain a more spread distribution and smaller pieces of the stick, notice the change on the scale of the y axis and remember that the total length is fixed at 1. The base distribution controls the locations, as the locations are sampled from the base distribution, we can see from Figure 6.10 that as increases the shape of the DP distributions resembles more and more the base distribution , from this we can hopefully see that in the we should exactly obtain the base distribution.

We can think of a DP as the prior on a random distribution f, where the base distribution

is what we expected f to be and the concentration parameter

represent how confidence we are about our prior guess.

Figure 6.1 shows that if you place a Gaussian on top of each data point and then sum all the Gaussians you can approximate the distribution of the data. We can use a DP to do something similar, but instead of placing a Gaussian on top of each data point we can place a Gaussian at the location of each substick from a DP realization and we scale or weight that Gaussian by the length of that substick. This procedure provides a general recipe for an infinite-Gaussian-mixture-model. Alternatively, we can replace the Gaussian for any other distribution and we will have a general recipe for a general infinite-mixture-model. Figure 6.12, show an example of such model, were we use a mixture of Laplace distributions. I choose a Laplace distribution arbitrarily just to reinforce the idea that you are by no-means restricted to do Gaussian-mixture-models:

α = 10
H = stats.norm
K = 5

x = np.linspace(-4, 4, 250)
x_ = np.array([x] * K).T
locs, w = stick_breaking_truncated(α, H, K)

dist = stats.laplace(locs, 0.5)
plt.plot(x, np.sum(dist.pdf(x_) * w, 1), 'C0', lw=2)
plt.plot(x, dist.pdf(x_) * w, 'k--', alpha=0.7)
plt.yticks([])

Figure 6.12

I hope at this point you have a good intuition of the DP, the only detail still missing is to understand the function stick_break_truncated. Mathematically the stick breaking process view of the DP can be represented in the following way:

Where:

is the indicator function which evaluates to zero everywhere, except for , this represent the locations sampled from the base distribution
The probabilities are given by:

Where:

is the length of a substick
is the length of the remaining portion, the one we need to keep breaking
indicates how to break the remaining portion
, from this expression we can see that when increases , will be on average smaller

Now we are more that ready to try to implement a DP in PyMC3. Let first define a stick_breaking function that works with PyMC3:

N = cs_exp.shape[0]
K = 20

def stick_breaking(α):
    β = pm.Beta('β', 1., α, shape=K)
    w = β * pm.math.concatenate([[1.],
                                tt.extra_ops.cumprod(1. - β)[:-1]])
    return w

We have to define a prior for , the concentration parameter. A common choice for this is a Gamma distribution:

with pm.Model() as model:
    α = pm.Gamma('α', 1., 1.)
    w = pm.Deterministic('w', stick_breaking(α))
    means = pm.Normal('means', mu=cs_exp.mean(), sd=10, shape=K)
    
    sd = pm.HalfNormal('sd', sd=10, shape=K)
    obs = pm.NormalMixture('obs', w, means, sd=sd, observed=cs_exp.values)
    trace = pm.sample(1000, tune=2000, nuts_kwargs={'target_accept':0.9})

Figure 6.13

From Figure 6.13 we can see that the value of is rather low indicating that a few component are necessary to describe the data.

Because we are approximating the infinite DP by a truncated stick-breaking procedure is important to check that the truncation value ( in this example) is not introducing any bias. A simple way to do this is to plot the average weight of each component, to be on the safe side we should have several components with negligible weight, otherwise we must increase the truncation value. An example of this type of plot is Figure 6.14. We can see that only a few of the first components are important and thus we can confident that the chosen upper-value of is large enough for this model and data:

plt.figure(figsize=(8, 6))
plot_w = np.arange(K)
plt.plot(plot_w, trace['w'].mean(0), 'o-')
plt.xticks(plot_w, plot_w+1)
plt.xlabel('Component')
plt.ylabel('Average weight')

Figure 6.14

Figure 6.15 shows the mean density estimated using the DP model (black) line together with samples from the posterior (grey) lines to reflect the uncertainty in the estimation. This model also show a less smooth density compared to the KDE from Figure 6.2 and Figure 6.8:

x_plot = np.linspace(cs.exp.min()-1, cs.exp.max()+1, 200)

post_pdf_contribs = stats.norm.pdf(np.atleast_3d(x_plot),
                                   trace['means'][:, np.newaxis, :],
                                   trace['sd'][:, np.newaxis, :])
post_pdfs = (trace['w'][:, np.newaxis, :] * 
             post_pdf_contribs).sum(axis=-1)

plt.figure(figsize=(8, 6))

plt.hist(cs_exp.values, bins=25, density=True, alpha=0.5)
plt.plot(x_plot, post_pdfs[::100].T, c='0.5')
plt.plot(x_plot, post_pdfs.mean(axis=0), c='k')

plt.xlabel('x')
plt.yticks([])

Figure 6.15

Table of Contents for Dirichlet process

Create new playlist

Sign In

Sign Up

Table of Contents for
Dirichlet process