Non-identifiability of mixture models

If you carefully check Figure 6.6, you will find something funny going on. Both means are estimated as bimodal distributions with values around (47, 57.5) and if you check the summary obtained with az.summary, the averages of the means are almost equal and around 52. We can see something similar with the values of . This is an example of a phenomenon known in statistics as parameter non-identifiability. This happen because the model is the same if component 1 has mean 47 and component 2 has a mean 57.5 and vice versa; both scenarios are fully equivalent. In the context of mixture models, this is also known as the label-switching problem. We have already found an example of parameter non-identifiability in Chapter 3, Modeling with Linear Regression when we discussed linear models and variables with high-correlation.

When possible, the model should be defined to remove non-identifiability. With mixture models, there are at least two ways of doing this:

Force the components to be ordered; for example, arrange the means of the components in strictly increasing order
Use informative priors

Parameters in a model are not identified if the same likelihood function is obtained for more than one choice of the model parameters.

Using PyMC3, one easy way to enforce the components to be ordered is by using pm.potential(). A potential is an arbitrary factor we add to the likelihood, without adding a variable to the model. The main difference between a likelihood and a potential is that the potential does not necessary depends on data, while the likelihood does. We can use a potential to enforce a constraint. For example we can define the potential in such a way that if the constraint is not violated, we add a factor of zero to the likelihood; otherwise, we add a factor of . The net result is that the model considers the parameters (or combination of parameters) violating the constraints to be impossible, while the model is unperturbed about the rest of the values:

clusters = 2
with pm.Model() as model_mgp:
    p = pm.Dirichlet('p', a=np.ones(clusters))
    means = pm.Normal('means', mu=np.array([.9, 1]) * cs_exp.mean(),
                      sd=10, shape=clusters)
    sd = pm.HalfNormal('sd', sd=10)
    order_means = pm.Potential('order_means',
                               tt.switch(means[1]-means[0] < 0,
                                         -np.inf, 0))
    y = pm.NormalMixture('y', w=p, mu=means, sd=sd, observed=cs_exp)
    trace_mgp = pm.sample(1000, random_seed=123)

varnames = ['means', 'p']
az.plot_trace(trace_mgp, varnames)

Figure 6.7

Let's also compute the summary for this model:

az.summary(trace_mgp)

	mean	sd	mc error	hpd 3%	hpd 97%	eff_n	r_hat
means[0]	46.84	0.42	0.01	46.04	47.61	1328.0	1.0
means[1]	57.46	0.10	0.00	57.26	57.65	2162.0	1.0
p[0]	0.09	0.01	0.00	0.07	0.11	1365.0	1.0
p[1]	0.91	0.01	0.00	0.89	0.93	1365.0	1.0
sd	3.65	0.07	0.00	3.51	3.78	1959.0	1.0

Another constraint that we may find useful to add is one ensuring all components have a not null probability, or in other words that each components in the mixture get at least one observation attached to it. You can do this with the following expression:

p_min = pm.Potential('p_min', tt.switch(tt.min(p) < min_p, -np.inf, 0))

Here, you can set min_p to some arbitrary, but reasonable value, such as 0.1 or 0.01

As we can see from Figure 6.4, the value of α controls the concentration of the Dirichlet distribution. A flat prior distribution on the simplex is obtained with , as used in model_mgp. Larger values of α means more informative priors. Empirical evidence suggests that values of are generally a good default choice as these values usually lead to posteriors distributions with each component having at least one data point assigned to them while reducing the chance of overestimating the number of components.

Table of Contents for Non-identifiability of mixture models

Create new playlist

Sign In

Sign Up

Table of Contents for
Non-identifiability of mixture models