Metropolis-Hastings

For some distributions, such as the Gaussian, we have very efficient algorithms to get samples from, but for other distributions, this is not the case. Metropolis-Hastings enables us to obtain samples from any probability distribution, , given that we can compute at least a value proportional to it, thus ignoring the normalization factor. This is very useful since in a lot of problems, not just Bayesian statistics, the hard part is to compute the normalization factor.

To conceptually understand this method, we are going to use the following analogy. Suppose we are interested in finding the volume of water a lake contains and which part of the lake has the deepest point. The water is really muddy so we can't estimate the depth just by looking through the water to the bottom, and the lake is really big, so a grid approximation does not seem like a very good idea. To develop a sampling strategy, we seek help from two of our best friends; Markovia and Monty. After a fruitful discussion, they come up with the following algorithm, which requires a boat—nothing fancy, we can even use a wooden raft, and a very long stick. This is cheaper than SONAR and we have already spent all our money on the boat, anyway! Check out these steps:

Choose a random place in the lake, and move the boat there.
Use the stick to measure the depth of the lake.
Move the boat to another point and take a new measurement.
Compare the two measures in the following way:
- If the new spot is deeper than the first one, write down in your notebook the depth of the new spot and repeat from step 2.
- If the spot is shallower than the first one, we have two options: to accept or reject. Accepting means we write down the depth of the new spot and repeat from step 2. Rejecting means we go back to the first spot and write down (yes again!) the value for the depth of the first spot.

The rule to decide whether to accept or reject is known as the Metropolis-Hastings criteria, and it basically says that we must accept the new spot with a probability that is proportional to the ratio of the depth of the new and old spots.

If we follow this iterative procedure, we will get not only the total volume of the lake and the deepest point, we will also get an approximation of the entire curvature of the bottom of the lake. As you may have guessed, in this analogy, the curvature of the bottom of the lake is the posterior distribution and the deepest point is the mode. According to our friend Markovia, the larger the number of iterations, the better the approximation. Indeed, the theory guarantees that under certain general circumstances, we are going to get the exact answer if we get an infinite number of samples. Luckily for us, in practice, and for many, many problems, we can get a very accurate approximation using a finite and relatively small number of samples.

The preceding explanation is enough to get a conceptual-level understanding of Metropolis-Hastings. The next few paragraphs contain a more detailed and formal explanation in case you want to dig deeper.

The Metropolis-Hastings algorithm has the following steps:

Choose an initial value for a parameter, . This can be done randomly or by making an educated guess.
Choose a new parameter value, , by sampling from an easy-to-sample distribution, such as a Gaussian or uniform distribution,. We can think of this step as perturbing the state somehow.
Compute the probability of accepting a new parameter value by using the Metropolis-Hastings criteria:

If the probability computed on step 3 is larger than the value taken from a uniform distribution on the [0, 1] interval, we accept the new state; otherwise, we stay in the old state.
We iterate from step 2 until we have enough samples.

Here are a couple of notes to take into account:

If the proposal distribution is symmetric, we get the Metropolis criteria (notice we drop the Hastings part):

Step 3 and step 4 imply that we will always accept moving to the most probable state. Less probable parameter values are accepted probabilistically, given the ratio between the probability of the new parameter value, , and the old parameter value, . This criteria for accepting proposed steps gives us a more efficient sampling approach compared to the grid approximation, while ensuring a correct sampling.
The target distribution (the posterior distribution in Bayesian statistics) is approximated by a list of sampled parameter values. If we accept, we add to the list the new sampled value, . If we reject, we add to the list the value of (even if the value is repeated).

At the end of the process, we will have a list of values. If everything was done the right way, these samples will be an approximation of the posterior. The most frequent values in our trace will be the most probable values according to the posterior. An advantage of this procedure is that analyzing the posterior is as simple manipulating an array of values, as you have already experimented in all the previous chapters.

The following code illustrates a very basic implementation of the Metropolis algorithm. It is not meant to solve any real problem, only to show it is possible to sample from a probability distribution if we know how to compute its density point-wise. Notice also that the following implementation has nothing Bayesian in it; there is no prior and we do not even have data! Remember that MCMC methods are very general algorithms that can be applied to a broad array of problems.

The first argument of the metropolis function is a SciPy distribution; we are assuming we do not know how to directly get samples from this distribution:

def metropolis(func, draws=10000):
    """A very simple Metropolis implementation"""
    trace = np.zeros(draws)
    old_x = 0.5  # func.mean()
    old_prob = func.pdf(old_x)
 
    delta = np.random.normal(0, 0.5, draws)
    for i in range(draws):
        new_x = old_x + delta[i]
        new_prob = func.pdf(new_x)
        acceptance = new_prob / old_prob
        if acceptance >= np.random.random():
            trace[i] = new_x
            old_x = new_x
            old_prob = new_prob
        else:
            trace[i] = old_x
     
    return trace

In the next example, we have defined func as a beta function, simply because it's easy to change their parameters and get different shapes. We are plotting the samples obtained by metropolis as a histogram and also the true distribution as the continuous (orange) line:

np.random.seed(3)
func = stats.beta(2, 5)
trace = metropolis(func=func)
x = np.linspace(0.01, .99, 100)
y = func.pdf(x)
plt.xlim(0, 1)
plt.plot(x, y, 'C1-', lw=3, label='True distribution')
plt.hist(trace[trace > 0], bins=25, density=True, label='Estimated distribution')
plt.xlabel('x')
plt.ylabel('pdf(x)')
plt.yticks([])
plt.legend();

Figure 8.5

The efficiency of the algorithm depends heavily on the proposal distribution; if the proposed state is very far away from the current state, the chance of rejecting is very high, and if the proposed state is very close, we explore the parameter space very slowly. In both scenarios, we will need many more samples than for a less extreme situation. Usually, the proposal is a multivariate Gaussian distribution whose covariance matrix is determined during the tuning phase. PyMC3 tunes the covariance adaptively by following the rule of thumb that the ideal acceptance is between 50% for a unidimensional Gaussian and around 23% for an n-dimensional Gaussian target distribution.

MCMC methods often take some time before they start getting samples from the target distribution. So, in practice, people perform a burn-in step, which consists of eliminating the first portion of the samples. Doing a burn-in is a practical trick and not part of Markovian theory; in fact, it will not be necessary for an infinite sample. Thus, removing the first portion of the samples is just an ad hoc trick to get better results, given that we can only compute a finite sample. Having theoretical guarantees or guidance is better than not having them, but for any practical problem, it is important to understand the difference between theory and practice. Remember, we should not get confused by mixing mathematical objects with the approximation of those objects. Spheres, Gaussians, Markov chains, and all the mathematical objects live only in the platonic world of ideas not in our imperfect, real world.

At this point I hope you have a good conceptual grasp of the Metropolis-Hastings method. You may need to go back and read this section one or more times; that's totally fine. The main ideas are simple but also subtle.

Table of Contents for Metropolis-Hastings

Create new playlist

Sign In

Sign Up

Table of Contents for
Metropolis-Hastings