As mentioned earlier, we started with the coin flip examples because of the ease of determining the posterior distribution analytically—primarily because of the beta distribution's self-conjugacy with respect to the binomial likelihood function.
It turns out that most real-world Bayesian analyses require a more complicated solution. In particular, the hyper-parameters that define the posterior distribution are rarely known. What can be determined is the probability density in the posterior distribution for each parameter value. The easiest way to get a sense of the shape of the posterior is to sample from it many thousands of times. More specifically, we sample from all possible parameter values and record the probability density at that point.
How do we do this? Well, in the case of just one parameter value, it's often computationally tractable to just randomly sample willy-nilly from the space of all possible parameter values. For cases where we are using Bayesian analysis to determine the credible values for two parameters, things get a little more hairy.
The posterior distribution for more than one parameter value is a called a joint distribution; in the case of two parameters, it is, more specifically, a bivariate distribution. One such bivariate distribution can be seen in Figure 7.10:
To picture what it is like to sample a bivariate posterior, imagine placing a bell jar on top of a piece of graph paper (be careful to make sure Ester Greenwood isn't under there!). We don't know the shape of the bell jar but we can, for each intersection of the lines in the graph paper, find the height of the bell jar over that exact point. Clearly, the smaller the grid on the graph paper, the higher resolution our estimate of the posterior distribution is.
Note that in the univariate case, we were sampling from n points, in the bivariate case, we are sampling from points (n
points for each axis). For models with more than two parameters, it is simply intractable to use this random sampling method. Luckily, there's a better option than just randomly sampling the parameter space: Markov Chain Monte Carlo (MCMC).
I think the easiest way to get a sense of what MCMC is, is by likening it to the game hot and cold. In this game—which you may have played as a child—an object is hidden and a searcher is blindfolded and tasked with finding this object. As the searcher wanders around, the other player tells the searcher whether she is hot or cold; hot if she is near the object, cold when she is far from the object. The other player also indicates whether the movement of the searcher is getting her closer to the object (getting warmer) or further from the object (getting cooler).
In this analogy, warm regions are areas where the probability density of the posterior distribution is high, and cool regions are the areas were the density is low. Put in this way, random sampling is like the searcher teleporting to random places in the space where the other player hid the object and just recording how hot or cold it is at that point. The guided behavior of the player we described before is far more efficient at exploring the areas of interest in the space.
At any one point, the blindfolded searcher has no memory of where she has been before. Her next position only depends on the point she is at currently (and the feedback of the other player). A memory-less transition process whereby the next position depends only upon the current position, and not on any previous positions, is called a Markov chain.
The technique for determining the shape of high-dimensional posterior distributions is therefore called Markov chain Monte Carlo, because it uses Markov chains to intelligently sample many times from the posterior distribution (Monte Carlo simulation).
The development of software to perform MCMC on commodity hardware is, for the most part, responsible for a Bayesian renaissance in recent decades. Problems that were, not too long ago, completely intractable are now possible to be performed on even relatively low-powered computers.
There is far more to know about MCMC than we have the space to discuss here. Luckily, we will be using software that abstracts some of these deeper topics away from us. Nevertheless, if you decide to use Bayesian methods in your own analyses (and I hope you do!), I'd strongly recommend consulting resources that can afford to discuss MCMC at a deeper level. There are many such resources, available for free, on the web.
Before we move on to examples using this method, it is important that we bring up this one last point: Mathematically, an infinitely long MCMC chain will give us a perfect picture of the posterior distribution. Unfortunately, we don't have all the time in the world (universe [?]), and we have to settle for a finite number of MCMC samples. The longer our chains, the more accurate the description of the posterior. As the chains get longer and longer, each new sample provides a smaller and smaller amount of new information (economists call this diminishing marginal returns). There is a point in the MCMC sampling where the description of the posterior becomes sufficiently stable, and for all practical purposes, further sampling is unnecessary. It is at this point that we say the chain converged. Unfortunately, there is no perfect guarantee that our chain has achieved convergence. Of all the criticisms of using Bayesian methods, this is the most legitimate—but only slightly.
There are really effective heuristics for determining whether a running chain has converged, and we will be using a function that will automatically stop sampling the posterior once it has achieved convergence. Further, convergence can be all but perfectly verified by visual inspection, as we'll see soon.
For the simple models in this chapter, none of this will be a problem, anyway.
18.226.4.191