Grid computing

Grid computing is a simple brute-force approach. Even if you are not able to compute the whole posterior, you may be able to compute the prior and the likelihood point-wise; this is a pretty common scenario, if not the most common one. Let's assume we want to compute the posterior for a single parameter model, the grid approximation is as follows:

  1. Define a reasonable interval for the parameter (the prior should give you a hint).
  2. Place a grid of points (generally equidistant) on that interval.
  3. For each point in the grid, multiply the likelihood and the prior.

Optionally, we may normalize the computed values, that is, to divide the result at each point by the sum of all points.

The following block of code implements the grid approach to compute the posterior for the coin-flipping model:

def posterior_grid(grid_points=50, heads=6, tails=9):
"""
A grid implementation for the coin-flipping problem
"""
grid = np.linspace(0, 1, grid_points)
prior = np.repeat(1/grid_points, grid_points) # uniform prior
likelihood = stats.binom.pmf(heads, heads+tails, grid)
posterior = likelihood * prior
posterior /= posterior.sum()
return grid, posterior

Assuming we flipped a coin 13 times and we observed three heads:

data = np.repeat([0, 1], (10, 3))
points = 10
h = data.sum()
t = len(data) - h
grid, posterior = posterior_grid(points, h, t)
plt.plot(grid, posterior, 'o-')

plt.title(f'heads = {h}, tails = {t}')
plt.yticks([])
plt.xlabel('θ');

Figure 8.1

It is easy to notice that a larger number of points (or equivalently: a reduced size of the grid) results in a better approximation. As a matter of fact, in the limit of an infinite number of points, we will get the exact posterior at the cost of increasing the computational resources.

The biggest caveat of the grid approach is that this method scales poorly with the number of parameters, also referred to as dimensions. We can see this with a simple example. Suppose we want to sample a unit interval (see Figure 8.2) like in the coin-flipping problem, and we use four equidistant points. This means a resolution of 0.25 units. Now suppose we have a 2D problem (the square in Figure 8.2) and we want to use a grid with the same resolution, we will need 16 points, and then for a 3D problem we will need 64 (see the cube in Figure 8.1). In this example, we need 16 times as many resources to sample from a cube of side 1 than for a line of length 1 with a resolution of 0.25. If we decide we instead need a resolution of 0.1 units, we will have to sample 10 points for the line and 1,000 for the cube:

Figure 8.2

Besides how the number of points increase, there is also another phenomenon that is not a property of the grid method, or any other method for that matter, and instead is a property of high-dimensional spaces. As you increase the number of parameters, the region of the parameter-space where most of the posterior is concentrated gets smaller and smaller compared to the sampled volume. This is a pervasive phenomenon in statistics and machine learning and is usually known as the curse of dimensionality, or as mathematicians prefer to call it, the concentration of measure.

The curse of dimensionality in the name used to talk about various related phenomena that are absent in low-dimensional spaces but present in high-dimensional spaces. Here are some examples of these phenomena:

  • As the number of dimensions increases, the euclidean distance between any pair of samples becomes closer and closer. That is, in high-dimensional spaces, most points are basically at the same distance from one another.
  • For a hypercube, most of the volume is at its corners, not in the middle. For a hypersphere, most of the volume is at its surface and not in the middle.
  • In high dimensions, most of the mass of a multivariate Gaussian distribution is not close to the mean (or mode), but in a shell around it that moves away from the mean to the tails as the dimensionality increases. This shell receives the name of typical set.

For code examples of some of these facts please check out https://github.com/aloctavodia/BAP

For our current discussion, all these facts means that if we do not choose wisely where to evaluate the posterior, we will spend most of our time computing values with an almost null contribution to the posterior, and thus we will be wasting valuable resources. The grid method is not a very smart method to choose where to evaluate the posterior distribution, thus making it not very useful as a general method for high-dimensional problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.232.189