Entropy

Now, I would like to briefly talk about the concept of entropy. Mathematically, we can define it as follows:

Intuitively, the more spread a distribution is, the larger its entropy. We can see this by running the following block of code and inspecting Figure 5.15:

np.random.seed(912)
x = range(0, 10)
q = stats.binom(10, 0.75)
r = stats.randint(0, 10)

true_distribution = [list(q.rvs(200)).count(i) / 200 for i in x]

q_pmf = q.pmf(x)
r_pmf = r.pmf(x)

_, ax = plt.subplots(1, 3, figsize=(12, 4), sharey=True,
                     constrained_layout=True)

for idx, (dist, label) in enumerate(zip([true_distribution, q_pmf, r_pmf], ['true_distribution', 'q', 'r'])):
    ax[idx].vlines(x, 0, dist, label=f'entropy = {stats.entropy(dist):.2f}')
    ax[idx].set_title(label)
    ax[idx].set_xticks(x)
    ax[idx].legend(loc=2, handlelength=0)

Figure 5.15

As we can see, distribution in the preceding diagram is the more spread of the three distributions and is also the one with the largest entropy. I suggest that you play with the code and explore how entropy changes (see exercise 10 for more information).

Following the previous example, you may be tempted to declare entropy as a weird form of measuring the variance of a distribution. While both concepts are related, they are not the same. Under some circumstances, an increase of entropy means an increase of the variance. This will be the case for a Gaussian distribution. However, we can also have examples where the variance increases and the entropy doesn't. We can understand why this happens without being very rigorous. Let's suppose that we have a distribution that is a mixture of two Gaussian's (we will discuss mixture distribution in detail in Chapter 6, Mixture Models). As we increase the distance between the modes, we increase the distance of the bulk of points from the mean, and the variance is precisely the average distance of all points to the mean. So, if we keep increasing the distance, the variance will keep increasing without limit. The entropy will be less affected because as we increase the distance between the modes, the points between the modes have less and less probability and thus their contribution to the total entropy will be negligible. From the perspective of the entropy, if we start from two overlapped Gaussian's and then move one in respect to the other, at some point, we will have two separated Gaussian's.

Entropy is also related to the concept of information, or its counterpart, uncertainty. In fact, we have being saying through this book that a more spread or flat prior distribution is a less informative one. This is not only intuitively correct but also has the theoretical support of the concept of entropy. In fact, there is a tribe among Bayesians that use entropy to justify their weakly informative or regularizing priors. This is usually known as the Maximum Entropy principle. We want to find the distribution with the largest possible entropy (the least informative), but we also want to take into account restraints that have been defined by our problem. This is an optimization problem that can be solved mathematically, but we will not look at the details of those computations. I will instead provide some examples. The distributions with the largest entropy under the following constraints are:

No constraints: Uniform (continuous or discrete, according to the type of variable)
A positive mean: Exponential
A given variance: Normal distribution
Only two unordered outcomes and a constant mean: Binomial, or the Poisson if we have rare events (remember that the Poisson is a special case of the binomial)

It is interesting to note that many of the generalized linear models like the ones we saw in Chapter 4, Generalizing Linear Models are traditionally defined using maximum entropy distributions, given the constraints of the models.

Table of Contents for Entropy

Create new playlist

Sign In

Sign Up

Table of Contents for
Entropy