Kullback-Leibler divergence

Now, I would like to briefly talk about the Kullback-Leibler (KL) divergence, or just KL divergence. This is a concept that you may found when reading about statistics, machine learning, information theory, or statistical mechanics. You may argue that the reason for the recurrence of the KL-divergence as well as other concepts like entropy or the marginal likelihood is simply that, at least partially, all of these disciplines are discussing the same sets of problems, just under slightly different perspectives.

The KL diverge is useful because it is a way of measuring how close two distributions are, and is defined as follows:

This reads as the Kullback-Leibler divergence from to (yes, you have to read it backwards), where and are two probability distributions. For continuous variables, instead of a summation, you should compute an integral, but the main idea is the same.

We can interpret the divergence as the extra entropy or uncertainty that's introduced by using the probability distribution to approximate the distribution . In fact, the KL divergence is the difference between two entropies:

By using the properties of the logarithms, we can rearrange equation 5.17 to recover equation 5.16. For this reason, we can also read as the relative entropy of with respect to (this time, we read it forward).

As a simple example, we can use KL-divergence to evaluate which distribution, q or r_, is a better approximation to true_distribution. Using Scipy, we can compute and :

stats.entropy(true_distribution, q_pmf), stats.entropy(true_distribution, r_pmf)

If you run the previous block of code, you will get _. Thus, we can conclude that q is a better approximation to true_distribution than r_, because it is the one introducing less extra uncertainty. I hope that you agree with me regarding the fact that this numerical result agrees with what you expected by inspecting Figure 5.15.

You may be tempted to describe the KL-divergence as a distance, but it is not symmetric and thus is not a real distance. If you run the following block of code, you will get . As you can see, these numbers are not the same. In this example, we can see that r is a better approximation of q—this is the other way around:

stats.entropy(r_pmf, q_pmf), stats.entropy(q_pmf, r_pmf)

indicates how well approximates , and we can also think of it in terms of surprise, that is, how surprised we will be if we see q when we expect p. How surprised we are about an event depends on the information we use to judge that event. I grew up in a very arid city with maybe one or two real rain storms a year. Then. I moved to another province to go to college and I was really shocked that, at least during the wet season, there was, on average, one real rain storm every week! Some of my classmates were from Buenos Aries, one of the most humid and rainy provinces in Argentina. For them, the frequency of rain was more or less expected. What's more, they thought that it could rain a little more as the air was not that humid.

We could use the KL-divergence to compare models, as this will give a measure of the posterior from which model is closer to the true distribution. The problem is that we do not know the true distribution. Therefore, the KL-divergence is not directly applicable. Nevertheless, we can use it as an argument to justify the use of the deviance (expression 5.3). If we assume that a true distribution exists, as shown in the following equation, then the true distribution is independent of any model and constants, and thus it will affect the value of the KL-divergence in the same way, irrespective of the (posterior) distribution we use to approximate the true distribution. Thus, we can use the deviance, that is, the part that depends on each model, to estimate how close we are of the true distribution, even when we do not know it. From equation 5.17 and by using a little bit of algebra, we can see the following:

Even if we do not know , we can conclude that the distribution with the larger or likelihood or deviance if you wantis the distribution closer in KL-divergence to the true distribution. In practice, the log-likelihood/deviance is obtained from a model that's been fitted to a finite sample. Therefore, we must also add a penalization term to correct the overestimation of the deviance, and that leads us to WAIC and other information criteria.

Table of Contents for Kullback-Leibler divergence

Create new playlist

Sign In

Sign Up

Table of Contents for
Kullback-Leibler divergence