33 cross-entropy and KL divergence

Assume I live in city A, where it rains 50% of the days. A friend of mine lives in city B, where it rains 10% of the days.

33.1 wrong model

What happens when my friend visits me in city A and, not knowing any better, assumes that it rains 10% of the days?

We have two probability distributions, the true weather distribution P, and the assumed weather distribution Q:

True distribution: P(rain) = 0.5, P(no rain) = 0.5
Assumed distribution: Q(rain) = 0.1, Q(no rain) = 0.9

What will be the expected surprise of my friend, when he visits me in city A? Now that we have discussed surprise (information) and entropy, we can calculate the following quantity, called cross-entropy:

H(P, Q) = - \sum_x P(x) \log Q(x)

My friend will evaluate his surprise using the mental model that he has, i.e., the assumed distribution Q. For example, because he comes from a dry city, every time it rains he is surprised a lot more than when it does not rain.

However, since my friend is visiting me in city A, he will actually experience the weather according to the true distribution P. He will not weigh the big surprise of rain with the probability of rain in city B (10%), but with the probability of rain in city A (50%).

This reasoning explains the asymetry of the cross-entropy: the first argument is the true distribution, which determines how often each event happens, while the second argument is the assumed distribution, which determines how surprised my friend will be when each event happens.

Let’s compute my friend’s expected surprise when he visits me in city A:

\begin{align*} H(P, Q) &= - \sum_x P(x) \log Q(x) \\ &= - (P(\text{rain}) \log Q(\text{rain}) + P(\text{no rain}) \log Q(\text{no rain})) \\ &= - (0.5 \log 0.1 + 0.5 \log 0.9) \\ &= - (0.5 \cdot -1 + 0.5 \cdot -0.045757) \\ &= 1.74 \text{ bits}, \end{align*}

where we used the base-2 logarithm, so the result is in bits.

To see the asymetry of the cross-entropy, let’s compute my expected surprise when I visit my friend in city B:

\begin{align*} H(Q, P) &= - \sum_x Q(x) \log P(x) \\ &= - (Q(\text{rain}) \log P(\text{rain}) + Q(\text{no rain}) \log P(\text{no rain})) \\ &= - (0.1 \log 0.5 + 0.9 \log 0.5) \\ &= - (0.1 \cdot -1 + 0.9 \cdot -1) \\ &= 1 \text{ bits}, \end{align*}

My friend’s expected surprise will be higher when he visits me in city A (1.74 bits) than my expected surprise when I visit him in city B (1 bit).

33.2 Kullback-Leibler divergence

Not all of my friend’s surprise is due to the fact that he has an inaccurate mental model of the weather. Some of his surprise is simply due to the inherent randomness of the weather. This would be the same surprise that I myself experience when I live in city A, and I have the correct mental model of the weather.

It would make sense to separate the surprise that is due to the inherent randomness of the weather from the surprise that is due to my friend’s wrong mental model. We can do this by subtracting the entropy of the true distribution P from the cross-entropy H(P, Q):

D_{KL}(P \| Q) = H(P, Q) - H(P)

This quantity is called the Kullback-Leibler divergence, and it measures the amount of surprise that is only due to my friend’s wrong mental model.

Let’s use the properties of logarithms to rewrite the KL divergence in a more convenient form:

\begin{align*} D_{KL}(P \| Q) &= H(P, Q) - H(P)\\ &= - \sum_x P(x) \log Q(x) + \sum_x P(x) \log P(x) \\ &= \sum_x P(x) (\log P(x) - \log Q(x)) \\ &= \sum_x P(x) \log \frac{P(x)}{Q(x)}. \end{align*}

This is the most common form of the KL divergence.

When our model is perfect, i.e., when P = Q, the KL divergence is zero (\log(P/P) = 0), because there is no extra surprise due to a wrong mental model. When our model is not perfect, the KL divergence is always positive, because we are always more surprised when our mental model is wrong.

33.3 model training and objective functions

We might want to train a model that classifies photos. We have a dataset of photos, and for each photo we know the correct label (cat, dog, elephant, etc.). The goal of the model is to predict the correct label for each photo.

At every step of the training process, we need to evaluate how well the model is doing. The true data distribution P is given by the labels in the training dataset, while the model’s predicted distribution Q is given by the model’s output. Ideally, our model’s predicted distribution Q should be as close as possible to the true data distribution P. That’s sounds like a job for the KL divergence!

We will adjust the model’s parameters to minimize the KL divergence between the true data distribution P and the model’s predicted distribution Q. In practice, we will minimize the cross-entropy H(P, Q) instead of the KL divergence D_{KL}(P \| Q), because the entropy H(P) does not depend on the model’s parameters, and therefore does not affect the optimization process. Think about it: no matter what the model’s parameters are, the entropy of the true data distribution P will always be the same. So minimizing the KL divergence is equivalent to minimizing the cross-entropy. We don’t care if they differ by a constant.