40 MLE and information theory

We will show here how the maximizing the log likelihood naturally connects to minimizing the Kullback-Leibler (KL) divergence.

We will start from a true data generating distribution P(x), and we will assume we have n independent and identically distributed (i.i.d.) observations x_1, x_2, \ldots, x_n drawn from it. We model the data using a parametric family of distributions Q(x | \theta), where \theta is the parameter we want to estimate. Our goal is to find the parameters \theta that make the model distribution Q(x | \theta) as close as possible to the true distribution P(x). This set of parameters is typically found using Maximum Likelihood Estimation (MLE):

\hat{\theta} = \arg\max_{\theta} \ell(\theta \mid x_{1:n})

In English: our best estimation of the parameters \theta is the one that maximizes the parameters (also called arguments) of the log likelihood of the observed data x_{1:n} = (x_1, x_2, \ldots, x_n).

The log likelihood function for the observed data given the parameters \theta is defined as:

\ell(\theta \mid x_{1:n}) = \sum_{i=1}^{n} \log Q(x_i | \theta)

Here we used Q to denote the model distribution, to distinguish it from the true data generating distribution P. We will now divide both sides by n to express the average log likelihood per observation:

\frac{1}{n} \ell(\theta \mid x_{1:n}) = \frac{1}{n} \sum_{i=1}^{n} \log Q(x_i | \theta)

This step can be justified in a few ways:

Dividing by a constant does not change the location of the maximum.
It is clear that \ell scales linearly with n, that is, the more data we have, the larger the log likelihood will be. Dividing by n normalizes this effect.
I already know where I want to go with this, and this steps helps me :)

Note that now the right-hand side is the sample mean of the log probabilities \log Q(x | \theta) evaluated at the observed data points x_i. This sample mean can also be understood as an empirical expectation over the observed data, where each data point is given equal weight 1/n:

\begin{align*} \frac{1}{n} \ell(\theta \mid x_{1:n}) &= \frac{1}{n} \sum_{i=1}^{n} \log Q(x_i | \theta) \\ &= \mathbb{E}_{x \sim \hat{P}_n}[\log Q(x | \theta)], \end{align*}

where the subscript x \sim \hat{P}_n indicates that the expectation (\mathbb{E}) is taken with respect to the empirical distribution \hat{P}_n defined by the observed data points x_1, x_2, \ldots, x_n.

Now, what happens as we collect more and more data, i.e., as n approaches infinity? Our “data-driven” distribution \hat{P}_n starts to look exactly like the “true” distribution P, therefore:

\begin{align*} \lim_{n \to \infty} \mathbb{E}_{x \sim \hat{P}_n}[\log Q(x | \theta)] &= \mathbb{E}_{x \sim P}[\log Q(x | \theta)] \\ &= \sum_{x} P(x) \log Q(x | \theta) \end{align*}

We’re almost there. Remember when we defined the cross-entropy in the chapter cross-entropy and KL divergence? It was defined as:

H(P, Q) = - \sum_x P(x) \log Q(x),

which is exactly the negative of what we got before. We have shown so far that maximizing the average log likelihood is equivalent to minimizing the cross-entropy between the true distribution P and the model distribution Q. The caveat for this statement is that this equivalence holds in the limit of infinite data. The very last step is to connect this to the KL divergence, which we defined as:

D_{KL}(P \| Q) = H(P, Q) - H(P).

The term H(P) is the entropy of the true distribution P, which does not depend on the model parameters \theta. Therefore, minimizing the cross-entropy H(P, Q) is equivalent to minimizing the KL divergence D_{KL}(P \| Q).

40.1 implication

The minimization of the KL divergence is ubiquitous in machine learning, and it is often used as a loss function to train models. This connection between MLE and KL divergence provides a theoretical foundation for many machine learning algorithms.