37  maximum likelihood estimation

The one idea that permeates all of statistics and machine learning is that of maximum likelihood estimation. I cannot overstate its centrality.

In the chapter on probability and likelihood, we solved a practical example of estimating the rate parameter of an exponential distribution using maximum likelihood estimation. For the case where we observed many samples, we found that the formula for the mean naturally emerged as the maximum likelihood estimator for the rate parameter. We will dig deeper into this idea, and show deep connections between maximum likelihood estimation and many other statistical concepts.

37.1 log likelihood

We already saw the formula for the likelihood for multiple independent observations:

L(\lambda \mid t_1, t_2, \ldots, t_n) = \prod_{i=1}^{n} L(\lambda \mid t_i),

We justified the product form by assuming that the observations are “iid”, meaning independent and identically distributed. Instead of working with the product form, it is often more convenient to work with the log likelihood, which converts the product into a sum:

\ell(\theta \mid x_{1:n}) = \log L(\theta \mid x_{1:n}) = \sum_{i=1}^{n} \log L(\theta \mid x_i) = \sum_{i=1}^{n} \log P( x_i \mid \theta).

Here we changed the notation a bit to use \theta for the parameters (which could be a vector) and x for the data vector, which is more common in statistics. There are a few reasons why we prefer to work with the log likelihood instead of the likelihood itself:

  1. Numerical Stability (Underflow): In likelihood with multiple observations, we saw that likelihoods are products: \prod P(x_i). If you have 100 observations, you are multiplying 100 small numbers, which can lead to numerical underflow where a computer rounds the result to zero. Summing logs, \sum \ln P(x_i​), keeps the numbers in a range computers can handle.
  2. Mathematical Elegance: Many common distributions (like the Exponential or Normal) use e, Euler’s number. The natural log cancels the exponent, turning products of complex terms into simple sums that are much easier to differentiate.
  3. Preservation of the Maximum: Because ln(x) is a monotonically increasing function, the value of \theta that maximizes the log-likelihood is identical to the value that maximizes the likelihood.
  4. Optimization Surface: Taking the log reshapes the likelihood surface. It stretches out the extremely steep slopes near the peak, creating a well-behaved “hill” that optimization algorithms can navigate reliably without the fluctuations caused by nearly-vertical gradients.

In the following chapters, we will see the deep connections between maximum likelihood estimation and many fundamental concepts in statistics and machine learning.