39  MLE and linear regression

The equation for linear regression is given by:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_k + \epsilon, where \beta_0 is the intercept, \beta_1, \beta_2, \ldots, \beta_k are the coefficients for the predictor variables x_1, x_2, \ldots, x_k, and \epsilon is the error term.

Reminder: the regression is called linear because it is linear in the parameters \beta_i, not necessarily in the predictor variables x_i. For example, a model like y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon is still considered linear regression because it is linear in \beta_0, \beta_1, and \beta_2. If you are still not convinced, consider that you can always create new predictor variables that are transformations of the original ones (e.g., x_2 = x^2) and include them in a linear regression model.

The \epsilon term is typically assumed to be normally distributed with mean 0 and constant variance \sigma^2. This assumption is key! It allows us to use maximum likelihood estimation to estimate the parameters \beta_0, \beta_1, \ldots, \beta_k. Let’s see how.

First, we solve the equation above for \epsilon:

\epsilon = y - (\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k).

See the figure below for a practical example.

import libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import expon, uniform, norm
Show the code
fig, ax = plt.subplots(1,3, figsize=(8,3), gridspec_kw=dict(width_ratios=[1,1, 0.2]))
fig.subplots_adjust(wspace=0.4)
beta0 = 2
beta1 = 13
beta2 = -1

np.random.seed(seed=1)
x = uniform.rvs(size=100) * 15
x = np.sort(x)
epsilon = norm.rvs(size=100, scale=5)
y = beta0 + beta1 * x + beta2 * x**2 + epsilon

ax[0].scatter(x, y, alpha=0.5, label=r"$y=\beta_0 + \beta_1 x + \beta_2 x^2x + \epsilon$")
ax[0].plot(x, beta0 + beta1 * x + beta2 * x**2, color='red', lw=2, label=r"$\beta_0 + \beta_1 x$", alpha=0.5)
ax[0].set(xlabel="x",
          ylim=(-30, 90),
          )
ax[0].legend(frameon=False, loc='upper left')

ax[1].scatter(x, epsilon, alpha=0.5)
ax[1].axhline(0, color='black', ls='--')
ax[1].set(xlabel="x")
ax[1].set_ylabel(r"$\epsilon$", rotation=0)
ax[1].yaxis.set_label_position("right")
ax[1].yaxis.tick_right()

eps_vec = np.linspace(-12, 12, 1000)
ax[2].plot(norm.pdf(eps_vec, loc=0, scale=5), eps_vec)
ax[2].set(ylim=(-12, 12),
          xlabel='pdf',
          xticks=[],
          yticks=[]
          );

We can now “forget” about the y and x variables for a moment, and model the error term \epsilon directly. Since we assumed that \epsilon is normally distributed with mean 0 and variance \sigma^2, we can write the following joint probability density function for \epsilon:

f(\epsilon \mid \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{\epsilon^2}{2\sigma^2}\right).

Using this expression for the joint pdf in the log likelihood function yields: \ell(\sigma^2 \mid \epsilon_{1:n}) = \sum_{i=1}^{n} \log f(\epsilon_i \mid \sigma^2).

This is the time to go back to the original variables. Substituting the expression for \epsilon into the log likelihood function gives:

\begin{align*} \ell(\beta_0, \beta_1, \ldots, \beta_k, \sigma^2 \mid y_{1:n}, x_{1:n}) &= \sum_{i=1}^{n} \log f\left(y_i - (\beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik}) \mid \sigma^2\right)\\ &= \sum_{i=1}^{n} \log \left( \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(y_i - (\beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik}))^2}{2\sigma^2}\right) \right)\\ &= \sum_{i=1}^{n} \left( -\frac{1}{2} \log(2\pi \sigma^2) - \frac{(y_i - (\beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik}))^2}{2\sigma^2} \right)\\ &= -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik}))^2. \end{align*}

In the expression above, instead of solving for the specific case of two predictor variables (as shown in the graph), I solved for the general case of k predictor variables, and therefore we would like to find the best values for the k+1 coefficients \beta_0, \beta_1, \ldots, \beta_k. Notice that the first term in the log likelihood does not depend on the \beta parameters, so we can ignore it for the purpose of maximization.

Now, here’s the kicker: because of the negative sign in front of the second term, maximizing the log likelihood is equivalent to minimizing the sum of squared errors (SSE):

\text{SSE} = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik}))^2.

Mind blown.

The general case that the best set of parameters \hat{\theta} are those that maximize the log likelihood, \hat{\theta} = \arg\max_{\theta_i} \ell(\theta_i \mid x_{1:n}), translates in the specific case of linear regression to “the best set of parameters \hat{\beta} are those that minimize the sum of squared errors”: \hat{\beta} = \arg\min_{\beta_i} \text{SSE}.

Final thought: whenever we use the Least Squares method to fit a linear regression model, we are assuming (implicitly or explicitly) that the error terms are normally distributed.