13 least squares

13.1 ordinary least squares (OLS) regression

Let’s go over a few things that appear in this notebook, statsmodels, Ordinary Least Squares

import libraries

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
sns.set_theme(style="ticks", font_scale=1.5)
np.random.seed(9876789)

13.2 polynomial regression

Let’s start with a simple polynomial regression example. We will start by generating synthetic data for a quadratic equation plus some noise.

# number of points
nsample = 100
# create independent variable x
x = np.linspace(0, 10, 100)
# create design matrix with linear and quadratic terms
X = np.column_stack((x, x ** 2))
# create coefficients array
beta = np.array([5, -2, 0.5])
# create random error term
e = np.random.normal(size=nsample)

x and e can be understood as column vectors of length n, while X and \beta are:

X = \begin{pmatrix} x_0 & x_0^2 \\ | & | \\ x_i & x_i^2 \\ | & | \\ x_n & x_n^2 \\ \end{pmatrix}, \qquad \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{pmatrix}.

Oops, there is no intercept column \vec{1} in the design matrix X. Let’s add it:

X = sm.add_constant(X)
print(X[:5, :])  # print first 5 rows of design matrix

[[1.         0.         0.        ]
 [1.         0.1010101  0.01020304]
 [1.         0.2020202  0.04081216]
 [1.         0.3030303  0.09182736]
 [1.         0.4040404  0.16324865]]

This add_constant function is smart, it has as default a prepend=True argument, meaning that the intercept is added as the first column, and a has_constant='skip' argument, meaning that it will not add a constant if one is already present in the matrix.

The matrix X is now a design matrix for a polynomial regression of degree 2. X = \begin{pmatrix} 1 & x_0 & x_0^2 \\ | & | & | \\ 1 & x_i & x_i^2 \\ | & | & | \\ 1 & x_n & x_n^2 \\ \end{pmatrix}

We now put everything together in the following equation:

y = X \beta + e

This creates the dependend variable y as a linear combination of the independent variables in X and the coefficients in \beta, plus an error term e.

y = np.dot(X, beta) + e

Let’s visualize this:

plot

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(x, y, label='data', facecolors='none', edgecolors='black', alpha=0.5)
ax.set(xlabel='x',
       ylabel='y',
       title='Simulated Data with Linear and Quadratic Terms'
       );

13.3 solving the “hard way”

I’m going to do something that nobody does. I will use the formula we derived in the previous chapter to find the coefficients \beta of the polynomial regression.

\beta = (X^TX)^{-1}X^Ty.

Translating this into code, and keeping in mind that matrix multiplication in Python is done with the @ operator, we get:

beta_opt = np.linalg.inv(X.T@X)@X.T@y
print(f"beta = {beta_opt}")

beta = [ 5.34233516 -2.14024948  0.51025357]

That’s it. We did it (again).

Let’s take a look at the matrix X^TX. Because X is a tall and skinny matrix of shape (n, 3), the matrix X^T is a wide and short matrix of shape (3, n). This is because we have many more data points n than the number of predictors (\vec{1},x,x^2), which is of course equal to the number of coefficients (\beta_0,\beta_1,\beta_2).

print(X.T@X)

[[1.00000000e+02 5.00000000e+02 3.35016835e+03]
 [5.00000000e+02 3.35016835e+03 2.52525253e+04]
 [3.35016835e+03 2.52525253e+04 2.03033670e+05]]

When we multiply the matrices X^T_{3\times n} and X_{n\times 3}, we get a square matrix of shape (3, 3), because the inner dimensions match (the number of columns in X^T is equal to the number of rows in X). The product X^TX is a square matrix of shape (3, 3), which is quite easy to invert. If it were the other way around (X\,X^T), we would have a matrix of shape (n, n), which is much harder to invert, especially if n is large. Lucky us.

Now let’s see if the parameters we found are any good.

print("beta parameters used to generate data:")
print(beta)
print("beta parameters estimated from data:")
print(beta_opt)

beta parameters used to generate data:
[ 5.  -2.   0.5]
beta parameters estimated from data:
[ 5.34233516 -2.14024948  0.51025357]

Pretty good, right? Now let’s see the best fit polynomial on the graph.

plot

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(x, y, label='data', facecolors='none', edgecolors='black', alpha=0.5)
ax.plot(x, np.dot(X, beta_opt), color='red', label='fitted line')
ax.legend(frameon=False)
ax.set(xlabel='x',
       ylabel='y',
       title='Simulated Data with Linear and Quadratic Terms'
       );

Why did I call it the “hard way”? Because these operations are so common that of course there are libraries that do this for us. We don’t need to remember the equation, we can just use, for example, statsmodels library’s OLS function, which does exactly this. Let’s see how it works.

model = sm.OLS(y, X)
results = model.fit()

Now we can compare the results of our manual calculation with the results from statsmodels. We should get the same coefficients, and we do.

print("beta parameters used to generate data:")
print(beta)
print("beta parameters from our calculation:")
print(beta_opt)
print("beta parameters from statsmodels:")
print(results.params)

beta parameters used to generate data:
[ 5.  -2.   0.5]
beta parameters from our calculation:
[ 5.34233516 -2.14024948  0.51025357]
beta parameters from statsmodels:
[ 5.34233516 -2.14024948  0.51025357]

13.4 statmodels.OLS and the summary

Statmodels provides us a lot more information than just the coefficients. Let’s take a look at the summary of the OLS regression.

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.988
Method:                 Least Squares   F-statistic:                     3965.
Date:                Mon, 23 Jun 2025   Prob (F-statistic):           9.77e-94
Time:                        12:50:31   Log-Likelihood:                -146.51
No. Observations:                 100   AIC:                             299.0
Df Residuals:                      97   BIC:                             306.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.3423      0.313     17.083      0.000       4.722       5.963
x1            -2.1402      0.145    -14.808      0.000      -2.427      -1.853
x2             0.5103      0.014     36.484      0.000       0.482       0.538
==============================================================================
Omnibus:                        2.042   Durbin-Watson:                   2.274
Prob(Omnibus):                  0.360   Jarque-Bera (JB):                1.875
Skew:                           0.234   Prob(JB):                        0.392
Kurtosis:                       2.519   Cond. No.                         144.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

I won’t go into the details of the summary, but I encourage you to take a look at it and see if you can make sense of it.

13.5 R-squared

R-squared is a measure of how well the model fits the data. It is defined as the proportion of the variance in the dependent variable that is predictable from the independent variables. It can be computed as follows:

R^2 = 1 - \frac{SS_{res}}{SS_{tot}}

where SS_{res} and SS_{tot} are defined as follows:

\begin{align*} SS_{res} &= \sum_{i=1}^n (y_i - \hat{y}_i)^2 \\ SS_{tot} &= \sum_{i=1}^n (y_i - \bar{y})^2 \end{align*}

The letters SS mean “sum of squares”, and \bar{y} is the mean of the dependent variable. Let’s compute it manually, and then compare it with the value from the statsmodels summary.

y_hat = np.dot(X, beta_opt)
SS_res = np.sum((y - y_hat) ** 2)
SS_tot = np.sum((y - np.mean(y)) ** 2)
R2 = 1 - (SS_res / SS_tot)
print("R-squared (manual calculation): ", R2)
print("R-squared (from statsmodels): ", results.rsquared)

R-squared (manual calculation):  0.9879144521349076
R-squared (from statsmodels):  0.9879144521349076

This high R^2 value tells us that the model explains a very large proportion of the variance in the dependent variable.

How can we know that the variance has anything to do with the R^2? If we divide both the SS_{res} and SS_{tot} by n-1, we get the sample variances of the residuals and the dependent variable, respectively.

\begin{align*} s^2_{res} = \frac{SS_{res}}{n-1} &= \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-1} \\ s^2_{tot} = \frac{SS_{tot}}{n-1} &= \frac{\sum_{i=1}^n (y_i - \bar{y})^2}{n-1} \end{align*}

Then the R^2 can then be expressed as: R^2 = 1 - \frac{s^2_{res}}{s^2_{tot}}.

I prefer this equation over the first, because it makes it clear that R^2 is the ratio of the variances, which is a more intuitive way to think about it.

13.6 any function will do

The formula we derived the the previous chapter works for predictors (independent variables) of any kind, not only polynomials. The formula will work as long as the parameters \beta are linear in the predictors. For exammple, we could have a nonlinear function like this:

y = \beta_0 + \beta_1 e^x + \beta_2 \sin(x^2)

because each beta multiplies a predictor. On the other hand, the following function would not work, because the parameters are not linear in the predictors:

y = \beta_0 + e^{\beta_1 x} + \sin(\beta_2 x^2)

Let’s this this in action, I’ll use the same example provided by statsmodels documentation, which is a nonlinear function of the form:

y = \beta_0 x + \beta_1 \sin(x) + \beta_2(x - 5)^2 + \beta_3

nsample = 50
sig = 0.5
x = np.linspace(0, 20, nsample)
X = np.column_stack((
    x,
    np.sin(x),
    (x - 5) ** 2,
    np.ones(nsample)
    ))
beta = [0.5, 0.5, -0.02, 5.0]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Show the code

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(x, y, label='data', facecolors='none', edgecolors='black', alpha=0.5)
ax.set(xlabel='x',
       ylabel='y',
       )

Show the code

result = sm.OLS(y, X).fit()
print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.933
Model:                            OLS   Adj. R-squared:                  0.928
Method:                 Least Squares   F-statistic:                     211.8
Date:                Mon, 23 Jun 2025   Prob (F-statistic):           6.30e-27
Time:                        12:51:08   Log-Likelihood:                -34.438
No. Observations:                  50   AIC:                             76.88
Df Residuals:                      46   BIC:                             84.52
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.4687      0.026     17.751      0.000       0.416       0.522
x2             0.4836      0.104      4.659      0.000       0.275       0.693
x3            -0.0174      0.002     -7.507      0.000      -0.022      -0.013
const          5.2058      0.171     30.405      0.000       4.861       5.550
==============================================================================
Omnibus:                        0.655   Durbin-Watson:                   2.896
Prob(Omnibus):                  0.721   Jarque-Bera (JB):                0.360
Skew:                           0.207   Prob(JB):                        0.835
Kurtosis:                       3.026   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Note something interesting: in our design matrix X, we encoded the intercept column as the last column, there is no reason why it should be the first column (although first column is a common choice). The function ‘statsmodels.OLS’ sees this, and when we print the summary, it will show the intercept as the last coefficient. Nice!

Let’s see a graph of the data and the fitted model.

Show the code

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(x, y, label='data', facecolors='none', edgecolors='black', alpha=0.5)
ax.plot(x, np.dot(X, result.params), color='red', label='fitted line')
ax.legend(frameon=False)
ax.set(xlabel='x',
       ylabel='y',
       )