14 equivalence

14.1 orthogonality

In the context of linear regression, the orthogonality condition states that the residual vector r is orthogonal to the column space of the design matrix X:

X^T r = 0

Let’s substitute r = y - \hat{y} = y - X\beta into the equation above.

X^T(y - X\beta) = 0

Distributing yields

X^Ty - X^TX\beta = 0,

and then

X^TX\beta = X^Ty.

We need to solve this equation for \beta, so we left-multiply both sides by the inverse of X^TX,

\beta = (X^TX)^{-1}X^Ty.

We already did that in a previous chapter. Now let’s get exactly the same result using another approach.

14.2 optimization

We wish to minimize the sum of squared errors, which is given by

L = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n (y_i - X_{ij}\beta_j)^2.

I’m not exactly sure, but I think we call this L because of the lagrangian function. Later on when we talk about regularization, we will see that the lagrangian function can be constrained by lagrange multipliers. In any case, let’s keep going with the optimization.

It is useful to express L in matrix notation. The sum of squared errors can be thought as the dot product of the residual vector with itself:

\begin{align*} L &= (y - X\beta)^T(y - X\beta) \\ &= y^Ty - y^TX\beta - \beta^TX^Ty + \beta^TX^TX\beta, \end{align*}

where we used the following properties of matrix algebra: 1. The dot product of a vector with itself is the sum of the squares of its components, i.e., a^Ta = \sum_{i=1}^n a_i^2. We used this to express the sum of squared errors in matrix notation. 2. The dot product is bilinear, i.e., (a - b)^T(c - d) = a^Tc - a^Td - b^Tc + b^Td. We used this to expand the expression for the sum of squared errors. 3. The transpose of a product of matrices is the product of their transposes in reverse order, i.e., (AB)^T = B^TA^T. We used this to compute (X\beta)^T.

Let’s use one more property to join the two middle terms, - y^TX\beta - \beta^TX^Ty:

The dot product is symmetric, i.e., a^Tb = b^Ta. This is evident we express the dot product as a summation:

a^Tb = \sum_{i=1}^n a_i b_i = \sum_{i=1}^n b_i a_i = b^Ta.

Joining the two middle terms results in the following L:

L = y^Ty - 2y^TX\beta + \beta^TX^TX\beta,

The set of parameters \beta that minimizes L is that which satisfies the extreme condition of the function L (either maximum or minimum). This means that the gradient of L with respect to \beta must be zero:

\frac{\partial L}{\partial \beta} = 0.

Let’s plug in the expression for L:

\frac{\partial}{\partial \beta} \left( y^Ty - 2\beta^TX^Ty + \beta^TX^TX\beta \right) = 0

The quantity L is a scalar, and also each of the three terms that we are differentiating is a scalar. Let’s differentiate them one by one.

The first term, y^Ty, is a constant with respect to \beta, so its derivative is zero.

The second term

\frac{\partial}{\partial \beta} \left( - 2\beta^TX^Ty \right) = - 2\frac{\partial}{\partial \beta} \left( \beta^TX^Ty \right).

The quantity being differentiated is a scalar, it’s the product of the row vector \beta^T and the column vector X^Ty. Right now we don’t care much about X^Ty, it could be any column vector, so let’s call it c. The derivative of dot product \beta^T c with respect to a specific element \beta_k can be written explicitly as

\frac{\partial (\beta^T c)}{\partial \beta} = \frac{\partial}{\partial \beta_k} \left( \sum_{i=1}^p \beta_i c_i \right) = \frac{\partial}{\partial \beta_k} \left( \beta_1 c_1 + \beta_2 c_2 + \ldots + \beta_p c_p \right) = c_k.

Whatever value for the index k we choose, the derivative will be zero for all indices except for k, and that explain the result above.

Since the gradient \nabla_{\beta}(\beta^T c) is a vector,

\nabla_{\beta}(\beta^T c) = \frac{\partial (\beta^T c)}{\partial \beta} = \begin{pmatrix} \frac{\partial (\beta^T c)}{\partial \beta_1} \\ \frac{\partial (\beta^T c)}{\partial \beta_2} \\ \vdots \\ \frac{\partial (\beta^T c)}{\partial \beta_p} \end{pmatrix}

and we have just figured out what each component is, we can write the solution as

\frac{\partial (\beta^T c)}{\partial \beta} = \begin{pmatrix} c_1 \\ c_2 \\ \vdots \\ c_p \end{pmatrix} = c = X^Ty.

So the derivative of the second term is simply -2 X^Ty.

To solve the derivatie of the third term, \beta^TX^TX\beta, we use the following property:

\frac{\partial}{\partial \beta} \left( \beta^T A \beta \right) = 2A\beta,

when A is a symmetric matrix. In our case, A = X^TX, which is symmetric because (X^TX)^T = X^T(X^T)^T = X^TX. Therefore we have:

\frac{\partial}{\partial \beta} \left( \beta^T X^TX \beta \right) = 2X^TX\beta,

and that is the derivative of the third term.

We still have to prove why the derivative of \beta^T A \beta is 2A\beta. First, let’s use the summation notation to express the term \beta^T A \beta:

\beta^T A \beta = \sum_{i=1}^p \sum_{j=1}^p \beta_i A_{ij} \beta_j.

Now, let’s differentiate this expression with respect to \beta_k, using the chain rule:

\frac{\partial}{\partial \beta_k} \left( \sum_{i=1}^p \sum_{j=1}^p \beta_i A_{ij} \beta_j \right) = \sum_{i=1}^p A_{ik} \beta_i + \sum_{j=1}^p \beta_j A_{kj}.

Using the symmetry of A (A_{ij} = A_{ji}):

\frac{\partial}{\partial \beta_k} \left( \sum_{i=1}^p \sum_{j=1}^p \beta_i A_{ij} \beta_j \right) = 2 \sum_{i=1}^p A_{ik} \beta_i = 2(A\beta)_k.

This is the element k of the vector 2A\beta. Since this is true for any index k, we can write the gradient as

\nabla_{\beta}(\beta^T A \beta) = 2A\beta.

Now that we have the derivatives of all three terms, we can write the gradient of L:

\frac{\partial L}{\partial \beta} = 0 - 2X^Ty + 2X^TX\beta = 0.

Rearranging…

X^TX\beta = X^Ty

…and solving for \beta:

\beta = (X^TX)^{-1}X^Ty

14.3 discussion

Using two completely different approaches, we arrived at the same result for the least squares solution:

\beta = (X^TX)^{-1}X^Ty .

Approach 1: We used the orthogonality condition, which states that the residual vector is orthogonal to the column space of the design matrix.
Approach 2: We applied the optimization method, minimizing the sum of squared errors—which corresponds to minimizing the squared length of the residual vector.

There is a deep connection here. The requirement that the residual vector is orthogonal to the column space of the design matrix is equivalent to minimizing the sum of squared errors. We can see this visually: if the projection of the response vector y onto the column space of X were anywhere else, the residual vector would be not only not orthogonal, but also longer!

\text{orthogonality} \iff \text{optimization}

This result even tranfers to other contexts, as long as there is a vector space with a well defined inner product (dot product) and an orthogonal basis. In these cases, the least squares solution can be interpreted as finding the projection of a vector onto a subspace spanned by an orthogonal basis. Some examples include:

Fourier series: the Fourier coefficients are the least squares solution to the problem of approximating a function by a sum of sines and cosines, where these functions are an orthogonal basis.
SVD (Singular Value Decomposition): A matrix can be decomposed into orthogonal matrices, and the singular values can be interpreted as the least squares solution to the problem of approximating a matrix by a sum of outer products of orthogonal vectors.

14.4 other properties

14.4.1 the sum of residuals is zero

As long as the model includes an intercept term, the sum of the residuals is zero. The model could be anything, not necessarily linear regression. Let’s prove this property.

\hat{y}_i = \beta_0 + f(x_i; \beta_1, \beta_2, \ldots, \beta_p)

The Ordinary Least Squares (OLS) estimates of the parameters \beta minimize the sum of squared residuals L. The equation for \beta_0 reads:

\begin{align*} \frac{\partial L}{\partial \beta_0} &= \frac{\partial}{\partial \beta_0} \sum_{i=1}^n \left(y_i - \hat{y}_i \right)^2 \\ &= \frac{\partial}{\partial \beta_0} \sum_{i=1}^n \left[y_i - \beta_0 - f(x_i; \beta_1, \beta_2, \ldots, \beta_p) \right]^2 \\ &= -2 \sum_{i=1}^n \left[y_i - \beta_0 - f(x_i; \beta_1, \beta_2, \ldots, \beta_p) \right] \\ &= -2 \sum_{i=1}^n \left(y_i - \hat{y}_i \right) = 0 \end{align*}

From the last line it follows that the sum of the residuals is zero.