15 partitioning of the sum of squares

One of the most important equations in regression analysis is the partitioning of the sum of squares (SS).

SS_\text{Total} = SS_\text{Model} + SS_\text{Error}

This equation states that the total variability in the response variable can be partitioned into two components: the variability explained by the regression model and the variability that is not explained by the model (the residuals).

In a more precise mathematical language, the equation states that:

\sum_{i=1}^{n} (y_i - \bar{y})^2 = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

where:

y_i are the observed values,
\hat{y}_i are the predicted values from the regression model,
\bar{y} is the mean of the observed values,
n is the number of observations.

15.1 high-dimensional Pythagorean theorem

Why is this equation true? It’s easiest to understand this equation by thinking of it as a high-dimensional version of the Pythagorean theorem:

\lVert T \rVert ^2 = \lVert M \rVert ^2 + \lVert E \rVert ^2,

where:

T = y - \bar{y} is the total vector, the vector of deviations of the observed values from their mean;
M = \hat{y} - \bar{y} is the model vector, the vector of deviations of the predicted values from the mean of the observed values;
E = y - \hat{y} is the error vector, the vector of residuals, or deviations of the observed values from the predicted values.

I omitted the subscript i to emphasize that these are vectors, not scalars.

Make sure you understand the figure below, already presented in a previous chapter.

The vector r=E in black is the residual or error vector. It is orthogonal to the subspace spanned by the column vector of the design matrix, represented in this image by the blue plane.

The vector M is not shown, but it necessarily lies in the blue plane. How do we know that? Because the predicted values \hat{y} are a linear combination of the columns of the design matrix, and therefore \hat{y} lies in the column space of the design matrix. Since \bar{y} is a constant vector (a multiple of the vector of ones), it also lies in the column space of the design matrix. Therefore, M = \hat{y} - \bar{y} lies in the column space of the design matrix.

From the above, we can already conclude that E is orthogonal to M. These are the two legs of a right triangle. We now need a hypotenuse. The hypotenuse is the total vector T = y - \bar{y}, which is the sum of the other two vectors:

T = M + E.

This is easy to see by substituting the definitions of M and E:

T = y - \bar{y} = (\hat{y} - \bar{y}) + (y - \hat{y}) = M + E.

Show the code

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 6))
ax.set_aspect('equal', adjustable='box')
ax.annotate('', xy=(2, 0), xytext=(0, 0), arrowprops=dict(facecolor='black', edgecolor='black', width=2, headwidth=10, headlength=10))
ax.annotate('', xy=(2, 1), xytext=(2, 0), arrowprops=dict(facecolor='blue', edgecolor='blue', width=2, headwidth=10, headlength=10))
ax.annotate('', xy=(2, 1), xytext=(0, 0), arrowprops=dict(facecolor='red', edgecolor='red', width=2, headwidth=10, headlength=10))

ax.text(1.0, 0.0, 'M', fontsize=20, ha='center', va='bottom')
ax.text(2.03, 0.5, 'E', fontsize=20, ha='left', va='center')
ax.text(1.0, 0.55, 'T', fontsize=20, ha='center', va='bottom')
ax.set_xticks([])
ax.set_yticks([])
ax.set_frame_on(False)
ax.set_xlim(0, 2)
ax.set_ylim(0, 1);