20 correlation and linear regression
The correlation coefficient is closely related to linear regression. We will see below a few instances of this relationship.
20.1 prelude: finding the intercept and slope
Let’s derive the formulas for the intercept and slope of the regression line. We want to minimize the sum of squared residuals L:
L = \sum_{i=1}^n (y_i - \hat{y}_i)^2, \tag{1}
where
\hat{y}_i = \beta_0 + \beta_1 x_i. \tag{2}
To find the optimal values of \beta_0 and \beta_1, we take the partial derivatives of L with respect to \beta_0 and \beta_1, set them to zero, and solve the resulting equations.
\begin{align*} \frac{\partial L}{\partial \beta_0} &= 0 \tag{3a}\\ \frac{\partial L}{\partial \beta_1} &= 0 \tag{3b} \end{align*}
We already did that for a general case, but the calculation had the variables in vector/matrix form. Here we will do it for the simple case of one predictor variable, so that we can see the relationship with correlation more clearly.
20.1.1 intercept
Let’s start with Eq. (3a), and substitute into it Eq. (2):
\begin{align*} \frac{\partial L}{\partial \beta_0} &= \frac{\partial}{\partial \beta_0} (y_i - \hat{y}_i)^2 \tag{4a} \\ &= \frac{\partial}{\partial \beta_0} (y_i - \beta_0 - \beta_1 x_i)^2 \tag{4b} \\ &= -2 \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) = 0 \tag{4c} \end{align*}
Eliminating the constant factor -2 and expanding the summation, we get:
\sum_{i=1}^n y_i - n \beta_0 - \beta_1 \sum_{i=1}^n x_i = 0 \tag{5}
We now divide by n and rearrange to isolate \beta_0:
\beta_0 = \bar{y} - \beta_1 \bar{x} \tag{6}
Note: we can rewrite equation (4c) as
\sum_{i=1}^n (y_i - \hat{y}_i) = \sum_{i=1}^n \text{residuals} = 0,
which is a nice thing to know.
20.1.2 slope
Now let’s move on to Eq. (3b), and substitute the result of Eq. (6) into it:
\begin{align*} \frac{\partial L}{\partial \beta_1} &= \frac{\partial}{\partial \beta_1} (y_i - \hat{y}_i)^2 \tag{7a} \\ &= \frac{\partial}{\partial \beta_1} (y_i - \beta_0 - \beta_1 x_i)^2 \tag{7b} \\ &= -2 \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) x_i = 0 \tag{7c} \\ &= -2 \sum_{i=1}^n (y_i - \bar{y} + \beta_1 \bar{x} - \beta_1 x_i) x_i = 0 \tag{7d} \end{align*}
Eliminating the constant factor 2 and expanding the summation, we get:
-\sum_{i=1}^n x_i y_i + \sum_{i=1}^n x_i \bar{y} - \beta_1 \sum_{i=1}^n x_i \bar{x} + \beta_1 \sum_{i=1}^n x_i^2 = 0 \tag{8}
Let’s group the terms involving \beta_1 on one side and the rest on the other side:
\beta_1 \left( \sum_{i=1}^n x_i^2 - \sum_{i=1}^n x_i \bar{x} \right) = \sum_{i=1}^n x_i y_i - \sum_{i=1}^n x_i \bar{y} \tag{9}
Isolating \beta_1, we have:
\beta_1 = \frac{\sum x_i y_i - \sum x_i \bar{y}}{\sum x_i^2 - \sum x_i \bar{x}} = \frac{\text{numerator}}{\text{denominator}} \tag{10}
It’s easier to interpret the numerator and denominator separately. To each we will add and subtract a term that will allow us to express them in simpler forms.
Numerator:
\text{numerator} = \sum_{i=1}^n x_i y_i - \sum_{i=1}^n x_i \bar{y} + \sum_{i=1}^n \bar{x} y_i - \sum_{i=1}^n \bar{x} y_i \tag{11}
We express the third term thus:
\text{third term} = \sum_{i=1}^n \bar{x} y_i = \bar{x} \sum_{i=1}^n y_i = n \bar{x} \bar{y} = \sum_{i=1}^n \bar{x} \bar{y} \tag{12}
The numerator now becomes:
\begin{align*} \text{numerator} &= \sum_{i=1}^n x_i y_i - \sum_{i=1}^n x_i \bar{y} + \sum_{i=1}^n \bar{x} \bar{y} - \sum_{i=1}^n \bar{x} y_i \tag{13a} \\ &= \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \tag{13b} \\ \end{align*}
Now the denominator:
\text{denominator} = \sum_{i=1}^n x_i^2 - \sum_{i=1}^n x_i \bar{x} + \sum_{i=1}^n x_i\bar{x} - \sum_{i=1}^n x_i\bar{x} \tag{14}
We group the second and fourth terms, and express the third term thus:
\text{third term} = \sum_{i=1}^n x_i \bar{x} = \bar{x} \sum_{i=1}^n x_i = n \bar{x}^2 = \sum_{i=1}^n \bar{x}^2 \tag{15}
The denominator now becomes:
\begin{align*} \text{denominator} &= \sum_{i=1}^n x_i^2 - 2 \sum_{i=1}^n x_i \bar{x} + \sum_{i=1}^n \bar{x}^2 \tag{16a} \\ &= \sum_{i=1}^n (x_i - \bar{x})^2 \tag{16b} \end{align*}
Putting it all together, we have:
\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \tag{17}
20.2 slope and correlation
Let’s divide both the numerator and denominator of Eq. (17) by n-1:
\beta_1 = \frac{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2} \tag{18}
The numerator is the sample covariance \text{cov}(X, Y), and the denominator is the sample variance \text{var}(X):
\beta_1 = \frac{\text{cov}(X, Y)}{\text{var}(X)} \tag{19}
Now, we can express the covariance in terms of the correlation coefficient \rho_{X,Y} and the standard deviations \sigma_X and \sigma_Y (see here):
\text{cov}(X, Y) = \rho_{X,Y} \sigma_X \sigma_Y \tag{20} Substituting Eq. (20) into Eq. (19), we get:
\beta_1 = \frac{\rho_{X,Y} \sigma_X \sigma_Y}{\sigma_X^2} \tag{21}
And finally, we have:
\beta_1 = \rho_{X,Y} \frac{\sigma_Y}{\sigma_X} \tag{22}
This shows that the slope of the regression line is directly proportional to the correlation coefficient. A higher absolute value of the correlation coefficient indicates a steeper slope, while a lower absolute value indicates a flatter slope.
20.3 R^2= square of the correlation coefficient r
Let’s start by saying that the correlation coefficient is called \rho when referring to the population, and r when referring to a sample. I’m playing loose with this convention here, but I hope it’s clear from the context.
Let’s show now that the coefficient of determination R^2 is equal to the square of the correlation coefficient r.
We start with the definition of R^2:
R^2 = 1 - \frac{\text{SS}_{\text{Error}}}{\text{SS}_{\text{Total}}} = \frac{\text{SS}_{\text{Total}}+\text{SS}_{\text{Error}}}{\text{SS}_{\text{Total}}} = \frac{\text{SS}_{\text{Model}}}{\text{SS}_{\text{Total}}} \tag{23}
where we used the fact that \text{SS}_{\text{Total}} = \text{SS}_{\text{Model}} + \text{SS}_{\text{Error}}, already seen before.
We now substitute into Eq. (23) the definitions of \text{SS}_{\text{Model}} and \text{SS}_{\text{Total}}:
R^2 = \frac{\sum_{i=1}^n (\hat{y}_i - \bar{y})^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \tag{24}
We now substitute into Eq. (24) the expression of \hat{y}_i from Eq. (2):
R^2 = \frac{\sum_{i=1}^n (\beta_0 + \beta_1 x_i - \bar{y})^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \tag{25}
Now we substitute into Eq. (25) the expression of \beta_0 from Eq. (6):
R^2 = \frac{\sum_{i=1}^n \left( \bar{y} - \beta_1 \bar{x} + \beta_1 x_i - \bar{y} \right)^2}{\sum_{i=1}^n (y_i - \bar{y})^2} = \frac{\sum_{i=1}^n \left( \beta_1 (x_i - \bar{x}) \right)^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \tag{26}
\beta_1 is a number, so we can take it out of the summation in the numerator:
R^2 = \frac{\beta_1^2 \sum_{i=1}^n (x_i - \bar{x})^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \tag{26b}
Now, let’s substitute into Eq. (26) the expression of \beta_1 from Eq. (17):
R^2 = \frac{\left( \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \right)^2 \sum_{i=1}^n (x_i - \bar{x})^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \tag{27}
We can simplify Eq. (27) by canceling one instance of the denominator in the squared term with the factor outside the squared term:
R^2 = \frac{\left[ \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \right]^2}{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2} \tag{28}
Now, let’s multiply both the numerator and denominator of Eq. (28) by \frac{1}{(n-1)^2}:
R^2 = \frac{\left[ \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \right]^2}{\left[ \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \right] \left[ \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2 \right]} \tag{29}
The numerator is the square of the sample covariance \text{cov}(X, Y), and the denominator is the product of the sample variances \text{var}(X)=\sigma_X^2 and \text{var}(Y)=\sigma_Y^2:
R^2 = \frac{\text{cov}(X, Y)^2}{\sigma_X^2 \sigma_Y^2} = \left( \frac{\text{cov}(X, Y)}{\sigma_X \sigma_Y} \right)^2 \tag{30}
This is exactly the square of the correlation coefficient r (see here):
R^2 = r^2 \tag{31}