19 correlation

19.1 variance

To understand correlation, we need to start with variance. Let’s say X is a random variable with mean \mu. The variance of X, denoted \text{var}(X)=\sigma^2, is defined as the expected value of the squared deviation from the mean:

\sigma^2 = \text{var}(X) = E[(X - \mu)^2] = \frac{1}{N} \sum_{i=1}^{N} (X_i - \mu)^2.

I should point out that the formula above is for the population variance. If we are working with a sample, we would use N-1 in the denominator instead of N to get an unbiased estimate of the population variance. Also, the mean and variance for the sample are denoted \bar{X} and s^2 respectively. In any case, let’s continue with the population variance for simplicity.

In simple words, the variance measures how much the values of X deviate from the mean \mu. A high variance indicates that the data points are spread out over a wider range of values, while a low variance indicates that they are closer to the mean.

19.2 covariance

Now, let’s consider two random variables, X and Y, with means \mu_X and \mu_Y. The covariance between X and Y, denoted \text{cov}(X, Y), is defined as:

\text{cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = \frac{1}{N} \sum_{i=1}^{N} (X_i - \mu_X)(Y_i - \mu_Y).

A high covariance indicates that when X is above its mean, Y tends to be above its mean as well (and vice versa). A low (or negative) covariance indicates that when X is above its mean, Y tends to be below its mean.

19.3 correlation

The covariance can be any value, making it difficult to interpret. To standardize the measure, we use the correlation coefficient, denoted \rho (for population) or r (for sample). The correlation coefficient is defined as:

\begin{align*} \rho_{X,Y} &= \frac{\text{cov}(X, Y)}{\sigma_X \sigma_Y} \\ &= E\left[\frac{(X - \mu_X)}{\sigma_X}\frac{(Y - \mu_Y)}{\sigma_Y}\right] \\ &= \frac{1}{N} \sum_{i=1}^{N} \frac{X_i - \mu_X}{\sigma_x}\frac{Y_i - \mu_Y}{\sigma_Y}. \end{align*}

where \sigma_X and \sigma_Y are the standard deviations of X and Y, respectively.

Something becomes clear now. If we calculate the correlation of X with itself, we get:

\rho_{X,X} = \frac{\text{cov}(X, X)}{\sigma_X \sigma_X} = \frac{\text{var}(X)}{\sigma_X^2} = 1.

The highest possible correlation is 1, which indicates a perfect positive linear relationship between the two variables. The lowest possible correlation is -1, which indicates a perfect negative linear relationship. A correlation of 0 indicates no linear relationship between the variables.

19.4 Pearson correlation coefficient

When we say “correlation”, we usually mean the Pearson correlation coefficient, which is the formula given above. Pearson invented this, so it’s named after him. There are other types of correlation coefficients, such as Spearman’s rank correlation coefficient and Kendall’s tau coefficient, which are used for non-parametric data or ordinal data.

19.5 covariance of z-scored variables

Notice that the correlation formula can be interpreted as the covariance of the z-scored variables. The z-score of a variable X is defined as:

Z_X = \frac{X - \mu_X}{\sigma_X}

Thus, the correlation can be rewritten as:

\rho_{X,Y} = \text{cov}(Z_X, Z_Y) = \frac{1}{N} \sum_{i=1}^{N} Z_{X_i} Z_{Y_i}

It is quite easy to compute the correlation on the computer. If X and Y are two arrays, first we standardize (z-score) them, then we compute their dot product (sum of piecewise multiplication), and finally we divide by N (or N-1 for sample correlation).

19.6 linearity

The Pearson correlation coefficient measures the strength and direction of a linear relationship between two variables. It does not capture non-linear relationships. For example, if Y = X^2, the correlation between X and Y may be low or even zero, despite a clear non-linear relationship. When we say “correlation”, it is usually implicit that we are referring to linear correlation.