21  cosine

In the spirit of this website, beautiful things happen when we imagine data in high dimensional spaces. Let’s do that for the correlation between two vectors.

\rho(x,y) = \frac{1}{N} \sum_{i=1}^N \left(\frac{x_i - \bar{x}}{\sigma_x}\right)\left(\frac{y_i - \bar{y}}{\sigma_y}\right) \tag{1}

As we saw before, it is particularly useful to rewrite this formula in terms of z-scored variables:

\rho(x,y) = \frac{1}{N} \sum_{i=1}^N z_{x_i} z_{y_i} \tag{2}

where z_{x_i} = \frac{x_i - \bar{x}}{\sigma_x} and z_{y_i} = \frac{y_i - \bar{y}}{\sigma_y}.

Now let’s see the formula for the dot product of two vectors z_x and z_y:

z_x \cdot z_y = \sum_{i=1}^N z_{x_i} z_{y_i} = \frac{1}{N} \sum_{i=1}^N z_{x_i} z_{y_i} \cdot N = \rho(x,y) \cdot N \tag{3}

There is another formula for the dot product that involves the angle \theta between the two vectors:

z_x \cdot z_y = \lVert z_x \rVert \, \lVert z_y \rVert \cos(\theta) \tag{4}

where ||z_x|| and ||z_y|| are the magnitudes (or lengths) of the vectors z_x and z_y.

The magnitude squared of a z-scored vector is:

\begin{align*} \lVert z_x \rVert ^2 &= \sum_{i=1}^N z_{x_i}^2 \\ &= \sum_{i=1}^N \left(\frac{x_i - \bar{x}}{\sigma_x}\right)^2 \\ &= \frac{1}{\sigma_x^2} \sum_{i=1}^N (x_i - \bar{x})^2 \\ &= \frac{1}{\sigma_x^2} \left( \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2 \right) N \\ & = N\frac{\sigma_x^2}{\sigma_x^2} \\ &= N \tag{5} \end{align*}

Of course, the same goes for z_y, so we have \lVert z_x \rVert = \lVert z_y \rVert = \sqrt{N}. Substituting this into Eq. (4) gives:

z_x \cdot z_y = \sqrt{N} \cdot \sqrt{N} \cdot \cos(\theta) = N \cos(\theta) \tag{6}

Finally, we equate Eqs. (3) and (6):

\rho(x,y) = \cos(\theta) \tag{7}

The correlation between two variables is equal to the cosine of the angle between their corresponding z-scored vectors, in a high-dimensional space.

21.1 cosine similarity

The cosine similarity is a measure of similarity between two non-zero vectors. It is defined as:

\text{cosine\_similarity}(x,y) = \frac{x \cdot y}{\lVert x \rVert \, \lVert y \rVert} \tag{8}

This measure is common in text analysis, where a non-zero element of a vector represents the presence of a word in a document, and the value of the element represents the frequency of that word. The cosine similarity measures the cosine of the angle between two vectors, which indicates how similar the two vectors are in terms of their direction, regardless of their magnitude. If we were to z-score the vectors, the absence of a word would be represented by a negative value (below average), which is not useful in this context.

When the vectors are z-scored, the cosine similarity is identical to the Pearson correlation coefficient.

  • cosine similarity: works on any non-zero vectors, does not require z-scoring. There is no need to alter the reference point.
  • Pearson correlation: works on z-scored vectors, requires centering and scaling. The reference point is the mean of each variable. Useful when comparing variables with different units or scales.