26 logistic connection
26.1 from Bayes the logistic
The arguments below follow those in subsection 12.2 of “Introduction to Environmental Data Science” by William W. Hsieh.
We start with Bayes’ theorem for two classes C_1 and C_2:
P(C_1|x) = \frac{P(x|C_1)P(C_1)}{P(x)} \tag{1}
Using the law of total probability in the denominator, we get:
P(C_1|x) = \frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1) + P(x|C_2)P(C_2)} \tag{2}
We now divide the numerator and denominator by P(x|C_1)P(C_1):
P(C_1|x) = \frac{1}{1 + \frac{P(x|C_2)P(C_2)}{P(x|C_1)P(C_1)}} \tag{3}
We now note that the ratio P(C_2|x)/P(C_1|x) can be expressed as:
\frac{P(C_2|x)}{P(C_1|x)} = \frac{\frac{P(x|C_2)P(C_2)}{P(x)}}{\frac{P(x|C_1)P(C_1)}{P(x)}} = \frac{P(x|C_2)P(C_2)}{P(x|C_1)P(C_1)} \tag{4}
In the expression above, we used the Bayes’ theorem in (1) to express P(C_2|x) and P(C_1|x) in terms of P(x|C_2) and P(x|C_1). We can now rewrite (3) as: P(C_1|x) = \frac{1}{1 + \frac{P(C_2|x)}{P(C_1|x)}} = \frac{1}{1 + \left(\frac{P(C_1|x)}{P(C_2|x)}\right)^{-1}} \tag{5}
The posterior probability P(C_1|x) is a function of the ratio P(C_1|x)/P(C_2|x). This ratio is called the posterior odds, or simply odds. We can make this function look like a sigmoid function by taking the logarithm of the posterior odds. The logarithm of the posterior odds is called the log-odds or logit: \text{logit} = u = \ln\left(\frac{P(C_1|x)}{P(C_2|x)}\right) \tag{6}
We can now rewrite (5) in terms of the logit: P(C_1|x) = \frac{1}{1 + e^{-u}} \tag{7}
Finally, we assume that there is a linear relationship between u and the features x:
u = \sum_j w_j x_j + w_0 = \mathbf{w}^T \mathbf{x} + w_0 \tag{8}
We now have the logistic function that connects the features x to the posterior probability P(C_1|x): P(C_1|x) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + w_0)}} \tag{9}
The one assumption that is needed to make the connection from Bayes’ theorem to the logistic function is that there is a linear relationship between the log-odds and the features x:
\ln\left(\frac{P(C_1|x)}{P(C_2|x)}\right) = \mathbf{w}^T \mathbf{x} + w_0 \tag{10}
This seems a rather arbitrary assumption. Why does this make sense?
- A linear relationship between the log odds and the features is simple and easy to interpret.
- Linear models are easy to implement and computationally efficient.
- In a few specific cases (see below) the linearity doesn’t have to be assumed, it emerges naturally from the model.
26.2 emergent linearity
Let’s start from the log odds definition in (6):
u = \ln\left(\frac{P(C_1|x)}{P(C_2|x)}\right) = \ln\left(\frac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_2)}\right) \tag{11}
We rewrite this as:
\begin{align*} u &= \ln \frac{P(x|C_1)}{P(x|C_2)} + \ln \frac{P(C_1)}{P(C_2)} \\ &= \ln P(x|C_1) - \ln P(x|C_2) + \ln \frac{P(C_1)}{P(C_2)}. \tag{12} \end{align*}
We now make the assumption that the likelihoods P(x|C_k) are Gaussian distributions. For simplicity, let’s assume that x is a single feature (univariate case).
P(x|C_k) = \frac{1}{\sqrt{2\pi}\sigma_k} \exp\left(-\frac{(x-\mu_k)^2}{2\sigma_k^2}\right), \tag{13}
where C_k are the two classes we have, C_1 and C_2.
We now calculate the log of the likelihoods:
\ln P(x|C_k) = -\ln \sqrt{2\pi \sigma_k^2} - \frac{(x-\mu_k)^2}{2\sigma_k^2}. \tag{14}
We now substitute this into Eq. (12) for the log odds:
\begin{align*} u &= -\ln \sqrt{2\pi \sigma_1^2} - \frac{(x-\mu_1)^2}{2\sigma_1^2} + \ln \sqrt{2\pi \sigma_2^2} + \frac{(x-\mu_2)^2}{2\sigma_2^2} + \ln \frac{P(C_1)}{P(C_2)} \\ &= \ln \frac{\sigma_2}{\sigma_1} + \frac{(x-\mu_2)^2}{2\sigma_2^2} - \frac{(x-\mu_1)^2}{2\sigma_1^2} + \ln \frac{P(C_1)}{P(C_2)}. \tag{15} \end{align*}
KEY ASSUMPTION: if we assume that the two classes have the same variance, \sigma_1 = \sigma_2 = \sigma, the expression simplifies to:
\begin{align*} u &= \frac{1}{2\sigma^2} \left( (x-\mu_2)^2 - (x-\mu_1)^2 \right) + \ln \frac{P(C_1)}{P(C_2)} \\ &= \frac{1}{2\sigma^2} \left( x^2 - 2x\mu_2 + \mu_2^2 - x^2 + 2x\mu_1 - \mu_1^2 \right) + \ln \frac{P(C_1)}{P(C_2)} \\ &= \frac{\mu_1 - \mu_2}{\sigma^2} x + \frac{\mu_2^2 - \mu_1^2}{2\sigma^2} + \ln \frac{P(C_1)}{P(C_2)}. \tag{16} \end{align*}
The first term depends on x linearly, and the other two terms are constants. We can thus rewrite the log odds u as:
u = wx + w_0, \tag{17}
where w = \frac{\mu_1 - \mu_2}{\sigma^2}, \quad w_0 = \frac{\mu_2^2 - \mu_1^2}{2\sigma^2} + \ln \frac{P(C_1)}{P(C_2)}. \tag{18}
Under the assumption that the distributions have equal variance, the posterior probability can be expressed as a logistic function of a linear combination of the input feature.
This is probably the simplest example of a connection between a generative model (Gaussian distributions for each class) and a discriminative model (logistic regression). It would work for other distributions from the exponential family, e.g., Poisson, Bernoulli, Exponential, etc. The one condition they all need to satisfy is that the non-linear part of the log-likelihoods cancels out when we compute the log-odds, leaving a linear function of x.
When we have real data in our hands, we usually don’t know the underlying distributions. The calculation above showed us that a linear relationship between the log odds and the features naturally emerges in a few cases, and this is the motivation for the wider assumption in (10).