28  conjugate prior

We will learn about conjugate priors in Bayesian statistics from a concrete example. Once we understand it, we will generalize the idea.

28.1 question

Let’s use here the same example from the chapter on cross-entropy and KL divergence:

Assume I live in city A, where it rains 50% of the days. A friend of mine lives in city B, where it rains 10% of the days. What happens when my friend visits me in city A and, not knowing any better, assumes that it rains 10% of the days?

The specific question we want to answer is: how can my friend update their belief about the probability of rain when they arrive in city A?

28.2 bayes’ theorem

We will use Bayes’ theorem to update our friend’s belief. To makes things easier to remember, let’s call the hypothesis p (the probability of rain), and the evidence R (a specific observation of rain or no rain). Thus, Bayes’ theorem can be rewritten as:

\begin{align} P(p|R) &= \frac{P(R|p)\cdot P(p)}{P(R)} \\ \text{posterior}&= \frac{\text{likelihood}\cdot \text{prior}}{\text{evidence}} \end{align}

where:

  • p is the hypothesis, the probability of rain.
  • R is the evidence, the observation of rain or no rain.
  • P(p) is the prior probability, our friend’s initial belief about the probability of rain.
  • P(R|p) is the likelihood, the likelihood of observing rain given the hypothesis that it rains with a certain probability.
  • P(R) is the evidence, the total probability of observing the evidence.
  • P(p|R) is the posterior probability, this is what we want to find: our friend’s updated belief about the probability of rain p after observing the evidence.

28.3 modeling the likelihood

In this problem, every day that passes it either rains or it does not rain. This can be understood as a Bernoulli process, where each day is an independent trial with two possible outcomes. “Success” would be ocurrence of rain (R=1), which happens with probability p. “Failure” would be no rain (R=0), which happens with probability 1-p. In mathematical terms, the likelihood can be modeled as:

P(R=r|p) = \begin{cases} p & \text{if } r=1 \\ 1-p & \text{if } r=0 \end{cases}.

This can be more compactly written as:

P(R=r|p) = p^r (1-p)^{1-r}.

This equation describes only one observation. However, we can extend it to multiple observations. Suppose our friend observes, over a total of n days, k days of rain and n-k days of no rain. The likelihood of observing this specific sequence of rain and no rain, given the probability p, can be modeled using the binomial distribution, which is the natural extension of the Bernoulli process for multiple trials:

P(R=k|p) = \binom{n}{k} p^k (1-p)^{n-k}.

This is the time to be more precise. When we previously said that “p is the hypothesis, the probability of rain”, we left behind the modeling aspect. We assumed here a generative model, the Bernoulli process. Rephrasing the statement more precisely: “p is the parameter of the generative model (Bernoulli process) that generates the observations of rain and no rain”.

28.4 modeling the prior

The prior in the Bayesian framework is not a single value. From the question above, we might think that our friend’s prior belief about the probability of rain is simply 0.1=10\%. However, in Bayesian statistics, the prior is represented as a probability distribution over all possible values of p. In would make sense to choose a probability distribution that is highest around 0.1 and lower elsewhere. There are infinite possible distributions that could represent this belief, so which should we choose? See below three examples of possible prior distributions, all have their mean at 0.1.

import libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.gridspec as grid_spec
from matplotlib.lines import Line2D
from scipy.stats import beta, norm, uniform, binom
plot various possible distributions
fig, ax = plt.subplots(figsize=(6, 4))
location = 0.1
p = np.linspace(0, 1, 1000)
ax.plot(p, beta.pdf(p, 2, 2/location-2), label='Beta(2, 18)', color="black")
ax.plot(p, norm.pdf(p, loc=location, scale=0.02), label='Normal(0.1, 0.05)', color="tab:orange")
ax.plot(p, uniform.pdf(p, loc=location-0.05, scale=0.1), label='Uniform(0.05, 0.15)', color="tab:blue")
ax.legend(frameon=False)
ax.set(xlabel='Probability of rain (p)',
       ylabel='Probability Density',
       title='Possible prior distributions for the probability of rain');

A particular good choice is the Beta distribution. The Beta distribution is defined on the interval [0, 1], which makes it suitable for modeling probabilities. It is parameterized by two positive shape parameters, \alpha and \beta, which determine the shape of the distribution. The probability density function (PDF) of the Beta distribution is given by:

\text{Beta}(p|\alpha, \beta) = \frac{p^{\alpha - 1} (1-p)^{\beta - 1}}{B(\alpha, \beta)},

where B(\alpha, \beta) is the Beta function, which serves as a normalization constant to ensure that the total probability integrates to 1.

The Beta distribution in the graph above is \text{Beta}(p|\alpha=2, \beta=18). How did I choose these parameters? The mean of a Beta distribution is given by:

\text{mean} = \frac{\alpha}{\alpha + \beta}.

The derivation of this formula is not shown here, but it involves calculating the expected value of the distribution using its pdf, and using properties of the Beta function and of the Gamma function. Indeed, by choosing \alpha=2 and \beta=18, we get a mean of 0.1.

Intuitively, a rain probability of 0.1 means that out of every n days, we expect it to rain on average 0.1n days. It terms of “successes” and “failures”:

\begin{align*} \text{mean} &= \frac{\alpha}{\alpha + \beta} \\ &= \frac{\text{expected successes}}{\text{expected successes} + \text{expected failures}} \\ &= \frac{\text{expected successes}}{\text{total number of trials}} \end{align*}

Instead of choosing \alpha=2 and \beta=18, we could choose other values that maintain the same mean but represent different levels of confidence or prior knowledge. See the three Beta distributions plotted below. All have their mean at 0.1, but the black one (\alpha=2, \beta=18) is more spread out, indicating less certainty about the probability of rain when only 2+18=20 days are considered. Increasing the number of days to 50 (orange, \alpha=5, \beta=45) or 200 (blue, \alpha=20, \beta=180) makes the distribution more peaked around the mean, indicating greater confidence in the estimate of the probability of rain.

three beta distributions
fig, ax = plt.subplots(figsize=(6, 4))
location = 0.1
p = np.linspace(0, 1, 1000)
ax.plot(p, beta.pdf(p, 2, 2/location-2), label='Beta(2, 18)', color="black")
ax.plot(p, beta.pdf(p, 5, 5/location-5), label='Beta(5, 45)', color="tab:orange")
ax.plot(p, beta.pdf(p, 20, 20/location-20), label='Beta(20, 180)', color="tab:blue")
ax.legend(frameon=False)
ax.set(xlabel='Probability of rain (p)',
       ylabel='Probability Density',
       title='Beta prior distributions with mean at 0.1');

28.5 posterior looks like an updated prior

Why is this distribution particulary convenient? Note that it has the factors p^{\alpha - 1} and (1-p)^{\beta - 1}, which are similar to the factors in the likelihood function p^k (1-p)^{n-k}.

According to Bayes’ theorem, the posterior distribution is proportional to the product of the likelihood and the prior. Thus, if we choose a Beta distribution as the prior, the posterior distribution will also be a Beta distribution, but with updated parameters. Let’s see how this works mathematically:

\begin{align*} P(p|R=k) &= \frac{1}{\underbrace{P(R=k)}_{\text{evidence}}} \cdot \underbrace{P(R=k|p)}_{\text{likelihood}} \cdot \underbrace{P(p)}_{\text{prior}} \\ &= \frac{1}{P(R=k)} \left( \binom{n}{k} p^k (1-p)^{n-k} \right) \cdot \left( \frac{p^{\alpha - 1} (1-p)^{\beta - 1}}{B(\alpha, \beta)} \right) \\ &= \underbrace{ \frac{1}{P(R=k)} \binom{n}{k} \frac{1}{B(\alpha, \beta)} }_{\text{normalization constant}} p^{k + \alpha - 1} (1-p)^{n - k + \beta - 1}\\ & = \text{Beta}(p | \alpha + k, \beta + n - k) \\ & = \text{Beta}(p | \alpha + \text{successes}, \beta + \text{failures}). \end{align*}

This isn’t magic. We simply chose a prior distribution (Beta) that, when multiplied by the likelihood (Binomial), results in a posterior distribution of the same family (Beta). This property is what defines conjugate priors. Why is this useful?

  • Computational Simplicity: The posterior distribution can be computed analytically without the need for complex numerical methods.
  • Intuitive Interpretation: The parameters of the posterior distribution can be interpreted as updated counts of successes and failures, making it easy to understand how new data influences our beliefs.

28.6 iterative updating of beliefs

Let’s say that my friends Bob and Alice move to city A, where it rains on 50% of the days. Bob has a prior belief modeled as a Beta distribution with parameters \alpha=2 and \beta=18, reflecting his initial belief that it rains 10% of the days, but with a low level of confidence. Alice, however, has an initial belief closer to the truth, 20%, but with a much higher level of confidence, modeled as a Beta distribution with parameters \alpha=100 and \beta=400. Every week that passes, they observe the weather and write down the number of rainy days, thus collecting the following data over 52 weeks (one year):

generate binomial data for 52 weeks
np.random.seed(6)
N_weeks = 52
success_array_daily = binom.rvs(n=1, p=0.5, size=N_weeks*7)  # first generate daily data
success_array = success_array_daily.reshape(-1, 7).sum(axis=1)  # aggragate to weekly data
# success_array = binom.rvs(n=7, p=0.5, size=N_weeks)  # do this if you don't need daily data
failure_array = 7 - success_array
sf_array = np.vstack([success_array, failure_array]).T
print(success_array)
[4 4 5 6 4 4 2 3 5 4 2 3 5 3 3 5 4 3 1 3 4 4 2 2 3 5 1 2 6 4 3 2 3 5 6 3 4
 5 3 5 2 3 4 3 3 5 5 3 5 6 5 2]

Now we just use the updating formula iteratively over the 52 weeks of observations:

update priors over 52 weeks and print parameters
bobs_parameters = np.array([[2, 18]])
alices_parameters = np.array([[100, 400]])
print("\t\tBob\t\tBob\t\tAlice\tAlice")
print("week\talpha\tbeta\talpha\tbeta")
print(f"{0}\t\t{bobs_parameters[0][0]}\t\t{bobs_parameters[0][1]}\t\t{alices_parameters[0][0]}\t\t{alices_parameters[0][1]}")
for week in np.arange(N_weeks):
    bobs_last_weeks_parameters = bobs_parameters[-1]
    bobs_this_weeks_parameters = bobs_last_weeks_parameters + sf_array[week]
    bobs_parameters = np.vstack([bobs_parameters, bobs_this_weeks_parameters])

    alices_last_weeks_parameters = alices_parameters[-1]
    alices_this_weeks_parameters = alices_last_weeks_parameters + sf_array[week]
    alices_parameters = np.vstack([alices_parameters, alices_this_weeks_parameters])

    print(f"{week+1}\t\t{bobs_this_weeks_parameters[0]}\t\t{bobs_this_weeks_parameters[1]}\t\t{alices_this_weeks_parameters[0]}\t\t{alices_this_weeks_parameters[1]}")
        Bob     Bob     Alice   Alice
week    alpha   beta    alpha   beta
0       2       18      100     400
1       6       21      104     403
2       10      24      108     406
3       15      26      113     408
4       21      27      119     409
5       25      30      123     412
6       29      33      127     415
7       31      38      129     420
8       34      42      132     424
9       39      44      137     426
10      43      47      141     429
11      45      52      143     434
12      48      56      146     438
13      53      58      151     440
14      56      62      154     444
15      59      66      157     448
16      64      68      162     450
17      68      71      166     453
18      71      75      169     457
19      72      81      170     463
20      75      85      173     467
21      79      88      177     470
22      83      91      181     473
23      85      96      183     478
24      87      101     185     483
25      90      105     188     487
26      95      107     193     489
27      96      113     194     495
28      98      118     196     500
29      104     119     202     501
30      108     122     206     504
31      111     126     209     508
32      113     131     211     513
33      116     135     214     517
34      121     137     219     519
35      127     138     225     520
36      130     142     228     524
37      134     145     232     527
38      139     147     237     529
39      142     151     240     533
40      147     153     245     535
41      149     158     247     540
42      152     162     250     544
43      156     165     254     547
44      159     169     257     551
45      162     173     260     555
46      167     175     265     557
47      172     177     270     559
48      175     181     273     563
49      180     183     278     565
50      186     184     284     566
51      191     186     289     568
52      193     191     291     573

Finally, we can plot how the probability densities of Bob and Alice are updated over the 52 weeks.

ridge plot
N_weeks = 52
N_panels = N_weeks + 1
gs = grid_spec.GridSpec(N_panels,1)
fig = plt.figure(figsize=(10,12))
p = np.linspace(0, 1, 1000)
bob_colors = mpl.cm.Blues(np.linspace(0.4,0.8,N_panels))
alice_colors = mpl.cm.Reds(np.linspace(0.4,0.8,N_panels))
ax_objs = []

for week in np.arange(N_panels)[::-1]:
    # creating new axes object, start from top = week 52
    ax_objs.append(fig.add_subplot(gs[N_panels-week-1:N_panels-week, 0:]))
    
    bobs_params = bobs_parameters[week]
    alices_params = alices_parameters[week]

    # don't plot the whole distribution, only when greater than threshold
    range_bob = beta.pdf(p, bobs_params[0], bobs_params[1])
    domain = np.where(range_bob > 1e-3*np.max(range_bob))
    ax_objs[-1].fill_between(p[domain], range_bob[domain],
                             color=bob_colors[week], alpha=1.0,
                             clip_on=False, ec="white", zorder=N_panels-week,
                             label="Bob")
    range_alice = beta.pdf(p, alices_params[0], alices_params[1])
    domain = np.where(range_alice > 1e-3*np.max(range_alice))
    ax_objs[-1].fill_between(p[domain], range_alice[domain],
                             color=alice_colors[week], alpha=1.0,
                             clip_on=False, ec="white", zorder=N_panels-week,
                             label="Alice")

    ax_objs[-1].set(xlim=(0,0.6),
                    ylim=(0,15),
                    yticks=[])
    if week>0:
        ax_objs[-1].set_xticks([])

    # make background transparent
    rect = ax_objs[-1].patch
    rect.set_alpha(0)
    ax_objs[-1].set_yticklabels([])

    if week == N_panels-1:
        ax_objs[-1].legend(frameon=False, loc="upper left", fontsize=12, ncol=2)
    if week%4==0:
        ax_objs[-1].set_yticks([0])
        ax_objs[-1].set_yticklabels([f"week {week}"])
        ax_objs[-1].tick_params(axis='y', length=0)
        ax_objs[-1].axhline(0, color="black", lw=1, zorder=N_panels-week+1)
    spines = ["top","right","left","bottom"]
    for s in spines:
        ax_objs[-1].spines[s].set_visible(False)

ax_objs[-1].set(xlabel='Probability densities updated over 52 weeks')
gs.update(hspace=-0.7)

ax_top = ax_objs[0]
ax_bottom = ax_objs[-1]
pos_top = ax_top.get_position().extents # left, bottom, right, top
pos_bottom = ax_bottom.get_position().extents # left, bottom, right, top
x_dotted_lines = [0.1,0.2,0.3,0.4,0.5,0.6]
for x in x_dotted_lines:
    display_coord = ax_bottom.transData.transform([x, 0.0])  # convert data coordinates to display coordinates
    fig_coord = fig.transFigure.inverted().transform(display_coord)  # convert display coordinates to figure coordinates
    line = Line2D([fig_coord[0], fig_coord[0]],     # x coordinates
                [fig_coord[1], pos_top[1]],          # y coordinates
                transform=fig.transFigure, color='black', ls=':', lw=0.5)
    fig.add_artist(line)

Although Alice starts with a more accurate prior, Bob’s belief converges faster towards the true probability of rain (50%) over time, because his prior at week 0 was less confident (more spread out), allowing new evidence to have a greater impact on his posterior belief. For each person’s pdf, we can plot its mean and standard deviation over time. For a Beta distribution \text{Beta}(p|\alpha, \beta), the mean and variance are given by:

\begin{align*} \text{mean} &= \frac{\alpha}{\alpha + \beta}, \\ \text{variance} &= \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}. \end{align*}

Bob and Alice’s estimates over time
fig, ax = plt.subplots(figsize=(6,4))
t = np.arange(N_weeks+1)

def beta_mean(alpha, beta):
    return alpha / (alpha + beta)
def beta_variance(alpha, beta):
    return (alpha * beta) / ((alpha + beta)**2 * (alpha + beta + 1))

bob_means = beta_mean(bobs_parameters[:,0], bobs_parameters[:,1])
bob_sqrt_variance = np.sqrt(beta_variance(bobs_parameters[:,0], bobs_parameters[:,1]))
alice_means = beta_mean(alices_parameters[:,0], alices_parameters[:,1])
alice_sqrt_variance = np.sqrt(beta_variance(alices_parameters[:,0], alices_parameters[:,1]))

line_bob, = ax.plot(t, bob_means, color="tab:blue", label="Bob's Mean")
fill_bob = ax.fill_between(t, bob_means - bob_sqrt_variance, bob_means + bob_sqrt_variance,
                color="tab:blue", alpha=0.2, label="Bob's Std Dev")
line_alice, = ax.plot(t, alice_means, color="tab:red", label="Alice's Mean")
fill_alice = ax.fill_between(t, alice_means - alice_sqrt_variance, alice_means + alice_sqrt_variance,
                color="tab:red", alpha=0.2, label="Alice's Std Dev")

ax.axhline(0.5, color="black", ls="--", lw=1)
ax.legend([(line_bob, fill_bob), (line_alice, fill_alice)],
          [r"Bob's mean$\pm$std", r"Alice's mean$\pm$std"],
          handler_map={tuple: mpl.legend_handler.HandlerTuple(ndivide=1)},
          frameon=False)

ax.set(xlabel='Weeks',
       ylabel='Estimated Expected Probability of Rain',
       title='Convergence of Bob and Alice\'s Estimates Over Time',
       ylim=(0,0.6),
       xlim=(0,N_weeks));

28.7 comparison with frequentist approach

Let’s add to the graph above one more line, for Charlie, who follows a frequentist approach to estimate the probability of rain over time. Charlie doesn’t have any prior belief; instead, he simply counts the number of rainy days observed so far (k) and divides it by the total number of days that passed (n). As shown in the chapter MLE and summary statistics, when assuming a generative model that is a Bernoulli process (or Binomial distribution for multiple trials), the maximum likelihood estimate (MLE) of the probability of success (rain) is given by the ratio of the number of successes to the total number of trials:

\hat{p} = \frac{k}{n}.

The conclusion here is that, even if Charlied doesn’t know it, his estimate is based on the Maximum Likelihood Estimation (MLE) principle.

This time, instead of plotting the weekly estimates, let’s see what happens if each of the residents of city A updates their beliefs daily.

compute daily parameter updates for Bob and Alice
failure_array_daily = 1 - success_array_daily
sf_array_daily = np.vstack([success_array_daily, failure_array_daily]).T

bobs_parameters_daily = np.array([[2, 18]])
alices_parameters_daily = np.array([[100, 400]])

for day in np.arange(N_weeks*7):
    bobs_yesterdays_parameters = bobs_parameters_daily[-1]
    bobs_todays_parameters = bobs_yesterdays_parameters + sf_array_daily[day]
    bobs_parameters_daily = np.vstack([bobs_parameters_daily, bobs_todays_parameters])
    alices_yesterdays_parameters = alices_parameters_daily[-1]
    alices_todays_parameters = alices_yesterdays_parameters + sf_array_daily[day]
    alices_parameters_daily = np.vstack([alices_parameters_daily, alices_todays_parameters])
frequentist vs Bayesian daily updates
fig, ax = plt.subplots(figsize=(6,4))

t = 7*np.arange(N_weeks+1)
t_day= np.arange(N_weeks*7+1)


bob_means_daily = beta_mean(bobs_parameters_daily[:,0], bobs_parameters_daily[:,1])
bob_sqrt_variance_daily = np.sqrt(beta_variance(bobs_parameters_daily[:,0], bobs_parameters_daily[:,1]))
alice_means_daily = beta_mean(alices_parameters_daily[:,0], alices_parameters_daily[:,1])
alice_sqrt_variance_daily = np.sqrt(beta_variance(alices_parameters_daily[:,0], alices_parameters_daily[:,1]))

line_bob_daily, = ax.plot(t_day, bob_means_daily, color="tab:blue", label="Bob's Mean (Daily)", alpha=0.5, ls="-")
fill_bob_daily = ax.fill_between(t_day, bob_means_daily - bob_sqrt_variance_daily, bob_means_daily + bob_sqrt_variance_daily,
                color="tab:blue", alpha=0.2, label="Bob's Std Dev")
line_alice_daily, = ax.plot(t_day, alice_means_daily, color="tab:red", label="Alice's Mean (Daily)", alpha=0.5, ls="-")
fill_alice_daily = ax.fill_between(t_day, alice_means_daily - alice_sqrt_variance_daily, alice_means_daily + alice_sqrt_variance_daily,
                color="tab:red", alpha=0.2, label="Alice's Std Dev")

bob_parameters_frequentist = bobs_parameters_daily - bobs_parameters_daily[0]
bob_means_frequentist = bob_parameters_frequentist[1:,0] / (bob_parameters_frequentist[1:,0] + bob_parameters_frequentist[1:,1])
line_bob_frequentist, = ax.plot(t_day[1:], bob_means_frequentist, color="purple", label="Bob's Frequentist Estimate", ls="-", marker="o", markersize=3, alpha=0.5, clip_on=True)

ax.axhline(0.5, color="black", ls="--", lw=1)
ax.legend([(line_bob_daily, fill_bob_daily), (line_alice_daily, fill_alice_daily), (line_bob_frequentist,)],
          [r"Bob's mean$\pm$std", r"Alice's mean$\pm$std", "Charlie, the frequentist"],
          handler_map={tuple: mpl.legend_handler.HandlerTuple(ndivide=1)},
          frameon=False)

ax.set(xlabel='Days',
       ylabel='Estimated Expected Probability of Rain',
       title='Frequentist vs Bayesian Daily Updates',
       ylim=(0,1),
       xlim=(0,150)
       );

Let’s digest what we see above.

  • The frequentist approach does not use prior beliefs, it simply estimates the probability based on the observed data.
  • At day zero, Bob and Alice have beliefs about the world, but Charlie doesn’t have any data to base his estimate on, so we see nothing for him at day zero.
  • The very first day was a rainy day, so Charlie’s estimate begins at 100%.
  • In the following days, Charlie’s estimate fluctuates more wildly than Bob’s and Alice’s, especially in the early weeks when the amount of data is still small. As more data is collected, Charlie’s estimate stabilizes and converges towards the true probability of rain (50%).
  • As the number of observations increases, all three estimates (Bob’s, Alice’s, and Charlie’s) converge towards the true probability of rain (50%).

Pierre-Simon Laplace came up with a solution to Charlie’s “small sample size” problem, where he estimates 100% probability of rain on day one. Laplace did that in a similar context, answering the question “will the sun rise tomorrow?”. Instead of not assuming anything like Charlie, Laplace proposes the “rule of succession”, which says that we assume one success and one failure even before observing any data. Translating that to a Beta prior, it means starting with \alpha=1 and \beta=1, which is a uniform prior over p, see the graph below.

Laplace vs frequentist vs Bayesian daily updates
laplaces_parameters_daily = np.array([[1, 1]])
for day in np.arange(N_weeks*7):
    laplaces_yesterdays_parameters = laplaces_parameters_daily[-1]
    laplaces_todays_parameters = laplaces_yesterdays_parameters + sf_array_daily[day]
    laplaces_parameters_daily = np.vstack([laplaces_parameters_daily, laplaces_todays_parameters])

fig, ax = plt.subplots(1, 2, figsize=(8,6))

laplaces_means_daily = beta_mean(laplaces_parameters_daily[:,0], laplaces_parameters_daily[:,1])
laplaces_sqrt_variance_daily = np.sqrt(beta_variance(laplaces_parameters_daily[:,0], laplaces_parameters_daily[:,1]))

line_bob_daily, = ax[0].plot(t_day, bob_means_daily, color="black", label="Bob's Mean (Daily)", alpha=0.5, ls="-")
fill_bob_daily = ax[0].fill_between(t_day, bob_means_daily - bob_sqrt_variance_daily, bob_means_daily + bob_sqrt_variance_daily,
                color="tab:blue", alpha=0.2, label="Bob's Std Dev")
line_alice_daily, = ax[0].plot(t_day, alice_means_daily, color="black", label="Alice's Mean (Daily)", alpha=0.5, ls="-")
fill_alice_daily = ax[0].fill_between(t_day, alice_means_daily - alice_sqrt_variance_daily, alice_means_daily + alice_sqrt_variance_daily,
                color="tab:red", alpha=0.2, label="Alice's Std Dev")

bob_parameters_frequentist = bobs_parameters_daily - bobs_parameters_daily[0]
bob_means_frequentist = bob_parameters_frequentist[1:,0] / (bob_parameters_frequentist[1:,0] + bob_parameters_frequentist[1:,1])
line_bob_frequentist, = ax[0].plot(t_day[1:], bob_means_frequentist, color="purple", label="Bob's Frequentist Estimate", ls="-", marker="o", markersize=3, alpha=0.5, clip_on=True)

line_laplaces_daily, = ax[0].plot(t_day, laplaces_means_daily, color="tab:orange", label="Laplace's Mean (Daily)", alpha=0.5, ls="-")
fill_laplaces_daily = ax[0].fill_between(t_day, laplaces_means_daily - laplaces_sqrt_variance_daily, laplaces_means_daily + laplaces_sqrt_variance_daily,
                color="tab:orange", alpha=0.2, label="Laplace's Std Dev")

ax[0].axhline(0.5, color="black", ls="--", lw=1)
ax[0].legend([(line_bob_daily, fill_bob_daily), (line_alice_daily, fill_alice_daily), (line_bob_frequentist,), (line_laplaces_daily, fill_laplaces_daily)],
          [r"Bob's mean$\pm$std", r"Alice's mean$\pm$std", "Charlie, the frequentist", "Laplace"],
          handler_map={tuple: mpl.legend_handler.HandlerTuple(ndivide=1)},
          frameon=False)

ax[0].set(xlabel='Days',
       ylabel='Estimated Expected Probability of Rain',
       title='Frequentist vs Bayesian Daily Updates',
       ylim=(0,1),
       xlim=(0,20)
       );

p = np.linspace(0, 1, 1000)
ax[1].plot(p, beta.pdf(p, laplaces_parameters_daily[0,0], laplaces_parameters_daily[0,1]), label='Beta(1, 1) - Laplace', color="black")
ax[1].plot(p, beta.pdf(p, bobs_parameters_daily[0,0], bobs_parameters_daily[0,1]), label=f'Beta({bobs_parameters_daily[0,0]}, {bobs_parameters_daily[0,1]}) - Bob', color="tab:orange")
ax[1].plot(p, beta.pdf(p, alices_parameters_daily[0,0], alices_parameters_daily[0,1]), label=f'Beta({alices_parameters_daily[0,0]}, {alices_parameters_daily[0,1]}) - Alice', color="tab:blue")
ax[1].legend(frameon=False)
ax[1].set(xlabel='Probability of rain (p)',
       ylabel='Probability Density',
       title='Beta prior distributions with mean at 0.1');

Laplace’s estimate doesn’t suffer from Charlie’s extreme initial estimate. Just after a few days, however, it is almost indistinguishable from Charlie’s estimate.

28.8 conjugate pairs

In the example above, we saw how convenient it is to use a prior function that is conjugate to the likelihood function. The generating process behind the observations was a Bernoulli process (or Binomial distribution for multiple trials), so we chose its conjugate, the Beta distribution.

Some examples of processes that can be described by a Binomial distribution (or Bernoulli process) are:

  • Conversion Rates: A marketing team has a prior belief about how many people will click an email link (Success) vs. ignore it (Failure).
  • Quality Control: A factory manager has a prior belief about the proportion of defective items (Success) vs. non-defective items (Failure) in a production batch.
  • Medical Trials: A researcher has a prior belief about the effectiveness of a new drug (Success) vs. ineffectiveness (Failure) based on preliminary studies.

What if the generating process was different?

28.8.1 Poisson distribution

The Poisson distribution models the number of events occurring in a fixed interval of time or space (the “count”). The Gamma distribution is its conjugate prior because it is defined for positive values (0 to \infty), making it perfect for modeling the rate (\lambda) at which these events happen. It can be used in scenarios such as:

  • Ecohydrology: Modeling the number of rainfall pulses in a desert ecosystem during the growing season.
  • Customer Service: Estimating the rate of phone calls arriving at a help desk per hour.
  • Radioactive Decay: Predicting the number of particles emitted by a substance over a specific duration.

28.8.2 normal distribution (known variance)

When you assume your data follows a Normal distribution (like measurement errors) and you know the variance, the conjugate prior for the mean is also a Normal distribution. This creates a beautiful “weighted average” effect: the posterior mean will sit somewhere between your prior guess and the data’s mean, depending on which one is more certain. It is useful in scenarios such as:

  • Measurement Errors: Estimating the true value of a physical quantity when measurements are subject to random errors.
  • Quality Control: Determining the average weight of products in a manufacturing process where individual weights vary normally around a true mean.
  • Psychometrics: Estimating the average score of a psychological test when individual scores are normally distributed with known variance.

28.8.3 categorical / multinomial distribution

This is the multi-dimensional version of the Beta distribution. Instead of just “Success/Failure,” you have multiple categories (e.g., “Rain/Sun/Clouds”). The Dirichlet distribution allows you to track the probabilities of all these categories simultaneously. It is useful in scenarios such as:

  • Topic Modeling: In Machine Learning, determining the “theme” of a document by looking at the frequency of different words (each word is a category).
  • Political Polling: Estimating the support for four different candidates in an upcoming election.
  • Genetics: Modeling the frequency of different alleles (gene variants) within a specific population.

28.9 uniform distribution (known bounds)

This is used when you are trying to find the maximum possible value (\theta) of a process. If you assume the data is spread evenly (Uniform) up to some unknown limit, the Pareto distribution is the conjugate prior that helps you “narrow in” on where that upper limit actually sits. It is useful in scenarios such as:

  • The German Tank Problem: Estimating the total number of tanks produced by an enemy based on the highest serial number found on captured tanks.
  • Ecological Limits: Estimating the maximum possible size a specific fish species can reach based on the largest specimens caught.
  • Quality Control: Determining the maximum defect size in a batch of manufactured items based on the largest defect observed in a sample.
  • Project Management: Estimating the maximum time required to complete a project based on the longest time taken for similar past projects.