10  numpy vs pandas

10.1 numpy

In the previous chapter, we computed the permutation test using numpy. We had two samples of different sizes, and before the permutation test we concatenated the two samples into one array. Then we shuffled the concatenated array and split it back into two samples, according to the original sizes. See a sketch of the code below:

Store the two samples in numpy arrays:

boys = np.array([121, 123, 124, 125])
girls = np.array([120, 121, 121, 122, 123, 123, 128, 129])
N_boys = len(boys)
N_girls = len(girls)

Define the statistic and compute the observed difference:

statistic = np.median
# compute the median for each sample and the difference
median_girls = statistic(sample_girls)
median_boys = statistic(sample_boys)
observed_diff = median_girls - median_boys

Run the permutation test:

N_permutations = 1000
# combine all values in one array
all_data = np.concatenate([sample_girls, sample_boys])
# create an array to store the differences
diffs = np.empty(N_permutations - 1)

for i in range(N_permutations - 1):    # this "minus 1" will be explained later
    # permute the labels
    permuted = np.random.permutation(all_data)
    new_girls = permuted[:N_girls]  # first N_girls values are girls
    new_boys = permuted[N_girls:]   # remaining values are boys
    diffs[i] = statistic(new_girls) - statistic(new_boys)
# add the observed difference to the array of differences
diffs = np.append(diffs, observed_diff)

All this works great if this is how your data looks like. Sometimes, however, you have structured data with more information, such as a DataFrame with multiple columns. In this case, you can leverage the capabilities of pandas.

10.2 pandas

Let’s give an example of structured data. Suppose we have a DataFrame with the following columns: sex, height, and weight.

import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="ticks", font_scale=1.5)
from scipy.stats import norm, ttest_ind, t
# %matplotlib widget
N_total = 20
np.random.seed(3)
height_list = norm.rvs(size=N_total, loc=150, scale=7)
weight_list = norm.rvs(size=N_total, loc=42, scale=5)
sex_list = np.random.choice(['M', 'F'], size=N_total, replace=True)
df = pd.DataFrame({
    'sex': sex_list,
    'height (cm)': height_list,
    'weight (kg)': weight_list
})
df
sex height (cm) weight (kg)
0 M 162.520399 36.074767
1 M 153.055569 40.971751
2 M 150.675482 49.430742
3 F 136.955551 43.183581
4 F 148.058283 36.881074
5 M 147.516687 38.435034
6 F 149.420810 45.126225
7 F 145.610995 41.197433
8 F 149.693273 38.155818
9 M 146.659474 40.849846
10 M 140.802947 45.725281
11 F 156.192357 51.880554
12 F 156.169226 35.779383
13 M 161.967011 38.867915
14 F 150.350235 37.981170
15 M 147.167258 29.904584
16 M 146.182480 37.381040
17 M 139.174659 36.880621
18 M 156.876572 47.619890
19 M 142.292527 41.340429

Calculate sample statistics using groupby:

sample_stats = df.groupby('sex')['height (cm)'].median()
observed_diff = sample_stats['F'] - sample_stats['M']

We can now leverage the pandas.DataFrame.sample method to sample from the DataFrame. Here, we use the following options:

  • frac=1 means we want to sample 100% of rows, but shuffled.
  • replace=False means we want to sample without replacement, that is, no duplicate rows.

We will shuffle the sex column and store the result in a new column called sex_shuffled. Then we can use groupby to compute the median.

N_permutations = 1000
diffs = np.empty(N_permutations - 1)
for i in range(N_permutations - 1):
    # shuffle dataframe 'sex' colunn, store it in 'sex_shuffled'
    df['sex_shuffled'] = df['sex'].sample(frac=1, replace=False).reset_index(drop=True)
    shuffled_stats = df.groupby('sex_shuffled')['height (cm)'].median()
    diffs[i] = shuffled_stats['F'] - shuffled_stats['M']  # median(F) - median(M)
# add the observed difference to the array of differences
diffs = np.append(diffs, observed_diff)