10 numpy vs pandas
10.1 numpy
In the previous chapter, we computed the permutation test using numpy
. We had two samples of different sizes, and before the permutation test we concatenated the two samples into one array. Then we shuffled the concatenated array and split it back into two samples, according to the original sizes. See a sketch of the code below:
Store the two samples in numpy arrays:
boys = np.array([121, 123, 124, 125])
girls = np.array([120, 121, 121, 122, 123, 123, 128, 129])
N_boys = len(boys)
N_girls = len(girls)
Define the statistic and compute the observed difference:
statistic = np.median
# compute the median for each sample and the difference
median_girls = statistic(sample_girls)
median_boys = statistic(sample_boys)
observed_diff = median_girls - median_boys
Run the permutation test:
N_permutations = 1000
# combine all values in one array
all_data = np.concatenate([sample_girls, sample_boys])
# create an array to store the differences
diffs = np.empty(N_permutations - 1)
for i in range(N_permutations - 1): # this "minus 1" will be explained later
# permute the labels
permuted = np.random.permutation(all_data)
new_girls = permuted[:N_girls] # first N_girls values are girls
new_boys = permuted[N_girls:] # remaining values are boys
diffs[i] = statistic(new_girls) - statistic(new_boys)
# add the observed difference to the array of differences
diffs = np.append(diffs, observed_diff)
All this works great if this is how your data looks like. Sometimes, however, you have structured data with more information, such as a DataFrame with multiple columns. In this case, you can leverage the capabilities of pandas
.
10.2 pandas
Let’s give an example of structured data. Suppose we have a DataFrame with the following columns: sex
, height
, and weight
.
N_total = 20
np.random.seed(3)
height_list = norm.rvs(size=N_total, loc=150, scale=7)
weight_list = norm.rvs(size=N_total, loc=42, scale=5)
sex_list = np.random.choice(['M', 'F'], size=N_total, replace=True)
df = pd.DataFrame({
'sex': sex_list,
'height (cm)': height_list,
'weight (kg)': weight_list
})
df
sex | height (cm) | weight (kg) | |
---|---|---|---|
0 | M | 162.520399 | 36.074767 |
1 | M | 153.055569 | 40.971751 |
2 | M | 150.675482 | 49.430742 |
3 | F | 136.955551 | 43.183581 |
4 | F | 148.058283 | 36.881074 |
5 | M | 147.516687 | 38.435034 |
6 | F | 149.420810 | 45.126225 |
7 | F | 145.610995 | 41.197433 |
8 | F | 149.693273 | 38.155818 |
9 | M | 146.659474 | 40.849846 |
10 | M | 140.802947 | 45.725281 |
11 | F | 156.192357 | 51.880554 |
12 | F | 156.169226 | 35.779383 |
13 | M | 161.967011 | 38.867915 |
14 | F | 150.350235 | 37.981170 |
15 | M | 147.167258 | 29.904584 |
16 | M | 146.182480 | 37.381040 |
17 | M | 139.174659 | 36.880621 |
18 | M | 156.876572 | 47.619890 |
19 | M | 142.292527 | 41.340429 |
Calculate sample statistics using groupby
:
We can now leverage the pandas.DataFrame.sample
method to sample from the DataFrame. Here, we use the following options:
frac=1
means we want to sample 100% of rows, but shuffled.replace=False
means we want to sample without replacement, that is, no duplicate rows.
We will shuffle the sex
column and store the result in a new column called sex_shuffled
. Then we can use groupby
to compute the median.
N_permutations = 1000
diffs = np.empty(N_permutations - 1)
for i in range(N_permutations - 1):
# shuffle dataframe 'sex' colunn, store it in 'sex_shuffled'
df['sex_shuffled'] = df['sex'].sample(frac=1, replace=False).reset_index(drop=True)
shuffled_stats = df.groupby('sex_shuffled')['height (cm)'].median()
diffs[i] = shuffled_stats['F'] - shuffled_stats['M'] # median(F) - median(M)
# add the observed difference to the array of differences
diffs = np.append(diffs, observed_diff)