6 population and samples
When we do an experiment, we take measurements. The complete set of all individual outcomes I’m interested in is called the population. The set of measurements that I actually take is called a sample from the population.
6.1 height
Here the population is literally the population of people. Every person whose height we measure is a sample from this population. When we are done measuring, we also call the whole set of measurements a sample.
6.2 temperature
The same idea applies here. However, we don’t have an already-existing population of temperatures just waiting to be measured. Instead, we have a process that generates temperatures (the weather), and every time we measure the temperature, we are taking a sample from the population of all possible temperatures that this process can generate. In the temperature case, and in many other cases, the population is conceptually infinite, since the process can generate an infinite number of outcomes.
6.3 extracting information
If we had unlimited time and resources, we could measure the height of every person in the world, or we could ask everyone in the country who they will vote for, or we could take an infinite number of temperature measurements. In that case, we would have the entire population. We could then answer any question we want about the population, and we would have no uncertainty in our answers.
In reality, we have limited time and resources, so we can only take a finite sample from the population. Our job is to learn as much as we can about the population from the sample. Many questions arise from taking a sample:
- Is the sample representative of the population? Maybe we measured the height of 20 people, but we did so in a basketball court, so our sample is biased towards taller people.
- Is every measurement in the sample independent of the others? Maybe we measured the height of 20 people, but they are all from the same extended family, so their heights are not independent.
- If we were to take another sample from the population, would we get roughtly the same results? Maybe we measured the height of 20 people, but if we were to measure another 20 people, we would get a very different average height. Is a sample of 20 people enough to learn what I want?
- Suppose I have a machine (a camera hooked up to a computer) that automatically measures the height of 5-year-olds as they enter a kid’s play. When analyzing the data, I find an outlier measurement of 180 cm. This might mean that the machine accidentally measured a parent, that is, the measurement was sampled from a different population thatn the one we care about (5-year-olds). What I’m trying to say here is that ouliers are not only extreme values, but evidence from contamination from an unintended population.
6.4 large sample behavior
There are mathematical theorems that tell us what happens when we take larger and larger samples from the population.
For example, the law of large numbers tells us that as we take larger and larger samples, the sample mean will get closer and closer to the population mean.
The central limit theorem tells us that as we take larger samples, the distribution of the sample mean will get closer to a normal distribution. This is true regardless of the distribution of the population, as long as the population has a finite mean and variance.
6.5 notation
When talking about means and standard deviations, we give different letters to the population and to the sample:
- Population. We use Greek letters: \mu for the mean, and \sigma for the standard deviation. When the population size is known, we can call it N.
- Sample. We use Latin letters: \bar{x} for the mean, and s for the standard deviation. Of course, if we call what we are measuring h instead of x, then \bar{h} would represent the sample mean. The sample size is usually denoted by n.