In Chapter 4, Probability, we mentioned that the mean height of US females is 65 inches. Now pretend we didn't know this fact—how could we find out what the average height is?
We can measure every US female, but that's untenable; we would run out of money, resources, and time before we even finished with a small city!
Inferential statistics gives us the power to answer this question using a very small sample of all US women. We can use the sample to tell us something about the population we drew it from. We can use observed data to make inferences about unobserved data. By the end of this chapter, you too will be able to go out and collect a small amount of data and use it to reason about the world!
In the example that is going to span this entire chapter, we are going to be examining how we would estimate the mean height of all US women using only samples. Specifically, we will be estimating the population parameters using samples' means as an estimator.
I am going to use the vector all.us.women
to represent the population. For simplicity's sake, let's say there are only 10,000 US women.
> # setting seed will make random number generation reproducible > set.seed(1) > all.us.women <- rnorm(10000, mean=65, sd=3.5)
We have just created a vector of 10,000 normally distributed random variables with the same parameters as our population of interest using the rnorm
function. Of course, at this point, we can just call mean
on this vector and call it a day—but that's cheating! We are going to see that we can get really really close to the population mean without actually using the entire population.
Now, let's take a random sample of ten from this population using the sample
function and compute the mean:
> our.sample <- sample(all.us.women, 10) > mean(our.sample) [1] 64.51365
Hey, not a bad start!
Our sample will, in all likelihood, contain some short people, some normal people, and some tall people. There's a chance that when we choose a sample that we choose one that contains predominately short people, or a disproportionate number of tall people. Because of this, our estimate will not be exactly accurate. However, as we choose more and more people to include in our sample, those chance occurrences—imbalanced proportions of the short and tall—tend to balance each other out.
Note that as we increase our sample size, the sample mean isn't always closer to the population mean, but it will be closer on average.
We can test that assertion ourselves! Study the following code carefully and try running it yourself.
> population.mean <- mean(all.us.women) > > for(sample.size in seq(5, 30, by=5)){ + # create empty vector with 1000 elements + sample.means <- numeric(1000) + for(i in 1:1000){ + sample.means[i] <- mean(sample(all.us.women, sample.size)) + } + distances.from.true.mean <- abs(sample.means - population.mean) + mean.distance.from.true.mean <- mean(distances.from.true.mean) + print(mean.distance.from.true.mean) + } [1] 1.245492 [1] 0.8653313 [1] 0.7386099 [1] 0.6355692 [1] 0.5458136 [1] 0.5090788
For each sample size from 5 to 30 (going up by 5), we will take 1000 different samples from the population, calculate their mean, take the differences from the population mean, and average them.
As you can see, increasing the sample size gets us closer to the population mean. Increasing the sample size also reduces the standard deviation between the means of the samples.
Knowing that, with all other things being equal, larger samples are preferable to smaller ones, let's work with a sample size of 40 for right now. We'll take our sample and estimate our population mean as follows:
> mean(our.new.sample) [1] 65.19704
3.147.71.92