Remember when I said that the sampling distribution of sample means is approximately normal for a large enough sample size? This caveat means that for smaller sample sizes (usually considered to be below 30), the sampling distribution of the sample means is not well approximated by a normal distribution. It is, however, well approximated by another distribution: the t-distribution.
A bit of history…
The t-distribution is also known as the Student's t-distribution. It gets its name from the 1908 paper that introduces it, by William Sealy Gosset writing under the pen name Student. Gosset worked as a statistician at the Guinness Brewery and used the t-distribution and the related t-test to study small samples of the quality of the beer's raw constituents. He is thought to have used a pen name at the request of Guinness so that competitors wouldn't know that they were using the t statistic to their advantage.
The t-distribution has two parameters, the mean and the degrees of freedom (or df
). For our purposes here, the 'degrees of freedom' is equal to our sample size, - 1
. For example, if we have a sample of 10 from some population and the mean is 5, then a t-distribution with parameters mean=5
and df=9
describes the sampling distribution of sample means with that sample size.
The t-distribution looks a lot like the normal distribution at first glance. However, further examination will reveal that the curve is more flat and wide. This wideness accounts for the higher level of uncertainty we have in regard to a smaller sample.
Notice that as the sample size (degrees of freedom) increases, the distribution gets narrower. As the sample size gets higher and higher, it gets closer and closer to a normal distribution. By 29 degrees of freedom, it is very close to a normal distribution indeed. This is why 30 is considered a good rule of thumb for what constitutes a good cut-off between large sample sizes and small sample sizes and, thus, when deciding whether to use a normal distribution or a t-distribution as a model for the sampling distribution.
Let's say that we could only afford taking the heights of 15 US women. What, then, is our 95% interval estimation?
> small.sample <- sample(all.us.women, 15) > mean(small.sample) [1] 65.51277 > qt(.025, df=14) [1] -2.144787 > # notice the difference > qnorm(.025) [1] -1.959964
Instead of using the qnorm
function to get the correct multiplier to the standard error, we want to find the quantile of the t-distribution at .025 (and .975). For this, we use the qt
function, which takes a probability and number of degrees of freedom. Note that the quantile of the t-distribution is larger than the quantile of the normal distribution, which will translate to larger confidence interval bounds; again, this reflects the additional uncertainty we have in our estimate due to a smaller sample size.
> err <- sd(small.sample) / sqrt(length(small.sample)) > mean(small.sample) - (2.145 * err) [1] 64.09551 > mean(small.sample) + (2.145 * err) [1] 66.93003
In this case, the bounds of our 95% confidence interval are 64.1 and 66.9.
3.131.38.104