One of the core ideas of statistics is that we can use a subset of a group, study it, and then make inferences or conclusions about that much larger group.
For example, let's say we wanted to find the average (mean) weight of all the people in Germany. One way do to this is to visit all the 81 million people in Germany, record their weights, and then find the average. However, it is a far more sane endeavor to take down the weights of only a few hundred Germans, and use those to deduce the average weight of all Germans. In this case, the few hundred people we do measure is the sample, and the entirety of people in Germany is called the population.
Now, there are Germans of all shapes and sizes: some heavier, some lighter. If we only pick a few Germans to weigh, we run the risk of, by chance, choosing a group of primarily underweight Germans or overweight ones. We might then come to an inaccurate conclusion about the weight of all Germans. But, as we add more Germans to our sample, those chance variations tend to balance themselves out.
All things being equal, it would be preferable to measure the weights of all Germans so that we can be absolutely sure that we have the right answer, but that just isn't feasible. If we take a large enough sample, though, and are careful that our sample is well-representative of the population, not only can we get extraordinarily close to the actual average weight of the population, but we can quantify our uncertainty. The more Germans we include in our sample, the less uncertain we are about our estimate of the population.
In the preceding case, we are using the sample mean as an estimator of the population mean, and the actual value of the sample mean is called our estimate. It turns out that the formula for population mean is a great estimator of the mean of the population when applied to only a sample. This is why we make no distinction between the population and sample means, except to replace the µ
with . Unfortunately, there exists no perfect estimator for the standard deviation of a population for all population types. There will always be some systematic difference in the expected value of the estimator and the real value of the population. This means that there is some bias in the estimator. Fortunately, we can partially correct it.
Note that the two differences between the population and the sample standard deviation are that (a) the µ is replaced by in the sample standard deviation, and (b) the divisor n is replaced by n-1.
In the case of the standard deviation of the population, we know the mean µ. In the case of the sample, however, we don't know the population mean, we only have an estimate of the population mean based on the sample mean . This must be taken into account and corrected in the new equation. No longer can we divide by the number of elements in the data set—we have to divide by the degrees of freedom, which is n-1.
What in the world are degrees of freedom? And why is it n-1?
Let's say we were gathering a party of six to play a board game. In this board game, each player controls one of six colored pawns. People start to join in at the board. The first person at the board gets their pick of their favorite colored pawn. The second player has one less pawn to choose from, but she still has a choice in the matter. By the time the last person joins in at the game table, she doesn't have a choice in what pawn she uses; she is forced to use the last remaining pawn. The concept of degrees of freedom is a little like this.
If we have a group of five numbers, but hold the mean of those numbers fixed, all but the last number can vary, because the last number must take on the value that will satisfy the fixed mean. We only have four degrees of freedom in this case.
More generally, the degrees of freedom is the sample size minus the number of parameters estimated from the data. When we are using the mean estimate in the standard deviation formula, we are effectively keeping one of the parameters of the formula fixed, so that only n-1 observations are free to vary. This is why the divisor of the sample standard deviation formula is n-1; it is the degrees of freedom that we are dividing by, not the sample size.
If you thought that the last few paragraphs were heady and theoretical, you're right. If you are confused, particularly by the concept of degrees of freedom, you can take solace in the fact that you are not alone; degrees of freedom, bias, and subtleties of population vs. sample standard deviation are notoriously confusing topics for newcomers to statistics. But you only have to learn it only once!
3.147.126.33