Interval estimation

Again, we care about the standard error (the standard deviation of the sampling distribution of sample means) because it expresses the degree of uncertainty we have in our estimation. Because of this, it's not uncommon for statisticians to report the standard error along with their estimate.

What's more common, though, is for statisticians to report a range of numbers to describe their estimates; this is called interval estimation. In contrast, when we were just providing the sample mean as our estimate of the population mean, we were engaging in point estimation.

One common approach to interval estimation is to use confidence intervals. A confidence interval gives us a range over which a significant proportion of the sample means would fall when samples are repeatedly drawn from a population and their means are calculated. Concretely, a 95% confidence interval is the range that would contain 95% of the sample means if multiple samples were taken from the same population. 95% confidence intervals are very common, but 90% and 99% confidence intervals aren't rare.

Think about this for a second: if a 95% confidence interval contains 95% of the sample means, that means that the 95% confidence interval covers 95% of the area of the sampling distribution.

Interval estimation

Figure 5.5: The 95% confidence interval of our estimate of the sample mean (64.085 to 66.31) covers 95% of the area in the our estimated sampling distribution

Okay, so how do we find the bounds of the confidence interval? Think back to the three-zs rule from the previous chapter on probability. Recall that about 95% of a normal distribution's area is within two standard deviations of the mean. Well, if the bounds of a confidence interval cover 95% of the sampling distribution, then the bounds must be two standard deviations away from the mean on both sides! Since the standard deviation of the distribution of interest (the sampling distribution of sample means) is the standard error, the bounds of the confidence interval are the mean minus 2 times the standard error and the mean plus 2 times the standard error.

In reality, two standard deviations (or two z-scores) away from the mean contain a little bit more than 95% of the area of the distribution. To be more precise, the range between -1.96 z-scores and 1.96 z-scores contains 95% of the area. Therefore, the bounds of a 95% confidence interval are:

Interval estimation

where Interval estimation is the sample mean and s is the sample standard deviation.

In our example, our bounds are:

  > err <- sd(our.new.sample) / sqrt(length(our.new.sample))
  > mean(our.new.sample) - (1.96*err)
  [1] 64.08497
  > mean(our.new.sample) + (1.96*err)
  [1] 66.30912

How did we get 1.96?

You can get this number yourself by using the qnorm function.

The qnorm function is a little like the opposite of the pnorm function that we saw in the previous chapter. That function started with a p because it gave us a probability—the probability that we would see a value equal to or below it in a normal distribution. The q in qnorm stands for quantile. A quantile, for a given probability, is the value at which the probability will be equal to or below that probability.

I know that was confusing! Stated differently, but equivalently, a quantile for a given probability is the value such that if we put it in the pnorm function, we get back that same probability.

  > qnorm(.025)
  [1] -1.959964
  > pnorm(-1.959964)
  [1] 0.025

We showed earlier that 95% of the area under a curve of a probability distribution is within 1.9599 z-scores away from the mean. We put .025 in the qnorm function, because if the mean is right smack in the middle of the 95% confidence interval, then there is 2.5% of the area to the left of the bound and 2.5% of the area to the right of the bound. Together, this lower 2.5% and upper 2.5% make up the missing 5% of the area.

Don't feel limited to the 95% confidence interval, though. You can figure out the bounds of a 90% confidence interval using just the same procedure. In an interval that contains 90% of the area of a curve, the bounds are the values for which 5% of the area is to the left and 5% of the area is to the right of (because 5% and 5% make up the missing 10%) the curve.

  > qnorm(.05)
  [1] -1.644854
  > qnorm(.95)
  [1] 1.644854
  > # notice the symmetry?

That means that for this example, the 90% confidence interval is 65.2 and 66.13 or 65.197 +- 0.933.

Note

A warning about confidence intervals

There are many misconceptions about confidence intervals floating about. The most pervasive is the misconception that 95% confidence intervals represent the interval such that there is a 95% chance that the population mean is in the interval. This is false. Once the bounds are created, it is no longer a question of probability; the population mean is either in there or it's not.

To convince yourself of this, take two samples from the same distribution and create 95% confidence intervals for both of them. They are different, right? Create a few more. How could it be the case that all of these intervals have the same probability of including the population mean?

Using a Bayesian interpretation of probability, it is possible to say that there exists intervals for which we are 95% certain that it encompasses the population mean, since Bayesian probability is a measure of our certainty, or degree of belief, in something. This Bayesian response to confidence intervals is called credible intervals, and we will learn about them in Chapter 7, Bayesian Methods. The procedure for their construction is very different to that of the confidence interval.

How did we get 1.96?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.2.68