Do you remember in Chapter 2, The Shape of Data when we described the normal distribution and how ubiquitous it is? The behavior of many random variables in real life is very well described by a normal distribution with certain parameters.
The two parameters that uniquely specify a normal distribution are µ (mu) and σ (sigma). µ, the mean, describes where the distribution's peak is located and σ, the standard deviation, describes how wide or narrow the distribution is.
The distribution of heights of American females is approximately normally distributed with parameters µ= 65 inches and σ= 3.5 inches.
With this information, we can easily answer questions about how probable it is to choose, at random, US women of certain heights.
As mentioned earlier in Chapter 2, The Shape of Data we can't really answer the question What is the probability that we choose a person who is exactly 60 inches?, because virtually no one is exactly 60 inches. Instead, we answer questions about how probable it is that a random person is within a certain range of heights.
What is the probability that a randomly chosen woman is 70 inches or taller? If you recall, the probability of a height within a range is the area under the curve, or the integral over that range. In this case, the range we will integrate looks like this:
> f <- function(x){ dnorm(x, mean=65, sd=3.5) } > integrate(f, 70, Inf) 0.07656373 with absolute error < 2.2e-06
The preceding R code indicates that there is a 7.66% chance of randomly choosing a woman who is 70 inches or taller.
Luckily for us, the normal distribution is so popular and well studied, that there is a function built into R, so we don't need to use integration ourselves.
> pnorm(70, mean=65, sd=3.5) [1] 0.9234363
The pnorm
function tells us the probability of choosing a woman who is shorter than 70 inches. If we want to find P (> 70 inches), we can either subtract this value by 1 (which gives us the complement) or use the optional argument lower.tail=FALSE
. If you do this, you'll see that the result matches the 7.66% chance we arrived at earlier.
When dealing with a normal distribution, we know that it is more likely to observe an outcome that is close to the mean than it is to observe one that is distant—but just how much more likely? Well, it turns out that roughly 68% of all the values drawn from a random distribution lie within 1 standard deviation, or 1 z-score, away from the mean. Expanding our boundaries, we find that roughly 95% of all values are within 2 z-scores from the mean. Finally, about 99.7% of normal deviates are within 3 standard deviations from the mean. This is called the three-sigma rule.
Before computers came on the scene, finding the probability of ranges associated with random deviates was a little more complicated. To save mathematicians from having to integrate the Gaussian (normal) function by hand (eww!), they used a z-table, or standard normal table. Though using this method today is, strictly speaking, unnecessary, and it is a little more involved, understanding how it works is important at a conceptual level. Not to mention that it gives you street cred as far as statisticians are concerned!
Formally, the z-table tells us the values of cumulative distribution function at different z-scores of a normal distribution. Less abstractly, the z-table tells us the area under the curve from negative infinity to certain z-scores. For example, looking up -1 on a z-table will tell us the area to the left of 1 standard deviation below the mean (15.9%).
Z-tables only describe the cumulative distribution function (area under the curve) of a standard normal distribution—one with a mean of 0 and a standard deviation of 1. However, we can use a z-table on normal distributions with any parameters, µ and σ. All you need to do is convert a value from the original distribution into a z-score. This process is called standardization.
To use a z-table to find the probability of choosing a US woman at random who is taller than 70 inches, we first have to convert this value into a z-score. To do this, we subtract the mean (65 inches) from 70 and then divide that value by the standard deviation (3.5 inches).
Then, we find 1.43
on the z-table; on most z-table layouts, this means finding the row labeled 1.4
(the z-score up to the tenths place) and the column ".03" (the value in the hundredths place). The value at this intersection is .9236, which means that the complement (someone taller than 70 inches) is 1-.9236 = 0.0764. This is the same answer we got when we used integration and the pnorm
function.
3.15.218.169