We've already noted that there are times when we'll want to ask how well a normal distribution can serve as a model of a real-world population or process. Because there are often advantages to using a normal model, the practical question often boils down to asking if our actual data deviate too grossly from a normal distribution. If not, a normal model can be quite useful.
In this chapter we'll introduce a basic approach to this question, based on the comparison of observed and theoretical quantiles (that is, the types of comparisons we've just made in the prior section). In Chapter 9 we'll return to this topic and refine the approach.
In the previous section, we computed three different quantiles of the normal variable X~ N(119.044, 18.841). We also have a large data table containing well over 6,000 observations of a variable that shares the same mean and standard deviation as X. We also know the medians (50th percentile) for each. If we compare our computed quantiles to the observed quantiles we see the following:
PercentileValue | Observed Value BPXSY1 | Computed Value of X |
---|---|---|
25 | 106 | 106.34 |
50 | 116 | 119.04 |
75 | 128 | 131.75 |
90 | 142 | 143.19 |
The observed and computed values are similar, though not identical. If BPXSY1 were normally distributed, the values in the last two columns of Table 6.1 would match perfectly.
We could continue to calculate theoretical quantiles for other percentiles and continue to compare the values one pair at a time. Fortunately, there is a more direct and simple way to carry out the comparison—a Normal Quantile Plot (sometimes known as a Normal Probability Plot, or NPP).
In JMP a normal quantile plot is a scatterplot with values of the observed data on the vertical axis and theoretical percentiles on the horizontal.[2] If the normal model were to match the observed data perfectly, the points in the graph would plot out along a 45° diagonal line. For this reason the plot includes a red diagonal reference line. To the extent that the points deviate from the line, we see imperfections in the fit. Let's look at two examples.
[2] These are the default axis settings.
Return to the NHANES Distribution report window.
Hold down the CTRL key and click on the red triangle next to BPXSY1; select Normal Quantile Plot.
Figure 6.8 shows the plots for both blood pressure columns. Neither shows a perfectly straight diagonal pattern, but the plot of diastolic pressure on the right more closely runs along the diagonal for most of the distribution. The normal model fits poorly in the tails of the distribution but pretty well elsewhere.
Recall that we have more than 6,600 observations here. The shadowgram, histogram, and box plots show that the diastolic distribution is much more symmetric than the systolic.
As an example of a quite good fit for a normal model, let's look at some other columns from a subset of the NHANES data table, selecting just the two-year-old girls in the sample.
Select Rows → Row Selection → Select Where. We need to specify two conditions to select the rows corresponding to two-year-old girls. Within the dialog box, first set the condition that RIAGENDR equals 2 (the code for females), and then click the Add Condition button in the lower portion of the dialog box.
Now choose RIDAGEYR equals 2 to select the two-year-old children, click Add Condition, and then OK.
If you now scroll down the rows of the data table you'll find a relatively small number of rows selected. In fact there are just 177 two-year-old girls in this sample of almost 10,000 people.
Select Analyze → Distribution. Cast BMXRECUM as Y. This column is the recumbent (reclining) height of these two-year-old girls.
Create a normal quantile plot for the data; it will look like Figure 6.11 and here we find the points are very close to the diagonal, suggesting that the normal model would be very suitable in this case.
As a final example in this section, perform all of the steps necessary to create a normal quantile plot for INDFMPIR. The correct result is shown in Figure 6.12.
This column is equal to the ratio of family income to the federally established definition of poverty. If the ratio is less than 1, this means that the family lives in poverty. By definition if the ratio is more than 5.0, NHANES records the value as equal to 5; there are 11 such families in this subsample.
In contrast to the prior graph we should conclude that a normal distribution forms a poor model for this variable. One could develop a normal model, but the results calculated from that model would rarely come close to matching the reality represented by the observed data from the families of two-year-old girls.
18.117.82.152