Checking Data for Suitability of Normal Model

We've already noted that there are times when we'll want to ask how well a normal distribution can serve as a model of a real-world population or process. Because there are often advantages to using a normal model, the practical question often boils down to asking if our actual data deviate too grossly from a normal distribution. If not, a normal model can be quite useful.

In this chapter we'll introduce a basic approach to this question, based on the comparison of observed and theoretical quantiles (that is, the types of comparisons we've just made in the prior section). In Chapter 9 we'll return to this topic and refine the approach.

Normal Quantile Plots

In the previous section, we computed three different quantiles of the normal variable X~ N(119.044, 18.841). We also have a large data table containing well over 6,000 observations of a variable that shares the same mean and standard deviation as X. We also know the medians (50th percentile) for each. If we compare our computed quantiles to the observed quantiles we see the following:

Table 6.1. Comparison of Observed and Theoretical Quantiles
PercentileValueObserved Value BPXSY1Computed Value of X
25106106.34
50116119.04
75128131.75
90142143.19

The observed and computed values are similar, though not identical. If BPXSY1 were normally distributed, the values in the last two columns of Table 6.1 would match perfectly.

We could continue to calculate theoretical quantiles for other percentiles and continue to compare the values one pair at a time. Fortunately, there is a more direct and simple way to carry out the comparison—a Normal Quantile Plot (sometimes known as a Normal Probability Plot, or NPP).

In JMP a normal quantile plot is a scatterplot with values of the observed data on the vertical axis and theoretical percentiles on the horizontal.[2] If the normal model were to match the observed data perfectly, the points in the graph would plot out along a 45° diagonal line. For this reason the plot includes a red diagonal reference line. To the extent that the points deviate from the line, we see imperfections in the fit. Let's look at two examples.

[2] These are the default axis settings.

  1. Return to the NHANES Distribution report window.

  2. Hold down the CTRL key and click on the red triangle next to BPXSY1; select Normal Quantile Plot.

Figure 6.8 shows the plots for both blood pressure columns. Neither shows a perfectly straight diagonal pattern, but the plot of diastolic pressure on the right more closely runs along the diagonal for most of the distribution. The normal model fits poorly in the tails of the distribution but pretty well elsewhere.

Recall that we have more than 6,600 observations here. The shadowgram, histogram, and box plots show that the diastolic distribution is much more symmetric than the systolic.

Figure 6.8. Normal Quantile Plots for Blood Pressure Data

As an example of a quite good fit for a normal model, let's look at some other columns from a subset of the NHANES data table, selecting just the two-year-old girls in the sample.

  1. Select Rows → Row Selection → Select Where. We need to specify two conditions to select the rows corresponding to two-year-old girls. Within the dialog box, first set the condition that RIAGENDR equals 2 (the code for females), and then click the Add Condition button in the lower portion of the dialog box.

    Figure 6.9. Setting Select Rows Criteria
  2. Now choose RIDAGEYR equals 2 to select the two-year-old children, click Add Condition, and then OK.

If you now scroll down the rows of the data table you'll find a relatively small number of rows selected. In fact there are just 177 two-year-old girls in this sample of almost 10,000 people.

  1. In the Rows panel of the data table window, point to the word Selected, right-click, and choose Data View. This opens a new data table containing just the selected rows.

    Figure 6.10. Choosing the Data View of Selected Rows
  2. Select Analyze → Distribution. Cast BMXRECUM as Y. This column is the recumbent (reclining) height of these two-year-old girls.

  3. Create a normal quantile plot for the data; it will look like Figure 6.11 and here we find the points are very close to the diagonal, suggesting that the normal model would be very suitable in this case.

    Figure 6.11. Normal Quantile Plot for Recumbent Length
  4. As a final example in this section, perform all of the steps necessary to create a normal quantile plot for INDFMPIR. The correct result is shown in Figure 6.12.

This column is equal to the ratio of family income to the federally established definition of poverty. If the ratio is less than 1, this means that the family lives in poverty. By definition if the ratio is more than 5.0, NHANES records the value as equal to 5; there are 11 such families in this subsample.

Figure 6.12. Quantile Plot for Family Income Poverty Ratio

In contrast to the prior graph we should conclude that a normal distribution forms a poor model for this variable. One could develop a normal model, but the results calculated from that model would rarely come close to matching the reality represented by the observed data from the families of two-year-old girls.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.82.152