8 The Difference Between Two Means

Overview

Are the mean responses from two groups different? What evidence would it take to convince you? This question opens the door to many of the issues that pervade statistical inference, and this chapter explores these issues. Comparing group means also introduces an important statistical distinction regarding how the measurement or sampling process affects the way the resulting data are analyzed. This chapter also talks about validating statistical assumptions.

When two groups are considered, there are two distinct situations that lead to two different analyses:

Independent Groups—the responses from the two groups are unrelated and statistically independent. For example, the two groups might be two classrooms with two sets of students in them. The responses come from different experimental units or subjects. The responses are uncorrelated, and the means from the two groups are uncorrelated.

Matched Pairs—the two responses form a pair of measurements coming from the same experimental unit or subject. For example, a matched pair might be a before- and-after blood pressure measurement from the same subject. These responses are correlated, and the statistical method must take that into account.

Chapter Contents

Overview

Two Independent Groups

When the Difference Isn’t Significant.

Check the Data

Launch the Fit Y by X Platform

Examine the Plot

Display and Compare the Means

Inside the Student’s t-Test

Equal or Unequal Variances?

One-Sided Version of the Test

Analysis of Variance and the All-Purpose F-Test

How Sensitive Is the Test?

How Many More Observations Are Needed?

When the Difference Is Significant

Normality and Normal Quantile Plots

Testing Means for Matched Pairs

Thermometer Tests

Look at the Data

Look at the Distribution of the Difference

Student’s t-Test

The Matched Pairs Platform for a Paired t-Test

Optional Topic:
An Equivalent Test for Stacked Data

Two Extremes of Neglecting the Pairing Situation: A Dramatization

A Nonparametric Approach

Introduction to Nonparametric Methods

Paired Means: The Wilcoxon Signed-Rank Test

Independent Means: The Wilcoxon Rank Sum Test

Exercises

Two Independent Groups

For two different groups, the goal might be to estimate the group means and determine if they are significantly different. Along the way, it is certainly advantageous to notice anything else of interest about the data.

When the Difference Isn’t Significant

A study compiled height measurements from 63 children, all age 12. It’s safe to say that as they get older, the mean height for males will be greater than for females, but is this the case at age 12? Let’s find out:

image   Select Help > Sample Data Library and open Htwt12.jmp to see the data shown (partially) below.

There are 63 rows and three columns. This example uses Gender and Height. Gender has the Nominal modeling type, with codes for the two categories, “f” and “m”. Gender will be the X variable for the analysis. Height contains the response of interest, and so it will be the Y variable.

image

Check the Data

To check the data, first look at the distributions of both variables graphically with histograms and box plots.

Every pilot walks around the plane looking for damage or other problems before starting up. No one would submit an analysis to the FDA without making sure that the data were not confused with data from another study. Do your kids use the same computer that you do? Then check your data. Does your data set have so many decimals of precision that it looks like it came from a random number generator? Great detectives let no clue go unnoticed. Great data analysts check their data carefully.

image   Select Analyze > Distribution.

image   In the launch window, assign Gender and Height to Y, Columns.

image   Click OK to see an analysis window like the one shown in Figure 8.1.

Figure 8.1 Histograms and Summary Tables

image

A look at the histograms for Gender and Height reveals that there are a few more males than females. The overall mean height is about 59, and there are no missing values (N is 63, and there are 63 rows in the table). The box plot indicates that two of the children seem unusually short compared to the rest of the data.

image   Move the cursor to the Gender histogram, and click on the bar for “m”.

Clicking the bar highlights the males in the data table and also highlights the males in the Height histogram (See Figure 8.2). Now click on the “f” bar, which highlights the females and un-highlights the males.

By alternately clicking on the bars for males and females, you can see the conditional distributions of each subset highlighted in the Height histogram. This gives a preliminary look at the height distribution within each group, and it is these group means we want to compare.

Figure 8.2 Interactive Histogram

image

Launch the Fit Y by X Platform

We know to use the Fit Y by X platform because our context is comparing two variables. In this example, there are two gender groups, and we want to compare their mean weights.

You can compare these group means by assigning Height as the continuous Y variable and Gender as the nominal (grouping) X variable. Begin by launching the analysis platform:

image   Select Analyze > Fit Y by X.

image   In the launch window, assign Height to Y and Gender to X.

Notice that the role-prompting window indicates that you are doing a one-way analysis of variance (ANOVA). Because Height is continuous and Gender is categorical (nominal), the Fit Y by X command automatically gives a one-way layout for comparing distributions.

image   Click OK to see the initial graphs, which are side-by-side vertical dot plots for each group (see the left picture in Figure 8.3).

Examine the Plot

The horizontal line across the middle shows the overall mean of all the observations. To identify possible outliers (students with unusual values):

image   Click the lowest point in the “f” vertical scatter and Shift-click in the lowest point in the “m” sample.

Shift-clicking extends a selection so that the first selection does not un-highlight.

image   Select Rows > Label/Unlabel to see the plot on the right in Figure 8.2.

Now the points are labeled 29 and 34, the row numbers corresponding to each data point. Move your mouse over these points (or any other points) to see the values for Gender and Height. Click anywhere in the graph to un-highlight (deselect) the points.

Figure 8.3 Plot of the Responses, Before and After Labeling Points

image

Display and Compare the Means

The next step is to display the group means in the graph, and to obtain an analysis of them.

image   Select Means/Anova/Pooled t from the red triangle menu next to Oneway Analysis.

image   From the same menu, select t Test.

This adds analyses that estimate the group means and test to see if they are different.

Note: You don’t usually select both versions of the t-test (shown in Figure 8.5). We’re selecting these for illustration. To determine the correct test for other situations, see “Equal or Unequal Variances?” on page 185.

Lets discuss the first test, Means/Anova/Pooled t. This option automatically displays the mean diamonds as shown on the left in Figure 8.4, with summary tables and statistical test reports.

The center lines of the mean diamonds are the group means. The top and bottom of the diamonds form the 95% confidence intervals for the means. You can say the probability is 0.95 that this confidence interval contains the true group mean.

The confidence intervals show whether a mean is significantly different from some hypothesized value, but what can it show regarding whether two means are significantly different? Use the rule shown here to interpret mean diamonds.

Interpretation Rule for Mean Diamonds: If the confidence intervals shown by the mean diamonds do not overlap, the groups are significantly different (but the reverse is not necessarily true).

It is clear that the mean diamonds in this example overlap. Therefore, you need to take a closer look at the text report beneath the plots to determine if the means are really different. The report, shown in Figure 8.4, includes summary statistics, t- test reports, an analysis of variance, and means estimates.

Note that the p-value of the t-test (shown with the label Prob>|t| in the t Test section of the report) table is not significant.

Figure 8.4 Diamonds to Compare Group Means and Pooled t Report

image

Inside the Student’s t-Test

The Student’s t-test appeared in the last chapter to test whether a mean was significantly different from a hypothesized value. Now the situation is to test whether the difference of two means is significantly different from the hypothesized value of zero. The t-ratio is formed by first finding the difference between the estimate and the hypothesized value, and then dividing that quantity by its standard error.

tstatistic=estimatehypothesized valuestandard error of the estimate

In the current case, the estimate is the difference in the means for the two groups, and the hypothesized value is zero.

tstatistic=(mean1mean2)0standard error of the estimate

For the means of two independent groups, the pooled standard error of the difference is the square root of the sum of squares of the standard errors of the means.

standard error of the difference=smean12+smean22

JMP calculates the pooled standard error and forms the tables shown in Figure 8.4. Roughly, you look for a t-statistic greater than 2 in absolute value to get significance at the 0.05 level. The p-value is determined in part by the degrees of freedom (DF) of the t-distribution. For this case, DF is the number of observations (63) minus two, because two means are estimated. With the calculated t (-0.817) and DF, the p-value is 0.4171. The label Prob>|t| is given to this p-value in the test table to indicate that it is the probability of getting an even greater absolute t statistic. Usually a p-value less than 0.05 is regarded as significant—this is the significance level.

In this example, the p-value of 0.4171 isn’t small enough to detect a significant difference in the means. Is this to say that the means are the same? Not at all. You just don’t have enough evidence to show that they are different. If you collect more data, you might be able to show a significant, albeit small, difference.

Equal or Unequal Variances?

The report shown in Figure 8.5 shows two t-test reports.

   The uppermost report is labeled Assuming equal variances, and is generated with the Means/Anova/Pooled t command.

   The lower report is labeled Assuming unequal variances, and is generated with the t Test command.

Which is the correct report to use?

Figure 8.5 t-Test and ANOVA Reports

image

In general, the unequal-variance t-test (also known as the unpooled t-test) is the preferred test. This is because the pooled version is quite sensitive (the opposite of robust) to departures from the equal-variance assumption (especially if the number of observations in the two groups is not the same), and often we cannot assume the variances of the two groups are equal. In addition, if the two variances are unequal, the unpooled test maintains the prescribed α-level and retains good power. For example, you might think you are conducting a test with α= 0.05, but it might in fact be 0.10 or 0.20. What you think is a 95% confidence interval might be, in reality, an 80% confidence interval (Cryer and Wittmer, 1999). For these reasons, we recommend the unpooled (t Test command) t-test for most situations. In this case, both t-tests are not significant.

However, the equal-variance version is included and discussed for several reasons.

   For situations with very small sample sizes (for example, having three or fewer observations in each group), the individual variances cannot be estimated very well, but the pooled versions can be, giving better power. In these circumstances, the pooled version has slightly enough power.

   Pooling the variances is the only option when there are more than two groups, when the F-test must be used. Therefore, the pooled t-test is a useful analogy for learning the analysis of the more general, multi-group situation. This situation is covered in Chapter 9, “Comparing Many Means: One-Way Analysis of Variance.”

Rule for t-tests: Unless you have very small sample sizes, or a specific a priori reason for assuming the variances are equal, use the t-test produced by the t Test command. When in doubt, use the t Test command (i.e., unpooled) version.

One-Sided Version of the Test

The Student’s t-test in the previous example is for a two-sided alternative. In that situation, the difference could go either way (that is, either group could be taller), so a two-sided test is appropriate. The one-sided p-values are shown on the report, but you can get them by doing a a little arithmetic on the reported two- sided p-value, forming one-sided p-values by using

p2or1p2

depending on the direction of the alternative.

Figure 8.6 One- and Two-sided t-Test

image

In this example, the mean for males was less than the mean for females (the mean difference, using m-f, is -0.6252). The pooled t-test (top table in Figure 8.5) shows the p-value for the alternative hypothesis that females are taller is 0.2085, which is half the two-tailed p-value. Testing the other direction, the p-value is 0.7915. These values are reported in Figure 8.5 as Prob < t and Prob > t, respectively.

Analysis of Variance and the All-Purpose F-Test

As well as showing the t-test for comparing two groups, the top report in Figure 8.5 shows an analysis of variance with its F-test. The F-test surfaces many times in the next few chapters, so an introduction is in order. Details will unfold later.

The F-test compares variance estimates for two situations, one a special case of the other. Not only is this useful for testing means, but other things, as well. Furthermore, when there are only two groups, the F-test is equivalent to the pooled (equal variance) t-test, and the F-ratio is the square of the t-ratio: (0.81)2 = 0.66, as you can see in Figure 8.5.

To begin, look at the different estimates of variance as reported in the Analysis of Variance table.

First, the analysis of variance procedure pools all responses into one big population and estimates the population mean (the grand mean). The variance around that grand mean is estimated by taking the average sum of squared differences of each point from the grand mean.

The difference between a response value and an estimate such as the mean is called a residual, or sometimes the error.

What happens when a separate mean is computed for each group instead of the grand mean for all groups? The variance around these individual means is calculated, and this is shown in the Error line in the Analysis of Variance table. The Mean Square for Error is the estimate of this variance, called residual variance (also called s2), and its square root, called the root mean squared error (or s), is the residual standard deviation estimate.

If the true group means are different, then the separate means give a better fit than the one grand mean. In other words, there will be less variance using the separate means than when using the grand mean. The change in the residual sum of squares from the single-mean model to the separate-means model leads us to the F-test shown in the Model line of the Analysis of Variance table (“Model”, in this case, is Gender). If the hypothesis that the means are the same is true, the Mean Square for Model also estimates the residual variance.

The F-ratio is the Model Mean Square divided by the Error Mean Square:

F-Ratio=Mean Square for the ModelMean Square for the Error=6.1419.200=0.6675

The F-ratio is a measure of improvement in fit when separate means are considered. If there is no difference between fitting the grand mean and individual means, then both numerator and denominator estimate the same variance (the grand mean residual variance), so the F-ratio is around 1. However, if the separate-means model does fit better, the numerator (the model mean square) contains more than just the grand mean residual variance, and the value of the F-test increases.

If the two mean squares in the F-ratio are statistically independent (and they are in this kind of analysis), then you can use the F-distribution associated with the F- ratio to get a p-value. This tells how likely you are to see the F-ratio given by the analysis if there really was no difference in the means.

If the tail probability (p-value) associated with the F-ratio in the F-distribution is smaller than 0.05 (or the α-level of your choice), you can conclude that the variance estimates are different, and thus that the means are different.

In this example, the total mean square and the error mean square are not much different. In fact, the F-ratio is actually less than one, and the p-value of 0.4171 (roughly the same as seen for the pooled t-test) is far from significant (it is much greater that 0.05).

The F-test can be viewed as whether the variance around the group means (the histogram on the left in Figure 8.7) is significantly less than the variance around the grand mean (the histogram on the right). In this case, the variance isn’t much different. If the effect were significant, the variation showing on the left would have been much less than that on the right.

In this way, a test of variances is also a test on means. The F-test turns up again and again because it is oriented to comparing the variation around two models. Most statistical tests can be constituted this way.

Figure 8.7 Residuals for Group Means Model (left) and Grand Mean Model (right)

image

Terminology for Sums of Squares: All disciplines that use statistics use analysis of variance in some form. However, you may find different names used for its components. For example, the following are different names for the same kinds of sums of squares (SS):

SS(model) = SS(regression) = SS (between)

SS(error) = SS(residual) = SS(within)

How Sensitive Is the Test?

How Many More Observations Are Needed?

So far, in this example, there is no conclusion to report because the analysis failed to show anything. This is an uncomfortable state of affairs. It is tempting to state that we have shown no significant difference, but in statistics this is the same as saying the findings were inconclusive. Our conclusions (or lack of) can just as easily be attributed to not having enough data as to there being a very small true effect.

To gain some perspective on the power of the test, or to estimate how many data points are needed to detect a difference, we use the Sample Size and Power facility in JMP. Looking at power and sample size enables us to estimate some experimental values and graphically make decisions about the sample’s data and effect sizes.

image   Select DOE > Design Diagnostics > Sample Size and Power.

This command brings up a list of prospective power and sample size calculators for several situations, as shown in Figure 8.8. In our case, we are concerned with comparing two means. From the Distribution report on height, we can see that the standard deviation is about 3. Suppose we want to detect a difference of 0.5.

image   Click Two Sample Means.

image   Enter 3 for Std Dev and 0.5 as Difference to Detect, as shown on the right in Figure 8.8.

Figure 8.8 Sample Size and Power Window

image

image   Click Continue to see the graph shown on the left in Figure 8.9.

image   Use the crosshair tool to find out what sample size is needed to have a power of 90%.

We would need around 1,519 data points to have a probability of 0.90 of detecting a difference of 0.5 with the current standard deviation.

How would this change if we were interested in a difference of 2 rather than a difference of 0.5?

image   Click the Back button and change the Difference to Detect from 0.5 to 2.

image   Click Continue.

image   Use the crosshair tool to find the number of data points you need for 90% power.

The results should be similar to the plot on the right in Figure 8.9.

We need only about 96 participants if we were interested in detecting a difference of 2.

Figure 8.9 Finding a Sample Size for 90% Power

image

When the Difference Is Significant

The 12-year-olds in the previous example don’t have significantly different average heights, but let’s take a look at the 15-year-olds.

image   To start, select Help > Sample Data Library and open Htwt15.jmp.

Then, proceed as before:

image   Select Analyze > Fit Y by X, assign Gender to X, Factor and Height to Y, Response, and then click OK.

image   Select Means/Anova/Pooled t from the red triangle menu next to Oneway Analysis.

You should see the plot and tables shown in Figure 8.10.

Figure 8.10 Analysis for Mean Heights of 15-year-olds

image

Note: As we discussed earlier, we normally recommend the unpooled (t Test command) version of the test. We’re using the pooled version here as a basis for comparison between the results of the pooled t-test and the F-test.

The results for the analysis of the 15-year-old heights are completely different than the results for 12-year-olds. Here, the males are significantly taller than the females. You can see this because the confidence intervals shown by the mean diamonds do not overlap. You can also see that the p-values for both the two- tailed t-test and the F-test are 0.0002, which is highly significant.

The F-test results say that the variance around the group means is significantly less than the variance around the grand mean. These two variances are shown, using uniform scaling, in the histograms in Figure 8.11.

Figure 8.11 Histograms of Grand Means Variance and Group Mean Variance

image

Normality and Normal Quantile Plots

The t-tests (and F-tests) used in this chapter assume that the sampling distribution for the group means is the normal distribution. With sample sizes of at least 30 for each group, Normality is probably a safe assumption. The Central Limit Theorem says that means approach a normal distribution as the sample size increases even if the original data are not normal.

If you suspect non-normality (due to small samples, or outliers, or a non-normal distribution), consider using nonparametric methods, covered at the end of this chapter.

To assess normality, use a normal quantile plot. This is particularly useful when overlaid for several groups, because so many attributes of the distributions are visible in one plot.

image   Return to the Fit Y by X platform showing Height by Gender for the 12-year-olds and select Normal Quantile Plot > Plot Actual by Quantile from the red triangle menu next to Oneway Analysis.

image   Do the same for the 15-year-olds.

image

The resulting plots (Figure 8.12) show the data compared to the normal distribution. The normality is judged by how well the points follow a straight line. In addition, the normal quantile plot gives other useful information:

   The standard deviations are the slopes of the straight lines. Lines with steep slopes represent the distributions with the greater variances.

   The vertical separation of the lines in the middle shows the difference in the means. The separation of other quantiles shows at other points on the x-axis.

The distributions for all groups look reasonably normal since the points (generally) cluster around their corresponding line.

The first graph in Figure 8.12 confirms that heights of 12-year-old males and females have nearly the same mean and variance—the slopes (standard deviations) are the same and the positions (means) are only slightly different.

The second graph in Figure 8.12 shows 15-year-old males and females have different means and different variances—the slope (standard deviation) is higher for the females, but the position (mean) is higher for the males. Recall that we used the pooled t-test in the analysis in Figure 8.10. Since the variances are different, the unpooled t-test (the t Test command) would have been the more appropriate test.

Figure 8.12 Normal Quantile Plots for 12-year-olds and 15-year-olds

image

Testing Means for Matched Pairs

Consider a situation where two responses form a pair of measurements coming from the same experimental unit. A typical situation is a before-and-after measurement on the same subject. The responses are correlated, and if only the group means are compared—ignoring the fact that the groups have a pairing — information is lost. The statistical method called the paired t-test enables you to compare the group means, while taking advantage of the information gained from the pairings.

In general, if the responses are positively correlated, the paired t-test gives a more significant p-value than the t-test for independent means (grouped t-test) discussed in the previous sections. If responses are negatively correlated, then the paired t-test is less significant than the grouped t-test. In most cases where the pair of measurements are taken from the same individual at different times, they are positively correlated, but be aware that it is possible for pairs to have a negative correlation.

Thermometer Tests

A health care center suspected that temperature readings from a new ear drum probe thermometer were consistently higher than readings from the standard oral mercury thermometer. To test this hypothesis, two temperature readings were taken on 20 patients, one with the ear-drum probe, and the other with the oral thermometer. Of course, there was variability among the readings, so they were not expected to be exactly the same. However, the suspicion was that there was a systematic difference—that the ear probe was reading too high.

image   For this example, select Help > Sample Data Library and open Therm.jmp.

A partial listing of the data table appears in Figure 8.13. The Therm.jmp data table has 20 observations and 4 variables. The two responses are the temperatures taken orally and tympanically (by ear) on the same person on the same visit.

Figure 8.13 Comparing Paired Scores

image

For paired comparisons, the two responses need to be arranged in two columns, each with a continuous modeling type. This is because JMP assumes that each row represents a single experimental unit. Since the two measurements are taken from the same person, they belong in the same row. It is also useful to create a new column with a formula to calculate the difference between the two responses. (If your data table is arranged with the two responses in different rows, use the Tables > Split command to rearrange it. For more information, see “Juggling Data Tables” on page 51.)

Look at the Data

Start by inspecting the distribution of the data. To do this:

image   Select Analyze > Distribution and assign Oral and Tympanic to Y, Columns.

image   When the results appear, select Uniform Scaling from the red triangle menu next to Distribution to display the plots on the same scale.

The histograms (in Figure 8.14) show the temperatures to have different distributions. The mean looks higher for the Tympanic temperatures. However, as you will see later, this side-by-side picture of each distribution can be misleading if you try to judge the significance of the difference from this perspective.

What about the outliers at the top end of the Oral temperature distribution? Are they of concern? Can you expect the distribution to be normal? Not really. It is not the temperatures that are of interest, but the difference in the temperatures. So there is no concern about the distribution so far. If the plots showed temperature readings of 110 or 90, there would be concern, because that would be suspicious data for human temperatures.

Figure 8.14 Plots and Summary Statistics for Temperature

image

Look at the Distribution of the Difference

The comparison of the two means is actually a comparison of the difference between them. Inspect the distribution of the differences:

image   Select Analyze > Distribution and assign difference to Y, Columns.

The results (shown in Figure 8.15) show a distribution that seems to be above zero. In the Summary Statistics table, the lower 95% limit for the mean is 0.828— greater than zero.

Figure 8.15 Histogram and Summary Statistics of the Difference

image

Student’s t-Test

image   Select Test Mean from the red triangle menu for the histogram of the difference variable. When prompted for a hypothesized value, accept the default value of zero.

image   Click OK.

Now you have the t-test for testing that the mean over the matched pairs is the same.

In this case, the results in the Test Mean table, shown here, show a p-value of less than 0.0001, which supports our visual guess that there is a significant difference between methods of temperature taking. The tympanic temperatures are significantly higher than the oral temperatures.

image

There is also a nonparametric test, the Wilcoxon signed-rank test, described at the end of this chapter, that tests the difference between two means. This test is produced by selecting the appropriate box on the test mean window.

The last section in this chapter discusses the Wilcoxon signed-rank text.

The Matched Pairs Platform for a Paired t-Test

JMP offers a special platform for the analysis of paired data. The Matched Pairs platform compares means between two response columns using a paired t-test. The primary plot in the platform is a plot of the difference of the two responses on the y-axis, and the mean of the two responses on the x-axis. This graph is the same as a scatterplot of the two original variables, but rotated 45°clockwise. A 45ºrotation turns the original coordinates into a difference and a sum. By rescaling, this plot can show a difference and a mean, as illustrated in Figure 8.16.

Figure 8.16 Transforming to Difference by Sum Is a Rotation by 45º

image

   There is a horizontal line at zero, which represents no difference between the group means (y2 - y1 = 0 or y2 = y1).

   There is a line that represents the computed difference between the group means, and dashed lines around it showing a confidence interval.

Note: If the confidence interval does not contain the horizontal zero line, the test detects a significant difference.

Seeing this platform in use reveals its usefulness.

image   Select Analyze > Specialized Modeling > Matched Pairs and assign Oral and Tympanic to Y, Paired Response.

image   Click OK to see a scatterplot of Tympanic and Oral as a matched pair.

To see the rotation of the scatterplot in Figure 8.17 more clearly,

image   Select the Reference Frame option from the red triangle menu next to Matched Pairs.

Figure 8.17 Scatterplot of Matched Pairs Analysis

image

The analysis first draws a reference line where the difference is equal to zero. This is the line where the means of the two columns are equal. If the means are equal, then the points should be evenly distributed around this line. You should see about as many points above this line as below it. If a point is above the reference line, it means that the difference is greater than zero. In this example, points above the line show the situation where the Tympanic temperature is greater than the Oral temperature.

Parallel to the reference line at zero is a solid red line that is displaced from zero by an amount equal to the difference in means between the two responses. This red line is the line of fit for the sample. The test of the means is equivalent to asking if the red line through the points is significantly separated from the reference line at zero.

The dashed lines around the red line of fit show the 95% confidence interval for the difference in means.

This scatterplot gives you a good idea of each variable’s distribution, as well as the distribution of the difference.

Interpretation Rule for the Paired t-test Scatterplot: If the confidence interval (represented by the dashed lines around the red line) contains the reference line at zero, then the two means are not significantly different.

Another feature of the scatterplot is that you can see the correlation structure. If the two variables are positively correlated, they lie closer to the line of fit, and the variance of the difference is small. If the variables are negatively correlated, then most of the variation is perpendicular to the line of fit, and the variance of the difference is large. It is this variance of the difference that scales the difference in a t-test and determines whether the difference is significant.

The paired t-test table beneath the scatterplot of Figure 8.17 gives the statistical details of the test. The results should be identical to those shown earlier in the Distribution platform. The table shows that the observed difference in temperature readings of 1.12 degrees is significantly different from zero.

Optional Topic:
An Equivalent Test for Stacked Data

There is a third approach to the paired t-test. Sometimes, you receive grouped data with the response values stacked into a single column instead of having a column for each group.

Suppose the temperature data is arranged as shown here. Both the oral and tympanic temperatures are in the single column called Temperature. They are identified by the values of the Type and the Name columns.

image

Note: You can create this table yourself by using the Tables > Stack command to stack the Oral and Tympanic columns in the Therm.jmp table used in the previous examples.

If you select Analyze > Fit Y by X with Temperature (the response of both temperatures) as Y and Type (the classification) as X and select t Test from the red triangle menu, you get the t-test designed for independent groups, which is inappropriate for paired data.

However, fitting a model that includes an adjustment for each person fixes the independence problem because the correlation is due to temperature differences from person to person. To do this, you need to use the Fit Model command, covered in Chapter 14, “Fitting Linear Models.” The response is modeled as a function of both the category of interest (Type—Oral or Tympanic) and the Name category that identifies the person.

image   Select Analyze > Fit Model.

image   When the Fit Model window appears, assign Temperature to Y, and both Type and Name as model effects.

image   Click Run Model.

The resulting p-value for the category effect is identical to the p-value from the paired t-test shown previously. In fact, the F-ratio in the effect test is exactly the square of the t-test value in the paired t-test. In this case the formula is

(Paired t-test statistic)2 = 8.03022 = 64.4848 = (stacked F-test statistic)

The Fit Model platform gives you a plethora of information, but for this example you need only the Effect Test table (Figure 8.18). It shows an F-ratio of 64.48, which is exactly the square of the t-ratio of 8.03 found with the previous approach. It’s just another way of doing the same test.

Figure 8.18 Equivalent F-test on Stacked Data

image

The alternative formulation for the paired means covered in this section is important for cases in which there are more than two related responses. Having many related responses is a repeated-measures or longitudinal situation. The generalization of the paired t-test is called the multivariate or T2 approach, whereas the generalization of the stacked formulation is called the mixed-model or split-plot approach.

Two Extremes of Neglecting the Pairing Situation: A Dramatization

What happens if you do the wrong test? What happens if you do a t-test for independent groups on highly correlated paired data?

Consider the following two data tables:

image   Select Help > Sample Data Library and open Blood Pressure by Time.jmp to see the left-hand table in Figure 8.19.

This table represents blood pressure measured for ten people in the morning and again in the afternoon. The hypothesis is that, on average, the blood pressure in the morning is the same as it is in the afternoon.

image   Open the sample data table called BabySleep.jmp to see the right-hand table in Figure 8.19.

In this table, a researcher monitored ten two-month-old infants at 10 minute intervals over a day and counted the intervals in which a baby was asleep or awake. The hypothesis is that at two months old, the asleep time is equal to the awake time.

Figure 8.19 The Blood Pressure by Time and BabySleep Sample Data Tables

image

Let’s do the incorrect t-test (the t-test for independent groups). Before conducting the test, we need to reorganize the data using the Stack command.

image   Select Tables > Stack to create two new tables (a stacked version of Baby Sleep.jmp and a stacked version of Blood Pressure by Time.jmp). Stack Awake and Asleep to form a single column in one table, and BP AM and BP PM to form a single column in a second table.

image   Select Analyze > Fit Y by X on both new tables, using the Label column to Y, Response and the Data column as X, Factor.

image   Select t Test from the red triangle menu for each plot.

The results for the two analyses are shown in Figure 8.20. The conclusions are that there is no significant difference between Awake and Asleep time, nor is there a difference between time of blood pressure measurement. The summary statistics are the same in both analyses and the probability is the same, showing no significance (p = 0.1426).

Figure 8.20 Results of t-test for Independent Means

image

Now do the proper test, the paired t-test.

image   Using the original (unstacked) tables, chose Analyze > Distribution and examine a distribution of the Dif variable in each table.

image   Double-click on the axis of the blood pressure histogram and make its scale match the scale of the baby sleep axis.

image   Then, test that each mean is zero (see Figure 8.21).

In this case, the analysis of the differences leads to very different conclusions.

   The mean difference between time of blood pressure measurement is highly significant because the variance is small (Std Dev=3.89).

   The mean difference between awake and asleep time is not significant because the variance of this difference is large (Std Dev=51.32).

So don’t judge the mean of the difference by the difference in the means without noting that the variance of the difference is the measuring stick, and that the measuring stick depends on the correlation between the two responses.

Figure 8.21 Histograms and Summary Statistics Show the Problem

image

The scatterplots produced by the Bivariate platform (Figure 8.22) and the Matched Pairs platform (Figure 8.23) show what is happening. The first pair is highly positively correlated, leading to a small variance for the difference. The second pair is highly negatively correlated, leading to a large variance for the difference.

Figure 8.22 Bivariate Scatterplots of Blood Pressure and Baby Sleep Data

image

Figure 8.23 Paired t-test for Positively and Negatively Correlated Data

image

To review, make sure you can answer the following question:

What is the reason that you use a different t-test for matched pairs?

a. Because the statistical assumptions for the t-test for groups are not satisfied with correlated data.

b. Because you can detect the difference much better with a paired t-test. The paired t-test is much more sensitive to a given difference.

c. Because you might be overstating the significance if you used a group t-test rather than a paired t-test.

d. Because you are testing a different thing.

Answer: All of the above.

a.   The grouped t-test assumes that the data are uncorrelated and paired data are correlated. So you would violate assumptions using the grouped t-test.

b.   Most of the time the data are positively correlated, so the difference has a smaller variance than you would attribute if they were independent. So the paired t-test is more powerful—that is, more sensitive.

c.   There may be a situation in which the pairs are negatively correlated, and if so, the variance of the difference would be greater than you expect from independent responses. The grouped t-test would overstate the significance.

d.   You are testing the same thing in that the mean of the difference is the same as the difference in the means. But you are testing a different thing in that the variance of the mean difference is different from the variance of the differences in the means (ignoring correlation), and the significance for means is measured with respect to the variance.

Mouse Mystery

Comparing two means is not always straightforward. Consider this story.

A food additive showed promise as a dieting drug. An experiment was run on mice to see if it helped control their weight gain. If it proved effective, then it could be sold to millions of people trying to control their weight.

After the experiment was over, the average weight gain for the treatment group was significantly less than for the control group, as hoped for. Then someone noticed that the treatment group had fewer observations than the control group. It seems that the food additive caused the obese mice in that group to tend to die young, so the thinner mice had a better survival rate for the final weighing.

These tables are set up such that the values are identical for the two responses, as a marginal distribution, but the values are paired differently so that the Blood Pressure by Time difference is highly significant and the babySleep difference is non-significant. This illustrates that it is the distribution of the difference that is important, not the distribution of the original values. If you don’t look at the data correctly, the data can appear the same even when they are dramatically different.

A Nonparametric Approach

Introduction to Nonparametric Methods

Nonparametric methods provide ways to analyze and test data that do not depend on assumptions about the distribution of the data. In order to ignore normality assumptions, nonparametric methods disregard some of the information in your data. Typically, instead of using actual response values, you use the rank ordering of the response.

Most of the time you don’t really throw away much relevant information, but you avoid information that might be misleading. A nonparametric approach creates a statistical test that ignores all the spacing information between response values. This protects the test against distributions that have very non-normal shapes, and can also provide insulation from data contaminated by rogue values.

In many cases, the nonparametric test has almost as much power as the corresponding parametric test and in some cases has more power. For example, if a batch of values is normally distributed, the rank-scored test for the mean has 95% efficiency relative to the most powerful normal-theory test.

The most popular nonparametric techniques are based on functions (scores) of the ranks:

   the rank itself, called a Wilcoxon score

   whether the value is greater than the median; whether the rank is more than n+12, called the Median test

   a normal quantile, computed as in normal quantile plots, called the van der Waerden score

Nonparametric methods are not contained in a single platform in JMP, but are available through many platforms according to the context where that test naturally occurs.

Paired Means: The Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is the nonparametric analog to the paired t-test. You do a signed-rank test by testing the distribution of the difference of matched pairs, as discussed previously. The following example shows the advantage of using the signed-rank test when data are non-normal.

image   Select Help > Sample Data Library and open Chamber.jmp

The data represent electrical measurements on 24 wiring boards. Each board is measured first when soldering is complete, and again after three weeks in a chamber with a controlled environment of high temperature and humidity (Iman 1995).

image

image   Examine the diff variable (difference between the outside and inside chamber measurements) with Analyze > Distribution.

image   Select the Continuous Fit > Normal from the red triangle menu next to diff.

image   Select Goodness of Fit from the red triangle menu next to Fitted Normal.

The Shapiro-Wilk W-test in the report tests the assumption that the data are normal. The probability of 0.0090 given by the normality test indicates that the data are significantly non-normal. In this situation, it might be better to use signed ranks for comparing the mean of diff to zero. Since this is a matched pairs situation, use the Matched Pairs platform.

Figure 8.24 The Chamber Data and Test For Normality

image

image   Select Analyze > Specialized Modeling > Matched Pairs.

image   Assign outside and inside as the paired responses, then click OK.

When the report appears,

image   Select Wilcoxon Signed Rank from the red triangle menu on the Matched Pairs title bar.

Note that the standard t-test probability is insignificant (p = 0.1107). However, in this example, the signed-rank test detects a difference between the groups with a p-value of 0.0106.

image

Independent Means: The Wilcoxon Rank Sum Test

If you want to nonparametrically test the means of two independent groups, as in the t-test, then you can rank the responses and analyze the ranks instead of the original data. This is the Wilcoxon rank sum test. It is also known as the Mann- Whitney U-test because there is a different formulation of it that was not discovered to be equivalent to the Wilcoxon rank sum test until after it had become widely used.

image   Open Htwt15.jmp again, select Analyze > Fit Y by X with Height as Y and Gender as X, and then click OK.

This is the same platform that gave the t-test.

image   Select Nonparametric > Wilcoxon Test from the red triangle menu next to Matched Pairs.

The result is the report in Figure 8.25. This table shows the sum and mean ranks for each group, then the Wilcoxon statistic along with an approximate p-value based on the large-sample distribution of the statistic. In this case, the difference in the mean heights is declared significant, with a p-value of 0.0002. If you have small samples, you should consider also checking the tables of the Wilcoxon to obtain a more exact test, because the normal approximation is not very precise in small samples.

Figure 8.25 Wilcoxon Rank Sum Test for Independent Groups

image

Exercises

1.   The sample data table On-Time Arrivals.jmp (Aviation Consumer Home Page, 1999) con- tains the percentage of airlines’ planes that arrived on time in 29 airports (those that the Department of Transportation designates “reportable”). You are interested in seeing if there are differences between certain months.

(a)   Suppose you want to examine the differences between March and June. Is this a situation where a grouped test of two means is appropriate or would a matched pairs test be a better choice?

(b)   Based on your answer in (a), determine if there is a difference in on-time arrivals between the two months.

(c)   Similarly, determine if there is a significant difference between the months June and August, and also between March and August.

2.   William Gosset was a pioneer in statistics. In one famous experiment, he wanted to investigate the yield from corn planted from two different types of seeds. One type of seed was dried in the normal way, while the other was kiln-dried. Gossett planted one seed of each seed type in 11 different plots and measured the yield for each one. The drying methods are represented by the columns Regular or Kiln in the sample data table Gosset’s Corn.jmp (Gosset 1908).

(a)   This is a matched-pairs experiment. Explain why it is inappropriate to use the grouped-means method of determining the difference between the two seeds.

(b)   Using the matched-pairs platform, determine if there is a difference in yield between kiln-dried and regular-dried corn.

3.   The sample data table Companies.jmp (Fortune Magazine, 1990) contains data on sales, profits, and employees for two different industries (Computers and Pharmaceutical). This exercise is interested in detecting differences between the two types of companies.

(a)   Suppose you wanted to test for differences in sales amounts for the two business types. First, examine histograms of the variables Type and Sales $ and comment on the output.

(b)   In comparing sales for the two types of companies, should you use grouped means or matched pairs for the test?

(c)   Using your answer in part (b), determine if there is a difference between the sales amounts of the two types of companies.

(d)   Should you remove any outliers in your analysis of part (c)? Comment on why this would or would not be appropriate in this situation.

4.   The sample data table Cars.jmp (Henderson and Velleman, 1981) contains information on several different brands of cars, including number of doors and impact compression for various parts of the body during crash tests.

(a)   Is there a difference between two- and four-door cars when it comes to impact compression on left legs?

(b)   Is there a difference between two- and four-door cars when it comes to compression on right legs?

(c)   Is there a difference between two- and four-door cars when it comes to head impact compression?

5.   The sample data table Chamber.jmp represents electrical measurements on 24 electrical boards. (This is the same data used in “Paired Means: The Wilcoxon Signed-Rank Test” on page 211.) Each measurement was taken when soldering was complete and then again three weeks later after sitting in a temperature- and humidity-controlled cham- ber. The investigator wants to know if there is a difference between the measurements.

(a)   Why is this a situation that calls for a matched-pairs analysis?

(b)   Using the paired t-test, determine if there is a significant difference between the means when the boards were outside versus. inside the chamber.

(c)   Does the analysis in part (b) lead to the same conclusion as the Wilcoxon signed-rank test? Why or why not?

6.   A manufacturer of widgets determined the quality of its product by measuring abra- sion on samples of finished products. The manufacturer was concerned that there was a difference in the abrasion measurement for the two shifts of workers that were employed at the factory. Use the data stored in the sample data table Abrasion.jmp to compute a t-test of abrasion comparing the two shifts. Is there statistical evidence for a difference?

7.   The manufacturers of a medication were concerned about adverse reactions in patients treated with their drug. Data on adverse reactions is stored in the sample data table AdverseR.jmp. The duration of the adverse reaction is stored in the ADR DURATION vari- able.

(a)   Patients given a placebo are noted with PBO listed in the treatment group variable, while those that received the standard drug regimen are designated with ST_DRUG. Test whether there is a significant difference in adverse reaction times between the two groups.

(b)   Test whether there is a difference in adverse reaction times based on the gender of the patient.

(c)   Redo the analyses in parts (a) and (b) using a nonparametric test. Do the results differ?

(d)   A critic of the study claims that the weights of the patients in the placebo group are not the same as those of the treatment group. Do the data support the critic’s claim?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.4.174