This chapter contains discussions of six enduring controversies in measurement and statistics, specifically:
Because many usability practitioners deeply depend on the use of measurement and statistics to guide their design recommendations, they inherit these controversies. In this chapter we summarize both sides of each issue and discuss what we, as pragmatic usability practitioners, recommend.
“There is, of course, nothing strange or scandalous about divisions of opinion among scientists. This is a condition for scientific progress.”
(Grove, 1989, p. 133)
“Criticism is the mother of methodology.”
Depending upon what type of scale we have constructed, some statistics are appropriate, others not. … The criterion for the appropriateness of a statistic is invariance under the transformations permitted by the scale. … Thus, the mean is appropriate to an interval scale and also to a ratio scale (but not, of course, to an ordinal or a nominal scale).
That I do not accept Stevens’ position on the relationship between strength of measurement and “permissible” statistical procedures should be evident from the kinds of data used as examples throughout this Primer: level of agreement with a questionnaire item, as measured on a 5-point scale having attached verbal labels … This is not to say, however, that the researcher may simply ignore the level of measurement provided by his or her data. It is indeed crucial for the investigator to take this factor into account in considering the kinds of theoretical statements and generalizations he or she makes on the basis of significance tests.
(Harris, 1985, pp. 326–328)
Even if one believes that there is a “real” scale for each attribute, which is either mirrored directly in a particular measure or mirrored as some monotonic transformation, an important question is, “What difference does it make if the measure does not have the same zero point or proportionally equal intervals as the ‘real’ scale?” If the scientist assumes, for example, that the scale is an interval scale when it “really” is not, something should go wrong in the daily work of the scientist. What would really go wrong? All that could go wrong would be that the scientist would make misstatements about the specific form of the relationship between the attribute and other variables. … How seriously are such misassumptions about scale properties likely to influence the reported results of scientific experiments? In psychology at the present time, the answer in most cases is “very little.”
(Nunnally, 1978, p. 28)
“But these numbers are not cardinal numbers,” the professor expostulated. “You can’t add them.”
“Oh, can’t I?” said the statistician. “I just did. Furthermore, after squaring each number, adding the squares, and proceeding in the usual fashion, I find the population standard deviation to be exactly 16.0.”
“But you can’t multiply ‘football numbers,’ the professor wailed. “Why, they aren’t even ordinal numbers, like test scores.”
“The numbers don’t know that,” said the statistician. “Since the numbers don’t remember where they came from, they always behave just the same way, regardless.”
On the other hand, for this ‘illegal’ statisticizing there can be invoked a kind of pragmatic sanction: In numerous instances it leads to fruitful results. While the outlawing of this procedure would probably serve no good purpose, it is proper to point out that means and standard deviations computed on an ordinal scale are in error to the extent that the successive intervals on the scale are unequal in size. When only the rank-order of data is known, we should proceed cautiously with our statistics, and especially with the conclusions we draw from them.
Table 9.1
Comparison of t With 30 Degrees of Freedom to z
α = 0.10 | α = 0.05 | α = 0.01 | |
t(30) | 1.697 | 2.042 | 2.750 |
z | 1.645 | 1.960 | 2.576 |
Difference | 0.052 | 0.082 | 0.174 |
Percent | 3.2% | 4.2% | 6.8% |
“Surely, God loves the .06 nearly as much as the .05”
(Rosnow and Rosenthal, 1989, p. 1277).
“Use common sense to extract the meaning from your data. Let the science of human factors and psychology drive the statistics; do not let statistics drive the science”
(Wickens, 1998, p. 22).
From this viewpoint, the cost of each type of error to user performance and possibly to user safety should be regarded as equivalent, and not as in the classical statistics of the 0.05 level, weighted heavily to avoiding Type I errors (a 1-in-20 chance of observing the effect, given that there is no difference between the old and new system). Indeed, it seems irresponsible to do otherwise than treat the two errors equivalently. Thus, there seems no possible reason why the decision criterion should be locked at 0.05 when, with applied studies that often are destined to have relatively low statistical power, the probability of a Type II error may be considerably higher than 0.05. Instead, designers should be at the liberty to adjust their own decision criteria (trading off between the two types of statistical errors) based on the consequences of the errors to user performance.
Table 9.2
Different Combinations of zα and zβ Summing to 2.93
zα | zβ | α | β |
2.93 | 0.00 | 0.003 | 0.500 |
2.68 | 0.25 | 0.007 | 0.401 |
2.43 | 0.50 | 0.015 | 0.309 |
2.18 | 0.75 | 0.029 | 0.227 |
1.93 | 1.00 | 0.054 | 0.159 |
1.65 | 1.28 | 0.100 | 0.100 |
1.25 | 1.68 | 0.211 | 0.046 |
1.00 | 1.93 | 0.317 | 0.027 |
0.75 | 2.18 | 0.453 | 0.015 |
0.50 | 2.43 | 0.617 | 0.008 |
0.25 | 2.68 | 0.803 | 0.004 |
0.00 | 2.93 | 1.000 | 0.002 |
Note: Bold indicates the values for which alpha = beta.
In such cases [multiple dependent variables] the investigator faces a choice of whether to present the results for each variable separately, to aggregate them in some way before analysis, or to use multivariate analysis of variance. … One of these alternatives—MANOVA—stands at the bottom of my list of options. … Technical discussion of MANOVA would carry us too far afield, but my experience with the method is that it is effortful to articulate the results. … Furthermore, when MANOVA comes out with simple results, there is almost always a way to present the same outcome with one of the simpler analytical alternatives. Manova mania is my name for the urge to use this technique.
In 1972 Maurice Kendall commented on how regrettable it was that during the 1940s mathematics had begun to ‘spoil’ statistics. Nowhere is this shift in emphasis from practice, with its room for intuition and pragmatism, to theory and abstraction, more evident than in the area of multiple comparison procedures. The rules for making such comparisons have been discussed ad nauseam and they continue to be discussed.
Table 9.4
Illustration of Alpha Inflation for 20 Tests Conducted With α = 0.05
x | p(x) | p(at least x) |
0 | 0.35849 | 1.00000 |
1 | 0.37735 | 0.64151 |
2 | 0.18868 | 0.26416 |
3 | 0.05958 | 0.07548 |
4 | 0.01333 | 0.01590 |
5 | 0.00224 | 0.00257 |
6 | 0.00030 | 0.00033 |
7 | 0.00003 | 0.00003 |
8 | 0.00000 | 0.00000 |
9 | 0.00000 | 0.00000 |
10 | 0.00000 | 0.00000 |
11 | 0.00000 | 0.00000 |
12 | 0.00000 | 0.00000 |
13 | 0.00000 | 0.00000 |
14 | 0.00000 | 0.00000 |
15 | 0.00000 | 0.00000 |
16 | 0.00000 | 0.00000 |
17 | 0.00000 | 0.00000 |
18 | 0.00000 | 0.00000 |
19 | 0.00000 | 0.00000 |
20 | 0.00000 | 0.00000 |
When there are multiple tests within the same study or series of studies, a stylistic issue is unavoidable. As Diaconis (1985) put it, “Multiplicity is one of the most prominent difficulties with data-analytic procedures. Roughly speaking, if enough different statistics are computed, some of them will be sure to show structure” (p. 9). In other words, random patterns will seem to contain something systematic when scrutinized in many particular ways. If you look at enough boulders, there is bound to be one that looks like a sculpted human face. Knowing this, if you apply extremely strict criteria for what is to be recognized as an intentionally carved face, you might miss the whole show on Easter Island.
Table 9.5
Likelihoods of Number of Type I Errors When the Null Hypothesis Is True Given 10, 20, and 100 Tests When α = 0.05
n = 10 | α = 0.05 | n = 20 | α = 0.05 | n = 100 | α = 0.05 | |
x | p(x) | p(at least x) | p(x) | p(at least x) | p(x) | p(at least x) |
0 | 0.59874 | 1.00000 | 0.35849 | 1.00000 | 0.00592 | 1.00000 |
1 | 0.31512 | 0.40126 | 0.37735 | 0.64151 | 0.03116 | 0.99408 |
2 | 0.07463 | 0.08614 | 0.18868 | 0.26416 | 0.08118 | 0.96292 |
3 | 0.01048 | 0.01150 | 0.05958 | 0.07548 | 0.13958 | 0.88174 |
4 | 0.00096 | 0.00103 | 0.01333 | 0.01590 | 0.17814 | 0.74216 |
5 | 0.00006 | 0.00006 | 0.00224 | 0.00257 | 0.18002 | 0.56402 |
6 | 0.00000 | 0.00000 | 0.00030 | 0.00033 | 0.15001 | 0.38400 |
7 | 0.00000 | 0.00000 | 0.00003 | 0.00003 | 0.10603 | 0.23399 |
8 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.06487 | 0.12796 |
9 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.03490 | 0.06309 |
10 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.01672 | 0.02819 |
Table 9.6
Critical Values of Number of Type I Errors (x) Given 5–100 Tests Conducted With α = 0.05 or 0.10
α = 0.05 | α = 0.10 | ||
Number of Tests | Critical x (p(x or more) ≤ 0.10) |
Number of Tests | Critical x (p(x or more) ≤ 0.10) |
5–11 | 2 | 5 | 2 |
12–22 | 3 | 6–11 | 3 |
23–36 | 4 | 12–18 | 4 |
37–50 | 5 | 19–25 | 5 |
51–64 | 6 | 26–32 | 6 |
65–79 | 7 | 33–40 | 7 |
80–95 | 8 | 41–48 | 8 |
96–111 | 9 | 49–56 | 9 |
57–64 | 10 | ||
65–72 | 11 | ||
73–80 | 12 | ||
81–88 | 13 | ||
89–97 | 14 | ||
98–105 | 15 |
Table 9.7
Significant Findings for 100 Tests Conducted With α = 0.05
Task | Measure | Product A | Product B | Product C | Product D | Product E |
1 | 1 | * | * | * | * | * |
1 | 2 | |||||
1 | 3 | |||||
1 | 4 | * | * | |||
2 | 1 | * | * | * | ||
2 | 2 | |||||
2 | 3 | |||||
2 | 4 | * | * | |||
3 | 1 | * | * | |||
3 | 2 | |||||
3 | 3 | |||||
3 | 4 | * | ||||
4 | 1 | * | * | |||
4 | 2 | * | ||||
4 | 3 | |||||
4 | 4 | |||||
5 | 1 | * | * | |||
5 | 2 | |||||
5 | 3 | |||||
5 | 4 | * | ||||
# Sig? | 1 | 2 | 3 | 6 | 9 |
Table 9.8
Sample Size Estimation for Review Question 2
Initial | 1 | 2 | |
tα | 1.65 | 1.83 | 1.80 |
tβ | 1.28 | 1.38 | 1.36 |
tα+β | 2.93 | 3.22 | 3.16 |
tα+β 2 | 8.58 | 10.34 | 9.98 |
s2 | 10 | 10 | 10 |
d | 3 | 3 | 3 |
d2 | 9 | 9 | 9 |
df | 9 | 11 | 11 |
Unrounded | 9.5 | 11.5 | 11.1 |
Rounded up | 10 | 12 | 12 |
Table 9.9
Probabilities for Number of Significant Results Given 20 Tests and α = 0.05
Product | x (# sig) | P(x or more) |
A | 1 | 0.642 |
B | 2 | 0.264 |
C | 3 | 0.075 |
D | 6 | 0.0003 |
E | 9 | 0.0000002 |
3.145.86.211