Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Six enduring controversies in measurement and statistics

Abstract

This chapter contains discussions of six enduring controversies in measurement and statistics, specifically:

• Is it OK to average data from multipoint scales?

• Do you need to test at least 30 users?

• Should you always conduct a two-tailed test?

• Can you reject the null hypothesis when p > 0.05?

• Can you combine usability metrics into single scores?

• What if you need to run more than one test?

Because many usability practitioners deeply depend on the use of measurement and statistics to guide their design recommendations, they inherit these controversies. In this chapter we summarize both sides of each issue and discuss what we, as pragmatic usability practitioners, recommend.

Keywords

multipoint scale

levels of measurement

Type I error

Type II error

confidence

power

two-tailed test

one-tailed test

combining measurements

multiple comparisons

Introduction

“There is, of course, nothing strange or scandalous about divisions of opinion among scientists. This is a condition for scientific progress.”

(Grove, 1989, p. 133)

“Criticism is the mother of methodology.”

(Abelson’s 8th law, 1995)

Controversy is one of the engines of scientific progress. Proponents of one point of view debate those who hold a different point of view, ideally using empirical (data-based) and rational (logic-based) arguments. When there is no clear winner, these debates can carry on over decades, or even centuries. The fields of measurement and statistics are no strangers to such debates (Abelson, 1995; Cowles, 1989; Stigler, 1986; 1999).

Because many usability practitioners deeply depend on the use of measurement and statistics to guide their design recommendations, they inherit these controversies. In earlier chapters of this book, we’ve already addressed a number of controversies, including:

• Can you use statistical analysis when samples are small? (see Chapters 3–7 for numerous examples of using statistics with small sample sizes)

• What is the best average to report when estimating task completion times? (Chapter 3, “The Geometric Mean”)

• What is the best choice for estimating binomial confidence intervals? (Chapter 3, “Adjusted-Wald: Add Two Successes and Two Failures”)

• Is it legitimate to use mid-probabilities rather than exact tests? (Chapter 4, “Mid-probabilities”)

• When are t-tests robust (insensitive) to violations of the assumptions of normality and equal variance? (Chapter 5, “Normality assumption of the paired t-test” and “Assumptions of the t-tests”)

• What is the best choice for analyzing 2 × 2 contingency tables? (Chapter 5, “Comparing Completion Rates, Conversion Rates, and A/B Testing”)

• Does observing five users enable the discovery of 85% of usability problems (Chapter 7, “Reconciling the ‘Magic Number Five’ with ‘Eight is Not Enough’)

• Is it possible to estimate the total number of problems available for discovery (and thus the number of still undiscovered problems) in a formative usability study? (Chapter 7, “Estimating the number of problems available for discovery and the number of undiscovered problems”)

• Should usability practitioners use methods based on the binomial probability formula when planning and analyzing the results of formative user research? (Chapter 7, “Other Statistical Models for Problem Discovery”)

• How many scale steps should there be in the items used in standardized usability questionnaires? (Chapter 8, “Number of scale steps”)

• Is it necessary to balance positive and negative tone of the items in standardized usability questionnaires? (Chapter 8, “Does it hurt to be positive? Evidence from an alternate form of the SUS”)

In this chapter we discuss in a little more detail six enduring controversies, summarizing both sides of each issue and what we, as pragmatic user researchers, recommend. Whether you ultimately agree or disagree with our analyses and recommendations, always keep in mind Abelson’s third law (Abelson, 1995, xv): “Never flout a convention just once.” In other words, within a single study or group of related studies, you should consistently apply whatever decision you’ve made, controversial or not. Ideally, you should make and document these decisions before collecting any data to reduce the temptation to pick and choose among the alternatives to make the findings favorable to your point of view (capitalizing on chance effects). The main goal of this chapter is to provide the information needed to make those decisions. Several of the controversies involve discussions of Type I and Type II errors, so if you don’t remember what they are, be sure to review Chapter 6 (“Example 7: Where’s the Power?”, especially Fig. 6.3).

Is it OK to average data from multipoint scales?

On one hand

In 1946, S. S. Stevens declared that all numbers are not created equal. Specifically, he defined four levels of measurement:

• Nominal: Numbers that are simply labels, such as the numbering of football players or model numbers.

• Ordinal: Numbers that have an order, but where the differences between numbers do not necessarily correspond to the differences in the underlying attribute, such as levels of multipoint rating scales or rank order of baseball teams based on percentage of wins.

• Interval: Numbers that are not only ordinal, but for which equal differences in the numbers correspond to equal differences in the underlying attribute, such as Fahrenheit or Celsius temperature scales.

• Ratio: Numbers that are not only interval, but for which there is a true 0 point so equal ratios in the numbers correspond to equal ratios in the underlying attribute, such as time intervals (reaction time, task completion times) or the Kelvin temperature scale.

From these four classes of measurements, Stevens developed a rational argument that certain types of arithmetic operations were not reasonable to apply to certain types of data. Based on his “principle of invariance,” he argued against doing anything more than counting nominal and ordinal data, and restricted addition, subtraction, multiplication, and division to interval and ratio data. For example, because you need to add and divide data to compute an arithmetic mean, he stated (Stevens, 1959, pp. 26–28):

Depending upon what type of scale we have constructed, some statistics are appropriate, others not. … The criterion for the appropriateness of a statistic is invariance under the transformations permitted by the scale. … Thus, the mean is appropriate to an interval scale and also to a ratio scale (but not, of course, to an ordinal or a nominal scale).

From this perspective, strictly speaking, the multipoint scales commonly used for rating attitudes are ordinal measurements, so it would not be permissible to even compute their arithmetic means. If it’s illogical to compute means of rating scale data, then it follows that it is incorrect when analyzing ordinal or nominal data to use statistical procedures such as t-tests that depend on computing the mean. Stevens’ levels of measurement have been very influential, appearing in numerous statistics textbooks and used to guide recommendations given to users of some statistical analysis programs (Velleman and Wilkinson, 1993).

On the other hand

After the publication of Stevens’ levels of measurement, arguments against their relationship to permissible arithmetic operations and associated statistical procedures appeared. For example:

That I do not accept Stevens’ position on the relationship between strength of measurement and “permissible” statistical procedures should be evident from the kinds of data used as examples throughout this Primer: level of agreement with a questionnaire item, as measured on a 5-point scale having attached verbal labels … This is not to say, however, that the researcher may simply ignore the level of measurement provided by his or her data. It is indeed crucial for the investigator to take this factor into account in considering the kinds of theoretical statements and generalizations he or she makes on the basis of significance tests.

(Harris, 1985, pp. 326–328)

Even if one believes that there is a “real” scale for each attribute, which is either mirrored directly in a particular measure or mirrored as some monotonic transformation, an important question is, “What difference does it make if the measure does not have the same zero point or proportionally equal intervals as the ‘real’ scale?” If the scientist assumes, for example, that the scale is an interval scale when it “really” is not, something should go wrong in the daily work of the scientist. What would really go wrong? All that could go wrong would be that the scientist would make misstatements about the specific form of the relationship between the attribute and other variables. … How seriously are such misassumptions about scale properties likely to influence the reported results of scientific experiments? In psychology at the present time, the answer in most cases is “very little.”

(Nunnally, 1978, p. 28)

For analyzing ordinal data, some researchers have recommended the use of statistical methods that are similar to the well-known t- and F-tests, but which replace the original data with ranks before analysis (Bradley, 1976). These methods (e.g., the Mann–Whitney U-test, the Friedman test, or the Kruskal–Wallis test), however, involve taking the means and standard deviations of the ranks, which are ordinal—not interval or ratio—data. Despite these violations of permissible manipulation of the data from Stevens’ point-of-view, those methods work perfectly well.

Probably the most famous counterargument was by Lord (1953) with his parable of a retired professor who had a machine used to randomly assign football numbers to the jerseys of freshmen and sophomore football players at his university—a clear use of numbers as labels (nominal data). After assigning numbers, the freshmen complained that the assignment wasn’t random—they claimed to have received generally smaller numbers than the sophomores, and that the sophomores must have tampered with the machine. “The sophomore team was laughing at them because they had such low numbers (Fig. 9.1). The freshmen were all for routing the sophomores out of their beds one by one and throwing them in the river” (pp. 750–751).

Figure 9.1 Example of assignment of football numbers. Source: Gabe Clogston (2009), used with permission.

In a panic and to avoid the impending violence, the professor consulted with a statistician to investigate how likely it was that the freshmen got their low numbers by chance. Over the professor’s objections, the statistician determined the population mean and standard deviation of the football numbers—54.3 and 16.0, respectively. He found that the mean of the freshmen’s numbers was too low to have happened by chance, strongly indicating that the sophomores had tampered with the football number machine to get larger numbers. The famous fictional dialog between the professor and the statistician was (Lord, 1953, p. 751)

“But these numbers are not cardinal numbers,” the professor expostulated. “You can’t add them.”

“Oh, can’t I?” said the statistician. “I just did. Furthermore, after squaring each number, adding the squares, and proceeding in the usual fashion, I find the population standard deviation to be exactly 16.0.”

“But you can’t multiply ‘football numbers,’ the professor wailed. “Why, they aren’t even ordinal numbers, like test scores.”

“The numbers don’t know that,” said the statistician. “Since the numbers don’t remember where they came from, they always behave just the same way, regardless.”

And so it went on for decades, with measurement theorists generally supporting the idea that levels of measurement should influence the choice of statistical analysis methods and applied statisticians arguing against the practice. In their recap of the controversy, Velleman and Wilkinson (1993, p. 68) wrote, “At times, the debate has been less than cordial. Gaito (1980) aimed sarcastic barbs at the measurement theory camp and Townsend and Ashby (1984) fired back. Unfortunately, as Mitchell (1986) noted, they often shot past each other.” The debate continues into the 21st century (Scholten and Borsboom, 2009).

In Stevens’ original paper (1946, p. 679), he actually took a more moderate stance on this topic than most people realize.

On the other hand, for this ‘illegal’ statisticizing there can be invoked a kind of pragmatic sanction: In numerous instances it leads to fruitful results. While the outlawing of this procedure would probably serve no good purpose, it is proper to point out that means and standard deviations computed on an ordinal scale are in error to the extent that the successive intervals on the scale are unequal in size. When only the rank-order of data is known, we should proceed cautiously with our statistics, and especially with the conclusions we draw from them.

Responding to criticisms of the implications of his 1953 paper, Lord (1954, pp. 264–265) stated, “nominal and ordinal numbers (including test scores) may be treated by the usual arithmetic operations so as to obtain means, standard deviations, and other similar statistics from which (in certain restricted situations) correct conclusions may usefully be deduced with complete logical rigor.” He then suggested that critics of his logic agree to participate in a game based on the “football numbers” story, with the statistician paying the critic one dollar every time the statistician incorrectly designates a sample as being drawn from one of two populations of nominal two-digit numbers and the critic paying the statistician one dollar when he is right. As far as we know, no critic ever took Lord up on his offer to play this game.

Our recommendation

So, which is it—all numbers are not equal (Stevens, 1946), or the numbers don’t remember where they came from (Lord, 1953)? Given our backgrounds in applied statistics (and personal experiences attempting to act in accordance with Stevens’ reasoning that didn’t work out very well—see the sidebar), we fall firmly in the camp that supports the use of statistical techniques (such as the t-test, analysis of variance, and factor analysis) on ordinal data such as multipoint rating scales. However, you can’t just ignore the level of measurement of your data.

When you make claims about the meaning of the outcomes of your statistical tests, you do have to be careful not to act as if rating scale data are interval rather than ordinal data. An average rating of 4 might be better than an average rating of 2, and a t-test might indicate that across a group of participants, the difference is consistent enough to be statistically significant. Even so, you can’t claim that it is twice as good (a ratio claim), nor can you claim that the difference between 4 and 2 is equal to the difference between 4 and 6 (an interval claim). You can only claim that there is a consistent difference. Fortunately, even if you made the mistake of thinking one product is twice as good as another when the scale doesn’t justify it, it would be a mistake that often would not affect the practical decision of which product is better. You would still have identified the better of the two products even if the actual difference in satisfaction was more modest.

Means work better than medians when analyzing ordinal multipoint data

How acting in accordance with Stevens’ levels of measurement nearly tripped me up—from the files of Jim Lewis

In the late 1980s I was involved in a high-profile project at IBM in which we were comparing performance and satisfaction across a set of common tasks for three competitive office application suites (Lewis et al., 1990). Based on what I had learned in my college statistics classes about Stevens’ levels of measurement, I pronounced that the multipoint rating scale data we were dealing with did not meet the assumptions required to take the mean of the data for the rating scales because they were ordinal rather than interval or ratio, so we should present their central tendencies using medians rather than means. I also advised against the use of t-tests for individual comparisons of the rating scale results, promoting instead its nonparametric analog, the Mann–Whitney U-test.

The folks who started running the statistics and putting the presentation together (which would have been given to a group that included high-level IBM executives) called me in a panic after they started following my advice. In the analyses, there were cases where the medians were identical, but the U-test detected a statistically significant difference. It turns out that the U-test is sensitive not only to central tendency, but also to the shape of the distribution, and in these cases the distributions had opposite skew but overlapping medians. As a follow-up, I systematically investigated the relationship among mean and median differences for multipoint scales and the observed significance levels of t- and U-tests conducted on the same data, all taken from our fairly large-scale usability test. It turned out that the mean difference correlated more than the median difference with the observed significance levels (both parametric and nonparametric) for discrete multipoint scale data.

Consequently, I no longer promote the concepts of Steven’s levels of measurement with regard to permissible statistical analysis, although I believe this distinction is critical when interpreting and applying results. It appears that t-tests have sufficient robustness for most usability work—especially when you can create a set of difference scores to use for the analysis. For details, see Lewis (1993).

Do you need to test at least 30 users?

On one hand

Probably most of us who have taken an introductory statistics class (or know someone who took such a class) have heard the rule of thumb that to estimate or compare means, your sample size should be at least 30. According to the central limit theorem, as the sample size increases, the distribution of the mean becomes more and more normal, regardless of the normality of the underlying distribution. Some simulation studies have shown that for a wide variety of distributions (but not all—see Bradley, 1978), the distribution of the mean becomes near normal when n = 30.

Another consideration is that it is slightly simpler to use z-scores rather than t-scores because z-scores do not require the use of degrees of freedom. As shown in Table 9.1 and Fig. 9.2, by the time you have about 30 degrees of freedom the value of t gets pretty close to the value of z. Consequently, there can be a feeling that you don’t have to deal with small samples that require small-sample statistics (Cohen, 1990).

Table 9.1

Comparison of t With 30 Degrees of Freedom to z

	α = 0.10	α = 0.05	α = 0.01
t(30)	1.697	2.042	2.750
z	1.645	1.960	2.576
Difference	0.052	0.082	0.174
Percent	3.2%	4.2%	6.8%

Figure 9.2 Approach of t to z as a function of α and degrees of freedom.

On the other hand

When the cost of a sample is expensive, as it typically is in many types of user research (e.g., moderated usability testing), it is important to estimate the needed sample size as accurately as possible, with the understanding that it is an estimate. The likelihood that 30 is exactly the right sample for a given set of circumstances is very low. As shown in our chapters on sample size estimation, a more appropriate approach is to take the formulas for computing the significance levels of a statistical test and, using algebra to solve for n, convert them to sample size estimation formulas. Those formulas then provide specific guidance on what you have to know or estimate for a given situation to estimate the required sample size.

The idea that even with the t-distribution (as opposed to the z-distribution) you need to have a sample size of at least 30 is inconsistent with the history of the development of the distribution. In 1899, William S. Gossett, a recent graduate of New College in Oxford with degrees in chemistry and mathematics, became one of the first scientists to join the Guinness brewery. “Compared with the giants of his day, he published very little, but his contribution is of critical importance. … The nature of the process of brewing, with its variability in temperature and ingredients, means that it is not possible to take large samples over a long run” (Cowles, 1989, p. 108–109).

This meant that Gossett could not use z-scores in his work—they just don’t work well with small samples. After analyzing the deficiencies of the z-distribution for statistical tests with small samples, he worked out the necessary adjustments as a function of degrees of freedom to produce his t tables, published under the pseudonym “Student” due to the policies of Guinness prohibiting publication by employees (Salsburg, 2001). In the work that led to the publication of the tables, Gossett performed an early version of Monte Carlo simulations (Stigler, 1999). He prepared 3000 cards labeled with physical measurements taken on criminals, shuffled them, then dealt them out into 750 groups of size 4—a sample size much smaller than 30.

Our recommendation

This controversy is similar to the “five is enough” versus “eight is not enough” argument covered in Chapter 7, but applied to summative rather than formative research. For any research, the number of users to test depends on the purpose of the test and the type of data you plan to collect. The “magic number” 30 has some empirical rationale, but in our opinion, it’s very weak. As you can see from the numerous examples in this book that have sample sizes not equal to 30 (sometimes less, sometimes more), we do not hold this rule of thumb in very high regard. As described in our sample size chapter for summative research, the appropriate sample size for a study depends on the type of distribution, the expected variability of the data, the desired levels of confidence and power, and the minimum size of the effect that you need to be able to reliably detect.

As illustrated in Fig. 9.2, when using the t-distribution with very small samples (e.g., with degrees of freedom less than 5), the very large values of t compensate for small sample sizes with regard to the control of Type I errors (claiming a difference is significant when it really is not). With sample sizes these small, your confidence intervals will be much wider than what you would get with larger samples. But once you’re dealing with more than 5 degrees of freedom, there is very little absolute difference between the value of z and the value of t. From the perspective of the approach of t to z, there is very little gain past 10 degrees of freedom.

It isn’t much more complicated to use the t-distribution than the z-distribution (you just need to be sure to use the right value for the degrees of freedom), and the reason for the development of the t-distribution was to enable the analysis of small samples. This is just one of the less obvious ways in which usability practitioners benefit from the science and practice of beer brewing. Historians of statistics widely regard Gossett’s publication of Student’s t-test as a landmark event (Box, 1984; Cowles, 1989; Stigler, 1999). In a letter to Ronald A. Fisher (one of the fathers of modern statistics) containing an early copy of the t tables, Gossett wrote, “You are probably the only man who will ever use them” (Box, 1978). Gossett got a lot of things right, but he certainly got that wrong.

Should you always conduct a two-tailed test?

On one hand

The controversy over the legitimate use of one-tailed tests began in the early 1950s (Cowles, 1989). Before then, the standard practice was to run two-tailed tests with equal rejection regions in each tail. For example, a researcher setting α to 0.05 would use z ± 1.96 as the critical value for a z-test, which corresponds to a rejection region of 0.025 in each tail (Fig. 9.3, two-tailed test). The rationale for two-sided tests was that in advance of data collection, the researcher could not be sure of the direction the results would take, so the unbiased approach was to put an equal amount of rejection region in each tail (where the rejection region is the set of test outcomes that indicate sufficient evidence to reject the null hypothesis).

Figure 9.3 Different ways to allocate probability to rejection regions.

The controversy began with the realization that many experimental hypotheses are not pure null hypotheses of no difference. Instead, there can be a directional component to the hypothesis, for example, after having fixed a number of usability problems in an early version of a product, participants should do better with the next version—higher completion rate, faster completion times, and greater satisfaction. For that test context, it seems reasonable to put all of the rejection probability in the same tail—a one-tailed test (Fig. 9.3, one-tailed test).

On the other hand

One concern over using one-tailed tests is dealing with the temptation to convert what started as a two-tailed test to a one-tailed test after the fact. Suppose you’re really not sure which product will better support users’ tasks, so you decide to run a two-tailed test with 0.025 in each tail (α = 0.05). Also, suppose your sample size is large enough that you can use a z-test, and the value of z that you get is 1.8. If you had run a one-tailed test in the right direction, you’d have a statistically significant result. If, after having run the test, you decide to treat the result like a one-tailed test, then instead of it really being a one-tailed test with a = 0.05, it’s a one-and-a-half-tailed test with α = 0.075 (Abelson, 1995)—you can’t make the 0.025 in the left tail disappear just by wishing it gone after data collection (Fig. 9.3, one-and-a-half-tailed test).

Another of the concerns with the one-tailed test is what a researcher should do if, against all expectation, the test result points strongly in the other direction. Suppose you originally set up a one-tailed test so any value of z greater than 1.65 would indicate a significant result but the result you actually get is z = −2.12. If, after having run the test, you decide to change it to a two-tailed test, you actually have another case of a one-and-a-half-tailed test with a rejection region that turns out to be 0.075 instead of the planned 0.05. Note that we’re not saying that there’s anything wrong with deciding to set α = 0.075 or even higher before running the test. The problem is that changing your mind after you’ve got the data in hand capitalizes on chance (which is not a good thing), inflating the actual value of α by 50% compared to the planned α.

A few statisticians have suggested a test strategy for directional hypotheses in which just a little bit of rejection region gets assigned to the unexpected direction—the “lopsided test” (Abelson, 1995) or “split-tailed test” (Braver, 1975; Harris, 1997). Fig. 9.3 shows the Abelson lopsided test, with a rejection region of 0.05 in the expected direction and 0.005 in the unexpected direction, for a total α = 0.055. By the way, if you really wanted to keep the total α = 0.05, you could adjust the rejection region on the right to 0.045 (z = 1.7 instead of 1.65, a relatively minor adjustment).

Our recommendation

For user researchers, the typical practice should be to use two-tailed tests, with equal distribution of the probability of rejection to both tails unless there is a compelling a priori reason to use an unequal distribution (the “lopsided” or “split-tailed” test). The exception to this is when you’re making a comparison with a benchmark. For example, if you need to prove that it’s very likely that the completion rate for a task exceeds 85% and you fail to reach that goal with a one-sided test, it doesn’t matter if the completion rate is significantly less than 85%. Significant or not, you’ve still got work to do, so the one-tailed test is appropriate for this situation (which is more of a usability engineering than a usability research context).

Can you reject the null hypothesis when p > 0.05?

On one hand

Setting α = 0.05 provides significant control over the likelihood of a Type I error (rejecting the null hypothesis when there is actually no difference). With this test criterion, any result with p < 0.05 is, by definition, statistically significant; all others are not. Over the long run, you should only make a Type I error once out of every 20 tests.

In the late 19th century, Francis Edgeworth, one of the first statisticians to routinely conduct tests of significance, used a very conservative α = 0.005 (Stigler, 1999). The first formal statement of judging significance with p < 0.05 dates back to Fisher in the early 20th century, although there is evidence that it had been conventional for some time (Cowles, 1989).

On the other hand

“Surely, God loves the .06 nearly as much as the .05”

(Rosnow and Rosenthal, 1989, p. 1277).

“Use common sense to extract the meaning from your data. Let the science of human factors and psychology drive the statistics; do not let statistics drive the science”

(Wickens, 1998, p. 22).

The history of setting α = 0.05 shows that it is a convention that has some empirical basis, but is still just a convention, not the result of some law of nature. The problem with a narrow focus on just the Type I error is that it takes attention away from the Type II error—emphasizing confidence over power (Baguley, 2004). For scientific publication, the p < 0.05 convention and an emphasis on the Type I error is reasonable because it’s generally less damaging to commit a Type II error (delaying the introduction of a real effect into the scientific database due to low power) than to commit a Type I error (introducing false findings into the scientific discourse). This might also be true for certain kinds of user research, but for other kinds of user research, it is possible that Type II errors might be more damaging than Type I errors, which would indicate using a different strategy for balancing the two types of error.

Wickens (1998) discussed the importance of balancing Type I and Type II errors in system development. Suppose you’ve conducted a usability study of two systems (one old, the other new), with the null hypothesis being that the new system is no better than the old. If you make a Type I error, the likely decision will be to adopt the new system when it’s really no better than the old (but also very likely no worse). If you make a Type II error, the likely decision will be to keep the old system when the new one is really better. Wickens (1998, p. 19) concluded:

From this viewpoint, the cost of each type of error to user performance and possibly to user safety should be regarded as equivalent, and not as in the classical statistics of the 0.05 level, weighted heavily to avoiding Type I errors (a 1-in-20 chance of observing the effect, given that there is no difference between the old and new system). Indeed, it seems irresponsible to do otherwise than treat the two errors equivalently. Thus, there seems no possible reason why the decision criterion should be locked at 0.05 when, with applied studies that often are destined to have relatively low statistical power, the probability of a Type II error may be considerably higher than 0.05. Instead, designers should be at the liberty to adjust their own decision criteria (trading off between the two types of statistical errors) based on the consequences of the errors to user performance.

As discussed in Chapter 6, when you’re planning a study you should have some idea of what sample size you’re going to need to provide adequate control over Type I and Type II errors for your specific situation. To estimate the sample size for a within-subjects t-test, for example, you start with the formula:

$n = \frac{{(z_{α} + z_{β})}^{2} s^{2}}{d^{2}}$ $n = \frac{{(z_{α} + z_{β})}^{2} s^{2}}{d^{2}}$

The z_α and z_β in the numerator correspond to the planned values for the Type I and Type II errors, respectively; s is the expected standard deviation, and d is the minimum size of the effect that you want to be able to detect in the study. The value for z_α depends on the desired level of confidence and whether the test will be one- or two-tailed. The value for z_β depends on the desired amount of power, and is always one-tailed (Diamond, 1981). Once you add z_α and z_β together, though, to paraphrase Lord (1953), that sum doesn’t remember where it came from.

For example, suppose you take Wicken’s (1998) advice and decide to relax the Type I error to α = 0.10 and to also set the Type II error to β = 0.10 (so you have 90% confidence and 90% power). For z-scores corresponding to 0.10, the two-tailed z is about 1.65 (z_α) and the one-tailed z is about 1.28 (z_β), so (z_α + z_β) equals 2.93. Table 9.2 shows some of the possible combinations of z_α and z_β (in addition to 1.65 + 1.28) that equal 2.93.

Table 9.2

Different Combinations of z_α and z_β Summing to 2.93

z_α	z_β	α	β
2.93	0.00	0.003	0.500
2.68	0.25	0.007	0.401
2.43	0.50	0.015	0.309
2.18	0.75	0.029	0.227
1.93	1.00	0.054	0.159
1.65	1.28	0.100	0.100
1.25	1.68	0.211	0.046
1.00	1.93	0.317	0.027
0.75	2.18	0.453	0.015
0.50	2.43	0.617	0.008
0.25	2.68	0.803	0.004
0.00	2.93	1.000	0.002

Note: Bold indicates the values for which alpha = beta.

The same z of 2.93 could mean that you’ve set α to 0.003, so you’re almost certain not to make a Type I error—in the long run, only about 3/1000 tests conducted when the null hypothesis is true would produce a false alarm. Unfortunately, you only have a 50–50 chance of proving that real differences exist because when α = 0.003, then β = 0.50. If you take the opposite approach of setting α to 1.0 and β to 0.002, then you’ll almost never make a Type II error (missing a real effect only 2/1000 times), but you’re guaranteed to make many, many Type I errors (false alarms). If you set α to 0.054 and β to 0.159, then you will have results that are close to the convention of setting α to 0.05 and β to 0.20 (95% confidence and 80% power—more precisely for these z-scores of 1.93 and 1.00, 94.6% confidence and 84.1% power).

Fisher on the convention of using α = 0.05

How Fisher recommended using p-values

When Karl Pearson was the grand old man of statistics and Ronald Fisher was a relative newcomer, Pearson, apparently threatened by Fisher’s ideas and mathematical ability, used his influence to prevent Fisher from publishing in the major statistical journals of the time, Biometrika and the Journal of the Royal Statistical Society. Consequently, Fisher published his ideas in a variety of other venues such as agricultural and meteorological journals, including several papers for the Proceedings of the Society for Psychical Research. It was in one of the papers for this latter journal that he mentioned the convention of setting what we now call the acceptable Type I error (alpha) to 0.05 (Fisher, 1929, p. 191) and, critically, also mentioned the importance of reproducibility when encountering an unexpected significant result.

An observation is judged to be significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is a common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely, all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation.

Our recommendation

Unless you’re planning to submit your results to an academic journal for publication, we recommend not worrying excessively about trying to control your Type I error to 0.05. The goal of statistics is not to make the correct decision every time—that just isn’t possible. The purpose of using statistics is to help you make better decisions in the long run. In an industrial setting, that could well mean setting α to 0.10 or in some cases, even to 0.20 (in which case you’ll make a Type I error in about one out of every five tests).

The important thing is to make these decisions before you run the test. Spend some time thinking about the relative consequences of making Type I and Type II errors in your specific context, carefully choosing appropriate criteria for α and β. Then use your analysis along with the expected standard deviation (s) and critical difference (d) to estimate the sample size you’ll need to achieve the statistical goals of your study. If the sample size turns out to be unfeasible, then revisit your decisions about α, β, and d until you find a combination that will work for you (as discussed in more detail in Chapter 6).

Estimating the sample size increase needed to achieve significance

You need to consider both confidence and power—from the files of Jeff Sauro

I was working with an Internet retailer who wanted to determine which of two changes to an item-detail page would help users make better decisions about a product and ultimately lead to more purchases. We had participants attempt to understand information about a product and answer a series of 11-point scales about the experience. We recruited participants using a customer list and launched the unmoderated study. We estimated we’d need approximately 700 participants for each of the two designs (1400 total) to detect a difference of at least 0.3 of a point (a 3% difference), using an alpha of 0.05 and power of 0.80. After one week of data collection though, we had fewer than 200 responses (92 in each group)—a fraction of the number we were planning for! Collecting more data costs money and time, so we needed to know how many more samples we absolutely needed to find statistical differences—if they existed. I ran a t-test on one of the items for the 184 participants who saw either Design A or B (Table 9.3). Design A was currently leading by 0.46 of a point. The p-value of 0.15 indicated it wasn’t statistically significant (at α < 0.05), but the lower limit of the confidence interval suggested it was getting close.

Table 9.3

Findings With n = 184

Design	Mean	s	n
A	8.21	2.04	92
B	7.75	2.23	92
Difference	0.46
p-value	0.15
95% Confidence interval around the difference
Upper limit	1.08
Lower limit	−0.16

With this information, at what sample size would we expect a 0.46 point difference to be statistically significant at the designated level of alpha? If we ran the sample size calculation using the formula in Chapter 6 for two independent means, we’d find that to detect a 0.46 difference (using an alpha of 0.05, power of 0.80, and standard deviation of 2.135—the average of 2.04 and 2.23—we would need a sample size of 340 in each group (680 in total). That’s more than three times the sample size at which we were getting close to a statistically significant difference. Furthermore, if everything stayed as it was except for increasing the sample size to 680, the p-value would not be close to 0.05—it would be a much smaller 0.005. What’s going on here?

The information in Table 9.2 provides a clue. When you do sample size estimation from scratch, you make a decision about your acceptable Type I and Type II error rates (associated with confidence and power, respectively), and then put the appropriate z-scores in the numerator of the formula

n = \frac{{(z_{α} + z_{β})}^{2} s^{2}}{d^{2}}

$n = \frac{{(z_{α} + z_{β})}^{2} s^{2}}{d^{2}}$

where they are summed into a single composite value of z. Using a value higher than 0 for z_β means that you are essentially purchasing additional power. Think of it as an insurance policy that protects you in case the actual variability is greater than you expected, or the mean difference is smaller than expected. Once you have data and you’re trying to determine the minimum sample size you need to declare a statistically significant outcome exactly at the specified level of α, you need to set power to 50% because the question you’re trying to answer has changed, and “purchasing” additional power will cause you to overestimate the minimum required sample size for these conditions. With this adjustment (setting power to 50%), the recalculated sample size estimate is n = 334 (167 in each group) and, if everything stayed as it was except for increasing the sample size, the p-value would be exactly 0.05. This is a much more manageable sample size increase—one that’s achievable from sending out reminders rather than needing to pull a whole new list of customers.

Can you combine usability metrics into single scores?

On one hand

Throughout the history of statistics, there has been an initial reluctance to combine measurements in any way, typically followed by empirical and theoretical work that supports the combination. For example, before the mid-17th century, astronomers would not average their observations—“the idea that accuracy could be increased by combining measurements made under different conditions was slow to come” (Stigler, 1986, p. 4).

We are now so used to the arithmetic mean that we often don’t give a second thought to computing it (and in some situations we really should). But what about combining similar measurements from different sources into a composite metric? That’s exactly what we do when we compute a stock index such as the Dow-Jones Industrial Average. We are comfortable with this type of combined score, especially given its successful use for over 100 years, but that level of comfort was not always in place. When William Stanley Jevons published analyses in which he combined the prices of different commodities into an index to study the global variation in the price of gold in the mid-19th century, he met with significant criticism (Stigler, 1986, 1999).

Stock and commodity indices at least have the common metric of price. What about the combination of different metrics—for example, the standard usability metrics of successful completion rates, completion times, and satisfaction? The statistical methods for accomplishing this task, based on the concepts of correlation and regression, appeared in the early 20th century and underwent an explosion of development in its first half (Cowles, 1989), producing principal components analysis, factor analysis, discriminant analysis, and multivariate analysis of variance (MANOVA).

Lewis (1991) used nonparametric rank-based methods to combine and analyze time-on-task, number of errors, and task-level satisfaction in summative usability tests. Conversion to ranks puts the different usability metrics on a common ordinal scale, allowing their combination through rank averaging. An important limitation of a rank-based approach is that it can only represent a relative comparison between like-products with similar tasks—it does not result in a measure of usability comparable across products or different sets of tasks. More recently, Sauro and Kindlund (2005) described methods for converting different usability metrics (task completion, error counts, task times, and satisfaction scores) to z-scores—another way to get different metrics to a common scale (their Single Usability Metric, or SUM).

Sauro and Kindlund (2005) reported significant correlations among the metrics they studied. Advanced analysis (specifically, a principal components analysis) indicated that the four usability metrics contributed about equally to the composite SUM score. In 2009, Sauro and Lewis also found substantial correlations among prototypical usability metrics such as task times, completion rates, errors, post-task satisfaction, and post-study satisfaction collected during a large number of unpublished summative usability tests. According to psychometric theory, an advantage of any composite score is an increase in the reliability of measurement, with the magnitude of the increase depending on correlations among the component scores (Nunnally, 1978).

SUM: the single usability metric

Calculating SUM scores—from the files of Jeff Sauro

SUM is a standardized, summated, and single usability metric, developed to represent the majority of variation in four common usability metrics used in summative usability tests: task completion rates, task time, error counts, and satisfaction. To standardize each of the usability metrics Erika Kindlund and I created a z-score type value or z-equivalent. For the continuous data (time and average satisfaction), we subtracted the mean value from a specification limit and divided by the standard deviation. For discrete data (completion rates and errors) we divided the unacceptable conditions (defects) by all opportunities for defects—a method of standardization adapted from the process sigma metric used in Six Sigma. For more details on how to standardize and combine these scores, see Sauro and Kindlund (2005).

To make it easier to work with SUM, I’ve provided free Web (Usability Scorecard) and Excel (SUM Calculator) tools (see www.measuringu.com/sum). The Usability Scorecard application takes raw usability metrics (completion, time, satisfaction, errors, and clicks) and automatically calculates confidence intervals and graphs. You can also work with any combination of the metrics into a 2-, 3- or 4-measure score. The SUM calculator takes raw usability metrics and converts them into a SUM score with confidence intervals.

You need to provide the raw metrics on a task-by-task basis and know the opportunity for errors. SUM will automatically calculate the maximum acceptable task time, or you can provide it. This calculator is an Excel-based version of the Usability Scorecard, with the limitation that it can only combine four measures (time, errors, sat, and completion) rather than any combination, and it does not graph the results. Once you have a set of SUM scores, you can treat them statistically as you would any raw score, computing confidence intervals, comparing them against benchmarks, or comparing SUM scores from different products or tasks.

On the other hand

If the component scores do not correlate, the reliability of the composite score will not increase relative to the component scores. Hornbæk and Law (2007), based on correlational analyses of a wide range of metrics and tasks gathered from published human–computer–interaction (HCI) literature, argued that attempts to reduce usability to one measure are bound to lose important information because there is no strong correlation among usability aspects (a finding that appears to be true for the broad range of HCI metrics studied by Hornbæk & Law, but not for prototypical usability metrics—see Sauro and Lewis, 2009). Indeed, loss of information occurs whenever you combine measurements. This is one of the reasons why it is important to provide additional information such as the standard deviation or a confidence interval when reporting a mean.

The combination of data can be particularly misleading if you blindly use statistical procedures such as MANOVA or discriminant analysis to combine different types of dependent measures. These methods automatically determine how to weight the different component metrics into a combined measure in a way that maximizes the differences between levels of independent variables. This increases the likelihood of getting a statistically significant result, but runs the risk of creating composite measures that are uninterpretable with regard to any real world attribute such as usability (Lewis, 1991). More generally in psychological experimentation, Abelson has warned against the blind use of these methods (1995, pp. 127–128):

In such cases [multiple dependent variables] the investigator faces a choice of whether to present the results for each variable separately, to aggregate them in some way before analysis, or to use multivariate analysis of variance. … One of these alternatives—MANOVA—stands at the bottom of my list of options. … Technical discussion of MANOVA would carry us too far afield, but my experience with the method is that it is effortful to articulate the results. … Furthermore, when MANOVA comes out with simple results, there is almost always a way to present the same outcome with one of the simpler analytical alternatives. Manova mania is my name for the urge to use this technique.

As true as this might be for psychological research, it is even truer for usability research intended to affect the design of products or systems. If you run a test with a composite measure and find a significant difference between products, then what do you really know? You will have to follow up that test with separate tests of the component metrics, so one could reasonably argue against running the test with the composite metric, instead starting with the tests of the component metrics.

Our recommendation

Both of us, at various times in our careers, have worked on methods for combining different usability metrics into single scores (Lewis, 1991; Sauro and Kindlund, 2005)—clearly, we are on the side of combining usability metrics when it is appropriate, but using a method that produces an interpretable composite such as SUM rather than MANOVA. There are situations in the real world in which practitioners must choose only one product from a summative competitive usability test of multiple products and, in so doing, must either rely on a single measurement (a very limiting approach), must try to rationally justify some priority of the dependent measures, or must use a composite score. Composite usability scores can also be useful on executive management dashboards. Even without an increase in reliability it can still be advantageous to combine the scores for these situations, but the factor analysis of Sauro and Lewis (2009) lends statistical support to the practice of combining component usability metrics into a single score.

Any summary score (median, mean, index, or other composite) must lose important information (just as an abstract does not contain all of the information in a full paper)—it is the price paid for summarizing data. It is certainly not appropriate to rely exclusively on summary data, but it is important to keep in mind that the data that contribute to a summary score remain available as component scores for any analyses and decisions that require more detailed information (such as providing guidance about how a product or system should change in a subsequent design iteration). You don’t lose anything permanently when you combine scores—you just gain an additional view.

What if you need to run more than one test?

What if you have collected data from three groups instead of two, and want to compare Group A with B, A with C, and B with C? You’ll need to perform multiple comparisons. As Cowles (1989, p. 171) pointed out, this has been a controversial topic in statistics for decades.

In 1972 Maurice Kendall commented on how regrettable it was that during the 1940s mathematics had begun to ‘spoil’ statistics. Nowhere is this shift in emphasis from practice, with its room for intuition and pragmatism, to theory and abstraction, more evident than in the area of multiple comparison procedures. The rules for making such comparisons have been discussed ad nauseam and they continue to be discussed.

On one hand

When the null hypothesis of no difference is true, you can think of a single test with α = 0.05 as the flip of a single coin that has a 95% chance of heads (correctly failing to reject the null hypothesis) and a 5% chance of tails (falsely concluding there is a difference when there really isn’t one—a false alarm, a Type I error). These are the probabilities for a single toss of the coin (a single test), but what if you run more than one test? Statisticians sometimes make a distinction between the error rate per comparison (EC) and the error rate per family (EF, or family-wise error rate) (Myers, 1979).

For example, if you ran 20 t-tests after collecting data in a usability study and there was really no difference in the tested products, you’d expect one Type I error, falsely concluding that there was a difference when that outcome happened just by chance. Unfortunately, other possible outcomes, such as seeing two or three Type I errors, also have a reasonable likelihood of happening by chance. The technical term for this is alpha inflation. For this series of tests, the actual value of α (defining α as the likelihood of getting one or more Type I errors) is much higher than 0.05. Table 9.4 shows, as expected, that the most likely number of Type I errors in a set of 20 independent tests with α = 0.05 is one, with a point probability of 0.37735. The likelihood of at least one Type I error, however, is higher—as shown in Table 9.4, it’s 0.64151. So, rather than having a 5% chance of encountering a Type I error when there is no real difference, α has inflated to about 64%.

Table 9.4

Illustration of Alpha Inflation for 20 Tests Conducted With α = 0.05

x	p(x)	p(at least x)
0	0.35849	1.00000
1	0.37735	0.64151
2	0.18868	0.26416
3	0.05958	0.07548
4	0.01333	0.01590
5	0.00224	0.00257
6	0.00030	0.00033
7	0.00003	0.00003
8	0.00000	0.00000
9	0.00000	0.00000
10	0.00000	0.00000
11	0.00000	0.00000
12	0.00000	0.00000
13	0.00000	0.00000
14	0.00000	0.00000
15	0.00000	0.00000
16	0.00000	0.00000
17	0.00000	0.00000
18	0.00000	0.00000
19	0.00000	0.00000
20	0.00000	0.00000

A quick way to compute the inflation of α (defining inflation as the probability of one or more Type I errors) is to use the same formula we used in Chapter 7 to model the discovery of problems in formative user research:

$p (at least one Type I error) = 1 - {(1 - α)}^{n}$ $p (at least one Type I error) = 1 - {(1 - α)}^{n}$

where n is the number of independent tests (Winer et al., 1991). For the previous example (20 tests conducted with α = 0.05), that would be:

$p (at least one Type I error) = 1 - {(1 - 0.05)}^{20} = 1 - {0.95}^{20} = 1 - 0.35849 = 0.64151$ $p (at least one Type I error) = 1 - {(1 - 0.05)}^{20} = 1 - {0.95}^{20} = 1 - 0.35849 = 0.64151$

In other words, the probability of at least one Type I error equals one minus the probability of none (see the entries for p(0) and p(at least 1) in Table 9.4).

Since the middle of the 20th century, there have been many strategies and techniques published to guide the analysis of multiple comparisons (Abelson, 1995; Cliff, 1987; Myers, 1979; Winer et al., 1991), such as omnibus tests (e.g., ANOVA and MANOVA) and procedures for the comparison of pairs of means (e.g., Tukey’s WSD and HSD procedures, the Student–Newman–Keuls test, Dunnett’s test, the Duncan procedure, the Scheffé procedure, the Bonferroni adjustment, and the Benjamini–Hochberg adjustment). A detailed discussion of all these techniques for reducing the effect of alpha inflation on statistical decision-making is beyond the scope of this book.

A popular and conceptually simple approach to controlling alpha inflation is the Bonferroni adjustment (Cliff, 1987; Myers, 1979; Winer et al., 1991). To apply the Bonferroni adjustment, divide the desired overall level of alpha by the number of tests you plan to run. For example, to run 10 tests for a family-wise error rate of 0.05, you would set α = 0.005 for each individual test. For 20 tests, it would be 0.0025 (0.05/20). Setting α = 0.0025 and running 20 independent tests would result in alpha inflation bringing the family-wise error rate to just under 0.05:

$p (at least one Type I error) = 1 - {(1 - 0.0025)}^{20} = 1 - {0.9975}^{20} = 1 - 0.9512 = 0.0488$ $p (at least one Type I error) = 1 - {(1 - 0.0025)}^{20} = 1 - {0.9975}^{20} = 1 - 0.9512 = 0.0488$

A relatively new method called the Benjamini–Hochberg adjustment (Benjamini and Hochberg, 1995) offers a good balance between making the Bonferroni adjustment and no adjustment at all. Rather than using a significance threshold of α/k (where k is the number of comparisons) for all multiple comparisons (the Bonferroni approach), the Benjamini–Hochberg method produces a graduated series of significance thresholds. To use the method, take the p-values from all the comparisons and rank them from lowest to highest. Then create a new threshold for statistical significance by dividing the rank by the number of comparisons and then multiplying this by the initial significance threshold (alpha). The first threshold will always be the same as the Bonferroni threshold, and the last threshold will always be the same as the unadjusted value of α. The thresholds in between the first and last comparisons rise in equal steps from the Bonferroni to the unadjusted threshold. For a detailed example of applying the Bonferroni and Benjamini–Hochberg adjustments to multiple comparisons, see Chapter 10 (“Comparing More than Two Means”).

Problem solved—or is it?

On the other hand

When the null hypothesis is not true, applying techniques such as Bonferroni or Benjamini–Hochberg adjustments can increase the number of Type II errors—the failure to detect differences that are really there (misses as opposed to the false alarms of Type I errors) (Abelson, 1995; Myers, 1979; Perneger, 1998; Winer et al., 1991). As illustrated in Table 9.2, an overemphasis on the prevention of Type I errors leads to the proliferation of Type II errors. Unless, for your situation, the cost of a Type I error is much greater than the cost of a Type II error, you should avoid applying any of the techniques designed to suppress alpha inflation, including Bonferroni or Benjamini–Hochberg adjustments. “Simply describing what tests of significance have been performed, and why, is generally the best way of dealing with multiple comparisons” (Perneger, 1998, p. 1236).

Abelson’s styles of rhetoric

Brash, stuffy, liberal, and conservative styles

In Chapter 4 of his highly regarded book Statistics as Principled Argument, Robert Abelson (1995) noted that researchers using statistics to support their claims can adopt different styles of rhetoric, of which he defined four:

• Brash (unreasonable): Overstates every statistical result, specifically, always uses one-tailed tests, runs different statistical tests on the same data and selects the one that produces the most significant result, when running multiple comparisons focuses on the significant results without regard to the number of comparisons, states actual value of p but talks around it to include results not quite significant according to preselected value of alpha

• Stuffy (unreasonable): Determined to never be brash under any circumstances—excessively cautious

• Liberal (reasonable): Less extreme version of brash—willing to explore and speculate about data

• Conservative (reasonable): Less extreme version of stuffy—more codified and cautious approach to data analysis than the liberal style

From our experience, we encourage a liberal style for most user research, but we acknowledge Abelson (1995, p. 57): “Debatable cases arise when null hypotheses are rejected according to liberal test procedures, but accepted by conservative tests. In these circumstances, reasonable people may disagree. The investigator faces an apparent dilemma: ‘Should I pronounce my results significant according to liberal criteria, risking skepticism by critical readers, or should I play it safe with conservative procedures and have nothing much to say?’ Throughout this chapter, we’ve tried to provide guidance to help user researchers resolve this apparent dilemma logically and pragmatically for their specific research situations.

Our recommendation

Abelson (1995, p. 70) stated:

When there are multiple tests within the same study or series of studies, a stylistic issue is unavoidable. As Diaconis (1985) put it, “Multiplicity is one of the most prominent difficulties with data-analytic procedures. Roughly speaking, if enough different statistics are computed, some of them will be sure to show structure” (p. 9). In other words, random patterns will seem to contain something systematic when scrutinized in many particular ways. If you look at enough boulders, there is bound to be one that looks like a sculpted human face. Knowing this, if you apply extremely strict criteria for what is to be recognized as an intentionally carved face, you might miss the whole show on Easter Island.

As discussed throughout this chapter, user researchers need to balance confidence and power in their studies, avoiding excessive attention to Type I errors over Type II errors unless the relative cost of a Type I error (thinking an effect is real when it isn’t) is much greater than that of a Type II error (failing to find and act upon real effects). This general strategy applies to the treatment of multiple comparisons just as it did in the previous discussions of one- versus two-tailed testing and the legitimacy of setting α > 0.05.

For most situations, we encourage user researchers to follow Perneger’s (1998) advice to run multiple comparisons at the designated level of alpha, making sure to report what tests have been done and why. For example, in summative usability tests, most practitioners use a fairly small set of well-defined and conventional measurements (success rates, completion times, user satisfaction) collected in a carefully constructed set of test scenarios, either for purposes of estimation or comparison with benchmarks or a fairly small and carefully selected set of products/systems. This practice helps to legitimize multiple testing at a specified and not overly conservative level of alpha because the researchers have clear a priori hypotheses under test.

Researchers engaging in this practice should, however, keep in mind the likely number of Type I errors for the number of tests they conduct. For example, Table 9.5 shows the likelihoods of different numbers of Type I errors when the null hypothesis is true and α = 0.05. When n = 10 tests, the most likely number of Type I errors is 0 (p = 0.60), the likelihood of getting at least one Type I error is 0.40, and the likelihood of getting two Type I errors is less than 10% (p = 0.086). When n = 100, the most likely number of Type I errors is 5 (p = 0.18), the likelihood of getting at least one Type I error is 0.994 (virtually certain), and the likelihood of getting 9 Type I errors is less than 10% (p = 0.06).

Table 9.5

Likelihoods of Number of Type I Errors When the Null Hypothesis Is True Given 10, 20, and 100 Tests When α = 0.05

	n = 10	α = 0.05	n = 20	α = 0.05	n = 100	α = 0.05
x	p(x)	p(at least x)	p(x)	p(at least x)	p(x)	p(at least x)
0	0.59874	1.00000	0.35849	1.00000	0.00592	1.00000
1	0.31512	0.40126	0.37735	0.64151	0.03116	0.99408
2	0.07463	0.08614	0.18868	0.26416	0.08118	0.96292
3	0.01048	0.01150	0.05958	0.07548	0.13958	0.88174
4	0.00096	0.00103	0.01333	0.01590	0.17814	0.74216
5	0.00006	0.00006	0.00224	0.00257	0.18002	0.56402
6	0.00000	0.00000	0.00030	0.00033	0.15001	0.38400
7	0.00000	0.00000	0.00003	0.00003	0.10603	0.23399
8	0.00000	0.00000	0.00000	0.00000	0.06487	0.12796
9	0.00000	0.00000	0.00000	0.00000	0.03490	0.06309
10	0.00000	0.00000	0.00000	0.00000	0.01672	0.02819

So, if you ran 100 tests (e.g., tests of your product against five competitive products with five tasks and four measurements per task) and had only five statistically significant results with p < 0.05, then you’re seeing exactly the number of expected Type I errors. You could go ahead and consider what those findings mean for your product, but you should be relatively cautious in their interpretation. On the other hand, if you had ten statistically significant results, the likelihood of that happening if there really was no difference is just 2.8%, so you could be stronger in your interpretation of those results and what they mean for your product.

This alleviates a potential problem with the Bonferroni adjustment strategy, which addresses the likelihood of getting one or more Type I errors when the null hypothesis is true. A researcher who conducts 100 tests (e.g., from a summative usability study with multiple products, measures, and tasks) and who has set α = 0.05 is not expecting one Type I error if the null hypothesis is true, the expectation is five Type I errors. Adjusting the EC to hold the expected number of Type I errors to one in this situation does not seem like a logically consistent strategy. In fact, the Benjamini–Hochberg method was developed to address this logical inconsistency. Rather than controlling the family-wise error rate to a specified level, the Benjamini–Hochberg method controls the false discovery rate—the proportion of rejected null hypotheses that are incorrect rejections (false discoveries)—rather than allowing only one Type I error. For this reason, if there is a need to adjust thresholds of significance, we recommend the Benjamini—Hochberg procedure due to its placement between liberal (unadjusted) and conservative (Bonferroni) approaches.

Using the method illustrated in Table 9.5, but covering a broader range of number of tests and values of alpha, Table 9.6 shows for α = 0.05 and α = 0.10 how many Type I errors would be unexpected (less than or equal to a 10% cumulative likelihood for that number or more) given the number of tests if the null hypothesis of no difference is true.

Table 9.6

Critical Values of Number of Type I Errors (x) Given 5–100 Tests Conducted With α = 0.05 or 0.10

α = 0.05		α = 0.10
Number of Tests	Critical x (p(x or more) ≤ 0.10)	Number of Tests	Critical x (p(x or more) ≤ 0.10)
5–11	2	5	2
12–22	3	6–11	3
23–36	4	12–18	4
37–50	5	19–25	5
51–64	6	26–32	6
65–79	7	33–40	7
80–95	8	41–48	8
96–111	9	49–56	9
		57–64	10
		65–72	11
		73–80	12
		81–88	13
		89–97	14
		98–105	15

For example, suppose you’ve conducted 25 tests of significance using α = 0.10, and got four significant results. For α = 0.10 and 25 tests, the critical value of x is 5. Because 4 is less than 5, you should be relatively cautious in how you interpret and act upon the significant findings. Alternatively, suppose you had found seven significant results. Because this is greater than the critical value of 5, you can act upon these results with more confidence.

One quick note—these computations assume that the multiple tests are independent, which will rarely be the case, especially when conducting within-subjects studies or doing multiple comparisons of all pairs of products in a multiproduct study. Fortunately, according to Winer et al. (1991), dependence among tests reduces the extent of alpha inflation as a function of the degree of dependence. This means that acting as if the data are independent even when they are not is consistent with a relatively conservative approach to this aspect of data analysis. The potential complexities of accounting for dependencies among data are beyond the scope of this book, and are not necessary for most practical user research.

Multiple comparisons in the real world

A test of multiple medical devices—from the files of Jeff Sauro

I recently assisted in the comparison of competing medical devices related to diabetes. A company wanted to know if they could claim that their product was easier to use than their three competitors. They conducted a usability test in which over 90 people used each device in a counterbalanced order, then ranked the products from most to least preferred and answered multiple questions to assess the perceived ease of use and learning. Was there any evidence that their product was better?

There were three competing products requiring three within-subjects t-tests (see Chapter 5) for each of the three measures of interest, for a total of nine comparisons. Although I tend to be cautious when assessing medical devices, I recommended against using a Bonferroni or other adjustment strategy because the comparisons were both planned and sensible. The company’s only interest was in how their product’s scores compared to the competition, and really didn’t care if competitor B was better than competitor C. What’s more, the metrics were all correlated (in general, products that are more usable are also more learnable and preferred).

The company commissioning the study paid a lot of money to have the test done. Had we included an adjustment to correct for alpha inflation we’d have increased the chance of a Type II error, concluding there was no difference in the products when there really was. In the end there were five significant differences using α = 0.05. One of the differences showed the product of interest was the most difficult to learn—good thing we used two-tailed tests! The other four significant findings showed the product as easier to use and ranked more highly than some of the competitors. Using Table 9.6 for alpha = 0.05 the critical value of x for nine tests is 2, making it very likely that the five observed significant differences were due to real differences rather than alpha inflation.

Key points

• There are quite a few enduring controversies in measurement and statistics that can affect user researchers and usability practitioners. They endure because there is at least a grain of truth on each side, so there is no absolute right or wrong position to take on these controversies.

• It’s OK to average data from multipoint scales, but be sure to take into account their ordinal level of measurement when interpreting the results.

• Rather than relying on a rule-of-thumb to set the sample size for a study, take the time to use the methods described in this book to determine the appropriate sample size. “Magic numbers,” whether for formative (“5 is enough”) or summative (“you need at least 30”) are rarely going to be exactly right for a given study.

• You should use two-tailed testing for most user research. The exception is when testing against a benchmark, in which case you should use a one-tailed test.

• Just like other aspects of statistical testing, there is nothing magic about setting your Type I error (α) to 0.05 unless you plan to publish your results. For many industrial testing contexts, it will be just as important (if not more so) to control the Type II error. Before you run a study, give careful thought to the relative costs of Type I and Type II errors for that specific study then make decisions about confidence and power accordingly. Use the methods described in this book to estimate the necessary sample size and, if the sample size turns out to be unfeasible, continue trying different combinations of α, β, and d until you find one that will accomplish your goals without exceeding your resources.

• Single-score (combined) usability metrics can be useful for executive dashboards or when considering multiple dependent measurements simultaneously to make a high-level go/no-go decision. They tend to be ineffective when providing specific direction about how to improve a product or system. Fortunately, the act of combining components into a single score does not result in the loss of the components—they are still available for analysis if necessary.

• When running more than one test on a set of data, go ahead and conduct multiple tests with α set to whatever value you’ve deemed appropriate given your criteria for Type II errors and your sample size. If the number of significant tests is close to what you’d expect for the number of false alarms (Type I errors) if the null hypothesis is true, then proceed with caution in interpreting the findings. Otherwise, if the number of significant tests is greater than the expected number of false alarms under the null hypothesis, you can take a stronger stand, interpreting the findings as indicating real differences and using them to guide your decisions.

• The proper use of statistics is to guide your judgment—not to replace it.

Chapter review questions

1. Refer back to Fig. 9.1. Assume the coach of that football team has decided to assign numbers in sequence to players as a function of chest width to ensure that players with smaller chests get single-digit numbers. The smallest player (chest width of 275 mm) got the number 1; the next smallest (chest width 287 mm) got the 2; the next (chest width 288 mm) got the 3, and so on. What is the level of measurement of those football numbers—nominal, ordinal, interval, or ratio?

2. Suppose you’re planning a within-subjects comparison of the accuracy of two dictation products, with the following criteria (similar to Example 5 in Chapter 6):

• Difference score variability from a previous evaluation = 10.0

• Critical difference (d) = 3.0

• Desired level of confidence (two-tailed): 90% (so the initial value of z_α is 1.65)

• Desired power (one-tailed): 90% (so the initial value of z_β is 1.28)

• Sum of desired confidence and power: z_α + z_β = 2.93

What sample size should you plan for the test? If it turns out to be less than 30 (and it will), what should you say to someone who criticizes your plan by claiming “you have to have a sample size of at least 30 to be statistically significant?”

3. For the planned study in Review Question 2—would it be OK to run that as a one-tailed rather a two-tailed test? Why?

4. Once more referring to the study described in Review Question 2, how would you respond to the criticism that you have to set α = 0.05 (95% confidence) to be able to claim statistical significance?

5. If you use SUM to combine a set of usability measurements that include successful completion rates, successful completion times, and task-level satisfaction, will that SUM score be less reliable, more reliable, or have the same reliability as its component scores?

6. Suppose you’ve run a formative usability study comparing your product against five competitive products with four measures and five tasks, for a total of 100 tests, using α = 0.05, with the results shown in Table 9.7 (an asterisk indicates a significant result). Are you likely to have seen this many significant results out of 100 tests by chance if the null hypothesis is true? How would you interpret the findings by product?

Table 9.7

Significant Findings for 100 Tests Conducted With α = 0.05

Task	Measure	Product A	Product B	Product C	Product D	Product E
1	1	*	*	*	*	*
1	2
1	3
1	4				*	*
2	1			*	*	*
2	2
2	3
2	4				*	*
3	1		*			*
3	2
3	3
3	4					*
4	1				*	*
4	2			*
4	3
4	4
5	1				*	*
5	2
5	3
5	4					*
# Sig?		1	2	3	6	9

Answers to chapter review questions

1. This is an ordinal assignment of numbers to football players. The player with the smallest chest will have the lowest number and the player with the largest chest will have the largest number, but there is no guarantee that the difference in chest size between the smallest and next-to-smallest player will be the same as the difference in chest size between the largest and next-to-largest player, and so on.

2. As shown in Table 9.8, you should plan for a sample size of 12. If challenged because this is less than 30, your response should acknowledge the general rule of thumb (no need to get into a fight), but point out that you’ve used a statistical process to get a more accurate estimate based on the needs of the study, the details of which you’d be happy to share with the critic. After all, no matter how good a rule of thumb might (or might not) be, when it states a single specific number, it’s very unlikely to be exactly right.

3. You can do anything that you want to do, but we would advise against a one-tailed test when you’re comparing competitive products because you really do not know in advance which one (if either) will be better. For most user research, the only time you’d use a one-tailed test is when you’re making a comparison to a benchmark.

4. You could respond to the criticism by pointing out that you’ve determined that the relative costs of Type I and Type II errors to your company are about equal, and you have no plans to publish the results in a scientific journal. Consequently, you’ve made the Type I and Type II errors equal rather than focusing primarily on the Type I error. The result is that you can be 90% confident that you will not claim a difference where one does not exist, and the test will also have 90% power to detect differences of at least 3% in the accuracy of the two products. If you hold the sample size to 12 and change α to 0.05 (increasing confidence to 95%), then the power of the test will drop from 90% to about 84% (Table 9.2).

5. It should be more reliable. Based on data we published a few years ago (Sauro and Lewis, 2009), these data are usually correlated in industrial usability studies. Composite metrics derived from correlated components, according to psychometric theory, will be more reliable than the components.

6. For the full set of 100 tests, there were 21 significant results (α = 0.05). From Table 9.6, the critical value of x (number of significant tests) for 100 tests if the null hypothesis is true is 9, so it seems very unlikely that the overall null hypothesis is true (in fact, the probability is just 0.00000002). For a study like this, the main purpose is usually to understand where a control product is in its competitive usability space, so the focus is on differences in products rather than differences in measures or tasks. For the subsets of 20 tests by product, the critical value of x is 3, so you should be relatively cautious in how you use the significant results for Products A and B, but can make stronger claims with regard to the statistically significant differences between the control product and Products C, D, and E (slightly stronger for C, much stronger for D and E). Table 9.9 shows the probabilities for these hypothetical product results.

Table 9.8

Sample Size Estimation for Review Question 2

	Initial	1	2
t_α	1.65	1.83	1.80
t_β	1.28	1.38	1.36
t_α+β	2.93	3.22	3.16
t_α+β²	8.58	10.34	9.98
s²	10	10	10
d	3	3	3
d²	9	9	9
df	9	11	11
Unrounded	9.5	11.5	11.1
Rounded up	10	12	12

Table 9.9

Probabilities for Number of Significant Results Given 20 Tests and α = 0.05

Product	x (# sig)	P(x or more)
A	1	0.642
B	2	0.264
C	3	0.075
D	6	0.0003
E	9	0.0000002

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Six enduring controversies in measurement and statistics

Create new playlist

Sign In

Sign Up

Abstract

Keywords

Introduction

Is it OK to average data from multipoint scales?

On one hand

On the other hand

Our recommendation

Do you need to test at least 30 users?

On one hand

On the other hand

Our recommendation

Should you always conduct a two-tailed test?

On one hand

On the other hand

Our recommendation

Can you reject the null hypothesis when p > 0.05?

On one hand

On the other hand

Our recommendation

Can you combine usability metrics into single scores?

On one hand

On the other hand

Our recommendation

What if you need to run more than one test?

On one hand

On the other hand

Our recommendation

Key points

Chapter review questions

Answers to chapter review questions

Table of Contents for
Chapter 9: Six enduring controversies in measurement and statistics