Key points

Sample size estimation is an important part of planning a user study, especially when the cost of a sample is high.
Different types of studies require different methods for sample size estimation. This chapter covers methods for user studies (such as summative usability studies) that use measurements that are continuous (such as time on task), multipoint scale (such as usability questionnaires), or discrete (such as successful task completions).
Different research goals (such as estimation of a value, comparison with a benchmark, or comparison among alternatives) require different methods for sample size estimation.
To obtain a sample size estimation formula, take the formula for the appropriate test and solve for n.
Sample size formulas for estimation of a value or comparison with benchmarks or alternatives require (a) an estimate of the expected measurement variance, (b) a decision about the required level of confidence, (c) a decision about the required power of the test, and (d) a decision about the smallest difference that is important for the test to be able to detect. Table 6.10 provides a list of the sample size formulas discussed in this chapter.

Table 6.10

List of Sample Size Formulas for Summative Testing

Type of Evaluation Basic Formula Notes
Estimation (nonbinary data) n=t2s2d2 image Start by using the appropriate two-sided z-score in place of t for the desired level of confidence, then iterate to the final solution, as described in the text
Comparison with a benchmark (nonbinary data) n=(tα+tβ)2s2d2 image Start by using the appropriate one-sided values of z for the values of t for the desired levels of confidence (α) and power (β), then iterate to the final solution, as described in the text
Comparison of alternatives (nonbinary data within-subjects) n=(tα+tβ)2s2d2 image Start by using the appropriate values of z for the values of t for the desired levels of confidence (two-sided α) and power (one-sided β), then iterate as described in the text
Comparison of alternatives (nonbinary data between-subjects, assuming equal group sizes) n=2(tα+tβ)2s2d2 image Start by using the appropriate values of z for the values of t for the desired levels of confidence (two-sided α) and power (one-sided β), then iterate to the final solution, as described in the text, to get the estimated sample size requirement for each group
Estimation (binary data, large sample) n=z2(p^)(1p^)d2 image Use for large sample studies, or as the first step in the process for small sample studies. For this and the rest of the equations given later, z2 is the sum of zα and zβ (confidence plus power)
Estimation (binary data, small sample) p^adj=np^+z22n+z2 image Use for the second step in the process for small sample studies to get the adjusted estimate of p
Estimation (binary data, small sample) nadj=z2(p^adj)(1p^adj)d2z2 image Use for the third step in the process for small sample studies to get the adjusted estimate of n
Comparison of alternatives (binary data, between-subjects) n=2z2p(1p)d2+12 image Use to estimate group sizes for N − 1 chi-squared tests (independent proportions) with equal group sizes—the total sample size estimate is 2n
Comparison of alternatives (binary data, within-subjects) n=z2(p^12+p^21)d2z2 image Use for the initial estimate of n for a McNemar Exact Test (matched proportions)
Comparison of alternatives (binary data, within-subjects) p^adj12=p12n+z28n+z22,p^adj21=p21n+z28n+z22 image Use for the second step in the process of estimating n for a McNemar Exact Test (matched proportions)
Comparison of alternatives (binary data, within-subjects) n=z2(p^adj12+p^adj21)dadj21.5z2 image Use for the third step in the process of estimating n for a McNemar Exact Test (matched proportions)

Chapter review questions

1. Assume you’ve been using a single 100-point item as a post-task measure of ease-of-use in past usability tests. One of the tasks you routinely conduct is installation. For the most recent usability study of the current version of the software package, the variability of this measurement (s2) was 25 (s = 5). You’re planning your first usability study with a new version of the software, and all you want to do is to get an estimate of this measure with 90% confidence and to be within ±2.5 points of the true value. How many participants do you need to run in the study?
2. Continuing with the review question given earlier, what if your research goal is to compare your result with a benchmark of having a result greater than 75? Also, assume that for this comparison you want a test with 80% power and want to be able to detect differences that are at least 2.5 points above the benchmark. The estimated variability of measurement is still 25 (s = 5) and desired confidence is still 90%. How many participants do you need to run in the study?
3. Again continuing with this example, what if you have improved the installation procedures for the new version, and want to test it against the previous version in a study where each participant performs the installation task with both the current and new versions, with the ability to detect a difference of at least 2.5 points? Assume that power and confidence remain at 80 and 90%, respectively, and that the estimated variability is still 25 (s = 5). How many participants do you need to run in the study?
4. Next, assume that the installation procedure is so time consuming that you cannot get participants to perform installation with both products, so you’ll have to have the installations done by independent groups of participants. How many participants do you need to run in the study? Assume that nothing else changes—power and confidence remain at 80 and 90%, respectively, variance is still 25, and the critical difference is still 2.5.
5. Continuing with the situation described in the previous question, suppose your resources (time and money) will only allow you to run a total of 20 participants to compare the alternative installation procedures. What can you do to reduce the estimated sample size?
6. Suppose that in addition to your subjective assessment of ease-of-use, you have also been measuring installation successes and failures using small-sample moderated usability studies. For the most recent usability study, the installation success rate was 65%. Using this as your best estimate of future success rates, what sample size do you need if you want to estimate with 90% confidence the new success rate within ±15 percentage points of the true value?
7. You’re pretty confident that your new installation process will be much more successful than the current process—in fact, you think you should have about 85% correct installation—much better than the current success rate of 65%. The current installation process is lengthy, typically taking two–three days to complete with verification of correct installation, so each participant will perform just one installation. You want to be able to detect the expected difference of 20 percentage points between the success rates with 80% confidence and 80% power, and are planning to run the same number of participants with the current and new installation procedures. How many participants (total including both groups) do you need to run?
8. For another product (Product B for “Before”), the current installation procedure is fairly short (about a half-hour), but that current process has numerous usability issues that have led to an estimated 50% failure rate on first attempts. You’ve tracked down the most serious usability issues and now have a prototype of an improved product (Product A for “After”). In a pilot study with 10 participants, you had 4 participants succeed with both products, 1 failed with both, 4 were successful with Product A but not B, and 1 was successful with Product B but not A. What are the resulting estimates for p1, p2, p12, and p21? If you want to run a larger-scale test with 95% confidence and 80% power, how many participants should you plan to run if you expect this pattern of results to stay roughly the same?

Answers to chapter review questions

1. The research problem in this exercise is to estimate a value without comparison to a benchmark or alternative. From the problem statement, the variability (s2) is 25 (s = 5) and the critical difference (d) is 2.5. This situation requires iteration to get to the final sample size estimate, starting with the z-score associated with two-sided testing and 90% confidence, which is 1.645. As shown in Table 6.11, the final sample size estimate for this study is 13 participants.

Table 6.11

Iterations for Review Question 1

Initial 1 2 3
t 1.645 1.812 1.771 1.782
t2 2.71 3.29 3.14 3.18
s2 25 25 25 25
d 2.5 2.5 2.5 2.5
d2 6.25 6.25 6.25 6.25
df 10 13 12 12
Unrounded 10.8 13.1 12.5 12.7
Rounded up 11 14 13 13

2. Relative to Review Question 1, we’re moving from a simple estimation problem to a comparison with a benchmark, which means that we now need to consider the power of the test and because we’re testing against a benchmark, will use a one-sided rather than a two-sided test. Like the previous exercise, this will require iteration, starting with the sum of the one-sided z-scores for 90% confidence and 80% power, which are, respectively, 1.282 and 0.842. As shown in Table 6.12, the final sample size estimate for this study is 20 participants.

Table 6.12

Iterations for Review Question 2

Initial 1 2
tα 1.282 1.330 1.328
tβ 0.842 0.862 0.861
tα+β 2.123 2.192 2.189
tα+β2 image 4.51 4.81 4.79
s2 25 25 25
d 2.5 2.5 2.5
d2 6.25 6.25 6.25
df 18 19 19
Unrounded 18.0 19.2 19.2
Rounded up 19 20 20

3. Relative to Review Question 2, we’re moving from comparison with a fixed benchmark to a within-subjects comparison between alternative designs, so the test should be two-sided rather than one-sided. The two-sided z-score for 90% confidence and one-sided for 80% power are 1.645 and 0.842, respectively. Table 6.13 shows the process of iterating for this situation, with a final sample size estimate of 27 participants.

Table 6.13

Iterations for Review Question 3

Initial 1 2
tα 1.645 1.711 1.706
tβ 0.842 0.857 0.856
tα+β 2.487 2.568 2.561
tα+β2 image 6.18 6.59 6.56
s2 25 25 25
d 2.5 2.5 2.5
d2 6.25 6.25 6.25
df 24 26 26
Unrounded 24.7 26.4 26.2
Rounded up 25 27 27

4. Relative to Review Question 3, we’re moving from a within-subjects experimental design to one that is between-subjects. That means that the formula for starting the iterative process starts with n = 2z2s2/d2 rather than n = z2s2/d2 (where z is the sum of the z-scores for confidence and power, zα and zβ)—essentially doubling the required sample size at that point in the process. Furthermore, the estimation is for the size of one group, so we’ll need to double it again to get the estimated sample size for the entire study. Table 6.14 shows the process of iterating for this situation, with a final sample size estimate of 51 participants per group, and a total sample size estimate of 102 participants.

Table 6.14

Iterations for Review Question 4

Initial 1 2
tα 1.645 1.661 1.660
tβ 0.842 0.842 0.842
tα+β 2.487 2.503 2.502
tα+β2 image 6.185 6.263 6.261
s2 25 25 25
d 2.5 2.5 2.5
d2 6.25 6.25 6.25
df 98 100 100
Unrounded 49.5 50.1 50.1
Rounded up 50 51 51

5. Keeping many of the conditions of the situations the same, over the course of the first four review questions, we’ve gone from needing a sample size of 13 to simply estimate the ease-of-use score within a specified level of precision, to 20 to compare it against a benchmark, to 27 to perform a within-subjects usability test, to 102 to perform a between-subjects usability test. Clearly, the change that led to the greatest increase in the sample size estimate was the shift from a within- to a between-subjects comparison of alternatives, so one way to reduce the estimated sample size is to strive to run within-subjects studies rather than between-subjects when you must compare alternatives. The other aspects of experimental design that you can control are the choices for confidence level, power, and critical difference. Let’s assume that you were able to change your plan to a within-subjects study. Furthermore, you have worked with your stakeholders to relax the requirement for the critical difference (d) from 2.5 to 3.5. As shown in Table 6.15, these two changes—switching from a between- to a within-subjects design and increasing the critical difference by just one point—lead to a study design for which you should only need 15 participants. Note that if the critical difference were relaxed to five points, the required sample size would be just eight participants. Also note that this is only one of many ways to reduce the sample size requirement, for example, you could have reduced the levels of confidence and power.

Table 6.15

Iterations for Review Question 5

Initial 1 2
tα 1.645 1.782 1.761
tβ 0.842 0.873 0.868
tα+β 2.487 2.655 2.629
tα+β2 image 6.18 7.05 6.91
s2 25 25 25
d 3.5 3.5 3.5
d2 12.25 12.25 12.25
df 12 14 14
Unrounded 12.6 14.4 14.1
Rounded up 13 15 15

6. For this question, the variable of interest is a binomial pass/fail measurement, so the appropriate approach is the sample size method based on the adjusted-Wald binomial confidence interval. We have the three pieces of information that we need to proceed: the success rate from the previous evaluation (p) was 0.65, the critical difference (d) is 0.15, and the desired level of confidence is 90% (so the two-sided value of z is 1.645). We first compute an initial sample size formula using the standard Wald formula: n = z2p(1 − p)/d2; which for this problem is: n = (1.645)2(0.65)(0.35)/0.152 = 27.4, which rounds up to 28. Next, we use that initial estimate of n to compute the adjusted value of p: padj = (np + z2/2)/(n + z2); which for this problem is: padj = ((28)(0.65) + 1.6452/2)/(28 + 1.6452) = 0.6368. We next use the adjusted value of p and the initial estimate of n to compute the adjusted estimate of n: (nadj = (z2(padj)(1 − padj)/d2) – z2), which for this problem is ((1.645)2(0.6368)(0.3632)/0.152) – 1.6452 = 25.11, which rounds up to 26. As a check, we could set the expected number of successes (x) to 0.65(26), which rounds to 17. A 90% adjusted-Wald binomial confidence interval for 17/26 has an observed p of 0.654, an adjusted p of 0.639, and a margin of error of 0.147, just a little more precise than the target precision of 0.15.
7. Because in this problem you’re planning to compare success rates between independent groups, the appropriate test is the N − 1 chi-squared test. From the conditions of the problem, we have information needed to do the sample size estimation: the expected values of p1 and p2 (0.65 and 0.85, respectively, for an average p = 0.75 and d = 0.20) and the sum of the z-scores for 80% confidence (two-sided z = 1.282) and 80% power (one-sided z = 0.842) of 2.124. Plugging these values into the appropriate sample size estimation formula, we get: n = (2(2.1242)(0.75)(0.25))/0.22 + 0.5 = 42.8, which rounds up to 43 participants per group, for a total of 86 participants. This is outside the scope of most moderated usability tests. Relaxing the power to 50% (so its associated z-score would be 0, making the total value of z = 1.282) would reduce the estimate of n per group to 16 (total sample size of 32).
8. The appropriate statistical test for this type of study is the McNemar Exact Test (or, equivalently, a confidence interval using the adjusted-Wald method for matched proportions). From the pilot study, the estimates for the different key proportions are p1 = 0.8, p2 = 0.5, p12 = 0.4, and p21 = 0.1, so d = 0.3. Using the three-step process, first compute an initial estimate of n with the standard Wald formula, using z = 2.8 (the sum of 1.96 for two-tailed 95% confidence and 0.84 for one-tailed 80% power):

n=2.82(0.1+0.4)0.322.82=35.7

image
Rounded up, this initial estimate is 36. Next, compute the adjustments.

p^adj12=0.1(36)+2.82836+2.822=0.114729

image

p^adj21=0.4(36)+2.82836+2.822=0.385271

image

dadj=0.3852710.114729=0.270541

image
Then compute the final sample size estimate which, after rounding up, is 42.

n=2.82(0.114729+0.385271)0.27054121.5(2.8)2=41.8

image

You can check this estimate by computing a confidence interval to see if it includes or excludes 0. Because the power of the test is 80%, you need to compute an equivalent confidence to use that combines the nominal power and confidence of the test (see the sidebar “Equivalent Confidence”). The composite z for this problem is 2.8, so the equivalent confidence to use for a two-sided confidence interval is 99.4915%. The closest integer values for a, b, c, and d are, respectively, 17, 17, 4, and 4, for the following values:

p1:34/42=0.81p2:21/42=0.5p12:17/42=0.405p21:4/42=0.095

image

The resulting confidence interval ranges from −0.55 to −0.015—close to but not including 0.

Using an n of 40, the expected values of p1, p2, p12, and p21 are exactly 0.8, 0.5, 0.4, and 0.1, respectively, and the confidence interval ranges from −0.549 to 0.0025, just barely including 0. The bounds of these confidence intervals support the sample size estimate of 42, but if samples were expensive, 40 would probably be adequate.

References

Agresti A, Coull B. Approximate is better than ‘exact’ for interval estimation of binomial proportions. Am. Stat. 1998;52:119126.

Agresti A, Min Y. Simple improved confidence intervals for comparing matched proportions. Stat. Med. 2005;24:729740.

Alreck PL, Settle RB. The Survey Research Handbook. Homewood, IL: Richard D. Irwin; 1985.

Bradley JV. Probability; Decision; Statistics. Englewood Cliffs, NJ: Prentice-Hall; 1976.

Brown FE. Marketing Research: A Structure for Decision Making. Reading, MA: Addison-Wesley; 1980.

Campbell I. Chi-squared and Fisher–Irwin tests of two-by-two tables with small sample recommendations. Stat. Med. 2007;26:36613675.

Campbell DT, Stanley JC. Experimental and Quasi-Experimental Designs for Research. Chicago, IL: Rand McNally; 1963.

Churchill Jr GA. Marketing Research: Methodological Foundations. Fort Worth, TX: Dryden Press; 1991.

Cohen J. Statistical Power Analysis for the Behavioral Sciences. second ed. Hillsdale, NJ: Lawrence Erlbaum; 1988.

Cowles M. Statistics in Psychology: An Historical Perspective. Hillsdale, NJ: Lawrence Erlbaum; 1989.

Diamond WJ. Practical Experiment Designs for Engineers and Scientists. Belmont, CA: Lifetime Learning Publications; 1981.

Dumas JS, Salzman MC, Usability assessment methods. Williges RC, ed. Reviews of Human Factors and Ergonomics, vol. 2. Santa Monica, CA: HFES; 2006:109140.

Kirakowski J. Summative usability testing: measurement and sample size. In: Bias RG, Mayhew DJ, eds. Cost-Justifying Usability: An Update for the Internet Age. Amsterdam, The Netherlands: Elsevier; 2005:519554.

Kraemer HC, Thiemann S. How Many Subjects? Statistical Power Analysis in Research. Newbury Park, CA: Sage Publications; 1987.

Landauer TK. Behavioral research methods in human–computer interaction. In: Helander M, Landauer TK, Prabhu P, eds. Handbook of Human–Computer Interaction. second ed. Amsterdam, The Netherlands: Elsevier; 1997:203227.

Lewis JR. Usability testing. In: Salvendy G, ed. Handbook of Human Factors and Ergonomics. fourth ed. New York, NY: John Wiley; 2012:12671312.

Mayer RE. From novice to expert. In: Helander MG, Landauer TK, Prabhu PV, eds. Handbook of Human–Computer Interaction. second ed. Amsterdam, The Netherlands: Elsevier; 1997:781795.

Minium EW, King BM, Bear G. Statistical Reasoning in Psychology and Education. third ed. New York, NY: John Wiley; 1993.

Myers JL. Fundamentals of Experimental Design. Boston, MA: Allyn & Bacon; 1979.

Nielsen J. Usability testing. In: Salvendy G, ed. Handbook of Human Factors and Ergonomics. second ed. New York, NY: John Wiley; 1997:15431568.

Parasuraman A. Nonprobability sampling methods. Reading, MA: Addison-Wesley; 1986: pp. 498-516.

Salsburg D. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. New York, NY: Henry Holt; 2001.

Sauro, J., Lewis, J.R., 2010. Average task times in usability tests: what to report? In: Proceedings of CHI. ACM, Atlanta, GA, pp. 2347–2350.

Scriven M. The methodology of evaluation. In: Tyler RW, Gagne RM, Scriven M, eds. Perspectives of Curriculum Evaluation. Chicago, IL: Rand McNally; 1967:3983.

Stigler SM. The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, MA: The Belknap Press of Harvard University Press; 1986.

Walpole RE. Elementary Statistical Concepts. New York, NY: Macmillan; 1976.

Wickens CD. Commonsense statistics. Ergon. Des. 1998;6(4):1822.

Winer BJ, Brown DR, Michels KM. Statistical Principles in Experimental Design. third ed. New York, NY: McGraw-Hill; 1991.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.136.119