Key points

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Key points

• Sample size estimation is an important part of planning a user study, especially when the cost of a sample is high.

• Different types of studies require different methods for sample size estimation. This chapter covers methods for user studies (such as summative usability studies) that use measurements that are continuous (such as time on task), multipoint scale (such as usability questionnaires), or discrete (such as successful task completions).

• Different research goals (such as estimation of a value, comparison with a benchmark, or comparison among alternatives) require different methods for sample size estimation.

• To obtain a sample size estimation formula, take the formula for the appropriate test and solve for n.

• Sample size formulas for estimation of a value or comparison with benchmarks or alternatives require (a) an estimate of the expected measurement variance, (b) a decision about the required level of confidence, (c) a decision about the required power of the test, and (d) a decision about the smallest difference that is important for the test to be able to detect. Table 6.10 provides a list of the sample size formulas discussed in this chapter.

Table 6.10

List of Sample Size Formulas for Summative Testing

Type of Evaluation	Basic Formula	Notes
Estimation (nonbinary data)	$n = \frac{t^{2} s^{2}}{d^{2}}$ $n = \frac{t^{2} s^{2}}{d^{2}}$	Start by using the appropriate two-sided z-score in place of t for the desired level of confidence, then iterate to the final solution, as described in the text
Comparison with a benchmark (nonbinary data)	$n = \frac{{(t_{α} + t_{β})}^{2} s^{2}}{d^{2}}$ $n = \frac{{(t_{α} + t_{β})}^{2} s^{2}}{d^{2}}$	Start by using the appropriate one-sided values of z for the values of t for the desired levels of confidence (α) and power (β), then iterate to the final solution, as described in the text
Comparison of alternatives (nonbinary data within-subjects)	$n = \frac{{(t_{α} + t_{β})}^{2} s^{2}}{d^{2}}$ $n = \frac{{(t_{α} + t_{β})}^{2} s^{2}}{d^{2}}$	Start by using the appropriate values of z for the values of t for the desired levels of confidence (two-sided α) and power (one-sided β), then iterate as described in the text
Comparison of alternatives (nonbinary data between-subjects, assuming equal group sizes)	$n = \frac{2 {(t_{α} + t_{β})}^{2} s^{2}}{d^{2}}$ $n = \frac{2 {(t_{α} + t_{β})}^{2} s^{2}}{d^{2}}$	Start by using the appropriate values of z for the values of t for the desired levels of confidence (two-sided α) and power (one-sided β), then iterate to the final solution, as described in the text, to get the estimated sample size requirement for each group
Estimation (binary data, large sample)	$n = \frac{z^{2} (\hat{p}) (1 - \hat{p})}{d^{2}}$ $n = \frac{z^{2} (\hat{p}) (1 - \hat{p})}{d^{2}}$	Use for large sample studies, or as the first step in the process for small sample studies. For this and the rest of the equations given later, z² is the sum of z_α and z_β (confidence plus power)
Estimation (binary data, small sample)	${\hat{p}}_{adj} = \frac{n \hat{p} + \frac{z^{2}}{2}}{n + z^{2}}$ ${\hat{p}}_{adj} = \frac{n \hat{p} + \frac{z^{2}}{2}}{n + z^{2}}$	Use for the second step in the process for small sample studies to get the adjusted estimate of p
Estimation (binary data, small sample)	$n_{adj} = \frac{z^{2} ({\hat{p}}_{adj}) (1 - {\hat{p}}_{adj})}{d^{2}} - z^{2}$ $n_{adj} = \frac{z^{2} ({\hat{p}}_{adj}) (1 - {\hat{p}}_{adj})}{d^{2}} - z^{2}$	Use for the third step in the process for small sample studies to get the adjusted estimate of n
Comparison of alternatives (binary data, between-subjects)	$n = \frac{2 z^{2} p (1 - p)}{d^{2}} + \frac{1}{2}$ $n = \frac{2 z^{2} p (1 - p)}{d^{2}} + \frac{1}{2}$	Use to estimate group sizes for N − 1 chi-squared tests (independent proportions) with equal group sizes—the total sample size estimate is 2n
Comparison of alternatives (binary data, within-subjects)	$n = \frac{z^{2} ({\hat{p}}_{12} + {\hat{p}}_{21})}{d^{2}} - z^{2}$ $n = \frac{z^{2} ({\hat{p}}_{12} + {\hat{p}}_{21})}{d^{2}} - z^{2}$	Use for the initial estimate of n for a McNemar Exact Test (matched proportions)
Comparison of alternatives (binary data, within-subjects)	${\hat{p}}_{adj 12} = \frac{p_{12} n + \frac{z^{2}}{8}}{n + \frac{z^{2}}{2}}, {\hat{p}}_{adj 21} = \frac{p_{21} n + \frac{z^{2}}{8}}{n + \frac{z^{2}}{2}}$ ${\hat{p}}_{adj 12} = \frac{p_{12} n + \frac{z^{2}}{8}}{n + \frac{z^{2}}{2}}, {\hat{p}}_{adj 21} = \frac{p_{21} n + \frac{z^{2}}{8}}{n + \frac{z^{2}}{2}}$	Use for the second step in the process of estimating n for a McNemar Exact Test (matched proportions)
Comparison of alternatives (binary data, within-subjects)	$n = \frac{z^{2} ({\hat{p}}_{adj 12} + {\hat{p}}_{adj 21})}{d_{adj}^{2}} - 1.5 z^{2}$ $n = \frac{z^{2} ({\hat{p}}_{adj 12} + {\hat{p}}_{adj 21})}{d_{adj}^{2}} - 1.5 z^{2}$	Use for the third step in the process of estimating n for a McNemar Exact Test (matched proportions)

Chapter review questions

1. Assume you’ve been using a single 100-point item as a post-task measure of ease-of-use in past usability tests. One of the tasks you routinely conduct is installation. For the most recent usability study of the current version of the software package, the variability of this measurement (s²) was 25 (s = 5). You’re planning your first usability study with a new version of the software, and all you want to do is to get an estimate of this measure with 90% confidence and to be within ±2.5 points of the true value. How many participants do you need to run in the study?

2. Continuing with the review question given earlier, what if your research goal is to compare your result with a benchmark of having a result greater than 75? Also, assume that for this comparison you want a test with 80% power and want to be able to detect differences that are at least 2.5 points above the benchmark. The estimated variability of measurement is still 25 (s = 5) and desired confidence is still 90%. How many participants do you need to run in the study?

3. Again continuing with this example, what if you have improved the installation procedures for the new version, and want to test it against the previous version in a study where each participant performs the installation task with both the current and new versions, with the ability to detect a difference of at least 2.5 points? Assume that power and confidence remain at 80 and 90%, respectively, and that the estimated variability is still 25 (s = 5). How many participants do you need to run in the study?

4. Next, assume that the installation procedure is so time consuming that you cannot get participants to perform installation with both products, so you’ll have to have the installations done by independent groups of participants. How many participants do you need to run in the study? Assume that nothing else changes—power and confidence remain at 80 and 90%, respectively, variance is still 25, and the critical difference is still 2.5.

5. Continuing with the situation described in the previous question, suppose your resources (time and money) will only allow you to run a total of 20 participants to compare the alternative installation procedures. What can you do to reduce the estimated sample size?

6. Suppose that in addition to your subjective assessment of ease-of-use, you have also been measuring installation successes and failures using small-sample moderated usability studies. For the most recent usability study, the installation success rate was 65%. Using this as your best estimate of future success rates, what sample size do you need if you want to estimate with 90% confidence the new success rate within ±15 percentage points of the true value?

7. You’re pretty confident that your new installation process will be much more successful than the current process—in fact, you think you should have about 85% correct installation—much better than the current success rate of 65%. The current installation process is lengthy, typically taking two–three days to complete with verification of correct installation, so each participant will perform just one installation. You want to be able to detect the expected difference of 20 percentage points between the success rates with 80% confidence and 80% power, and are planning to run the same number of participants with the current and new installation procedures. How many participants (total including both groups) do you need to run?

8. For another product (Product B for “Before”), the current installation procedure is fairly short (about a half-hour), but that current process has numerous usability issues that have led to an estimated 50% failure rate on first attempts. You’ve tracked down the most serious usability issues and now have a prototype of an improved product (Product A for “After”). In a pilot study with 10 participants, you had 4 participants succeed with both products, 1 failed with both, 4 were successful with Product A but not B, and 1 was successful with Product B but not A. What are the resulting estimates for p₁, p₂, p₁₂, and p₂₁? If you want to run a larger-scale test with 95% confidence and 80% power, how many participants should you plan to run if you expect this pattern of results to stay roughly the same?

Answers to chapter review questions

1. The research problem in this exercise is to estimate a value without comparison to a benchmark or alternative. From the problem statement, the variability (s²) is 25 (s = 5) and the critical difference (d) is 2.5. This situation requires iteration to get to the final sample size estimate, starting with the z-score associated with two-sided testing and 90% confidence, which is 1.645. As shown in Table 6.11, the final sample size estimate for this study is 13 participants.

Table 6.11

Iterations for Review Question 1

	Initial	1	2	3
t	1.645	1.812	1.771	1.782
t²	2.71	3.29	3.14	3.18
s²	25	25	25	25
d	2.5	2.5	2.5	2.5
d²	6.25	6.25	6.25	6.25
df	10	13	12	12
Unrounded	10.8	13.1	12.5	12.7
Rounded up	11	14	13	13

2. Relative to Review Question 1, we’re moving from a simple estimation problem to a comparison with a benchmark, which means that we now need to consider the power of the test and because we’re testing against a benchmark, will use a one-sided rather than a two-sided test. Like the previous exercise, this will require iteration, starting with the sum of the one-sided z-scores for 90% confidence and 80% power, which are, respectively, 1.282 and 0.842. As shown in Table 6.12, the final sample size estimate for this study is 20 participants.

Table 6.12

Iterations for Review Question 2

	Initial	1	2
t_α	1.282	1.330	1.328
t_β	0.842	0.862	0.861
t_α+β	2.123	2.192	2.189
$t_{α + β^{2}}$ $t_{α + β^{2}}$	4.51	4.81	4.79
s²	25	25	25
d	2.5	2.5	2.5
d²	6.25	6.25	6.25
df	18	19	19
Unrounded	18.0	19.2	19.2
Rounded up	19	20	20

3. Relative to Review Question 2, we’re moving from comparison with a fixed benchmark to a within-subjects comparison between alternative designs, so the test should be two-sided rather than one-sided. The two-sided z-score for 90% confidence and one-sided for 80% power are 1.645 and 0.842, respectively. Table 6.13 shows the process of iterating for this situation, with a final sample size estimate of 27 participants.

Table 6.13

Iterations for Review Question 3

	Initial	1	2
t_α	1.645	1.711	1.706
t_β	0.842	0.857	0.856
t_α+β	2.487	2.568	2.561
$t_{α + β^{2}}$ $t_{α + β^{2}}$	6.18	6.59	6.56
s²	25	25	25
d	2.5	2.5	2.5
d²	6.25	6.25	6.25
df	24	26	26
Unrounded	24.7	26.4	26.2
Rounded up	25	27	27

4. Relative to Review Question 3, we’re moving from a within-subjects experimental design to one that is between-subjects. That means that the formula for starting the iterative process starts with n = 2z²s²/d² rather than n = z²s²/d² (where z is the sum of the z-scores for confidence and power, z_α and z_β)—essentially doubling the required sample size at that point in the process. Furthermore, the estimation is for the size of one group, so we’ll need to double it again to get the estimated sample size for the entire study. Table 6.14 shows the process of iterating for this situation, with a final sample size estimate of 51 participants per group, and a total sample size estimate of 102 participants.

Table 6.14

Iterations for Review Question 4

	Initial	1	2
t_α	1.645	1.661	1.660
t_β	0.842	0.842	0.842
t_α+β	2.487	2.503	2.502
$t_{α + β^{2}}$ $t_{α + β^{2}}$	6.185	6.263	6.261
s²	25	25	25
d	2.5	2.5	2.5
d²	6.25	6.25	6.25
df	98	100	100
Unrounded	49.5	50.1	50.1
Rounded up	50	51	51

5. Keeping many of the conditions of the situations the same, over the course of the first four review questions, we’ve gone from needing a sample size of 13 to simply estimate the ease-of-use score within a specified level of precision, to 20 to compare it against a benchmark, to 27 to perform a within-subjects usability test, to 102 to perform a between-subjects usability test. Clearly, the change that led to the greatest increase in the sample size estimate was the shift from a within- to a between-subjects comparison of alternatives, so one way to reduce the estimated sample size is to strive to run within-subjects studies rather than between-subjects when you must compare alternatives. The other aspects of experimental design that you can control are the choices for confidence level, power, and critical difference. Let’s assume that you were able to change your plan to a within-subjects study. Furthermore, you have worked with your stakeholders to relax the requirement for the critical difference (d) from 2.5 to 3.5. As shown in Table 6.15, these two changes—switching from a between- to a within-subjects design and increasing the critical difference by just one point—lead to a study design for which you should only need 15 participants. Note that if the critical difference were relaxed to five points, the required sample size would be just eight participants. Also note that this is only one of many ways to reduce the sample size requirement, for example, you could have reduced the levels of confidence and power.

Table 6.15

Iterations for Review Question 5

	Initial	1	2
t_α	1.645	1.782	1.761
t_β	0.842	0.873	0.868
t_α+β	2.487	2.655	2.629
$t_{α + β^{2}}$ $t_{α + β^{2}}$	6.18	7.05	6.91
s²	25	25	25
d	3.5	3.5	3.5
d²	12.25	12.25	12.25
df	12	14	14
Unrounded	12.6	14.4	14.1
Rounded up	13	15	15

6. For this question, the variable of interest is a binomial pass/fail measurement, so the appropriate approach is the sample size method based on the adjusted-Wald binomial confidence interval. We have the three pieces of information that we need to proceed: the success rate from the previous evaluation (p) was 0.65, the critical difference (d) is 0.15, and the desired level of confidence is 90% (so the two-sided value of z is 1.645). We first compute an initial sample size formula using the standard Wald formula: n = z²p(1 − p)/d²; which for this problem is: n = (1.645)²(0.65)(0.35)/0.15² = 27.4, which rounds up to 28. Next, we use that initial estimate of n to compute the adjusted value of p: p_adj = (np + z²/2)/(n + z²); which for this problem is: p_adj = ((28)(0.65) + 1.645²/2)/(28 + 1.645²) = 0.6368. We next use the adjusted value of p and the initial estimate of n to compute the adjusted estimate of n: (n_adj = (z²(p_adj)(1 − p_adj)/d²) – z²), which for this problem is ((1.645)²(0.6368)(0.3632)/0.15²) – 1.645² = 25.11, which rounds up to 26. As a check, we could set the expected number of successes (x) to 0.65(26), which rounds to 17. A 90% adjusted-Wald binomial confidence interval for 17/26 has an observed p of 0.654, an adjusted p of 0.639, and a margin of error of 0.147, just a little more precise than the target precision of 0.15.

7. Because in this problem you’re planning to compare success rates between independent groups, the appropriate test is the N − 1 chi-squared test. From the conditions of the problem, we have information needed to do the sample size estimation: the expected values of p₁ and p₂ (0.65 and 0.85, respectively, for an average p = 0.75 and d = 0.20) and the sum of the z-scores for 80% confidence (two-sided z = 1.282) and 80% power (one-sided z = 0.842) of 2.124. Plugging these values into the appropriate sample size estimation formula, we get: n = (2(2.124²)(0.75)(0.25))/0.2² + 0.5 = 42.8, which rounds up to 43 participants per group, for a total of 86 participants. This is outside the scope of most moderated usability tests. Relaxing the power to 50% (so its associated z-score would be 0, making the total value of z = 1.282) would reduce the estimate of n per group to 16 (total sample size of 32).

8. The appropriate statistical test for this type of study is the McNemar Exact Test (or, equivalently, a confidence interval using the adjusted-Wald method for matched proportions). From the pilot study, the estimates for the different key proportions are p₁ = 0.8, p₂ = 0.5, p₁₂ = 0.4, and p₂₁ = 0.1, so d = 0.3. Using the three-step process, first compute an initial estimate of n with the standard Wald formula, using z = 2.8 (the sum of 1.96 for two-tailed 95% confidence and 0.84 for one-tailed 80% power):

$n = \frac{{2.8}^{2} (0.1 + 0.4)}{{0.3}^{2}} - {2.8}^{2} = 35.7$ $n = \frac{{2.8}^{2} (0.1 + 0.4)}{{0.3}^{2}} - {2.8}^{2} = 35.7$

Rounded up, this initial estimate is 36. Next, compute the adjustments.

${\hat{p}}_{adj 12} = \frac{0.1 (36) + \frac{{2.8}^{2}}{8}}{36 + \frac{{2.8}^{2}}{2}} = 0.114729$ ${\hat{p}}_{adj 12} = \frac{0.1 (36) + \frac{{2.8}^{2}}{8}}{36 + \frac{{2.8}^{2}}{2}} = 0.114729$

${\hat{p}}_{adj 21} = \frac{0.4 (36) + \frac{{2.8}^{2}}{8}}{36 + \frac{{2.8}^{2}}{2}} = 0.385271$ ${\hat{p}}_{adj 21} = \frac{0.4 (36) + \frac{{2.8}^{2}}{8}}{36 + \frac{{2.8}^{2}}{2}} = 0.385271$

$d_{adj} = 0.385271 - 0.114729 = 0.270541$ $d_{adj} = 0.385271 - 0.114729 = 0.270541$

Then compute the final sample size estimate which, after rounding up, is 42.

$n = \frac{{2.8}^{2} (0.114729 + 0.385271)}{{0.270541}^{2}} - 1.5 {(2.8)}^{2} = 41.8$ $n = \frac{{2.8}^{2} (0.114729 + 0.385271)}{{0.270541}^{2}} - 1.5 {(2.8)}^{2} = 41.8$

You can check this estimate by computing a confidence interval to see if it includes or excludes 0. Because the power of the test is 80%, you need to compute an equivalent confidence to use that combines the nominal power and confidence of the test (see the sidebar “Equivalent Confidence”). The composite z for this problem is 2.8, so the equivalent confidence to use for a two-sided confidence interval is 99.4915%. The closest integer values for a, b, c, and d are, respectively, 17, 17, 4, and 4, for the following values:

$\begin{array}{l} p_{1} : 34 / 42 = 0.81 \\ p_{2} : 21 / 42 = 0.5 \\ p_{12} : 17 / 42 = 0.405 \\ p_{21} : 4 / 42 = 0.095 \end{array}$ $\begin{array}{l} p_{1} : 34 / 42 = 0.81 \\ p_{2} : 21 / 42 = 0.5 \\ p_{12} : 17 / 42 = 0.405 \\ p_{21} : 4 / 42 = 0.095 \end{array}$

The resulting confidence interval ranges from −0.55 to −0.015—close to but not including 0.

Using an n of 40, the expected values of p₁, p₂, p₁₂, and p₂₁ are exactly 0.8, 0.5, 0.4, and 0.1, respectively, and the confidence interval ranges from −0.549 to 0.0025, just barely including 0. The bounds of these confidence intervals support the sample size estimate of 42, but if samples were expensive, 40 would probably be adequate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Key points

Create new playlist

Sign In

Sign Up

Key points

Chapter review questions

Answers to chapter review questions

Table of Contents for
Key points