Chapter 5

Is there a statistical difference between designs?

Abstract

The goal of this chapter is to cover methods for determining if a difference is statistically significant and how large or small of a difference likely exists in the untested population. It is important to account for chance differences when comparing two designs or products. To do this, you need to find a p-value from the appropriate statistical test. To understand the likely range of the difference between designs or products, you should compute a confidence interval around the difference. To determine which statistical test you need to use, you need to identify whether your outcome measure is binary or continuous and whether you have the same users in each group (within-subjects) or a different set of users (between-subjects).

Keywords

p-value
confidence interval
statistical test
binary measure
continuous measure
within-subjects
between-subjects
two-sample t-test
paired t-test
chi-squared test
N−1 chi-squared test
Fisher exact test
two-proportion test
N−1 two-proportion test
McNemar exact test
adjusted-Wald binomial confidence interval
normality assumption
completion rates
conversion rates
A/B testing
Yates correction

Introduction

Many researchers first realize the need for statistics when they have to compare two designs or products in an A/B test or competitive analysis. When stakes are high (or subject to scrutiny) just providing descriptive statistics and declaring one design better is insufficient. What is needed is to determine whether the difference between designs (such as between conversion rates, task times, or ratings) is greater than what we’d expect from chance. This chapter is all about determining whether a difference is statistically significant and how large or small of a difference likely exists in the untested population.

Comparing two means (rating scales and task times)

A central theme in this book is to understand the role of chance in our calculations. When we can’t measure every user to compute a mean likelihood to recommend or a median task time, we have to estimate these averages from a sample.
Just because a sample of users from Product A has a higher average System Usability Scale (SUS) score than a sample from Product B does not mean the average SUS score for all users is higher on Product A than Product B. Chance plays a role in every sample selection and we need to account for that when comparing means. See Sauro (2011a) for more detail on using the SUS for comparing interface usability.
To determine whether SUS scores, Net Promoter Scores, task times, or any two means from continuous variables are significantly different (such as comparing different versions of the same product over time or against a competitive product), you first need to identify whether the same users were used in each test (within-subjects design) or whether there was a different set of users tested on each product (between-subjects design).

Within-subjects comparison (paired t-test)

When the same users are in each test group you have removed a major source of variation between your sets of data. In such tests you should alternate which product users encounter first to minimize carry-over effects. If all users encounter Product A first, this runs the risk of unfairly biasing users—either for or against Product A. The advantages are that you can attribute differences in measurements to differences between products, and you can detect smaller differences with the same sample size.
To determine whether there is a significant difference between means of continuous or rating scale measurements, use the following formula:

t=D¯sDn

image
where D¯image is the mean of the difference scores,
SD is the standard deviation of the difference scores,
n is the sample size (the total number of users), and
t is the test statistic (look-up using the t-distribution based on the sample size for two-sided area). See Technical Note 1.

Example 1: Comparing two System Usability Scale (SUS) means

For example, in a test between two expense-reporting applications, 26 users worked (in random order) with two web applications (A and B). They performed several tasks on both systems and then completed the 10-item SUS questionnaire, with the results shown in Table 5.1 (subtracting the score for B from the score for A to get the difference score).

Table 5.1

Pairs of SUS Scores and Their Differences for Example 1

User A B Difference
1 77.5 60 17.5
2 90 62.5 27.5
3 80 45 35
4 77.5 20 57.5
5 100 80 20
6 95 42.5 52.5
7 82.5 32.5 50
8 97.5 80 17.5
9 80 52.5 27.5
10 87.5 60 27.5
11 77.5 42.5 35
12 87.5 87.5 0
13 82.5 52.5 30
14 50 10 40
15 77.5 67.5 10
16 82.5 40 42.5
17 80 57.5 22.5
18 65 32.5 32.5
19 72.5 67.5 5
20 85 47.5 37.5
21 80 45 35
22 100 62.5 37.5
23 80 40 40
24 57.5 45 12.5
25 97.5 65 32.5
26 95 72.5 22.5
Mean 82.2 52.7 29.5

Product A had a mean SUS score of 82.2 and Product B had a mean SUS score of 52.7. The mean of the difference scores was 29.5 with a standard deviation of 14.125. Plugging these values in the formula, we get

t=29.514.12526

image

t=10.649

image
We have a test statistic (t) equal to 10.649. To determine whether this is significant, we need to look up the p-value using a t-table, the Excel function =TDIST(), or the calculator available at http://www.usablestats.com/calcs/tdist.
The degrees of freedom for this type of test is equal to n – 1, so we have 25 degrees of freedom (26−1). Because this is a two-sided test (see Technical Note 1), the p-value is =TDIST(10.649,25,2) = .0000000001. Because this value is so small, we can conclude that there’s less than a one in a billion chance that the population mean SUS scores are equal to each other. Put another way, we can be over 99.999% sure products A and B have different SUS scores. Product A’s SUS score of 82.2 is statistically significantly higher than product B’s of 52.7, so we can conclude users perceive Product A as easier to use.
Technical Note 1: We’re using the two-sided area (instead of the one-sided area that was used in comparing a mean to a benchmark in Chapter 4) because we want to see whether the difference between SUS means is equal to 0, which is a two-sided research question. It is tempting to look at the results and see that Product A had a higher mean and then use just a one-sided test. Although it wouldn’t matter in this example, it can happen that the one-sided test generates a significant p-value but the corresponding two-sided p-value is not significant. Waiting until after the test has been conducted to determine whether to use a one -or two-sided test improperly capitalizes on chance. We strongly recommend sticking with the two-sided test when comparing two means (also see Chapter 9, “Should You Always Conduct a Two-Tailed Test?”).

Confidence interval around the difference

With any comparison we also want to know the size of the difference (often referred to as the effect size). The p-value we get from conducting the paired t-test tells us only that the difference is significant. A significant difference could mean just a one-point difference in SUS scores, which would not be of much practical importance. As sample sizes get large (above 100), as is common in remote unmoderated testing, it becomes more likely to see a statistically significant difference when the actual effect size is not practically significant. The confidence interval around the difference helps us distinguish between trivial (albeit statistically significant) differences and differences users would likely notice.
To generate a confidence interval around the difference scores to understand the likely range of the true difference between products, use the following formula:

D¯±tasDn

image
where D¯image is the mean of the difference scores (as was used in computing the test statistic),
n is the sample size (the total number of users),
SD is the standard deviation of the difference scores (also used in computing the test statistic), and
ta is the critical value from the t-distribution for n−1 degrees of freedom and the specified level of confidence.
For a 95% confidence interval and sample size of 26 (25 degrees of freedom), the critical value is 2.06. See http://www.usablestats.com/calcs/tdist to obtain critical values from the t-distribution, or in Excel use =TINV(0.05,25).
Plugging in the values we get

29.5±2.0614.12526

image

29.5±5.705

image
We can be 95% confident the actual difference between product SUS scores is between 23.8 and 35.2.
Practical significance
The difference is statistically significant, but is it practically significant? To answer this question depends on how we interpret the lowest and highest plausible differences. Even the lowest estimate of the difference of 23.8 points puts Product A at 45% higher than Product B. It also helps to know something about SUS scores. A difference of 23.8 points crosses a substantial range of products and places Product A’s perceived usability much higher than Product B’s relative to hundreds of other product scores (Sauro, 2011a; also see Chapter 8, Tables 8.5 and 8.7). Given this information it seems reasonable to conclude that users would notice the difference in the usability and it suggests that the difference is both statistically and practically meaningful.
Technical Note 2: For the confidence interval formula we use the convention that ta represents a two-sided confidence level. Many, but not all, statistics books use the convention t1α2image which is based on a table of values that is one-sided. We find this approach more confusing because in most cases you’ll be working with two-sided rather than one-sided confidence intervals. It is also inconsistent with the Excel TINV function, which is a very convenient way to get desired values of t when computing confidence intervals.

Comparing task times

In earlier chapters we saw how task times have a strong positive skew from some users taking a long time to complete a task. This skew makes confidence intervals (Chapter 3) and tests against benchmarks (Chapter 4) less accurate. In those situations we applied a log transformation to the raw times to improve the accuracy of the results. When analyzing difference scores, however, the two-tailed paired t-test is widely considered robust to violations of normality, especially when the skew in the data takes the same shape in both samples (Agresti and Franklin, 2007; Box, 1953; Howell, 2002). Although sample mean task times will differ from their population median, we can still accurately tell whether the difference between means is greater than what we’d expect from chance alone using the paired t-test, so there is no need to complicate this test with a transformation.

Example 2: Comparing two task times

In the same test of two accounting systems used in Example 1, task times were also collected. One task asked users to create an expense report. Of the 26 users who attempted the task, 21 completed it successfully on both products. These 21 task times and their difference scores appear in Table 5.2. Failed task attempts are indicated with a minus sign and not included in the calculation.

Table 5.2

Pairs of Completion Times and Their Differences for Example 2

User A B Difference
1 223
2 140
3 178 184 −6
4 145 195 −50
5 256
6 148 210 −62
7 222 299 −77
8 141 148 −7
9 149 184 −35
10 150
11 133 229 −96
12 160
13 117 200 −83
14 292 549 −257
15 127 235 −108
16 151 210 −59
17 127 218 −91
18 211 196 15
19 106 162 −56
20 121 176 −55
21 146 269 −123
22 135 336 −201
23 111 167 −56
24 116 203 −87
25 187 247 −60
26 120 174 −54
Mean 158 228 77

The mean difference score is −77 s and the standard deviation of the difference scores is 61 s. Plugging these values in the formula we get

t=D^sDn

image

t=776121

image

t=5.78

image
We have a test statistic (t) equal to −5.78 with 20 (n – 1) degrees of freedom and the decision prior to running the study to conduct a two-sided test. To determine whether this is significant we need to look up the p-value using a t-table, the Excel function =TDIST(), or the calculator available at http://www.usablestats.com/calcs/tdist. Using =TDIST(5.78,20,2), we find p = 0.00001, so there is strong evidence to conclude that users take less time to complete an expense report on Product A. If you follow the steps from the previous example, you’ll find that the 95% confidence interval for this difference ranged from about 49–104 s—a difference that users are likely to notice.
In this example, the test statistic is negative because we subtracted the typically longer task time (from Product B) from the shorter task time (Product A). We would get the same p-value if we subtracted the smaller time from the larger time, changing the sign of the test statistic. When using the Excel TDIST function, keep in mind that it only works with positive values of t.

Normality assumption of the paired t-test

As we’ve seen with the paired t-test formula, the computations are performed on the difference scores. We therefore are only working with one sample of data which means the paired t-test is really just the one-sample t-test from Chapter 4 with a different name.
The paired t-test therefore has the same normality assumption as the one-sample t-test. For large sample sizes (above 30) normality isn’t a concern because the sampling distribution of the mean is normally distributed (see Chapter 9). For smaller sample sizes (less than 30) and for two-tailed tests, the one-sample t-test/paired t-test is considered robust against violations of the normality assumption. That is, data can be non-normal (as with task-time data) but still generate accurate p-values (Box, 1953) when using a paired t-test.

Between-subjects comparison (two-sample t-test)

When a different set of users is tested on each product there is variation both between users and between designs. Any difference between the means (e.g., questionnaire data, task times) must be tested to see whether it is greater than the variation between the different users.
To determine whether there is a significant difference between means of independent samples of users, we use the two-sample t-test (also called t-test on independent means). It uses the following formula:

t=x^1x^2s12n1+s22n2

image
where x^1andx^2image are means from samples 1 and 2,
s1 and s2 are standard deviations from samples 1 and 2,
n1 and n2 are the sample size from samples 1 and 2, and
t is the test statistic (look-up using the t-distribution based on the sample size for two-sided area)

Example 1: Comparing two SUS scores

For example, in a test between two CRM applications, the following SUS scores were obtained after 11 users attempted tasks on Product A and 12 different users attempted the same tasks on Product B for a total of 23 different users tested (Table 5.3).

Table 5.3

Data for Comparison of SUS Scores from Independent Groups

A B
50 50
45 52.5
57.5 52.5
47.5 50
52.5 52.5
57.5 47.5
52.5 50
50 50
52.5 50
55 40
47.5 42.5
57.5
51.6 49.6
Product A had a mean SUS score of 51.6 (sd = 4.07) and Product B had a mean SUS score of 49.6 (sd = 4.63). Plugging these values in the formula we get

t=51.649.64.07211+4.63212

image

t=1.102

image
The observed difference in SUS scores generates a test statistic (t) equal to 1.102. To determine whether this is significant, we need to look up the p-value using a t-table, the Excel function =TDIST, or the calculator available at http://www.usablestats.com/calcs/tdist.
We have 20 degrees of freedom (see the sidebar on “Degrees of Freedom for the Two-sample t-test”) and want the two-sided area, so the p-value is =TDIST(1.102,20,2) = 0.2835. Because this value is rather large (and well above 0.05 or 0.10) we can’t conclude that the difference is greater than chance. A p-value of 0.2835 tells us the probability that this difference of two points is due to chance is 28.35%. Put another way, we can be only about 71.65% sure that products A and B have different SUS scores—a level of certainty that is better than 50–50 but that falls well below the usual criterion for claiming a significant difference. Product A’s SUS score of 51.6, while higher, is not statistically distinguishable from Product B’s score of 49.6 at this sample size.
If we had to pick one product, there’s more evidence that Product A has a higher SUS score, but in reality it could be that the two are indistinguishable in the minds of users or, less likely, that users think Product B is more usable. In most applied research settings, having only 71.65% confidence that the products are different is not sufficient evidence for a critical decision.
With time and budget to collect more data, you can use the estimates of the standard deviation and the observed difference to compute sample size needed to detect a two-point difference in SUS scores (see Chapter 6). Given a sample standard deviation of 4.1 and a difference of two points (95% confidence and 80% Power), you’d need a sample size of 136 (68 in each group) to reliably detect a difference this small.

Degrees of freedom for the two-sample t-test

It’s a little more complicated than the one-sample test, but that’s what computers are for

It’s simple to calculate the degrees of freedom for a one-sample t-test—just subtract 1 from the sample size (n – 1). There’s also a simple formula for computing degrees of freedom for a two-sample t-test, which appears in many statistics books—add the independent sample sizes together and subtract 2 (n1 + n2 – 2).
Instead of using that simple method for the two-sample t-test, in this book we use a modification called the Welch–Satterthwaite procedure (Satterthwaite, 1946Welch, 1938). It provides accurate results even if the variances are unequal (one of the assumptions of the two-sample t-test) by adjusting the number of degrees of freedom using the following formula:

df'=s12n1+s22n22s12n1n112+s22n22n21

image
where s1 and s2 are the standard deviations of the two groups, and n1 and n2 are the group’s sample sizes.
For fractional results, round the degrees of freedom (df) down to the nearest integer. For the data in Table 5.3, the computation of the degrees of freedom is

df'=4.07211+4.6321224.072111112+4.632122121=10.80.52=20.8,whichroundsdownto20

image
The computations are a bit tedious to do by hand, but most software packages compute it automatically, and it’s fairly easy to set up in Excel. If, for some reason, you don’t have access to a computer and the variances are approximately equal, you can use the simpler formula (n1 + n2 − 2). If the variances are markedly different (e.g., the ratio of the standard deviations is greater than 2), as a conservative shortcut you can subtract 2 from the smaller of the two sample sizes.

Confidence interval around the difference

With any comparison, we also want to know the size of the difference (the effect size). The p-value we get from conducting the two-sample t-test only tells us that a significant difference exists. For example, a significant difference could mean just a one-point difference in SUS scores (which would not be of much practical importance) or a 20-point difference, which would be meaningful.
There are several ways to report an effect size, but for practical work, the most compelling and easiest-to-understand is the confidence interval. We can use the following formula to generate a confidence interval around the difference scores to understand the likely range of the true difference between products:

(x^1x^2)±tas12n1+s22n2

image
where x^1andx^2image are means from samples 1 and 2,
s1ands2image are standard deviations from samples 1 and 2,
n1andn2image are the sample size from samples 1 and 2, and
ta is the critical value from the t-distribution for a specified level of confidence and degrees of freedom. For a 95% confidence interval and 20 degrees of freedom, the critical value is 2.086. See http://www.usablestats/calcs/tdist for obtaining critical values from the t-distribution.
Plugging in the values we get

51.649.6±2.0864.07211+4.63212

image

2.0±3.8

image
So, we can be 95% confident that the actual difference between product SUS scores is between −1.8 and 5.8. Because the interval crosses zero, we can’t be 95% sure that a difference exists; as stated previously, we’re only 71.65% sure. Although Product A appears to be a little better than Product B, the confidence interval tells us that there is still a modest chance that Product B has a higher SUS score (by as much as 1.8 points).

Example 2: Comparing two task times

Twenty users were asked to add a contact to a CRM application. Eleven users completed the task on the existing version and nine different users completed the same task on the new enhanced version. Is there compelling evidence to conclude that there has been a reduction in the mean time to complete the task? The raw values (in seconds) appear in Table 5.4.

Table 5.4

Data for Comparison of Task Times from Independent Groups

Old New
18 12
44 35
35 21
78 9
38 2
18 10
16 5
22 38
40 30
77
20
The mean task time for the 11 users of the old version was 37 s with a standard deviation of 22.4 s. The mean task time for the nine users of the new version was 18 s with a standard deviation of 13.4 s. Plugging in the values we get

t=x^1x^2s12n1+s22n2

image

t=371822.4211+13.429

image

t=2.33

image
The observed difference in mean times generates a test statistic (t) equal to 2.33. To determine whether this is significant we need to find the p-value using a t-table, the Excel function =TDIST, or the calculator available at http://www.usablestats.com/calcs/tdist.
We have 16 degrees of freedom (see the sidebar on “Degrees of Freedom for the Two-sample t-test”) and want the two-sided area, so the p-value is =TDIST(2.33,16,2) = 0.033. Because this value is rather small (less than 0.05) there is reasonable evidence that the two task times are different. We can conclude users take less time with the new design. From this sample we can estimate the likely range of the difference between mean times by generating a confidence interval. For a 95% confidence interval with 16 degrees of freedom, the critical value of t is 2.12. See http://www.usablestats.com/calcs/tdist for obtaining critical values from the t-distribution.
Plugging the values into the formula we get

(x^1x^2)±tas12n1+s22n2

image

3718±2.1222.4211+13.429

image

19±17.2

image
We can be 95% confident the difference in mean times is between about 2 and 36 s.

Assumptions of the t-tests

The two-sample t-test has four assumptions:
1. Both samples are representative of their parent populations (representativeness).
2. The two samples are unrelated to each other (independence).
3. Both samples are approximately normally distributed (normality).
4. The variances in both groups are approximate equal (homogeneity of variances).
As with all statistical procedures, the first assumption is the most important. The p-values, confidence intervals, and conclusions are only valid if the sample of users is representative of the population about which you are making inferences. In user research this means having the right users attempt the right tasks on the right interface.
Meeting the second assumption is usually not a problem in user research as the values from one participant are unlikely to affect the responses of another. The latter two assumptions, however, can cause some consternation and are worth discussing.

Normality

Like the one-sample t-test, paired t-test, and most parametric statistical tests, there is an underlying assumption of normality. Specifically, this test assumes that the sampling distribution of the mean differences (not the distribution of the raw scores) is approximately normally distributed. When this distribution of mean differences is not normal, the p-values can be off by some amount. For large samples (above 30 for all but the most extreme distributions) the normality assumption isn’t an issue because the sampling distribution of the mean is normally distributed according to the Central Limit Theorem (see Chapter 9).
Fortunately, even for small sample sizes (less than 30), the t-test still generates reliable results even when the data are not normally distributed. For example, Box (1953) showed that a typical amount of error is a manageable 2%. For example, if you generate a p-value of 0.02, the long-term actual probability might be 0.04. This is especially the case when the sample sizes in both groups are equal so, if possible, you should plan for equal sample sizes in each group, even though you might end up with uneven sample sizes.

Equality of variances

The third assumption is that the variances (and equivalently the standard deviations) are approximately equal in both groups. As a general rule, you should only be concerned about unequal variances when the ratio between the two standard deviations is greater than 2 (e.g., a standard deviation of 4 in one sample and 12 in the other is a ratio of 3) (Agresti and Franklin, 2007). The robustness of the two-sample t-test also extends to violations of this assumption, especially when the sample sizes are roughly equal (Agresti and Franklin, 2007; Box, 1953; Howell, 2002). For a method of adjusting degrees of freedom to help compensate for unequal variances, see the sidebar: “Degrees of Freedom for the Two-sample t-test.”

Don’t worry too much about violating assumptions (except representativeness)

Now that we’ve covered the assumptions for the two-sample t-test, we want to reassure you that you shouldn’t concern yourself with them too much for most practical work—except of course representativeness. No amount of statistical manipulations can overcome the problem of measuring the wrong users performing the wrong tasks.
We’ve provided the detail on the other assumptions here so you can be aware that they exist. You might have encountered warnings about non-normal data and heterogeneous variances in statistics books, or from colleagues critical of the use of t-tests with typical continuous or rating-scale usability metrics. It is our opinion that the two-sample t-test, especially when used with two-sided probabilities and (near) equal sample sizes, is a workhorse that will generate accurate results for statistical comparisons in user research. It is, however, always a good idea to examine your data, ideally graphically to look for outliers or unusual observations that could have arisen from coding errors or errors users made while responding. These types of data quality errors can have a real effect on your results—an effect that properly conducted statistics cannot fix.

Comparing completion rates, conversion rates, and A/B testing

A binary response variable takes on only two values: yes/no, convert/didn’t convert, purchased/didn’t purchase, completed the task/failed the task, and so on. These are coded into values of 1 and 0, respectively. Even continuous measures can be degraded into binary measures: proportion of users taking less than a minute to complete a task, proportion of responses scoring 9 or 10 on an 11-point scale. These types of binary measures appear extensively in user research.
As with the continuous method for comparing task times and satisfaction scores, we need to consider whether the two samples being compared have different users in each group (between-subjects) or use the same people (within-subjects).

Between-subjects

Comparing the two outcomes of binary variables for two independent groups happens to be one of the most frequently computed procedures in applied statistics. Surprisingly, there is little agreement on the best statistical test for this situation. For large sample sizes, the chi-square test is typically recommended. For small sample sizes, the Fisher exact test (also called the Fisher–Irwin test) is typically recommended. However, there is disagreement on what constitutes a “small” or “large” sample size and what version of these tests to use. A recent survey of medical and general statistics textbooks by Campbell (2007) found that only 2 of 14 books agreed on what procedure to recommend for comparing two independent binary outcomes.
The latest research suggests that a slight adjustment to the standard chi-square test, and equivalently to the two-proportion test, generates the best results for almost all sample sizes. The adjustment is simply subtracting 1 from the total sample size and using it in the standard chi-square or two-proportion test formulas (shown later in this chapter). Because there is so much debate on this topic we spend the next few pages describing the alternatives which you are likely to encounter (or were taught) and then present the recommended N−1 chi-square test and N−1 two-proportion test. You can skip to the N−1 chi-square section if you have no interest in understanding the alternative formulas and their drawbacks.

Chi-square test of independence

One of the oldest methods and the one typically taught in introductory statistics books is the chi-square test. Karl Pearson, who also developed the most widely used correlation coefficient, proposed the chi-square test in 1900 (Pearson, 1900).
It uses an intuitive concept of comparing the observed counts in each group with what you would expect from chance. The chi-square test makes no assumptions about the parent population in each group, so it is a distribution-free nonparametric test. It uses a 2 × 2 table (pronounced two by two) with the nomenclature shown in Table 5.5.

Table 5.5

Nomenclature for Chi-Square Tests of Independence

Pass Fail Total
Design A a b m
Design B c d n
Total r s N

To conduct a chi-square test, compare the result of the following formula to the chi-square distribution with 1 degree of freedom.

χ2=(adbc)2Nmnrs

image

Degrees of freedom for chi-square tests

For a 2 × 2 table, it’s always 1

The general formula for calculating the degrees of freedom for a chi-square test of independence is to multiply one less than the number of rows by one less than the number of columns.

df=(r1)(c1)

image
In a 2 × 2 table, there are two rows and two columns, so all chi-square tests conducted on these types of tables have 1 degree of freedom.
For example, if 40 out of 60 (67%) users complete a task on Design A, can we conclude it is statistically different from Design B where 15 out of 35 (43%) users passed? Setting this up in Table 5.6 and filling the values in the formula, we get

Table 5.6

Data for Chi-Square Test of Independence

Pass Fail Total
Design A 40 20 60
Design B 15 20 35
Total 55 40 95

χ2=(40×2020×15)2×9560×35×55×40

image

χ2=5.1406

image
We use a table of chi-square values or the Excel function =CHIDIST(5.1406, 1), and get the p-value of 0.0234. Because this value is low, we conclude the completion rates are statistically different. Design A has the higher completion rate and so it is statistically higher than B’s.

Small sample sizes

The chi-square test tends to generate accurate results for large sample sizes, but is not recommended when sample sizes are small. As mentioned earlier, both what constitutes a small sample size and what alternative procedure to use is the subject of continued research and debate.
The most common sample size guideline is to use the chi-square test when the expected cell counts are greater than 5 (Cochran, 1952 1954). This rule appears in most introductory statistics text despite being somewhat arbitrary (Campbell, 2007). The expected counts are different than the actual cell counts, computed by multiplying the row and column totals for each cell and then dividing by the total sample size. From the aforementioned example, this generates the following expected cell counts:

(r×m)N=(55×60)95=34.74

image

(s×m)N=(40×60)95=25.26

image

(r×n)N=(55×35)95=20.26

image

(s×n)N=(40×35)95=14.74

image
The minimum expected cell count for the data in the example is 14.74 which is greater than 5 and so, according to the common sample size guideline, the normal chi-square test is appropriate.
Here is another example comparing conversion rates on two designs with a total sample size of 22 and some expected cell counts less than 5. The cell nomenclature appears in parentheses in Table 5.7. Eleven out of 12 users (92%) completed the task on Design A; 5 out of 10 (50%) completed it on Design B.

Table 5.7

Conversion Rates for Two Designs

Pass Fail Total
Design A 11 (a) 1 (b) 12 (m)
Design B 5 (c) 5 (d) 10 (n)
Total 16 (r) 6 (s) 22 (N)

Filling in these values we get

χ2=(adbc)2Nmnrs

image

χ2=(11 × 51 × 5)2 × 2212 × 10 × 16 × 6

image

χ2=4.7743

image
Looking up this value in a chi-square table or using the Excel function =CHIDIST(4.7743, 1) we get the p-value of 0.0288, so we conclude there is a statistically significant difference between conversion rates for these designs.
However, in examining the expected cell frequencies we see that two are less than 5.

(r×m)N=(16×12)22=8.73

image

(s×m)N=(6×12)22=3.27

image

(r×n)N=(16×10)22=7.27

image

(s×n)N=(6×10)22=2.73

image
With low expected cell counts, most statistics textbooks warn against using the chi-square test and instead recommend either the Fisher exact test (aka Fisher Irwin test) or the chi-square test with Yates correction. Before covering those alternative methods, however, we should mention the two-proportion test.

Two-proportion test

Another common way for comparing two proportions is the two-proportion test. It is mathematically equivalent to the chi-square test. Agresti and Franklin (2007) have suggested a rule of thumb for its minimum sample size that there should be at least 10 successes and 10 failures in each sample.
It generates a test statistic that is looked up using the normal (z) distribution to find the p-values. It uses the following formula and will be further discussed in a subsequent section (N−1 two-proportion test).

z=(p^1p^2)PQ×1n1+1n2

image

Fisher exact test

The Fisher exact test uses exact probabilities instead of approximations as is done with the chi-square distribution and t distributions. As with the exact binomial confidence interval method used in Chapter 4, exact methods tend to be conservative and generate p-values that are higher than they should be and therefore require larger differences between groups to achieve statistical significance.
The Fisher exact test computes the p-values by finding the probabilities of all possible combinations of 2 × 2 tables that have the same marginal totals (the values in cells m, n, r, and s) that are equal to or more extreme than the ones observed. These values are computed for each 2 × 2 table using the following formula:

p=m!n!r!s!a!b!c!d!N!

image
The computations are very tedious to do by hand and, because they involve factorials, can generate extremely large numbers. Software is used in computing the p-values because there are typically dozens of tables that have the same marginal or more extreme marginal totals (m, n, r, and s) even for modest sample sizes. An online Fisher exact test calculator is available at www.measuringu.com/fisher.php.
The two-tailed p-value generated from the calculator is 0.0557. Using 0.05 as our threshold for significance, strictly speaking, we would conclude there is NOT a statistically significant difference between designs using the Fisher exact test. In applied use, we’d likely come to the same conclusion if we have 94.4% confidence or 95% confidence—namely that it’s unlikely that the difference is due to chance. For more discussion of this topic, see “Can You Reject the Null Hypothesis When p > 0.05?” in Chapter 9.

Yates correction

The Yates correction attempts to approximate the p-values from the Fisher exact test with a simple adjustment to the original chi-square formula.

χyates2=adbcN22Nmnrs

image
Using the same aforementioned example, we get the Yates chi-square test statistic of

χyates2=11×51×52222×2212×10×16×6

image

χyates2=2.905

image
Looking up this value in a chi-square table or using the Excel function =CHIDIST(2.905, 1) we get the p-value of 0.0883. Using 0.05 as our threshold for significance, we would conclude there is NOT a statistically significant difference between designs using the Yates correction (although, as with the Fisher test, this outcome would probably draw our attention to the possibility of a significant difference).
For this example, the p-value for the Yates correction is higher than the Fisher exact test, which is a typical result. In general, the Yates correction tends to generate p-values higher than the Fisher exact test and is therefore even more conservative, overstating the true long-term probability of a difference. For this reason and because most software programs can easily calculate the Fisher exact test, we do not recommend the use of the chi-square test with the Yates correction.

N−1 Chi-square test

Pearson also proposed an alternate form of the chi-square test in his original work (Campbell, 2007Pearson, 1900). Instead of multiplying the numerator by N (the total sample size), it is multiplied by N−1.

χ2=(adbc)2(N1)mnrs

image
Campbell (2007) has shown this simple adjustment to perform better than the standard chi-square, Yates variant, and Fisher exact tests for almost all sample sizes. It tends not to work well when the minimum expected cell count is less than 1. Fortunately, having such low expected cell counts doesn’t happen a lot in user research, and when it does, the Fisher exact test is an appropriate substitute. Using the N−1 chi-square test, we get the following p-value from the example data used previously:

χ2=(11×51×5)2×2112×10×16×6

image

χ2=4.557

image
Looking up this value in a chi-square table or using the Excel function =CHIDIST(4.557, 1) we get the p-value of 0.0328. Using 0.05 as our threshold for significance, we would conclude there is a statistically significant difference between designs.

N−1 Two-proportion test

An alternative way of analyzing a 2 × 2 table is to compare the differences in proportions. Similar to the two-sample t-test where the difference between the means was compared to the t-distribution, the N−1 chi-square test is equivalent to an N−1 two-proportion test. Instead of using the chi-square distribution to generate the p-values, we use the normal (z) distribution.
Many readers may find this approach more intuitive for three reasons.
1. It is often easier to think in terms of completion rates or conversion rates (measured as proportions) rather than the number of users that pass or fail.
2. We use the more familiar and readily available normal distribution as the reference distribution for finding p-values and don’t need to worry about degrees of freedom.
3. The confidence interval formula uses the difference between the two proportions and makes for an easier transition in computation and understanding.
The N−1 two-proportion test uses the standard large sample two-proportion formula (as shown in the previous section) except that it is adjusted by a factor of N1Nimage. This adjustment is algebraically equivalent to the N−1 chi-square adjustment. The resulting formula is

z=(p^1p^2)N1NPQ×1n1+1n2

image
where
p^1 and p^2image are the sample proportions
P=x1+x2n1+n2image, where x1 and x2 are the numbers completing or converting, and n1 and n2 are the numbers attempting
Q = 1−P
N is the total sample size in both groups
Using the example data we have 11 out of 12 (91.7%) completing on Design A and 5 out of 10 (50.0%) completing on Design B for a total sample size of 22.
First we compute the values for P and Q and substitute them in the larger equation.

P=11+512+10=0.727andQ=10.727=0.273

image

z=(0.9170.50)221220.727×0.273×112+110

image

z=2.135

image
We can use a normal (z) table to look up the two-sided p-value or the Excel function =(1-NORMSDIST(2.135))*2 which generates a two-sided p-value of 0.0328—the same p-value we got from the N−1 chi-square test, demonstrating their mathematical equivalence.
Table 5.8 summarizes the p-values generated from the sample data for all approaches and our recommended strategy.

Table 5.8

Summary of p-values Generated from Sample Data for Chi-Square and Fisher Tests

Method P-value Notes
N−1 chi-square/N−1 two-proportion test 0.0328 Recommended: When expected cell counts are all >1
chi-square/ two-proportion test 0.0288 Not Recommended: Understates true probability for small sample sizes
chi-square with Yates correction 0.0883 Not Recommended: Overstates true probability for all sample sizes
Fisher exact test 0.0557 Recommended: When any expected cell count is <1

Confidence interval for the difference between proportions

As with all tests of statistical comparisons, in addition to knowing whether the difference is significant, we also want to know how large of a difference likely exists. To do so for this type of comparison, we generate a confidence interval around the difference between two proportions. The recommended formula is an adjusted-Wald confidence interval similar to that used in Chapter 4, except that it is for a difference between proportions (Agresti and Caffo, 2000) instead of around a single proportion (Agresti and Coull, 1998).
The adjustment is to add a quarter of a squared z-critical value to the numerator and half a squared z-critical value to the denominator when computing each proportion. For a 95% confidence level the two-sided z-critical value is 1.96. This is like adding two pseudo observations to each sample—one success and one failure—as shown in the following:

p^adj=x+z24n+z22=x+1.9624n+1.9622=x+.9604n+1.92x+1n+2

image
This adjustment is then inserted into the more familiar (to some) Wald confidence interval formula.

(p^adj1p^adj2)±zαp^adj1(1p^adj1)nadj1+p^adj2(1p^adj2)nadj2

image
zα = two-sided z critical value for the level of confidence (e.g., 1.96 for a 95% confidence level)
With the same example data we’ve used so far, we will compute a 95% confidence interval. First we compute the adjustments.
For Design A, 11 out of 12 users completed the task, and these become our x and n, respectively.

p^adj1=x+z24n+z22=11+1.962412+1.9622=11+0.960412+1.92=11.9613.92=0.859

image
For Design B, five out of ten users completed the task, and these become our respective x and n.

p^adj2=x+z24n+z22=5+1.962410+1.9622=5+0.960410+1.92=5.9611.92=0.50

image
Note: When the sample proportion is 0.5, the adjusted p will also be 0.50, as seen in this example.
Plugging these adjustments into the main formula we get

(0.8590.50)±1.960.859(10.859)13.92+0.50(10.50)11.92

image

0.359±0.338

image
By adding and subtracting 0.338 to the difference between proportions of 0.359, we get a 95% confidence interval that ranges from 0.022 to 0.697. That is, we can be 95% confident that the actual difference between design completion rates is between 2% and 70%.

Example 1: Comparing two completion rates

A new version of a CRM software application was created to improve the process of adding contacts to a distribution list. Four out of nine users (44.4%) completed the task on the old version and 11 out of 12 (91.7%) completed it on the new version. Is there enough evidence to conclude the new design improves completion rates? We will use the N−1 two-proportion test.

z=(p^1p^2)N1NPQ×1n1+1n2

image

P=x1+x2n1+n2

image
Filling in the values we get

P=4+119+12=0.714andQ=10.714=0.286

image

z=(0.9170.444)211210.714×0.286×19+112

image

z=2.313

image
We can use a normal (z) table to look up the two-sided p-value or the Excel function NORMSDIST for the test statistic of 2.313. To use NORMSDIST, you need to copy the formula =(1-NORMSDIST(2.313))*2, which generates a p-value of 0.0207. Because this value is low, we have reasonable evidence to conclude the completion rate on the new CRM design has improved. To estimate the actual improvement in the completion rate for the entire user population, we now generate a 95% confidence interval around the difference in proportions using the adjusted-Wald procedure.

p^adj1=x+z24n+z22=4+1.96249+1.9622=4+0.969+1.92=4.9610.92=0.454

image

p^adj2=x+z24n+z22=11+1.962412+1.9622=11+0.9612+1.92=11.9613.92=0.859

image
The critical value of z for a 95% confidence level is 1.96.

(0.8590.454)±1.960.859(10.859)13.92+0.454(10.454)10.92

image

0.405±0.347

image
The 95% confidence interval is 0.058 to 0.752, that is, we can be 95% confident the actual improvement in completion rates on the new task design is between 6% and 75%.

Example 2: A/B testing

An A/B test was conducted live on an e-commerce website for two weeks to determine which product page converted more users to purchase a product. Concept A was presented to 455 users and 37 (8.13%) purchased the product. Concept B was presented to 438 users and 22 (5.02%) purchased the product. Is there evidence that one concept is statistically better than the other? Using the N−1 two-proportion test we get

P=37+22455+438=0.066andQ=10.066=0.934

image

z=(0.08130.0502)89318930.066×0.934×1455+1438

image

Z=1.87

image
Looking up the test statistic 1.87 in a normal table, we get a two-sided p-value of 0.06. The probability the two concepts have the same conversion rate is around 6%. That is, there is about a 94% probability the completion rates are different. The 90% confidence interval around the difference in conversion rates (which uses the critical value of 1.64) is

p^adj1=x+z24n+z22=37+1.6424455+1.6422=37+0.68455+1.35=37.68456.35=0.083

image

p^adj1=x+z24n+z22=22+1.6424438+1.6422=22+0.68438+1.35=22.68439.35=0.052

image

(0.0830.052)±1.640.083(10.083)466.35+0.052(10.052)439.35

image

0.031±0.027

image
The 90% confidence interval around the observed difference of 0.031 ranges from 0.004 to 0.058. That is, if Concept A was used on all users (assuming the two week period was representative) we could expect it to convert between 0.4% and 6% more users than Concept B. As with any confidence interval, the actual long-term conversion rate is more likely to be closer to the middle value of 3.1% than to either of the extreme end points. For many large-volume e-commerce websites, however, even the small lower limit estimated advantage of 0.4% for Concept A could translate into a lot more revenue.

Within-subjects

When the same users are used in each group the test design is within-subjects (also called matched pairs). As with the continuous within-subjects test (the paired t-test) the variation between users has been removed and you have a better chance of detecting differences (higher power) with the same sample size as a between-subjects design.
To determine whether there is a significant difference between completion rates, conversion rates, or any dichotomous variable we use the McNemar exact test and generate p-values by testing whether the proportion of discordant pairs is greater than 0.5 (called the sign test) for all sample sizes.

McNemar exact test

The McNemar exact test uses a 2 × 2 table similar to those in the between-subjects section, but the primary test metric is the number of participants who switch from pass to fail or fail to pass—the discordant pairs (McNemar, 1969).
Unlike the between-subjects chi-square test, we cannot setup our 2 × 2 table just from the summary data of the participants who passed and failed. We need to know the number who had a different outcome on each design—the discordant pairs of responses. Table 5.9 shows the nomenclature used to represent the cells of the 2 × 2 table for this type of analysis.

Table 5.9

Nomenclature for McNemar Exact Test

Design B Pass Design B Fail Total
Design A a b m
Design B c d n
Total r s N

We want to know if the proportion of discordant pairs (cells b and c) is greater than what we’d expect to see from chance alone. For this type of analysis, we set chance to 0.50. If the proportion of pairs that are discordant is different from 0.50 (higher or lower), then we have evidence that there is a difference between designs.
To test the observed proportion against a test proportion, we use the nonparametric binomial test. This is the same approach we took in Chapter 4 (Comparing Small Sample Completion Rates to a Criterion). When the proportion tested is 0.50, the binomial test goes by the special name “the sign test.”
The sign test uses the following binomial probability formula:

p(x)=n!x!(nx)!px(1p)(nx)

image
where
x is the number of positive or negative discordant pairs (cell c or cell b, whichever is smaller),
n is the total number of discordant pairs (cell b + cell c), and
p = .50.
Note: The term n! is pronounced “n factorial” and is n×(n1)×(n2)××2×1image.
As discussed in Chapter 4, we will again use mid-probabilities as a less conservative alternative to exact probabilities, which tend to overstate the value of p, especially when sample sizes are small.

Example 1: Completion rates

For example, 15 users attempted the same task on two different designs. The completion rate on Design A was 87% and on Design B was 53%. Table 5.10 shows how each user performed, with 0s representing failed task attempts and 1s for passing attempts.

Table 5.10

Sample Data for McNemar Exact Test

User Design A Design B
1 1 0
2 1 1
3 1 1
4 1 0
5 1 0
6 1 1
7 1 1
8 0 1
9 1 0
10 1 1
11 0 0
12 1 1
13 1 0
14 1 1
15 1 0
Comp rate 87% 53%
Next we total the number of concordant and discordant responses in a 2 × 2 table (Table 5.11).

Table 5.11

Concordant and Discordant Responses for Example 1

Design B Pass Design B Fail Total
Design A Pass 7 (a) 6 (b) 13 (m)
Design A Fail 1 (c) 1 (d) 2 (n)
Total 8 (r) 7 (s) 15 (N)

Concordant pairs
Seven users completed the task on both designs (cell a)
One user failed on Design A and failed on Design B (cell d)
Discordant pairs
Six users completed on Design A but failed on Design B (cell b)
One user failed on Design A and passed on Design B (cell c)
Table 5.12 shows the discordant users along with a sign (positive or negative) to indicate whether they performed better (plus sign) or worse (negative sign) on Design B. By the way, this is where this procedure gets its name the “sign test”—we’re testing whether the proportion of pluses to minuses is significantly different from 0.50.

Table 5.12

Discordant Performance from Example 1

User Relative Performance on B
1
4
5
8 +
9
13
15
In total, there were seven discordant pairs (cell b + cell c). Most users who performed differently performed better on Design A (six of seven). We will use the smaller of the discordant cells to simplify the computation, which is the one person in cell c who failed on Design A and passed on Design B. (Note that you will get the same result if you used the larger of the discordant cells, but it would be more work.) Plugging these values in the formula, we get

p(0)=7!0!(70)!0.50(10.5)(70)=0.0078

image

p(1)=7!1!(71)!0.51(10.5)(71)=0.0547

image
The one-tailed exact-p value is these two probabilities added together, 0.0078 + 0.0547 = 0.0625, so the two-tailed probability is double this (0.125). The mid-probability is equal to half the exact probability for the value observed plus the cumulative probability of all values less than the one observed. In this case, the probability of all values less than the one observed is just the probability of 0 discordant pairs, which is 0.0078:

Midp=120.0547+0.0078

image

Midp=0.0352

image
The one-tailed mid-p value is 0.0352, so the two-tailed mid-p value is double this (0.0704). Thus, the probability of seeing one out of seven users perform better on Design A than B if there really was no difference is 0.0704. Put another way, we can be about 93% sure Design A has a better completion rate than Design B.
The computations for this two-sided mid-p value are rather tedious to do by hand, but are fairly easy to get using the Excel function =2*(BINOMDIST(0,7,0.5,FALSE) + 0.5*BINOMDIST(1,7,0.5,FALSE)).
If you need to guarantee that the reported p-value is greater than or equal to the actual long-term probability, then you should use the exact-p values instead of the mid-p values. This is similar to the recommendation we gave when comparing the completion rate to a benchmark and when computing binomial confidence intervals (see Chapter 4). For most applications in user research, the mid-p value will work better (lead to more correct decisions) over the long run (Agresti and Coull, 1998).

Alternate approaches

As with the between-subjects chi-square test, there isn’t much agreement among statistics texts (or statisticians) on the best way to compute the within-subjects p-value. This section provides information about additional approaches you might have encountered. You may safely skip this section if you trust our recommendation (or if you’re not interested in more geeky technical details).
Chi-square statistic
The most common recommendation in statistics text books for large sample within-subject comparisons is to use the chi-square statistic. It is typically called the McNemar chi-square test (McNemar, 1969), as opposed to the McNemar exact test which we presented in the earlier section. It uses the following formula:

χ2=(cb)2c+b

image
You will notice that the formula only uses the discordant cells (b and c). You can look up the test statistic in a chi-square table with 1 degree of freedom to generate the p-value, or use the Excel CHIDIST function. Using the data from Example 1 with seven discordant pairs we get a test statistic of

χ2=(16)27=3.571

image
Using the Excel function =CHIDIST(3.571, 1), we get the p-value of 0.0587, which, for this example, is reasonably close to our mid-p value of 0.0704.
However, to use this approach, the sample size needs to be reasonably large to have accurate results. As a general guide, it is a large enough sample if the number of discordant pairs (b + c) is greater than 30 (Agresti and Franklin, 2007).
You can equivalently use the z-statistic and corresponding normal table of values to generate a p-value instead of the chi-square statistic, by simply taking the square root of the entire equation.

Z=cbc+b

image

Z=616+1=57=1.89

image
Using the Excel NORMSDIST function (=2*NORMSDIST(1.89)), we get p = 0.0587, demonstrating the mathematical equivalence of the methods.
Yates correction to the chi-square statistic
To further complicate matters, some texts recommend using a Yates corrected chi-square for all sample sizes (Bland, 2000). As shown in the following, the Yates correction is

χ2=(|cb|1)2b+c

image
Using the data from Example 1 with seven discordant pairs we get

χ2=(|16|1)27=2.29

image
We look up this value in a chi-square table of values with 1 degree of freedom or use the Excel function =CHIDIST(2.29, 1) to get the p-value of 0.1306. For this example, this value is even higher than the exact-p value from the sign test, which we expect to overstate the magnitude of p. A major criticism of the Yates correction is that it will likely exceed the p-value from the sign test. Recall that this overcorrection also occurs with the Yates correction of the between-subjects chi-square test. For this reason, we do not recommend the use of the Yates correction.
Table 5.13 provides a summary of the p-values generated from the different approaches and our recommendations.

Table 5.13

Summary of p-values Generated from Sample Data for McNemar Tests

Method P-value Notes
McNemar exact test using mid-probabilities 0.0704 Recommended: For all sample sizes will provide best average long-term probability, but some individual tests may understate actual probability
McNemar exact test using exact probabilities 0.125 Recommended: For all sample sizes when you need to guarantee the long-term probability is greater than or equal to the p-value (a conservative approach)
McNemar chi-square test/z test 0.0587 Not Recommended: Understates true probability for sample sizes and is unclear about what constitutes a large sample size
McNemar chi-square test with Yates correction 0.1306 Not Recommended: Overstates true probability for all sample sizes

Confidence interval around the difference for matched pairs

To estimate the likely magnitude of the difference between matched pairs of binary responses, we recommend the appropriate adjusted-Wald confidence interval (Agresti and Min, 2005). As described in Chapter 3 for confidence intervals around a single proportion, this adjustment uses the same concept as that for the between-subjects confidence interval around two proportions.
When applied to a 2 × 2 table for a within-subjects setup (as shown in Table 5.14), the adjustment is to add 1/8th of a squared critical value from the normal distribution for the specified level of confidence to each cell in the 2 × 2 table. For a 95% level of confidence, this has the effect of adding two pseudo observations to the total number of trials (N).

Table 5.14

Framework for Adjusted-Wald Confidence Interval

Design B Pass Design B Fail Total
Design A Pass aadj badj madj
Design A Fail cadj dadj nadj
Total radj sadj Nadj

Using the same notation from the 2 × 2 table with the “adj” meaning to add zα28image to each value, we have the formula:

(p^2adjp^1adj)±zα(p^12adj+p^21adj)(p^21adjp^12adj)2Nadj

image
where

p^1adj=madjNadj

image

p^2adj=radjNadj

image

p^12adj=badjNadj

image

p^21adj=cadjNadj

image
zα = two-sided z critical value for the level of confidence (e.g., 1.96 for a 95% confidence level)
zα28image = The adjustment added to each cell (e.g., for a 95% confidence level this is 1.9628=0.48image)
The formula is similar to the confidence interval around two independent proportions. The key difference here is how we generate the proportions from the 2 × 2 table.
Table 5.15 shows the results from Example 1 (so you don’t need to flip back to the original page).

Table 5.15

Results from Example 1

Design B Pass Design B Fail Total
Design A Pass 7 (a) 6 (b) 13 (m)
Design A Fail 1 (c) 1 (d) 2 (n)
Total 8 (r) 7 (s) 15 (N)

Table 5.16 shows the adjustment of 0.5 added to each cell.

Table 5.16

Adjusted Values for Computing Confidence Interval

Design B Pass Design B Fail Total
Design A Pass 7.5 (aadj) 6.5 (badj) 14 (madj)
Design A Fail 1.5 (cadj) 1.5 (dadj) 3 (nadj)
Total 9 (radj) 8 (sadj) 17 (Nadj)

You can see that the adjustment has the effect of adding two pseudo users to the sample as we go from a total of 15–17. Filling these values in the formula for a 95% confidence interval (which has a critical z-value of 1.96) we get

p^1adj=1417=0.825

image

p^2adj=917=0.529

image

p^12adj=6.517=0.383

image

p^21adj=1.517=0.087

image

(p^2adjp^1adj)±zα(p^12adj+p^21adj)(p^21adjp^12adj)2Nadj

image

(0.5290.825)±1.96(0.383+0.087)(0.0870.383)217

image

0.296±0.295

image
The 95% confidence interval around the difference in completion rates between designs is −59.1% to −0.1%. The confidence interval goes from negative to positive because we subtracted the design with the better completion rate from the one with the worse completion rate.
There’s nothing sacred about the order in which you subtract the proportions. We can just as easily subtract Design B from Design A, which would generate a confidence interval of 0.1– 59.1%. Neither confidence interval quite crosses 0, so we can be about 95% confident there is a difference. It is typically easier to subtract the smaller proportion from the larger when reporting confidence intervals, so we will do that through the remainder of this section.
The mid-p value from the McNemar exact test was 0.0704 which gave us around 93% confidence that there was a difference—just short of the 95% confidence indicated by the adjusted-Wald confidence interval (which is based on a somewhat different statistical procedure), but likely confident enough for many early stage designs to move on to the next research question (or make any indicated improvements to the current design and move on to testing the next design).
In most applied settings, the difference between 94% confidence and 95% confidence shouldn’t lead to different decisions. If you are using a rigid cutoff of 0.05, such as for a publication, then use the p-value to decide whether to reject the null hypothesis. Keep in mind that most statistical calculations approximate the role of chance. Both the approximation and the choice of the method used can result in p-values that fluctuate by a few percentage points (as we saw in Table 5.13) so don’t get too hung up on what the “right” p-value is. If you are testing in an environment where you need to guarantee a certain p-value (medical device testing come to mind), then increasing your confidence level to 99% and using the exact-p values instead of the mid-p values will significantly reduce the probability of identifying a chance difference as significant.

Example 2: Completion rates

In a comparative usability test, 14 users attempted to rent the same type of car in the same city on two different websites (Avis.com and Enterprise.com). All 14 users completed the task on Avis.com but only 10 of 14 completed it on Enterprise.com. The users and their task results appear in Table 5.17 and Table 5.18. Is there sufficient evidence that more users could complete the task on Avis.com than on Enterprise.com (as designed at the time of this study)?

Table 5.17

Completion Data from CUE-8 Task

User Avis.com Enterprise.com
1 1 1
2 1 1
3 1 0
4 1 0
5 1 1
6 1 1
7 1 1
8 1 0
9 1 1
10 1 1
11 1 1
12 1 0
13 1 1
14 1 1
Comp rate 100% 71%

Table 5.18

Organization of Concordant and Discordant Pairs from CUE-8 Task

Enterprise.com Pass Enterprise.com Fail Pass Total
Avis.com Pass 10 (a) 4 (b) 14 (m)
Avis.com Fail 0 (c) 0 (d) 0 (n)
Total 10 (r) 4 (s) 14 (N)

In total there were four discordant users (cell b + cell c), all of which performed better on Avis.com. Table 5.19 shows the improvement performance difference for the four users on Enterprise.com.

Table 5.19

Discordant Performance from CUE-8 Task

User Relative Performance on Enterprise.com
3
4
8
12
Plugging the appropriate values in the formula we get

p(x)=n!x!(nx)!px(1p)(nx)

image

p(0)=4!0!(40)!0.50(10.5)(40)=0.0625

image
The one-tailed exact-p value is 0.0625, so the two-tailed probability is double this (0.125). The mid-probability is equal to half the exact probability for the value observed plus the cumulative probability of all values less than the one observed. Because there are no values less than 0, the one-tailed mid-probability is equal to half of 0.0625:

Mid-p=12(0.0625)

image

Mid-p=0.0313

image
The one-tailed mid-p value is 0.0313, so the two-tailed mid-p value is double this (0.0625). Thus, the probability of seeing zero out of four users perform worse on Enterprise.com if there really was no difference is 0.0625. Put another way, we can be around 94% sure Avis.com had a better completion rate than Enterprise.com on this rental car task at the time of this study.

Comparing rental car websites

Why Enterprise.com had a worse completion rate— from the files of Jeff Sauro

In case you were wondering why Enterprise.com had a worse completion rate, the task required users to add a GPS system to the rental car reservation. On Enterprise.com, this option only appeared AFTER you entered your personal information. It thus led four users to spend a lot of time hunting for that option and either giving up or saying they would call customer service. Allowing users to add that feature (which changes the total rental price) would likely increase the completion rate (and rental rate) for Enterprise.com.
The 95% confidence interval around the difference is found by first adjusting the values in each interior cell of the 2 × 2 table by 0.5 1.9628=0.480.5image, as shown in Table 5.20.

Table 5.20

Adjusted Counts for CUE-8 Task

Design B Pass Design B Pass Total
Design A Pass 10.5 (aadj) 4.5 (badj) 15 (madj)
Design A Fail 0.5 (cadj) 0.5 (dadj) 1 (nadj)
Total 11 (radj) 5 (sadj) 16 (Nadj)

Finding the component parts of the formula and entering the values we get

(p^2adjp^1adj)±zα(p^12adj+p^21adj)(p^21adjp^12adj)2Nadj

image

p^1adj=madjNadj=1516=0.938

image

p^2adj=radjNadj=1116=0.688

image

p^12adj=badjNadj=4.516=0.281

image

p^21adj=cadjNadj=0.516=0.03

image

(0.9380.688)±1.96(0.281+0.03)(0.030.281)216

image

0.250±0.245

image
We can be 95% confident the difference between proportions is between 0.5% and 49.5%. This interval does not cross zero which tells us we can be 95% confident the difference is greater than zero. It is another example of a significant difference seen with the confidence interval but not with the p-value. We didn’t plan on both examples having p-values so close to 0.05. They are a consequence of using data from actual usability tests. Fortunately, you are more likely to see p-values and confidence intervals point to the same conclusion.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.113.163