Chapter review questions

1. Assume you need to conduct a single-shot (not iterative) formative usability study that can detect about 85% of the problems that have a probability of occurrence of 0.25 for the specific participants and tasks used in the study (in other words, not 85% of ALL possible usability problems, but 85% of the problems discoverable with your specific method). How many participants should you plan to run?
2. Suppose you decide that you will maintain your goal of 85% discovery, but need to set the target value of p to 0.20. Now how many participants do you need?
3. You just ran a formative usability study with 4 participants. What percentage of the problems of p = 0.50 are you likely to have discovered? What about p = 0.01; 0.90; 0.25?
4. Table 7.11 shows the results of a formative usability evaluation of an interactive voice response application (Lewis, 2008) in which six participants completed four tasks, with the discovery of twelve distinct usability problems. For this matrix, what is the observed value of p across these problems and participants?
5. Continuing with the data in Table 7.11, what is the adjusted value of p?
6. Using the adjusted value of p, what is the estimated total number of the problems available for discovery with these tasks and types of participants? What is the estimated number of undiscovered problems? How confident should you be in this estimate? Should you run more participants, or is it reasonable to stop?

Table 7.11

Results from Lewis (2008) Formative Usability Study

Problem
Participant 1 2 3 4 5 6 7 8 9 10 11 12 Count Proportion
1 X X 2 0.17
2 X X X X X 5 0.42
3 X X 2 0.17
4 X X X 3 0.25
5 X X X X 4 0.33
6 X X X X 4 0.33
Count 1 2 3 1 1 1 1 1 1 1 5 2 20

Note: X = specified participant experienced specified problem

Answers to chapter review questions

1. From Table 7.1, when p = 0.25, you need to run seven participants to achieve the discovery goal of 85% [P(x ≥ 1) = 0.85]. Alternatively, you could search the row in Table 7.2 for p = 0.25 until you find the sample size at which the value in the cell first exceeds 0.85, which is at n = 7.
2. Tables 7.1 and 7.2 do not have entries for p = 0.20, so you need to use the formula below, which indicates a sample size requirement of 9 (8.5 rounded up). n=ln(1P(x1))ln(1p)=ln(10.85)ln(10.20)=ln(0.15)ln(0.80)=1.8970.223=8.5image
3. Table 7.2 shows that the expected percentage of discovery when n = 4 and p = 0.50 is 94%. For p = 0.01, it’s 4% expected discovery; for p = 0.90, it’s 100%; for p = 0.25, it’s 68%.
4. For the results shown in Table 7.11, the observed average value of p is 0.28. You can get this by averaging the average values across the six participants (shown in the table), the average values across the twelve problems (not shown in the table), or dividing the number of filled cells by the total number of cells [20/(6 × 12) = 20/72 = 0.28].
5. To compute the adjusted value of p, use the formula below. The deflation component is (0.28 – 1/6)(1 – 1/6) = 0.11(0.83) = 0.09. Because there were 12 distinct problems, 8 of which occurred once, the Good-Turing component is 0.28/(1 + 8/12) = 0.28/1.67 = 0.17. The average of these two components—the adjusted value of p—is 0.13. padj=12(pest1n)11n+12pest(1+GTadj)image.
6. The adjusted estimate of p (from Problem 5) is 0.13. We know from Table 7.11 that there were twelve problems discovered with six participants. To estimate the percentage of discovery so far, use 1 – (1 – p)n. Putting in the values of n and p, you get 1 – (1 – 0.13)6 = 0.57 (57% estimated discovery). If 57% discovery equals 12 problems, then the estimated number of problems available for discovery is 12/0.57 = 21.05 (rounds up to 22), so the estimated number of undiscovered problems is about 10. Because a sample size of 6 is in the range of over-optimism when using the binomial model, there are probably more than 10 problems remaining for discovery. Given the results shown in Table 7.9, it’s reasonable to believe that there could be an additional two to seven undiscovered problems, so it’s unlikely that there are more than 17 undiscovered problems. This low rate of problem discovery (padj = .13) is indicative of an interface in which there are few high-frequency problems to find. If there are resources to continue testing, it might be more productive to change the tasks in an attempt to create the conditions for discovering a different set of problems and, possibly, more frequently occurring problems.

References

Agresti A. Simple capture-recapture models permitting unequal catchability and variable sampling effort. Biometrics. 1994;50:494500.

Al-Awar J, Chapanis A, Ford R. Tutorials for the first-time computer user. IEEE Trans. Prof. Commun. 1981;24:3037.

Baecker RM. Themes in the early history of HCI—some unanswered questions. Interactions. 2008;15(2):2227.

Barnum, C., Bevan, N., Cockton, G., Nielsen, J., Spool, J., Wixon, D., 2003. The “Magic Number 5”: Is it enough for web testing? In: Proceedings of CHI 2003. Ft. Lauderdale, FL: ACM, pp. 698–699.

Boehm BW. Software Engineering Economics. Englewood Cliffs, NJ: Prentice-Hall; 1981.

Borsci S, Londei A, Federici S. The bootstrap discovery behaviour (BDB): a new outlook on usability evaluation. Cogn. Process. 2011;12:2331.

Borsci S, MacRedie RD, Barnett J, Martin J, Kuljis J, Young T. Reviewing and extending the five-user assumption: A grounded procedure for interaction evaluation. ACM Trans. Comput. Hum. Interact. 2013;20(5):123: Article 29.

Bradley JV. Probability; Decision; Statistics. Englewood Cliffs, NJ: Prentice-Hall; 1976.

Bradley JV. Robustness? Br. J. Math. Stat. Psychol. 1978;31:144152.

Briand LC, El Emam K, Freimut BG, Laitenberger O. A comprehensive evaluation of capture-recapture models for estimating software defect content. IEEE Trans. Softw. Eng. 2000;26(6):518540.

Burnham KP, Overton WS. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika. 1979;65:625633.

Caulton DA. Relaxing the homogeneity assumption in usability testing. Behav. Inf. Technol. 2001;20:17.

Chapanis A. Some generalizations about generalization. Hum. Factors. 1988;30:253267.

Coull BA, Agresti A. The use of mixed logit models to reflect heterogeneity in capture-recapture studies. Biometrics. 1999;55:294301.

Cowles M. Statistics in Psychology: An Historical Perspective. Hillsdale, NJ: Lawrence Erlbaum; 1989.

Dalal SR, Mallows CL. Some graphical aids for deciding when to stop testing software. IEEE J. Sel. Area. Comm. 1990;8(2):169175.

Dorazio RM. On selecting a prior for the precision parameter of Dirichlet process mixture models. J. Stat. Plan. Inference. 2009;139:33843390.

Dorazio RM, Royle JA. Mixture models for estimating the size of a closed population when capture rates vary among individuals. Biometrics. 2003;59:351364.

Dumas J. The great leap forward: The birth of the usability profession (1988-1993). J. Usability Stud. 2007;2(2):5460.

Dumas, J., Sorce, J., Virzi, R., 1995. Expert reviews: How many experts is enough? In: Proceedings of the Human Factors and Ergonomics Society Thirty-Ninth Annual Meeting. Santa Monica, CA: Human Factors and Ergonomics Society, pp. 228–232.

Eick, S.G., Loader, C.R., Vander Wiel, S.A., Votta, L.G., 1993. How many errors remain in a software design document after inspection? In: Proceedings of the Twenty-Fifth Symposium on the Interface. Fairfax Station, V.A.: Interface Foundation of North America, pp. 195–202.

Ennis DM, Bi J. The beta-binomial model: accounting for inter-trial variation in replicated difference and preference tests. J. Sens. Stud. 1998;13:389412.

Faulkner L. Beyond the five-user assumption: benefits of increased sample sizes in usability testing. Behav. Res. Methods Instrum. Comput. 2003;35:379383.

Gould JD. How to design usable systems. In: Helander M, ed. Handbook of Human–Computer Interaction. Amsterdam, Netherlands: North-Holland; 1988:757789.

Gould JD, Boies SJ. Human factors challenges in creating a principal support office system: The Speech Filing System approach. ACM Trans. Inf. Syst. 1983;1:273298.

Gould JD, Lewis C. Designing for Usability: Key Principles and What Designers Think. Yorktown Heights, NY: IBM Corporation; 1984: (Tech. Report RC-10317).

Gould JD, Boies SJ, Levy S, Richards JT, Schoonard J. The 1984 Olympic message system: a test of behavioral principles of system design. Commun. ACM. 1987;30:758769.

Guest G, Bunce A, Johnson L. How many interviews are enough? An experiment with data saturation and variability. Field Methods. 2006;18(1.):5982.

Hertzum M, Jacobsen NJ. The evaluator effect: A chilling fact about usability evaluation methods. Int. J. Hum. Comput. Interact. 2001;13:421443.

Hornbæk K. Dogmas in the assessment of usability evaluation methods. Behav. Inf. Technol. 2010;29(1):97111.

Hwang W, Salvendy G. What makes evaluators to find more usability problems?: A meta-analysis for individual detection rates. In: Jacko J, ed. Human-Computer Interaction, Part I, HCII 2007. Heidelberg, Germany: Springer-Verlag; 2007:499507.

Hwang W, Salvendy G. Integration of usability evaluation studies via a novel meta-analytic approach: What are significant attributes for effective evaluation. Int. J. Hum. Comput. Interact. 2009;25(4):282306.

Hwang W, Salvendy G. Number of people required for usability evaluation: The 10 ± 2 rule. Commun. ACM. 2010;53(5):130133.

Jelinek F. Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press; 1997.

Kanis H. Estimating the number of usability problems. Appl. Ergon. 2011;42:337347.

Kennedy, P. J., 1982. Development and testing of the operator training package for a small computer system. In: Proceedings of the Human Factors Society Twenty-Sixth Annual Meeting. Santa Monica, CA: Human Factors Society, pp. 715–717.

Law, E. L., Hvannberg, E. T., 2004. Analysis of combinatorial user effect in international usability tests. In: Proceedings of CHI 2004. Vienna, Austria: ACM, pp. 9–16.

Lewis, J. R., 1982. Testing small system customer set-up. In: Proceedings of the Human Factors Society Twenty-Sixth Annual Meeting. Santa Monica, CA: Human Factors Society.

Lewis JR. Sample sizes for usability studies: additional considerations. Hum. Factors. 1994;36:368378.

Lewis, J. R., 2000. Evaluation of problem discovery rate adjustment procedures for sample sizes from two to ten (Tech. Report 29.3362). Raleigh, NC: IBM Corp. Available from: http://drjim.0catch.com/pcarlo5-ral.pdf.

Lewis JR. Evaluation of procedures for adjusting problem-discovery rates estimated from small samples. Int. J. Hum. Comput. Interact. 2001;13:445479.

Lewis, J. R., 2006a. Effect of level of problem description on problem discovery rates: Two case studies. In: Proceedings of the Human Factors, Ergonomics Fiftieth Annual Meeting. Santa Monica, C.A.: HFES, pp. 2567–2571.

Lewis JR. Sample sizes for usability tests: mostly math, not magic. Interactions. 2006;13(6):2933.

Lewis JR. Usability evaluation of a speech recognition IVR. In: Tullis T, Albert B, eds. Measuring the User Experience, Chapter 10: Case Studies. Amsterdam, Netherlands: Morgan-Kaufman; 2008:244252.

Lewis JR. Usability testing. In: Salvendy G, ed. Handbook of Human Factors and Ergonomics. 4th ed. New York, NY: John Wiley; 2012:12671312.

Lewis JR, Sauro J. When 100% really isn’t 100%: Improving the accuracy of small-sample estimates of completion rates. J. Usability Test. 2006;3(1):136150.

Lewis, J. R., Henry, S. C., Mack, R. L., 1990. Integrated office software benchmarks: a case study. In: Proceedings of the Third IFIP Conference on Human-Computer Interaction–INTERACT ’90. Cambridge, UK: Elsevier Science Publishers, pp. 337–343.

Lindgaard, G., Chattratichart, J., 2007. Usability testing: What have we overlooked?” In: Proceedings of CHI 2007. San Jose, CA: ACM, pp. 1415–1424.

Manning CD, Schütze H. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press; 1999.

Medlock MC, Wixon D, McGee M, Welsh D. The rapid iterative test and evaluation method: Better products in less time. In: Bias RG, Mayhew DJ, eds. Cost-Justifying Usability: An Update for the Internet Age. Amsterdam, Netherlands: Elsevier; 2005:489517.

Nielsen, J., 1992. Finding usability problems through heuristic evaluation. In: Proceedings of CHI ‘92. Monterey, CA: ACM, pp. 373–380.

Nielsen, J., 2000. Why you only need to test with 5 users. Alertbox, www.useit.com/alertbox/20000319.html. (Downloaded January 26, 2011.).

Nielsen, J., Landauer, T. K., 1993. A mathematical model of the finding of usability problems. In: Proceedings of INTERCHI’93. Amsterdam, Netherlands: ACM, pp. 206–213.

Nielsen, J., Molich, R., 1990. Heuristic evaluation of user interfaces. In: Proceedings of CHI ’90. New York, NY: ACM, pp. 249–256.

Perfetti, C., Landesman, L., 2001. Eight is not enough. Available from: http://www.uie.com/articles/eight_is_not_enough/.

Sauro J. The relationship between problem frequency and problem severity in usability evaluations. J. Usability Stud. 2014;10(1):1725.

Schmettow, M., 2008. Heterogeneity in the usability evaluation process. In: Proceedings of the Twenty-Second British HCI Group Annual Conference on HCI 2008: People and Computers XXII: Culture, Creativity, Interaction—Volume 1. Liverpool, UK: ACM, pp. 89–98.

Schmettow, M., 2009. Controlling the usability evaluation process under varying defect visibility. In: Proceedings of the 2009 British Computer Society Conference on Human-Computer Interaction. Cambridge, UK: ACM, pp. 188–197.

Schmettow M. Sample size in usability tests. Commun. ACM. 2012;55(4):6470.

Schnabel ZE. The estimation of the total fish population of a lake. Amer. Math. Monthly. 1938;45:348352.

Smith DC, Irby C, Kimball R, Verplank B, Harlem E. Designing the star user interface. Byte. 1982;7(4):242282.

Spool, J., Schroeder, W., 2001. Testing websites: five users is nowhere near enough. In: CHI 2001 Extended Abstracts. New York, N.Y., AC.M., pp. 285–286.

Turner CW, Lewis JR, Nielsen J. Determining usability test sample size. In: Karwowski W, ed. The International Encyclopedia of Ergonomics and Human Factors. Boca Raton, FL: CRC Press; 2006:30843088.

Virzi, R. A., 1990. Streamlining the design process: Running fewer subjects. In: Proceedings of the Human Factors Society Thirty-Fourth Annual Meeting. Santa Monica, CA: Human Factors Society, pp. 291–294.

Virzi RA. Refining the test phase of usability evaluation: how many subjects is enough? Hum. Factors. 1992;34:457468.

Virzi RA. Usability inspection methods. In: Helander MG, Landauer TK, Prabhu PV, eds. Handbook of Human–Computer Interaction. second ed. Amsterdam, Netherlands: Elsevier; 1997:705715.

Walia, G. S., Carver, J. C., 2008. Evaluation of capture-recapture models for estimating the abundance of naturally-occurring defects. In: Proceedings of ESEM ’08. Kaiserslautern, Germany: ACM, pp. 158–167.

Walia, G. S., Carver, J. C., Nagappan, N., 2008. The effect of the number of inspectors on the defect estimates produced by capture-recapture models. In: Proceedings of ICSE ’08. Leipzig, Germany: ACM, pp. 331–340.

Whiteside J, Bennett J, Holtzblatt K. Usability engineering: Our experience and evolution. In: Helander M, ed. Handbook of Human–Computer Interaction. Amsterdam, Netherlands: North-Holland; 1988:791817.

Williams G. The Lisa computer system. Byte. 1983;8(2):3350.

Wilson EB. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 1927;22:209212.

Woolrych, A., Cockton, G., 2001. Why and when five test users aren’t enough. In: Vanderdonckt, J., Blandford, A., Derycke, A. (Eds.), Proceedings of IHM–HCI 2001 Conference, vol. 2. Toulouse, France: Cépadèus Éditions, pp. 105–108.

Wright PC, Monk AF. A cost-effective evaluation method for use by designers. Int. J. Man Mach. Stud. 1991;35:891912.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.236.70