Usability evaluations can generally be divided into formative and summative evaluations. Formative evaluations are done early in the product development life cycle to discover insights and shape the design direction. They typically involve usability inspection methods or usability testing with low-fidelity mocks or prototypes. Summative evaluations, on the other hand, are typically done toward the end of the product development life cycle with high-fidelity prototypes or the actual final product to evaluate it against a set of metrics (e.g., time on task, success rate). This can be done via in-person or remote usability testing or live experiments. Table 14.1 lists several formative and summative evaluation methods. Each method mentioned in this chapter could fill an entire chapter, if not a book, devoted just to discussing its origins, alternatives, and intricacies. As a result, our goal is to provide an overview of the evaluation methods available and provide information about where to look to learn more.
Table 14.1
Comparison of evaluation methodologies
Method | Formative or summative | State of your product | Goal | Resources required |
Heuristic evaluation | Formative | Low to high fidelity | Identify violations of known usability guidelines | Low |
Cognitive walkthrough | Formative | Low to high fidelity | Identify low-hanging issues early | Low |
Usability testing in-person | It depends | Any stage | Identify usability issues | Medium |
Eye tracking | Summative | Very high fidelity to launched | Identify where users look for features/information | High |
RITE | Formative | Any stage | Iterate quickly on a design | High |
Desirability testing | Summative | Very high fidelity to launched | Measure emotional response | Medium |
Remote testing | Summative | Very high fidelity to launched | Identify usability issues across large sample | Low to High |
Live experiments | Summative | Launched | Measure product changes with large sample of actual users | High |
Just like with other research methods described in this book, none of the evaluation methods here are meant to stand alone. Each uncovers different issues and should be used in combination to develop the ideal user experience.
It is ideal to have a third party (e.g., someone who has not been directly involved in the design of the product or service) conduct the evaluations to minimize bias. This is not always possible, but regardless of who conducts the evaluation, the evaluation must remain neutral. This means that he or she must:
■ Recruit representative participants, not just those who are fans or critics of your company/product
■ Use a range of representative tasks, not just those that your product/service is best or worst at
■ Use neutral language and nonverbal cues to avoid giving participants any signal what the “right” response is or what you want to hear, and never guide the participants or provide your feedback on the product
■ Be true to the data, rather than interpreting what he or she thinks the participant “really meant”
Depending on where you are in product development, what your research questions are, and what your budget is, there are a number of evaluation methodologies to choose from (see Table 14.1 for a comparison). Whether you choose a storyboard, paper prototype, low- or high-fidelity interactive prototype, or a launched product, you should evaluate it early and often.
Usability inspection methods leverage experts (e.g., people with experience in usability/user research, subject matter experts) rather than involve actual end users to evaluate your product or service against a set of specific criteria. These are quick and cheap ways of catching the “low-hanging fruit” or obvious usability issues throughout the product development cycle. If you are pressed for time or budget, these methods represent a minimum standard to meet. However, be aware that experts can miss issues that methods that involve users will reveal. System experts can make incorrect assumptions about what the end user knows or wants.
Jakob Nielsen and Rolf Molich introduced the heuristic evaluation as a “discount” usability inspection method (Nielsen & Molich, 1990). “Discount usability engineering” meant that the methods were designed to save practitioners time and money over the standard lab usability study (Nielsen, 1989). They argued that there are 10 heuristics that products should adhere to for a good user experience (Nielsen, 1994). Three to five UX experts (or novices trained on the heuristics)—not end users or subject matter experts (SMEs)—individually assess a product by walking through a core set of tasks and noting any places where heuristics are violated. The evaluators then come together to combine their findings into a single report of issues that should be addressed. Note that it is difficult for every element in a product to adhere to all 10 heuristics, as they can sometimes be at odds. Additionally, products that adhere to all 10 heuristics are not guaranteed to meet users’ needs, but it is significantly less likely that they will face the barriers of poor design. Nielsen’s heuristics are as follows:
1. Visibility of system status: Keep the user informed about the status of your system and give them feedback in a reasonable time.
2. Match between system and the real world: Use terminology and concepts the user is familiar with and avoid technical jargon. Present information in a logical order and follow real-world conventions.
3. User control and freedom: Allow users to control what happens in the system and be able to return to previous states (e.g., undo, redo).
4. Consistency and standards: Be consistent throughout your product (e.g., terminology, layout, actions). Follow known standards and conventions.
5. Error prevention: To the greatest extent possible, help users avoid making errors, make it easy for users to see when an error has been made (i.e., error checking), and give users a chance to fix them before committing to an action (e.g., confirmation dialog).
6. Recognition rather than recall: Do not force users to rely on their memory to use your system. Make options or information (e.g., instructions) visible or easily accessible across your product when needed.
7. Flexibility and efficiency of use: Make accelerators available for expert users but hidden for novice ones. Allow users to customize the system based on their frequent actions.
8. Aesthetic and minimalist design: Avoid irrelevant information, and hide infrequently needed information. Keep the design to a minimum to avoid overloading the user’s attention.
9. Help users recognize, diagnose, and recover from errors: Although your system should prevent errors in the first place, when they do happen, provide error messages in clear terms (no error codes) that indicate the problem and how to recover from it.
10. Help and documentation: Ideally, your system should be used without documentation; however, that is not always realistic. When help or documentation is needed, make it brief, easy to find, focused on the task at hand, and clear.
We have provided a worksheet at http://tinyurl.com/understandingyourusers to help you conduct heuristic evaluations.
Cognitive walkthroughs are a formative usability inspection method (Lewis, Polson, Wharton, & Rieman, 1990; Polson, Lewis, Rieman, & Wharton, 1992; Nielsen, 1994). Whereas the heuristic evaluation looks at a product or system holistically, the cognitive walkthrough is task-specific. It is based on the belief that people learn systems by trying to accomplish tasks with it, rather than first reading through instructions. It is ideal for products that are meant to be walk-up and use (i.e., no training required).
In a group of three to six people, your colleagues or SMEs are asked to put themselves in the shoes of the intended user group and to walk through a scenario. To increase the validity and reliability of the original method, Jacobsen and John (2000) recommended that you include individuals with a variety of backgrounds that span the range of your intended user audience to increase the likelihood of catching issues and create scenarios to cover the full functionality of the system and that your colleagues consider multiple points of views of the intended user group (e.g., users booking a flight for themselves versus for someone else). State a clear goal that the user wants to achieve (e.g., book a flight, check-in for a flight) and make sure everyone understands that goal. The individual conducting the walkthrough presents the scenario and then shows everyone one screen (e.g., mobile app, airline kiosk screen) at a time. When each screen is presented, everyone is asked to write down the answers to four questions:
1. Is this what you expected to see?
2. Are you making progress toward your goal?
3. What would your next action be?
4. What do you expect to see next?
Going around the room, the evaluator asks each individual to state his or her answers and provide any related thoughts. For example, if they feel like they are not making progress toward their goal, they state why that is. A separate notetaker should identify any areas where expectations are violated and other usability issues identified.
Conduct two or three group sessions to ensure you have covered your scenarios and identified the range of issues. When examining the individual issues identified, you should consider if the issue can be applied more generally across the product (Jacobsen & John, 2000). For example, your colleagues may have noted that they want to live chat with a customer service agent when booking their flight. You should consider if there are other times when users may want to live chat with an agent, and therefore, that feature should be made more widely available. Ideally, you would iterate on the designs and conduct another round to ensure you have addressed the issues.
Usability testing is the systematic observation of end users attempting to complete a task or set of tasks with your product based on representative scenarios. In individual sessions, participants interact with your product (e.g., paper prototype, low- or high-fidelity prototype, the launched product) as they think aloud (refer to Chapter 7, “Using a Think-Aloud Protocol” section, page 169), and user performance is evaluated against metrics such as task success, time on task, and conversion rate (e.g., whether or not the participant made a purchase). Several participants are shown the same product and asked to complete the same tasks in order to identify as many usability issues as possible.
There is a lot of debate about the number of participants needed for usability evaluation (see Borsci et al. (2013) for an academic evaluation and Sauro (2010) for a great history on the sample size debate). Nielsen and Landauer (1993) found that you get a better return on your investment if you conduct multiple rounds of testing; however, only five participants are needed per round. In other words, you will find more usability issues conducting three rounds of testing with five participants each if you iterate between rounds (i.e., make changes or add features/functionality to your prototype or product based on each round of feedback) than if you conduct a single study with 15 participants. If you have multiple, distinct user types, you will want to include three to four participants from each user type in your study per round.
There are a few variations on the in-person usability test that you can choose from based on your research questions, space availability, user availability, and budget.
This involves bringing users to a dedicated testing space within your company or university or at a vendor’s site. If your organization does not have to have a formal lab space for you to conduct a usability study, you can create your own impromptu lab with a conference room, laptop, screen recording software, and (optionally) video camera. See Chapter 4, “Setting Up Research Facilities” on page 82 to learn more. An academic or corporate lab environment likely does not match the user’s environment. It probably has high-end equipment and a fast Internet connection, looks like an office, and is devoid of any distractions (e.g., colleagues, spouses, or kids making noise and interrupting you). Although a lab environment may lack ecological validity (i.e., mimic the real-world environment), it offers everyone a consistent experience and allows participants to focus on evaluating your product.
One variation on a lab study incorporates a special piece of equipment called an eye tracker. While most eye trackers are used on a desktop, there are also mobile eye trackers that can also be used in the field (e.g., in a store for shopping studies, in a car for automotive studies). Eye tracking was first used in cognitive psychology (Rayner, 1998); however, the HCI community has adapted it to study where people look (or do not look) for information or functionality and for how long. Figures 14.1 and 14.2 show desktop and mobile eye tracking devices. By recording participant fixations and saccades (i.e., rapid eye movements between fixation points), a heat map can be created (see Figure 14.3). The longer participants’ gazes stay fixed on a spot, the “hotter” the area is on the map, indicated by the color red. As fewer participants look at an area or for less time, the “cooler” it gets and transitions to blue. Areas where no one looked are black. By understanding where people look for information or features, you can understand whether or not participants discover and process an item. If participants’ eyes do not dwell on an area of an interface, there is no way for them to process that area. This information can help you decide if changes are needed to your design to make something more discoverable.
Eye tracking studies are the one type of evaluation methodology where you do not want participants to think aloud as they interact with your product. Asking participants to think aloud will cause them to change their eye gaze as they speak with the evaluator or recall past remarks (Kim, Dong, Kim, & Lee, 2007). This will muddy the eye tracking data and should be avoided. One work-around is called the retrospective think-aloud in which participants are shown a video of their session and asked to tell the moderator what they were thinking at the time (Russell & Chi, 2014). A study found that showing participants a gaze plot or gaze video cue during the retrospective think-aloud resulted in a higher identification of usability issues (Tobii Technology, 2009). This will double the length of your study session so most researchers using this method will actually include half the number of tasks they normally would in order to keep their entire session at an hour.
In 2002, the Microsoft games division developed Rapid Iterative Testing and Evaluation (RITE) as a formative method to quickly address issues that prevented participants from proceeding in a game and evaluating the remaining functionality (Medlock, Wixon, Terrano, Romero, & Fulton, 2002). Unlike traditional usability testing that is meant to identify as many usability issues as possible and measure the severity of the issues in a product, RITE is designed to quickly identify any large usability issue that is preventing users from completing a task or does not allow the product to meet its stated goals. RITE studies should be conducted early in the development cycle with a prototype. The development team must observe all usability sessions, and following each session where a blocking usability issue is identified, they agree on a solution. The prototype is then updated and another session is conducted to see if the solution fixed the problem. If the team cannot agree on the severity of a problem, an additional session can be conducted before any changes are made. This cycle of immediately fixing and testing updated prototypes continues until multiple sessions are conducted where no further issues are identified. In contrast to traditional usability testing where five or more participants see the same design, in RITE, at most, two participants would see the same design before changes are made for the next session.
RITE sessions typically require more sessions and therefore more participants than a single, traditional usability study with five to eight participants. Additionally, because this method requires the dedication of the development team to observe all sessions, brainstorm solutions following each session, and someone to update the prototype quickly and repeatedly, it is a more resource-intensive methodology. Overall, this can be perceived as a risky method because a lot of resources are invested early, based on initially small sample sizes. However, if done early in the development cycle, the team can feel confident they are building a product free of major usability issues.
At Google, when we need to make a quick decision about which design direction to take among several, we may conduct a 5-10-minute study with guests that visit our cafés for lunch. In only a couple of hours, we can get feedback from 10 or more participants using this formative method. Although the participants may be friends or family members of Googlers, there is surprising diversity in skills, demographics, and familiarity with Google products. We are able to collect enough data in a very short period of time to inform product direction and identify critical usability issues, confusing terminology, etc. You could do this at any café, in front of a grocery store, at any mall, etc. Of course, you need permission of the store manager or owner.
You can increase the ecological validity of your study by conducting evaluations in the field. This will give you a better sense of how people will use your product in the “real world.” If your product will be used at home, you could conduct the study in the participants’ homes, for example. See Chapter 13, “Field Studies” on page 380 for tips on conducting field research. This can be done either very early or much later in the product development life cycle, depending on your goals (i.e., identify opportunities and inform product direction or measure the product against a set of metrics).
To be successful, it is not enough for a product to be usable (i.e., users can complete a specified set of tasks with the product); it must also be pleasant to use and desirable. Don Norman (2004) argued that aesthetically-pleasing products are actually more effective. The games division at Microsoft introduced us to another new methodology in 2002, this time focusing on emotions, rather than usability issues (Benedek & Miner, 2002). Desirability testing evaluates whether or not a product elicits the desired emotional response from users. It is most often conducted with a released version of your product (or competitor’s product) to see how it makes participants feel. The Microsoft researchers identified a set of 118 positive, negative, and neutral adjectives based on market research and their own research (e.g., unconventional, appealing, inconsistent, professional, motivating, intimidating). However, you may have specific adjectives in mind that you would like users to feel (or not feel) when using your product. You can add those to the list; however, be mindful to keep a balance of positive and negative adjectives.
To conduct the method, create a set of flash cards with a single adjective per index card. After interacting with the product, perhaps following your standard usability study, hand the stack of cards to participants. Ask them to pick anywhere from five to ten cards from the stack that describe how the product made them feel. Then, ask participants to tell you why they chose each card. The researchers suggest conducting this method with 25 participants per user segment. From here, you can do affinity diagramming on the themes participants highlighted. If your product does not elicit the emotions you had hoped for, you can make changes as needed (e.g., adding/removing functionality, changing the tone of messaging, adding different visuals) and retest.
It is not always possible, or even desirable, to conduct evaluations with participants in person. For example, if you are based in any of the tech-savvy regions of the country, conducting only in-person sessions around your company will result in a sampling bias. You may also miss out on any region-specific issues users face (e.g., customer service hours are unfriendly to one or more time zones). Remote testing can help you gather data from participants outside of your geographic area. Another benefit of remote testing is that you can typically collect feedback from a much larger sample size in a shorter period of time, and no lab facilities are needed. Unfortunately, if you are conducting studies on hardware devices or with highly confidential products, you may still need to conduct studies in person.
There are two ways to conduct remote studies:
1. Use online vendors or services to conduct evaluations with their panels (e.g., UserZoom, UserTesting.com).
2. Use a service like GoToMeeting, WebEx, or Hangouts to remotely connect to the participant in his or her environment (e.g., home, office) while you remain in your lab or office. You will need to call or e-mail directions to the participant for how to connect your computer to his or her computer to share screens, and it may be necessary to walk participants through the installation step-by-step over the phone. Once you have done this, you can show the participant your prototype and conduct the study as you normally would in the lab. Alternatively, you can ask the participant to show you his or her computer or, using the web cam, show his or her environment, mobile device, etc. You will also need to leave time at the end of the session to walk participants through the process of uninstalling the application.
Live experiments, from an HCI standpoint, is a summative evaluation method that involves comparing two or more designs (live websites) to see which one performs better (e.g., higher click-through rate, higher conversion rate). To avoid biasing the results, users in industry studies are usually not informed they are part of an experiment; however, in academia, consent is often required. In A/B testing, a percent of users are shown one design (“A”) and, via logs analysis, performance is compared against another version (“B”). Designs can be a variation on a live control (typically the current version of your product) or two entirely new designs. See Figure 14.4 for an illustration. Multivariate testing follows the same principle, but in this case, multiple variables are manipulated to examine how changes in those variables interact to result in the ideal combination. All versions must be tested in parallel to control for extraneous variables that could affect your experiment (e.g., website outage, change in fees for your service).
You will need a large enough sample per design in order to conduct statistical analysis and find any significant differences; however, multivariate testing requires a far larger sample than simple A/B because of the number of combinations under consideration.
There are a few free tools and several fee-based online services that can enable you to conduct live experiments. Simply do a web search for “website optimization” or “A/B testing,” and you will find several vendors and tools to help you.
In addition to identifying usability issues, there are a few metrics you may consider collecting in a summative evaluation:
■ Time on task: Length of time to complete a task.
■ Number of errors: Errors made completing a task and/or across the study.
■ Completion rate: Number of participants that completed the task successfully.
■ Satisfaction: Overall, how satisfied participants are on a given task and/or with the product as a whole at the end of the study (e.g., “Overall, how satisfied or dissatisfied are you with your experience? Extremely dissatisfied, Very dissatisfied, Moderately dissatisfied, Slightly dissatisfied, Neither satisfied nor dissatisfied, Slightly satisfied, Moderately satisfied, Very satisfied, Extremely satisfied”).
■ Page views or clicks: As a measure of efficiency, you can compare the number of page views or clicks by a participant against the most efficient/ideal path. Of course, the optimal path for a user may not be his or her preferred path. Collecting site analytics can tell you what users do but not why. To understand why users traverse your product in a certain path, you must conduct other types of studies (e.g., lab study, field study).
■ Conversion: Usually measured in a live experiment, this is a measure of whether or not participants (users) “converted” or successfully completed their desired task (i.e., signed up, made a purchase).
In a benchmarking study, you can compare the performance of your product or service against that of a competitor or a set of industry best practices. Keep in mind that with small sample sizes (e.g., in-person studies), you should not conduct statistical tests on the data and expect it will be representative of your broader population. However, it can be helpful to compare the metrics between rounds of testing to see if design solutions are improving the user experience.
For all of the evaluation methods listed, you should document the design that was evaluated to avoid repeating the same mistakes in later versions. This might be as simple as including screenshots in a slide deck or maintaining version control on a prototype so it is easy to see exactly what the participant experienced.
In small sample size or informal studies (e.g., cognitive walkthrough, café study, RITE), a simple list of issues identified and recommendations is usually sufficient to communicate with your stakeholders. A brief description of the participants can be important, particularly if there are any caveats stakeholders should be aware of (e.g., only company employees participated for confidentiality reasons). These are meant to be fast, lightweight methods that do not slow down the process with lengthy documentation.
For larger or more complex studies (e.g., eye tracking, live experiment), you will want to include a description of the methodology, participant demographics, graphs (e.g., heat map, click path), and any statistical analysis. Different presentation formats are usually required for different stakeholders. For example, it is unlikely engineers or designers will care about the details of the methodology or statistical analysis, but other researchers will. A simple presentation of screenshots showing the issues identified and recommendations is often best for the majority of your stakeholders, but we recommend creating a second report that has all of the details mentioned above, so if questions arise, you can easily answer (or defend) your results.
In this chapter, we have discussed several methods for evaluating your product or service. There is a method available for every stage in your product life cycle and for every schedule or budget. Evaluating your product is not the end of the life cycle. You will want (and need) to continue other forms of user research so you continually understand the needs of your users and how to best meet them.
18.220.203.200