11 Evaluation of 3D User Interfaces

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11. Evaluation of 3D User Interfaces

Most of this book has covered the various aspects of 3D UI design. However, one of the central truths of HCI is that even the most careful and well-informed designs can still go wrong in any number of ways. Thus, evaluation of user experience becomes critical. In this chapter, we discuss some of the evaluation methods that can be used for 3D UIs, metrics that help to describe the 3D user experience, distinctive characteristics of 3D UI evaluation, and guidelines for choosing 3D UI evaluation methods. Evaluation should be performed not only when a design is complete, but also throughout the design process.

11.1 Introduction

For many years, the fields of 3D UIs, VR, and AR were so novel and the possibilities so limitless that many researchers simply focused on developing new devices, interaction techniques, and UI metaphors—exploring the design space—without taking the time to assess how good the new designs were. As the fields have matured, however, researchers are taking a closer look at the payoff of 3D UI designs in terms of user experience (including usability, usefulness, and emotional impact). We must critically analyze, assess, and compare devices, interaction techniques, UIs, and applications if 3D UIs are to be used in the real world.

11.1.1 Purposes of Evaluation

Simply stated, evaluation is the analysis, assessment, and testing of an artifact. In UI evaluation, the artifact is the entire UI or part of it, such as a particular input device or interaction technique. The main purpose of UI evaluation is the identification of usability problems or issues, leading to changes in the UI design. In other words, design and evaluation should be performed in an iterative fashion, such that design is followed by evaluation, leading to a redesign, which can then be evaluated, and so on. The iteration ends when the UI is “good enough,” based on the metrics that have been set (or more frequently in real-world situations, when the budget runs out or the deadline arrives).

Although problem identification and redesign are the main goals of evaluation, it may also have secondary purposes. One of these is a more general understanding of the usability of a particular technique, device, or metaphor. This general understanding can lead to design guidelines (such as those presented throughout this book), so that each new design can start from an informed position rather than starting from scratch. For example, we can be reasonably sure that users will not have usability problems with the selection of items from a pull-down menu in a desktop application, if the interface designers follow well-known design guidelines, because the design of those menus has already gone through many evaluations and iterations.

Another, more ambitious, goal of UI evaluation is the development of performance models. These models aim to predict the performance of a user on a particular task within an interface. As discussed in Chapter 3, “Human Factors Fundamentals,” Fitts’s law (Fitts 1954) predicts how quickly a user will be able to position a pointer over a target area based on the distance to the target, the size of the target, and the muscle groups used in moving the pointer. Such performance models must be based on a large number of experimental trials on a wide range of generic tasks, and they are always subject to criticism (e.g., the model doesn’t take an important factor into account, or the model doesn’t apply to a particular type of task). Nevertheless, if a useful model can be developed, it can provide important guidance for designers.

It’s also important to note that UI evaluation is only one piece of the broader evaluation of user experience (UX). A well-designed UI can still provide a poor user experience if it doesn’t provide the necessary functionality to help users achieve their goals in the context of use (usefulness) or if it fails to satisfy and delight the user (emotional impact). Although we focus on UI evaluation in this chapter, production 3D applications should be evaluated with a broader UX focus in mind.

11.1.2 Terminology

We must define some important terms before continuing with our discussion of 3D UI evaluation. The most important term (which we’ve already used a couple of times) is usability. We define usability to encompass everything about an artifact and a person that affects the person’s use of the artifact. Evaluation, then, measures some aspects of the usability of an interface (it is not likely that we can quantify the usability of an interface with a single score). Usability metrics fall into several categories, such as system performance, user task performance, and subjective response (see section 11.3).

There are at least two roles that people play in a usability evaluation. A person who designs, implements, administers, or analyzes an evaluation is called an evaluator. A person who takes part in an evaluation by using the interface, performing tasks, or answering questions is called a user. In formal experimentation, a user is often called a participant.

Finally, we distinguish below between evaluation methods and evaluation approaches. Evaluation methods are particular steps that can be used in an evaluation. An evaluation approach, on the other hand, is a combination of methods, used in a particular sequence, to form a complete usability evaluation.

11.1.3 Chapter Roadmap

We’ve already covered the basics of UI evaluation in Chapter 4, “General Principles of Human-Computer Interaction,” section 4.4.5. Evaluation methods and metrics were also discussed throughout Chapter 3, “Human Factors Fundamentals”. We begin this chapter by providing some background information on evaluation methods for 3D UIs (section 11.2). We then focus on 3D UI evaluation metrics (section 11.3) and distinctive characteristics of 3D UI evaluation (section 11.4). In section 11.5, we classify 3D UI evaluation methods and follow that in section 11.6 with a description and comparison of three comprehensive approaches to 3D UI evaluation—sequential evaluation, testbed evaluation, and component evaluation. We then present a set of guidelines for those performing evaluations of 3D UIs (section 11.7). We conclude with a discussion of evaluating the 3D UIs of our two case studies (section 11.8).

11.2 Evaluation Methods for 3D UIs

In this section, we describe some of the common methods used in 3D UI evaluation. None of these methods are new or unique to 3D UIs. They have all been used and tested in many other usability evaluation contexts. We present them here as an introduction to these topics for the reader who has never studied UX evaluations. For more detailed information, you can consult any one of a large number of introductory books on UX evaluations (see the recommended reading list at the end of the chapter).

From the literature, we have compiled a list of usability evaluation methods that have been applied to 3D UIs (although numerous references could be cited for some of the techniques we present, we have included citations that are most recognized and accessible). Most of these methods were developed for 2D or GUI usability evaluation and have been subsequently extended to support 3D UI evaluation.

Cognitive Walkthrough

The cognitive walkthrough (Polson et al. 1992) is an approach to evaluating an UI based on stepping through common tasks that a user would perform and evaluating the interface’s ability to support each step. This approach is intended especially to gain an understanding of the usability of a system for first-time or infrequent users, that is, for users in an exploratory learning mode. Steed and Tromp (1998) used a cognitive walkthrough approach to evaluate a collaborative VE.

Heuristic Evaluation

Heuristic or guidelines-based expert evaluation (Nielsen and Molich 1992) is a method in which several usability experts separately evaluate a UI design (probably a prototype) by applying a set of heuristics or design guidelines that are either general enough to apply to any UI or are tailored for 3D UIs in particular. No representative users are involved. Results from the several experts are then combined and ranked to prioritize iterative design or redesign of each usability issue discovered. The current lack of well-formed guidelines and heuristics for 3D UI design and evaluation makes this approach more challenging for 3D UIs. Examples of this approach applied to 3D UIs can be found in Gabbard et al. (1999); Stanney and Reeves (2000); and Steed and Tromp (1998).

Formative Evaluation

Formative evaluation (both formal and informal; Hix and Hartson 1993) is an observational, empirical evaluation method, applied during evolving stages of design, that assesses user interaction by iteratively placing representative users in task-based scenarios in order to identify usability problems, as well as to assess the design’s ability to support user exploration, learning, and task performance. Formative evaluations can range from being rather informal, providing mostly qualitative results such as critical incidents, user comments, and general reactions, to being very formal and extensive, producing both qualitative and quantitative results (such as task timing or errors).

Collected data are analyzed to identify UI components that both support and detract from user task performance and user satisfaction. Alternating between formative evaluation and design or redesign efforts ultimately leads to an iteratively refined UI design. Most usability evaluations of 3D UI applications fall into the formative evaluation category. The work of Hix et al. (1999) provides a good example.

Summative Evaluation

Summative or comparative evaluation (both formal and informal; Hix and Hartson 1993; Scriven 1967) is a method for either (a) comparing the usability of a UI to target usability values or (b) comparing two or more configurations of UI designs, UI components, and/or UI techniques. As with formative evaluation, representative users perform task scenarios as evaluators collect both qualitative and quantitative data. As with formative evaluations, summative evaluations can be formally or informally applied.

Summative evaluation is generally performed after UI designs (or components) are complete. Summative evaluation enables evaluators to measure and subsequently compare the productivity and cost benefits associated with different UI designs. Comparing 3D UIs requires a consistent set of user task scenarios (borrowed and/or refined from the formative evaluation effort), resulting in primarily quantitative results that compare (on a task-by-task basis) a design’s support for specific user task performance.

Summative evaluations that are run as formal experiments need to follow a systematic process for experimental design. Research questions are typically of the form, “What are the effects of X and Y on Z?” For example, a formal 3D UI experiment might ask, “What are the effects of interaction technique and display field of view on accuracy in a manipulation task?” In such a research question, X and Y (interaction technique and display FOV in the example) are called independent variables, while Z (accuracy) is called a dependent variable. The independent variables are manipulated explicitly among multiple levels. In our example, the interaction technique variable might have two levels—Simple Virtual Hand and Go-Go (section 7.4.1)—while the display FOV variable might have three levels—60, 80, and 100 degrees. In a factorial design (the most common design), each combination of the independent variables’ levels becomes a condition. In our example, there would be six conditions (two interaction techniques times three display FOVs).

Experimenters must decide how many of the conditions each participant will experience. If there’s a likelihood that participants will have significant learning from one condition to the next, the best approach is a between-subjects design, meaning that each participant is exposed to only one of the conditions. If it’s important that participants be able to compare the conditions, and if learning is not expected, a within-subjects design might be appropriate, in which participants experience all the conditions. Hybrids are also possible, where some independent variables are between subjects and others are within subjects.

The goal of this sort of experimental design is to test the effects of the independent variables on the dependent variables. This includes main effects, where a single independent variable has a direct effect on a dependent variable, and interactions, where two or more independent variables have a combined effect on a dependent variable. To be able to find these effects, it is critical that other factors besides the independent variables are held constant; that is, that the only thing that changes from one condition to the next are the independent variables. It is also important to avoid confounds within an independent variable. For example, if one compared an HWD to a surround-screen display and found effects, it would be impossible to know whether those effects were due to differences in FOV, FOR, spatial resolution, weight of headgear, or any number of other differences between the two displays. The best experimental designs have independent variables whose levels only differ in one way. Readers wanting more detail on experimental design are encouraged to consult one of the many books on the subject, such as Cairns and Cox (2008).

Many of the formal experiments discussed in Part IV of this book are summative evaluations of 3D interaction techniques. For example, see Bowman, Johnson, and Hodges (1999) and Poupyrev, Weghorst, and colleagues (1997).

Questionnaires

A questionnaire (Hix and Hartson 1993) is a written set of questions used to obtain information from users before or after they have participated in a usability evaluation session. Questionnaires are good for collecting demographic information (e.g., age, gender, computer experience) and subjective data (e.g., opinions, comments, preferences, ratings) and are often more convenient and more consistent than spoken interviews.

In the context of 3D UIs, questionnaires are used quite frequently, especially to elicit information about subjective phenomena such as presence (Witmer and Singer 1998) or simulator sickness/cybersickness (Kennedy et al. 1993).

Interviews and Demos

An interview (Hix and Hartson 1993) is a technique for gathering information about users by talking directly to them. An interview can gather more information than a questionnaire and may go to a deeper level of detail. Interviews are good for getting subjective reactions, opinions, and insights into how people reason about issues. Structured interviews have a predefined flow of questions. Open-ended interviews permit the respondent (interviewee) to provide additional information, and they permit the interviewer to explore paths of questioning that may occur to him spontaneously during the interview. Demonstrations (typically of a prototype) may be used in conjunction with user interviews to aid a user in talking about the interface.

In 3D UI evaluation, the use of interviews has not been studied explicitly, but informal interviews are often used at the end of formative or summative usability evaluations (Bowman and Hodges 1997).

11.3 Evaluation Metrics for 3D UIs

Now we turn to metrics. That is, how do we measure the characteristics of a 3D UI when evaluating it? We focus on metrics of usability: a 3D UI is usable when the user can reach her goals; when the important tasks can be done better, easier, or faster than with another system; and when users are not frustrated or uncomfortable. Note that all of these have to do with the user.

We discuss three types of metrics for 3D UIs: system performance metrics, task performance metrics, and subjective response metrics.

11.3.1 System Performance Metrics

System performance refers to typical computer or graphics system performance, using metrics such as frame rate, latency, network delay, and optical distortion. From the interface point of view, system performance metrics are really not important in and of themselves. Rather, they are important only insofar as they affect the user’s experience or tasks. For example, the frame rate of immersive visual displays needs to be high enough to avoid inducing cybersickness. Also, in a collaborative setting, task performance will likely be negatively affected if there is too much network delay.

11.3.2 Task Performance Metrics

User task performance refers to the quality of performance of specific tasks in the 3D application, such as the time to navigate to a specific location, the accuracy of object placement, or the number of errors a user makes in selecting an object from a set. Task performance metrics may also be domain-specific. For example, evaluators may want to measure student learning in an educational application or spatial awareness in a military training VE.

Typically, speed and accuracy are the most important task performance metrics. The problem with measuring both speed and accuracy is that there is an implicit relationship between them: I can go faster but be less accurate, or I can increase my accuracy by decreasing my speed. It is assumed that for every task, there is some curve representing this speed/accuracy tradeoff, and users must decide where on the curve they want to be (even if they don’t do this consciously). In an evaluation, therefore, if you simply tell your participants to do a task as quickly and precisely as possible, they will probably end up all over the curve, giving you data with a high level of variability. Therefore, it is very important that you instruct users in a very specific way if you want them to be at one end of the curve or the other. Another way to manage the tradeoff is to tell users to do the task as quickly as possible one time, as accurately as possible the second time, and to balance speed and accuracy the third time. This gives you information about the tradeoff curve for the particular task you’re looking at.

11.3.3 Subjective Response Metrics

A subjective response refers to the personal perception and experience of the interface by the user (e.g., perceived ease of use, ease of learning, satisfaction, etc.). These responses are often measured via questionnaires or interviews and may be either qualitative (descriptive) or quantitative (numeric). Subjective response metrics are often related to emotional impact but generally contribute to usability as well. A usable application is one whose interface does not pose any significant barriers to task completion. UIs should be intuitive, provide good affordances (see Chapter 4, “General Principles of Human-Computer Interaction”), provide good feedback, be unobtrusive, and so on. An application cannot be effective unless users are willing to use it. It is possible for a 3D UI to provide functionality for the user to do a task, but a lack of attention to subjective user experience can keep it from being used.

For 3D UIs in particular, presence and user comfort (see Chapter 3, “Human Factors Fundamentals”, sections 3.3.6 and 3.5) can be important metrics that are not usually considered in traditional UI evaluation. Presence is a crucial but not very well understood metric for VE systems. It is the “feeling of being there”—existing in the virtual world rather than in the physical world. How can we measure presence? One method simply asks users to rate their feeling of being there on a 1 to 100 scale. Questionnaires can also be used and can contain a wide variety of questions, all designed to get at different aspects of presence. Psychophysical measures are used in controlled experiments where stimuli are manipulated and then correlated to users’ ratings of presence (for example, how does the rating change when the environment is presented in monaural versus stereo modes?). There are also some more objective measures. Some are physiological (how the body responds to the VE, for example via heart rate?). Others might look at users’ reactions to events in the VE (e.g., does the user duck when he’s about to hit a virtual beam?). Tests of memory for the environment and the objects within it might give an indirect measurement of the level of presence. Finally, if we know a task for which presence is required, we can measure users’ performance on that task and infer the level of presence. There is still a great deal of debate about the definition of presence, the best ways to measure presence, and the importance of presence as a metric (e.g., Slater et al. 2009; Usoh et al. 2000; Witmer and Singer 1998).

The other novel subjective response metric for 3D systems is user comfort. This includes several different things. The most notable and well-studied is cybersickness (also called simulator sickness because it was first noted in flight simulators or VR sickness to specifically call out the problem of sickness in the VR medium). It is symptomatically similar to motion sickness and may result from mismatches in sensory information (e.g., your eyes tell your brain that you are moving, but your vestibular system tells your brain that you are not moving). 3D applications may also result in physical aftereffects of exposure. For example, if a 3D UI misregisters the virtual hand and the real hand (they’re not at the same physical location), the user may have trouble doing precise manipulation in the real world after exposure to the virtual world, because the sensorimotor systems have adapted to the misregistration. Even more seriously, activities like driving or walking may be impaired after extremely long exposures. Finally, there are simple strains on arms/hands/eyes from the use of 3D devices. User comfort is also usually measured subjectively, using rating scales or questionnaires. The most famous questionnaire is the simulator sickness questionnaire (SSQ) developed by Kennedy et al. (1993). Researchers have used some objective measures in the study of aftereffects—for example, by measuring the accuracy of a manipulation task in the real world after exposure to a virtual world (Wann and Mon-Williams 2002).

11.4 Characteristics of 3D UI Evaluations

The approaches we discuss below for evaluation of 3D UIs have been developed and used in response to perceived differences between the evaluation of 3D UIs and the evaluation of traditional UIs such as GUIs. Many of the fundamental concepts and goals are similar, but use of these approaches in the context of 3D UIs is distinct. Here, we present some of the issues that differentiate 3D UI evaluation, organized into several categories. The categories contain overlapping considerations but provide a rough partitioning of these important issues. Note that many of these issues are not necessarily found in the literature but instead come from personal experience and extensive discussions with colleagues.

11.4.1 Physical Environment Issues

One of the most obvious differences between 3D UIs and traditional UIs is the physical environment in which that interface is used. In many 3D UIs, nontraditional input and output devices are used, which can preclude the use of some types of evaluation. Users may be standing rather than sitting, and they may be moving about a large space, using whole-body movements. These properties give rise to several issues for usability evaluation. Following are some examples:

In interfaces using non-see-through HWDs, the user cannot see the surrounding physical world. Therefore, the evaluator must ensure that the user will not bump into walls or other physical objects, trip over cables, or move outside the range of the tracking device (Viirre 1994). A related problem in surround-screen VEs, such as the CAVE (Cruz-Neira et al. 1993), is that the physical walls can be difficult to see because of projected graphics. Problems of this sort could contaminate the results of an evaluation (e.g., if the user trips while in the midst of a timed task) and more importantly could cause injury to the user. To mitigate risk, the evaluator can ensure that cables are bundled and will not get in the way of the user (e.g., cables may descend from above or a human cable wrangler may be necessary). Some modern systems also help mitigate this issue through the use of wireless technologies. Also, the user may be placed in a physical enclosure that limits movement to areas where there are no physical objects to interfere.

Many 3D displays do not allow multiple simultaneous viewers (e.g., user and evaluator), so hardware or software must be set up so that an evaluator can see the same image as the user. With an HWD, the user’s view is typically rendered to a normal monitor. In a surround-screen or workbench VE, a monoscopic view of the scene could be rendered to a monitor, or, if performance will not be adversely affected, both the user and the evaluator can be tracked (this can cause other problems, however; see section 11.4.2 on evaluator considerations). If images are viewed on a monitor, then it may be difficult to see and understand both the actions of the user and the view of the virtual environment at the same time. Another approach would be to use a green-screen setup to allow the evaluator to see the user in the context of the VE.

A common and very effective technique for generating important qualitative data during usability evaluation sessions is the “think-aloud” protocol (as described in Hix and Hartson 1993). With this technique, participants talk about their actions, goals, and thoughts regarding the interface while they are performing specific tasks. In some 3D UIs, however, speech recognition is used as an interaction technique, making the think-aloud protocol much more difficult and perhaps even impossible. Post-session interviews may help to recover some of the information that would have been obtained from the think-aloud protocol.

Another common technique involves recording video of both the user and the interface (as described in Hix and Hartson 1993). Because 3D UI users are often mobile, a single, fixed camera may require a very wide shot, which may not allow precise identification of actions. This could be addressed by using a tracking camera (with, unfortunately, additional expense and complexity) or a camera operator (additional personnel). Moreover, views of the user and the graphical environment must be synchronized so that cause and effect can clearly be seen in the video. Finally, recording the video of a stereoscopic graphics image can be problematic.

An ever-increasing number of 3D applications are collaborative and shared among two or more users (Glencross et al. 2007; Tang et al. 2010; Beck et al. 2013; Chen et al. 2015). These collaborative 3D UIs become even more difficult to evaluate than single-user 3D UIs due to physical separation of users (i.e., users can be in more than one physical location), the additional information that must be recorded for each user, the unpredictability of network behavior as a factor influencing usability, the possibility that each user will have different devices, and the additional complexity of the system, which may cause more crashes or other problems.

11.4.2 Evaluator Issues

A second set of issues relates to the role of the evaluator in a 3D UI evaluation. Because of the complexities and distinctive characteristics of 3D UIs, a study may require multiple evaluators, different evaluator roles and behaviors, or both. Following are some examples:

Many VEs attempt to produce a sense of presence in the user—that is, a feeling of actually being in the virtual world rather than the physical one. Evaluators or cable wranglers can cause breaks in presence if the user can sense them. In VEs using projected graphics, the user will see an evaluator if the evaluator moves into the user’s field of view. This is especially likely in a CAVE environment (Cruz-Neira et al. 1993) where it is difficult to see the front of a user (and thus the user’s facial expressions and detailed use of handheld devices) without affecting that user’s sense of presence. This may break presence, because the evaluator is not part of the virtual world. In any type of VE, touching or talking to the user can cause such breaks. If the evaluation is assessing presence or if presence is hypothesized to affect performance, then the evaluator must take care to remain unsensed during the evaluation.

When presence is deemed very important for a particular VE, an evaluator may not wish to intervene at all during an evaluation session. This means that the experimental application/interface must be robust enough that the session does not have to be interrupted to fix a problem. Also, instructions given to the user must be very detailed, explicit, and precise, and the evaluator should make sure the user has a complete understanding of the procedure and tasks before beginning the session.

3D UI hardware and software are often more complex and less robust than traditional UI hardware and software. Again, multiple evaluators may be needed to do tasks such as helping the user with display and input hardware, running the software that produces graphics and other output, recording data such as timings and errors, and recording critical incidents and other qualitative observations of a user’s actions.

Traditional UIs typically require only discrete streams of input (e.g., from the mouse and keyboard), but many 3D UIs include multimodal input, combining discrete events, gestures, voice, and/or whole-body motion. It is very difficult for an evaluator to observe these multiple input streams simultaneously and record an accurate log of the user’s actions. These challenges make automated data collection (see section 11.4.4) very important.

11.4.3 User Issues

There are also a large number of issues related to the user population used as participants in 3D UI usability evaluations. In traditional evaluations, participants are gleaned from the target user population of an application or from a similar representative group of people. Efforts are often made, for example, to preserve gender balance, to have a good distribution of ages, and to test both experts and novices if these differences are representative of the target user population. The nature of 3D UI evaluation, however, does not always allow for such straightforward selection of users. Following are some examples:

3D interaction techniques are still often a “solution looking for a problem.” Because of this, the target user population for a 3D application or interaction technique to be evaluated may not be known or well understood. For example, a study comparing two virtual travel techniques is not aimed at a particular set of users. Thus, it may be difficult to generalize performance results. The best course of action is to evaluate the most diverse user population possible in terms of age, gender, technical ability, physical characteristics, and so on, and to include these factors in any models of performance.

It may be difficult to differentiate between novice and expert users because there are very few potential participants who could be considered experts in 3D UIs (although this may be changing). In a research setting, most users who could be considered experts might be lab coworkers, whose participation in an evaluation could confound the results. Also, because most users are typically novices, evaluators can make no assumptions about a novice user’s ability to understand or use a given interaction technique or device.

Because 3D UIs will be novel to many potential participants, the results of an evaluation may exhibit high variability and differences among individuals. This means that the number of participants needed to obtain a good picture of performance may be larger than for traditional usability evaluations. If statistically significant results are required (depending on the type of usability evaluation being performed), the number of participants may be even greater.

Researchers are still studying a large design space for 3D interaction techniques and devices. Because of this, evaluations often compare two or more techniques, devices, or combinations of the two. To perform such evaluations using a within-subjects design, users must be able to adapt to a wide variety of situations. If a between-subjects design is used, a larger number of subjects will again be needed.

3D UI evaluations must consider the effects of cybersickness and fatigue on participants. Although some of the causes of cybersickness are known, there are still no predictive models for it (Kennedy et al. 2000), and little is known regarding acceptable exposure time to 3D UIs. For evaluations, then, a worst-case assumption must be made. A lengthy experiment must contain planned rest breaks and contingency plans in case of ill or fatigued participants. Shortening the experiment is often not an option, especially if statistically significant results are needed.

Because it is not known exactly what 3D UI situations cause sickness or fatigue, most 3D UI evaluations should include some measurement (e.g., subjective, questionnaire-based, or physiological) of these factors. A result indicating that an interaction technique was 50% faster than any other evaluated technique would be severely misleading if that interaction technique also made 30% of participants sick. Thus, user comfort measurements should be included in low-level 3D UI evaluations.

Presence is another example of a measure often required in VE evaluations that has no analog in traditional UI evaluation. VE evaluations must often take into account subjective reports of perceived presence, perceived fidelity of the virtual world, and so on. Questionnaires (Usoh et al. 2000; Witmer and Singer 1998) have been developed that purportedly obtain reliable and consistent measurements of such factors. Chapter 3, “Human Factors Fundamentals”, section 3.4.3 also discusses some techniques that can be used to measure presence.

11.4.4 Evaluation Type Issues

Traditional UI evaluation can take many forms. These include informal user studies, formal experiments, task-based usability studies, heuristic evaluations, and the use of predictive models of performance (see Chapter 4, “General Principles of Human-Computer Interaction,” and section 11.2 above for further discussion of these types of evaluations). There are several issues related to the use of various types of usability evaluation in 3D UIs. Following are some examples:

Because of the complexity and novelty of 3D UIs, automated data collection of system performance and task performance metrics by the 3D UI software or ancillary software is nearly a necessity. For example, several issues above have noted the need for more than one evaluator or video recording. Automated data collection can reduce the need for multiple evaluators during a single session. Additionally, automated data collection is more accurate than evaluator-based observations. Task time can be measured in milliseconds (instead of seconds), and predefined errors are always identified and counted. The major limiting factor of automated data collection is the additional programming required to properly identify and record performance metrics. Often this requires close integration with the key events of an interaction technique or interface.

Evaluations based solely on heuristics (i.e., design guidelines), performed by usability experts, are very difficult in 3D UIs because of a lack of published verified guidelines for 3D UI design. There are some notable exceptions (McMahan et al. 2014; Bowman et al. 2008; Teather and Stuerzlinger 2008; Conkar et al. 1999; Gabbard 1997; Kaur 1999; Kaur et al. 1999; Mills and Noyes 1999; Stanney and Reeves 2000), but for the most part, it is difficult to predict the usability of a 3D interface without studying real users attempting representative tasks in the 3D UI. It is not likely that a large number of heuristics will appear, at least not until 3D input and output devices become more standardized. Even assuming standardized devices, however, the design space for 3D interaction techniques and interfaces is very large, making it difficult to produce effective and general heuristics to use as the basis for evaluation.

Heuristic evaluations and other forms of usability inspections are analytic in nature and are typically carried out by experts through the examination of prototypes. If very early prototypes are being used (e.g., storyboards or video prototypes), it is very difficult for experts to perform this sort of analysis with confidence, because 3D UIs must be experienced first-hand before their user experience can be understood. Thus, the findings of early heuristic evaluations should not be taken as gospel.

As we’ve noted several times in this book, the devil is in the details of 3D UI design. Thus, designers should not place too much weight on the results of any formative 3D UI evaluation based on a rough prototype without all the interaction details fully specified.

Another major type of evaluation that does not employ users is the application of performance models (e.g., GOMS, Fitts’s law). Aside from a few examples (Kopper et al. 2010; Zhai and Woltjer 2003), very few models of this type have been developed for or adapted to 3D UIs. However, the lower cost of both heuristic evaluation and performance model application makes them attractive for evaluation.

When performing statistical experiments to quantify and compare the usability of various 3D interaction techniques, input devices, interface elements, and so on, it is often difficult to know which factors have a potential impact on the results. Besides the primary independent variables (i.e., interaction technique), a large number of other potential factors could be included, such as environment, task, system, or user characteristics. One approach is to try to vary as many of these potentially important factors as possible during a single experiment. This “testbed evaluation” approach (Bowman et al. 1999; Snow and Williges 1998) has been used with some success (see section 11.4.2). The other extreme would be to simply hold as many of these other factors as possible constant and run the evaluation only in a particular set of circumstances. Thus, statistical 3D UI experimental evaluations may be either overly complex or overly simplistic—finding the proper balance is difficult.

11.4.5 General Issues

General issues related to 3D UI evaluation include the following:

3D UI usability evaluations generally focus at a lower level than traditional UI evaluations. In the context of GUIs, a standard look and feel and a standard set of interface elements and interaction techniques exist, so evaluation usually looks at subtle interface nuances, overall interface metaphors, information architecture, functionality, or other higher-level issues. In 3D UIs, however, there are no interface standards. Therefore, 3D UI evaluations most often evaluate lower-level components, such as interaction techniques or input devices.

It is tempting to overgeneralize the results of evaluations of 3D interaction performed in a generic (non-application) context. However, because of the fast-changing and complex nature of 3D UIs, one cannot assume anything (display type, input devices, graphics processing power, tracker accuracy, etc.) about the characteristics of a real 3D application. Everything has the potential to change. Therefore, it is important to report on information about the apparatus with which the evaluation was performed and to evaluate with a range of setups (e.g., using different devices) if possible.

In general, 3D UI evaluations require users as participants. It is the responsibility of 3D UI evaluators to ensure that the proper steps are taken to protect these human subjects. This process involves having well-defined procedures and obtaining approval to conduct the evaluation from an Institutional Review Board (IRB) or a similar ethics committee.

11.5 Classification of Evaluation Methods

A classification space for 3D UI usability evaluation methods can provide a structured means for comparing evaluation methods. One such space classifies methods according to three key characteristics: involvement of representative users, context of evaluation, and types of results produced (Figure 11.1).

The first characteristic discriminates between those methods that require the participation of representative users (to provide design-or use-based experiences and options) and those methods that do not (methods not requiring users still require a usability expert). The second characteristic describes the type of context in which the evaluation takes place. In particular, this characteristic identifies those methods that are applied in a generic context and those that are applied in an application-specific context. The context of evaluation inherently imposes restrictions on the applicability and generality of results. Thus, conclusions or results of evaluations conducted in a generic context can typically be applied more broadly (i.e., to more types of interfaces) than results of an application-specific evaluation method, which may be best suited for applications that are similar in nature. The third characteristic identifies whether or not a given usability evaluation method produces (primarily) qualitative or quantitative results.

Figure 11.1 A classification of usability evaluation methods for 3D UIs. (Image reprinted by permission of MIT Press and Presence: Teleoperators and Virtual Environments)

Note that these characteristics are not designed to be mutually exclusive and are instead designed to convey one (of many) usability evaluation method characteristics. For example, a particular usability evaluation method may produce both quantitative and qualitative results. Indeed, many identified methods are flexible enough to provide insight at many levels. These three characteristics were chosen (over other potential characteristics) because they are often the most significant (to evaluators) due to their overall effect on the usability process. That is, a researcher interested in undertaking usability evaluation will likely need to know what the evaluation will cost, what the impact of the evaluation will be, and how the results can be applied. Each of the three characteristics addresses these concerns: degree of user involvement directly affects the cost to proctor and analyze the evaluation; results of the process indicate what type of information will be produced (for the given cost); and context of evaluation inherently dictates to what extent results may be applied.

This classification is useful on several levels. It structures the space of evaluation methods and provides a practical vocabulary for discussion of methods in the research community. It also allows researchers to compare two or more methods and understand how they are similar or different on a fundamental level. Finally, it reveals “holes” in the space (Card et al. 1990)—combinations of the three characteristics that have rarely or never been tried in the 3D UI community.

Figure 11.1 shows that there are two such holes in this space (the shaded boxes). More specifically, there is a lack of current 3D UI usability evaluation methods that do not require users and that can be applied in a generic context to produce quantitative results (upper right of the figure). Note that some possible existing 2D and GUI evaluation methods are listed in parentheses, but few, if any, of these methods have been applied to 3D UIs. Similarly, there appears to be no method that provides quantitative results in an application-specific setting that does not require users (third box down on the right of the figure). These areas may be interesting avenues for further research.

11.6 Three Multimethod Approaches

A shortcoming of the classification of evaluation methods discussed in Chapter 4, section 4.4.5, is that it does not convey when in the UX lifecycle a method is best applied or how several methods may be applied. In most cases, answers to these questions cannot be determined without a comprehensive understanding of each of the methods, as well as the specific goals and circumstances of the 3D UI research or development effort. In this section, we present three well-developed 3D UI evaluation approaches and compare them in terms of practical usage and results.

11.6.1 Sequential Evaluation Approach

Gabbard et al. (1999) present a sequential approach to usability evaluation for specific 3D applications. The sequential evaluation approach is a UX engineering approach and addresses both design and evaluation of 3D UIs. However, for the scope of this chapter, we focus on different types of evaluation and address analysis, design, and prototyping only when they have a direct effect on evaluation.

Although some of its components are well suited for evaluation of generic interaction techniques, the complete sequential evaluation approach employs application-specific guidelines, domain-specific representative users, and application-specific user tasks to produce a usable and useful interface for a particular application. In many cases, results or lessons learned may be applied to other, similar applications (for example, 3D applications with similar display or input devices, or with similar types of tasks). In other cases (albeit less often), it is possible to abstract the results for general use. You should consider an approach like sequential evaluation if you are designing a production 3D application (i.e., a targeted application that will be used in the real world by real users).

Sequential evaluation evolved from iteratively adapting and enhancing existing 2D and GUI usability evaluation methods. In particular, it modifies and extends specific methods to account for complex interaction techniques, nonstandard and dynamic UI components, and multimodal tasks inherent in 3D UIs. Moreover, the adapted/extended methods both streamline the UX engineering process and provide sufficient coverage of the usability space. Although the name implies that the various methods are applied in sequence, there is considerable opportunity to iterate both within a particular method as well as among methods.

It is important to note that all the pieces of this approach have been used for years in GUI usability evaluations. The unique contribution of the work in Gabbard et al. (1999) is the breadth and depth offered by progressive use of these techniques, adapted when necessary for 3D UI evaluation, in an application-specific context. Further, the way in which each step in the progression informs the next step is an important finding: the ordering of the methods guides developers toward a usable application.

Figure 11.2 presents the sequential evaluation approach. It allows developers to improve a 3D UI through a combination of expert-based and user-based techniques. This approach is based on sequentially performing user task analysis (see Figure 11.2, state 1), heuristic (or guideline-based expert) evaluation (Figure 11.2, state 2), formative evaluation (Figure 11.2, state 3), and summative evaluation (Figure 11.2, state 4), with iteration as appropriate within and among each type of evaluation. This approach leverages the results of each individual method by systematically defining and refining the 3D UI in a cost-effective progression.

Figure 11.2 Sequential evaluation approach. (Image reprinted by permission of MIT Press and Presence: Teleoperators and Virtual Environments)

Depending upon the nature of the application, this sequential evaluation approach may be applied in a strictly serial approach (as Figure 11.2’s solid black arrows illustrate) or iteratively applied (either as a whole or per-individual method, as Figure 11.2’s gray arrows illustrate) many times. For example, when used to evaluate a complex command-and-control battlefield visualization application (Hix et al. 1999), user task analysis was followed by significant iterative use of heuristic and formative evaluation and lastly followed by a single broad summative evaluation.

From experience, this sequential evaluation approach provides cost-effective assessment and refinement of usability for a specific 3D application. Obviously, the exact cost and benefit of a particular evaluation effort depends largely on the application’s complexity and maturity. In some cases, cost can be managed by performing quick lightweight formative evaluations (which involve users and thus are typically the most time-consuming to plan and perform). Moreover, by using a “hallway methodology,” user-based methods can be performed quickly and cost-effectively by simply finding volunteers from within one’s own organization. This approach should be used only as a last resort or in cases where the representative user class includes just about anyone. When it is used, care should be taken to ensure that “hallway” users provide a close representative match to the application’s ultimate end users.

The individual methods involved in sequential evaluation are described earlier in Chapter 4 (user task analysis in section 4.4.2 and heuristic, formative, and summative evaluation in section 4.4.5), as well as in section 11.2 above.

Example of Approach

The sequential evaluation approach has been applied to several 3D UIs, including the Naval Research Lab’s Dragon application: a VE for battlefield visualization (Gabbard et al. 1999). Dragon is presented on a workbench (see Chapter 5, “3D User Interface Output Hardware,” section 5.2.2) that provides a 3D display for observing and managing battle-space information shared among commanders and other battle planners. The researchers performed several formative evaluations over a nine-month period, using one to three users and two to three evaluators per session. Each evaluation session revealed a set of usability problems and generated a corresponding set of recommendations. The developers would address the recommendations and produce an improved UI for the next iteration of evaluation. The researchers performed four major cycles of iteration during the evaluation of Dragon, each cycle using the progression of usability methods described in this section.

During the expert guidelines-based evaluations, various user interaction design experts worked alone or collectively to assess the evolving user interaction design for Dragon. The expert evaluations uncovered several major design problems that are described in detail in Hix et al. (1999). Based on user task analysis and early expert guideline-based evaluations, the researchers created a set of user task scenarios specifically for battlefield visualization. During each formative session, there were at least two and often three evaluators present. Although both the expert guideline-based evaluation sessions and the formative evaluation sessions were personnel intensive (with two or three evaluators involved), it was found that the quality and amount of data collected by multiple evaluators greatly outweighed the cost of those evaluators.

Finally, the summative evaluation statistically examined the effect of four factors: locomotion metaphor (egocentric versus exocentric), gesture control (controls rate versus position), visual presentation device (workbench, desktop, CAVE), and stereopsis (present versus not present). The results of these efforts are described in Hix and Gabbard (2002). This experience with sequential evaluation demonstrated its utility and effectiveness.

11.6.2 Testbed Evaluation Approach

Bowman and Hodges (1999) take the approach of empirically evaluating interaction techniques outside the context of applications (i.e., within a generic context rather than within a specific application) and add the support of a framework for design and evaluation, which we summarize here. This testbed evaluation approach is primarily aimed at researchers who are attempting to gain an in-depth understanding of interaction techniques and input devices, along with the tradeoffs inherent in their designs as they are used in different scenarios.

Principled, systematic design and evaluation frameworks give formalism and structure to research on interaction; they do not rely solely on experience and intuition. Formal frameworks provide us not only with a greater understanding of the advantages and disadvantages of current techniques but also with better opportunities to create robust and well-performing new techniques based on knowledge gained through evaluation. Therefore, this approach follows several important evaluation concepts, elucidated in the following sections. Figure 11.3 presents an overview of this approach.

Figure 11.3 Testbed evaluation approach. (Image reprinted by permission of MIT Press and Presence: Teleoperators and Virtual Environments)

Initial Evaluation

The first step toward formalizing the design, evaluation, and application of interaction techniques is to gain an intuitive understanding of the generic interaction tasks in which one is interested and current techniques available for the tasks (see Figure 11.3, state 1). This is accomplished through experience using interaction techniques and through observation and evaluation of groups of users. These initial evaluation experiences are heavily drawn upon for the processes of building a taxonomy, listing outside influences on usability, and listing performance measures. It is helpful, therefore, to gain as much experience of this type as possible so that good decisions can be made in the next phases of formalization.

Taxonomy

The next step is to establish a taxonomy (Figure 11.3, state 2) of interaction techniques for the interaction task being evaluated. In particular, the testbed approach uses task-decomposition taxonomies, like the ones presented in Chapter 7, “Selection and Manipulation,” (section 7.3.1) and Chapter 8, “Travel,” (section 8.3.1). For example, the task of changing an object’s color might be made up of three subtasks: selecting an object, choosing a color, and applying the color. The subtask for choosing a color might have two possible technique components: changing the values of R, G, and B sliders or touching a point within a 3D color space. The subtasks and their related technique components make up a taxonomy for the object coloring task.

Ideally, the taxonomies established by this approach need to be correct, complete, and general. Any interaction technique that can be conceived for the task should fit within the taxonomy. Thus, subtasks will necessarily be abstract. The taxonomy will also list several possible technique components for each of the subtasks, but not necessarily every conceivable component.

Building taxonomies is a good way to understand the low-level makeup of interaction techniques and to formalize differences between them, but once they are in place, they can also be used in the design process. One can think of a taxonomy not only as a characterization, but also as a design space. Because a taxonomy breaks the task down into separable subtasks, a wide range of designs can be considered quickly by simply trying different combinations of technique components for each of the subtasks. There is no guarantee that a given combination will make sense as a complete interaction technique, but the systematic nature of the taxonomy makes it easy to generate designs and to reject inappropriate combinations.

Outside Factors

Interaction techniques cannot be evaluated in a vacuum. The usability of a technique may depend on a variety of factors (Figure 11.3, state 3), of which the interaction technique is but one. In order for the evaluation framework to be complete, such factors must be included explicitly and used as secondary independent variables in evaluations. Bowman and Hodges (1999) identified four categories of outside factors.

First, task characteristics are those attributes of the task that may affect usability, including distance to be traveled or size of the object being manipulated. Second, the approach considers environment characteristics, such as the number of obstacles and the level of activity or motion in the 3D scene. User characteristics, including cognitive measures such as spatial ability and physical attributes such as arm length, may also have effects on usability. Finally, system characteristics, such as the lighting model used or the mean frame rate, may be significant.

Performance Metrics

This approach is designed to obtain information about human performance in common 3D interaction tasks—but what is performance? Speed and accuracy are easy to measure, are quantitative, and are clearly important in the evaluation of interaction techniques, but there are also many other performance metrics (Figure 11.3, state 4) to be considered. Thus, this approach also considers subjective usability metrics, such as perceived ease of use, ease of learning, and user comfort. The choice of interaction technique could conceivably affect all of these, and they should not be discounted. Also, more than any other current computing paradigm, 3D UIs involve the user’s senses and body in the task. Thus, a focus on user-centric performance measures is essential. If an interaction technique does not make good use of human skills or if it causes fatigue or discomfort, it will not provide overall usability, despite its performance in other areas.

Testbed Evaluation

Bowman and Hodges (1999) use testbed evaluation (Figure 11.3, state 5) as the final stage in the evaluation of interaction techniques for 3D interaction tasks. This approach allows generic, generalizable, and reusable evaluation through the creation of testbeds—environments and tasks that involve all important aspects of a task, that evaluate each component of a technique, that consider outside influences (factors other than the interaction technique) on performance, and that have multiple performance measures. A testbed experiment uses a formal factorial experimental design and normally requires a large number of participants. If many interaction techniques or outside factors are included in the evaluation, the number of trials per subject can become overly large, so interaction techniques are usually a between-subjects variable (each subject uses only a single interaction technique), while other factors are within-subjects variables.

Application and Generalization of Results

Testbed evaluation produces a set of results or models (Figure 11.3, state 6) that characterize the usability of an interaction technique for the specified task. Usability is given in terms of multiple performance metrics with respect to various levels of outside factors. These results become part of a performance database for the interaction task, with more information being added to the database each time a new technique is run through the testbed. These results can also be generalized into heuristics or guidelines (Figure 11.3, state 7) that can easily be evaluated and applied by 3D UI developers. Many of the guidelines presented in earlier chapters are generalizations of the results of such experiments.

The last step is to apply the performance results to 3D applications (Figure 11.3, state 8) with the goal of making them more useful and usable. In order to choose interaction techniques for applications appropriately, one must understand the interaction requirements of the application. There is no single best technique, because the technique that is best for one application may not be optimal for another application with different requirements. Therefore, applications need to specify their interaction requirements before the most appropriate interaction techniques can be chosen. This specification is done in terms of the performance metrics that have already been defined as part of the formal framework. Once the requirements are in place, the performance results from testbed evaluation can be used to recommend interaction techniques that meet those requirements.

Examples of Approach

Although testbed evaluation could be applied to almost any type of interactive system, it is especially appropriate for 3D UIs because of its focus on low-level interaction techniques. We’ll summarize two testbed experiments that were performed to compare techniques for the tasks of travel (Bowman et al. 1999) and selection/manipulation (Bowman and Hodges 1999).

The travel testbed experiment compared seven different travel techniques for the tasks of naïve search and primed search. In the primed search trials, the initial visibility of the target and the required accuracy of movement were also varied. The dependent variables were time for task completion and subjective user comfort ratings. Forty-four participants participated in the experiment. The researchers gathered both demographic and spatial ability information for each subject.

The selection/manipulation testbed compared the usability and performance of nine different interaction techniques. For selection tasks, the independent variables were distance from the user to the object, size of the object, and density of distracter objects. For manipulation tasks, the required accuracy of placement, the required degrees of freedom, and the distance through which the object was moved were varied. The dependent variables in this experiment were the time for task completion, the number of selection errors, and subjective user comfort ratings. Forty-eight participants participated, and the researchers again obtained demographic data and spatial ability scores.

In both instances, the testbed approach produced unexpected and interesting results that would not have been revealed by a simpler experiment. For example, in the selection/manipulation testbed, it was found that selection techniques using an extended virtual hand (see Chapter 7, “Selection and Manipulation,” section 7.4.1) performed well with larger, nearer objects and more poorly with smaller, farther objects, while selection techniques based on ray-casting (section 7.5.1) performed well regardless of object size or distance. The testbed environments and tasks have also proved to be reusable. The travel testbed was used to evaluate a new travel technique and compare it to existing techniques, while the manipulation testbed was reused to evaluate the usability of common techniques in the context of different VE display devices.

11.6.3 Component Evaluation Approach

Another framework-based evaluation approach is to focus on the stages of action and the components that affect those stages. McMahan et al. (2015) provide a system-centric adaptation of Norman’s Seven Stages of Action (Norman 2013) called the User-System Loop as the basis of this approach (see Chapter 4, section 4.2.2). Every interaction begins with user actions that input devices sense or are manipulated by, which are then interpreted by transfer functions as meaningful system effects. Those effects alter the data and models underlying the simulation’s objects, physics, and artificial intelligence. Rendering software captures aspects of the simulation’s updated state and sends commands to output devices to create sensory stimuli for the user to perceive.

At each stage of the User-System Loop, there are components that affect the overall interaction and usability of the system. For example, at the output devices stage, the visual display’s field of view will affect how much of the 3D UI that the user can simultaneously see. Whether the display is stereoscopic or not will also affect the user’s depth cues. At the transfer function stage, if the inputs sensed by the input devices are directly mapped to the effect’s output properties in a one-to-one manner, the interaction will afford direct manipulation (Shneiderman 1998). On the other hand, if the inputs are scaled and manipulated in a dramatic or nonlinear fashion when being mapped, the interaction will be a “magic” technique, like the ones described in Chapter 10, section 10.3.3. As discussed in the previous chapters, such changes to the interaction can dramatically affect the user experience.

Evaluating Components

Like testbed evaluation, component evaluation explicitly investigates specific components while keeping other factors the same. First, this helps to avoid confounding variables, or factors not being investigated, from affecting the results of the evaluations. For example, when comparing a surround-screen-based 3D UI to an HWD-based 3D UI, there are multiple confounding variables that may contribute to any significant differences between the UIs, such as field of view, the total size of the visual field surrounding the user, latency, brightness, and the weight of the stereo glasses compared to the weight of the HWD. By explicitly controlling components not being investigated so they are the same, component evaluations avoid potential confounds.

Second, component evaluation yields important knowledge about how to best use components together if the evaluation is used to investigate multiple components. As an example, McMahan et al. (2012) found that conditions with mixed input and output components, such as using a mouse with a visual field completely surrounding the user, yielded significantly lower user performance in a first-person shooter game than conditions with matched pairs of input and output components, such as using a 6-DOF wand with a visual field completely surrounding the user. This interesting result would not have been revealed by a simpler or less-controlled experiment.

Finally, component evaluation will also provide information on the individual effects of a component, if it has no interaction effects with the other investigated components. For example, Ragan et al. (2015) investigated both field of view and visual complexity—the amount of detail, clutter, and objects in a scene (Oliva et al. 2004)—and did not find a significant interaction effect between the components for affecting how users learned a prescribed strategy for a visual scanning task. However, the researchers did find that visual complexity had an individual effect on how users learned the strategy, with users that trained with more complexity better demonstrating the strategy than those that had trained with less complexity.

Applying Component Results

With its focus on controlled experiments to understand the effects of individual components and combinations of components, the component evaluation approach is primarily designed for researchers. It results in general knowledge that can be applied to the design of many 3D UIs.

While the sequential evaluation approach focuses on improving a particular application through iteration and the testbed evaluation approach focuses on characterizing the usability of specific interaction techniques, the results of the component evaluation approach could be used to improve a particular application or to generalize the effects of a specific component for most 3D UIs and 3D interaction techniques. During the design and iterative development of a 3D UI, design choices will present themselves, and the best choice may not be obvious. For example, a large field of view may provide a more realistic view of the VE but may also increase cybersickness. A component evaluation that specifically investigates the effects of field of view could be used to determine if a smaller field of view would still provide a realistic view of the VE while inducing less cybersickness for the given 3D UI application. Alternatively, a series of component evaluations could be used with several 3D UI applications to generalize the effects of field of view and its tradeoffs between realistic views and cybersickness.

Defining Components

A major consideration to using the component evaluation approach is how to define the components that affect each stage of action. For example, the GOMS model (Card et al. 1983), described in Chapter 4, could be extended to define components as the operators required at each stage of action. Component evaluations using this strategy would yield information and knowledge about how elementary actions affect interactions within a system.

A powerful method is to define the components in terms of fidelity, or realism (Bowman and McMahan 2007; McMahan et al. 2010, 2012, 2015; McMahan 2011; Bowman et al. 2012; Laha et al. 2013, 2014; Ragan et al. 2015; Nabiyouni et al. 2015). McMahan et al. (2015) have identified three categories of fidelity components based on the User-System Loop—interaction fidelity, scenario fidelity, and display fidelity, which we discuss further in the following sections.

Interaction Fidelity Components

McMahan et al. (2012) define interaction fidelity as the objective degree of exactness with which real-world actions are reproduced in an interactive system. For example, the real walking technique (see Chapter 8, section 8.4.1) provides one of the highest levels of interaction fidelity, while the walking-in-place technique (see Chapter 8, section 8.4.2) provides a lower level. Interaction fidelity directly plays a role during the stages of user actions, input devices, and transfer functions.

McMahan (2011) presents a framework of interaction fidelity components referred to as the Framework for Interaction Fidelity Analysis (FIFA). In the most recent version of FIFA (McMahan et al. 2015), the components of interaction fidelity are divided into the three subcategories of biomechanical symmetry, input veracity, and control symmetry.

Biomechanical symmetry is the objective degree of exactness with which real-world body movements for a task are reproduced during interaction (McMahan 2011). It consists of three components based on aspects of biomechanics: anthropometric symmetry, kinematic symmetry, and kinetic symmetry. The anthropometric symmetry component is the objective degree of exactness with which body segments involved in a real-world task are required by an interaction technique. For instance, walking in place has a high degree of anthropometric symmetry because it involves the same body segments (i.e., thighs, legs, and feet) as walking in the real world. The kinematic symmetry component is the objective degree of exactness with which a body motion for a real-world task is produced during an interaction. For example, walking in place has a lower degree of kinematic symmetry than real walking, as it does not incorporate the swing phase of the gait cycle and real walking does. Finally, kinetic symmetry is the objective degree of exactness with which the forces involved in a real-world action are reproduced during interaction. The Virtusphere, for instance, requires large internal muscle forces to start and stop the device from rolling due to its weight and inertia (Nabiyouni et al. 2015).

McMahan et al. (2015) define input veracity as the objective degree of exactness with which the input devices capture and measure the user’s actions. It also consists of three components: accuracy, precision, and latency. The accuracy of an input device is how close its readings are to the “true” values it senses (Taylor 1997). Accuracy is especially important for interacting with a system that coincides with the real world, such as augmented reality. For AR, aligning virtual images and objects with the user’s view of the real world (i.e., registration) is one of the greatest challenges (Kipper et al. 2012). The precision of an input device, also called its “repeatability”, is the degree to which reported measurements in the same conditions yield the same results (Taylor 1997). A lack of precision is often associated with jitter (Ragan et al. 2009). The last component of input veracity is latency, which is the temporal delay between user input and sensory feedback generated by the system in response to it (Friston and Steed 2014). Several research studies have found latency to have a negative effect on user performance (Allison et al. 2001; Ellis et al. 1999).

Finally, McMahan (2011) defines control symmetry as the objective degree of exactness with which control in a real-world task is provided by an interaction. In the most recent version of FIFA (McMahan et al. 2015), the control symmetry subcategory has a single component—transfer function symmetry, which is the objective degree of exactness with which a real-world transfer function is reproduced through interaction. As most real-world actions do not involve an actual transfer function, McMahan et al. (2015) use the component to compare the transfer functions of interaction techniques to the physical properties affected during the real-world action counterpart.

In summary, McMahan et al. (2015) categorize the components of interaction fidelity as aspects of biomechanical symmetry, input veracity, and control symmetry. These categories directly correspond to the user actions, input devices, and transfer functions of the User-System Loop.

Scenario Fidelity Components

Ragan et al. (2015) define the second category of fidelity components as scenario fidelity, which is the objective degree of exactness with which behaviors, rules, and object properties are reproduced in a simulation compared to the real world. For example, VEs designed using basic architectural principles, such as those by Campbell (1996), provide a greater degree of scenario fidelity than VEs designed with impossible architectures, such as the ones created by Suma et al. (2012). Scenario fidelity impacts the data and models underlying the simulation.

Scenario fidelity has not been as extensively researched as interaction fidelity or display fidelity. Currently, its components have only been defined at the high-level categories of behaviors, rules, and object properties (Ragan et al. 2015). Behaviors refer to the artificial intelligence properties that control virtual agents and objects within the simulation. Rules refer to physics and other models that determine what happen to objects within the simulation. Finally, object properties refer to the dimensional (e.g., height, width, shape), light-related (e.g., texture, color), and physics-related qualities (e.g., weight, mass) of objects.

As the components of scenario fidelity have only been loosely defined with three broad categories, there is much research to be done to further identify individual components.

Display Fidelity Components

Finally, the third category of fidelity components is display fidelity, which McMahan et al. (2012) define as the objective degree of exactness with which real-world sensory stimuli are reproduced by a system (note that display fidelity has also been referred to as immersion; see McMahan et al. 2012 for more discussion). A surround-screen display with a 360-degree FOR and stereoscopic glasses, for example, provides a greater level of display fidelity than a monoscopic desktop monitor. Display fidelity directly depends on the qualities of the rendering software, output devices, and sensory stimuli produced.

Bowman and McMahan (2007) identify many of the components of visual display fidelity, including stereoscopy (the display of different images to each eye to provide an additional depth cue), field of view (FOV; the size of the visual field that can be viewed instantaneously by the user), field of regard (FOR; the total size of the visual field surrounding the user), display resolution (the total pixels displayed on the screen or surface), display size (the physical dimensions of the display screen or surface), refresh rate (how often the display draws provided rendered data), and frame rate (how often rendered data is provided to the display).

In addition to visual components, display fidelity also comprises auditory, haptic, olfactory, and gustatory aspects. However, Bowman and McMahan (2007) were focused on visual aspects and did not address the components of these sensory modalities.

Examples of the Component Evaluation Approach

While the component evaluation approach could be used to evaluate nearly any type of interactive system, the fidelity-components strategy specifically targets 3D UIs. It has been used in many studies to investigate the effects and interactions of numerous components of fidelity. These studies can be categorized into three types of evaluations—those investigating components of interaction fidelity, components of display fidelity, and both. Few studies have explicitly investigated the effects or interactions of scenario fidelity.

Investigations of interaction fidelity have mostly compared levels of interaction fidelity holistically rather than varying individual components, due to the integral relationships among the components (for example, it is difficult to alter the kinetics of an interaction without affecting its kinematics). McMahan et al. (2010) compared two techniques based on a steering wheel metaphor to two game controller techniques that used a joystick to steer a virtual vehicle. They found that the steering wheel techniques were significantly worse for the steering task than the joystick techniques, despite having higher levels of biomechanical symmetry and control symmetry. Similarly, Nabiyouni et al. (2015) found that the Virtusphere technique was significantly worse for travel tasks than a joystick technique, despite also having higher levels of biomechanical symmetry and control symmetry. However, they found that real walking, which provides extremely high levels of biomechanical symmetry and control symmetry, was significantly better than the Virtusphere.

Unlike interaction fidelity, specific components of display fidelity have been controlled and investigated. Arthur (2000) evaluated FOV in an HWD at three levels (48°, 112°, 176°) for a visual search task and determined that increasing FOV improved searching within a VE. Similarly, Ni et al. (2006) investigated display resolution at two levels (1280x720 and 2560x1440) and display size at two levels of physical viewing angle (42.8° and 90°) for searching and comparing information within a VE. They found that higher levels of display resolution and display size improved user performance metrics. In the AR domain, Kishishita et al. (2014) showed similar results in that target discovery rates consistently drop with in-view labelling and increase with in-situ labelling as display angle approaches 100 degrees of field of view. Past this point, the performances of the two view management methods begin to converge, suggesting equivalent discovery rates at approximately 130 degrees of field of view.

Finally, some studies have investigated both interaction fidelity and display fidelity. McMahan et al. (2012) controlled both at low and high levels for a first-person shooter game. They varied interaction fidelity between a low level (mouse-and-keyboard technique used for aiming and traveling, respectively), and a high level (6-DOF tracker technique for aiming and the moderate-fidelity human joystick technique (see Chapter 8, section 8.4.2) for travel). They also varied display fidelity between a low level with no stereoscopy and only a 90° field of regard and a high level with stereoscopy and a 360° field of regard. Their results indicated that the mixed conditions (i.e., low-fidelity interactions with a high-fidelity display and high-fidelity interactions with a low-fidelity display) were significantly worse for user performance than the matched conditions (i.e., low-fidelity interactions with a low-fidelity display and high-fidelity interaction with a high-fidelity display).

In most of the case studies above, and others, the component evaluation approach produced unexpected results that provided unique insights. First, results have shown that many components with moderate levels of fidelity can yield poorer performance than those with lower levels (McMahan et al. 2010, 2012; McMahan 2011; Nabiyouni et al. 2015). This could sometimes be due to users not being as familiar with moderate-fidelity interactions as real-world high-fidelity actions or desktop-based low-fidelity interactions, but could also be an inherent limitation of moderate levels of fidelity, which seem natural but are not, similar to the “uncanny valley” effect. Second, several studies have demonstrated that increasing display fidelity normally yields better performance (Arthur 2000; Ni et al. 2006; McMahan 2011; Laha et al. 2014), unless interaction fidelity is also changing (McMahan et al. 2012). These interesting insights would not have been possible without conducting component evaluations.

11.6.4 Comparison of Approaches

The three major evaluation approaches we have presented for 3D UIs—sequential evaluation, testbed evaluation, and component evaluation—take quite different approaches to the same problem: how to understand and improve usability and the user experience for 3D UIs. Sequential evaluation is done in the context of a particular application and can have both quantitative and qualitative results. Testbed evaluation is done in a generic evaluation context and usually seeks quantitative results. Component evaluation can be adapted for a particular application to make a specific design choice, or it can be applied to multiple application contexts to generalize the effects of one or more components. Like testbed evaluation, component evaluation prioritizes quantitative results. All three approaches employ users in evaluation.

In this section, we take a more detailed look at the similarities of and differences between these three approaches. We organize this comparison by answering several key questions about each of them. Many of these questions can be asked of other evaluation methods and perhaps should be asked before designing a 3D UI evaluation. Indeed, answers to these questions may help identify appropriate evaluation methods, given specific research, design, or development goals. Another possibility is to understand the general properties, strengths, and weaknesses of each approach so that the three approaches can be linked in complementary ways.

What Are the Goals of the Approach?

As mentioned above, all three approaches ultimately aim to understand and improve usability and the user experience in 3D UIs. However, there are more specific goals that exhibit differences between the three approaches.

Sequential evaluation’s immediate goal is to iterate toward a better UI for a particular 3D application. It looks closely at particular user tasks of an application to determine which scenarios and interaction techniques should be incorporated. In general, this approach tends to be quite specific in order to produce the best possible interface design for a particular application under development.

Testbed evaluation has the specific goal of finding generic performance characteristics of interaction techniques. This means that one wants to understand interaction technique performance in a high-level, abstract way, not in the context of a particular application. This goal is important because, if achieved, it can lead to wide applicability of the results. In order to do generic evaluation, the testbed approach is limited to general techniques for common universal tasks (such as navigation, selection, or manipulation). To say this in another way, testbed evaluation is not designed to evaluate special-purpose techniques for specific tasks. Rather, it abstracts away from these specifics, using generic properties of the task, user, environment, and system.

Component evaluation has the goal of determining the main effects of specific system components and any interaction effects among them for either a specific application context or generally, across multiple application contexts. When conducted for a specific application context, component evaluation can be used to decide upon the best design choices. Alternatively, a series of component evaluations can be conducted across multiple application contexts to develop expectations about how the components will generally affect usability and the user experience in most 3D UIs. These expectations can then be used to create design guidelines such as, “Do not employ moderate-fidelity interaction techniques when familiarity and walk-up usability are important” (McMahan et al. 2012).

When Should the Approach Be Used?

Sequential evaluation should be used early and continually throughout the design cycle of a 3D application. User task analysis is necessary before the first interface prototypes are built. Heuristic and formative evaluations of a prototype produce recommendations that can be applied to subsequent design iterations. Formative evaluations of different design possibilities can be done when the choice of design (e.g., for interaction techniques) is not clear.

By its non-application-specific nature, the testbed approach actually falls completely outside the design cycle of a particular application. Ideally, testbed evaluation should be completed before an application is even a glimmer in the eye of a developer. Because it produces general performance/usability results for interaction techniques, these results can be used as a starting point for the design of new 3D UIs.

Like the testbed approach, component evaluation can be used before the system concept for an application is even written. A series of component evaluations over multiple application contexts will yield knowledge of the general effects of one or more components, which can be expressed as design guidelines. Alternatively, single component evaluations can be used during the development of a 3D application to decide upon unclear design choices.

The distinct time periods in which testbed evaluation and sequential evaluation are employed suggest that combining the two approaches is possible and even desirable. Testbed evaluation can first produce a set of general results and guidelines that can serve as an advanced and well-informed starting point for a 3D application’s UI design. Sequential evaluation can then refine that initial design in a more application-specific fashion. Similarly, component evaluation can be used to derive design guidelines concerning specific components that sequential evaluation can rely upon during the initial design of a 3D application. Alternatively, component evaluation can be used during sequential evaluation to make specific design decisions.

In What Situations Is the Approach Useful?

As we have said, the sequential evaluation approach should be used throughout the design cycle of a 3D UI, but it is especially useful in the early stages of interface design. Because sequential evaluation produces results even on very low-fidelity prototypes or design specifications, a 3D application’s UI can be refined much earlier, resulting in greater cost savings. Also, the earlier this approach is used in development, the more time remains for producing design iterations, which ultimately results in a better product. This approach also makes the most sense when a user task analysis has been performed. This analysis will suggest task scenarios that make evaluation more meaningful and effective.

Testbed evaluation allows the researcher to understand detailed performance characteristics of common interaction techniques, especially user performance. It provides a wide range of performance data that may be applicable to a variety of situations. In a development effort that requires a suite of applications with common interaction techniques and interface elements, testbed evaluation could provide a quantitative basis for choosing them, because developers could choose interaction techniques that performed well across the range of tasks, environments, and users in the applications; their choices are supported by empirical evidence.

Component evaluation is useful for determining the general effects of one or more system components and establishing design guidelines for how those components should be controlled or fixed in a design. Single component evaluations are also useful for making design choices that directly involve one or more system components, such as what FOV to provide or whether or not stereoscopic graphics should be used.

What Are the Costs of Using the Approach?

In general, the sequential evaluation approach may be less costly than the testbed evaluation approach because it can focus on a particular 3D application rather than pay the cost of abstraction. However, some important costs are still associated with this approach. Multiple evaluators may be needed. Development of useful task scenarios may take a large amount of effort. Conducting the evaluations themselves may be costly in terms of time, depending on the complexity of task scenarios. Most importantly, because this is part of an iterative design effort, time spent by developers to incorporate suggested design changes after each round of evaluation must be considered.

The testbed evaluation approach can be seen as very costly and is definitely not appropriate for every situation. In certain scenarios, however, its benefits can make the extra effort worthwhile. Some of the most important costs associated with testbed evaluation include difficult experimental design (many independent and dependent variables, where some of the combinations of variables are not testable), experiments requiring large numbers of trials to ensure significant results, and large amounts of time spent running experiments because of the number of participants and trials. Once an experiment has been conducted, the results may not be as detailed as some developers would like. Because testbed evaluation looks at generic situations, information on specific interface details such as labeling, the shape of icons, and so on will not usually be available.

The cost of the component evaluation approach varies, based on when and how it is being used. If a series of component evaluations are being conducted to generalize the effects of one or more components, the cost of the evaluations is similar to that of a testbed evaluation. Difficult experimental designs with many independent and dependent variables usually require large amounts of trials, participants, and time to conduct such a series of component evaluations. On the other hand, single component evaluations used during the design of a 3D application can be less costly than the sequential evaluation approach.

What Are the Benefits of Using the Approach?

For a particular application, the sequential evaluation approach can be very beneficial. Although it does not produce reusable results or general principles in the same broad sense as testbed and component evaluations can, it is likely to produce a more refined and usable 3D UI than if the results of testbed or component evaluations were applied alone. Another of the major benefits of this method relates to its involvement of users in the development process. Because members of the representative user group take part in many of the evaluations, the 3D UI is more likely to be tailored to their needs and should result in higher user acceptance and productivity, reduced user errors, and increased user satisfaction. There may be some transferability of results, because other applications may have similar tasks or requirements, or they may be able to use refined interaction techniques produced by the process.

Because testbed evaluation is so costly, its benefits must be significant before it becomes a useful evaluation method. One such benefit is generality of the results. Because testbed experiments are conducted in a generalized context, the results may be applied many times in many different types of applications. Of course, there is a cost associated with each use of the results because the developer must decide which results are relevant to a specific 3D UI. Second, testbeds for a particular task may be used multiple times. When a new interaction technique is proposed, that technique can be run through the testbed and compared with techniques already evaluated. The same set of participants is not necessary, because testbed evaluation usually uses a between-subjects design. Finally, the generality of the experiments lends itself to development of general guidelines and heuristics. It is more difficult to generalize from experience with a single application.

Like its costs, the benefits of the component evaluation approach vary based on when and how the approach is used. If the approach is used in a series across multiple 3D UI contexts, one benefit is the generality of the results, similar to the testbed approach. Unlike the testbed approach, however, the cost associated with each use of the results is often less, as most results should be relevant to most 3D UIs since they are based on system components that should be present in most 3D UI systems. Another benefit to using the component approach across multiple 3D UI contexts is that general guidelines for controlling that component within 3D UIs should become apparent. Finally, a single component evaluation provides the benefit of revealing the best design choice for a particular 3D application, when that design choice depends upon one or more system components.

How Are the Approach’s Evaluation Results Applied?

Application of the results of the sequential evaluation approach is straightforward. Heuristic and formative evaluations produce specific suggestions for changes to the application’s UI or interaction techniques. The result of summative evaluation is an interface or set of interaction techniques that performs the best or is the most usable in a comparative study. In any case, results of the evaluation are tied directly to changes in the interface of the 3D application.

The results of testbed evaluation are applicable to any 3D UI that uses the tasks studied with a testbed. For example, testbed results are available for some of the most common tasks in 3D UIs: travel and selection/manipulation (Bowman et al. 2001). The results can be applied in two ways. The first informal technique is to use the guidelines produced by testbed evaluation in choosing interaction techniques for an application (as in Bowman et al. 1999). A more formal technique uses the requirements of the application (specified in terms of the testbed’s performance metrics) to choose the interaction technique closest to those requirements. Both of these approaches should produce a set of interaction techniques for the application that makes it more usable than the same application designed using intuition alone. However, because the results are so general, the 3D UI will almost certainly require further refinement.

The results of a series of component evaluations across multiple 3D UI contexts are applicable to any 3D UI system that includes the system components evaluated. As many 3D UI systems include the same system components, the results of the component evaluation may be more generally applicable than the results of the testbed approach. Finally, the results of a single component evaluation can be straightforwardly applied to a design choice based on how to control one or more system components.

11.7 Guidelines for 3D UI Evaluation

In this section, we present some guidelines for those wishing to perform usability evaluations of 3D UIs. The first subsection presents general guidelines, and the second subsection focuses specifically on formal experimentation.

11.7.1 General Guidelines

The general guidelines include less formal modes of evaluation.

Tip

Begin with informal evaluation.

Informal evaluation is very important, both in the process of developing an application and in doing basic interaction research. In the context of an application, informal evaluation can quickly narrow the design space and point out major flaws in the design. In basic research, informal evaluation helps you understand the task and the techniques on an intuitive level before moving on to more formal classifications and experiments.

Tip

Acknowledge and plan for the differences between traditional UI and 3D UI evaluation.

Section 11.4 detailed a large number of distinctive characteristics of 3D UI evaluation. These differences must be considered when you design a study. For example, you should plan to have multiple evaluators, incorporate rest breaks into your procedure, and assess whether breaks in presence could affect your results.

Tip

Choose an evaluation approach that meets your requirements.

Just as we discussed with respect to interaction techniques, there is no optimal usability evaluation method or approach. A range of methods should be considered, and important questions such as those in section 11.6.4 should be asked. For example, if you have designed a new interaction technique and want to refine the usability of the design before any implementation, a heuristic evaluation or cognitive walkthrough fits the bill. On the other hand, if you must choose between two input devices for a task in which a small difference in efficiency may be significant, a formal experiment may be required.

Tip

Use a wide range of metrics.

Remember that speed and accuracy alone do not equal usability. Also remember to look at learning, comfort, presence, and other metrics in order to get a complete picture of the usability and user experience of the interface.

11.7.2 Guidelines for Formal Experimentation

This section presents a number of guidelines for formal experiments to investigate the usability issues of 3D interfaces.

Tip

Design experiments with general applicability.

If you’re going to do formal experiments, you will be investing a large amount of time and effort. So you want the results to be as general as possible. Thus, you have to think hard about how to design tasks that are generic, performance measures to which real applications can relate, and a method for applications to easily apply the results.

Tip

Use pilot studies to determine which variables should be tested in the main experiment.

In doing formal experiments, especially testbed evaluations, you often have too many variables to actually test without an infinite supply of time and participants. Small pilot studies can show trends that may allow you to remove certain variables because they do not appear to affect the task you’re doing.

Tip

Use automated data collection for system performance and task performance metrics.

As discussed in section 11.4.1, the complexity of 3D UIs can make data collection difficult for evaluators, for they are not able to simultaneously observe the subject’s actions and the results of those actions within the 3D UI. A simple yet effective solution to this issue is to program automated data collection methods within the 3D UI software or ancillary software. Such automated data collection methods are more accurate than evaluator-based observations, with time measured in milliseconds and every predefined error being identified. The major limitation of automated data collection is the time and effort required to program the additional data collection methods.

Tip

Look for interactions between variables—rarely will a single technique be the best in all situations.

In most formal experiments on the usability of 3D UIs, the most interesting results have been interactions. That is, it’s rarely the case that technique A is always better than technique B. Rather, for instance, technique A works well when the environment has characteristic X, and technique B works well when the environment has characteristic Y. Statistical analysis should reveal these interactions between variables.

11.8 Case Studies

In this section, we discuss the evaluation aspects of our two case studies. Again, if you have not read sequentially through the book, you can find the introduction to the case studies in section 2.4, and the design decisions for the case studies at the ends of Chapters 5–10.

11.8.1 VR Gaming Case Study

As we noted in the introduction to the VR gaming case study, this design is purely a thought experiment. Although we have significant experience in designing 3D UIs, we are not so arrogant as to suggest that the 3D UI design we’ve presented is the best one or even an especially good one. On the contrary, in our experience, when we imagine 3D interaction in our minds, the interaction concepts can seem really exciting and powerful, but then we find that they don’t turn out be nearly as cool or effective when they are actually implemented. That’s because it’s very hard to imagine all the subtleties of how an interaction will look and feel—we tend to skip over details in our imagination, and the devil is in the details for 3D UI design.

So while this case study should be helpful in seeing how to think about the design of various parts of a complete 3D UI for a VR game, and while several of the interaction concepts we’ve described would probably work quite well, several rounds of prototyping and formative evaluation would be critical to actually realize an effective 3D UI if this were a real project.

Prototyping is especially important for 3D UIs. While early prototypes of more traditional UIs (desktop, mobile device, etc.) are often low-fidelity paper prototypes or wireframes, it’s difficult to apply these approaches to 3D UI prototyping, which often require a more complete implementation. Unfortunately, this means that the early stages of a 3D UI design often can’t progress as quickly as traditional UIs, since the notion of rapid throw-away prototypes is less applicable to 3D UIs. Still, 3D UI development tools are becoming more and more powerful, as many game engines gain support for VR, AR, tracking, and the like. And it’s still possible to prototype interaction concepts without having high-fidelity representations of the other parts of the game.

In this case study, then, we would suggest starting with working prototypes of the individual interaction concepts (for selection/manipulation, travel, and system control), with some informal formative usability studies and several rounds of iteration to refine and hone them as quickly as possible. In these studies, we would be looking both for large usability problems (requiring significant technique redesign) and for places where the technique’s mappings need tweaking.

The next step would be to put all the interaction techniques together into a prototype of the complete UI, using just a couple of rooms that are representative of the entire game. We need to evaluate all the interaction techniques together, because even when techniques are effective on their own, they don’t always work well together. In these formative studies, we would look for critical incidents, where interactions break down or don’t flow well together. We would also administer standard usability questionnaires.

Once we are happy with the complete 3D UI, we would move on to broader play-testing of several levels of the actual game, looking not just at usability, but at the broader user experience. These evaluations should use members of the target audience for the game, and should make use of standard UX questionnaires and interviews to determine the impact of the game on things like fun, frustration, and emotion.

Key Concepts

The key concepts that we have discussed in relation to our VR gaming case study are

A 3D interaction concept that seems good in the imagination or on paper does not always translate to an effective implementation. Working prototypes are critical to understand the potential of 3D UI designs.

Be sure to evaluate the complete UI, not just the individual interaction techniques.

Start with usability evaluation, but for real 3D UI applications, go beyond usability to understand the broader user experience.

11.8.2 Mobile AR Case Study

Continuing the thread of the Mobile AR Case Study section in the previous chapter, here we will take a closer look at evaluation approaches that affected the analysis of the various perceptual, cognitive, and physical ergonomics issues of the HYDROSYS system.

With regards to perceptual issues, AR systems have been shown to suffer from a range of different problems (Kruijff et al. 2010). In the early stages of development of our system, we performed a user study to address legibility and visibility issues with a rather curious outcome. A simple paper-based experiment used different backgrounds depicting the usage environments, with different structural and color properties. Overlaid on these environments, labels with different color schemes were shown. The results of the informal study with around ten users showed that, not surprisingly, high-contrast combinations were preferred most. Especially those colors that depicted high contrast between the foreground and the background and between the label and text color were preferred. The study also brought out one specific result: the color found to be most visible and legible was pink, as it clearly stood out against the natural background and text was still easily readable. However, we found the color not suitable, as it would not provide a visually pleasing design, at least not in our eyes. In other cultures, however, this color scheme might actually be fully acceptable!

The results might have been different with a different display. For example, using a wide field of view head-worn display, you can visualize information in the periphery. But the outer regions of our visual field react quite differently to colors than our central vision (Kishishita et al 2014), as the ability to perceive colors diminishes towards the borders of our visual field. Additional studies could have been performed to examine this question or the question of label layout with different displays.

Another issue not to be ignored when performing outdoor evaluations is the environmental condition. The biggest challenge of performing perception-driven outdoor experiments in AR is lighting conditions. It is almost impossible to control lighting and related issues (such as screen reflections), even though some methods can be used to minimize effects. For example, you can select similar timeslots on a daily basis when the weather is somewhat similar, controlling better the direction of sunlight. On the other hand, deliberately performing evaluations under different lighting conditions may better reveal the potential and problems of the techniques being evaluated. At a minimum, one should log the outdoor conditions (brightness, possibly the angle of sunlight), to better understand the evaluation outcomes.

Cognitive issues also played an important role in HYDROSYS. We evaluated different techniques to improve spatial assessment of augmented information through multi-camera navigation systems. As we described in Chapter 8, we performed two studies to analyze the effect of different techniques on spatial knowledge acquisition and search performance. The first experiment (Veas et al. 2010) was a summative evaluation in which we compared three different techniques (mosaic, tunnel, and transition). The focus was on assessing which technique was best for knowledge acquisition and spatial awareness, while addressing cognitive load and user preference. The techniques were used in different local and remote camera combinations. We blindfolded users to block acquisition of spatial knowledge between physical locations, and we used rarely visited locations to avoid effects of prior knowledge. Based on the acquired knowledge, we asked the users to draw the spatial configuration of the site.

To measure the spatial abilities of users, participants filled out a SBSOD questionnaire before the experiment. They also rated techniques on subjectively perceived cognitive load using NASA TLX and RSME scales—see Chapter 3, “Human Factors Fundamentals” for more information on these scales. The experiment illustrated some of the positive and negative aspects of the chosen methods. These scales for addressing spatial abilities and cognitive load are useful and can be reasonably well performed by participants. However, such self-reported measures provide only an initial indication. Objective measures (such as measuring stress with biosensors) would be a good option for coming to more reliable and valid conclusions.

NASA TLX and RSME did not reveal any significantly different results among the techniques. Furthermore, we found the maps highly challenging for interpretation. This was partly because we did not initially provide a legend to inform the participants how to encode the drawing but also because of the difficulty in comparing results. For example, we found noticeable scale and rotation errors. Yet creating a suitable metric to quantify these errors was difficult, as simple offsets would not do them justice. For these reasons, we removed map drawing and RSME scales from the second experiment, where we instead made use of performance time as the main measure.

The experiments revealed the usefulness of addressing user abilities and cognitive load, as technique preference in this case was associated with user capacities. We can also conclude that addressing cognitive issues is not easy, as different users have highly different abilities and interpreting results is not always straightforward. In general, the second experiment provided more valid and reliable results, as we mixed objective and subjective measures. We recommend this approach when analyzing cognition-related issues. Combining subjective and objective methods helps us obtain a more complete understanding.

Finally, addressing ergonomic issues is often disregarded in the design of 3DUIs, even though it can have a major effect on the ease of use and performance of a system. In various informal and formal studies of the Vesp’r and HYDROSYS setups—see Chapter 5, “3D User Interface Output Hardware”—we analyzed two interrelated issues related to ergonomics: grip and pose. The various handheld setups we developed supported multiple grips and poses to enable the user to hold the device comfortably and interact with controllers or the screen. As we discussed in Chapter 3, user comfort is highly affected by both the pose and the duration for which certain poses are held.

As an example of how we included this knowledge in evaluation, in one of our studies users had to compare different grips while performing tasks in different poses (see Figure 11.4). We compared single versus two-handed grips with lower and higher angle postures over a relatively long usage duration. The duration was approximately half an hour; anything shorter would simply not reveal all the key ergonomic issues. Users had to pick up objects from a table and place them on a wall location, forcing them to hold the setup quite high. After the experiment, we asked the users to respond to different questions on, among other things, weight, grip shape, grip material and posture, and user comfort. The experiment revealed the importance of performing longer-duration tests that solely focus on ergonomics, as user comfort and preference varied quite widely. For more details, refer to Veas and Kruijff (2008).

Figure 11.4 Different grips and pose to validate user comfort. (images courtesy of Ernst Kruijff and Eduardo Veas).

Key Concepts

Perception: verify visual appearance of system control elements and general view management. Be sure to evaluate AR systems in the environment in which the system is deployed. Deal appropriately with outdoor conditions during evaluation as it may bias results.

Cognition: assess subjective mental load of more complex systems, as it may greatly affect performance. Extend and correlate with objective measures (such as performance or biosensors) where possible to gain more valid and reliable insights.

Ergonomics: study ergonomics of systems that are used for lengthy time periods. Evaluate long-term usage duration with tasks designed for different poses and grips to assess user comfort and fatigue.

11.9 Conclusion

In this chapter, we have presented an overview of some of the issues surrounding evaluation of 3D UIs. The most important takeaway message is that evaluation is almost always necessary. Despite all the design knowledge, guidelines, and techniques we presented in earlier chapters, initial 3D UI designs require assessment of usability and user experience so that the design can be iterated and improved. In addition, formal experimentation by researchers deepens our understanding of 3D interaction and provides new knowledge, new guidelines, and evidence that can be used to build models and theories.

Acknowledgment

Some of the content in this chapter comes from a 2002 article by Doug Bowman, Joseph Gabbard, and Deborah Hix that appeared in the journal Presence: “A Survey of Usability Evaluation in Virtual Environments: Classification and Comparison of Methods.” (Presence: Teleoperators and Virtual Environments 11(4): 404–424).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11 Evaluation of 3D User Interfaces

Create new playlist

Sign In

Sign Up

Chapter 11. Evaluation of 3D User Interfaces

11.1 Introduction

11.1.1 Purposes of Evaluation

11.1.2 Terminology

11.1.3 Chapter Roadmap

11.2 Evaluation Methods for 3D UIs

Cognitive Walkthrough

Heuristic Evaluation

Formative Evaluation

Summative Evaluation

Questionnaires

Interviews and Demos

11.3 Evaluation Metrics for 3D UIs

11.3.1 System Performance Metrics

11.3.2 Task Performance Metrics

11.3.3 Subjective Response Metrics

11.4 Characteristics of 3D UI Evaluations

11.4.1 Physical Environment Issues

11.4.2 Evaluator Issues

11.4.3 User Issues

11.4.4 Evaluation Type Issues

11.4.5 General Issues

11.5 Classification of Evaluation Methods

11.6 Three Multimethod Approaches

11.6.1 Sequential Evaluation Approach

Example of Approach

11.6.2 Testbed Evaluation Approach

Initial Evaluation

Taxonomy

Outside Factors

Performance Metrics

Testbed Evaluation

Application and Generalization of Results

Examples of Approach

11.6.3 Component Evaluation Approach

Evaluating Components

Applying Component Results

Defining Components

Interaction Fidelity Components

Scenario Fidelity Components

Display Fidelity Components

Examples of the Component Evaluation Approach

11.6.4 Comparison of Approaches

What Are the Goals of the Approach?

When Should the Approach Be Used?

In What Situations Is the Approach Useful?

What Are the Costs of Using the Approach?

What Are the Benefits of Using the Approach?

How Are the Approach’s Evaluation Results Applied?

11.7 Guidelines for 3D UI Evaluation

11.7.1 General Guidelines

11.7.2 Guidelines for Formal Experimentation

11.8 Case Studies

11.8.1 VR Gaming Case Study

11.8.2 Mobile AR Case Study

11.9 Conclusion

Recommended Reading

Acknowledgment

Table of Contents for
11 Evaluation of 3D User Interfaces