Interacting in mixed reality

With the introduction of any new computing paradigm comes new ways of interacting with it and, as highlighted in the opening paragraph, history has shown that we are moving from an interface that is natural to the computer toward an interface that is more natural to people. For the most part, HoloLens removes dedicated input devices and relies on inferred intent, gestures, and voice. I would argue that this constraint is the second most compelling offering that HoloLens gives us; it is an opportunity to invent more natural and seamless experiences that can be accessible to everyone. Microsoft refers to three main forms of input, including Gaze Gesture Voice (GGV); let's examine each of these in turn.

Gaze refers to tracking what the user is looking at; from this, we can infer their interest (and intent). For example, I will normally look at a person before I speak to them, hopefully, signalling that I have something to say to them. Similarly, during the conversation, I may gaze at an object, signalling to the other person that the object that I'm gazing at is the subject I'm speaking about.

This concept is heavily used in HoloLens applications for selecting and interacting with holograms. Gaze is accompanied with a cursor; the cursor provides a visual representation of the users gaze, providing visual feedback to what the user is looking at. It can additionally be used to show the state of the application or object the user is currently gazing at, for example, the cursor can visually change to signal whether the hologram the user is gazing at is interactive or not. On the official developer site, Microsoft has listed the design principles; I have paraphrased and listed them here for convenience:

Always present: The cursor is, in some sense, akin to the mouse pointer of a GUI; it helps the users understand the environment and the current state of the application.
Cursor scale: As the cursor is used for selecting and interacting with holograms, it's size should be no bigger than the objects the user can interact with. Scale can also be used to assist the users' understanding of depth, for example, the cursor will be larger when on nearby surfaces than when on surfaces farther away.
Look and feel: Using a directionless shape means that you avoid implying any specific direction with the cursor; the shape commonly used is a donut or torus. Making the cursor hug the surfaces gives the user a sense that the system is aware of their surroundings.
Visual cues: As mentioned earlier, the cursor is a great way of communicating to the user about what is important as well as relaying the current state of the application. In addition to signalling to the user what is interactive and what is not, it also can be used to present additional information (possible actions) or the current state, such as visualizing showing the user that their hand has been detected.

While gazing provides the mechanism for targeting objects, gestures and voice provide the means to interact with them. Gestures can be either discrete or continuous. The discrete gestures execute a specific action, for example, the air-tap gesture is equivalent to a double-click on a mouse or tap on the screen. In contrast, continuous gestures are entered and exited and while active, they will provide continuous update to their state. An example of this is the manipulation gesture, whereby the user enters the gesture by holding their finger down (called the hold gesture); once active, this will continuously provide updates of the position of the tracked hand until the gesture is exited with the finger being lifted. This is equivalent to dragging items on desktop and touch devices with the addition of depth.

HoloLens recognizes and tracks hands in either the ready state (back of hand facing you with the index finger up) or pressed state (back of hand facing you with the index finger down) and makes the current position and state of the currently tracked hands available, allowing you to devise your own gestures in addition of providing some standard gestures, some of which are reserved for the operating system. The following gestures are available:

Air-tap: This is when the user presses (finger down) and releases (finger up), and is performed within a certain threshold. This interaction is commonly associated to selecting holograms (as mentioned earlier).
Bloom: Reversed for the operating system, bloom is performed by holding your hand in front of you with your fingers closed, and then opening your hand up. When detected, HoloLens will redirect the user to the Start menu.
Manipulation: As mentioned earlier, manipulation is a continuous gesture entered when the user presses their finger down and holds it down, and exited when hand tracking is lost or the user releases their finger. When active, the user's hand is tracked with the intention of using the absolute position to manipulate the targeted hologram.
Navigation: This is similar to the manipulation gesture, except for its intended use. Instead of mapping the absolute position changes of the user's hand with the hologram, as with manipulation, navigation provides a standard range of -1 to 1 on each axis (x, y, and z); this is useful (and often used) when interacting with user interfaces, such as scrolling or panning.

The last dominate form of interacting with HoloLens, and one I'm particularly excited about, is voice. In the recent times, we have seen the rise of Conversational User Interface (CUI); so, it's timely to introduce a platform where one of it's dominate inputs is voice. In addition to being a vision we have had since before the advent of computers, it also provides the following benefits:

Hands free (obviously important for a device like HoloLens)
More efficient and requires less effort to achieve a task; this is true for data entry and navigating deeply nested menus
Reduces cognitive load; when done well, it should be intuitive and natural, with minimal learning required

However, how voice is used is really dependent on your application; it can simply be used to supplement gestures such as allowing the user to use the Select keyword (a reserved keyword) to select the object the user is currently gazing at or support complex requests by the user, such as answering free-form questions from the user. Voice also has some weaknesses, including these:

Difficulty with handling ambiguity in language; for example, how do you handle the request of louder
Manipulating things in physical space is also cumbersome
Social acceptance and privacy are also considerations that need to be taken into account

With the success of Machine Learning (ML) and adoption of services such as Amazon's Echo, it is likely that these weaknesses will be short lived.

Table of Contents for Interacting in mixed reality

Create new playlist

Sign In

Sign Up

Table of Contents for
Interacting in mixed reality