Chapter 5. Readying Voice Content for Launch

The very notion of making a voice interface usable is complicated by the fact that voice differs so considerably from other types of user interfaces. After all, we need to cater not only to disabled users treating our interfaces as an alternative means of access, but also to users who may never have encountered a voice interface before.

Making our voice content usable starts with understanding the target demographic and how they currently retrieve their information, and it continues with designing an interface faithful to the expectations of both novice and experienced users. As such, some form of usability assessment needs to be built into every step of the process.

In this chapter, we’ll cover voice usability and the final stages before launch: the critical phase of voice usability testing—a far cry from usability testing in other contexts—as well as the prerelease phase that prepares our interface for prime time. We’ll also look at how to make sure our content is in all the right places and how to set up logs and analytics. Finally, we’ll explore ways to keep improving on usability by watching how users interact with our content in the real world.

Voice Usability

The people using our voice interfaces are more likely to succeed if we focus on the ways they will interact with the content within from a variety of lived experiences. Without deeply understanding our users and catering clearly to their requirements, our voice content won’t quite fulfill their needs. Early, iterative prototyping will quickly pinpoint any weak spots in the interface where users may struggle.

Building usability in by default

Build usability into every point of the project timeline, including the earliest discovery phases. This could mean conducting background research on the desired audience of your interface and learning about their habits and ideals, even before embarking on articulating a content strategy, conducting a content audit, and writing sample dialogues.

If your voice interface is designed to reduce the burden on customer service agents handling dozens of conversations a day, simply sitting in on a typical shift to listen in on calls can reveal how real users will eventually interact with a voice interface. How do customers negotiate with call handlers? How quickly do they expect a solution to their problem? What are the sorts of questions they ask, and how do they word those questions?

In Designing Voice User Interfaces, Pearl also recommends this approach (http://bkaprt.com/vcu36/03-02). She writes that frontline customer service workers “can provide a huge amount of valuable information” because they “know about what users are really calling in about, what the biggest complaints are, and what information can be difficult for callers to find.” Without real-world input, it becomes harder to cater to your users’ requirements.

If early research conducted with real-world conversations yields better dialogues in the long run, so does early prototyping with authentic sample dialogues. This often helps designers fine-tune the personality of the voice interface according to user expectations.

Personality is an important consideration as well. According to voice interface designer Susan Hura, we humans tend to see voice interfaces without human personalities as “robotic or incoherent.” The interface should take on the role of a conversation partner and be capable of authentic dialogues with users as early into the implementation as possible; if the interface’s personality seems abrupt or otherwise uncannily not quite human, set aside sufficient time to address that back in the dialogue scripts (http://bkaprt.com/vcu36/01-03, PDF).

Too often, teams err by leaving usability research for the very end of a project, when it’s far too late to implement any findings, or by conducting too little discovery at the outset. Whether you study actual conversations between humans or iteratively prototype with sample dialogues and call flows, building these processes into your earliest design and implementation phases can help you avoid returning to the drawing board later on.

Voice accessibility and usability go hand in hand

When we talk about the usability of voice interfaces, we can’t ignore some of their most experienced users: people who use screen readers and other text-to-speech devices. Accessibility and usability in voice interfaces go hand in hand, and some people, like Deaf or Deaf-Blind users who require multimodal solutions with a visual or tactile component, can’t use voice-only interfaces at all.

Accessibility has long been a bulwark of web usability, but it hasn’t received quite as much attention when it comes to voice usability. And while the voice interface may seem an anomaly among devices traditionally thought of as assistive, accessibility advocates nonetheless acknowledge their potential. Digital Services Georgia, for instance, cited accessibility as a major motivator for making their content available through voice in addition to the web. It was a logical stepping-stone in their continuing journey to serve every Georgian.

Because screen readers recite every single piece of text and media on a page, they often overshare by including superfluous information (think “Skip to main content” or overly wordy image descriptions), lengthening the user’s interaction. And because screen readers on the web are rooted in visual, page-based structures, voice interfaces can help users target a single paragraph rather than forcing them to sift through a long page of irrelevant information.

Beyond screen readers, however, designer Bo Campbell has pointed out that voice interfaces can serve distinct access needs along a number of other fronts. People with dyslexia, for instance, often find voice interfaces to be more pleasant than dealing with written text. For teams with control over speech synthesizers, volume or speech rate controls can aid hard-of-hearing users. Meanwhile, for those able to adjust how speech is recognized, a voice interface’s ability to interpret slurred, shaky, or broken speech by people with cognitive disabilities can enrich their user experience. And even though drilling into the inner workings of voice assistants is seldom possible, focusing on a “linear, time-efficient architecture” can help users with cognitive disabilities understand context and wayfinding with minimal friction (http://bkaprt.com/vcu36/05-01).

While the potential for accessibility with voice interfaces remains mostly unexplored, I look forward to a future, closer than we might think, where voice interactions become full-fledged conduits through which disabled people can access content they need. The spread of accessibility into other user experiences will also heighten the demand for multimodal accessibility, which must reach well beyond trappings of the accessible web like the screen reader.

Voice Usability Testing

Because voice interfaces are the most humanlike interfaces, many of the traditional strategies for usability testing go out the window. Most usability tests involve some form of real-time conversation during the interaction, which might lead to corrupted data in voice usability tests. Because methodologies for voice usability testing remain new, it’s important to mitigate risk by preparing diligently for their uniqueness.

Running a voice usability test also presents challenges when it comes to appropriate procedure. Test environments, test subject demographics, study questionnaires, and scenario definition all require paradigms that revolve around voice interactions. Voice usability tests require two moderators—one to watch and ask questions, the other to transcribe and catch signs of trouble. It’s also best to have a developer in the room, not only to find areas for technical improvement but also to debug or troubleshoot any problems arising during the test.

Nevertheless, lessons from other kinds of usability tests still apply. The mantra “test early, test often” applies to interfaces of any type. The objective should always be to build an interface where your users don’t mind spending time—a habitable user experience.

Early usability testing

If your voice interface isn’t yet working, or if there’s still significant work left to be done, usability testing can still happen outside of a formal test environment. Some approaches to this include:

  • Table reads. Often, certain issues spring up only once users begin to interact with a voice experience in unexpected ways. Table reads of sample dialogues can isolate such issues. “You might have some items in your design (such as handling pronouns or referring to a user’s previous behavior) that will take more complex development,” writes Pearl in Designing Voice User Interfaces, “and it’s important to get buy-in from the outset and not surprise [stakeholders] late in the game (http://bkaprt.com/vcu36/03-02).”
  • Wizard of Oz testing. In Wizard of Oz (WOz) testing, the interface isn’t yet ready for a full regimen of testing, so a human stands in “behind the curtain” to present the illusion of a fully functional system. The “Wizard” needs to be a researcher who deeply understands the sample dialogues and can react rapidly to a user’s action with an appropriate response. One of the key benefits of WOz testing is the ability to test early and often without all components of the interface in place.
  • Hallway testing. Once you have a smoke-and-mirrors prototype in hand or a working build, you can conduct hallway testing, which involves mustering work colleagues, family members, friends, or neighbors to participate in an informal study. Because it’s rare that your recruits will be members of the target demographic, hallway testing can provide early insight, but like table reads and WOz testing, it pales in comparison to the real thing.

If your voice interface isn’t yet done or still needs more development to be ready for a full round of usability testing, these techniques can help you gain insights before the real deal.

Test environments and subjects

Because the primary medium of voice interfaces is sound, one risk usability researchers new to voice have seldom had to consider is aural interference. Poor soundproofing or outside noise can halt a voice usability test before it even begins. For this reason, I recommend holding tests in recording studios or similar soundproof spaces to ensure as little outside noise as possible. Though absolute silence is ideal, a mostly silent room is often the best we can do.

Preparing subjects who might be used to other types of usability tests can be challenging. Lengthy silences are common during setup, debugging, and transitions between tasks, and may feel awkward to participants—especially if they’re not expecting them. For Ask GeorgiaGov, we asked participants to stay silent when not speaking with Alexa to avoid overfilling our audio recordings and transcripts. Users and moderators who are new to voice usability tests will need onboarding and training about how they need to refrain from being as chatty as they might be in other types of evaluations, and how voice interface errors surface differently from visual alerts.

When testing Ask GeorgiaGov, we found that most of our tech-savvy coworkers had little to no exposure to voice assistants in the first year or two after Amazon Alexa’s launch, when it was still a novelty. Ultimately, this inexperience helped us achieve results that came close to how we envisioned actual Georgians interacting with the interface. Nevertheless, if we could repeat our project, we’d try to recruit real users in Georgia across a variety of demographics.

Trust and consent are key in any researcher-subject setting, and if your content deals with potentially traumatic topics, you must shield your users by ensuring they’re comfortable with performing those tasks. State governments routinely grapple with resident inquiries dealing with abuse and trauma in areas like sexual harassment, firearms, and the carceral state. To avoid causing inadvertent harm in our Ask GeorgiaGov testing, we prescreened users with a consent form so anyone could opt out of tasks they might find distressing.

Consider the ways a task prompt might inadvertently surface deeply personal trauma: researching how to get a firearm, searching for “prison visitation,” or studying child support laws won’t be neutral requests for people who have been impacted by gun violence, racist incarceration practices, or domestic upheaval. By clearly distinguishing between benign and potentially harmful tasks, we were able to keep our test subjects safe.

Scenario and task definition

At the start of each evaluation, test administrators should share a scenario (a situation that needs resolution) and task (an action that needs performing) with the test subject. Many voice usability researchers define scenarios and tasks in such a way that subjects have relative free rein over their interactions, with any necessary guardrails defined.

Note how these task definitions from Ask GeorgiaGov never reveal the controls or navigation required to access the requested content. Instead, the task invites users to explore the interface and orient their mental maps on their own.

You have a business license in Georgia, but you’re not sure if you have to register on an annual basis. Talk with Alexa to find out the information you need. At the end, ask for a phone number for more information.
You’ve just moved to Georgia and you know you need to transfer your driver’s license, but you’re not sure what to do. Talk with Alexa to find out the information you need. At the end, ask for a phone number for more information.

If a participant faces multiple scenarios to resolve, write Cohen, Giangola, and Balogh in Voice User Interface Design, randomizing their sequence for each new subject is also crucial: earlier tasks can influence how the user reacts to later ones, potentially skewing the results (http://bkaprt.com/vcu36/01-02).

Retrospective testing

Usability.gov delineates two categories of testing: concurrent, where the researcher gathers data live during an interaction; and retrospective, where the researcher saves the data gathering for after the interaction is over (http://bkaprt.com/vcu36/05-02).

Many usability researchers use concurrent testing techniques that track the user’s reactions and interactions in real time, but these tactics are less useful when testing voice interfaces. Spoken words relevant to the evaluator but not to the interaction at hand can accidentally invoke wake words (Alexa, for instance, could mishear “I selected” as “Hi Alexa”). This leaves the user juggling two conversations: one with the voice interface, and another with the human evaluator. A retrospective technique allows these conversations to occur at separate points in time, vastly simplifying the experience on both sides of the table.

Retrospective approaches do have frustrating flaws. For instance, humans are notorious for having slippery memories. Test subjects may insert false recollections into their impressions or misinterpret the outcome of their conversations. But there are advantages, too: retrospective approaches give users time to ponder and polish their thoughts rather than doling out tidbits in an incremental stream of consciousness, as would be the case in concurrent techniques.

In retrospective probing (RP), researchers prepare questions to ask participants once a task is complete, focusing on users’ recent impressions as they performed certain actions. Meanwhile, in retrospective think-aloud (RTA), test moderators have participants retrace their steps after the interaction concludes and may even present test subjects with a transcript or recording of their conversation, asking follow-up questions to shed light on key moments.

To test Ask GeorgiaGov, we used retrospective probing exclusively with questions we asked users after the interaction was over, collecting insights about its performance. Our retrospective questionnaire consisted of three questions:

Facilitator: “On a scale of 1–5, based on the scenario, was the information you received helpful? Why or why not?” (seeking quantitative and qualitative responses about content relevance and search issues)
Facilitator: “On a scale of 1–5, based on the scenario, was the content presented clearly and easy to follow? Why or why not?” (seeking quantitative and qualitative responses about voice content legibility and discoverability)
Facilitator: “What’s the answer to the question you were tasked with asking?” (verifies that the user landed on correct voice content)

Running voice usability tests

The core mission of a voice usability test is to gauge the quality of the interface along multiple dimensions. Because voice interfaces are more humanlike than other interfaces, even if an interaction was successful, the demeanor of the interface or the nature of the response may be grounds for a user to say it wasn’t a good experience.

Your test procedure, written before the first round of tests, should outline not only the questionnaire you’ll use for your retrospective testing, but also any other preemptive questions users need to be asked, such as demographic information or consent to record. Here are some of the initial questions we asked our usability test subjects after they completed our prescreening questionnaire and sat down:

Facilitator: “Is it okay if we record and transcribe?”
Facilitator: “On a scale of 1–5, what would you say your skill level with Amazon Alexa is, 1 being ‘I’ve never used it before’ and 5 being ‘I use it for everything at home’?”

Our test procedure also enumerated the processes by which we gave participants materials, like any needed guidance:

User is given a printout consisting of help text from the skill page on the Alexa Marketplace and a scenario to work with.

We also clearly described how users would proceed through scenarios.

Scenario: “You’re a registered nurse and you’ve just moved to Georgia, but you don’t know if your license from your old state is still valid. Talk with Alexa to find out the information you need. At the end, ask for a phone number for more information.”
User is then told to start skill.
User: “Alexa, ask GeorgiaGov.”

Our procedure ended with facilitators asking the three questions from the retrospective questionnaire, and then repeating the procedure to guide the user through two more scenarios, interceding if any issues arose during the test.

As you can see, robust test procedures ensure your evaluation is consistent and repeatable. Because our Alexa skill also involved the integration of discrete technologies such as a CMS, we encountered considerable challenges when it came to debugging, mostly involving hooking up Drupal’s complex content services to Alexa’s own complicated approach to handling data. We learned just how fundamental to the design process “test early, test often” truly is (http://bkaprt.com/vcu36/05-03).

Multiple rounds of usability testing will uncover problems in your interface that need addressing before deployment and launch. But even beyond this, there are a few other tasks I’d advise teams to perform before launching a voice interface.

Just Before Launch

The prerelease phase differs from the iterative voice usability testing we’ve covered earlier in this chapter. We’re now concerned with making sure that every piece of voice content is discoverable, that every point in the dialogue is accessible, and that the voice interface behaves in lockstep with call-flow diagrams.

  • Tasks during the prerelease phase involve cross-functional teams and stakeholders, and we’ll cover each in the coming sections:
  • dialogue traversal testing (DTT)
  • logging and analytics
  • creating a custom dictionary (if needed)

Other tasks might be warranted in this phase, such as quality assurance (QA) testing to ensure software code performs as expected, and load testing to ensure many users can simultaneously access the interface. Typically, both of these are responsibilities of the underlying technology and the developer team.

Dialogue traversal testing

In dialogue traversal testing (DTT), someone on the team examines every nook and cranny of the interface to check if there’s any point that could jeopardize the entire interaction. Confusing transitions, missing error-recovery strategies, or missing responses might surface in DTT.

Regardless of what kind of voice interface you’re building, conducting a dialogue traversal test involves interacting with the fully functional voice interface. As Cohen, Giangola, and Balogh write in Voice User Interface Design, “You should try silence to test no-speech timeouts. You should impose multiple successive errors within dialog [sic] states to ensure proper behavior” (http://bkaprt.com/vcu36/01-02).

It’s crucial to traverse as many of the error trajectories associated with every dialogue state as possible, especially for the two most frequent errors voice interfaces need to handle: no-speech timeouts (when no speech was detected) and no-match errors (the system isn’t equipped to handle the user’s response).

There’s no one right way to conduct dialogue traversal testing. An optimal test script ensures every dialogue state and every error-recovery strategy is visited. In some scenarios, especially open-ended prompts (“What can I do for you today?”), a complete round of DTT might be impossible due to the limitless range of possible user responses. For these interfaces, at minimum, this means ensuring that users can reach all basic functions, dialogue states, and happy paths.

In Designing Voice User Interfaces, Pearl suggests printing out the final call-flow diagram for your interface and testing each of the journeys or traversal patterns represented in the diagram, scrawling notes about problem areas in the interface in the process (Fig 5.1) (http://bkaprt.com/vcu36/03-02). Like running into every possible dead end in a labyrinth, DTTs are first and foremost about ensuring your users never get lost or waylaid at any critical point; they’re not meant to be exhaustive, since that is usually unrealistic.

Fig 5.1: Highlighted in red, one of many possible traversal patterns through our call-flow diagram for Ask GeorgiaGov, in which we trigger a recursive behavior and decline a phone number.

Logging and analytics

If your interface pulls content from a CMS, how often do 404 errors occur? Do users who ask about a particular topic run into more errors? Is there a point at which a handoff to a third party is timing out, returning nothing? Logs and analytics, which track users as well as their successes and failures across the entire interaction, help stakeholders measure how the quality of the interface evolves over time and inform debugging and maintenance later on.

Common metrics include the amount of time spent within the interface, task completion rates, dropout rates, barge-in (when a user interrupts an ongoing machine utterance) rates, and the frequency of no-speech timeouts and no-match errors. Each metric offers insight into how your initial users will judge the interface, allowing for quick calibration just after launch. For instance, a high barge-in rate suggests that users are growing frustrated too early and interrupting the interface before it’s done talking.

For all of your analytics and logging to succeed, your voice interface needs to provide or export data to a format that stakeholders and maintainers alike can understand. Without a mechanism for logging, upon launch, you won’t have a clue how your users are evaluating your voice content.

In the case of voice content, it’s often the case that your CMS or database can capture logs in the same way they might track metrics for a website. In other cases, like when you depend on a database you don’t control, you may need to forward the gathered data to a third-party system or dashboard. Logs help developers understand where things are going awry, but they are also critical for content strategists and editors who can glean insights on how to adjust and gauge the performance of their content in voice.

Logs should archive key information from interactions themselves to give both engineers and editors insight into user intents that result in more errors, machine prompts that confuse users, and content that requires reformulation. Your logs might encode information such as:

  • the recognition result (how the system understood a user, with confidence scores; this might include audio recordings)
  • the match result (what the system matched the user’s utterance to)
  • error states (no-speech and no-match errors)
  • dialogue or decision state names (identifying where the user was in the interface)
  • latency (to account for where delays may be occurring)

Because Ask GeorgiaGov forwards users’ questions to the search service on the Georgia.gov website, we couldn’t rely solely on the reports that Amazon Alexa itself provides developers and maintainers. In addition to Alexa’s built-in logs, we built logs into the Drupal CMS that recorded not only errors but also events that Alexa couldn’t handle solo, like search queries it needed to delegate. In the process, we were able to gather transcriptions of all search queries, queries that returned no results, and, most crucially, what content was pulled from Drupal into Alexa.

Conveniently for the Digital Services Georgia team, these logs and reports sat alongside the very same dashboards the editorial organization used to analyze the performance of the Georgia.gov website, allowing for quick and easy comparison between the two channels. If you’re dealing with content delivered through multiple channels, consider what will make the lives of your content editors easiest.

Custom dictionaries

One crucial step before launch is to populate your custom dictionary if your framework permits it. In Amazon Alexa and Google Home, for instance, you can program certain terms and phrases that might be unfamiliar to the automated speech recognition capabilities within the hardware.

Say you’re building an in-car navigation system for a specific city. Registering “Clemons Street” at a higher probability than “chemistry” can help Alexa more readily select the likelier result. In other cases, you may need to become a bit of a lexicographer and introduce terms to the interface’s vocabulary that are absent from common usage. Some names, for example, may require a pronunciation that a speech synthesizer doesn’t recognize from the natural language it encountered during its training (like “Houston Street” in Manhattan, which is not pronounced like the name of the Texan metropolis).

Check your dialogues for:

  • brand names
  • proper nouns
  • neologisms
  • jargon
  • loanwords from other languages
  • unusual pronunciations
  • uncommon words
  • homophones or similar-sounding terms needing disambiguation

Before deploying Ask GeorgiaGov, we discovered several issues that obligated the creation of a custom dictionary. Certain common terms in Georgia state law went unrecognized by Alexa, most notably ad valorem tax, a Latin term that refers to a tax based on the appraised value of a transaction or property. We added this and several dozen other terms to our custom dictionary. Depending on the kinds of terminology you deal with, you’ll want to consider a dictionary, too, if your platform supports it.

Voice Content in the Wild

Ask GeorgiaGov was released on the Alexa Skills Marketplace in October 2017 to great fanfare by both the Acquia Labs innovation team and our client, Digital Services Georgia. We were overjoyed to receive feedback from real Georgians all over the Peach State who now had another means to reach their state government. And we had the evidence to back it up thanks to our usability test results and our logs and reports.

Eight months after Ask GeorgiaGov launched, we held a retrospective with the Digital Services Georgia team to leaf through the logs and review the reception among residents of Georgia. What we found shocked us all. As it turned out, there was a massive gap between the topics most searched for on the Georgia.gov website and those sought on the Ask GeorgiaGov interface. Users who preferred Alexa to acquire content searched most often for information regarding vehicle registration, driver’s licenses, and the state sales tax. Though Ask GeorgiaGov has since been decommissioned, these results were deeply instructive for Georgia’s subsequent forays into conversational interfaces, like the text-driven chatbot that carries the same name (http://bkaprt.com/vcu36/02-01).

Buoying all of us who built one of the first-ever content-focused Alexa skills, 79.2 percent of all user interactions over the preceding eight months had led to the delivery of a successful content response, a number that was striking due to the relative immaturity of Amazon Alexa at the time and the lack of informational voice interfaces in production. We also found that 71.2 percent of all interactions led users to request an agency phone number, validating our choice to broaden our thinking to devices beyond Alexa, like smartphones.

Buried deep in the reports, however, we found perplexing 404 errors citing a search term that appeared repeatedly in the logs as “Lawson’s.” It was an anomaly we couldn’t unravel. After consulting the native Georgians around the table, we unearthed the culprit. As it turned out, one of our valued users, somewhere in Georgia, had been trying fruitlessly to get Alexa to understand her pronunciation of the word “license” in her native drawl (http://bkaprt.com/vcu36/05-03).

This heartwarming anecdote highlights an honest truth. Though testing early and often and preparing copious logs can bring an interface to within an inch of perfection, flaws still arise after launch, whether due to human error or, in this case, because of Alexa’s own inability to understand our kaleidoscope of American English dialects. It seems we can rest assured that voice interfaces still have a way to go before they outwit us at our own game of human conversation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.70.157