Preface

This book has been almost twenty years in the making. During those twenty years, the running line among practitioners in the speech technology field was, and for many it still is: “Speech is just around the corner.” Meaning, by this time next year, God willing, Speech technology will finally deliver on its promise and will at long last be adopted as a reliable way for humans to retrieve and create information, as well as do things: instead of typing, pushing buttons, tapping and swapping, they will just speak and listen.

In the early days, the proposition that “Speech is just around the corner” was an earnest aspiration. There was exuberance (this was the 1990s after all) and, for the most part, the prediction was hope-filled. In hindsight, the proposition looks almost irrational, given the state of the technology’s usability at the time, its cost, and its basic performance which was slow and inaccurate. But then, as the years wore on, the prediction turned into a healthy mix of self deprecation (‘How could we have been so arrogant?’), stubborn defiance (‘But--we will make it happen!’), and a sober aversion to anything that smacks of hype (‘And when it does seem to be happening, we will keep our sceptical eyes wide open’).

Voice telephony systems, otherwise known as the unloved Interactive Voice Response (IVR), where humans call a phone number with the intent to speak with another human only to be unpleasantly met by a system that tries to speak and listen were the first interactive speech technologies deployed for mainstream use and did go mainstream in the early 2000s. And although they did deliver value, notwithstanding the understandably justified grousing from users, they somehow didn’t count as the fulfillment of the “Speech is just around the corner” aspiration. It was not until the launch of the iPhone 4S on October 4, 2011 (one day before the death of Steve Jobs) that one could arguably say that speech had arrived: Siri was born and interactive speech was now available, on demand, for the tens of millions of people who owned an iPhone.

The arrival of Siri was a watershed moment not only because interactive speech was now available on a smartphone, but because the type of speech-based interactions that it delivered were fundamentally different from the ones that users were encountering in IVR systems. The key difference lies in the basic fact that when someone calls a phone number, they intend to speak to a human being. When they encounter an automated system instead, they have to decide whether or not they want to interact with it, immediately ask for an agent (or “zero-out”), or simply hang up. With Siri, in contrast, the user is willingly engaging in speech automation. When they press and hold the home button (as the first interactions with Siri had the user do), they are not expecting to speak to a human: they instead expect and want to speak to the speech app. In other words, the user wants to self-serve using speech and is not intending to speak to a human. This was a first for mainstream voice technology.

Even so, Siri never really took off the way we speech practitioners were hoping it would. After the initial swell of enthusiasm, it quickly became clear that what we had witnessed with Siri was not the fulfillment of “Speech is just around the corner” but an incremental, though important, enhancement of a multi-modal interface to which a new mode -- voice -- was added. Not only could you now see, touch, and feel (through haptic response), but now you could also speak and listen to fulfill self service tasks. However, Siri did achieve a major accomplishment for which we Speech practitioners should forever be thankful to Steve Jobs. It introduced a cutting edge, “cool” context within which Speech was being used -- namely the iPhone -- raising the brand standing of Speech as a technology that now had a future to look forward to beyond the crabbed and unexciting world of telephony IVR.

And so, we had to wait for November 8, 2014, when the Amazon Echo arrived, to be able to make a reasonable case that speech was no longer just around the corner but that it had arrived.

The Amazon Echo was the first device to deliver on three fundamental aspects of the voice interface that made it a candidate for fulfilling the promise of Speech technology: (1) It was an interface where the user engaged the Speech system willingly; (2) The interface enabled far field interactions: meaning that the user did not need to place the device near their mouth as they had to with the smartphone and dictation microphones; and (3) Most crucially, the user could engage with it while both their eyes and their hands were busy: they didn’t need to look at anything or touch anything to interact with the speech system.

We, the authors of this book, are very much of that generation: We did enthusiastically believe in those early years that speech was around the corner, we were disappointed with every passing year that it had not arrived, but we kept our hope and our resolve alive, and now we do believe that Speech has arrived. Also, because we are of that generation, we are averse to hype, and our critical, if not skeptical, eyes remain wide open.

A bit about this book.

This book focuses on a very specific type of interface: the voice-first interface -- or “voicebot” for short from now on. This is the interface that helps users with interactions where their eyes and their hands are not at their disposal -- or choose not have them at their disposal: they are under the hood of the car fixing an oil leak, are potting a plant, are taking a shower, are lying in bed, half asleep, are preparing food, are in their car, are blind or have temporarily lost sight, are folding clothes and watching TV, are tidying up the house, are walking the dog, are having a face to face conversation with someone, are on a zoom call, are typing away at their laptop, are pecking at their smartphone, are in a museum staring at a painting. It is the challenge of designing for those myriad types of use cases that this book is for.

Therefore, this book does not touch on designing for non-voice, text-based chat bots. Nor does it propose to help designers build multimodal interfaces. Designing multimodal interfaces, even those where voice is a central modality and the other modalities (screen, touch, haptic) play a supporting role, is an entirely different endeavor. Often, novice designers make the mistake of thinking that voice-centric multi-modality as something like ‘voice-first plus,’ when in reality it is its very own, separate type of interface, as different from voice-first as it is from visual multi-modality, where the visual modality and not voice is at the center of the interface (for instance, the smartphone or the smart tablet).

A few words on the format. You will notice that this book does not make use of any visuals except at the very end, in our appendices. Our stand is this: if we are going to help the reader design compelling, effective, and even enjoyable voicebots, with the only tool at the disposal of both the designer and the user being spoken language and audio, we had better be able to communicate our concepts and recommendations through pure language, and hopefully illustrate how effective voicebot design can be delivered through the way we write and how we lay out our material.

You will also notice that we are short on wordy introductions, that we avoid elaborations, and that we usually skip neatly wrapping up chapters with conclusions. This is by design. If we are going to preach brevity, precision, and moving the conversation along as core principles of effective voicebot design, we had better reflect that in our writing style.

Speaking of style, this book is written in the spirit of the classic, English writing style guide, The Elements of Style by Strunk and White, a book that is familiar at least with most, if not all, English majors, and in fact probably anyone who took an English composition class. What makes this century-plus old tiny monograph, first published by Harcourt in 1920, a compelling book that has served generations of writers is its focus on the “bottom line.” It is a book that cuts to the chase. Our aim is to emulate that style.

In addition, just as “The Elements of Style” was not meant to be the end all text and the final word about the topic of writing and composition but a handy manual for writers to use when they need actionable answers to concrete questions, this book is similarly meant to be used as a companion to, and not as a replacement of, all the excellent books on conversational voice design readily available to the designer. We provide a long list in our references section.

Who Should Read This Book?

The target readers of this book are budding and practicing voicebot designers in the newly emerging technology space of far field voice, as delivered by platforms such as Amazon Alexa and Google Assistant, and hearable/speakable technologies, such as Apple’s AirPods. The book can also be useful for those who design IVR systems, but only to the extent that those systems are used eyes-free and hands-free.

While this book is meant primarily to help voicebot designers think through and make sound design decisions, we have written the book explicitly to be highly readable and jargon free so that it is also accessible to those colleagues with whom a designer must work to deliver effective voicebots: User Experience (UX) researchers, Product Managers, Developers, Testers, Marketers, and Business Development professionals.

Why We Wrote This Book

This book aims to provide direct answers to questions such as, “How do I design an effective opening interaction with a voicebot?” or “What should I keep in mind as I design for failures?” or “What are some best practices for designing a conversational voice help system?” Answers to such questions can sometimes be found in the other books, but the reader usually has to look for them, and they may need to look up several books before they find the answers. This book pulls all of such answers into one text and focuses on answering those questions directly and succinctly.

However: this book does not pretend by any means to provide final, immovable, timelessly frozen answers. Our aim instead is to crystalize the crucial questions that the designer should ask themselves when they undertake their work, and then provide our answers, drawing on our decades-long experience designing and deploying voicebots. For instance, the designer needs to carefully consider how a conversation opens and that the first few seconds of an interaction are crucial and can spell the success or failure of the interaction. Someone who has never designed a voicebot before may not even be aware of how crucial those opening moments are. It may also not occur to that designer that the first time user and the frequent user must be engaged differently, or that prompts should be crafted in such a way that the user knows what to say when the prompt completes, or that there are time-proven techniques for writing effective error prompts. Teaching the designer how to critically grapple with the many challenges that they will face when designing voicebots is our ultimate goal, not prescribing fixed and non-negotiable recipes.

This book has a second, and perhaps more ambitious aim, which is to argue, and advocate through its recommendations, for the following: The practice of designing effective voicebots needs to free itself from the notion that the closer a voicebot mimics a human (through the sound of the voicebot’s voice, the language that it uses, the “persona” that it assumes), the better will be the experience of the voicebot user. We believe that this position is as faulty as, say, stating that the way an adult speaks to a baby, or a child speaks to a dog, or a person speaks to someone who doesn’t speak their language, are imperfect styles that should be improved upon and should emulate two humans speaking with each other. We will advocate for the outlines of a style of interacting with voicebots that will borrow many of the ways humans speak to each other, but we will deviate, and at times in significant ways, from human-to-human speak.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset