Chapter 6. The Future of Voice Content

We’re rapidly entering a new era of voice interface design. Up to this point, this book has presented a contemporary account of how we design and build voice interfaces that may bear little relation to how they’ll look—and sound—in the coming years. In the near future, how will new tools and technologies make voice interfaces not only more robust, but also more democratized for designers who neither write code nor want to deal with hardware?

Accelerating innovation in voice interfaces is rightfully leading designers and content strategists to pose key questions about where they fit in the picture, how lifelike speech synthesizers will truly perform, and whether approaches that aspire to authentic conversation will be available to anyone and everyone, not just rarefied corporations.

It’s ironic that soon, many of the machine-centric tactics laid out in this text may give way to a much more human future. Design artifacts like sample dialogues and call-flow diagrams may yield the floor to intelligent algorithms and neural networks that can bamboozle people into thinking they are chatting with a fellow human being. But debate continues about whether voice users of today actually want a conversational partner who is indistinguishable from a real human being. Do we truly want to be hoodwinked? Will voice interfaces actually represent our multifaceted world?

After all, rooted in the notion of “perfectly human” voice interfaces is a brewing debate about how faithfully our voice interfaces ought to mirror our society. When the face of a voice interface is necessarily a human being, how do we resolve issues of representation and belonging so that users see and hear themselves reflected in the technology they use? As we move toward more people building voice interfaces on their own and narrow the gap between computers and ourselves, if we can repair our society’s damaging biases that threaten that lofty vision in the process, the future of voice content will be bright.

The Democratization of Voice Design

Since the advent of the first IVR systems, with their finicky hardware and dizzying requirements, the way designers and developers build voice interfaces has continued to evolve, lowering the barrier of entry to voice interface design for a wider audience. No longer do we need a postgraduate degree in computational linguistics or to understand the specifics of a call-routing system or VoiceXML to succeed.

A new competitive landscape in the conversational industry is bringing a medium formerly confined to large corporations down to the level of designers and developers working alone to craft voice interfaces. Thanks to these new technologies and frameworks that make designing and building voice interfaces easier than ever, the age of voice interfaces is not only available to some; it is now accessible to all of us.

But even though what-you-see-is-what-you-get (WYSIWYG) and drag-and-drop tools are beginning to surface for voice interfaces, many designers today still depend on developers. It’s often still fundamentally the domain of developers to implement the voice applications designers and content strategists want to see. In some ways, the situation has changed little from the IVR-systems era.

Industry leaders are beginning to grasp in earnest the value of designer-friendly tooling for voice interfaces. Breaking this barrier for designers will democratize voice interface design in much the same way that no-code tools have revolutionized web design. In only a few short years, we may see even more no-code tools proliferate—joining the likes of Voiceflow, Botsociety, and Google’s Dialogflow—that allow creatives to spearhead the process from start to finish, all without the aid of a developer. Soon, for the first time, voice interfaces will truly be buildable by anyone and everyone.

How Voice Will Reframe Content

For editorial teams, voice content will prove to be a compelling conduit to reach users beyond the website in the coming years. In addition to growing demands for voice-enabled content, we’re also in the midst of an explosion in other channels for content delivery like mixed reality, wearables, and digital signage. Our page-bound content strategies and web-only content management systems will need a full refresh to keep up.

Today, we can no longer assume that our content will always display as a page. The era of macrocontent, the stuff of long-form websites, is giving way to more modular microcontent: atomic chunks of copy that resemble the sort of content we scroll through on social media, microblogging tools, and, of course, voice interfaces.

But how can organizations successfully transform legacy macrocontent into modern microcontent? After all, there isn’t yet a good solution for reconciling content that needs to be rendered into distinct spoken and written forms. Without managing multiple versions (a potential maintenance headache), it’s still difficult to imagine toggling between written text in a formal register and spoken text in a colloquial register on the fly.

While some content strategists might suggest the use of parallel renditions of text, one for a written website and the other for a spoken voice interface, this introduces an increased burden of upkeep for editorial teams. I’m interested in seeing more automated approaches become reality: a tool could traverse content items in a corpus and, based on certain settings, chunk that content into voice-ready pieces and even potentially convert formal, buttoned-up text into more fluid, conversational prose.

Some emerging services are headed in the right direction. Amazon Polly, for instance, renders written text into speech that sounds natural to the human ear and can integrate with CMSs like WordPress and Drupal to generate audio recordings of copy read aloud by configurable voices (http://bkaprt.com/vcu36/06-01). However, Polly doesn’t yet support in-text transformations, like swapping the formal word cannot for the informal contraction can’t.

A truly channel-agnostic solution would dynamically generate content on demand, optimized for voice and other contextless content experiences, all without demanding any manual modification of the original content source. Whatever innovations take hold, voice content won’t just upend how we consume content; it’ll also reinvent how we author, govern, and publish it.

Inclusive Voice Content

More voice content will also benefit disabled people who employ assistive technologies such as screen readers to navigate web content, in addition to people who use mobility aids or can’t use a keyboard and mouse. For Ask GeorgiaGov, for instance, we discovered that many elderly and disabled Georgians could acquire information far more quickly from our voice content than by traveling in person to or calling an agency office. If this was our experience, the implications of voice content for accessibility and inclusion could be staggering.

For Blind users, screen readers leave much to be desired, because the visual nature of the web limits efficient aural navigation. Interfaces that present voice content offer a chance for Blind people to interact with content using conversational structures, which are far more understandable than a screen reader’s verbose translation of what a web page looks like. Never before have other user interfaces mounted a serious challenge to screen readers as the primary means to access web content for Blind users. Voice interfaces slinging voice content could conceivably even supersede screen readers as a whole.

When we extricate ourselves from a single-channel bias and evaluate how different users interact in distinct ways with content on a range of devices, we can better meet their needs and make all access to content more equitable. Delivering content through voice could support someone with chronic migraines who can’t stare at a screen for more than a few minutes, or someone with Parkinson’s disease who prefers to speak instead of moving a mouse. Offering multiple keys to unlock content gives people with different lived experiences more choice, more flexibility, and more equity.

But throughout this book, we’ve identified a problem: What’s the use of providing voice content if the conversations we have with an interface are stilted and repetitive? If we truly want to meet every user where they are, shouldn’t we foster more lifelike human conversation?

Conversation-Centric Design

There’s something eerily soulless about reducing our entire capacity for conversation to two seemingly ill-fitting types: transactional and informational voice interactions. After all, among the acknowledged pitfalls of voice interfaces is their uncanny aloofness. User experience researchers Robert Moore and Raphael Arar contend in Studies in Conversational UX Design that the voice interfaces of today aren’t adequate for true conversation, precisely because of this rigidity:

Creating a user interface that approximates [natural conversation] requires modeling the patterns of human conversation, either through manual design or machine learning. Rather than avoiding this complexity, by producing simplistic interactions, we should embrace the complexity of natural conversation because it mirrors the complexity of the world. The result will be machines that we can talk to in a natural way, instead of merely machines that we can interact with through unnatural uses of natural language. (http://bkaprt.com/vcu36/03-03)

According to this argument, voice interactions that feel natural are more faithful to the sort of organic and meandering conversation we might have in passing with a cashier at the grocery store—at times transactional, at times informational, and at times prosocial.

Most voice interfaces up until now have been solely transactional or informational, with limited capacity for idle chatter beyond some onboarding and feedback. But today, voice assistants like Amazon Alexa and Google Home compete with each other on the basis of their affinity for human conversation. After all, as conversational designer Deborah Dahl emphasized at Mobile Voice 2016, a voice assistant “just doing a web search doesn’t show understanding” (http://bkaprt.com/vcu36/06-02).

Sadly, the utopian ideal of conversation-centric design—equipping a voice interface with the capacity for truly extemporaneous dialogue—is still a long way away. Attractive, conversation-centric design requires budgets that most organizations simply don’t have. Without the full force of technical teams exploring the latest and greatest in low-level speech technologies, which remain luxuries only behemoths like Amazon, Apple, and Google can afford, a truly conversation-centric interface remains a pipe dream for most designers.

Voice interfaces occupy a new niche in our growing collection of artificially learned interfaces, right next to keyboards, mice, and game controllers. We’ve gotten used to their mechanical awkwardness and cold repetitiveness; and in the process, we’ve evolved a new category of artificial interactions to acquire and rehearse. Learning to type quickly on a keyboard, then, exercises much the same kind of muscle memory as learning to have effective voice interactions with Alexa or Google Home, but those rehearsed behaviors don’t necessarily pass as true conversation.

This highlights the paradox of conversation-centric design. Designing voice interfaces that are more stilted allows us to limit the interface’s responsibilities to what’s within reason, like delivering content related to a single topic. It keeps our scope tight. But conversation-centric design to its fullest extent means washing away those boundaries and meeting the expectations humans have of a normal conversation. True conversation-centric interfaces can answer any conceivable question instead of only handling preprogrammed transactional or informational use cases.

That said, while the ideal of conversation-centric design poses additional problems when it comes to user expectations, that doesn’t mean we can’t aspire toward conversation-centricity with custom-built informational and transactional interactions—and even a smidgen of prosocial small talk, even if it’s on the kitschy side—that coalesce toward the user’s happiest path. Moreover, conversation-centric design is a fundamental component of how technologists see voice interfaces innovating in the coming years.

The Conversational Singularity

Perhaps the most urgent question for the voice industry at large is the one whose answer will render this book obsolete and relegate it to bookstore bargain bins: When will the manual design strategies we’ve covered in this book, like sample dialogues and call flows, give way to truly conversation-centric design?

After all, the mission of voice assistants like Alexa and Google Home is to make all content across the web available to voice users, not just a carefully curated subset of it siloed away on a site somewhere. Though we remain far from truly natural conversation, substantial investments by the likes of Amazon, Apple, Google, and IBM are well underway to outperform the highly structured interactions that voice users have with their devices today using custom-built skills.

Slowly but surely, we’re making our way toward what Mark Curtis calls the conversational singularity, a moment when the frictions between humans and conversational interfaces vanish because machines are capable of having bona fide organic conversation at last (http://bkaprt.com/vcu36/06-03). Dependent on advancements in natural language processing, natural language generation, and speech synthesis, the conversational singularity will render our carefully constructed, custom-built voice interfaces obsolete—but will swing the door open to new promise for voice experiences.

The conversational singularity is compelling for futurists, because it portends a future where voice interfaces display what Harris calls habitability, which he defines in Voice Interaction Design as characterizing “a system the user can inhabit.” Rather than constraining the user like gutter guards in a bowling alley, Harris suggests voice interfaces could do the work to accommodate the user ergonomically and adapt intelligently to whatever they have in mind (http://bkaprt.com/vcu36/02-02).

Right now, we’re in that strange transitional time where anyone can architect voice interfaces, but we must still program them in ways diametrically opposed to the ideal of conversation-centric design. Soon, however, the corporations that gave us Siri, Cortana, and Alexa may reach capabilities that go well beyond all custom-built voice interfaces on earth. This outcome may sound like a happy ending, but it is in fact a dangerous proposition.

One of the less desirable effects of the conversational singularity could be the continued concentration of the reins and levers governing conversational technology in the hands of the wealthy and privileged few. It could also lead to mass layoffs of customer service agents and call center staffers around the world, many of them in low- and middle-income countries. This leads us to our final topic: inclusion and representation in the voice world itself.

Identity and Intrinsic Bias in Voice

When you hear Alexa, Cortana, or Siri speaking, who is the person you picture in your mind?

We eagerly anthropomorphize the machines we have conversations with, even though machines couldn’t care less about how they are personified. Despite the fact that they are entirely digital automata, we ascribe traits to them and name our voice assistants Alexa and Siri. We treat such “executive assistants” as cisgender white women, despite the misogyny and racism inherent in such characterizations—not to mention the fact that faceless voice assistants bear no resemblance to a real person.

A human identity can give life to normally austere interfaces like IVR systems and voice assistants, but it comes at a cost. The sexism in our society that portrays executive assistants only as secretarial women pervades every aspect of voice assistants, including how we talk to them. Deeply held biases prejudice us in favor of one type of voice assistant over another. The attributes we bestow upon speech synthesizers to establish identity may in fact worsen the systemic oppression that many of us face on a daily basis in society.

Because those tasked with building the first IVR systems were generally straight cisgender white men, it’s no surprise that most voice assistants are, to the user’s ears, straight cisgender white women who speak in a General American dialect. There’s seemingly no space for dialects like Indian English and African-American Vernacular English (AAVE), immigrants who speak English as a second language, and queer people who code-switch between LGBTQ+ and straight-passing modes of speech.

There are some tentative emerging steps in the right direction. In-car navigation applications like Waze now permit users to upload their own short audio clips to replace the default voice recordings that pepper rides with various versions of “Keep left” and “Accident reported ahead.” Once, while I was in a Lyft humming down an expressway, my driver shared how he’d enlisted his school-age daughter to record her own voice. “I wanted to have her with me on every ride,” he told me proudly as she rattled off directions.

I’m eager to see this level of voice customization, or at least a broad selection of possible voices to choose from, become available for speech synthesizers in voice assistants. Not only would it pave the way for better representation of diverse dialects and lived experiences; it would also facilitate more inclusive and equitable forms of voice interface design.

Our voice interfaces are capable of confirming or challenging our deeply held biases about society. Today’s audiences for voice interfaces are increasingly attuned to the need for representation of the marginalized and underrepresented, which might manifest in voice interactions through bilingual or dialectal code-switching, colloquialisms used among communities of color, and speech synthesizer customization. Because voice interfaces are the most humanlike of all digital experiences (a refrain throughout this book), we must respect—and celebrate—the humanity of those we aim to reach.

Representation Matters

Inclusion in voice design is not merely about accessible alternatives to screen readers or hearing a variety of voices. It’s also about authentic representation. After all, the digital identities we give voice interfaces can introduce human problems that machines are not coded to consider.

In the realm of voice, interfaces are no longer just machines—we see them, for better or worse, as people with their own fully formed identities. In order for voice content to become truly inclusive of people having diverse lived experiences, creators of voice interfaces must grapple with the systemic issues that surface. My sincere hope is that corporations like Microsoft, Apple, Google, and Amazon instigate top-down approaches to improve inclusion and customization in speech synthesizers.

In proprietary software, however, this is unlikely. So, at the same time, those responsible for the foundations of speech synthesis all voice designers rely on should open avenues for others to create vocal patterns and speech styles that more accurately represent the real world. Such a grassroots, bottom-up solution would require speech synthesis technologies to be open-sourced, and the potential progress could reinvent representation in voice interfaces forever.

I hope for a not-too-distant future where we can configure the voices we hear on Alexa like we do on Waze, where those who engage in bilingual code-switching can hear those same toggles represented on Google Home, and where Black trans women and other multiply marginalized groups can hear someone from their own communities telling them the baggage carousel for their flight, the closing time of continental breakfast, and the best way to get a small business loan.

The golden age of voice content is rapidly approaching, but we remain in the early stages of a mass migration of formerly web-only content to newly voice-enabled content. We must not let it fall victim to the vagaries of human society that our machines simply couldn’t care less about, like the inherent biases—racism and ableism, misogyny and misogynoir, homophobia and transphobia—that continue to silence the people we serve.

We’ve had a long-running conversation throughout this book about the amazing things that can happen when we give our copy, long trapped on the web, a second life as voice content. Now, it’s time for us to work toward the next and far more important milestone as technologists. It’s your turn to take up the challenge. Just imagine what we can accomplish and invent if we can restore to those who have long been oppressed and underrepresented in our society a voice—one that each and every one of us can truly hear ourselves in, and call our own.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.60.29