© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2021
A. Thymé-Gobbel, C. JankowskiMastering Voice Interfaceshttps://doi.org/10.1007/978-1-4842-7005-9_3

3. Running a Voice App—and Noticing Issues

Ann Thymé-Gobbel1   and Charles Jankowski2
(1)
Brisbane, CA, USA
(2)
Fremont, CA, USA
 

In the first two chapters, we introduced you to voice interaction technology and the reasons why some things are harder to get right than others for humans and machines in conversational interactions. Now it’s time to jump in and get your own simple voice-first interaction up and running. You’ll stay in the familiar food domain; it’s a convenient test bed for introducing the core concepts—finding a restaurant is probably something you’re familiar with, and it covers many voice-first concepts. The task of finding a restaurant seems simple, but things get complicated fast. When you expand functionality to deal with real users, you’ll stray from the happy path quickly, but let’s not worry about real life yet.

Hands-On: Preparing the Restaurant Finder

Figure 3-1 (similar to Figure 2-1) illustrates a simple restaurant finder. A person asks for a recommendation; a voice assistant provides one. That’s it. A word on notation: “AGENT” in panel 2 is shorthand indicating that you don’t yet care about the name or platform of the voice assistant or how listening is initiated. It can be a wake word, like Alexa or Hey, Google, or a press of a button. For IVRs, it’s calling a number. It merely suggests that this dialog is initiated by the user.
../images/507213_1_En_3_Chapter/507213_1_En_3_Fig1_HTML.png
Figure 3-1

Plausible restaurant finder dialog between a user and a voice assistant

Next, let’s define this interaction so you can build it. In Chapter 2, you started thinking about the decisions you need to make for a restaurant finder. Many decisions are about scope: users ask for a restaurant recommendation for a specific city; the app responds with a few best-match suggestions based on parameters specified by the user, ordered by proximity. You learn about how and why to limit scope in Chapter 4; here’s a first take:
  • Geographical scope: Your initial restaurant finder is limited to a predefined area, one small town, because doing well at unconstrained name recognition is very complicated. So, we’ve picked a small city in the San Francisco area: Brisbane. There are only about 20 places in town that offer food.

  • Search scope: Starting small, users can’t refine their search or search for something based on food type or hours; they can only get a recommendation from a predefined set of responses.

  • Information scope: Here too, you limit the initial functionality. You need a list of all restaurants in the city and a set of responses, such as business hours or street address, to play based on that list.

  • User scope: No customized results based on user settings and no follow-on questions to refine searches. But make sure users don’t hear the same suggestion every time, so something needs to be tracked. Assume there can be multiple users.

  • Interaction scope: Start by creating a custom voice app, a skill (for Alexa) or action (for Google). Access will be primarily from a stationary device (as opposed to a mobile app or in-car), probably indoors, at home or work, and not requiring a screen.

SUCCESS TIP 3.1 VOICE SUCCESS IS BASED ON SCOPE AND LIMITATIONS

No matter what voice interaction you’re creating, much of your time should be spent understanding the scope and limiting it where you can—and knowing you can’t limit user behavior. That means you need to figure out how best to handle out-of-scope requests. “You” means you: designer, developer, and product owner.

Having settled on the scope, you define the basic dialog for when all goes well: the happy path. The recommended starting point is to write sample dialogs:
  • User AGENT, ask Brisbane Helper to recommend a restaurant.

  • AGENT [Look up data and generate an appropriate response.]

  • AGENT Na Na’s Chinese gets good reviews.

Sample dialogs show up throughout this book. They’re high-level examples of the functionality a voice app offers, how it’s offered, what’s understood, and what responses the app will give. Sample dialogs are the conceptual draft drawings of voice, showing core functionality and representative conversations; they don’t show every possible wording a user might say. Any “real” implementation will use more than one sample dialog, but this is all the first version of this restaurant finder will do, so one’s enough. It’s time to build it.

Choosing Voice Platforms

Today you have access to an ever-increasing number of voice platforms of ever-increasing abilities, so you don’t need to start by building your own from individual modules. Build your simple restaurant finder using a platform you prefer and have access to. We assume many of you will use either the Amazon Alexa or the Google Assistant platform. If coding isn’t your core strength, make use of one of the many no-code platforms that let you run your creations on either of those two platforms.1 If you prefer the flexibility of open source, there are yet other options.2 Any mature platform lets you get something functional running quickly, something concrete that’s a sandbox for playing with and understanding the core concepts. And that’s what we want right now. Remember: You’re not reading a platform-specific cookbook—we cover concepts and provide concrete examples for you to run and play with to get a solid understanding of voice. We actually recommend you develop the simple dialog in this chapter using more than one platform, including any you’re interested in that we don’t cover. Seeing what’s platform-specific vs. what appears across all platforms helps you become a stronger voice practitioner, and you’ll be less dependent on any specific platform. Early on, you’ll prefer platforms that handle as many details as possible for you. The more experience you gain, the more flexibility you want. And you’ll be able to make use of that flexibility without getting in trouble. You want a stable robust platform with superior speech recognition. We have chosen to go the Google route in this book for several reasons that will get to a little later in this chapter. But first, let’s jump into building something!

Hands-On: Implementing the Restaurant Finder

It’s time to implement the basic restaurant finder in Figure 3-1. You’ll be using Actions on Google, the developer platform for Google Assistant. So far, it’s only one question, but that’s enough to start. We’ll walk you through the steps of getting set up, pointing out some things to watch out for without going into every detail. For more details, refer to one of the many available references for the most up-to-date information.3

Basic Setup

Here we go… OK, Google, ask Brisbane Helper to recommend a restaurant .
  1. 1.

    Create a Google account, or log into an existing one.

     
  2. 2.
    Open the Actions on Google Developers Console4 and create a new project.
    1. a.

      Click the New Project button, and then type BrisbaneHelper as the project name. Click Create Project .

       
    2. b.

      Choose Custom for “What kind of Action do you want to build?” Click Next.

       
    3. c.

      Important: On the screen asking “How do you want to build it?”, scroll all the way to the bottom and click “Click here to build your Action using Dialogflow.” Later in this chapter, we’ll discuss why we’re using Dialogflow.

       
    4. d.

      Choose Build your Action and then Add Action(s).

       
    5. e.

      Click Get Started and then Build.

       
     
  3. 3.

    Dialogflow (Dialogflow Essentials) should open. Click Create.

     

On the referenced web page, it asks you to enable Fulfillment. Fulfillment is the ability for Dialogflow to connect with a web service that you build to do more detailed processing (analog to the AWS Lambda function used for Alexa). You don’t need it here because Brisbane Helper is so simple. In Chapter 7, you’ll add that flexibility.

Step 1: Respond to an Invocation

In Dialogflow, choose the “Default Welcome Intent” that’s automatically created. This is the starting point of the interaction when the user invokes it—in other words, when saying, OK, Google, talk to Brisbane Helper.
  1. 1.

    Click Responses. Click the trash icon to remove any existing responses.

     
  2. 2.

    Type something like “Welcome to Brisbane Helper. How can I help?” or “Hi, I’m Brisbane Helper. What’s your question?” This adds a new response that plays when the user engages the action without specifying a question.

     
  3. 3.

    In the Intents window, click Save.

     

Step 2: Specify What the User Says

In Dialogflow (as with Alexa), an intent is the basic “unit” of the user “doing something,” like requesting information (Recommend a restaurant, When is it open?, Do they take reservations?) or requesting some action (Order a pizza for delivery, Make a reservation for 7 PM).
  1. 1.

    Create a new intent. To the right of Intents, click the plus symbol (+).

     
  2. 2.

    Type RecommendRestaurant in the Intent name field. Save.

     
In a voice application, the way to “do something” is of course to say it. Specify what the user might say to make their request.
  1. 1.

    In the section Training Phrases, click Add Training Phrases.

     
  2. 2.

    In the box Add User Expression, type “Recommend a restaurant.”

     
  3. 3.

    Press Enter, and then click Save. That’s the phrase in the sample dialog.

     

Step 3: Specify What the VUI Says

Next, specify how the action should respond when the user says, Recommend a restaurant.
  1. 1.

    Find Enter a Text Response Variant. Type “Na Na’s Chinese gets good reviews.” Press return.

     
  2. 2.

    After the application has responded, this dialog is complete, and you’re done. Click the slider to the left of Set this intent as end of conversation. If you don’t do this, the action won’t end, but will wait for the user to say more after it gives the recommendation. Click Save.

     

Step 4: Connect Dialogflow to Actions on Google

You’ve defined the conversation in Dialogflow, but you started with Actions on Google, which is the development environment to extend Google Assistant. You need to connect the two.
  1. 1.
    Click Integrations, then Google Assistant, and then Integration Settings. In Discovery, you see two sections—Explicit Invocation and Implicit Invocation:
    • Explicit invocation is what you’ll use here. It refers to the user starting their request by specifying the intent’s name by itself or in combination with the intent, for example, OK, Google, talk to Brisbane Helper or OK, Google, ask Brisbane Helper to recommend a restaurant.

    • Implicit invocation refers to specifying an intent that leads to invoking an action without invoking it by name, for example, Hey, Google, I need a local recommendation for a restaurant.

     
  2. 2.

    In Explicit Invocation, find Default Welcome Intent. Click the checkbox.

     
  3. 3.

    Where it says Auto-preview Changes, move the slider to the right.

     
  4. 4.

    Click Test, at the bottom of the box.

     
  5. 5.

    Leave the Auto-preview Changes checked.

     
  6. 6.

    Actions on Google should now appear, with the successful integration, in the Test section.

     

Step 5: Test Your VUI

You’ve built your simple app. Hooray! Time to test it. You can test in several ways: in Dialogflow, in the Actions console, or with a Google device.

Testing Using Dialogflow

In Dialogflow, you can test with either text or voice, as shown in Figure 3-2:
  • Text test: Find Try It Now, and then type “recommend a restaurant.” Press Enter. It should show the correct response, “Na Na’s Chinese gets good reviews.”

  • Voice test: Find Try It Now, click the mic icon (just click; don’t click and hold), and say Recommend a restaurant. Click the mic again to stop listening. You should get the same response as for text testing.

../images/507213_1_En_3_Chapter/507213_1_En_3_Fig2_HTML.png
Figure 3-2

Testing method #1: Dialogflow

Troubleshooting

Before you test using Actions on Google, go to the Activity Controls page for your Google account and turn on Web & App Activity, Device Information, and Voice & Audio Activity permissions.

Testing Using the Actions console

In the Actions console, you can test with either text or voice:
  1. 1.
    In your Actions console window, click Test.
    • Make sure your volume isn’t muted.

     
  2. 2.
    Invoke the action. Type “Talk to Brisbane Helper.” Or click the mic and say it. This time, you don’t need to click it the second time; it will stop listening on its own.
    • This invocation phrase wasn’t necessary when you tested in Dialogflow. But now that you’re in Actions on Google, you need to tell Google Assistant how to access your skill, so you set the invocation phrase to “Brisbane Helper.”

     
  3. 3.

    You should hear the response Sure, here’s the test version of Brisbane Helper and then maybe a different voice saying, Welcome to Brisbane Helper. How can I help? You added that prompt earlier.

     
  4. 4.

    Type, or click the mic and speak, the magic phrase: “recommend a restaurant.”

     
  5. 5.

    You should get the expected response: Na Na’s Chinese gets good reviews.

     

Testing with Assistant, Home, or Nest Devices

If you’ve read some of the resources we pointed to earlier, you’ve seen that there’s a well-defined publishing process for making your creation available to other users. How cool is that? Refer to those resources for the current process—remember this stuff is in flux. For quick testing, as long as your device (Google Home, Nest Hub, or the Google Assistant app on iOS or Android) is logged into the same Google account as your development environment, you can test with that device, which is handy.
  1. 1.

    Set up your Google Home device with the same Google account you’re using for Actions on Google and Dialogflow.

     
  2. 2.

    Now we need to define how you’re going to access your app from Google Home or other devices. In the Actions console (where you just tested), click Develop on the top and then Invocation on the left. Under Display Name, type “Brisbane Helper.” Click Save on the upper right.

     
  3. 3.

    When the device is ready, say, “OK, Google, talk to Brisbane Helper.”

     
  4. 4.

    Here you need both the wake word and the invocation phrase.

     
  5. 5.

    You hear the same thing you heard in the simulator, but it’s coming out of your device. Nice!

     
Do i Need to Test with a Google Home Device?

You may wonder if you really need to test “on-device” with whatever platform will be the target platform of your application. Strictly speaking, no, not early in the process. But if you want to build a voice app that not only does something but also does it well, you’ll need to put yourself in the mindset of your expected users. That’s a recurring theme of this book. It means experiencing the app the way a user would. So whenever possible, we strongly recommend that you test what you build in the same environment, under the same contexts and conditions that your users will experience, and within the same platform and ecosystem they’ll use.

Step 6: Save, Export, or Import Your Work

You have options for saving away your work to share with others and using work by others.
  1. 1.

    In Dialogflow, click the gear icon by your app’s name.

     
  2. 2.

    Click Export and Import.

     
  3. 3.

    Click Export as ZIP. The file will be saved as APPNAME.zip in the Downloads directory. Rename it if you want to keep track of versions.

     
To load an app from a .zip file
  1. 1.

    In Dialogflow, click the gear icon by your app’s name.

     
  2. 2.

    Click Export and Import.

     
  3. 3.

    Click Restore from ZIP. All current intents and entities will be cleared before loading.

     

You also have the option to Import from ZIP. If there are duplicates, existing intents and entities will be replaced. Note that you have to actually type the word “restore” to restore from a .zip file, since the process wipes out anything you currently have.

You’ve now built a voice app that does one very simple thing. You’ll return to your restaurant app shortly to expand it. But first, let’s take a closer look at the Google platform and why we chose to use it.

Why We’re Using Actions on Google and Assistant

Short answer: We had to pick something. The main goal of this book is not to teach you how to create Google actions specifically, but to demonstrate the concepts and principles that apply to any voice development—no matter the framework. We happen to like many aspects of Google. And of Alexa and Nuance and Apple and Bixby and Mycroft and…and…and so on. We would’ve loved to compare half a dozen platforms throughout the book, but it’s long enough as it is, so for brevity’s sake, we had to pick one. Here are some of the reasons why we chose to use Google.

First, we wanted a widely available platform that was free to use, well supported, mature enough to be robust, flexible for a variety of use cases, and with top-of-the-line performance across all core voice system components. It should be familiar to end users as well as to designers and developers, with components and support in multiple languages. And it should “come with” all tools, hardware and software necessary to build something fully functional, while also not being completely tied to those components.

There are additional practical considerations. We’d like you to not worry about complexities unless it’s necessary for some point we make. But we also don’t want you to depend on what a specific platform provides. Nor do you want to be limited by it. That’s not how you learn about voice. As your dialogs get more complex or you’re working with a platform that doesn’t offer some tool or feature, you’ll want the ability to customize to provide the best experience for your users. That’s our focus.

Meeting all those criteria basically narrowed the field to Amazon and Google. We simply have more experience creating voice interactions with Google (and Nuance and some open source platforms) than with Amazon, so we know that it will work well for our purpose, and it leads naturally into using Google for more hands-on development using the Google Cloud Speech-to-Text API. But don’t worry: Alexa and Google share many concepts and approaches to conversational voice, so ideas you see in one will map to the other, as well as to other platforms.

For additional code, we personally favor Python over Node.js, again based on personal experience from product development; it’s flexible and practically the standard in the voice and natural language community. Ultimately you should use what’s appropriate for your particular use case and fits into your overall product development environment with the best speech and language performance available. There’s no one single right answer for everyone.

Again, this isn’t a cookbook or manual for Actions on Google or for a particular device or use case—we use the platform so you can take advantage of what’s already available to you and build something quicker than if you have to put all the pieces together. Platforms and computing power change—the advanced features we discuss in later chapters may be available on the Actions platform by the time you read those chapters, which is great. But you still need to understand the reasons behind a tool to use it correctly. And that’s what this book is about. Speech is speech; natural conversational interfaces rely on the same core principles this decade as they did last decade and the decade before that. The principles of spoken language apply across all platforms and devices, because it’s about human language.

Google’s Voice Development Ecosystem

Let’s take a closer look at the Google voice development architecture to see how it maps to our voice discussion in Chapter 1. Figure 3-3 is a variant of Figures 1-3 and 2-10, showing how the same building blocks fit into the Google Assistant voice service and Actions on Google framework.
../images/507213_1_En_3_Chapter/507213_1_En_3_Fig3_HTML.png
Figure 3-3

Building blocks for a voice system using Actions on Google and Google Assistant

The mapping here is approximate. Think of it as a big picture rather than as a definition of Google’s voice architecture. Three main building blocks are involved in Google Assistant voice development:
  • Actions on Google for STT and TTS

  • Dialogflow for NLU, NLG, and DM

  • Node.js or Python as the back end to handle anything but the simplest DM

Many of these components can be used separately, adding to the flexibility. Google has cloud APIs for speech-to-text,5 text-to-speech,6 natural language,7 and translation,8 among others. We use Actions on Google and Dialogflow since they package these components together nicely, letting you focus on application design, and they also interface with the hardware that we’re going to test on such as Google Home.

Actions on Google

Actions on Google is the platform on which Google Assistant’s “intelligent voice” functionality can be extended into actions. A great advantage of this framework is that the platform takes care of some of the thorniest components of voice development (STT, NLU, TTS) through its libraries and services. Your focus when starting out can therefore be on the core aspect of implementing the DM and NLG parts to build the app.

Dialogflow

There are at least three ways that you can design dialogs in Actions on Google:
  • Actions SDK: A code-like API for specifying actions, intents, and so on

  • Dialogflow: A more GUI-based application that provides a graphical wrapper around the Actions SDK, adds NLU functionality, and incorporates machine learning, allowing synonym expansion of phrases mapping to intents

  • Actions Builder: A more recent GUI, more directly connected to Actions on Google.

You’ll use Dialogflow for now so you can focus on application building rather than on code formats. Dialogflow is more or less responsible for handling the NLU, DM, and NLG components of the action. From the DM and NLG point of view, the main responsibility of the application is to decide what to say next, based on user input, contexts, and the progress of the dialog. There are two ways to specify the responses the system plays back to the user:
  • Provide static responses in Dialogflow through the GUI.

  • Use a webhook back end to provide responses. This can be very dynamic and depend on history, user data, or application data—all the data sources shown in Figure 3-3. We’ll make extensive use of webhooks in later chapters.

Dialogflow provides a good balance of simplicity, extensibility, performance, and openness of various components. It’s also mature and stable because it’s been around for a while—for our goal of teaching you about design principles and best practices, that’s more important than to include the latest platform features. So we use what’s now called Dialogflow ES, rather than the more recent (and not free) Dialog CX, and we don’t use the Actions Builder environment, which is less flexible than ES (currently only deployable to Assistant, not to third-party channels).9

The Pros and Cons of Relying on Tools

Until recently, it was next to impossible to create any kind of voice interaction unless you were in research or employed by one of the few enterprise companies directly working in voice. Today, there’s easy access to tools and platforms that didn’t exist a few years ago. Practically anyone with a standard laptop and a free account can quickly put together something that responds when spoken to. If you already have experience with Google Assistant or Amazon Alexa, you know that the platform provides a lot “for free” by handling many details for you.

But tools and platforms come and go, and they can change quickly, as we just mentioned in the previous section. Even while writing this chapter, at least one voice tool went from free to pay-for, another changed its name, and yet another was acquired. If you depend on tools to handle the details for you, you may be in trouble if those tools disappear or change. More importantly for our purpose, you won’t learn how to improve your voice interactions if you don’t understand why you got some result or can’t access what you need to fix. As users become savvier and more familiar with voice, they become more demanding. Being too closely tied to a specific framework may limit your ability to create the right voice interaction or one that stands out from the crowd.

What if you need precise control over responses or secure access to data? What if you deal with large data sets, have branding needs, have integrations with other systems, or need access to your users’ utterances for validation? Or what if you need precise control over accuracy and responses to specific requests? Well, then you’ll need more than these platforms probably give you, and you need to take charge of a larger chunk of the development. Fortunately, none of these are new issues for voice. Techniques exist for addressing them, but they’re either not yet available for developer modification, or their relevance is not fully understood. Most voice developers today are fairly new to the field and only have experience with one platform. One of our goals is to reintroduce some powerful techniques that have fallen by the wayside because of having been less accessible. As those features are added, you still need to know how to apply them. The choices you make affect every voice app you build, from the simplest question-answer to a full prescription refill system. Learning to “think in voice” will empower you to make the right choices.

If you choose to develop within the Google and Alexa platforms, you may trade ecosystem access, convenience, and development speed for various limitations. If you’re willing and able to build up your own environment with your own parsing and designs, a new world opens up, offering choices at every step to create something different from what others do. But as is the case in life, with choices and flexibility come increased responsibility and careful planning. Only you know what’s the right answer for you.

Each company also offers products for other voice use cases, such as the following:
  • The Internet of Things (IoT) use case: AGENT, dim the lights in the living room or AGENT, set the temperature to 70. Controlling appliances via spoken commands addressed to and directed at an Alexa Echo device or Google Home device. The appliances themselves don’t have microphones or speakers for voice responses. These are connected devices, usually referred to by the term smart home.

  • Third-party devices with built-in microphones and speakers that make use of the recognition and NL “smarts” of Alexa or Google: AGENT, play music by Queen. The spoken commands are directed at a device that has its own microphones and speakers but is “inhabited” by one of the two voice assistants. These are built-in devices.

General voice UI best practices apply across all use cases, but details differ because users’ needs differ, devices have different functionalities, and different overlapping architectures and ecosystems need to interact. Custom skills and actions need invocation, while smart home and built-ins don’t. Because of the narrower focus and specific needs of smart home and built-in applications, we don’t focus on those in this book.

You might not need an invocation or wake word at all. That's the case if the VUI is delimited in other ways, for example, a phone call to an IVR, a push-to-talk device, an always-on handsfree device, or a free-standing VUI outside the common smart home ecosystems. They’re all valid and all current environments—and they’re all equally able to produce both great and crummy voice interactions.

SUCCESS TIP 3.2 ARE YOU SURE THAT WORD MEANS WHAT YOU THINK IT MEANS?

Be aware that product names and features are in near-constant flux, as you might expect from quickly evolving high tech. What you see here is the landscape at the time of writing in 2020–2021. If you work with a team to build a custom skill or action, a smart appliance, or a speaker with its own audio input and output on the Amazon or Google platform, or anywhere else, make sure everyone on your team is clear on what’s possible and what’s not within that approach and that everyone matches the same name to the same product.

Hands-On: Making Changes

Back to the restaurant finder. You’re up and running and talking to your voice assistant. Cool, it works. Better yet, you make your skill or action available and proudly tell your friends. They use it, and…they tell you it doesn’t work well.

What happened? What did they say to make them complain when it worked for you? If you only built exactly what we covered earlier, very likely some of the following issues happened:
  1. 1.
    They made requests that were within your defined scope, but those requests
    1. a.

      Were rejected

       
    2. b.

      Resulted in a response that was wrong

       
    3. c.

      Resulted in a response that was misleading

       
    4. d.

      Resulted in a response that was uninformative

       
    5. e.

      Resulted in a response that sounded weird or off

       
     
  2. 2.

    They asked for things you knew were out of scope.

     
  3. 3.

    They heard only one response even though you had created several variants.

     

We recommend you don’t ignore your friends or tell them they’re stupid for not talking in just the right way! In reality, you didn’t do your planning—understanding your users and designing for them—because we haven’t told you how yet. This chapter is about getting a first taste of the process, so let’s look at how you can address these issues.

SUCCESS TIP 3.3 IF USERS SAY IT, YOU HAVE TO DEAL WITH IT. PERIOD.

Your users are your users because they want to be, so treat them well, like friends, not like enemies or objects. This might seem obvious, but it’s surprisingly easy to fall into thinking Nobody will say that or What a stupid way to say that or even If they don’t read my description of what’s covered, it’s their own fault. Most of the time, users are not trying to trip the system, and they know when they are. Sure, you can create clever responses to out-of-scope requests, but don’t do that at the expense of making the actual intended experience more robust.10

Adding More Phrases with the Same Meaning

In the case of issue 1a (rejection), your friends probably asked reasonably worded requests that were not covered by your training phrases or the auto-generated coverage based on those. Maybe they didn’t say restaurant or eat but said What’s the best place? or Where to? Or they included preambles you didn’t cover: I would absolutely love to hear about a new café I’m not familiar with. Remember: A core strength of voice interactions is that people don’t have to learn some set of commands—they can just speak normally. Your first solution is to add more variety; ask a few people how they would phrase the request if you think you covered all. Later, we’ll show you more methodical exhaustive approaches.

Pushing the Limits

First, let’s do a little testing. Recall that you added one training phrase, “recommend a restaurant.” You might think that you’d have to say exactly that. Try some phrases:
  • “Recommend a restaurant.” (That had better work; it’s the only one you added!)

  • “Recommend restaurant.” “Please recommend a restaurant.”

  • “Can you please recommend a restaurant for me?”

  • “Would it be possible for you to recommend a restaurant for me to eat at?”

  • “Can you tell me a good restaurant?”

  • “What’s a good restaurant?” “Can you suggest a good restaurant?”

  • “Recommend a place to eat.” “Got any restaurant suggestions?”

  • “Where is a good place to eat?” “Where should I eat?”

  • “Where can I get some food?”

Well, this is interesting. You should have found that, except for the last two sets of phrases,11 it still responded with the right answer! Why is that?

Look back to Chapter 1, specifically the section “Natural Language Understanding” (NLU) . NLU is the module that takes a string of words and generates the meaning of the sentence. Here, the meaning is the Intent you created, RecommendRestaurant. If you have multiple Intents (you soon will), NLU will figure out which Intent, if any, matches the phrase that was spoken and produce an associated response for that Intent (we’re capitalizing Intent here since that’s a very specific Dialogflow concept that we saw).

How does it work? You may think, Aha! If the sentence has both recommend and restaurant, that’s it, right? No: What’s a good restaurant? and Recommend a place to eat both worked. Hmmm. Maybe it’s either including “recommend” or “restaurant”? That would work for now, but you’re about to add some more phrases that break this hypothesis. As you add enough phrases to make the app robust, rules like that would be too unwieldy, so you need some sort of automatic method.

What’s happening is that Dialogflow is “learning” how to map your training phrases to intents automatically, using various technologies like natural language processing and machine learning. Or, rather, it’s making use of rules and statistical mappings of words and phrases to meaning. When you add more phrases and click Save, it relearns using those new phrases. Automatically extracting meaning, especially training such a process from new data, is a problem that researchers have been working on hard for years, and the fruits of such work are what you see in Google, Alexa, and other systems. It’s not the whole answer, but enjoy!

Rejection

Let’s look at the last three phrases in the list. What happened when you said, Where is a good place to eat? It probably responded, Sorry, could you say that again?, What was that?, or something. This is a rejection, one of the most important concepts in building good voice apps and one you first learned about in Chapter 2. Basically, the app (specifically the NLU part) not only has to hypothesize which Intent is represented by a user’s request but also if it’s none of them. You hear Sorry, could you say that again? when the NLU doesn’t think any of the Intents match. This response is a default response. Don’t worry yet about if it’s a good response (spoiler: it’s not); only remember the importance of the concept of rejection. As you design and develop more complex voice interactions, you’ll probably be shocked at how much of your effort is spent on handling rejections.

Fixing Rejection

As you might expect, if the app rejects too many reasonable user utterances and your users only hear Sorry…, that’s no good. You address this by adding more phrases. Let’s do that now in Dialogflow:
  1. 1.

    Make sure you’re still in BrisbaneHelper in Dialogflow. If not, open it.

     
  2. 2.

    Select the RecommendRestaurant intent.

     
  3. 3.

    In Training Phrases, find Add User Expression. Type “Where is a good place to eat?” Press Enter.

     
  4. 4.

    Click Save.

     
  5. 5.

    When you see Agent Training Completed, retest Where is a good place to eat?

     
  6. 6.

    You should now get the correct answer.

     

You’ve added one of the rejected phrases. Keep testing. Try Where should I eat? It works too, even though you didn’t add it as a phrase! Again, smart, machine learning NLU to the rescue; it figured out that Where is a good place to eat? and Where should I eat? have enough similar meaning. How about Where can I get some food? That’s still rejected. Once we add that phrase as well, we should be good for the whole list.

Troubleshooting

After adding phrases, don’t forget to not only click Save but also wait for Dialogflow to display the phrase Agent Training Completed. Only then will new phrases work. While you’re waiting, the NLU is being retrained using machine learning, and that takes a little bit of time.

Expanding Information Coverage

Na Na’s may get good reviews, but sometimes people need to eat something different. Now that you’ve added some variability on the input side, let’s add some to the output, changing the response users might hear when asking for a restaurant recommendation. In Dialogflow
  1. 1.

    Make sure you’re still in BrisbaneHelper. If not, open it.

     
  2. 2.

    Choose Intents, and then select the RecommendRestaurant intent.

     
  3. 3.

    Find the Responses section. At Enter a Text Response Variant, type “Don’t forget to try a quiche at Star Box Food. And get there before the end of lunchtime.” Press Enter. Then click Save.

     
  4. 4.

    When you see Agent Training Completed, test using the Recommend a restaurant phrase. You should get either of the two phrases for Na Na’s or Star Box.

     
  5. 5.
    Add a few more phrases:
    • “Did you know that Madhouse Coffee serves sandwiches and other bites?”

    • “What about further outside, but still mostly Brisbane? 7 Mile House is on Bayshore, and WhiteCaps Drinks is out on Sierra Point.”

     

How does a response get picked from the list? It’s just random, with a slight tweak to avoid playing the same response twice in a row. Issue 3 (hearing the same response) would have been a logic oversight to not randomize responses. Watch out for those! Later, you’ll learn about more detailed schemes where you might factor in what’s open or what kind of food the user is looking for and so on.

Adding Granularity

Issues 1b–e (incorrect or odd responses) listed earlier have two related causes. One is simply knowing how to best word a clear unambiguous response. We’ll cover that in great detail in later chapters. The other cause is one that’s necessary even if the responses are worded well: intent granularity. Right now, there’s only one single intent with no slots. If your friends asked Recommend a restaurant for breakfast, they may or may not have been understood. That is, for breakfast could have been dropped by the interpreter. Since there’s no handling of [MEAL] yet, if the response happened to be for a dinner-only place, it would be wrong. Same for …open now, …open late, or …open tomorrow if it’s not open and you don’t yet track and compare time. And if they said Recommend a Mexican restaurant , they wouldn’t want to hear about a great Chinese restaurant.

You want to support these sample phrases, starting with Recommend a restaurant for breakfast. In Dialogflow
  1. 1.

    Click the plus (+) to the right of Intents to create a new intent. Name it RecommendRestaurantBreakfast.

     
  2. 2.

    Add a training phrase “Recommend a restaurant for breakfast.”

     
  3. 3.

    Add a response phrase “If you’re hungry in the morning, you have several options. How about Madhouse Coffee or Melissa’s Taqueria?”

     
  4. 4.

    In Responses, Set this intent as end of conversation.

     
  5. 5.

    Save and test.

     
Keep going to support more sample phrases for different search criteria:
  1. 1.
    “Recommend a restaurant open late”:
    1. a.

      Create a new Intent RecommendRestaurantLate with the sample phrase “Recommend a restaurant open late.”

       
    2. b.

      Add a response “Mama Mia Pizza is open late.”

       
    3. c.

      Enable Set intent as end of conversation. Save and test.

       
     
  2. 2.
    “Recommend a Mexican restaurant”:
    1. a.

      Create a new Intent RecommendRestaurantMexican with the sample phrase “Recommend a Mexican restaurant.”

       
    2. b.

      Add the response “Hungry for Mexican food? Try Melissa’s Taqueria.”

       
     
  3. 3.

    Enable Set intent as end of conversation. Save and test.

     

Now you support breakfast, late, and Mexican food. You’re thinking, Wow, this could get really messy; there are all those other cuisines, types of meals, times of day, and intents for every one of those combinations? Here you used static responses, and yes, with this approach it will be a crazy number of combinations. But you’ll soon learn about a couple of enhancements, slots or entities, and dynamic or webhook responses, which will, believe it or not, allow you to support all those possibilities with one intent.

SUCCESS TIP 3.4 YOUR USERS ARE NOT YOUR UNPAID TESTERS

Of course you want to update your voice interactions based on data from real users, but don’t expect your users to debug your voice apps for you. Get utterance examples from a varied group of alpha testers. Find beta testers who tell you if they felt understood and if responses were accurate. Without this, you’ll never get rid of poor reviews.

Does this mean you have to answer every question? No. It means you can start by recognizing more of what’s said to avoid problems and sound smarter until you add the actual functionality. You can add the most relevant information to every prompt. Maybe the user didn’t ask for it, but they probably won’t mind hearing it if it’s brief. You could also handle some words, like open now and dinner, even if you can’t provide an answer. Use sample dialogs to also show interactions that aren’t on the happy path:
  • User Hey, Google, ask Brisbane Helper to recommend a restaurant that’s open now.

  • VUI Na Na’s gets good reviews. They’re open for lunch and dinner every day except Saturday.

  • User Hey, Google, ask Brisbane Helper to recommend a restaurant that’s open now.

  • VUI Na Na’s gets good reviews. I’m not sure about their hours.

SUCCESS TIP 3.5 EVERYONE WANTS TO BE HEARD AND UNDERSTOOD

The more unfulfilled requests you can shift from not handled to handled, the smarter your voice interaction sounds. When you acknowledge your user, they feel understood even if you can’t do what they ask.

What about issue 2, real out-of-scope requests? You can’t cover everything, and this is true for all VUIs on all platforms. So you need some appropriately vague response, like telling the user you can’t help, but give them something general, maybe with a hint of what you can provide. Or you can ask the user to try again. Asking again, or reprompting, is more appropriate for some tasks and interaction types than others. It’s a big topic you’ll learn about in Chapter 13.
  • User Hey, Google, ask Brisbane Helper which place has the highest rating but isn’t busy.

  • VUI [No matching handling.]

  • VUI I’m not sure how to help with that, but here’s a general Brisbane recommendation. Na Na’s gets good reviews.

Preview: Static vs. Dynamic Responses

The way we’ve handled responses so far is by using static responses. You use these when you specify responses in the Responses section of the Intent. Other than the randomization we talked about, you can have no logic or decision making with static responses, which really limits you. The alternative is to use dynamic or webhook responses. These responses are generated by code (we’ll use Python) that you write. Here you can imagine having unlimited logic to filter responses by cuisine, time of day, and so on. You use webhook responses when you enable Fulfillment for your Intent, which you’ll do soon. Look at the documentation at https://dialogflow.com/docs/intents/responses for more details on using static vs. webhook responses.

What’s Next?

Take a look at the last few sample dialogs and think about the context for when these responses work more or less well. What changes would you make in the wording to cast the net wider and have responses that sound “intelligent” but don’t actually fulfill the requests?

With this quick and simple introduction to voice development, you’re already starting to get a sense of what throws voice apps for a loop and what’s no issue. This is what you need to master to succeed with voice: how to find big and small issues and fix them. The happy path is usually relatively straightforward; robustness is what’s a bit tricky. By learning not just the “how” but also the “why,” you minimize your dependency on specific tools and platforms while also learning how to choose the best implementation for a particular context and create conversational interactions where the platform doesn’t limit you.

When you want to create a robust and well-functioning voice application, you start by asking the types of questions you’ve already started asking—just in greater detail. In the next chapter, we’ll look at that “blueprints” part of the process.

Summary

  • Because voice is fundamentally limitless, you need to define your VUI’s scope carefully.

  • It’s possible to handle a request without fulfilling it. A well-worded voice prompt shows users they were understood even when they can’t get what they asked for.

  • Learn to anticipate what your users will ask for and how they’ll say it. Available voice assistant ecosystem tools can get you far fast, but VUI performance will always involve iterations to improve coverage.

  • A VUI can disappoint users in many ways: not recognize a reasonable request, not respond at all, respond incorrectly or in an uninformative manner, or respond in a way that the user doesn’t understand. All must be addressed and all have different solutions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.6.77