© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
E. Wu, D. MaslovRaspberry Pi Retail Applicationshttps://doi.org/10.1007/978-1-4842-7951-9_5

5. Voice Interaction Drive-through Self-service Station

Elaine Wu1   and Dmitry Maslov1
(1)
Shenzhen, China
 

Problem Overview

Due to the rapid development of the artificial intelligence technology, a fast and reliable ASR (Automatic Speech Recognition) system finally became a reality. Neural network models capable of converting audible speech into written text for later parsing are now available to almost everyone on the globe, in hundreds of different languages. They are used in intelligent assistant services, such as Siri and Google Assistant, telephone customer services, and in many other fields. The next logical step is to move the dialogue with a customer from a mobile app to a voice-controlled device such as Amazon Echo or Google Home. See Figure 5-1.
Figure 5-1

Amazon smart speakers at an exhibition

The possibility of using a self-service kiosk with voice control is another step closer to creating a “fully automated” fast food restaurant.

The problem of fast food restaurants is that, due to their high throughput, it is financially difficult to serve every customer by a human employee. That is why the industry is looking for solutions to automate the process of serving the order. Two of the main problems are the volume of orders and the payment process. The solution to the first problem is to use a self-service kiosk with a touchscreen interface. This is already in use in airports and railway stations, and it is a good solution for fast food restaurants. The second problem is more difficult to solve, due to the need for a cashier to collect money, as well as due to the fact that the restaurant should not be left without any employees. Here is where speech interaction for payment and ordering comes in. The customer can place an order, pay, and get information about the waiting time without leaving their seat. This will save the customers and employees time, as well as reduce errors and misunderstandings during communication between a customer and a cashier.

The fast food industry is one of the fastest-growing markets for automation solutions. Its most famous representatives are McDonald’s and Kentucky Fried Chicken (KFC). Both companies have already implemented self-service kiosks in many of their restaurants.

Most recently, the COVID-19 epidemic has accelerated the process of adopting new technologies. Using a voice-enabled drive-through service allows businesses to reduce the spread of infection and protect their employees. See Figure 5-2.
Figure 5-2

COVID-19 prevention measures at a fast food restaurant

Business Impact

The future of the fast food industry depends on the ability of restaurants to provide a better customer experience by automating repetitive and time-consuming tasks. The main problem that ordinary touchscreen, self-service kiosks have is that customers do not know how to use them, so there are many errors when entering orders. An error occurs once in every ten customer interactions. This leads to long queues in front of the kiosk. This is why voice interaction could be a solution. In natural conversation, a person does not consult the manual on how to pronounce words and people do not expect you to speak in an unnatural language (formal or slang). Therefore, using speech interaction seems like a natural solution for voice-controlled self-service stations.

There are many ways to use voice recognition and speech feedback in a restaurant. It can be used by customers as well as employees. The customer can use a kiosk to order food, pay for it, and get information about the waiting time. This is already common practice in some fast food restaurants, such as KFC.

The technology is not always advanced enough to understand the order accurately or to provide the right information at the right time. That is why there is a high rate of mistakes made by customers when placing orders. Speech recognition technology can reduce this problem significantly, because it can provide correct information, even when a customer spells something wrong or uses terminology not known to the machine. For example, if someone orders “a chicken nuggets with barbecue sauce,” instead of giving an error message, the machine will recognize that this person wants chicken nuggets with barbecue sauce and give further instructions on how to proceed with the order. See Figure 5-3.
Figure 5-3

People ordering at touchscreen kiosks

Another possibility for using speech interaction in fast food restaurants is for employees who work at self-service kiosks or counters. Instead of typing all the information manually into the computer, they can just speak what they want to write into the computer system and then have the computer system convert their speech into text. This may also help restaurants reduce the rate of mistakes made by cashiers.

Speech recognition technology will also make it easier for retail businesses to provide information to their customers. For example, a self-service kiosk can give a customer information about dishes and drinks, as well as where they can find something. This can be achieved by using custom software with speech-recognition and speech-synthesis capabilities.

The third way to use speech-interaction technology is in terms of marketing and advertising. Some fast food restaurants are using virtual assistants like Amazon Alexa and Google Home to advertise their products on customers’ smart speakers. This form of advertising may be especially useful for businesses that use shopping services such as Amazon Pantry or UberEats, because they can remind customers about their products whenever they see them in the list of deliveries.

Of course, there are some difficulties in using speech recognition technology in retail. For example, dealing with accents leads to the lowered accuracy. Speech-recognition systems work better with a “standard” accent and have a lower accuracy when processing an uncommon accent. This is why it is very important to have sufficient speech data from different speakers and different regions. See Figure 5-4.
Figure 5-4

Different languages and accents still present a challenge for speech recognition systems

Thus, the more data a speech recognition system has, the better its accuracy. A similar problem could occur between two different languages and dialects of one language (for example, between British English and American English). The second problem is that there should be a sufficient number of phrases for parsing in different situations (negative answers, questions, etc.). Additionally, there may be problems with background noise interfering with voice interaction.

This is especially relevant in fast food restaurants, when there is music playing or other sounds around the customers and employees. Finally, another problem may occur if people speak too quickly or too slow or they stutter. If this happens in the middle of an interaction with a customer, it would be difficult for the device to continue using its voice, as it would not know how to continue in a natural conversation with such a person due to lack of experience.

Related Knowledge

The main reason behind the significant increase in accuracy of automatic speech-recognition software and by extension its usability is deep learning. Similar to computer vision, before about 2010, speech recognition relied heavily on hand-crafted algorithms, which were complicated to program and rigid, and therefore they didn’t perform well when encountering new accents or background sound. Anyone who used speech recognition to dial a phone number before 2010 knows what the quality of interaction was like. Deep learning has changed all that. In the two decades since, we’ve seen the accuracy of speech recognition improve from about 50% to nearly 95% with deep learning. See Figure 5-5.
Figure 5-5

Graph of decreasing word error rate over the last decade

This is a revolution similar to the Internet and mobile computing revolutions. Deep learning has helped self-driving cars and many other domains dealing with large volumes of data. Speech is one such domain, whereby deep learning is making a huge difference in the way we interact with computers.

How Does Deep Learning Work with Speech Recognition?

Deep learning works for speech recognition on the assumption that any language or accent has a fixed set of distinguishing characteristics, and each word and sentence is made up of these characteristics to varying degrees. The problem with speech recognition is therefore reduced to finding these distinguishing characteristics in the speech signal and mapping them to words or characters. The recognition algorithm involves a sequence of steps:
  1. 1.

    Collecting large amounts of data: Collecting hundreds of hours of speech samples from multiple speakers of a language/accent to train the neural network on this data.

     
  2. 2.

    Preprocessing the speech data: This involves segmenting the speech into smaller frames of 100 ms-1000 ms and applying audio augmentations to the recordings.

     
  3. 3.

    Transforming speech signals into a form that computers can understand: Each of the small frames of the recorded speech is converted into a vector of parameters. These vectors are then stacked together and converted into a 15-dimensional vector. This vector is then passed through a neural network that learns the mapping between these parameters and words.

     
  4. 4.

    Modeling: A neural network is trained on the speech data and the mapping between these parameters and words.

     
  5. 5.

    Making predictions: Finally, the model is used to predict words given the parameters of the speech signal.

     
Deep learning for speech recognition provides the following benefits:
  • Very high accuracy: Speech recognition accuracy is measured in terms of “recognition rate,” which is the percentage of words that are correctly recognized by the software. For English, this number is about 90%, which is similar to what a human can achieve.

  • Better quality of interaction: Deep learning for speech recognition has made possible natural language user interfaces for computers. For example, a user might say “Turn on the lights in the living room” and the computer would know to turn on the lights in the living room. This is just not possible with conventional speech recognition methods.

  • Speakers don’t need to use a special microphone: Because deep learning can adapt to the speaker’s accent, a speaker can use any microphone, which is a very big deal. The common way to collect speech data was to ask people to record themselves saying a few hundred sentences using a special microphone, which was expensive and cumbersome.

As you might have realized by now, training a speech-recognition model is not an easy endeavor. You need a large enough dataset and powerful training servers to achieve good accuracy. There is no need to reinvent the wheel with speech recognition, however. There are plenty of options available, both for online and offline speech recognition. Next, we look at some of them to give you an idea as to which approach is best suited for your specific scenario.

Online Speech-Recognition Service Providers

This section discusses the common online speech-recognition service providers on the market today.

  • Microsoft Azure: With Azure (see Figure 5-6), you can have your own custom speech-recognition service up and running in under an hour. The flexibility of the platform allows developers to fine-tune the speech-recognition model.

Figure 5-6

Microsoft Azure logo

  • Amazon Alexa: Alexa (see Figure 5-7) offers a number of features, including voice recognition, text-to-speech, and natural language processing. The service is available for mobile, web, and devices running Alexa, such as the Amazon Echo.

Figure 5-7

Amazon Alexa logo

  • Amazon Lex: Lex i(see Figure 5-8) is a service for building conversational interfaces for applications using voice and text. The service helps to build applications with chat bots so users can interact naturally with your software using voice and text.

Figure 5-8

Amazon Lex logo

  • IBM Watson: Watson (see Figure 5-9) is a cognitive computing platform that uses natural language processing and machine learning to reveal insights from large amounts of unstructured data.

Figure 5-9

IBM Watson logo

  • Google Cloud Speech API: The Google Cloud Speech API (see Figure 5-10) is a REST API that enables you to convert audio to text. It supports over 80 languages, and it provides high-quality transcription and low latency.

Figure 5-10

Google Cloud Speech API logo

All of these online services charge for a certain amount of hours of audio/spoken utterances. While it is cheap to get started, if you have many devices running, the cost may become high enough for you to contemplate hybrid or offline solutions.

Offline speech recognition implies that the sound data is processed on the device, which has added benefits of lower latency and better privacy protection. The overall accuracy of offline speech recognition is not as high as of its online alternatives. This is because it has to run on the device, which has much less compute capacity than a server. However, if, as with the project in this chapter, the vocabulary for the task is limited to a certain area (such as ordering fast food using a drive-through), you can still achieve good results.

Offline Speech Recognition Frameworks

This section discusses the common offline speech-recognition service providers on the market today.

  • Mozilla DeepSpeech: DeepSpeech (see Figure 5-11) is an open source, deep learning-based speech-recognition engine. It is capable of producing high-quality results in a wide variety of environments and languages. In 2020, because of internal reorganization in Mozilla, part of original DeepSpeech developers forked DeepSpeech into another project, called Coqui STT (pronounced “ko-kee”). The model inference engine and training scripts are open source and thus it is possible to use pretrained models or train your own.

Figure 5-11

Mozilla DeepSpeech logo

  • Kaldi: This is an open source speech-recognition toolkit written in C++ for speech recognition and signal processing, freely available under the Apache License v2.0. Kaldi (see Figure 5-12) aims to provide software that is flexible and extensible, and is intended for use by ASR researchers for building a recognition system. While Kaldi is a powerful tool, it was recently sidelined by other projects, since it is difficult to set up for production and more suitable for research and experimentation.

Figure 5-12

Kaldi logo

  • Picovoice: This an end-to-end platform for building voice products (see Figure 5-13.) Unlike Alexa and Google services, Picovoice runs entirely on-device while being comparatively accurate for a specific task. The main difference with Picovoice from other speech-recognition products is that it combines speech recognition with intent recognition. Using Picovoice, one can infer the user’s intent from a naturally spoken utterance such as:

  • “Hey Edison, set the lights in the living room to blue”

  • Picovoice detects the occurrence of the custom wake word (Hey Edison), and then extracts the intent from the follow-on spoken command. Picovoice is free to use for prototyping, but enterprise customers need to purchase a license to use it.

Figure 5-13

Picovoice logo

  • Fluent.ai : A set of solutions similar to Picovoice, enabling comparatively accurate and intuitive speech understanding solutions in a small footprint and low latency package. It’s capable of running fully offline on small devices. See Figure 5-14.

Figure 5-14

Fluent.ai logo

For the demonstration project in this chapter, we’re going to use Picovoice, since it focuses on speech-to-intent use cases with a specific domain, in this case ordering fast food.

Implementation

The Picovoice system consists of two main components: the wake-up word listener and the speech-to-intent recognition engine. The first one is extremely lightweight and consumes hardly any system resources, so it can be running 24/7 in background. Upon detection of the wake-up word, the speech-to-intent engine starts listening. When the user finishes the utterance, it outputs the intent in parsed form, for example:
{
  "intent": "changeColor",
  "slots": {
    "location": "living room",
    "color": "blue"
  }
}
This can be used to serve food to the customers according to their request. Additionally, you will see how to implement a simple on-device text-to-speech interface for confirming the order and conveying payment information to the customer. This will enable the entire process of ordering to be done hands-free and without using a screen.
  1. 1)

    Install Picovoice on Raspberry Pi 4 and try the pretrained Picovoice model for lights control.

     
  2. 2)

    Train a new speech-to-intent model for ordering fast food.

     
  3. 3)

    Combine the new speech-to-intent model with the text-to-speech engine.

     

Use the Pretrained Picovoice Model for Lights Control

First you need to install a microphone on the Raspberry Pi 4. The exact choice will be determined by your environmental conditions. For demonstration purposes, you can use a regular USB microphone if you have one lying around.

In this book, we use an affordable dual-microphone expansion board for Raspberry Pi, Seeed Studio’s reSpeaker 2-mic Pi Hat (see Figure 5-15). It was developed based on WM8960, a low-power stereo codec. While only costing 9.90 dollars, the board is equipped with two microphones on both sides of the board for collecting sounds. It also provides three APA102 RGB LEDs, one user button, and two on-board grove interfaces for expanding your applications. What is more, a 3.5mm audio jack or a JST 2.0 speaker out are both available for audio output.
Figure 5-15

Seeed Studio’s reSpeaker 2-mic Hat for Raspberry Pi

The first step is to install the drivers for your microphone. We use the Raspberry Pi OS 64-bit image as the starting point. For the purposes of this project, you can continue using the image you installed in Chapter 2. For some microphones, the drivers are already included with the Raspberry PI OS Linux kernel. If you opt for reSpeaker 2-mic Pi Hat, after installing the board on the Raspberry Pi 4 GPIO header, you’ll need to install the drivers by executing a few simple commands:
sudo apt-get update
git clone https://github.com/respeaker/seeed-voicecard.git
cd seeed-voicecard
sudo ./install_arm64.sh
sudo reboot now

If you use a 32-bit image of the Raspberry Pi OS, you need to use the install.sh script instead of install_arm64.sh.

After the installation is finished, you can check for the presence of recording and playback devices with the aplay and arecord tools, as shown in Figure 5-16.
Figure 5-16

aplay -l and arecord -l execution results

Provided that the output of these commands on your system matches the output shown in Figure 5-16, your system is ready to start recording sound for inference.

For the next step, clone the Picovoice GitHub repository. Execute the following command from the Chapter_5 folder:
git clone https://github.com/Picovoice/picovoice.git exercise_1
cd exercise_1
git submodule update --init
pip3 install -r demo/python/requirements.txt
Then run the following command from the Picovoice folder to start the demo:
python3 demo/python/picovoice_demo_mic.py --keyword_path resources/porcupine/resources/keyword_files/raspberry-pi/picovoice_raspberry-pi.ppn --context_path resources/rhino/resources/contexts/raspberry-pi/smart_lighting_raspberry-pi.rhn
Say the wake-up word (“picovoice”) immediately followed by a command (e.g., “turn on the lights in the bedroom”) to see intent being recognized and displayed in the terminal. See Figure 5-17.
Figure 5-17

The result of the “Turn on the lights in the kitchen” voice command being recognized

The next step is to train the custom model for fast food ordering using Picovoice Console.

Train a New Speech-to-Intent Model for Ordering Fast Food

To create a new speech-to-intent model for your application with Picovoice, go to https://console.picovoice.ai/ and register an account there. Then access the console and choose the Rhino engine. Choose an empty template and click Create Context. For the purposes of this project, we use the name “fastfood,” as shown in Figure 5-18.
Figure 5-18

Context creation interface

In the newly opened window, create the necessary intents using the New Intent box in the left column. For this example, we are going to have three intents: orderFood, confirm, and cancel. See Figure 5-19.
Figure 5-19

Intents for the fast food context

Add the slots next. When ordering food, the slots will be main, side, and drink. Add some food items to each of the slots as well. Here is what we used for this project (see Figure 5-20):
  • Main: Hamburger, chicken burger, fish burger, wrap

  • Side: French fries, salad, corn, onion rings

  • Drink: Coke, orange juice, Sprite, Diet Coke

Figure 5-20

Sample words for the main slot

After that, go to the orderFood intent and add some sentences. Make them as descriptive as you can. See Figure 5-21.
Figure 5-21

Sample sentences for the orderFood intent

Different grammar/politeness variations of the same sentence can be created by placing options within square brackets. For syntax details, see the Rhino syntax cheat sheet at https://picovoice.ai/docs/tips/syntax-cheat-sheet/.

After you finish adding expressions, use the microphone button on the right column to test your context in-browser. Wait for the “Listening for voice input…” prompt. Speak the phrase that matches your expression. The results of the speech-to-intent inference appear in a box below the microphone button. See Figure 5-22.
Figure 5-22

Inference results for “I would like a fish burger and french fries” sentence

If it matches an expression, it will show the intent to which that expression belongs. If the spoken phrase did not match any expressions, it will report that it did not understand. To make the model more flexible, try exploring possible variations in different accents/phrasing by asking different people to test the model.

Once you’re satisfied with the results, click the Save button and then train the model. Go to the Models tab and click Download, then choose Raspberry Pi as the target platform. After that, you can download the ZIP archive with a model file and copy it to your Raspberry Pi. If you use the Visual Studio Code IDE, that would be as simple as dragging and dropping the ZIP file to Chapter_5/exercise_2. Extract the ZIP archive with the following command from the exercise_2 folder:
unzip fastfood_raspberry-pi_v1.6.0.zip
Note

The name of the ZIP archive might differ.

After that, execute the following command from the Chapter_5 folder (NOT from the Chapter_5/exercise_2 folder!):
python3 demo/python/picovoice_demo_mic.py --keyword_path resources/porcupine/resources/keyword_files/raspberry-pi/picovoice_raspberry-pi.ppn --context_path fast-food.rhn
Make sure to change the context path argument to the exact path of the context file model you trained and copied to Raspberry Pi (if necessary, from your working directory). See Figure 5-23.
Figure 5-23

Execution result on Raspberry Pi 4 with the reSpeaker hat

Combine the New Speech-to-Intent Model with the Text-To-Speech Engine

The final step is to combine the custom speech-to-intent model with a simple text-to-speech engine in one Python script, so after understanding the order, the system will repeat it back to the customer and ask for confirmation. If the order is correct, the customer confirms it verbally or cancels it. Upon confirmation of the order, the total price and request for payment will be spoken.

You can find the finalized script in the chapter_5/exercise_3 folder of the book materials. Let’s go through the most important parts of the script.
parser = argparse.ArgumentParser()
parser.add_argument('--context_path', help="Absolute path to context file.")
args = parser.parse_args()
processor = OrderProcessor(args)
processor.main()

After doing the necessary imports on the top of the script, the script main try point is executed, which reads the argument with the context file location and then instantiates the OrderProcessor class with arguments from the command line. After that, the main() method of the AudioProcessor class is called.

The __init__ method of the OrderProcessor class contains an initialization of the “Porcupine” wake-up word engine with the default key word file set to “picovoice.” The Rhino speech-to-intent engine has a context file specified in the arguments to the script. Another important step is preparing the audio stream with start_audio_stream_method() and instantiating the text-to-speech engine for the talk back function.
def __init__(self, args):
  self.wakeword_engine = pvporcupine.create(keywords=['picovoice'])
  self.wakeword_engine_frame_length = self.wakeword_engine.frame_length
  self.nlu_engine = pvrhino.create(library_path=pvrhino.LIBRARY_PATH,
  model_path=pvrhino.MODEL_PATH,
  context_path=args.context_path)
  self.nlu_engine_frame_length = self.nlu_engine.frame_length
  self.spinner = Halo(text='Listening', spinner='dots')
  self.start_audio_input()
  self.engine = pyttsx3.init()
  self.menu_prices = {"hamburger":1.99,
  "wrap":1.3,
  "chicken burger": 1.1,
  "fish burger": 1.5,
  "french fries": 0.5,
  "salad": 0.6,
  "corn": 0.4,
  "coke": 0.2,
  "sprite": 0.2,
  "diet coke": 0.2,
  "orange juice": 0.2}

Once these are all ready, the main logic of the program is executed in the main() method. The important parts of the order food flow are separated into methods of the OrderProcessor class. First you wait for the keyword, then a TTS greeting is played, then you use the Rhino speech-to-intent engine to parse the customer’s order. Next, the order is parsed. The system repeats the order back to the customer using TTS and waits for confirmation or cancellation.

After the order is confirmed, the cost is output and the customer is asked to pay using a QR code. The actual QR code payment code is not included in this example, but you can use the Paypal payment processing example in Chapter 3 to add this feature.
def main(self):
  while True:
    try:
      self.wait_for_keyword()
      self.speak('Welcome to order at Robo Fast Food.')
      time.sleep(0.5)
      self.order, phrase = self.process_order()
      self.speak(phrase)
      time.sleep(0.5)
      result = self.wait_for_confirmation()
      if result:
        total = sum([self.menu_prices[item] for item in self.order])
        phrase = "Your order total is {} USD. Please scan the QR code to pay. Enjoy your meal!".format(total)
      else:
        phrase = "Alright. Welcome to come back again!"
      self.speak(phrase)

See the example code for detailed content of other OrderProcessor methods .

Before you run the exercise_3 code, you need to install some additional dependencies. You can do that with the following (from the exercise_3 folder):
sudo apt-get install portaudio19-dev espeak
pip3 install -r requirements.txt
You can also run the example with the following command, from the exercise_3 folder:
python3 porcupine_demo_mic.py --context_path [path-to-your-rhino-model]

Pro Tips

There are multiple improvements that can be made to the example script and setup you created in this project. If you want your device to handle more queries, you can use a generic speech-to-text engine, for example DeepSpeech or one of the described cloud-based solutions, which have extremely high accuracy, even in noisy environments. Then you would create a text-to-intent model based on your use case. These are normally easier to train, since the dataset for the training is text and not raw audio. Such a setup would be more flexible and allow for more natural customer interaction.

While we used regular Raspberry Pi 4 in this chapter, it is definitely possible to deploy the project on a Raspberry Pi 4 Compute Module installed on a custom-made carrier board similar to the one described in Chapter 3 or using reTerminal with a microphone array attached. See Figure 5-24.
Figure 5-24

Raspberry Pi 3B+ with reSpeaker 4-mic

Some other possible expansions include a touchscreen for adding visual feedback to interaction, a camera for face recognition payment processing, and a QR code scanner for other payment options.

Since it is not the main focus of this chapter, for the text-to-speech engine, we used the most basic engine available to be deployed on Raspberry Pi 4—espeak. Although it is easy to set up and use, it sounds a bit robotic and there are better options available, both online and offline.

Summary

This chapter explained the main principles of speech recognition and then demonstrated a working principle of using this technology, with an example of a speech recognition-enabled, fast food drive-through service kiosk. The kiosk can take customers’ orders in natural language using the Picovoice Rhino speech-to-intent engine and communicate with those customers with the help of the espeak text-to-speech engine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.98.186