Problem Overview
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig1_HTML.jpg)
Amazon smart speakers at an exhibition
The possibility of using a self-service kiosk with voice control is another step closer to creating a “fully automated” fast food restaurant.
The problem of fast food restaurants is that, due to their high throughput, it is financially difficult to serve every customer by a human employee. That is why the industry is looking for solutions to automate the process of serving the order. Two of the main problems are the volume of orders and the payment process. The solution to the first problem is to use a self-service kiosk with a touchscreen interface. This is already in use in airports and railway stations, and it is a good solution for fast food restaurants. The second problem is more difficult to solve, due to the need for a cashier to collect money, as well as due to the fact that the restaurant should not be left without any employees. Here is where speech interaction for payment and ordering comes in. The customer can place an order, pay, and get information about the waiting time without leaving their seat. This will save the customers and employees time, as well as reduce errors and misunderstandings during communication between a customer and a cashier.
The fast food industry is one of the fastest-growing markets for automation solutions. Its most famous representatives are McDonald’s and Kentucky Fried Chicken (KFC). Both companies have already implemented self-service kiosks in many of their restaurants.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig2_HTML.png)
COVID-19 prevention measures at a fast food restaurant
Business Impact
The future of the fast food industry depends on the ability of restaurants to provide a better customer experience by automating repetitive and time-consuming tasks. The main problem that ordinary touchscreen, self-service kiosks have is that customers do not know how to use them, so there are many errors when entering orders. An error occurs once in every ten customer interactions. This leads to long queues in front of the kiosk. This is why voice interaction could be a solution. In natural conversation, a person does not consult the manual on how to pronounce words and people do not expect you to speak in an unnatural language (formal or slang). Therefore, using speech interaction seems like a natural solution for voice-controlled self-service stations.
There are many ways to use voice recognition and speech feedback in a restaurant. It can be used by customers as well as employees. The customer can use a kiosk to order food, pay for it, and get information about the waiting time. This is already common practice in some fast food restaurants, such as KFC.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig3_HTML.png)
People ordering at touchscreen kiosks
Another possibility for using speech interaction in fast food restaurants is for employees who work at self-service kiosks or counters. Instead of typing all the information manually into the computer, they can just speak what they want to write into the computer system and then have the computer system convert their speech into text. This may also help restaurants reduce the rate of mistakes made by cashiers.
Speech recognition technology will also make it easier for retail businesses to provide information to their customers. For example, a self-service kiosk can give a customer information about dishes and drinks, as well as where they can find something. This can be achieved by using custom software with speech-recognition and speech-synthesis capabilities.
The third way to use speech-interaction technology is in terms of marketing and advertising. Some fast food restaurants are using virtual assistants like Amazon Alexa and Google Home to advertise their products on customers’ smart speakers. This form of advertising may be especially useful for businesses that use shopping services such as Amazon Pantry or UberEats, because they can remind customers about their products whenever they see them in the list of deliveries.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig4_HTML.png)
Different languages and accents still present a challenge for speech recognition systems
Thus, the more data a speech recognition system has, the better its accuracy. A similar problem could occur between two different languages and dialects of one language (for example, between British English and American English). The second problem is that there should be a sufficient number of phrases for parsing in different situations (negative answers, questions, etc.). Additionally, there may be problems with background noise interfering with voice interaction.
This is especially relevant in fast food restaurants, when there is music playing or other sounds around the customers and employees. Finally, another problem may occur if people speak too quickly or too slow or they stutter. If this happens in the middle of an interaction with a customer, it would be difficult for the device to continue using its voice, as it would not know how to continue in a natural conversation with such a person due to lack of experience.
Related Knowledge
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig5_HTML.jpg)
Graph of decreasing word error rate over the last decade
This is a revolution similar to the Internet and mobile computing revolutions. Deep learning has helped self-driving cars and many other domains dealing with large volumes of data. Speech is one such domain, whereby deep learning is making a huge difference in the way we interact with computers.
How Does Deep Learning Work with Speech Recognition?
- 1.
Collecting large amounts of data: Collecting hundreds of hours of speech samples from multiple speakers of a language/accent to train the neural network on this data.
- 2.
Preprocessing the speech data: This involves segmenting the speech into smaller frames of 100 ms-1000 ms and applying audio augmentations to the recordings.
- 3.
Transforming speech signals into a form that computers can understand: Each of the small frames of the recorded speech is converted into a vector of parameters. These vectors are then stacked together and converted into a 15-dimensional vector. This vector is then passed through a neural network that learns the mapping between these parameters and words.
- 4.
Modeling: A neural network is trained on the speech data and the mapping between these parameters and words.
- 5.
Making predictions: Finally, the model is used to predict words given the parameters of the speech signal.
Very high accuracy: Speech recognition accuracy is measured in terms of “recognition rate,” which is the percentage of words that are correctly recognized by the software. For English, this number is about 90%, which is similar to what a human can achieve.
Better quality of interaction: Deep learning for speech recognition has made possible natural language user interfaces for computers. For example, a user might say “Turn on the lights in the living room” and the computer would know to turn on the lights in the living room. This is just not possible with conventional speech recognition methods.
Speakers don’t need to use a special microphone: Because deep learning can adapt to the speaker’s accent, a speaker can use any microphone, which is a very big deal. The common way to collect speech data was to ask people to record themselves saying a few hundred sentences using a special microphone, which was expensive and cumbersome.
As you might have realized by now, training a speech-recognition model is not an easy endeavor. You need a large enough dataset and powerful training servers to achieve good accuracy. There is no need to reinvent the wheel with speech recognition, however. There are plenty of options available, both for online and offline speech recognition. Next, we look at some of them to give you an idea as to which approach is best suited for your specific scenario.
Online Speech-Recognition Service Providers
This section discusses the common online speech-recognition service providers on the market today.
Microsoft Azure: With Azure (see Figure 5-6), you can have your own custom speech-recognition service up and running in under an hour. The flexibility of the platform allows developers to fine-tune the speech-recognition model.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig6_HTML.jpg)
Microsoft Azure logo
Amazon Alexa: Alexa (see Figure 5-7) offers a number of features, including voice recognition, text-to-speech, and natural language processing. The service is available for mobile, web, and devices running Alexa, such as the Amazon Echo.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig7_HTML.jpg)
Amazon Alexa logo
Amazon Lex: Lex i(see Figure 5-8) is a service for building conversational interfaces for applications using voice and text. The service helps to build applications with chat bots so users can interact naturally with your software using voice and text.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig8_HTML.jpg)
Amazon Lex logo
IBM Watson: Watson (see Figure 5-9) is a cognitive computing platform that uses natural language processing and machine learning to reveal insights from large amounts of unstructured data.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig9_HTML.jpg)
IBM Watson logo
Google Cloud Speech API: The Google Cloud Speech API (see Figure 5-10) is a REST API that enables you to convert audio to text. It supports over 80 languages, and it provides high-quality transcription and low latency.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig10_HTML.jpg)
Google Cloud Speech API logo
All of these online services charge for a certain amount of hours of audio/spoken utterances. While it is cheap to get started, if you have many devices running, the cost may become high enough for you to contemplate hybrid or offline solutions.
Offline speech recognition implies that the sound data is processed on the device, which has added benefits of lower latency and better privacy protection. The overall accuracy of offline speech recognition is not as high as of its online alternatives. This is because it has to run on the device, which has much less compute capacity than a server. However, if, as with the project in this chapter, the vocabulary for the task is limited to a certain area (such as ordering fast food using a drive-through), you can still achieve good results.
Offline Speech Recognition Frameworks
This section discusses the common offline speech-recognition service providers on the market today.
Mozilla DeepSpeech: DeepSpeech (see Figure 5-11) is an open source, deep learning-based speech-recognition engine. It is capable of producing high-quality results in a wide variety of environments and languages. In 2020, because of internal reorganization in Mozilla, part of original DeepSpeech developers forked DeepSpeech into another project, called Coqui STT (pronounced “ko-kee”). The model inference engine and training scripts are open source and thus it is possible to use pretrained models or train your own.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig11_HTML.jpg)
Mozilla DeepSpeech logo
Kaldi: This is an open source speech-recognition toolkit written in C++ for speech recognition and signal processing, freely available under the Apache License v2.0. Kaldi (see Figure 5-12) aims to provide software that is flexible and extensible, and is intended for use by ASR researchers for building a recognition system. While Kaldi is a powerful tool, it was recently sidelined by other projects, since it is difficult to set up for production and more suitable for research and experimentation.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig12_HTML.jpg)
Kaldi logo
Picovoice: This an end-to-end platform for building voice products (see Figure 5-13.) Unlike Alexa and Google services, Picovoice runs entirely on-device while being comparatively accurate for a specific task. The main difference with Picovoice from other speech-recognition products is that it combines speech recognition with intent recognition. Using Picovoice, one can infer the user’s intent from a naturally spoken utterance such as:
“Hey Edison, set the lights in the living room to blue”
Picovoice detects the occurrence of the custom wake word (Hey Edison), and then extracts the intent from the follow-on spoken command. Picovoice is free to use for prototyping, but enterprise customers need to purchase a license to use it.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig13_HTML.jpg)
Picovoice logo
Fluent.ai : A set of solutions similar to Picovoice, enabling comparatively accurate and intuitive speech understanding solutions in a small footprint and low latency package. It’s capable of running fully offline on small devices. See Figure 5-14.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig14_HTML.jpg)
Fluent.ai logo
For the demonstration project in this chapter, we’re going to use Picovoice, since it focuses on speech-to-intent use cases with a specific domain, in this case ordering fast food.
Implementation
- 1)
Install Picovoice on Raspberry Pi 4 and try the pretrained Picovoice model for lights control.
- 2)
Train a new speech-to-intent model for ordering fast food.
- 3)
Combine the new speech-to-intent model with the text-to-speech engine.
Use the Pretrained Picovoice Model for Lights Control
First you need to install a microphone on the Raspberry Pi 4. The exact choice will be determined by your environmental conditions. For demonstration purposes, you can use a regular USB microphone if you have one lying around.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig15_HTML.jpg)
Seeed Studio’s reSpeaker 2-mic Hat for Raspberry Pi
If you use a 32-bit image of the Raspberry Pi OS, you need to use the install.sh script instead of install_arm64.sh.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig16_HTML.jpg)
aplay -l and arecord -l execution results
Provided that the output of these commands on your system matches the output shown in Figure 5-16, your system is ready to start recording sound for inference.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig17_HTML.jpg)
The result of the “Turn on the lights in the kitchen” voice command being recognized
The next step is to train the custom model for fast food ordering using Picovoice Console.
Train a New Speech-to-Intent Model for Ordering Fast Food
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig18_HTML.jpg)
Context creation interface
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig19_HTML.jpg)
Intents for the fast food context
Main: Hamburger, chicken burger, fish burger, wrap
Side: French fries, salad, corn, onion rings
Drink: Coke, orange juice, Sprite, Diet Coke
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig20_HTML.jpg)
Sample words for the main slot
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig21_HTML.jpg)
Sample sentences for the orderFood intent
Different grammar/politeness variations of the same sentence can be created by placing options within square brackets. For syntax details, see the Rhino syntax cheat sheet at https://picovoice.ai/docs/tips/syntax-cheat-sheet/.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig22_HTML.jpg)
Inference results for “I would like a fish burger and french fries” sentence
If it matches an expression, it will show the intent to which that expression belongs. If the spoken phrase did not match any expressions, it will report that it did not understand. To make the model more flexible, try exploring possible variations in different accents/phrasing by asking different people to test the model.
The name of the ZIP archive might differ.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig23_HTML.jpg)
Execution result on Raspberry Pi 4 with the reSpeaker hat
Combine the New Speech-to-Intent Model with the Text-To-Speech Engine
The final step is to combine the custom speech-to-intent model with a simple text-to-speech engine in one Python script, so after understanding the order, the system will repeat it back to the customer and ask for confirmation. If the order is correct, the customer confirms it verbally or cancels it. Upon confirmation of the order, the total price and request for payment will be spoken.
After doing the necessary imports on the top of the script, the script main try point is executed, which reads the argument with the context file location and then instantiates the OrderProcessor class with arguments from the command line. After that, the main() method of the AudioProcessor class is called.
Once these are all ready, the main logic of the program is executed in the main() method. The important parts of the order food flow are separated into methods of the OrderProcessor class. First you wait for the keyword, then a TTS greeting is played, then you use the Rhino speech-to-intent engine to parse the customer’s order. Next, the order is parsed. The system repeats the order back to the customer using TTS and waits for confirmation or cancellation.
See the example code for detailed content of other OrderProcessor methods .
Pro Tips
There are multiple improvements that can be made to the example script and setup you created in this project. If you want your device to handle more queries, you can use a generic speech-to-text engine, for example DeepSpeech or one of the described cloud-based solutions, which have extremely high accuracy, even in noisy environments. Then you would create a text-to-intent model based on your use case. These are normally easier to train, since the dataset for the training is text and not raw audio. Such a setup would be more flexible and allow for more natural customer interaction.
![](https://imgdetail.ebookreading.net/2023/10/9781484279519/9781484279519__9781484279519__files__images__512749_1_En_5_Chapter__512749_1_En_5_Fig24_HTML.png)
Raspberry Pi 3B+ with reSpeaker 4-mic
Some other possible expansions include a touchscreen for adding visual feedback to interaction, a camera for face recognition payment processing, and a QR code scanner for other payment options.
Since it is not the main focus of this chapter, for the text-to-speech engine, we used the most basic engine available to be deployed on Raspberry Pi 4—espeak. Although it is easy to set up and use, it sounds a bit robotic and there are better options available, both online and offline.
Summary
This chapter explained the main principles of speech recognition and then demonstrated a working principle of using this technology, with an example of a speech recognition-enabled, fast food drive-through service kiosk. The kiosk can take customers’ orders in natural language using the Picovoice Rhino speech-to-intent engine and communicate with those customers with the help of the espeak text-to-speech engine.