From Wikipedia: hearing, auditory perception, or audition is the ability to perceive sound by detecting vibrations, changes in the pressure of the surrounding medium through time, through an organ such as the ear. Like touch, audition requires sensitivity to the movement of molecules in the world outside the organism. Both hearing and touch are types of mechanosensation.
This chapter contains a lot of short examples and code snippets for doing various audio processing features, including installing audio programs, playing music files, recording and processing audio streams, performing speech recognition, and performing text-to-speech. A lot of these examples can be combined to build full end-to-end audio systems and voice-controlled electronics. A quick example of a voice controlled electronic system is given at the end of this chapter.
There are only two parts required for this chapter: parts 1 and 2 below. As an alternative for part 1, you can use parts 3 and 4 in conjunction:
A Linux-compatible USB headset
I use the Logitech ClearChat Comfort/USB Headset H390. It’s fairly cheap at $25 on Amazon and has a mute button that I find comes in handy. However, any Linux-compatible USB headset should work well with Edison, and cheaper ones are available.
A power supply
Same requirements as those listed in Chapter 5 (see “Materials List”).
A USB-to-3.5 mm speaker/headphone and microphone jack
If you prefer to use your (sure to be) already existent headphones, then you can buy USB to 3.5 mm jack converters. Good examples are the Plugable USB Audio Adapter with 3.5mm Speaker/Headphone and Microphone Jacks and the iLuv USB Audio Adapter.
A 3.5 mm microphone
Assuming you already have 3.5 mm headphones, you’ll also need a 3.5 mm jack microphone to plug into your USB adapter.
The first step in exploring sound is being able to record sounds and listen to recorded sounds. Whereas in Chapter 5, you started video processing by pulling images from the Internet, here you’ll start right away by connecting a headset.
Make sure the switch in between the microUSB and USB slot is flipped toward the USB port and plug in your USB headset. If you have the same (or similar) Logitech that I’m using, the red LED on the mute button will start blinking after a few seconds. This is an indication that the headset is receiving power from Edison and that the microphone is muted. If you push the button, the LED will turn solid red, indicating that the microphone is now active.
As a first step, you need to tell Edison to use your USB headset as the default system device for sound. You can tell how Edison references your sound card in the hardware with the aplay
command:
# aplay -Ll | tail -5 Subdevices: 1/1 Subdevice #0: subdevice #0 card 2: Headset [Logitech USB Headset], device 0: USB Audio [USB Audio] Subdevices: 1/1 Subdevice #0: subdevice #0
Your device should appear as card 2 and will be called by name. The important piece is what comes after the colon and before the opening square bracket. This is the name by which you can reference your headset sound card; in this case, it’s simply Headset
.
Use your favorite text editor to open the /etc/asound.conf file. Add the following line to the top, save it, and then close the file:
pcm.!default sysdefault:Headset
Remember to replace Headset
with whatever device name you discovered using the aplay
command.
Install the alsa
utilities for sounds and the text-to-speech engine espeak
:
# opkg install alsa-utils espeak
Occasionally you’ll get errors using the opkg installer, most of which can be solved by updating opkg:
# opkg update
Alsa automatically installs some wave files for testing your headset. You can play one of these to make sure your audio is configured properly:
# aplay /usr/share/sounds/alsa/Front_Center.wav
If everything is installed properly, you should hear the words “Front, Center” in a nice-sounding woman’s voice. Let’s switch that to a creepy robot voice by using espeak
!
# espeak "Front, Center"
If you just type the command espeak
, you can issue creepy robot text words until you terminate with Ctrl+C:
# espeak front center hello whatever ^C #
The espeak
program has a lot of fun options to play around with, from intonation to speech speed and even language! You can see the full list by calling the help dialogue (the same flag works for aplay
and arecord
and many other command-line programs):
# espeak -h
To give you a feel for it, try asking, “How are you?” in fast-paced, high-pitched Swedish:
# espeak -v sw "Hur mor du?" -p 99 -s 250
You can also record your own voice using the arecord
command and then play it back using aplay
. Without any specified arguments, arecord
will record forever. You can kill it with Ctrl+C:
# arecord hello.wav Recording WAVE 'hello.wav' : Unsigned 8 bit, Rate 8000 Hz, Mono ^CAborted by signal Interrupt... # aplay hello.wav Playing WAVE 'hello.wav' : Unsigned 8 bit, Rate 8000 Hz, Mono
Remember how awesome the iPod was when it came out? You can turn Edison into your very own iPod quite easily.
If you’d like to turn Edison into a small form-factor mp3 player, one great setup option is the SparkFun Base and Battery Blocks. From here, you can tack on the OLED Block for display controls or the microSD Block for increased storage capacity. If you’d like your music player to last for a really long time, a cute trick is to connect a cell phone battery pack to one of the microUSBs instead of using the Battery Block. These have a much higher battery capacity.
First, install the mpg123
library for playing mp3s:
# opkg install mpg123
Next, just for this example, you’ll pull some songs from LastFM’s free download collection:
# wget http://freedownloads.last.fm/download/59565166/From%2BEmbrace%2BTo%2BEmbrace.mp3 # wget http://freedownloads.last.fm/download/569330114/Lost%2BBoys.mp3 # wget http://freedownloads.last.fm/download/569264057/Get%2BGot.mp3
The downloaded filenames are shown here:
# ls *.mp3 From+Embrace+To+Embrace.mp3 Lost+Boys.mp3 Get+Got.mp3
You can play any one of these files with the mpg123
command and quit with Ctrl+C:
# mpg123 Get+Got.mp3
You can play all the mp3 files on shuffle with the -Z
flag and the wildcard *:
# mpg123 –Z *.mp3
You can create playlist for these files using a simple text document. Create a file called playlist.txt with the following contents. Remember to replace /home/root/ with the complete paths to the files on your own system:
/home/root/Get+Got.mp3 /home/root/From+Embrace+To+Embrace.mp3 /home/root/Lost+Boys.mp3
Now play your playlist with mpg123
:
# mpg123 -@ playlist.txt
You’ve just created an iPod! You can even use your SPI screen to display the info for each song.
Let’s do something a little more computationally savvy with our audio. You’ll use Python to interface with the audio stream and detect when you’re speaking into the mic and when you’re silent.
You’ll use the pyaudio
library for this. First, install the dependencies using opkg:
# opkg install libjack # opkg install --nodeps jack-dev # opkg install libportaudio-dev
Then, install pyaudio
, making sure to set the flags so that pip trusts the unverified pyaudio
library:
# pip install --allow-external pyaudio --allow-unverified pyaudio pyaudio
Now, download a simple audio-recording file from GitHub, compliments of Mahmoud Abdrabo. This is a modified example from the pyaudio documentation:
# wget https://gist.githubusercontent.com/mabdrabo/8678538/raw/30e63a8c2ab78b516b13a180895308b8a4244ecf/sound_recorder.py
This will download sound_recorder.py
. Open this file.
import
pyaudio
import
wave
FORMAT
=
pyaudio
.
paInt16
CHANNELS
=
2
RATE
=
44100
CHUNK
=
1024
RECORD_SECONDS
=
5
WAVE_OUTPUT_FILENAME
=
"
file.wav
"
audio
=
pyaudio
.
PyAudio
(
)
# start Recording
stream
=
audio
.
open
(
format
=
FORMAT
,
channels
=
CHANNELS
,
rate
=
RATE
,
input
=
True
,
frames_per_buffer
=
CHUNK
)
"
recording...
"
frames
=
[
]
for
i
in
range
(
0
,
int
(
RATE
/
CHUNK
*
RECORD_SECONDS
)
)
:
data
=
stream
.
read
(
CHUNK
)
frames
.
append
(
data
)
"
finished recording
"
# stop Recording
stream
.
stop_stream
(
)
stream
.
close
(
)
audio
.
terminate
(
)
waveFile
=
wave
.
open
(
WAVE_OUTPUT_FILENAME
,
'
wb
'
)
waveFile
.
setnchannels
(
CHANNELS
)
waveFile
.
setsampwidth
(
audio
.
get_sample_size
(
FORMAT
)
)
waveFile
.
setframerate
(
RATE
)
waveFile
.
writeframes
(
b
'
'
.
join
(
frames
)
)
waveFile
.
close
(
)
The script first imports the necessary libraries: pyaudio
to pull the audio stream, and wave
to save the audio file.
Next, it sets the audio recording format by declaring variables for the log format (FORMAT
), data rate (RATE
), amount of data to read at a time (CHUNK
), duration of recording (RECORD_SECONDS
), and an output filename (WAVE_OUTPUT_FILENAME
).
The script then creates a pyaudio
audio stream and opens it with the given configuration. Before the script starts recording, it prints “recording…” and then creates an empty frame array.
Each time through the for
loop, which iterates as many times as there are chunks to read in five seconds, the audio data that is read is tacked onto (or appended
to) the frames array. When the loop finishes, “finished recording” is printed.
The audio stream is stopped and closed, and the pyaudio
object is terminated.
The full five-second recorded audio is written to disk using wave
functions. The file is opened and configured, the frames array is written out, and the file is closed.
Run the file and, making sure your mic is not muted, record yourself speaking in between the “recording…” and “finished recording” lines. Play your wonderful recording back to you using aplay
:
# aplay file.wav
It’s great to just blanket record sounds, but it would be nice to make our system a bit smarter. We’re going to modify sound_recorder.py
to help us do that.
First, add the following line at the end of the import statements:
import
audioop
Now, in the for
loop where you recorded the stream, add the print
statement in between the the two other lines as shown here:
data
=
stream
.
read
(
CHUNK
)
audioop
.
rms
(
data
,
2
)
frames
.
append
(
data
)
The audioop
library contains a series of functions and classes to help you manipulate raw audio data. In this case, you’re taking the root mean square (RMS) of each audio chunk you read. The root mean square is calculated by squaring every data point in the sample, taking the mean of these values, and then taking the square root of that mean. You can think of it like the mean, except the RMS value puts a higher emphasis on outlier points.
The RMS value is a common way to detect speech and silence in audio streams. If even just a few points have higher volumes, the RMS value will be skewed way up. Save your modified sound_recorder.py
file and run it again. This time, as the data is processed, the RMS values will be printed to the screen.
Run the script again with your mic on mute. You’ll notice that the printed values drop immediately down to low single-digit values. The audio stream is receiving basically 0 noise. Run it again with the mic unmuted but without speaking. You’ll probably see values in the 100s and 200s (if your house is approximately as noisy as mine). If you put the headset on and begin talking, you’ll notice that the values probably shoot up into the 1,000s. If you’re using a USB headset, one of the reasons this value shoots up so high is that your mic is directional, meaning it prefers sound to come from one direction as opposed to all over. In this case, it’s from the slot facing your mouth.
By choosing an adequate threshold, you can actually detect fairly well when people are speaking and when they’re silent. Try playing with this yourself. Open the sound_recorder.py
script one more time, define a threshold for speech, and modify the loop as follows:
threshold
=
800
for
i
in
range
(
0
,
int
(
RATE
/
CHUNK
*
RECORD_SECONDS
)):
data
=
stream
.
read
(
CHUNK
)
rms
=
audioop
.
rms
(
data
,
2
)
if
rms
>
threshold
:
"You're speaking"
frames
.
append
(
data
)
Save and exit the file, and then run it. This time through, silence gets you nothing, but any time you speak, you should see the “You’re speaking” text. Feel free to choose a threshold that best suits your headset; it will definitely vary by model and even by fit. If you find yourself needing more time, modify the recording duration (RECORD_SECONDS
variable) to better suit your needs.
If you think speech detection is cool, you’re in for a treat, because speech recognition blows it out of the water. Up until now, we’ve been doing everything offline, directly on Edison without the need for Internet. This example uses the Google Speech Recognition API and so needs an Internet connection to work. Make sure your Edison is on WiFi.
The Python library you’ll be using is SpeechRecognition. You’ll need to install a single dependency and pip install
the library itself:
# opkg install flac-dev # pip install SpeechRecognition
After installation, create a Python script called SpeechReconize.py
with the following contents:
This code comes almost directly from the examples on the SpeechRecognition page.
import
speech_recognition
as
sr
r
=
sr
.
Recognizer
()
"Start Listening..."
with
sr
.
Microphone
()
as
source
:
# use system default microphone as the
# audio source
audio
=
r
.
listen
(
source
)
# listen for the first phrase and extract it
# into audio data
"Done Listening..."
try
:
(
"You said: "
+
r
.
recognize
(
audio
))
# recognize speech using
# Google Speech Recognition
except
LookupError
:
# speech is unintelligible
(
"Could not understand audio"
)
Run the script now. You’ll see that the script hangs after it prints “Start Listening…” until you both start and finish speaking. When you finish, the script prints “Done Listening…”, hangs for another second or so, and then prints out the contents of your speech or the error message, “Could not understand audio.” If you play with this script a little bit, you’ll immediately see how powerful it is. The listen
command is very good at capturing full phrases, and the speech classifier is very, very accurate. Essentially, you’ve just added Google Now to your Edison in less than 20 lines of code.
It’s worth noting that this example uses pyAudio
to handle the audio stream from the mic. The thresholding and audio stream capture is done in much in the same way as you did manually in the previous example, but this time you’ve got a fancy library to do it all in the background!
This does raise an interesting point though. In the last exercise, I told you that ambient noise and speech volume would vary with not only the environment but also with the microphone. You might wonder how the SpeechRecognition
library works so well just straight off the bat. SpeechRecognition
has some default settings that work pretty well for most normal situations. Having a nice, directional headset microphone makes the algorithm work even better. Think about how much more defined your voice is into this headset than it would be yelling at your phone on a New York street corner. However, if you are out on that New York street corner, you’re going to need a way to recalibrate as we did prevoiusly. All you need to do is add a single line to the script, directly above the listen
command:
with
sr
.
Microphone
()
as
source
:
# use system default microphone
# as the audio source
r
.
adjust_for_ambient_noise
(
source
)
# adjust thresholding
audio
=
r
.
listen
(
source
)
# listen for the first phrase # and extract it into audio data
This adjust_for_ambient_noise
function listens for one second to the audio stream and uses that to calibrate the threshold for ambient noise. Now, in theory, your recognition should be good in almost any environment.
There’s one last example in this book, and it’s using speech recognition to control peripheral devices, in this case your SPI screen. Connect the SPI screen to the Arduino breakout board now. You’ll use the SpeechRecognition
library for parsing speech and the ILI9341
library to drive the SPI screen. The script will take in your speech, parse it, and then use it to drive the color of the screen.
I’ve posted a quick example of this system to GitHub. In order for this example to work as is, you have to download it into the EdisonILI9341 directory or wherever your ILI9341.py
file resides if you moved it. Change into that directory now, then pull the script using wget
:
# wget https://raw.githubusercontent.com/smoyerman/VoiceControlledScreen/master/ScreenControl.py
Open this file in a text editor. It should look fairly similar to the previous example, but expanded:
import
speech_recognition
as
sr
import
ILI9341
# Construct screen and speech recognizer
disp
=
ILI9341
.
ILI9341
(
)
disp
.
begin
(
)
r
=
sr
.
Recognizer
(
)
# Hard-coded colors
red
=
(
255
,
0
,
0
)
green
=
(
0
,
255
,
0
)
blue
=
(
0
,
0
,
255
)
white
=
(
255
,
255
,
255
)
black
=
(
0
,
0
,
0
)
puple
=
(
100
,
0
,
100
)
# Loop and listen 10 times
for
i
in
range
(
10
)
:
# Listen for audio each time
"
Start Listening...
"
with
sr
.
Microphone
(
)
as
source
:
# Only check for ambient noise the first time
if
i
==
0
:
r
.
adjust_for_ambient_noise
(
source
)
audio
=
r
.
listen
(
source
)
"
Done Listening...
"
try
:
# Parse text
speechString
=
r
.
recognize
(
audio
)
(
"
You said
"
+
speechString
)
speechArray
=
speechString
.
split
(
)
# Check for colors
if
"
red
"
in
speechArray
:
color
=
red
elif
"
blue
"
in
speechArray
:
color
=
blue
elif
"
green
"
in
speechArray
:
color
=
green
elif
"
purple
"
in
speechArray
:
color
=
purple
elif
"
white
"
in
speechArray
:
color
=
white
elif
"
black
"
in
speechArray
:
color
=
black
# Display to screen
disp
.
clear
(
color
)
disp
.
display
(
)
# Check for error
except
LookupError
:
(
"
Could not understand audio
"
)
There are several new components to this script:
At the top, the script imports ILI9341.
It declares the screen object and initializes the screen.
After initializing the speech recognition object, colors are hardcoded as different RGB tuples.
The speech recognition is then placed in a loop that will run 10 times. Each time, we process audio the same way as in the previous example.
When the processing is finished, we store the output as the variable speechString
.
The speechString
variable is then split into an array, so “An expression like this” becomes ["An”,"expression”,“like”,"this"]. This is to keep words like “bluebird” from triggering the phrase blue. By splitting it into an array, the script is forcing an exact match. The if
and elif
statements then perform the color keyword searching.
Finally, disp.clear(color)
sets the color, and disp.display()
renders it to the screen.
Run the code, and you’ll see the screen change with your voice commands:
# python ScreenControl.py Start Listening... Done Listening... You said I see a red door <-- screen turns red Start Listening... Done Listening... You said and I want to paint it black <-- screen turns black Start Listening...
The way the code is written is like going down a queue. The code will only check for the next keyword if all other conditions before it have not been met. This is the nature of elif
. Therefore, the expressions “red is black” and “black is red” will both do the same thing: render the screen red.
There are two main areas of study for going further in this chapter: digital signal processing, which encompasses audio signals, and natural language processing, which covers topics like automatic speech recognition and speech to text:
A free wikibook starting with the basics of digital signals and moving up into some very complicated territory. It contains a lot of hands-on exercises in Matlab and its free counterpart, Octave. Many other free online resources such as this one exist; do a quick Google search to find which works best for you.
A hands-on text with a lot of examples. Note that this book can get a little heavy on theory and upper-level mathematics.
3.133.157.142