© Charlie Gerard 2021
C. GerardPractical Machine Learning in JavaScripthttps://doi.org/10.1007/978-1-4842-6418-8_5

5. Experimenting with inputs

Charlie Gerard1 
(1)
Les Clayes sous bois, France
 

In the previous chapters, we looked into how to use machine learning with images and text data to do object detection and classification, as well as sentiment analysis, toxicity classification and question answering.

These are probably the most common examples of what machine learning can do. However, this list is not exhaustive and many more inputs can be used.

In this chapter, we’re going to explore different kinds of input data and build a few experimental projects to understand how to use machine learning with audio and hardware data, as well as using models focused on body and movement recognition.

5.1 Audio data

When you first read the words “audio data,” you might think that this section of the book is going to focus on music; however, I am going to dive into using sound more generally.

We don’t really think about it often but a lot of things around us produce sounds that give us contextual information about our environment.

For example, the sound of thunder helps you understand the weather is probably bad without you having to look out the window, or you can recognize the sound of a plane passing by before you even see it, or even hearing the sound of waves indicates you are probably close to the ocean, and so on.

Without realizing, recognizing, and understanding the meaning of these sounds impacts our daily lives and our actions. Hearing a knock on your door indicates someone is probably behind waiting for you to open it, or hearing the sound of boiling water while you are cooking suggests that it is ready for you to pour something in it.

Using sound data and machine learning could help us leverage the rich properties of sounds to recognize certain human activities and enhance current smart systems such as Siri, Alexa, and so on.

This is what is called acoustic activity recognition .

Considering a lot of the devices we surround ourselves with possess a microphone, there is a lot of opportunities for this technology.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig1_HTML.jpg
Figure 5-1

Illustration of personal devices that possess a microphone

So far, the smart systems some of us may be using recognize words to trigger commands, but they have no understanding of what is going on around them; your phone does not know you are in the bathroom, your Alexa device does not know you might be in the kitchen, and so on. However, they could and this could be used to create more tailored and useful digital experiences.

Before we dive into the practical part of this chapter and see how to build such systems in JavaScript using TensorFlow.js, it is helpful to start by understanding the basics of what sound is, and how it is translated to data we can use in code.

5.1.1 What is sound?

Sound is the vibration of air molecules.

If you have ever turned the volume of speakers really loud, you might have noticed that they end up moving back and forth with the music. This movement pushes on air particles, changing the air pressure and creating sound waves.

The same phenomenon happens with speech. When you speak, your vocal cords vibrate, disturbing air molecules around and changing the air pressure, creating sound waves.

A way to illustrate this phenomenon is with the following image.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig2_HTML.jpg

When you hit a tuning fork, it will start vibrating. This back and forth movement will change the surrounding air pressure. The movement forward will create a higher pressure and the movement backward will create a region of lower pressure. The repetition of this movement will create waves.

On the receiver side, our eardrums vibrate with the changes of pressure and this vibration is then transformed into an electrical signal sent to the brain.

So, if sound is a change in air pressure, how do we transform a sound wave into data we can use with our devices?

To be able to interpret sound data, our devices use microphones.

There exist different types of microphones, but in general, these devices have a diaphragm or membrane that vibrates when exposed to changes of air pressure caused by sound waves.

These vibrations move a magnet near a coil inside the microphone that generate a small electrical current. Your computer then converts this signal into numbers that represent both volume and frequency.

5.1.2 Accessing audio data

In JavaScript, the Web API that lets developers access data coming from the computer’s microphone is the Web Audio API.

If you have never used this API before, it’s totally fine; we are going to go through the main lines you need to get set up everything.

To start, we need to access the AudioContext interface on the global window object, as well as making sure we can get permission to access an audio and video input device with getUserMedia.
window.AudioContext = window.AudioContext || window.webkitAudioContext;
navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia;
Listing 5-1

Setup to use the Web Audio API in JavaScript

This code sample takes into consideration cross-browser compatibility.

Then, to start listening to input coming from the microphone, we need to wait for a user action on the page, for example, a click.

Once the user has interacted with the web page, we can instantiate an audio context, allow access to the computer’s audio input device, and use some of the Web Audio API built-in methods to create a source and an analyzer and connect the two together to start getting some data.
document.body.onclick = async () => {
  const audioctx = new window.AudioContext();
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const source = audioctx.createMediaStreamSource(stream);
  analyser = audioctx.createAnalyser();
  analyser.smoothingTimeConstant = 0;
  source.connect(analyser);
  analyser.fftSize = 1024;
  getAudioData();
};
Listing 5-2

JavaScript code sample to set up the audio context on click

In the preceding code sample, we are using navigator.mediaDevices.getUserMedia to get access to the microphone. If you have ever built applications that were using audio or video input devices before, you might be familiar with writing navigator.getUserMedia(); however, this is deprecated and you should now be using navigator.mediaDevices.getUserMedia().

Writing it the old way will still work but is not recommended as it will probably not be supported in the next few years.

Once the basic setup is done, the getAudioData function filters the raw data coming from the device to only get the frequency data.
const getAudioData = () => {
  const freqdata = new Uint8Array(analyser.frequencyBinCount);
  analyser.getByteFrequencyData(freqdata);
  console.log(freqdata);
  requestAnimationFrame(getAudioData);
};
Listing 5-3

Function to filter through the raw data to get the frequency data we will use

We also call requestAnimationFrame to continuously call this function and update the data we are logging with live data.

Altogether, you can access live data from the microphone in less than 25 lines of JavaScript!
window.AudioContext = window.AudioContext || window.webkitAudioContext;
navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia;
let analyser;
document.body.onclick = async () => {
  const audioctx = new window.AudioContext();
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const source = audioctx.createMediaStreamSource(stream);
  analyser = audioctx.createAnalyser();
  analyser.smoothingTimeConstant = 0;
  source.connect(analyser);
  analyser.fftSize = 1024;
  getAudioData();
};
const getAudioData = () => {
  const freqdata = new Uint8Array(analyser.frequencyBinCount);
  analyser.getByteFrequencyData(freqdata);
  console.log(freqdata);
  requestAnimationFrame(getAudioData);
};
Listing 5-4

Complete code sample to get input data from the microphone in JavaScript

The output from this code is an array of raw data we are logging in the browser’s console.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig3_HTML.jpg
Figure 5-3

Screenshot of the data returned by the preceding code sample

These arrays represent the frequencies that make up the sounds recorded by the computer’s microphone. The default sample rate is 44,100Hz, which means we get about 44,000 samples of data per second.

In the format shown earlier (arrays of integers), finding patterns to recognize some type of activity seems pretty difficult. We wouldn’t really be able to identify the difference between speaking, laughing, music playing, and so on.

To help make sense of this raw frequency data, we can turn it into visualizations.

5.1.3 Visualizing audio data

There are different ways to visualize sound. A couple of ways you might be familiar with are waveforms or frequency charts.

Waveform visualizers represent the displacement of sound waves over time.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig4_HTML.jpg
Figure 5-4

Illustration of a waveform visualization. Source: https://css-tricks.com/making-an-audio-waveform-visualizer-with-vanilla-javascript/

On the x axis (the horizontal one) is the unit of time and on the y axis (vertical one) is the frequencies. Sound happens over a certain period of time and is made of multiple frequencies.

This way of visualizing sound is a bit too minimal to be able to identify patterns. As you can see in the illustration earlier, all frequencies that make up a sound are reduced to a single line.

Frequency charts are visualizations that represent a measure of how many times a waveform repeats in a given amount of time.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig5_HTML.jpg
Figure 5-5

Illustration of a frequency chart visualization

You might be familiar with this type of audio visualization as they are probably the most common one.

This way of visualizing can maybe give you some insights about a beat as it represents repetitions or maybe about how loud the sound is as the y axis shows the volume, but that’s about it.

This visualization does not give us enough information to be able to recognize and classify sounds we are visualizing.

Another type of visualization that is much more helpful is called a spectrogram.

A spectrogram is like a picture of a sound. It shows the frequencies that make up the sound from low to high and how they change over time. It is a visual representation of the spectrum of frequencies of a signal, a bit like a heat map of sound.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig6_HTML.jpg
Figure 5-6

Illustration of a spectrogram

On the y axis is the spectrum of frequencies and, on the x axis, the amount of time. The axes seem similar to the two other type of visualizations we mentioned previously, but instead of representing all frequencies in a single line, we represent the whole spectrum.

In a spectrogram, a third axis can be helpful too, the amplitude. The amplitude of a sound can be described as the volume. The brighter the color, the louder the sound.

Visualizing sounds as spectrograms is much more helpful in finding patterns that would help us recognize and classify sounds.

For example, next is a screenshot of the output of a spectrogram running while I am speaking.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig7_HTML.jpg
Figure 5-7

Illustration of a spectrogram taken while speaking

By itself, this might not help you understand why spectrograms are more helpful visualizations. The following is another screenshot of a spectrogram taken while I was clapping my hands three times.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig8_HTML.jpg
Figure 5-8

Illustration of a spectrogram taken while clapping my hands three times

Hopefully, it starts to make more sense! If you compare both spectrograms, you can clearly distinguish between the two activities: speaking and clapping my hands.

If you wanted, you could try to visualize more sounds like coughing, your phone ringing, toilets flushing, and so on.

Overall, the main takeaway is that spectrograms help us see the signature of various sounds more clearly and distinguish them between different activities.

If we can make this differentiation by looking at a screenshot of a spectrogram, we can hope that using this data with a machine learning algorithm will also work in finding patterns and classify this sounds to build an activity classifier.

A broader example of using spectrograms for activity classification is from a research paper published by the Carnegie Mellon University in the United States. In their paper titled “Ubicoustics: Plug-and-Play Acoustic Activity Recognition,” they created spectrograms for various activities from using a chainsaw, to a vehicle driving nearby.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig9_HTML.jpg
Figure 5-9

Spectrograms collected from the research by the Carnegie Mellon University. Source: http://www.gierad.com/projects/ubicoustics/

So, before we dive into using sound with machine learning, let’s go through how we can turn the live data from the microphone that we logged in the console using the Web Audio API to a spectrogram.

Creating a spectrogram

In the code sample we wrote earlier, we created a getAudioData function that was getting the frequency data from the raw data and was logging it to the browser’s console.
const getAudioData = () => {
  const freqdata = new Uint8Array(analyser.frequencyBinCount);
  analyser.getByteFrequencyData(freqdata);
  console.log(freqdata);
  requestAnimationFrame(getAudioData);
};
Listing 5-5

getAudioData function to get frequency data from raw data

Where we wrote our console.log statement, we are going to add the code to create the visualization.

To do this, we are going to use the Canvas API, so we need to start by adding a canvas element to our HTML file like so.
<canvas id="canvas"></canvas>
Listing 5-6

Adding a canvas element to the HTML file

In our JavaScript, we are going to be able to access this element and use some methods from the Canvas API to draw our visualization.
var canvas = document.getElementById("canvas");
var ctx = canvas.getContext("2d");
Listing 5-7

Getting the canvas element and context in JavaScript

The main concept of this visualization is to draw the spectrum of frequencies as they vary with time, so we need to get the current canvas and redraw over it every time we get new live data.
imagedata = ctx.getImageData(1, 0, canvas.width - 1, canvas.height);
ctx.putImageData(imagedata, 0, 0);
Listing 5-8

Getting the image data from the canvas element and redrawing over it

Then, we need to loop through the frequency data we get from the Web Audio API and draw them onto the canvas.
for (var i = 0; i < freqdata.length; i++) {
  let value = (2 * freqdata[i]) / 255;
  ctx.beginPath();
  ctx.strokeStyle = `rgba(${Math.max(0, 255 * value)}, ${Math.max(
      0,
      255 * (value - 1)
    )}, 54, 255)`;
  ctx.moveTo(canvas.width - 1, canvas.height - i * (canvas.height / freqdata.length));
    ctx.lineTo(
      canvas.width - 1,
      canvas.height -
        (i * (canvas.height / freqdata.length) +
          canvas.height / freqdata.length)
    );
    ctx.stroke();
  }
Listing 5-9

Looping through frequency data and drawing it onto the canvas

Inside this for loop, we use the beginPath method to indicate that we are going to start drawing something onto the canvas.

Then, we call strokeStyle and pass it a dynamic value that will represent the colors used to display the amplitude of the sound.

After that, we call moveTo to move the visualization 1 pixel to the left and leave space for the new input to be drawn onto the screen at the far right, drawn with lineTo.

Finally, we call the stroke method to draw the line.

Altogether, our getAudioData function should look something like this.
const getAudioData = () => {
  freqdata = new Uint8Array(analyser.frequencyBinCount);
  analyser.getByteFrequencyData(freqdata);
  console.log(freqdata);
  imagedata = ctx.getImageData(1, 0, canvas.width - 1, canvas.height);
  ctx.putImageData(imagedata, 0, 0);
  for (var i = 0; i < freqdata.length; i++) {
    let value = (2 * freqdata[i]) / 255;
    ctx.beginPath();
    ctx.strokeStyle = `rgba(${Math.max(0, 255 * value)}, ${Math.max(
      0,
      255 * (value - 1)
    )}, 54, 255)`;
    ctx.moveTo(
      canvas.width - 1,
      canvas.height - i * (canvas.height / freqdata.length)
    );
    ctx.lineTo(
      canvas.width - 1,
      canvas.height -
        (i * (canvas.height / freqdata.length) +
          canvas.height / freqdata.length)
    );
    ctx.stroke();
  }
  requestAnimationFrame(getAudioData);
};
Listing 5-10

Full getAudioData function

You might be wondering why it is important to understand how to create spectrograms. The main reason is that it is what is used as training data for the machine learning algorithm.

Instead of using the raw data the way we logged it in the browser’s console, we instead use pictures of spectrograms generated to transform a sound problem into an image one.

Advancements in image recognition and classification have been really good over the past few years, and algorithms used with image data have been proven to be very performant.

Also, turning sound data into an image means that we can deal with a smaller amount of data to train a model, which would result in a shorter amount of time needed.

Indeed, the default sample rate of the Web Audio API is 44KHz, which means that it collects 44,000 samples of data per second.

If we record 2 seconds of audio, it is 88,000 points of data for a single sample.

You can imagine that as we need to record a lot more samples, it would end up being a very large amount of data being fed to a machine learning algorithm, which would take a long time to train.

On the other hand, a spectrogram being extracted as a picture can be easily resized to a smaller size, which could end up being only a 28x28 pixel image, for example, which would result in 784 data points for a 2-second audio clip.

Now that we covered how to access live data from the microphone in JavaScript and how to transform it into a spectrogram visualization, allowing us to see how different sounds create visually different patterns, let’s look into how to train a machine learning model to create a classifier.

5.1.4 Training the classifier

Instead of creating a custom machine learning algorithm for this, we are going to use instead one of the Teachable Machine experiments dedicated to sound data. You can find it at https://teachablemachine.withgoogle.com/train/audio.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig10_HTML.jpg
Figure 5-10

Teachable Machine interface

This project allows us to record samples of sound data, label them, train a machine learning algorithm, test the output, and export the model all within a single interface and in the browser!

To start, we need to record some background noise for 20 seconds using the section highlighted in red in the following figure.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig11_HTML.jpg
Figure 5-11

Teachable Machine interface with background noise section highlighted

Then, we can start to record some samples for whatever sound we would like the model to recognize later on.

The minimum amount of samples is 8 and each of them needs to be 2 seconds long.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig12_HTML.jpg
Figure 5-12

Teachable Machine interface with custom section highlighted

As this experiment uses transfer learning to quickly retrain a model that has already been trained with sound data, we need to work with the same format the original model was trained with.

Eight samples is the minimum but you can record more if you’d like. The more samples, the better. However, don’t forget that it will also impact the amount of time the training will take.

Once you have recorded your samples and labelled them, you can start the live training in the browser and make sure not to close the browser window.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig13_HTML.jpg
Figure 5-13

Teachable Machine interface – training the model

When this step is done, you should be able to see some live predictions in the last step of the experiment. Before you export the model, you can try to repeat the sounds you recorded to verify the accuracy of the predictions. If you don’t find it accurate enough, you can record more samples and restart the training.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig14_HTML.jpg
Figure 5-14

Teachable Machine interface – running live predictions

If you are ready to move on, you can either upload your model to some Google servers and be provided with a link to it, or download the machine learning model that was created.

If you’d like to get a better understanding of how it works in the background, how the machine learning model is created, and so on, I’d recommend having a look at the source code available on GitHub!

Even though I really like interfaces like Teachable Machine as they allow anyone to get started and experiment quickly, looking at the source code can reveal some important details. For example, the next image is how I realized that this project was using transfer learning.

While going through the code to see how the machine learning model was created and how the training was done, I noticed the following sample of code.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig15_HTML.jpg
Figure 5-15

Sample from the open source GitHub repository of Teachable Machine

On line 793, we can see that the method addExample is called. This is the same method we used in the chapter of this book dedicated to image recognition when we used transfer learning to train an image classification model quickly with new input images.

Noticing these details is important if you decide to experiment with re-creating this model on your own, without going through the Teachable Machine interface.

Now that we went through the training process, we can write the code to generate the predictions.

5.1.5 Predictions

Before we can start writing this code, we need to import TensorFlow.js and the speech commands model.
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/[email protected]/dist/tf.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/[email protected]/dist/speech-commands.min.js"></script>
Listing 5-11

Import TensorFlow.js and the speech commands model in an HTML file

As I mentioned earlier, this experiment uses transfer learning, so we need to import the speech commands model that has already been trained with audio data to make it simpler and faster to get started.

The speech commands model was originally trained to recognize and classify spoken words, like “yes”, “no”, “up”, and “down”. However, here, we are using it with sounds produced by activities, so it might not be as accurate as if we were using spoken words in our samples.

Before going through the rest of the code samples, make sure you have downloaded your trained model from the Teachable Machine platform, unzipped it, and added it to your application folder.

The following code samples will assume that your model is stored in a folder called activities-model at the root of your application.

Overall, your file structure should look something like this:
  • activities-model/
    • metadata.json

    • model.json

    • weights.bin

  • index.html

  • index.js

In our JavaScript file, we will need to create a function to load our model and start the live predictions, but before, we can create two variables to hold the paths to our model and metadata files.
let URL = "http://localhost:8000/activities-model/";
const modelURL = `${URL}/model.json`;
const metadataURL = `${URL}metadata.json`;
Listing 5-12

Variables to refer to the model and its metadata

You may have noticed that I used localhost:8000 in the preceding code; however, feel free to change the port and make sure to update this if you decide to release your application to production.

Then, we need to load the model and ensure it is loaded before we continue.
const model = window.speechCommands.create(
    "BROWSER_FFT",
    undefined,
    modelURL,
    metadataURL
);
await model.ensureModelLoaded();
Listing 5-13

Loading the model

Once the model is ready, we can run live predictions by calling the listen method on the model.
model.listen(
    (prediction) => {
      predictionCallback(prediction.scores);
    },
    modelParameters
  );
Listing 5-14

Live predictions

Altogether, the setupModel function should look like this.
async function setupModel(URL, predictionCB) {
  predictionCallback = predictionCB;
  const modelURL = `${URL}/model.json`;
  const metadataURL = `${URL}metadata.json`;
  model = window.speechCommands.create(
    "BROWSER_FFT",
    undefined,
    modelURL,
    metadataURL
  );
  await model.ensureModelLoaded();
  const modelParameters = {
    invokeCallbackOnNoiseAndUnknown: true, // run even when only background noise is detected
    includeSpectrogram: true,
    overlapFactor: 0.5, // how often per second to sample audio, 0.5 means twice per second
  };
  model.listen(
    (prediction) => {
      predictionCallback(prediction.scores);
    },
    modelParameters
  );
}
Listing 5-15

Full code sample

When called, this function will contain the predictions data in the callback invoked every time the model has a prediction.
document.body.onclick = () => {
  setupModel(URL, (data) => {
     console.log(data)
  });
}
Listing 5-16

Calling the function

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig16_HTML.jpg
Figure 5-16

Example of data returned when calling the function

This array containing the results of the prediction is ordered by label used. In the previous example, I had trained the model with six different labels so each array returned contained six values.

In ean ach array, the value closest to 1 represents the label predicted.

To match the data predicted with the correct label, we can create an array containing the labels we used for training and use it when calling the setupModel function .
const labels = [
  "Coughing",
  "Phone ringing",
  "Speaking",
  "_background_noise_",
];
let currentPrediction;
document.body.onclick = () => {
  setupModel(URL, (data) => {
    let maximum = Math.max(...data);
    if (maximum > 0.7) {
      let maxIndex = data.indexOf(maximum);
      currentPrediction = labels[maxIndex];
      console.log(currentPrediction);
    }
  });
}
Listing 5-17

Mapping scores to labels

In less than 100 lines of JavaScript, we are able to load and run a machine learning model that can classify live audio input!
let model, predictionCallback;
let URL = "http://localhost:8000/activities-model/";
const labels = [
  "Coughing",
  "Phone ringing",
  "Speaking",
  "_background_noise_",
];
let currentPrediction, previousPrediction;
currentPrediction = previousPrediction;
document.body.onclick = () => {
  setupModel(URL, (data) => {
    let maximum = Math.max(...data);
    if (maximum > 0.7) {
      let maxIndex = data.indexOf(maximum);
      currentPrediction = labels[maxIndex];
      console.log(currentPrediction);
    }
  });
};
async function setupModel(URL, predictionCB) {
  const modelURL = `${URL}/model.json`;
  const metadataURL = `${URL}metadata.json`;
  model = window.speechCommands.create(
    "BROWSER_FFT",
    undefined,
    modelURL,
    metadataURL
  );
  await model.ensureModelLoaded();
  // This tells the model how to run when listening for audio
  const modelParameters = {
    invokeCallbackOnNoiseAndUnknown: true, // run even when only background noise is detected
    includeSpectrogram: true, // give us access to numerical audio data
    overlapFactor: 0.5, // how often per second to sample audio, 0.5 means twice per second
  };
  model.listen(
    (prediction) => {
      predictionCallback(prediction.scores);
    },
    modelParameters
  );
}
Listing 5-18

Full code sample

5.1.6 Transfer learning API

In the previous section, we covered how to record sound samples and train the model using the Teachable Machine experiment, for simplicity. However, if you are looking to implement this in your own application and let users run this same process themselves, you can use the transfer learning API.

This API lets you build your own interface and call API endpoints to record samples, train the model, and run live predictions.

Recording samples

Let’s imagine a very simple web interface with a few buttons.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig17_HTML.jpg
Figure 5-17

Web interface with a few button elements

Some of these buttons are used to collect sample data, one button to start the training and the last one to trigger the live predictions.

To get started, we need an HTML file with these six buttons and two script tags to import TensorFlow.js and the Speech Commands model.
<html lang="en">
  <head>
    <title>Speech recognition</title>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/[email protected]/dist/tf.min.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/[email protected]/dist/speech-commands.min.js"></script>
  </head>
  <body>
    <section>
        <button id="red">Red</button>
        <button id="blue">Blue</button>
        <button id="green">Green</button>
        <button id="background">Background</button>
        <button id="train">Train</button>
        <button id="predict">Predict</button>
    </section>
    <script src="index.js"></script>
  </body>
</html>
Listing 5-19

HTML file

In the JavaScript file, before being able to run these actions, we need to create the model, ensure it is loaded, and pass a main label to our model to create a collection that will contain our audio samples.
const init = async () => {
  const baseRecognizer = speechCommands.create("BROWSER_FFT");
  await baseRecognizer.ensureModelLoaded();
  transferRecognizer = baseRecognizer.createTransfer("colors");
};
Listing 5-20

Set up the recognizers

Then, we can add event listeners on our buttons so they will collect samples on click. For this, we need to call the collectExample method on our recognizer and pass it a string we would like the sample to be labelled with.
const redButton = document.getElementById("red");
redButton.onclick = async () => await transferRecognizer.collectExample("red");
Listing 5-21

Collecting samples

To start the training, we call the train method on the recognizer.
const trainButton = document.getElementById("train");
trainButton.onclick = async () => {
  await transferRecognizer.train({
    epochs: 25,
    callback: {
      onEpochEnd: async (epoch, logs) => {
        console.log(`Epoch ${epoch}: loss=${logs.loss}, accuracy=${logs.acc}`);
      },
    },
  });
};
Listing 5-22

Training

And finally, to classify live audio inputs after training, we call the listen method .
const predictButton = document.getElementById("predict");
predictButton.onclick = async () => {
  await transferRecognizer.listen(
    (result) => {
      const words = transferRecognizer.wordLabels();
      for (let i = 0; i < words.length; ++i) {
        console.log(`score for word '${words[i]}' = ${result.scores[i]}`);
      }
    },
    { probabilityThreshold: 0.75 }
  );
};
Listing 5-23

Predict

Altogether, this code sample looks like the following.
let transferRecognizer;
const init = async () => {
  const baseRecognizer = speechCommands.create("BROWSER_FFT");
  await baseRecognizer.ensureModelLoaded();
  transferRecognizer = baseRecognizer.createTransfer("colors");
};
init();
const redButton = document.getElementById("red");
const backgroundButton = document.getElementById("background");
const trainButton = document.getElementById("train");
const predictButton = document.getElementById("predict");
redButton.onclick = async () => await transferRecognizer.collectExample("red");
backgroundButton.onclick = async () =>
  await transferRecognizer.collectExample("_background_noise_");
trainButton.onclick = async () => {
  await transferRecognizer.train({
    epochs: 25,
    callback: {
      onEpochEnd: async (epoch, logs) => {
        console.log(`Epoch ${epoch}: loss=${logs.loss}, accuracy=${logs.acc}`);
      },
    },
  });
};
predictButton.onclick = async () => {
  await transferRecognizer.listen(
    (result) => {
      const words = transferRecognizer.wordLabels();
      for (let i = 0; i < words.length; ++i) {
        console.log(`score for word '${words[i]}' = ${result.scores[i]}`);
      }
    },
    { probabilityThreshold: 0.75 }
  );
};
Listing 5-24

Full code sample

5.1.7 Applications

Even though the examples I have used so far for our code samples (speaking and coughing) might have seemed simple, the way this technology is currently being used shows how interesting it can be.

Health

In July 2020, Apple announced the release of a new version of their watchOS that included an application triggering a countdown when the user washes their hands. Related to the advice from public health officials around avoiding the spread of COVID-19, this application uses the watch’s microphone to detect the sound of running water and trigger the 20 seconds countdown.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig18_HTML.jpg
Figure 5-18

Countdown triggered when the user washes their hands

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig19_HTML.jpg
Figure 5-19

Interface of the countdown on the Apple Watch

From the code samples shown in the last few pages, a similar application can be built using JavaScript and TensorFlow.js.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig20_HTML.jpg
Figure 5-20

Prototype of similar countdown interface using TensorFlow.js

Biodiversity research and protection

One of my favorite applications for this technology is in biodiversity research and protection of endangered species.

A really good example of this is the Rainforest Connection collective.

This collective uses old cell phone and their built-in microphones to detect the sound of chainsaws in the forest and alert rangers of potential activities of illegal deforestation.

Using solar panels and attaching the installation to trees, they can constantly monitor what is going on around and run live predictions.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig21_HTML.jpg
Figure 5-21

Example of installation made of solar panels and used mobile phones. Source: https://www.facebook.com/RainforestCx/

If this is a project that interests you, they also have a mobile application called Rainforest Connection, in which you can listen to the sound of nature, live from the forest, if you would like to check it out!

Another use of this technology is in protecting killer whales. A collaboration between Google, Rainforest Connection, and Fisheries and Oceans Canada (DFO) uses bioacoustics monitoring to track, monitor, and observe the animal’s behavior in the Salish Sea.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig22_HTML.jpg

Web Accessibility

Another application you might not have noticed is currently implemented in a service you probably know. Indeed, if you are using YouTube, you may have come across live ambient sound captioning.

If you have ever activated captions on a YouTube video, you may know of spoken words being displayed as an overlay at the bottom.

However, there are more information in a video than what can be found in the transcripts.

Indeed, people without hearing impairment benefit from having access to additional information in the form of contextual sounds like music playing or the sound of rain in a video.

Only displaying spoken words in captions can cut quite a lot of information out for people with hearing impairment.

About 3 years ago, in 2017, YouTube released live ambient sound captioning that uses acoustic recognition to add to the captions details about ambient sounds detected in the soundtrack of a video.

Here is an example.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig23_HTML.jpg
Figure 5-23

Example of live ambient sound captioning on YouTube

The preceding screenshot is taken from an interview between Janelle Monae and Pharrell Williams where the captions are activated.

Spoken words are displayed as expected, but we can also see ambient sounds like [Applause].

People with hearing impairment can now have the opportunity to get more information about the video than only dialogues.

At the moment, the ambient sounds that can be detected on YouTube videos include
  • Applause

  • Music playing

  • Laughter

It might not seem like much, but again, this is something we take for granted if we never have to think about the experience some people with disabilities have on these platforms.

Besides, thinking this feature has been implemented about 3 years ago already shows that a major technology company like Google has been actively exploring the potential of using machine learning with audio data and has been working on finding useful applications.

5.1.8 Limits

Now that we covered how to experiment with acoustic activity recognition in JavaScript and a few different applications, it is important to be aware of some of the limitations of such technology to have a better understanding of the real opportunities.

Quality and quantity of the data

If you decide to build a similar acoustic activity recognition system from scratch and write your own model without using transfer learning and the speech commands model from TensorFlow.js, you are going to need to collect a lot more sound samples than the minimum of 8 required when using Teachable Machine.

To gather a large amount of samples, you can either decide to record them yourself or buy them from a professional audio library.

Another important point is to make sure to check the quality of the data recorded. If you want to detect the sound of a vacuum cleaner running, for example, make sure that there is no background noise and that the vacuum cleaner can be clearly heard in the audio track.

One tip to generate samples of data from a single one is to use an audio editing software to change some parameters of a single audio source to create multiple versions of it. You can, for example, modify the reverb, the pitch, and so on.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig24_HTML.jpg
Figure 5-24

Transforming sounds. Gierad Laput, Karan Ahuja, Mayank Goel, and Chris Harrison. 2018. Ubicoustics: Plug-and-Play Acoustic Activity Recognition. In The 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 213-224. DOI: https://doi.org/10.1145/3242587.3242609

Single activity

At the moment, this technology seems to be efficient in recognizing a single sound at once.

For example, if you trained your model to recognize the sound of someone speaking as well as the sound of running water, if you placed your system in the kitchen and the user was speaking as well as washing the dishes, the activity predicted would only be the one with the highest score in the predictions returned.

However, as the system runs continuously, it would probably get confused between the two activities. It would probably alternate between “speaking” and “running water” until one of the activities stopped.

This would definitely become a problem if you built an application that can detect sounds produced by activities that can be executed at the same time.

For example, let’s imagine you usually play music while taking a shower and you built an application that can detect two activities: the sound of the shower running and speaking.

You want to be able to trigger a counter whenever it detects that the shower is running so you can avoid taking long showers and save water.

You also want to be able to lower the sound of your speakers when it detects that someone is speaking in the bathroom.

As these two activities can happen at the same time (you can speak while taking a shower), the system could get confused between the two activities and detect the shower running for a second and someone speaking the next.

As a result, it would start and stop the speakers one second, and start/stop the counter the next. This would definitely not create an ideal experience.

However, this does not mean that there is no potential in building applications using acoustic activity recognition, it only means that we would need to work around this limitation.

Besides, some research is being done around developing systems that can handle the detection of multiple activities at once. We will look into it in the next few pages.

User experience

When it comes to user experience, there are always some challenges with new technologies like this one.

First of all, privacy.

Having devices listening to users always raises some concerns about where the data is stored, how it is used, is it secure, and so on.

Considering that some companies releasing Internet of Things devices do not always put security first in their products, these concerns are very normal.

As a result, the adoption of these devices by consumers can be slower than expected.

Not only privacy and security should be baked in these systems, it should also be communicated to users in a clear way to reassure them and give them a sense of empowerment over their data.

Secondly, another challenge is in teaching users new interactions .

For example, even though most modern phones now have voice assistants built-in, getting information from asking Siri or Google is not the primary interaction.

This could be for various reasons including privacy and limitations of the technology itself, but people also have habits that are difficult to change.

Besides, considering the current imperfect state of this technology, it is easy for users to give up after a few trials, when they do not get the response they were looking for.

A way to mitigate this would be to release small applications to analyze users’ reactions to them and adapt. The work Apple did by implementing the water detection in their new watchOS is an example of that.

Finally, one of the big challenges of creating a custom acoustic activity recognition system is in the collection of sample data and training by the users.

Even though you can build and release an application that detects the sound of a faucet running because there’s a high probability that it produces a similar sound in most homes, some other sounds are not so common.

As a result, empowering users to use this technology would involve letting them record their own samples and train the model so they can have the opportunity to have a customized application.

However, as machine learning algorithms need to be trained with a large amount of data to have a chance to produce accurate predictions, it would require a lot of effort from users and would inevitably not be successful.

Luckily, some researchers are experimenting with solutions to these problems.

Now, even though there are some limits to this technology, solutions also start to appear.

For example, in terms of protecting users’ privacy, an open source project called Project Alias by Bjørn Karmann attempts to empower voice assistant users.

This project is a DIY add-on made with a Raspberry Pi microcontroller, a speaker, and microphone module, all in a 3D printed enclosure that aims at blocking voice assistants like Amazon Alexa and Google Home from continuously listening to people.

Through a mobile application, users can train Alias to react on a custom wake word or sound. Once it is trained, Alias can take control over the home assistant and activate it for you. When you don’t use it, the add-on will prevent the assistant from listening by emitting white noise into their microphone.

Alias’s neural network being run locally, the privacy of the user is protected.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig25_HTML.jpg
Figure 5-25

Project Alias. Source: https://bjoernkarmann.dk/project_alias

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig26_HTML.jpg
Figure 5-26

Project Alias components. Source: https://bjoernkarmann.dk/project_alias

Another project, called Synthetic Sensors, aims at creating a system that can accurately predict multiple sounds at once.

Developed by a team of researchers at the Carnegie Mellon University, this project involves a custom-built piece of hardware made of multiple sensors, including an accelerometer, microphone, temperature sensor, motion sensor, and color sensor.

Using the raw data collected from these sensors, researchers created multiple stacked spectrograms and trained algorithms to detect patterns produced by multiple activities such as
  • Microwave door closed

  • Wood saw running

  • Kettle on

  • Faucet running

  • Toilet flushing

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig27_HTML.jpg
Figure 5-27

Project Synthetic Sensors hardware. Gierad Laput, Yang Zhang, and Chris Harrison. 2017. Synthetic Sensors: Towards General-Purpose Sensing. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17). ACM, New York, NY, USA, 3986-3999. DOI: https://doi.org/10.1145/3025453.3025773.

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig28_HTML.jpg
Figure 5-28

Project Synthetic Sensors example of spectrograms and activities recognition. Gierad Laput, Yang Zhang, and Chris Harrison. 2017. Synthetic Sensors: Towards General-Purpose Sensing. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17). ACM, New York, NY, USA, 3986-3999. DOI: https://doi.org/10.1145/3025453.3025773

Finally, in terms of user experience, a research project called Listen Learner aims at allowing users to collect data and train a model to recognize custom sounds, with minimal effort.

The full name of the project is Listen Learner, Automatic Class Discovery and One-Shot Interaction for Activity Recognition.

It aims at providing high classification accuracy, while minimizing user burden, by continuously listening to sounds in its environment, classifying them by cluster of similar sounds, and asking the user what the sound is after having collected enough similar samples.

The result of the study shows that this system can accurately and automatically learn acoustic events (e.g., 97% precision, 87% recall), while adhering to users’ preferences for nonintrusive interactive behavior.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig29_HTML.jpg
Figure 5-29

Wu, J., Harrison, C., Bigham, J. and Laput, G. 2020. Automated Class Discovery and One-Shot Interactions for Acoustic Activity Recognition. In Proceedings of the 38th Annual SIGCHI Conference on Human Factors in Computing Systems. CHI '20. ACM, New York, NY. Source: www.chrisharrison.net/index.php/Research/ListenLearner

5.2 Body and movement tracking

After looking at how to use machine learning with audio data, let’s look into another type of input, that is, body tracking.

In this section, we are going to use data from body movements via the webcam using three different Tensorlfow.js models.

5.2.1 Facemesh

The first model we are going to experiment with is called Facemesh. It is a machine learning model focused on face recognition that predicts the position of 486 3D facial landmarks on a user’s face, returning points with their x, y, and z coordinates.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig30_HTML.jpg
Figure 5-30

Example of visualization of face tracking with Facemesh. source: https://github.com/tensorflow/tfjs-models/tree/master/facemesh

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig31_HTML.jpg

The main difference between this face recognition model and other face tracking JavaScript libraries like face-tracking.js is that the TensorFlow.js model intends to approximate the surface geometry of a human face and not only the 2D position of some key points.

This model provides coordinates in a 3D environment which allows to approximate the depth of facial features as well as tracking the position of key points even when the user is rotating their face in three dimensions.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig32_HTML.jpg
Figure 5-32

Key points using the webcam and in a 3D visualization. Source: https://storage.googleapis.com/tfjs-models/demos/facemesh/index.html

Loading the model

To start using the model, we need to load it using the two following lines in your HTML file.
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
<script src='https://cdn.jsdelivr.net/npm/@tensorflow-models/facemesh'></script>
Listing 5-25

Importing TensorFlow.js and Facemesh in an HTML file

As we are going to use the video feed from the webcam to detect faces, we also need to add a video element to our file.

Altogether, the very minimum HTML you need for this is as follows.
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Facemesh</title>
  </head>
  <body>
    <video></video>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/facemesh"></script>
    <script src="index.js"></script>
  </body>
</html>
Listing 5-26

Core HTML code needed

Then, in your JavaScript code, you need to load the model and the webcam feed using the following code.
let model;
let video;
const init = async () => {
  model = await facemesh.load();
  video = await loadVideo();
  main(); // This will be declared in the next code sample.
}
const loadVideo = async () => {
  const video = await setupCamera();
  video.play();
  return video;
};
const setupCamera = async () => {
  if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
    throw new Error(
      "Browser API navigator.mediaDevices.getUserMedia not available"
    );
  }
  video = document.querySelector("video");
  video.width = window.innerWidth;
  video.height = window.innerHeight;
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: false,
    video: {
      facingMode: "user",
      width: window.innerWidth,
      height: window.innerHeight,
    },
  });
  video.srcObject = stream;
  return new Promise(
    (resolve) => (video.onloadedmetadata = () => resolve(video))
  );
};
Listing 5-27

Load the model and set up the webcam feed

Predictions

Once the model and the video are ready, we can call our main function to find facial landmarks in the input stream.
async function main() {
  const predictions = await model.estimateFaces(
    document.querySelector("video")
  );
  if (predictions.length > 0) {
    console.log(predictions);
    for (let i = 0; i < predictions.length; i++) {
      const keypoints = predictions[i].scaledMesh;
      // Log facial keypoints.
      for (let i = 0; i < keypoints.length; i++) {
        const [x, y, z] = keypoints[i];
        console.log(`Keypoint ${i}: [${x}, ${y}, ${z}]`);
      }
    }
  }
}
Listing 5-28

Function to find face landmarks

The output of this code sample in the browser’s console returns the following.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig33_HTML.jpg
Figure 5-33

Output of the landmarks in the console

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig34_HTML.jpg
Figure 5-34

Output of loop statement in the console

As we can see in the preceding two screenshots, the predictions returned contain an important amount of information.

The annotations are organized by landmark areas, in alphabetical order and containing arrays of x, y, and z coordinates.

The bounding box contains two main keys, bottomRight and topLeft, to indicate the boundaries of the position of the detected face in the video stream. These two properties contain an array of only two coordinates, x and y, as the z axis is not useful in this case.

Finally, the mesh and scaledMesh properties contain all coordinates of the landmarks and are useful to render all points in 3D space on the screen.

Full code sample

Altogether, the JavaScript code to set up the model, the video feed, and start predicting the position of landmarks should look like the following.
let video;
let model;
const init = async () => {
  video = await loadVideo();
  await tf.setBackend("webgl");
  model = await facemesh.load();
  main();
};
const loadVideo = async () => {
  const video = await setupCamera();
  video.play();
  return video;
};
const setupCamera = async () => {
  if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
    throw new Error(
      "Browser API navigator.mediaDevices.getUserMedia not available"
    );
  }
  video = document.querySelector("video");
  video.width = window.innerWidth;
  video.height = window.innerHeight;
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: false,
    video: {
      facingMode: "user",
      width: window.innerWidth,
      height: window.innerHeight,
    },
  });
  video.srcObject = stream;
  return new Promise(
    (resolve) => (video.onloadedmetadata = () => resolve(video))
  );
};
init();
async function main() {
  const predictions = await model.estimateFaces(
    document.querySelector("video")
  );
  if (predictions.length > 0) {
    console.log(predictions);
    for (let i = 0; i < predictions.length; i++) {
      const keypoints = predictions[i].scaledMesh;
      // Log facial keypoints.
      for (let i = 0; i < keypoints.length; i++) {
        const [x, y, z] = keypoints[i];
        console.log(`Keypoint ${i}: [${x}, ${y}, ${z}]`);
      }
    }
  }
  requestAnimationFrame(main);
}
Listing 5-29

Full JavaScript code sample

Project

To put this code sample into practice, let’s build a quick prototype to allow users to scroll down a page by tilting their head back and forth.

We are going to be able to reuse most of the code written previously and make some small modifications to trigger a scroll using some of the landmarks detected.

The specific landmark we are going to use to detect the movement of the head is the lipsLowerOuter and more precisely its z axis.

Looking at all the properties available in the annotations object, using the lipsLowerOuter one is the closest to the chin, so we can look at the predicted changes of z coordinate for this area to determine if the head is tilting backward (chin moving forward) or forward (chin moving backward).

To do this, in our main function, once we get predictions, we can add the following lines of code.
if (predictions[0].annotations.lipsLowerOuter) {
   let zAxis = predictions[0].annotations.lipsLowerOuter[9][2];
   if (zAxis > 5) {
     // Scroll down
     window.scrollTo({
       top: (scrollPosition += 10),
       left: 0,
       behavior: "smooth",
     });
   } else if (zAxis < -5) {
     // Scroll up
     window.scrollTo({
      top: (scrollPosition -= 10),
      left: 0,
      behavior: "smooth",
    });
   }
}
Listing 5-30

Triggering scroll when z axis changes

In this code sample, I declare a variable that I call zAxis to store the value of the z coordinate I want to track. To get this value, I look into the array of coordinates contained in the lipsLowerOuter property of the annotations object.

Based on the annotation objects returned, we can see that the lipsLowerOuter property contains 10 arrays of 3 values each.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig35_HTML.jpg
Figure 5-35

Annotations returned with lipsLowerOuter values

This is why the code sample shown just earlier was accessing the z coordinates using predictions[0].annotations.lipsLowerOuter[9][2].

I decided to access the last element ([9]) of the lipsLowerOuter property and its third value ([2]), the z coordinate of the section.

The value 5 was selected after trial and error and seeing what threshold would work for this particular project. It is not a standard value that you will need to use every time you use the Facemesh model. Instead, I decided it was the correct threshold for me to use after logging the variable zAxis and seeing its value change in the browser’s console as I was tilting my head back and forth.

Then, assuming that you declared scrollPosition earlier in the code and set it to a value (I personally set it to 0), a “scroll up” event will happen when you tilt your head backward and “scroll down” when you tilt your head forward.

Finally, I set the property behavior to “smooth” so we have some smooth scrolling happening, which, in my opinion, creates a better experience.

If you did not add any content to your HTML file, you won’t see anything happen yet though, so don’t forget to add enough text or images to be able to test that everything is working!

In less than 75 lines of JavaScript, we loaded a face recognition model, set up the video stream, ran predictions to get the 3D coordinates of facial landmarks, and wrote some logic to trigger a scroll up or down when tilting your head backward or forward!
let video;
let model;
const init = async () => {
  video = await loadVideo();
  await tf.setBackend("webgl");
  model = await facemesh.load();
  main();
};
const loadVideo = async () => {
  const video = await setupCamera();
  video.play();
  return video;
};
const setupCamera = async () => {
  if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
    throw new Error(
      "Browser API navigator.mediaDevices.getUserMedia not available"
    );
  }
  video = document.querySelector("video");
  video.width = window.innerWidth;
  video.height = window.innerHeight;
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: false,
    video: {
      facingMode: "user",
      width: window.innerWidth,
      height: window.innerHeight,
    },
  });
  video.srcObject = stream;
  return new Promise(
    (resolve) => (video.onloadedmetadata = () => resolve(video))
  );
};
init();
let scrollPosition = 0;
async function main() {
  const predictions = await model.estimateFaces(
    document.querySelector("video")
  );
  if (predictions.length > 0) {
    if (predictions[0].annotations.lipsLowerOuter) {
      zAxis = predictions[0].annotations.lipsLowerOuter[9][2];
      if (zAxis > 5) {
        // Scroll down
        window.scrollTo({
          top: (scrollPosition += 10),
          left: 0,
          behavior: "smooth",
        });
      } else if (zAxis < -5) {
        // Scroll up
        window.scrollTo({
          top: (scrollPosition -= 10),
          left: 0,
          behavior: "smooth",
        });
      }
    }
  }
  requestAnimationFrame(main);
}
Listing 5-31

Complete JavaScript code

This model is specialized in detecting face landmarks. Next, we’re going to look into another one, to detect keypoints in a user’s hands.

5.2.2 Handpose

The second model we are going to experiment with is called Handpose. This model specializes in recognizing the position of 21 3D keypoints in the user’s hands.

The following is an example of the output of this model, once visualized on the screen using the Canvas API.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig36_HTML.jpg
Figure 5-36

Keypoints from Handpose visualized. Source: https://github.com/tensorflow/tfjs-models/tree/master/handpose

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig37_HTML.jpg
Figure 5-37

Keypoints from Handpose visualized. Source: https://github.com/tensorflow/tfjs-models/tree/master/handpose

To implement this, the lines of code will look very familiar if you have read the previous section.

Loading the model

We need to start by requiring TensorFlow.js and the Handpose model:
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/handpose"></script>
Listing 5-32

Import TensorFlow.js and the Handpose model

Similarly to the way the Facemesh model works, we are going to use the video stream as input so we also need to add a video element in your main HTML file.

Then, in your JavaScript file, we can use the same functions we wrote before to set up the camera and load the model. The only line we will need to change is the line where we call the load method on the model.

As we are using Handpose instead of Facemesh, we need to replace facemesh.load() with handpose.load().

So, overall the base of your JavaScript file should have the following code.
let video;
let model;
const init = async () => {
  video = await loadVideo();
  await tf.setBackend("webgl");
  model = await handpose.load();
};
const loadVideo = async () => {
  const video = await setupCamera();
  video.play();
  return video;
};
const setupCamera = async () => {
  if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
    throw new Error(
      "Browser API navigator.mediaDevices.getUserMedia not available"
    );
  }
  video = document.querySelector("video");
  video.width = window.innerWidth;
  video.height = window.innerHeight;
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: false,
    video: {
      facingMode: "user",
      width: window.innerWidth,
      height: window.innerHeight,
    },
  });
  video.srcObject = stream;
  return new Promise(
    (resolve) => (video.onloadedmetadata = () => resolve(video))
  );
};
init();
Listing 5-33

Code to set up to load the model and video input

Predicting key points

Once the model is loaded and the webcam feed is set up, we can run predictions and detect keypoints when a hand is placed in front of the webcam.

To do this, we can copy the main() function we created when using Facemesh, but replace the expression model.estimateFaces with model.estimateHands.

As a result, the main function should be as follows.
async function main() {
  const predictions = await model.estimateHands(
    document.querySelector("video")
  );
  if (predictions.length > 0) {
    console.log(predictions);
  }
  requestAnimationFrame(main);
}
Listing 5-34

Run predictions and log the output

The output of this code will log the following data in the browser’s console.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig38_HTML.jpg
Figure 5-38

Output when detecting hands

We can see that the format of this data is very similar to the one when using the Facemesh model!

This makes it easier and faster to experiment as you can reuse code samples you have written in other projects. It allows developers to get set up quickly to focus on experimenting with the possibilities of what can be built with such models, without spending too much time in configuration.

The main differences that can be noticed are the properties defined in annotations, the additional handInViewConfidence property, and the lack of mesh and scaledMesh data.

The handInViewConfidence property represents the probability of a hand being present. It is a floating value between 0 and 1. The closer it is to 1, the more confident the model is that a hand is found in the video stream.

At the moment of writing this book, this model is able to detect only one hand at a time. As a result, you cannot build applications that would require a user to use both hands at once as a way of interacting with the interface.

Full code sample

To check that everything is working properly, here is the full JavaScript code sample needed to test your setup.
let video;
let model;
const init = async () => {
  video = await loadVideo();
  await tf.setBackend("webgl");
  model = await handpose.load();
  main();
};
const loadVideo = async () => {
  const video = await setupCamera();
  video.play();
  return video;
};
const setupCamera = async () => {
  if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
    throw new Error(
      "Browser API navigator.mediaDevices.getUserMedia not available"
    );
  }
  video = document.querySelector("video");
  video.width = window.innerWidth;
  video.height = window.innerHeight;
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: false,
    video: {
      facingMode: "user",
      width: window.innerWidth,
      height: window.innerHeight,
    },
  });
  video.srcObject = stream;
  return new Promise(
    (resolve) => (video.onloadedmetadata = () => resolve(video))
  );
};
init();
async function main() {
  const predictions = await model.estimateHands(
    document.querySelector("video")
  );
  if (predictions.length > 0) {
    console.log(predictions);
  }
  requestAnimationFrame(main);
}
Listing 5-35

Full code sample

Project

To experiment with the kind of applications that can be built with this model, we’re going to build a small “Rock Paper Scissors” game.

To understand how we are going to recognize the three gestures, let’s have a look at the following visualizations to understand the position of the keypoints per gesture.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig39_HTML.jpg
Figure 5-39

“Rock” gesture visualized

The preceding screenshot represents the “rock” gesture. As we can see, all fingers are folded so the tips of the fingers should be further in their z axis than the keypoint at the end of the first phalanx bone for each finger.

Otherwise, we can also consider that the y coordinate of the finger tips should be higher than the one of the major knuckles, keeping in mind that the top of the screen is equal to 0 and the lower the keypoint, the higher the y value.

We’ll be able to play around with the data returned in the annotations object to see if this is accurate and can be used to detect the “rock” gesture.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig40_HTML.jpg
Figure 5-40

“Paper” gesture visualized

In the “paper” gesture, all fingers are straight so we can use mainly the y coordinates of different fingers. For example, we could check if the y value of the last point of each finger (at the tips) is less than the y value of the palm or the base of each finger.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig41_HTML.jpg
Figure 5-41

Scissors” gesture visualized

Finally, the “scissors” gesture could be recognized by looking at the space in x axis between the index finger and the middle finger, as well as the y coordinate of the other fingers.

If the y value of the tip of the ring finger and little finger is lower than their base, they are probably folded.

Reusing the code samples we have gone through in the previous sections, let’s look into how we can write the logic to recognize and differentiate these gestures.

If we start with the “rock” gesture, here is how we could check if the y coordinate of each finger is higher than the one of the base knuckle.
let indexBase = predictions[0].annotations.indexFinger[0][1];
let indexTip = predictions[0].annotations.indexFinger[3][1];
if (indexTip > indexBase) {
  console.log("index finger folded");
}
Listing 5-36

Logic to check if the index finger is folded

We can start by declaring two variables, one to store the y position of the base of the index finger and one for the tip of the same finger.

Looking back at the data from the annotations object when a finger is present on screen, we can see that, for the index finger, we get an array of 4 arrays representing the x, y, and z coordinates of each key point.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig42_HTML.jpg
Figure 5-42

Output data when a hand is detected

The y coordinate in the first array has a value of about 352.27 and the y coordinate in the last array has a value of about 126.62, which is lower, so we can deduce that the first array represents the position of the base of the index finger, and the last array represents the keypoint at the tip of that finger.

We can test that this information is correct by writing the if statement shown earlier that logs the message “index finger folded” if the value of indexTip is higher than the one of indexBase.

And it works!

If you test this code by placing your hand in front of the camera and switch from holding your index finger straight and then folding it, you should see the message being logged in the console!

If we wanted to keep it really quick and simpler, we could stop here and decide that this single check determines the “rock” gesture. However, if we would like to have more confidence in our gesture, we could repeat the same process for the middle finger, ring finger, and little finger.

The thumb would be a little different as we would check the difference in x coordinate rather than y, because of the way this finger folds.

For the “paper” gesture, as all fingers are extended, we could check that the tip of each finger has a smaller y coordinate than the base.

Here’s what the code could look like to verify that.
let indexBase = predictions[0].annotations.indexFinger[0][1];
let indexTip = predictions[0].annotations.indexFinger[3][1];
let thumbBase = predictions[0].annotations.thumb[0][1];
let thumbTip = predictions[0].annotations.thumb[3][1];
let middleBase = predictions[0].annotations.middleFinger[0][1];
let middleTip = predictions[0].annotations.middleFinger[3][1];
let ringBase = predictions[0].annotations.ringFinger[0][1];
let ringTip = predictions[0].annotations.ringFinger[3][1];
let pinkyBase = predictions[0].annotations.pinky[0][1];
let pinkyTip = predictions[0].annotations.pinky[3][1];
let indexExtended = indexBase > indexTip ? true : false;
let thumbExtended = thumbBase > thumbTip ? true : false;
let middleExtended = middleBase > middleTip ? true : false;
let ringExtended = ringBase > ringTip ? true : false;
let pinkyExtended = pinkyBase > pinkyTip ? true : false;
if (indexExtended && thumbExtended && middleExtended && ringExtended &&
      pinkyExtended) {
      console.log("paper gesture!");
    } else {
      console.log("other gesture");
    }
Listing 5-37

Check the y coordinate of each finger for the “paper” gesture

We start by storing the coordinates we are interested in into variables and then compare their values to set the extended states to true or false.

If all fingers are extended, we log the message “paper gesture!”.

If everything is working fine, you should be able to place your hand in front of the camera with all fingers extended and see the logs in the browser’s console.

If you change to another gesture, the message “other gesture” should be logged.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig43_HTML.jpg
Figure 5-43

Screenshot of hand detected in the webcam feed and paper gesture logged in the console

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig44_HTML.jpg
Figure 5-44

Screenshot of hand detected in the webcam feed and other gesture logged in the console

Finally, detecting the scissors ” gesture can be done by looking at the changes of x coordinates for the tips of the index and middle fingers, as well as making sure the other fingers are not extended.
let indexTipX = predictions[0].annotations.indexFinger[3][0];
let middleTipX = predictions[0].annotations.middleFinger[3][0];
let diffFingersX =
      indexTipX > middleTipX ? indexTipX - middleTipX : middleTipX - indexTipX;
console.log(diffFingersX);
Listing 5-38

Check the difference in x coordinate for the index and middle finger tips

The following are two screenshots of the data we get back with this code sample.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig45_HTML.jpg
Figure 5-45

Output data when executing a “scissors” gesture

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig46_HTML.jpg
Figure 5-46

Output data when executing another gesture

We can see that when we do the “scissors” gesture, the value of the diffFingersX variable is much higher than when the two fingers are close together.

Looking at this data, we could decide that our threshold could be 100. If the value of diffFingersX is more than 100 and the ring and little fingers are folded, the likelihood of the gesture being “scissors” is very high.

So, altogether, we could check this gesture with the following code sample.
let ringBase = predictions[0].annotations.ringFinger[0][1];
let ringTip = predictions[0].annotations.ringFinger[3][1];
let pinkyBase = predictions[0].annotations.pinky[0][1];
let pinkyTip = predictions[0].annotations.pinky[3][1];
let ringExtended = ringBase > ringTip ? true : false;
let pinkyExtended = pinkyBase > pinkyTip ? true : false;
let indexTipX = predictions[0].annotations.indexFinger[3][0];
let middleTipX = predictions[0].annotations.middleFinger[3][0];
let diffFingersX =
      indexTipX > middleTipX ? indexTipX - middleTipX : middleTipX - indexTipX;
if (diffFingersX > 100 && !ringExtended && !pinkyExtended) {
  console.log("scissors gesture!");
}
Listing 5-39

Detect “scissors” gesture

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig47_HTML.jpg
Figure 5-47

Screenshot of “scissors” gesture detection working

Now that we wrote the logic to detect the gestures separately, let’s put it all together.
let indexBase = predictions[0].annotations.indexFinger[0][1];
let indexTip = predictions[0].annotations.indexFinger[3][1];
let thumbBase = predictions[0].annotations.thumb[0][1];
let thumbTip = predictions[0].annotations.thumb[3][1];
let middleBase = predictions[0].annotations.middleFinger[0][1];
let middleTip = predictions[0].annotations.middleFinger[3][1];
let ringBase = predictions[0].annotations.ringFinger[0][1];
let ringTip = predictions[0].annotations.ringFinger[3][1];
let pinkyBase = predictions[0].annotations.pinky[0][1];
let pinkyTip = predictions[0].annotations.pinky[3][1];
let indexExtended = indexBase > indexTip ? true : false;
let thumbExtended = thumbBase > thumbTip ? true : false;
let middleExtended = middleBase > middleTip ? true : false;
let ringExtended = ringBase > ringTip ? true : false;
let pinkyExtended = pinkyBase > pinkyTip ? true : false;
if (
      indexExtended &&
      thumbExtended &&
      middleExtended &&
      ringExtended &&
      pinkyExtended
) {
  console.log("paper gesture!");
}
/* Rock gesture */
if (!indexExtended && !middleExtended && !ringExtended && !pinkyExtended) {
   console.log("rock gesture!");
}
/* Scissors gesture */
let indexTipX = predictions[0].annotations.indexFinger[3][0];
let middleTipX = predictions[0].annotations.middleFinger[3][0];
let diffFingersX =
      indexTipX > middleTipX ? indexTipX - middleTipX : middleTipX - indexTipX;
if (diffFingersX > 100 && !ringExtended && !pinkyExtended) {
  console.log("scissors gesture!");
}
Listing 5-40

Logic for detecting all gestures

If everything works properly, you should see the correct message being logged in the console when you do each gesture!

Once you have verified that the logic works, you can move on from using console.log and use this to build a game or use these gestures as a controller for your interface, and so on.

The most important thing is to understand how the model works, get familiar with building logic using coordinates so you can explore the opportunities, and be conscious of some of the limits.

5.2.3 PoseNet

Finally, the last body tracking model we are going to talk about is called PoseNet.

PoseNet is a pose detection model that can estimate a single pose or multiple poses in an image or video.

Similarly to the Facemesh and Handpose models, PoseNet tracks the position of keypoints in a user’s body.

The following is an example of these key points visualized.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig48_HTML.jpg
Figure 5-48

Visualization of the keypoints detected by PoseNet. Source: https://github.com/tensorflow/tfjs-models/tree/master/posenet

This body tracking model can detect 17 keypoints and their 2D coordinates, indexed by part ID.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig49_HTML.jpg
Figure 5-49

List of keypoints and their ID. Source: https://github.com/tensorflow/tfjs-models/tree/master/posenet

Even though this model is also specialized in tracking a person’s body using the webcam feed, using it in your code is a little bit different from the two models we covered in the previous sections.

Importing and loading the model

Importing and loading the model follows the same standard as most of the code samples in this book.
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/posenet"></script>
Listing 5-41

Import TensorFlow.js and the PoseNet model in HTML

const net = await posenet.load();
Listing 5-42

Loading the model in JavaScript

This default way of loading PoseNet uses a faster and smaller model based on the MobileNetV1 architecture. The trade-off for speed is a lower accuracy.

If you want to experiment with the parameters, you can also load it this way.
const net = await posenet.load({
  architecture: 'MobileNetV1',
  outputStride: 16,
  inputResolution: { width: 640, height: 480 },
  multiplier: 0.75
});
Listing 5-43

Alternative ways of loading the model

If you want to try the second configuration available, you can indicate that you’d like to use the other model based on the ResNet50 architecture that has better accuracy but is a larger model, so will take more time to load.
const net = await posenet.load({
  architecture: 'ResNet50',
  outputStride: 32,
  inputResolution: { width: 257, height: 200 },
  quantBytes: 2
});
Listing 5-44

Loading the model using the ResNet50 architecture

If you feel a bit confused by the different parameters, don’t worry, as you get started, using the default ones provided is completely fine. If you want to learn more about them, you can find more information in the official TensorFlow documentation.

Once the model is loaded, you can focus on predicting poses.

Predictions

To get predictions from the model, you mainly need to call the estimateSinglePose method on the model.
const pose = await net.estimateSinglePose(image, {
  flipHorizontal: false
});
Listing 5-45

Predicting single poses

The image parameter can either be some imageData, an HTML image element, an HTML canvas element, or an HTML video element. It represents the input image you want to get predictions on.

The flipHorizontal parameter indicates if you would like to flip/mirror the pose horizontally. By default, its value is set to false.

If you are using videos, it should be set to true if the video is by default flipped horizontally (e.g., when using a webcam).

The preceding code sample will set the variable pose to a single pose object that will contain a confidence score and an array of keypoints detected, with their 2D coordinates, the name of the body part, and a probability score.

The following is an example of the object that will be returned.
{
  "score": 0.32371445304906,
  "keypoints": [
    {
      "position": {
        "y": 76.291801452637,
        "x": 253.36747741699
      },
      "part": "nose",
      "score": 0.99539834260941
    },
    {
      "position": {
        "y": 71.10383605957,
        "x": 253.54365539551
      },
      "part": "leftEye",
      "score": 0.98781454563141
    },
    {
      "position": {
        "y": 71.839515686035,
        "x": 246.00454711914
      },
      "part": "rightEye",
      "score": 0.99528175592422
    },
    {
      "position": {
        "y": 72.848854064941,
        "x": 263.08151245117
      },
      "part": "leftEar",
      "score": 0.84029853343964
    },
    {
      "position": {
        "y": 79.956565856934,
        "x": 234.26812744141
      },
      "part": "rightEar",
      "score": 0.92544466257095
    },
    {
      "position": {
        "y": 98.34538269043,
        "x": 399.64068603516
      },
      "part": "leftShoulder",
      "score": 0.99559044837952
    },
    {
      "position": {
        "y": 95.082359313965,
        "x": 458.21868896484
      },
      "part": "rightShoulder",
      "score": 0.99583911895752
    },
    {
      "position": {
        "y": 94.626205444336,
        "x": 163.94561767578
      },
      "part": "leftElbow",
      "score": 0.9518963098526
    },
    {
      "position": {
        "y": 150.2349395752,
        "x": 245.06030273438
      },
      "part": "rightElbow",
      "score": 0.98052614927292
    },
    {
      "position": {
        "y": 113.9603729248,
        "x": 393.19735717773
      },
      "part": "leftWrist",
      "score": 0.94009721279144
    },
    {
      "position": {
        "y": 186.47859191895,
        "x": 257.98034667969
      },
      "part": "rightWrist",
      "score": 0.98029226064682
    },
    {
      "position": {
        "y": 208.5266418457,
        "x": 284.46710205078
      },
      "part": "leftHip",
      "score": 0.97870296239853
    },
    {
      "position": {
        "y": 209.9910736084,
        "x": 243.31219482422
      },
      "part": "rightHip",
      "score": 0.97424703836441
    },
    {
      "position": {
        "y": 281.61965942383,
        "x": 310.93188476562
      },
      "part": "leftKnee",
      "score": 0.98368924856186
    },
    {
      "position": {
        "y": 282.80120849609,
        "x": 203.81164550781
      },
      "part": "rightKnee",
      "score": 0.96947449445724
    },
    {
      "position": {
        "y": 360.62716674805,
        "x": 292.21047973633
      },
      "part": "leftAnkle",
      "score": 0.8883239030838
    },
    {
      "position": {
        "y": 347.41177368164,
        "x": 203.88229370117
      },
      "part": "rightAnkle",
      "score": 0.8255187869072
    }
  ]
}
Listing 5-46

Complete object returned as predictions

If you would like to detect multiple poses, if you expect an image or video to contain multiple people, you can change the method called to be as follows.
const poses = await net.estimateMultiplePoses(image, {
  flipHorizontal: false,
  maxDetections: 5,
  scoreThreshold: 0.5,
  nmsRadius: 20
});
Listing 5-47

Predicting multiple poses

We can see that some additional parameters are passed in.
  • maxDetections indicates the maximum number of poses we’d like to detect. The value 5 is the default but you can change it to more or less.

  • scoreThreshold indicates that you only want instances to be returned if the score value at the root of the object is higher than the value set. 0.5 is the default value.

  • nmsRadius stands for nonmaximum suppression and indicates the amount of pixels that should separate multiple poses detected. The value needs to be strictly positive and defaults to 20.

Using this method will set the value of the variable poses to an array of pose objects, like the following.
[
  // Pose 1
  {
    // Pose score
    "score": 0.42985695206067,
    "keypoints": [
      {
        "position": {
          "x": 126.09371757507,
          "y": 97.861720561981
        },
        "part": "nose",
        "score": 0.99710708856583
      },
      {
        "position": {
          "x": 132.53466176987,
          "y": 86.429876804352
        },
        "part": "leftEye",
        "score": 0.99919074773788
      },
      ...
    ],
  },
  // Pose 2
  {
    // Pose score
    "score": 0.13461434583673,
    "keypoints": [
      {
        "position": {
          "x": 116.58444058895,
          "y": 99.772533416748
        },
        "part": "nose",
        "score": 0.0028593824245036
      }
      {
        "position": {
          "x": 133.49897611141,
          "y": 79.644590377808
        },
        "part": "leftEye",
        "score": 0.99919074773788
      },
      ...
    ],
  }
]
Listing 5-48

Output array when detecting multiple poses

Full code sample

Altogether, the code sample to set up the prediction of poses in an image is as follows.
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>PoseNet</title>
  </head>
  <body>
    <!—- you can replace the path to the asset with any you'd like -—>
    <img src="image-pose.jpg" alt="" />
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/posenet"></script>
    <script src="index.js"></script>
  </body>
</html>
Listing 5-49

HTML code to detect poses in an image

const imageElement = document.getElementsByTagName("img")[0];
posenet
  .load()
  .then(function (net) {
    const pose = net.estimateSinglePose(imageElement, {
      flipHorizontal: true,
    });
    return pose;
  })
  .then(function (pose) {
    console.log(pose);
  })
  .catch((err) => console.log(err));
Listing 5-50

JavaScript code

For a video from the webcam feed, the code should be as follows.
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>PoseNet</title>
  </head>
  <body>
    <video></video>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/posenet"></script>
    <script src="index.js"></script>
  </body>
</html>
Listing 5-51

HTML code to detect poses in a video

let video;
let model;
const init = async () => {
  video = await loadVideo();
  model = await posenet.load();
  main();
};
const loadVideo = async () => {
  const video = await setupCamera();
  video.play();
  return video;
};
const setupCamera = async () => {
  if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
    throw new Error(
      "Browser API navigator.mediaDevices.getUserMedia not available"
    );
  }
  video = document.querySelector("video");
  video.width = window.innerWidth;
  video.height = window.innerHeight;
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: false,
    video: {
      facingMode: "user",
      width: window.innerWidth,
      height: window.innerHeight,
    },
  });
  video.srcObject = stream;
  return new Promise(
    (resolve) => (video.onloadedmetadata = () => resolve(video))
  );
};
init();
const main = () => {
  const pose = model
    .estimateSinglePose(video, {
      flipHorizontal: true,
    })
    .then((pose) => {
      console.log(pose);
    });
  requestAnimationFrame(main);
};
Listing 5-52

JavaScript code to detect poses in a video from the webcam

Visualizing keypoints

So far, we’ve mainly used console.log to be able to see the results coming back from the model. However, you might want to visualize them on the page to make sure that the body tracking is working and that the keypoints are placed in the right position.

To do this, we are going to use the Canvas API.

We need to start by adding a HTML canvas element to the HTML file. Then, we can create a function that will access this element and its context, detect the poses, and draw the keypoints.

Accessing the canvas element and its context is done with the following lines.
const canvas = document.getElementById("output");
const ctx = canvas.getContext("2d");
canvas.width = window.innerWidth;
canvas.height = window.innerHeight;
Listing 5-53

Accessing the canvas element

Then, we can create a function that will call the estimateSinglePose method to start the detection, draw the video on the canvas, and loop through the keypoints found to render them on the canvas element.
async function poseDetectionFrame() {
    const pose = await net.estimateSinglePose(video, {
      flipHorizontal: true
    });
    ctx.clearRect(0, 0, videoWidth, videoHeight);
    ctx.save();
    ctx.scale(-1, 1);
    ctx.translate(-videoWidth, 0);
    ctx.drawImage(video, 0, 0, window.innerWidth, window.innerHeight);
    ctx.restore();
    drawKeypoints(pose.keypoints, 0.5, ctx);
    drawSkeleton(pose.keypoints, 0.5, ctx);
    requestAnimationFrame(poseDetectionFrame);
}
Listing 5-54

Start the detection, draw the webcam feed on a canvas element, and render the keypoints

The drawKeypoints and drawSkeleton functions use some of the Canvas API methods to draw circles and lines at the coordinates of the keypoints detected.
const color = "aqua";
const lineWidth = 2;
const toTuple = ({ y, x }) => [y, x];
function drawPoint(ctx, y, x, r, color) {
  ctx.beginPath();
  ctx.arc(x, y, r, 0, 2 * Math.PI);
  ctx.fillStyle = color;
  ctx.fill();
}
function drawSegment([ay, ax], [by, bx], color, scale, ctx) {
  ctx.beginPath();
  ctx.moveTo(ax * scale, ay * scale);
  ctx.lineTo(bx * scale, by * scale);
  ctx.lineWidth = lineWidth;
  ctx.strokeStyle = color;
  ctx.stroke();
}
function drawSkeleton(keypoints, minConfidence, ctx, scale = 1) {
  const adjacentKeyPoints = posenet.getAdjacentKeyPoints(
    keypoints,
    minConfidence
  );
  adjacentKeyPoints.forEach((keypoints) => {
    drawSegment(
      toTuple(keypoints[0].position),
      toTuple(keypoints[1].position),
      color,
      scale,
      ctx
    );
  });
}
function drawKeypoints(keypoints, minConfidence, ctx, scale = 1) {
  for (let i = 0; i < keypoints.length; i++) {
    const keypoint = keypoints[i];
    if (keypoint.score < minConfidence) {
      continue;
    }
    const { y, x } = keypoint.position;
    drawPoint(ctx, y * scale, x * scale, 3, color);
  }
}
Listing 5-55

Some helper functions to draw the keypoints onto the canvas element

The poseDetectionFrame function should be called once the video and model are loaded.

Altogether, the full code sample should look like the following.
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>PoseNet</title>
  </head>
  <body>
    <video id="video"></video>
    <canvas id="output" />
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/posenet"></script>
    <script src="utils.js" type="module"></script>
    <script src="index.js" type="module"></script>
  </body>
</html>
Listing 5-56

Complete HTML code to visualize keypoints

import { drawKeypoints, drawSkeleton } from "./utils.js";
const videoWidth = window.innerWidth;
const videoHeight = window.innerHeight;
async function setupCamera() {
  if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
    throw new Error(
      "Browser API navigator.mediaDevices.getUserMedia not available"
    );
  }
  const video = document.getElementById("video");
  video.width = videoWidth;
  video.height = videoHeight;
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: false,
    video: {
      facingMode: "user",
      width: videoWidth,
      height: videoHeight,
    },
  });
  video.srcObject = stream;
  return new Promise((resolve) => {
    video.onloadedmetadata = () => {
      resolve(video);
    };
  });
}
async function loadVideo() {
  const video = await setupCamera();
  video.play();
  return video;
}
function detectPoseInRealTime(video, net) {
  const canvas = document.getElementById("output");
  const ctx = canvas.getContext("2d");
  const flipPoseHorizontal = true;
  canvas.width = videoWidth;
  canvas.height = videoHeight;
  async function poseDetectionFrame() {
    let minPoseConfidence;
    let minPartConfidence;
    const pose = await net.estimateSinglePose(video, {
      flipHorizontal: flipPoseHorizontal,
    });
    minPoseConfidence = 0.1;
    minPartConfidence = 0.5;
    ctx.clearRect(0, 0, videoWidth, videoHeight);
    ctx.save();
    ctx.scale(-1, 1);
    ctx.translate(-videoWidth, 0);
    ctx.drawImage(video, 0, 0, videoWidth, videoHeight);
    ctx.restore();
    drawKeypoints(pose.keypoints, minPartConfidence, ctx);
    drawSkeleton(pose.keypoints, minPartConfidence, ctx);
    requestAnimationFrame(poseDetectionFrame);
  }
  poseDetectionFrame();
}
let net;
export async function init() {
  net = await posenet.load();
  let video;
  try {
    video = await loadVideo();
  } catch (e) {
    throw e;
  }
  detectPoseInRealTime(video, net);
}
navigator.getUserMedia =
  navigator.getUserMedia ||
  navigator.webkitGetUserMedia ||
  navigator.mozGetUserMedia;
init();
Listing 5-57

JavaScript code to visualize keypoints in index.js

const color = "aqua";
const lineWidth = 2;
function toTuple({ y, x }) {
  return [y, x];
}
export function drawPoint(ctx, y, x, r, color) {
  ctx.beginPath();
  ctx.arc(x, y, r, 0, 2 * Math.PI);
  ctx.fillStyle = color;
  ctx.fill();
}
export function drawSegment([ay, ax], [by, bx], color, scale, ctx) {
  ctx.beginPath();
  ctx.moveTo(ax * scale, ay * scale);
  ctx.lineTo(bx * scale, by * scale);
  ctx.lineWidth = lineWidth;
  ctx.strokeStyle = color;
  ctx.stroke();
}
export function drawSkeleton(keypoints, minConfidence, ctx, scale = 1) {
  const adjacentKeyPoints = posenet.getAdjacentKeyPoints(
    keypoints,
    minConfidence
  );
  adjacentKeyPoints.forEach((keypoints) => {
    drawSegment(
      toTuple(keypoints[0].position),
      toTuple(keypoints[1].position),
      color,
      scale,
      ctx
    );
  });
}
export function drawKeypoints(keypoints, minConfidence, ctx, scale = 1) {
  for (let i = 0; i < keypoints.length; i++) {
    const keypoint = keypoints[i];
    if (keypoint.score < minConfidence) {
      continue;
    }
    const { y, x } = keypoint.position;
    drawPoint(ctx, y * scale, x * scale, 3, color);
  }
}
Listing 5-58

JavaScript code to visualize keypoints in utils.js

The output of this code should visualize the keypoints like the following.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig50_HTML.jpg
Figure 5-50

Output of the complete code sample

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig51_HTML.jpg
Figure 5-51

Output of the complete code sample

Now that we have gone through the code to detect poses, access coordinates for different parts of the body, and visualize them on a canvas, feel free to experiment with this data to create projects exploring new interactions.

5.3 Hardware data

For the last section of this chapter and the last part of the book that will contain code samples, we are going to look into something a bit more advanced and experimental. The next few pages will focus on using data generated by hardware and build a custom machine learning model to detect gestures.

Usually, when working with hardware, I use microcontrollers such as Arduino or Raspberry Pi; however, to make it more accessible to anyone reading this book that might not have access to such material, this next section is going to use another device that has built-in hardware components, your mobile phone!

This is assuming you possess a modern mobile phone with at least an accelerometer and gyroscope.

To have access to this data in JavaScript, we are going to use the Generic Sensor API.

This API is rather new and experimental and has browser support only in Chrome at the moment, so if you decide to write the following code samples, make sure to use Chrome as your browser.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig52_HTML.jpg
Figure 5-52

Browser support for the Generic Sensor API. Source: https://caniuse.com/#search=sensor%20api

To build our gesture classifier, we are going to access and record data from the accelerometer and gyroscope present in your phone, save this data into files in your project, create a machine learning model, train it, and run predictions on new live data.

To be able to do this, we are going to need a little bit of Node.js, web sockets with socket.io, the Generic Sensor API, and TensorFlow.js.

If you are unfamiliar with some of these technologies, don’t worry, I’m going to explain each part and provide code samples you should be able to follow.

5.3.1 Web Sensors API

As we are using hardware data in this section, the first thing we need to do is verify that we can access the correct data.

As said a little earlier, we need to record data from the gyroscope and accelerometer.

The gyroscope gives us details about the orientation of the device and its angular velocity, and the accelerometer focuses on giving us data about the acceleration.

Even though we could use only one of these sensors if we wanted, I believe that combining data of both gyroscope and accelerometer gives us more precise information about the motion and will be helpful for gesture recognition.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig53_HTML.jpg
Figure 5-53

Accelerometer axes on mobile phone. Source: https://developers.google.com/web/fundamentals/native-hardware/device-orientation

../images/496132_1_En_5_Chapter/496132_1_En_5_Fig54_HTML.jpg
Figure 5-54

Gyroscope axes on mobile phone. Source: https://www.sitepoint.com/using-device-orientation-html5/

5.3.2 Accessing sensors data

To access the data using the Generic Sensor API, we need to start by declaring a few variables: one that will refer to the requestAnimationFrame statement so we can cancel it later on, and two others that will contain gyroscope and accelerometer data.
let dataRequest;
let gyroscopeData = {
    x: '',
    y: '',
    z: ''
}
let accelerometerData = {
    x: '',
    y: '',
    z: ''
}
Listing 5-59

In index.js. Declaring variables to contain hardware data

Then, to access the phone’s sensors data, you will need to instantiate a new Gyroscope and Accelerometer interface, use the reading event listener to get the x, y, and z coordinates of the device’s motion, and call the start method to start the tracking.
function initSensors() {
    let gyroscope = new Gyroscope({frequency: 60});
    gyroscope.addEventListener('reading', e => {
        gyroscopeData.x = gyroscope.x;
        gyroscopeData.y = gyroscope.y;
        gyroscopeData.z = gyroscope.z;
    });
    gyroscope.start();
    let accelerometer = new Accelerometer({frequency: 60});
    accelerometer.addEventListener('reading', e => {
        accelerometerData.x = accelerometer.x;
        accelerometerData.y = accelerometer.y;
        accelerometerData.z = accelerometer.z;
    });
    accelerometer.start();
}
Listing 5-60

In index.js. Get accelerometer and gyroscope data

Finally, as we are interested in recording data when we are executing a specific gesture, we need to call the preceding function only when the user is pressing on the mobile screen with the touchstart event. We also should cancel it on touchend.
function getData() {
  let data = {
    xAcc: accelerometerData.x,
    yAcc: accelerometerData.y,
    zAcc: accelerometerData.z,
    xGyro: gyroscopeData.x,
    yGyro: gyroscopeData.y,
    zGyro: gyroscopeData.z,
  };
  document.body.innerHTML = JSON.stringify(data);
  dataRequest = requestAnimationFrame(getData);
}
window.onload = function () {
  initSensors();
  document.body.addEventListener("touchstart", (e) => {
    getData();
  });
  document.body.addEventListener("touchend", (e) => {
    cancelAnimationFrame(dataRequest);
  });
};
Listing 5-61

In index.js. Use the touchstart event listener to start displaying data

At this point, if you want to check the output of this code, you will need to visit the page on your mobile phone using a tool like ngrok, for example, to create a tunnel to your localhost.

What you should see is the live accelerometer and gyroscope data displayed on the screen when you press it, and when you release it, the data should not update anymore.

At this point, we display the data on the page so we can double check that everything is working as expected.

However, what we really need is to store this data in file when we record gestures. For this, we are going to need web sockets to send the data from the front-end to a back-end server that will be in charge of writing the data to files in our application folder.

5.3.3 Setting up web sockets

To set up web sockets, we are going to use socket.io.

So far, in all previous examples, we only worked with HTML and JavaScript files without any back end.

If you have never written any Node.js before, you will need to install it as well as npm or yarn to be able to install packages.

Once you have these two tools set up, at the root of your project folder, in your terminal, write npm init to generate a package.json file that will contain some details about the project.

Once your package.json file is generated, in your terminal, write npm install socket.io to install the package.

Once this is done, add the following script tag in your HTML file.
<script type="text/javascript" src="./../socket.io/socket.io.js"></script>
Listing 5-62

Import the socket.io script in the HTML file

Now, you should be able to use socket.io in the front end. In your JavaScript file, start by instantiating it with const socket = io().

If you have any issue with setting up the package, feel free to refer to the official documentation.

Then, in our event listener for touchstart, we can use socket.io to send data to the server with the following data.
socket.emit("motion data", `${accelerometerData.x} ${accelerometerData.y} ${accelerometerData.z} ${gyroscopeData.x} ${gyroscopeData.y} ${gyroscopeData.z}`);
Listing 5-63

Send motion data via web sockets

We are sending the motion data as a string as we want to write these values down into files.

On touchend, we need to send another event indicating that we want to stop the emission of data with socket.emit('end motion data').

Altogether, our first JavaScript file should look like the following.
const socket = io();
let gyroscopeData = {
  x: "",
  y: "",
  z: "",
};
let accelerometerData = {
  x: "",
  y: "",
  z: "",
};
let dataRequest;
function initSensors() {
  let gyroscope = new Gyroscope({ frequency: 60 });
  gyroscope.addEventListener("reading", (e) => {
    gyroscopeData.x = gyroscope.x;
    gyroscopeData.y = gyroscope.y;
    gyroscopeData.z = gyroscope.z;
  });
  gyroscope.start();
  let accelerometer = new Accelerometer({ frequency: 60 });
  accelerometer.addEventListener("reading", (e) => {
    accelerometerData.x = accelerometer.x;
    accelerometerData.y = accelerometer.y;
    accelerometerData.z = accelerometer.z;
  });
  accelerometer.start();
}
function getData() {
  dataRequest = requestAnimationFrame(getData);
  socket.emit(
    "motion data",
    `${accelerometerData.x} ${accelerometerData.y} ${accelerometerData.z} ${gyroscopeData.x} ${gyroscopeData.y} ${gyroscopeData.z}`
  );
}
window.onload = function () {
  initSensors();
  document.body.addEventListener("touchstart", (e) => {
    getData();
  });
  document.body.addEventListener("touchend", (e) => {
    socket.emit("end motion data");
    cancelAnimationFrame(dataRequest);
  });
};
Listing 5-64

Complete JavaScript code in the index.js file

Now, let’s implement the server side of this project to serve our front-end files, receive the data, and store it into text files.

First, we need to create a new JavaScript file. I personally named it server.js.

To serve our front-end files, we are going to use the express npm package. To install it, type npm install express —-save in your terminal.

Once installed, write the following code to create a '/record' route that will serve our index.html file.
const express = require("express");
const app = express();
var http = require("http").createServer(app);
app.use("/record", express.static(__dirname + '/'));
http.listen(process.env.PORT || 3000);
Listing 5-65

Initial setup of the server.js file

You should be able to type node server.js in your terminal, visit http://localhost:3000/record in your browser, and it should serve the index.html file we created previously.

Now, let’s test our web sockets connection by requiring the socket.io package and write the back-end code that will receive messages from the front end.

At the top of the server.js file, require the package with const io = require('socket.io')(http).

Then, set up the connection and listen to events with the following data.
io.on("connection", function (socket) {
  socket.on("motion data", function (data) {
    console.log(data);
  });
  socket.on("end motion data", function () {
    console.log('end');
  });
});
Listing 5-66

In server.js. Web sockets connection

Now, restart the server, visit the page on ‘/record’ on your mobile, and you should see motion data logged in your terminal when you touch your mobile’s screen.

If you don’t see anything, double check that your page is served using https.

At this point, we know that the web sockets connection is properly set up, and the following step is to save this data into files in our application so we’ll be able to use it to train a machine learning algorithm.

To save files, we are going to use the Node.js File System module, so we need to start by requiring it with const fs = require('fs');.

Then, we are going to write some code that will be able to handle arguments passed when starting the server, so we can easily record new samples.

For example, if we want to record three gestures, one performing the letter A in the air, the second the letter B, and the third the letter C, we want to be able to type node server.js letterA 1 to indicate that we are currently recording data for the letter A gesture (letterA parameter) and that this is the first sample (the 1 parameter).

The following code will handle these two arguments, store them in variables, and use them to name the new file created.
let stream;
let sampleNumber;
let gestureType;
let previousSampleNumber;
process.argv.forEach(function (val, index, array) {
  gestureType = array[2];
  sampleNumber = parseInt(array[3]);
  previousSampleNumber = sampleNumber;
  stream = fs.createWriteStream(
    `data/sample_${gestureType}_${sampleNumber}.txt`,
    { flags: "a" }
  );
});
Listing 5-67

In server.js. Code to handle arguments passed in to generate file names dynamically

Now, when starting the server, you will need to pass these two arguments (gesture type and sample number).

To actually write the data from the front end to these files, we need to write the following lines of code.
socket.on("motion data", function (data) {
    /* This following line allows us to record new files without having to start/stop the server. On motion end, we increment the sampleNumber variable so when receiving new data, we deduce it is related to a new gesture and create a file with the correct sample number. */
    if (sampleNumber !== previousSampleNumber) {
      stream = fs.createWriteStream(
        `./data/sample_${gestureType}_${sampleNumber}.txt`,
        { flags: "a" }
      );
    }
    stream.write(`${data} `);
});
socket.on("end motion data", function () {
    stream.end();
    sampleNumber += 1;
});
Listing 5-68

In server.js. Code to create a file and stream when receiving data

We also close the stream when receiving the “end motion data” event so we stop writing motion data when the user has stopped touching their phone’s screen, as this means they’ve stopped executing the gesture we want to record.

To test this setup, start by creating an empty folder in your application called ‘data’, then type node server.js letterA 1 in your terminal, visit back the web page on your mobile, and execute the gesture of the letter A in the air while pressing the screen, and when releasing, you should see a new file named sample_letterA_1.text in the data folder, and it should contain gesture data!

At this stage, we are able to get accelerometer and gyroscope data, send it to our server using web sockets, and save it into files in our application.
const express = require("express");
const app = express();
const http = require("http").createServer(app);
const io = require('socket.io')(http);
const fs = require('fs');
let stream;
let sampleNumber;
let gestureType;
let previousSampleNumber;
app.use("/record", express.static(__dirname + '/'));
process.argv.forEach(function (val, index, array) {
  gestureType = array[2];
  sampleNumber = parseInt(array[3]);
  previousSampleNumber = sampleNumber;
  stream = fs.createWriteStream(
    `data/sample_${gestureType}_${sampleNumber}.txt`,
    { flags: "a" }
  );
});
io.on("connection", function (socket) {
  socket.on("motion data", function (data) {
    if (sampleNumber !== previousSampleNumber) {
      stream = fs.createWriteStream(
        `./data/sample_${gestureType}_${sampleNumber}.txt`,
        { flags: "a" }
      );
    }
    stream.write(`${data} `);
});
  socket.on("end motion data", function () {
    stream.end();
    sampleNumber += 1;
  });
});
http.listen(process.env.PORT || 3000);
Listing 5-69

Complete code sample in the server.js file

Before moving on to writing the code responsible for formatting our data and creating the machine learning model, make sure to record a few samples of data for each of our three gestures; the more, the better, but I would advise to record at least 20 samples per gesture.

5.3.4 Data processing

For this section, I would advise to create a new JavaScript file. I personally called it train.js.

In this file, we are going to read through the text files we recorded in the previous step, transform the data from strings to tensors, and create and train our model. Some of the following code samples are not directly related to TensorFlow.js (reading folders and files, and formatting the data into multidimensional arrays), so I will not dive into them too much.

The first step here is to go through our data folder, get the data for each sample and gesture, and organize it into arrays of features and labels.

For this, I used the line-reader npm package, so we need to install it using npm install line-reader.

We also need to install TensorFlow with npm install @tensorflow/tfjs-node.

Then, I created two functions readDir and readFile to loop through all the files in the data folder and for each file, loop through each line, transform strings into numbers, and return an object containing the label and features for that gesture.
const lineReader = require("line-reader");
var fs = require("fs");
const tf = require("@tensorflow/tfjs-node");
const gestureClasses = ["letterA", "letterB", "letterC"];
let numClasses = gestureClasses.length;
let numSamplesPerGesture = 20; // the number of times you recorded each gesture.
let totalNumDataFiles = numSamplesPerGesture * numClasses;
let numPointsOfData = 6; // x, y, and z for both accelerometer and gyroscope
let numLinesPerFile = 100; // Files might have a different amount of lines so we need a value to truncate and make sure all our samples have the same length.
let totalNumDataPerFile = numPointsOfData * numLinesPerFile;
function readFile(file) {
  let allFileData = [];
  return new Promise((resolve, reject) => {
    fs.readFile(`data/${file}`, "utf8", (err, data) => {
      if (err) {
        reject(err);
      } else {
        lineReader.eachLine(`data/${file}`, function (line) {
          // Turn each line into an array of floats.
          let dataArray = line
            .split(" ")
            .map((arrayItem) => parseFloat(arrayItem));
          allFileData.push(...dataArray);
          let concatArray = [...allFileData];
          if (concatArray.length === totalNumDataPerFile) {
            // Get the label from the filename
            let label = file.split("_")[1];
            let labelIndex = gestureClasses.indexOf(label);
            // Return an object with data as features and the label index
            resolve({ features: concatArray, label: labelIndex });
          }
        });
      }
    });
  });
}
const readDir = () =>
  new Promise((resolve, reject) =>
    fs.readdir(`data/`, "utf8", (err, data) =>
      err ? reject(err) : resolve(data)
    )
  );
(async () => {
  const filenames = await readDir();
  let allData = [];
  filenames.map(async (file) => {
    let originalContent = await readFile(file);
    allData.push(originalContent);
    if (allData.length === totalNumDataFiles) {
      console.log(allData);
    }
  });
})();
Listing 5-70

In train.js. Loop through files to transform raw data into objects of features and labels

I am not going to dive deeper into the preceding code sample, but I added some inline comments to help.

If you run this code using node train.js, you should get some output similar to the following figure.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig55_HTML.jpg
Figure 5-55

Sample output of formatted data

At this point, our variable allData holds all features and labels for each gesture sample, but we are not done yet. Before feeding this data to a machine learning algorithm, we need to transform it to tensors, the data type that TensorFlow.js works with.

The following code samples are going to be more complicated as we need to format the data further, create tensors, split them between a training set and a test set to validate our future predictions, and then generate the model.

I have added inline comments to attempt to explain each step.

So, where we wrote console.log(allData) in the preceding code, replace it with format(allData), and the following is going to show the implementation of this function.
let justFeatures = [];
let justLabels = [];
const format = (allData) => {
  // Sort all data by label to get [{label: 0, features: ...}, {label: 1, features: ...}];
  let sortedData = allData.sort((a, b) => (a.label > b.label ? 1 : -1));
 // Tensorflow works with arrays and not objects so we need to separate labels and tensors.
  sortedData.map((item) => {
    createMultidimentionalArrays(justLabels, item.label, item.label);
    createMultidimentionalArrays(justFeatures, item.label, item.features);
  });
};
function createMultidimentionalArrays(dataArray, index, item) {
  !dataArray[index] && dataArray.push([]);
  dataArray[index].push(item);
}
Listing 5-71

In train.js. Sorting and formatting the data

Running this should result in justFeatures and justLabels being multidimensional arrays containing features and labels indices, respectively.

For example, justLabels should look like [ [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], [ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ] ].

Now that we are getting closer to a format TensorFlow can work with, we still need to transform these multidimensional arrays to tensors. To do this, let’s start by creating a function called transformToTensor.
const [
    trainingFeatures,
    trainingLabels,
    testingFeatures,
    testingLabels,
] = transformToTensor(justFeatures, justLabels);
const transformToTensor = (features, labels) => {
  return tf.tidy(() => {
    // Preparing to split the dataset between training set and test set.
    const featureTrainings = [];
    const labelTrainings = [];
    const featureTests = [];
    const labelTests = [];
    // For each gesture trained, convert the data to tensors and store it between training set and test set.
    for (let i = 0; i < gestureClasses.length; ++i) {
      const [
        featureTrain,
        labelTrain,
        featureTest,
        labelTest,
      ] = convertToTensors(features[i], labels[i], 0.2);
      featureTrainings.push(featureTrain);
      labelTrainings.push(labelTrain);
      featureTests.push(featureTest);
      labelTests.push(labelTest);
    }
    // Return all data concatenated
    return [
      tf.concat(featureTrainings, 0),
      tf.concat(labelTrainings, 0),
      tf.concat(featureTests, 0),
      tf.concat(labelTests, 0),
    ];
  });
};
Listing 5-72

In train.js. Transforming multidimensional arrays into tensors

The preceding code calls a function called convertToTensors so let’s define it.
const convertToTensors = (featuresData, labelData, testSplit) => {
  if (featuresData.length !== labelData.length) {
    throw new Error(
      "features set and labels set have different numbers of examples"
    );
  }
  // Shuffle the data to avoid having a model that gets used to the order of the samples.
  const [shuffledFeatures, shuffledLabels] = shuffleData(
    featuresData,
    labelData
  );
  // Create the tensor
  const featuresTensor = tf.tensor2d(shuffledFeatures, [
    numSamplesPerGesture,
    totalNumDataPerFile,
  ]);
  // Create a 1D `tf.Tensor` to hold the labels, and convert the number label from the set {0, 1, 2} into one-hot encoding (e.g., 0 --> [1, 0, 0]).
  const labelsTensor = tf.oneHot(
    tf.tensor1d(shuffledLabels).toInt(),
    numClasses
  );
  // Split all this data into training set and test set and return it.
  return split(featuresTensor, labelsTensor, testSplit);
};
Listing 5-73

In train.js. Convert data to tensors

This function calls two other functions, shuffleData and split.
const shuffleData = (features, labels) => {
  const indices = [...Array(numSamplesPerGesture).keys()];
  tf.util.shuffle(indices);
  const shuffledFeatures = [];
  const shuffledLabels = [];
  features.map((featuresArray, index) => {
    shuffledFeatures.push(features[indices[index]]);
    shuffledLabels.push(labels[indices[index]]);
  });
  return [shuffledFeatures, shuffledLabels];
};
Listing 5-74

In train.js. Shuffle the data

const split = (featuresTensor, labelsTensor, testSplit) => {
  // Split the data into a training set and a test set, based on `testSplit`.
  const numTestExamples = Math.round(numSamplesPerGesture * testSplit);
  const numTrainExamples = numSamplesPerGesture - numTestExamples;
  const trainingFeatures = featuresTensor.slice(
    [0, 0],
    [numTrainExamples, totalNumDataPerFile]
  );
  const testingFeatures = featuresTensor.slice(
    [numTrainExamples, 0],
    [numTestExamples, totalNumDataPerFile]
  );
  const trainingLabels = labelsTensor.slice(
    [0, 0],
    [numTrainExamples, numClasses]
  );
  const testingLabels = labelsTensor.slice(
    [numTrainExamples, 0],
    [numTestExamples, numClasses]
  );
  return [trainingFeatures, trainingLabels, testingFeatures, testingLabels];
};
Listing 5-75

In train.js. Split the data into training and test set

At this point, if you add a console.log statement in the code to log the trainingFeatures variable in the format function, you should get a tensor as output.
Tensor {
  kept: false,
  isDisposedInternal: false,
  shape: [ 12, 600 ],
  dtype: 'float32',
  size: 7200,
  strides: [ 600 ],
  dataId: {},
  id: 70,
  rankType: '2',
  scopeId: 0
}
Listing 5-76

Example of output tensor

The values in the “shape” array will differ depending on how many samples of data you train and the number of lines per file.

Altogether, the code sample starting from the format function should look like the following.
const format = (allData) => {
  let sortedData = allData.sort((a, b) => (a.label > b.label ? 1 : -1));
  sortedData.map((item) => {
    createMultidimentionalArrays(justLabels, item.label, item.label);
    createMultidimentionalArrays(justFeatures, item.label, item.features);
  });
  const [
    trainingFeatures,
    trainingLabels,
    testingFeatures,
    testingLabels,
  ] = transformToTensor(justFeatures, justLabels);
};
function createMultidimentionalArrays(dataArray, index, item) {
  !dataArray[index] && dataArray.push([]);
  dataArray[index].push(item);
}
const transformToTensor = (features, labels) => {
  return tf.tidy(() => {
    const featureTrainings = [];
    const labelTrainings = [];
    const featureTests = [];
    const labelTests = [];
    for (let i = 0; i < gestureClasses.length; ++i) {
      const [
        featureTrain,
        labelTrain,
        featureTest,
        labelTest,
      ] = convertToTensors(features[i], labels[i], 0.2);
      featureTrainings.push(featureTrain);
      labelTrainings.push(labelTrain);
      featureTests.push(featureTest);
      labelTests.push(labelTest);
    }
    return [
      tf.concat(featureTrainings, 0),
      tf.concat(labelTrainings, 0),
      tf.concat(featureTests, 0),
      tf.concat(labelTests, 0),
    ];
  });
};
const convertToTensors = (featuresData, labelData, testSplit) => {
  if (featuresData.length !== labelData.length) {
    throw new Error(
      "features set and labels set have different numbers of examples"
    );
  }
  const [shuffledFeatures, shuffledLabels] = shuffleData(
    featuresData,
    labelData
  );
  const featuresTensor = tf.tensor2d(shuffledFeatures, [
    numSamplesPerGesture,
    totalNumDataPerFile,
  ]);
  const labelsTensor = tf.oneHot(
    tf.tensor1d(shuffledLabels).toInt(),
    numClasses
  );
  return split(featuresTensor, labelsTensor, testSplit);
};
const shuffleData = (features, labels) => {
  const indices = [...Array(numSamplesPerGesture).keys()];
  tf.util.shuffle(indices);
  const shuffledFeatures = [];
  const shuffledLabels = [];
  features.map((featuresArray, index) => {
    shuffledFeatures.push(features[indices[index]]);
    shuffledLabels.push(labels[indices[index]]);
  });
  return [shuffledFeatures, shuffledLabels];
};
const split = (featuresTensor, labelsTensor, testSplit) => {
  const numTestExamples = Math.round(numSamplesPerGesture * testSplit);
  const numTrainExamples = numSamplesPerGesture - numTestExamples;
  const trainingFeatures = featuresTensor.slice(
    [0, 0],
    [numTrainExamples, totalNumDataPerFile]
  );
  const testingFeatures = featuresTensor.slice(
    [numTrainExamples, 0],
    [numTestExamples, totalNumDataPerFile]
  );
  const trainingLabels = labelsTensor.slice(
    [0, 0],
    [numTrainExamples, numClasses]
  );
  const testingLabels = labelsTensor.slice(
    [numTrainExamples, 0],
    [numTestExamples, numClasses]
  );
  return [trainingFeatures, trainingLabels, testingFeatures, testingLabels];
};
Listing 5-77

In train.js. Full code sample for formatting the data

It is a lot to take in if you are new to machine learning and TensorFlow.js, but we are almost there. Our data is formatted and split between a training set and a test set, so the last step is the creation of the model and the training.

5.3.5 Creating and training the model

This part of the code is a bit arbitrary as there are multiple ways to create models and to pick values for parameters. However, you can copy the following code as a starting point and play around with different values later to see how they impact the accuracy of the model.
const createModel = async (featureTrain, labelTrain, featureTest, labelTest) => {
  const params = { learningRate: 0.1, epochs: 40 };
  // Instantiate a sequential model
  const model = tf.sequential();
  // Add a few layers
  model.add(
    tf.layers.dense({
      units: 10,
      activation: "sigmoid",
      inputShape: [featureTrain.shape[1]],
    })
  );
  model.add(tf.layers.dense({ units: numClasses, activation: "softmax" }));
  model.summary();
  const optimizer = tf.train.adam(params.learningRate);
  model.compile({
    optimizer: optimizer,
    loss: "categoricalCrossentropy",
    metrics: ["accuracy"],
  });
  // Train the model with our features and labels
  await model.fit(featureTrain, labelTrain, {
    epochs: params.epochs,
    validationData: [featureTest, labelTest],
  });
  // Save the model in our file system.
  await model.save("file://model");
  return model;
};
Listing 5-78

In train.js. Create, train, and save a model

At the end of our format function, call this createModel function using createModel(trainingFeatures, trainingLabels, testingFeatures, testingLabels).

Now, if everything works fine and you run node train.js in your terminal, you should see the model training and find a model folder in your application!

In case something is not working as expected, here’s what the complete train.js file should look like.
const lineReader = require("line-reader");
var fs = require("fs");
const tf = require("@tensorflow/tfjs-node");
let justFeatures = [];
let justLabels = [];
const gestureClasses = ["letterA", "letterB", "letterC"];
let numClasses = gestureClasses.length;
let numSamplesPerGesture = 5;
let totalNumDataFiles = numSamplesPerGesture * numClasses;
let numPointsOfData = 6;
let numLinesPerFile = 100;
let totalNumDataPerFile = numPointsOfData * numLinesPerFile;
function readFile(file) {
  let allFileData = [];
  return new Promise((resolve, reject) => {
    fs.readFile(`data/${file}`, "utf8", (err, data) => {
      if (err) {
        reject(err);
      } else {
        lineReader.eachLine(`data/${file}`, function (line) {
          let dataArray = line
            .split(" ")
            .map((arrayItem) => parseFloat(arrayItem));
          allFileData.push(...dataArray);
          let concatArray = [...allFileData];
          if (concatArray.length === totalNumDataPerFile) {
            let label = file.split("_")[1];
            let labelIndex = gestureClasses.indexOf(label);
            resolve({ features: concatArray, label: labelIndex });
          }
        });
      }
    });
  });
}
const readDir = () =>
  new Promise((resolve, reject) =>
    fs.readdir(`data/`, "utf8", (err, data) =>
      err ? reject(err) : resolve(data)
    )
  );
(async () => {
  const filenames = await readDir();
  let allData = [];
  filenames.map(async (file) => {
    let originalContent = await readFile(file);
    allData.push(originalContent);
    if (allData.length === totalNumDataFiles) {
      format(allData);
    }
  });
})();
const format = (allData) => {
  let sortedData = allData.sort((a, b) => (a.label > b.label ? 1 : -1));
  sortedData.map((item) => {
    createMultidimentionalArrays(justLabels, item.label, item.label);
    createMultidimentionalArrays(justFeatures, item.label, item.features);
  });
  const [trainingFeatures, trainingLabels, testingFeatures, testingLabels] = transformToTensor(justFeatures, justLabels);
  createModel(trainingFeatures, trainingLabels, testingFeatures, testingLabels);
};
function createMultidimentionalArrays(dataArray, index, item) {
  !dataArray[index] && dataArray.push([]);
  dataArray[index].push(item);
}
const transformToTensor = (features, labels) => {
  return tf.tidy(() => {
    const featureTrainings = [];
    const labelTrainings = [];
    const featureTests = [];
    const labelTests = [];
    for (let i = 0; i < gestureClasses.length; ++i) {
      const [featureTrain, labelTrain, featureTest, labelTest] = convertToTensors(features[i], labels[i], 0.2);
      featureTrainings.push(featureTrain);
      labelTrainings.push(labelTrain);
      featureTests.push(featureTest);
      labelTests.push(labelTest);
    }
    const concatAxis = 0;
    return [
      tf.concat(featureTrainings, concatAxis),
      tf.concat(labelTrainings, concatAxis),
      tf.concat(featureTests, concatAxis),
      tf.concat(labelTests, concatAxis),
    ];
  });
};
const convertToTensors = (featuresData, labelData, testSplit) => {
  if (featuresData.length !== labelData.length) {
    throw new Error(
      "features set and labels set have different numbers of examples"
    );
  }
  const [shuffledFeatures, shuffledLabels] = shuffleData(
    featuresData, labelData);
  const featuresTensor = tf.tensor2d(shuffledFeatures, [
    numSamplesPerGesture,
    totalNumDataPerFile,
  ]);
  const labelsTensor = tf.oneHot(
    tf.tensor1d(shuffledLabels).toInt(),
    numClasses
  );
  return split(featuresTensor, labelsTensor, testSplit);
};
const shuffleData = (features, labels) => {
  const indices = [...Array(numSamplesPerGesture).keys()];
  tf.util.shuffle(indices);
  const shuffledFeatures = [];
  const shuffledLabels = [];
  features.map((featuresArray, index) => {
    shuffledFeatures.push(features[indices[index]]);
    shuffledLabels.push(labels[indices[index]]);
  });
  return [shuffledFeatures, shuffledLabels];
};
const split = (featuresTensor, labelsTensor, testSplit) => {
  const numTestExamples = Math.round(numSamplesPerGesture * testSplit);
  const numTrainExamples = numSamplesPerGesture - numTestExamples;
  const trainingFeatures = featuresTensor.slice(
    [0, 0],
    [numTrainExamples, totalNumDataPerFile]
  );
  const testingFeatures = featuresTensor.slice(
    [numTrainExamples, 0],
    [numTestExamples, totalNumDataPerFile]
  );
  const trainingLabels = labelsTensor.slice(
    [0, 0],
    [numTrainExamples, numClasses]
  );
  const testingLabels = labelsTensor.slice(
    [numTrainExamples, 0],
    [numTestExamples, numClasses]
  );
  return [trainingFeatures, trainingLabels, testingFeatures, testingLabels];
};
const createModel = async (xTrain, yTrain, xTest, yTest) => {
  const params = { learningRate: 0.1, epochs: 40 };
  const model = tf.sequential();
  model.add(
    tf.layers.dense({
      units: 10,
      activation: "sigmoid",
      inputShape: [xTrain.shape[1]],
    })
  );
  model.add(tf.layers.dense({ units: numClasses, activation: "softmax" }));
  model.summary();
  const optimizer = tf.train.adam(params.learningRate);
  model.compile({
    optimizer: optimizer,
    loss: "categoricalCrossentropy",
    metrics: ["accuracy"],
  });
  await model.fit(xTrain, yTrain, {
    epochs: params.epochs,
    validationData: [xTest, yTest],
  });
  await model.save("file://model");
  return model;
};
Listing 5-79

Complete code sample in train.js

The training steps you should see in your terminal should look like the following figure.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig56_HTML.jpg
Figure 5-56

Sample output of the training steps

The output of the model shows us that the last step of the training showed an accuracy of 0.9, which is really good!

Now, to test this with live data, let’s move on to the last step of this project, using our model to generate predictions.

5.3.6 Live predictions

For this last step, let’s create a new JavaScript file called predict.js .

We are going to create a new endpoint called ‘/predict’, serve our index.html file, use similar web sockets code to send motion data from our phone to our server, and run live predictions.

A first small modification is in our initial index.js file in our front-end code. Instead of sending the motion data as a string, we need to replace it with the following data.
let data = {
    xAcc: accelerometerData.x,
    yAcc: accelerometerData.y,
    zAcc: accelerometerData.z,
    xGyro: gyroscopeData.x,
    yGyro: gyroscopeData.y,
    zGyro: gyroscopeData.z,
};
socket.emit("motion data", data);
Listing 5-80

In index.js. Update the shape of the motion data sent via web sockets

As the live data is going to have to be fed to the model, it is easier to send an object of numbers rather than go through the same formatting we went during the training process.

Then, our predict.js file is going to look very similar to our server.js file at the exception of an additional predict function that feeds live data to the model and generate a prediction about the gesture.
const tf = require("@tensorflow/tfjs-node");
const express = require("express");
const app = express();
var http = require("http").createServer(app);
const io = require("socket.io")(http);
let liveData = [];
let predictionDone = false;
let model;
const gestureClasses = ["letterA", "letterB", "letterC"];
// Create new endpoint
app.use("/predict", express.static(__dirname + "/"));
io.on("connection", async function (socket) {
  // Load the model
  model = await tf.loadLayersModel("file://model/model.json");
  socket.on("motion data", function (data) {
    predictionDone = false;
    // This makes sure the data has the same shape as the one used during training. 600 represents 6 values (x,y,z for accelerometer and gyroscope), collected 100 times.
    if (liveData.length < 600) {
      liveData.push(
        data.xAcc,
        data.yAcc,
        data.zAcc,
        data.xGyro,
        data.yGyro,
        data.zGyro
      );
    }
  });
  socket.on("end motion data", function () {
    if (!predictionDone && liveData.length) {
      predictionDone = true;
      predict(model, liveData);
      liveData = [];
    }
  });
});
const predict = (model, newSampleData) => {
  tf.tidy(() => {
    const inputData = newSampleData;
    // Create a tensor from live data
    const input = tf.tensor2d([inputData], [1, 600]);
    const predictOut = model.predict(input);
    // Access the highest probability
    const winner = gestureClasses[predictOut.argMax(-1).dataSync()[0]];
    console.log(winner);
  });
};
http.listen(process.env.PORT || 3000);
Listing 5-81

In predict.js. Complete code for the predict.js file

If you run the preceding code sample using node predict.js, visit the page on '/predict' on your phone, and execute one of the three gestures we trained. While holding the screen down, you should see a prediction in the terminal once you release the screen!

When running live predictions, you might come across the following error. This happens when a gesture is executed too fast and the amount of data collected was lower than our 600 value, meaning the data does not have the correct shape for the model to use it. If you try again a bit slower, it should be working.
../images/496132_1_En_5_Chapter/496132_1_En_5_Fig57_HTML.jpg
Figure 5-57

Possible error when a gesture is executed too fast

Now that our live predictions work, you could move on to changing some parameters used to create the model to see how it impacts the predictions, or train different gestures, or even send the prediction back to the front end using web sockets to create an interactive application. The main goal of this last section was to cover the steps involved into creating your own machine learning model.

Over the last few pages we learned to access hardware data using the Generic Sensor API, set up a server and web sockets to communicate and share data, save motion data into files, process and transform it, as well as create, train, and use a model to predict live gestures!

Hopefully it gives you a better idea of all the possibilities offered by machine learning and TensorFlow.js.

However, it was a lot of new information if you are new to it, especially this last section was quite advanced and experimental, so I would not expect you to understand everything and feel completely comfortable yet.

Feel free to go back over the code samples, take your time, and play around with building small prototypes if you are interested.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.134.104.173