Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

E. Wu, D. MaslovRaspberry Pi Retail Applicationshttps://doi.org/10.1007/978-1-4842-7951-9_2

2. People Counting

Elaine Wu¹ and Dmitry Maslov¹

(1)

Shenzhen, China

Problem Overview

The number of customers that visit a store greatly affects its retail profits, but the traditional ways of assessing customer flow have been prone to error. Computer vision techniques can automate the process of people counting and reporting. The solutions available on the market currently cost at least a few hundred dollars and most often are closed-source, which raises privacy concerns. This chapter analyzes and implements a deep learning-based people counting solution using the Raspberry Pi 4 and a ceiling-mounted, top-view camera. This solution is based on open source technology, it’s easy to implement and maintain, and it has the additional benefit of costing a little over 50 dollars.

The Business Impact

Having data about customer flow during different time periods is crucial to optimizing resource allocation. Management can make decisions about staffing at certain times and in certain parts of the store based on customer flow data, which can help prevent overstaffing/understaffing and can raise customer satisfaction rates. Of course, the impact of customer flow data is not limited to people-management decisions; it can also be used to improve store logistics (when multiple counting devices are installed near different aisles/hallways) and inventory stocking.

Related Knowledge

The most traditional approach to people counting of course is to have a dedicated person count the number of people entering and leaving the facility. No technical skills are required, but the disadvantages are very much obvious—the need and cost to allocate a person's time and low accuracy over long periods of time, since people tend to get tired and take breaks and so on.

You can also use traditional sensors for people counting, such as ultrasonic/time-of-flight or infrared sensors installed in door frames. These solutions are more accurate and cost-effective, but also more limited. Since all of these sensors essentially check the distance to an obstacle, they can misread objects near them as a person coming or going. In addition, incoming and outgoing people streams need to be separated between different doors in order for the count to be accurate. In addition, since the height of detection is fixed, it might not work equally for people of different heights or for children.

Computer Vision and Its Applications

A more robust approach utilizes cameras and computer vision. Computer vision, often abbreviated as CV, is defined as a field of study that seeks to develop techniques to help computers “see” and understand the content of digital images, such as photographs and videos. It is a multidisciplinary field that can broadly be called a subfield of artificial intelligence and machine learning, which may involve the use of specialized methods and general learning algorithms. See Figure 2-1.

Figure 2-1
The relationship between AI, machine learning, and computer vision

One particular problem in computer vision may be easily addressed with a hand-crafted statistical method, whereas another may require a large and complex ensemble of generalized machine learning algorithms.

A 2010 textbook on computer vision entitled Computer Vision: Algorithms and Applications by Richard Szeliski provides a list of some high-level problems that have seen success using computer vision:

Optical character recognition (OCR)
Machine inspection
Retail (e.g., automated checkouts)
3D model building (photogrammetry)
Medical imaging
Automotive safety
Match move (e.g., merging CGI with live actors in movies)
Motion capture (mocap)
Surveillance
Fingerprint recognition and biometrics

Many popular computer vision applications involve trying to recognize things in photographs or videos, for example:

Image classification: What broad category of object is in this photograph?
Object detection: Where are the objects in the photograph?
Semantic segmentation: What pixels belong to the object in the image?

Think for a moment—which of these is most applicable to the task of people counting?

The right answer is object detection . Output of an image classification algorithm is just an object category, in this case it would be “a person.” It cannot tell us where this person is or how many people there are. Semantic segmentation would provide a pixel-by-pixel classification for the original image. Although it might be suitable for people-counting tasks, it would add unnecessary complexity.

Traditional computer vision algorithms can be used for people detection, such as HOG descriptors or even simple frame differencing to detect motion. When referring to “traditional” computer vision, we mean hand-crafted algorithms consisting of complicated code written by specialists in the field that take into account the inherent complexity of visual perception. The opposite of traditional computer vision algorithms involves using machine learning and deep learning in particular. See Figure 2-2.

Figure 2-2
Deep learning description, as related to machine learning and artificial intelligence

Machine learning studies and creates algorithms that can learn rules from data, be it tabular data, sound, or images. Deep learning is a narrower subset of machine learning, utilizing deep neural networks that can statistically learn general rules from vast amounts of data. Deep learning rose to popularity after a deep neural network created by Geoffrey Hinton achieved an extraordinarily high score in the ImageNet image classification competition in 2012. Deep learning moved to the spotlight of computer science and machine learning and is being used widely for a variety of tasks.

Approaches to Object Detection

There are numerous neural network architectures for object detection. One of the earlier approaches is exemplified in the Region Based Convolutional Neural Networks (R-CNN) architecture. Given an input image, R-CNN begins by applying a mechanism called selective search to extract regions of interest (ROI) . Each ROI is a rectangle that represents the boundary of an object in an image. See Figure 2-3.

Figure 2-3
Working principle of an R-CNN diagram, Source: R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms

Depending on the scenario, there may be as many as 2,000 ROIs. Each ROI is fed through a neural network to produce output features. For each ROI's output features, a collection of support-vector machine classifiers is used to determine what type of object (if any) is contained in the ROI.

Although R-CNN could achieve good accuracy on a variety of object detection benchmarks, the main disadvantage of this architecture was that multiple passes over the regions of interest were required. That meant that it was computationally expensive, which limited its usefulness in embedded systems.

The next step in the evolution of object detection networks were single shot detectors or SSDs, such as the very well-known YOLO (You Only Look Once) architecture. As expressed in the name, single shot detectors detect objects in the image in a single pass. Let’s see how this is done using YOLOv2, which we are going to use later for person detection.

YOLOv2 divides images into a grid and predicts the presence (or absence) of an object in each grid cell. See Figure 2-4.

Figure 2-4
An example of detection boxes in an image divided into a 3x3 grid

To streamline and speed up the training process, so-called anchors or priors are provided to the network. Anchors are initial sizes (width, height) of bounding boxes, which are resized to the detected object size.

Here's a top-level view on what's going on when a YOLO architecture neural network performs object detection on the image. According to the features detected by the feature extractor network, for each grid cell, a set of predictions is made, which includes the anchor’s offset, anchor probability, and anchor class. The predictions with low probabilities are discarded and you get a set of final predictions.

Once a person or multiple people are detected, it tracks each instance of the person and assigns a unique ID to them. It also tracks the direction of each ID. If an ID crosses the divisor line, YOLO counts the ID as entering or leaving the area.

Before we start with a practical implementation, it is important to mention the limitations of this technique. As you may know, no single method can handle all situations. The computer vision approach performs very well in brightly lit, uncluttered environments, when the camera is installed above the detection area at 70-90 degree angle. It will not perform that well if the line of sight is obstructed or in situations when large numbers of people (5~10) simultaneously leave or enter the area. In the second case, it will still count people, but the counting accuracy might decrease.

Implementing Object Detection

In this section you learn how to:

1)
Install a 64-bit Raspberry Pi OS image on Raspberry Pi 4.
2)
Install the necessary packages.
3)
Execute people counting code with a pretrained MobileNet SSD model.
4)
Train and optimize your own neural network model, which is specifically for people detection from a top-view camera.

Install a 64-bit Raspberry Pi OS on Raspberry Pi 4

For this particular project, you’re going to use Raspberry Pi OS 64-bit instead of the 32-bit Raspberry Pi OS, since 64-bit support is particularly important for efficient, optimized execution of a neural network inference with TensorFlow Lite.

Raspberry Pi recommends that you use the Raspberry Pi Imager to install an operating system on your SD card. You need another computer with an SD card reader to install the image.

Using the Raspberry Pi Imager

Raspberry Pi has developed a graphical SD card writing tool that works on macOS, Ubuntu 18.04, and Windows called Raspberry Pi Imager. This is the easiest option for most users, since it will download the image automatically and install it to the SD card.

Download the latest version of Raspberry Pi Imager from www.raspberrypi.com/software/ and install it on your main computer. The exact installation instructions depend on your OS. Additionally, since the 64-bit image was still in the development stage at the moment of writing this book, you’ll need to download it manually from https://downloads.raspberrypi.org/raspios_lite_arm64/images/. The exact image used for the projects in this book was 2021-05-07-raspios-buster-arm64-lite.zip.

Insert the SD card into your PC and run the Raspberry Pi Imager. Click Choose OS and scroll down to Use Custom, as shown in Figure 2-5.

Figure 2-5
Choosing a custom image in Raspberry Pi Imager

Choose the raspios-buster-arm64-lite.zip archive you downloaded. Since you’re going to use the Raspberry Pi headless, meaning without a keyboard and screen connected, it is necessary to enable SSH and specify the WiFi network name and password (if you want to connect the Raspberry Pi to the Internet with WiFi). You can do all of that by entering the Advanced options (choose Ctrl+Shift+X). See Figure 2-6.

Figure 2-6
Advanced configuration in Raspberry Pi Imager

Save the changes and click the Write button in the Raspberry Pi Imager tool. After a few minutes, the flash process should be complete and you can eject the SD card and insert it into your Raspberry Pi 4.

Install the Necessary Packages

Since you are running the Raspberry Pi 4 headless, you need a convenient way to edit code and move files to and from Raspberry Pi 4. You could use your OS built-in SSH client and access the development boards simply by typing:

ssh pi@[your-pi-ip-address]

And entering the password. You can find your Raspberry Pi 4 IP address after it is powered on and connected to your router in your router configuration page (normally in the DHCP Client section, but this differs with different router models). See Figure 2-7.

Figure 2-7
Accessing Raspberry PI 4 by using the default SSH client in Ubuntu 20.04

Instead, we recommend using Visual Studio Code, together with the official extension, Remote - SSH. This will allow you to edit code in a more powerful and comfortable IDE and transfer files between the Pi and the computer using a graphical user interface.

To install Visual Studio Code, go to https://code.visualstudio.com/ and follow the installation instructions for your platform. After launching Visual Studio Code, click Extensions and search for Remote - SSH. After a short installation and Visual Studio Code restart, you will see a green button in the bottom-left corner, which allows you to connect to a remote device. See Figure 2-8.

Figure 2-8
Click the green button to start the SSH connection configuration

Click that button, choose Connect Current Window to Host, then choose Add New SSH Host. After that, type pi@[your-pi-ip-address]. Click the green button again, then choose Connect Current Window to Host, and then choose pi@[your-pi-ip-address]. That will start the connection process. Click Open Folder and then OK for the default choice (normally /home/pi/). See Figure 2-9.

Figure 2-9
Visual Studio Code interface after successful SSH connection

Note

If your Raspberry Pi 4 IP address changes, you need to add the new SSH host again.

Raspberry Pi OS Image comes preinstalled with Python, along with other important software to get the project up and running. However, you need to manually install pip (Python Package Manager) for Python 3. You can do that by executing the following commands:

sudo apt-get update

sudo apt-get install python3-pip git

To install the necessary packages, git clone the repository for this book, then run these commands:

cd Chapter_2

pip3 install --upgrade pip

pip3 install --extra-index-url https://google-coral.github.io/py-repo/ tflite_runtime

pip3 install - r requirements.txt

The second command will download and install all the necessary Python packages to your Raspberry Pi.

Execute People Counting Code with a Pretrained MobileNet SSD Model

First, you’re going to use a readily-available MobileNet SSD model trained on a Pascal VOC dataset to detect objects of 20 different classes, including people. You can find the code for this exercise in the Chapter_2/Exercise_1 folder. Let’s go through the most important pieces of code.

detector_video.py is the main script, inside of which you import the TensorFlow Lite interpreter, the Flask web server, and other helper packages and functions. You’ll create a web app, which will render an index.html template with an image placeholder. This image placeholder will access the /video_stream route of the app, which yields processed video streams from the camera.

After acquiring a frame with OpenCV, you pass it to the method of the detector instance of the Detector class you initialized earlier. The image is preprocessed and makes a forward pass through the neural network model. You parse the results in the run method of the Detector class. Then, in the draw_overlay() method, for each detected bounding box, you check if the box centroids are sufficiently close to centroids of other IDs, which are already registered and kept track of using the self.people_list list.

If these centroids are sufficiently close (Euclidean distance between the two pairs of coordinates is used as a metric), you assign an existing ID to them; if they are not, you assign a new ID to that bounding box. Additionally, if an ID coordinate is not updated during a set amount of frames, you delete it from the self.people_list list. Each ID also has a direction property associated with it, which is calculated as the moving average of the last 12 y-coordinates of the ID. So, if the ID has been moving up the image, it will have a positive direction and if it has been moving down the image, the ID will have a negative direction. The program uses that property to count people as leaving or entering, when they cross a central line. If the person’s ID is moving up the image (in a positive direction) and is above the central line, it counts them as entering. If the person’s ID is moving down the image (in a negative direction) and is below the central line, it counts them as leaving. The leaving-entering direction can be swapped if desired.

Finally, you draw centroids, boxes for each ID, and display the total number of people who entered and left using OpenCV drawing functions on the screen.

To check the model and the script on the example video, run the following command from the Exercise_1 folder:

python3 detector_video.py --model models/MobileNet-YOLOv2-224_224.tflite --labels models/MobileNet-YOLOv2-224_224.txt --source file --file ../video_samples/example_01.mp4

Then open the http://[your-pi-ip-address]:5000/ web page. Your Raspberry Pi needs to be on the same network as your computer. You will see the output of the video stream displayed in the web browser, as shown in Figure 2-10.

Figure 2-10
Inference on a prerecorded video file

To use the stream from the web camera, enter the following command on your Raspberry Pi:

python3 detector_video.py --model models/MobileNet-YOLOv2-224_224.tflite --labels models/MobileNet-YOLOv2-224_224.txt --source cv

Finally, to use a PiCamera module, use this command:

python3 detector_video.py --model models/MobileNet-YOLOv2-224_224.tflite --labels models/MobileNet-YOLOv2-224_224.txt --source picamera

What you will see is that the performance of the detection model is better when used on frontal images of people. This is because it is a generic detection model trained on a Pascal VOC dataset to detect 20 classes of different objects, including cats, dogs, and potted plants. Most of the images of people in the dataset are taken with handheld cameras, which is why this model doesn’t recognize people in top-view cameras that well. So, the next step is to use a custom dataset of images taken from top-view cameras to create a more specialized model with better accuracy.

Train and Optimize Your Own Neural Network Model

This section is specifically for people detection from a top-view camera. Training a neural network for object detection can be a daunting task for a beginner, especially if you plan to deploy it to an embedded device, such as Raspberry Pi 4. For the purposes of this course, we use aXeleRate, a Keras-based framework for AI on the Edge (i.e., AI deployed to embedded devices). See Figure 2-11.

aXeleRate allows for an easy, streamlined training and conversion process, where the user simply needs to add the data and configuration on the one end and receives a trained model, already converted to the target platform, on the other end. aXeleRate can be run in Google Colab, an interactive Jupyter Notebook service by Google, or on a local machine. Use the local machine-training option if you have an NVIDIA GPU and native installation of Ubuntu 18.04 or 20.04. Otherwise, it is recommended to use Google Colab, since it comes with all the required packages. Also, at the time of this writing, Google provides a certain amount of GPU hours for free accounts.

You can find the dataset in the materials for this book. It has the following folder structure:

imgs contains training images
anns contains training annotations in Pascal VOC format
imgs_validation contains validation images
anns_validation contains validation annotations in Pascal VOC format

The training and validation dataset comes primarily from three sources:

Synthetic images generated with NVIDIA Isaac SDK
Images converted from PIROPO database videos
Personal recordings of the author converted to images

If you want to use your own dataset or add some samples to the existing one, you can use any object-detection dataset-annotation tool available, as long as it supports exporting to the Pascal VOC format.

For training in the Colab notebook, use Visual Studio Code to open the aXeleRate_people_topview.ipynb file in the Chapter_2/training folder, click Open in Colab, and follow the instructions in the notebook.

For local training, on your Ubuntu 18.04 (or 20.04) PC, install the Anaconda virtual environment manager and create a new environment:

conda create -n ml python=3.8

Then activate it with this command:

conda activate ml

Install the CUDA toolkit and NVIDIA packages for GPU-enabled training with this command:

conda install tensorflow-gpu~=2.4

And finally install aXeleRate in the environment with pip:

git clone https://github.com/AIWintermuteAI/aXeleRate.git

pip install -e .

Place the Chapter_2/training/people_topview.json file inside the configs folder in the aXeleRate repository. Change the path to the training and validation image/annotation folders in the .json config file and then run the following command to start the training (see Figure 2-12):

python axelerate/train.py --config configs/people_topview.json

Figure 2-12
Local training on Ubuntu 20.04 PC with NVIDIA GPU

After training is completed, you can see the trained model in the projects/people_topview/[time-of-training-session] directory. Copy the resulting model file (with the .tflite extension) to the Chapter_2/models/ folder.

The commands used to launch an inference with the new model are similar to the ones you used to launch the inference with the pretrained model. From the Exercise_2 folder, run these commands:

python3 detector_video.py --model models/YOLO_best_mAP.tflite --labels models/labels.txt --source file --file ../video_samples/example_01.mp4

You will notice that there is less flickering of the bounding boxes, which means the model is better at detecting people from top-view footage. However, if you use a web camera or Pi camera and point it at the room, you might notice that the model you trained outputs a lot of false detections. This is expected, since the model is only trained to detect people from a top-view camera and will not perform well when presented with an entirely new view perspective. See Figure 2-13.

Figure 2-13
The proper camera positioning for top-view people counting camera device

For best results, mount Raspberry Pi 4 and the camera at the ceiling above the door frame or walkway, approximately 3-4 meters (10-12 feet) from the floor. In the “Pro Tips” section, you’ll find example scripts and ideas on how to add more functionality to the basic implementation.

Pro Tips

While this script can work in a simple scenario, where you have only a single camera and are willing to manually record the count at the end of each day, you will probably want to automate the process even more. You can do that by connecting each Pi instance to a local SQL database or a cloud-based database. For example, you can use the SQLite database management system to store the people count in a local filesystem. This will enable you to query the count for any day or for a list of days that you want. You can learn more about SQLite at its official website.

If you want to store the people count in the cloud, you can use a database service provided by a cloud provider or one that runs on your own server (like the one deployed in the previous section). You can schedule a script to run each day at a specific time and record the people count. The script can then push the data to the cloud or local database.

To take full advantage of Raspberry Pi 4 CPU performance, when using the 64-bit Raspberry Pi OS image, you also need a custom-built version of TensorFlow Lite with threading support and XNNPACK optimized delegate enabled. You can find this in the Chapter_2/Pro-tips folder. Install it with pip3.

Another thing you can contemplate if you are building a scalable solution is the board choice for the device. While the standard Raspberry Pi 4 is great for prototyping, thanks to its rich interface selection on the development board, it can become a disadvantage when deploying at scale. This is mostly because your device doesn’t need four USB ports and a display port. If your device is going to be deployed in an uncluttered environment and can tolerate a higher false-positive rate, you can train a smaller model and then use the previous generation board, the Raspberry Pi 3A. This is a stripped-down version of Raspberry Pi 3B, having only one USB port. It costs only 25 USD. An alternative option is to use the Raspberry Pi Compute Module 4, although in that case you need to design and manufacture your own carrier board. We discuss compute module applications and board design in later chapters.

Summary

This chapter introduced and explained the concepts of computer vision and deep learning and demonstrated their application for people counting in retail facilities. You learned how to train your own neural network model for people detection from a top-view camera and then convert it and optimize it for deployment on your Raspberry Pi 4.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. People Counting

Create new playlist

Sign In

Sign Up