Problem Overview
The number of customers that visit a store greatly affects its retail profits, but the traditional ways of assessing customer flow have been prone to error. Computer vision techniques can automate the process of people counting and reporting. The solutions available on the market currently cost at least a few hundred dollars and most often are closed-source, which raises privacy concerns. This chapter analyzes and implements a deep learning-based people counting solution using the Raspberry Pi 4 and a ceiling-mounted, top-view camera. This solution is based on open source technology, it’s easy to implement and maintain, and it has the additional benefit of costing a little over 50 dollars.
The Business Impact
Having data about customer flow during different time periods is crucial to optimizing resource allocation. Management can make decisions about staffing at certain times and in certain parts of the store based on customer flow data, which can help prevent overstaffing/understaffing and can raise customer satisfaction rates. Of course, the impact of customer flow data is not limited to people-management decisions; it can also be used to improve store logistics (when multiple counting devices are installed near different aisles/hallways) and inventory stocking.
Related Knowledge
The most traditional approach to people counting of course is to have a dedicated person count the number of people entering and leaving the facility. No technical skills are required, but the disadvantages are very much obvious—the need and cost to allocate a person's time and low accuracy over long periods of time, since people tend to get tired and take breaks and so on.
You can also use traditional sensors for people counting, such as ultrasonic/time-of-flight or infrared sensors installed in door frames. These solutions are more accurate and cost-effective, but also more limited. Since all of these sensors essentially check the distance to an obstacle, they can misread objects near them as a person coming or going. In addition, incoming and outgoing people streams need to be separated between different doors in order for the count to be accurate. In addition, since the height of detection is fixed, it might not work equally for people of different heights or for children.
Computer Vision and Its Applications
One particular problem in computer vision may be easily addressed with a hand-crafted statistical method, whereas another may require a large and complex ensemble of generalized machine learning algorithms.
Optical character recognition (OCR)
Machine inspection
Retail (e.g., automated checkouts)
3D model building (photogrammetry)
Medical imaging
Automotive safety
Match move (e.g., merging CGI with live actors in movies)
Motion capture (mocap)
Surveillance
Fingerprint recognition and biometrics
Image classification: What broad category of object is in this photograph?
Object detection: Where are the objects in the photograph?
Semantic segmentation: What pixels belong to the object in the image?
Think for a moment—which of these is most applicable to the task of people counting?
The right answer is object detection . Output of an image classification algorithm is just an object category, in this case it would be “a person.” It cannot tell us where this person is or how many people there are. Semantic segmentation would provide a pixel-by-pixel classification for the original image. Although it might be suitable for people-counting tasks, it would add unnecessary complexity.
Machine learning studies and creates algorithms that can learn rules from data, be it tabular data, sound, or images. Deep learning is a narrower subset of machine learning, utilizing deep neural networks that can statistically learn general rules from vast amounts of data. Deep learning rose to popularity after a deep neural network created by Geoffrey Hinton achieved an extraordinarily high score in the ImageNet image classification competition in 2012. Deep learning moved to the spotlight of computer science and machine learning and is being used widely for a variety of tasks.
Approaches to Object Detection
Depending on the scenario, there may be as many as 2,000 ROIs. Each ROI is fed through a neural network to produce output features. For each ROI's output features, a collection of support-vector machine classifiers is used to determine what type of object (if any) is contained in the ROI.
Although R-CNN could achieve good accuracy on a variety of object detection benchmarks, the main disadvantage of this architecture was that multiple passes over the regions of interest were required. That meant that it was computationally expensive, which limited its usefulness in embedded systems.
The next step in the evolution of object detection networks were single shot detectors or SSDs, such as the very well-known YOLO (You Only Look Once) architecture. As expressed in the name, single shot detectors detect objects in the image in a single pass. Let’s see how this is done using YOLOv2, which we are going to use later for person detection.
To streamline and speed up the training process, so-called anchors or priors are provided to the network. Anchors are initial sizes (width, height) of bounding boxes, which are resized to the detected object size.
Here's a top-level view on what's going on when a YOLO architecture neural network performs object detection on the image. According to the features detected by the feature extractor network, for each grid cell, a set of predictions is made, which includes the anchor’s offset, anchor probability, and anchor class. The predictions with low probabilities are discarded and you get a set of final predictions.
Once a person or multiple people are detected, it tracks each instance of the person and assigns a unique ID to them. It also tracks the direction of each ID. If an ID crosses the divisor line, YOLO counts the ID as entering or leaving the area.
Before we start with a practical implementation, it is important to mention the limitations of this technique. As you may know, no single method can handle all situations. The computer vision approach performs very well in brightly lit, uncluttered environments, when the camera is installed above the detection area at 70-90 degree angle. It will not perform that well if the line of sight is obstructed or in situations when large numbers of people (5~10) simultaneously leave or enter the area. In the second case, it will still count people, but the counting accuracy might decrease.
Implementing Object Detection
- 1)
Install a 64-bit Raspberry Pi OS image on Raspberry Pi 4.
- 2)
Install the necessary packages.
- 3)
Execute people counting code with a pretrained MobileNet SSD model.
- 4)
Train and optimize your own neural network model, which is specifically for people detection from a top-view camera.
Install a 64-bit Raspberry Pi OS on Raspberry Pi 4
For this particular project, you’re going to use Raspberry Pi OS 64-bit instead of the 32-bit Raspberry Pi OS, since 64-bit support is particularly important for efficient, optimized execution of a neural network inference with TensorFlow Lite.
Raspberry Pi recommends that you use the Raspberry Pi Imager to install an operating system on your SD card. You need another computer with an SD card reader to install the image.
Using the Raspberry Pi Imager
Raspberry Pi has developed a graphical SD card writing tool that works on macOS, Ubuntu 18.04, and Windows called Raspberry Pi Imager. This is the easiest option for most users, since it will download the image automatically and install it to the SD card.
Download the latest version of Raspberry Pi Imager from www.raspberrypi.com/software/ and install it on your main computer. The exact installation instructions depend on your OS. Additionally, since the 64-bit image was still in the development stage at the moment of writing this book, you’ll need to download it manually from https://downloads.raspberrypi.org/raspios_lite_arm64/images/. The exact image used for the projects in this book was 2021-05-07-raspios-buster-arm64-lite.zip.
Save the changes and click the Write button in the Raspberry Pi Imager tool. After a few minutes, the flash process should be complete and you can eject the SD card and insert it into your Raspberry Pi 4.
Install the Necessary Packages
Instead, we recommend using Visual Studio Code, together with the official extension, Remote - SSH. This will allow you to edit code in a more powerful and comfortable IDE and transfer files between the Pi and the computer using a graphical user interface.
If your Raspberry Pi 4 IP address changes, you need to add the new SSH host again.
The second command will download and install all the necessary Python packages to your Raspberry Pi.
Execute People Counting Code with a Pretrained MobileNet SSD Model
First, you’re going to use a readily-available MobileNet SSD model trained on a Pascal VOC dataset to detect objects of 20 different classes, including people. You can find the code for this exercise in the Chapter_2/Exercise_1 folder. Let’s go through the most important pieces of code.
detector_video.py is the main script, inside of which you import the TensorFlow Lite interpreter, the Flask web server, and other helper packages and functions. You’ll create a web app, which will render an index.html template with an image placeholder. This image placeholder will access the /video_stream route of the app, which yields processed video streams from the camera.
After acquiring a frame with OpenCV, you pass it to the method of the detector instance of the Detector class you initialized earlier. The image is preprocessed and makes a forward pass through the neural network model. You parse the results in the run method of the Detector class. Then, in the draw_overlay() method, for each detected bounding box, you check if the box centroids are sufficiently close to centroids of other IDs, which are already registered and kept track of using the self.people_list list.
If these centroids are sufficiently close (Euclidean distance between the two pairs of coordinates is used as a metric), you assign an existing ID to them; if they are not, you assign a new ID to that bounding box. Additionally, if an ID coordinate is not updated during a set amount of frames, you delete it from the self.people_list list. Each ID also has a direction property associated with it, which is calculated as the moving average of the last 12 y-coordinates of the ID. So, if the ID has been moving up the image, it will have a positive direction and if it has been moving down the image, the ID will have a negative direction. The program uses that property to count people as leaving or entering, when they cross a central line. If the person’s ID is moving up the image (in a positive direction) and is above the central line, it counts them as entering. If the person’s ID is moving down the image (in a negative direction) and is below the central line, it counts them as leaving. The leaving-entering direction can be swapped if desired.
Finally, you draw centroids, boxes for each ID, and display the total number of people who entered and left using OpenCV drawing functions on the screen.
What you will see is that the performance of the detection model is better when used on frontal images of people. This is because it is a generic detection model trained on a Pascal VOC dataset to detect 20 classes of different objects, including cats, dogs, and potted plants. Most of the images of people in the dataset are taken with handheld cameras, which is why this model doesn’t recognize people in top-view cameras that well. So, the next step is to use a custom dataset of images taken from top-view cameras to create a more specialized model with better accuracy.
Train and Optimize Your Own Neural Network Model
aXeleRate allows for an easy, streamlined training and conversion process, where the user simply needs to add the data and configuration on the one end and receives a trained model, already converted to the target platform, on the other end. aXeleRate can be run in Google Colab, an interactive Jupyter Notebook service by Google, or on a local machine. Use the local machine-training option if you have an NVIDIA GPU and native installation of Ubuntu 18.04 or 20.04. Otherwise, it is recommended to use Google Colab, since it comes with all the required packages. Also, at the time of this writing, Google provides a certain amount of GPU hours for free accounts.
imgs contains training images
anns contains training annotations in Pascal VOC format
imgs_validation contains validation images
anns_validation contains validation annotations in Pascal VOC format
Synthetic images generated with NVIDIA Isaac SDK
Images converted from PIROPO database videos
Personal recordings of the author converted to images
If you want to use your own dataset or add some samples to the existing one, you can use any object-detection dataset-annotation tool available, as long as it supports exporting to the Pascal VOC format.
For training in the Colab notebook, use Visual Studio Code to open the aXeleRate_people_topview.ipynb file in the Chapter_2/training folder, click Open in Colab, and follow the instructions in the notebook.
After training is completed, you can see the trained model in the projects/people_topview/[time-of-training-session] directory. Copy the resulting model file (with the .tflite extension) to the Chapter_2/models/ folder.
For best results, mount Raspberry Pi 4 and the camera at the ceiling above the door frame or walkway, approximately 3-4 meters (10-12 feet) from the floor. In the “Pro Tips” section, you’ll find example scripts and ideas on how to add more functionality to the basic implementation.
Pro Tips
While this script can work in a simple scenario, where you have only a single camera and are willing to manually record the count at the end of each day, you will probably want to automate the process even more. You can do that by connecting each Pi instance to a local SQL database or a cloud-based database. For example, you can use the SQLite database management system to store the people count in a local filesystem. This will enable you to query the count for any day or for a list of days that you want. You can learn more about SQLite at its official website.
If you want to store the people count in the cloud, you can use a database service provided by a cloud provider or one that runs on your own server (like the one deployed in the previous section). You can schedule a script to run each day at a specific time and record the people count. The script can then push the data to the cloud or local database.
To take full advantage of Raspberry Pi 4 CPU performance, when using the 64-bit Raspberry Pi OS image, you also need a custom-built version of TensorFlow Lite with threading support and XNNPACK optimized delegate enabled. You can find this in the Chapter_2/Pro-tips folder. Install it with pip3.
Another thing you can contemplate if you are building a scalable solution is the board choice for the device. While the standard Raspberry Pi 4 is great for prototyping, thanks to its rich interface selection on the development board, it can become a disadvantage when deploying at scale. This is mostly because your device doesn’t need four USB ports and a display port. If your device is going to be deployed in an uncluttered environment and can tolerate a higher false-positive rate, you can train a smaller model and then use the previous generation board, the Raspberry Pi 3A. This is a stripped-down version of Raspberry Pi 3B, having only one USB port. It costs only 25 USD. An alternative option is to use the Raspberry Pi Compute Module 4, although in that case you need to design and manufacture your own carrier board. We discuss compute module applications and board design in later chapters.
Summary
This chapter introduced and explained the concepts of computer vision and deep learning and demonstrated their application for people counting in retail facilities. You learned how to train your own neural network model for people detection from a top-view camera and then convert it and optimize it for deployment on your Raspberry Pi 4.