Chapter 5. Tracking Visually Salient Objects

The goal of this chapter is to track multiple visually salient objects in a video sequence at once. Instead of labeling the objects of interest in the video ourselves, we will let the algorithm decide which regions of a video frame are worth tracking.

We have previously learned how to detect simple objects of interest (such as a human hand) in tightly controlled scenarios or how to infer geometrical features of a visual scene from camera motion. In this chapter, we ask what we can learn about a visual scene by looking at the image statistics of a large number of frames. By analyzing the Fourier spectrum of natural images we will build a saliency map, which allows us to label certain statistically interesting patches of the image as (potential or) proto-objects. We will then feed the location of all the proto- objects to a mean-shift tracker that will allow us to keep track of where the objects move from one frame to the next.

To build the app, we need to combine the following two main features:

  • Saliency map: We will use Fourier analysis to get a general understanding of natural image statistics, which will help us build a model of what general image backgrounds look like. By comparing and contrasting the background model to a specific image frame, we can locate sub-regions of the image that pop out of their surroundings. Ideally, these sub-regions correspond to the image patches that tend to grab our immediate attention when looking at the image.
  • Object tracking: Once all the potentially interesting patches of an image are located, we will track their movement over many frames using a simple yet effective method called mean-shift tracking. Because it is possible to have multiple proto-objects in the scene that might change appearance over time, we need to be able to distinguish between them and keep track of all of them.

Visual saliency is a technical term from cognitive psychology that tries to describe the visual quality of certain objects or items that allows them to grab our immediate attention. Our brains constantly drive our gaze towards the important regions of the visual scene and keep track of them over time, allowing us to quickly scan our surroundings for interesting objects and events while neglecting the less important parts.

An example of a regular RGB image and its conversion to a saliency map, where the statistically interesting pop-out regions appear bright and the others dark, is shown in the following figure:

Tracking Visually Salient Objects

Traditional models might try to associate particular features with each target (much like our feature matching approach in Chapter 3, Finding Objects via Feature Matching and Perspective Transforms), which would convert the problem to the detection of specific categories or objects. However, these models require manual labeling and training. But what if the features or the number of the objects to track is not known?

Instead, we will try to mimic what the brain does, that is, tune our algorithm to the statistics of the natural images, so that we can immediately locate the patterns or sub-regions that "grab our attention" in the visual scene (that is, patterns that deviate from these statistical regularities) and flag them for further inspection. The result is an algorithm that works for any number of proto-objects in the scene, such as tracking all the players on a soccer field. Refer to the following image:

Tracking Visually Salient Objects

Note

This chapter uses OpenCV 2.4.9, as well as the additional packages NumPy (http://www.numpy.org), wxPython 2.8 (http://www.wxpython.org/download.php), and matplotlib (http://www.matplotlib.org/downloads.html). Although parts of the algorithms presented in this chapter have been added to an optional Saliency module of the OpenCV 3.0.0 release, there is currently no Python API for it, so we will write our own code.

Planning the app

The final app will convert each RGB frame of a video sequence into a saliency map, extract all the interesting proto-objects, and feed them to a mean-shift tracking algorithm. To do this, we need the following components:

  • main: The main function routine (in chapter5.py) to start the application.
  • Saliency: A class that generates a saliency map from an RGB color image. It includes the following public methods:
    • Saliency.get_saliency_map: The main method to convert an RGB color image to a saliency map
    • Saliency.get_proto_objects_map: A method to convert a saliency map into a binary mask containing all the proto-objects
    • Saliency.plot_power_density: A method to display the 2D power density of an RGB color image, which is helpful to understand the Fourier transform
    • Saliency.plot_power_spectrum: A method to display the radially averaged power spectrum of an RGB color image, which is helpful to understand natural image statistics
  • MultiObjectTracker: A class that tracks multiple objects in a video using mean-shift tracking. It includes the following public method, which itself contains a number of private helper methods:
    • MultiObjectTracker.advance_frame: A method to update the tracking information for a new frame, combining bounding boxes obtained from both the saliency map and mean-shift tracking

In the following sections, we will discuss these steps in detail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.188.121