Chapter 7. Detecting and Recognizing Objects

This chapter will introduce the concept of detecting and recognizing objects, which is one of the most common challenges in computer vision. You've come this far in the book, so at this stage, you're wondering how far are you from mounting a computer in your car that will give you information about cars and people surrounding you through the use of a camera. Well, You're not too far from your goal, actually.

In this chapter, we will expand on the concept of object detection, which we initially explored when talking about recognizing faces, and adapt it to all sorts of real-life objects, not just faces.

Object detection and recognition techniques

We made a distinction in Chapter 5, Detecting and Recognizing Faces, which we'll reiterate for clarity: detecting an object is the ability of a program to determine if a certain region of an image contains an unidentified object, and recognizing is the ability of a program to identify this object. Recognizing normally only occurs in areas of interest where an object has been detected, for example, we have attempted to recognize faces on the areas of an image that contained a face in the first place.

When it comes to recognizing and detecting objects, there are a number of techniques used in computer vision, which we'll be examining:

  • Histogram of Oriented Gradients
  • Image pyramids
  • Sliding windows

Unlike feature detection algorithms, these are not mutually exclusive techniques, rather, they are complimentary. You can perform a Histogram of Oriented Gradients (HOG) while applying the sliding windows technique.

So, let's take a look at HOG first and understand what it is.

HOG descriptors

HOG is a feature descriptor, so it belongs to the same family of algorithms, such as SIFT, SURF, and ORB.

It is used in image and video processing to detect objects. Its internal mechanism is really clever; an image is divided into portions and a gradient for each portion is calculated. We've observed a similar approach when we talked about face recognition through LBPH.

HOG, however, calculates histograms that are not based on color values, rather, they are based on gradients. As HOG is a feature descriptor, it is capable of delivering the type of information that is vital for feature matching and object detection/recognition.

Before diving into the technical details of how HOG works, let's first take a look at how HOG sees the world; here is an image of a truck:

HOG descriptors

This is its HOG version:

HOG descriptors

You can easily recognize the wheels and the main structure of the vehicle. So, what is HOG seeing? First of all, you can see how the image is divided into cells; these are 16x16 pixels cells. Each cell contains a visual representation of the calculated gradients of color in eight directions (N, NW, W, SW, S, SE, E, and NE).

These eight values contained in each cell are the famous histograms. Therefore, a single cell gets a unique signature, which you can mentally visualize to be somewhat like this:

HOG descriptors

The extrapolation of histograms into descriptors is quite a complex process. First, local histograms for each cell are calculated. The cells are grouped into larger regions called blocks. These blocks can be made of any number of cells, but Dalal and Triggs found that 2x2 cell blocks yielded the best results when performing people detection. A block-wide vector is created so that it can be normalized, accounting for variations in illumination and shadowing (a single cell is too small a region to detect such variations). This improves the accuracy of detection as it reduces the illumination and shadowing difference between the sample and the block being examined.

Simply comparing cells in two images would not work unless the images are identical (both in terms of size and data).

There are two main problems to resolve:

  • Location
  • Scale

The scale issue

Imagine, for example, if your sample was a detail (say, a bike) extrapolated from a larger image, and you're trying to compare the two pictures. You would not obtain the same gradient signatures and the detection would fail (even though the bike is in both pictures).

The location issue

Once we've resolved the scale problem, we have another obstacle in our path: a potentially detectable object can be anywhere in the image, so we need to scan the entire image in portions to make sure we can identify areas of interest, and within these areas, try to detect objects. Even if a sample image and object in the image are of identical size, there needs to be a way to instruct OpenCV to locate this object. So, the rest of the image is discarded and a comparison is made on potentially matching regions.

To obviate these problems, we need to familiarize ourselves with the concepts of image pyramid and sliding windows.

Image pyramid

Many of the algorithms used in computer vision utilize a concept called pyramid.

An image pyramid is a multiscale representation of an image. This diagram should help you understand this concept:

Image pyramid

A multiscale representation of an image, or an image pyramid, helps you resolve the problem of detecting objects at different scales. The importance of this concept is easily explained through real-life hard facts, such as it is extremely unlikely that an object will appear in an image at the exact scale it appeared in our sample image.

Moreover, you will learn that object classifiers (utilities that allow you to detect objects in OpenCV) need training, and this training is provided through image databases made up of positive matches and negative matches. Among the positives, it is again unlikely that the object we want to identify will appear in the same scale throughout the training dataset.

We've got it, Joe. We need to take scale out of the equation, so now let's examine how an image pyramid is built.

An image pyramid is built through the following process:

  1. Take an image.
  2. Resize (smaller) the image using an arbitrary scale parameter.
  3. Smoothen the image (using Gaussian blurring).
  4. If the image is larger than an arbitrary minimum size, repeat the process from step 1.

Despite exploring image pyramids, scale ratio, and minimum sizes only at this stage of the book, you've already dealt with them. If you recall Chapter 5, Detecting and Recognizing Faces, we used the detectMultiScale method of the CascadeClassifier object.

Straight away, detectMultiScale doesn't sound so obscure anymore; in fact, it has become self-explanatory. The cascade classifier object attempts at detecting an object at different scales of an input image. The second piece of information that should become much clearer is the scaleFactor parameter of the detectMultiScale() method. This parameter represents the ratio at which the image will be resampled to a smaller size at each step of the pyramid.

The smaller the scaleFactor parameter, the more layers in the pyramid, and the slower and more computationally intensive the operation will be, although—to an extent—more accurate in results.

So, by now, you should have an understanding of what an image pyramid is, and why it is used in computer vision. Let's now move on to sliding windows.

Sliding windows

Sliding windows is a technique used in computer vision that consists of examining the shifting portions of an image (sliding windows) and operating detection on those using image pyramids. This is done so that an object can be detected at a multiscale level.

Sliding windows resolves location issues by scanning smaller regions of a larger image, and then repeating the scanning on different scales of the same image.

With this technique, each image is decomposed into portions, which allows discarding portions that are unlikely to contain objects, while the remaining portions are classified.

There is one problem that emerges with this approach, though: overlapping regions.

Let's expand a little bit on this concept to clarify the nature of the problem. Say, you're operating face detection on an image and are using sliding windows.

Each window slides off a few pixels at a time, which means that a sliding window happens to be a positive match for the same face in four different positions. Naturally, we don't want to report four matches, rather only one; furthermore, we're not interested in the portion of the image with a good score, but simply in the portion with the highest score.

Here's where non-maximum suppression comes into play: given a set of overlapping regions, we can suppress all the regions that are not classified with the maximum score.

Non-maximum (or non-maxima) suppression

Non-maximum (or non-maxima) suppression is a technique that suppresses all the results that relate to the same area of an image, which are not the maximum score for a particular area. This is because similarly colocated windows tend to have higher scores and overlapping areas are significant, but we are only interested in the window with the best result, and discarding overlapping windows with lower scores.

When examining an image with sliding windows, you want to make sure to retain the best window of a bunch of windows, all overlapping around the same subject.

To do this, you determine that all the windows with more than a threshold, x, in common will be thrown into the non-maximum suppression operation.

This is quite complex, but it's also not the end of this process. Remember the image pyramid? We're scanning the image at smaller scales iteratively to make sure to detect objects in different scales.

This means that you will obtain a series of windows at different scales, then, compute the size of a window obtained in a smaller scale as if it were detected in the original scale, and, finally, throw this window into the original mix.

It does sound a bit complex. Thankfully, we're not the first to come across this problem, which has been resolved in several ways. The fastest algorithm in my experience was implemented by Dr. Tomasz Malisiewicz at http://www.computervisionblog.com/2011/08/blazing-fast-nmsm-from-exemplar-svm.html. The example is in MATLAB, but in the application example, we will obviously use a Python version of it.

The general approach behind non-maximum suppression is as follows:

  1. Once an image pyramid has been constructed, scan the image with the sliding window approach for object detection.
  2. Collect all the current windows that have returned a positive result (beyond a certain arbitrary threshold), and take a window, W, with the highest response.
  3. Eliminate all windows that overlap W significantly.
  4. Move on to the next window with the highest response and repeat the process for the current scale.

When this process is complete, move up the next scale in the image pyramid and repeat the preceding process. To make sure windows are correctly represented at the end of the entire non-maximum suppression process, be sure to compute the window size in relation to the original size of the image (for example, if you detect a window at 50 percent scale of the original size in the pyramid, the detected window will actually be four times the size in the original image).

At the end of this process, you will have a set of maximum scored windows. Optionally, you can check for windows that are entirely contained in other windows (like we did for the people detection process at the beginning of the chapter) and eliminate those.

Now, how do we determine the score of a window? We need a classification system that determines whether a certain feature is present or not and a confidence score for this classification. This is where support vector machines (SVM) comes into play.

Support vector machines

Explaining in detail what an SVM is and does is beyond the scope of this book, but suffice it to say, SVM is an algorithm that—given labeled training data–enables the classification of this data by outputting an optimal hyperplane, which, in plain English, is the optimal plane that divides differently classified data. A visual representation will help you understand this:

Support vector machines

Why is it so helpful in computer vision and object detection in particular? This is due to the fact that finding the optimal division line between pixels that belong to an object and those that don't is a vital component of object detection.

The SVM model has been around since the early 1960s; however, the current form of its implementation originates in a 1995 paper by Corinna Cortes and Vadimir Vapnik, which is available at http://link.springer.com/article/10.1007/BF00994018.

Now that we have a good understanding of the concepts involved in object detection, we can start looking at a few examples. We will start from built-in functions and evolve into training our own custom object detectors.

People detection

OpenCV comes with HOGDescriptor that performs people detection.

Here's a pretty straightforward example:

import cv2
import numpy as np

def is_inside(o, i):
    ox, oy, ow, oh = o
    ix, iy, iw, ih = i
    return ox > ix and oy > iy and ox + ow < ix + iw and oy + oh < iy + ih

def draw_person(image, person):
  x, y, w, h = person
  cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 255), 2)

img = cv2.imread("../images/people.jpg")
hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())

found, w = hog.detectMultiScale(img)

found_filtered = []
for ri, r in enumerate(found):
    for qi, q in enumerate(found):
        if ri != qi and is_inside(r, q):
            break
    else:
        found_filtered.append(r)

for person in found_filtered:
  draw_person(img, person)

cv2.imshow("people detection", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

After the usual imports, we define two very simple functions: is_inside and draw_person, which perform two minimal tasks, namely, determining whether a rectangle is fully contained in another rectangle, and drawing rectangles around detected people.

We then load the image and create HOGDescriptor through a very simple and self-explanatory code:

cv2.HOGDescriptor()

After this, we specify that HOGDescriptor will use a default people detector.

This is done through the setSVMDetector() method, which—after our introduction to SVM—sounds less obscure than it may have if we hadn't introduced SVMs.

Next, we apply detectMultiScale on the loaded image. Interestingly, unlike all the face detection algorithms, we don't need to convert the original image to grayscale before applying any form of object detection.

The detection method will return an array of rectangles, which would be a good enough source of information for us to start drawing shapes on the image. If we did this, however, you would notice something strange: some of the rectangles are entirely contained in other rectangles. This clearly indicates an error in detection, and we can safely assume that a rectangle entirely inside another one can be discarded.

This is precisely the reason why we defined an is_inside function, and why we iterate through the result of the detection to discard false positives.

If you run the script yourself, you will see rectangles around people in the image.

Creating and training an object detector

Using built-in features makes it easy to come up with a quick prototype for an application, and we're all very grateful to the OpenCV developers for making great features, such as face detection or people detection readily available (truly, we are).

However, whether you are a hobbyist or a computer vision professional, it's unlikely that you will only deal with people and faces.

Moreover, if you're like me, you wonder how the people detector feature was created in the first place and if you can improve it. Furthermore, you may also wonder whether you can apply the same concepts to detect the most diverse type of objects, ranging from cars to goblins.

In an enterprise environment, you may have to deal with very specific detection, such as registration plates, book covers, or whatever your company may deal with.

So, the question is, how do we come up with our own classifiers?

The answer lies in SVM and bag-of-words technique.

We've already talked about HOG and SVM, so let's take a closer look at bag-of-words.

Bag-of-words

Bag-of-words (BOW) is a concept that was not initially intended for computer vision, rather, we use an evolved version of this concept in the context of computer vision. So, let's first talk about its basic version, which—as you may have guessed— originally belongs to the field of language analysis and information retrieval.

BOW is the technique by which we assign a count weight to each word in a series of documents; we then rerepresent these documents with vectors that represent these set of counts. Let's look at an example:

  • Document 1: I like OpenCV and I like Python
  • Document 2: I like C++ and Python
  • Document 3: I don't like artichokes

These three documents allow us to build a dictionary (or codebook) with these values:

{
    I: 4,
    like: 4,
    OpenCV: 2,
    and: 2,
    Python: 2,
    C++: 1,
    dont: 1,
    artichokes: 1
}

We have eight entries. Let's now rerepresent the original documents using eight-entry vectors, each vector containing all the words in the dictionary with values representing the count for each term in the document. The vector representation of the preceding three sentences is as follows:

[2, 2, 1, 1, 1, 0, 0, 0]
[1, 1, 0, 1, 1, 1, 0, 0]
[1, 1, 0, 0, 0, 0, 1, 1]

This kind of representation of documents has many effective applications in the real world, such as spam filtering.

These vectors can be conceptualized as a histogram representation of documents or as a feature (the same way we extracted features from images in previous chapters), which can be used to train classifiers.

Now that we have a grasp of the basic concept of BOW or bag of visual words (BOVW) in computer vision, let's see how this applies to the world of computer vision.

BOW in computer vision

We are by now familiar with the concept of image features. We've used feature extractors, such as SIFT, and SURF, to extract features from images so that we could match these features in another image.

We've also familiarized ourselves with the concept of codebook, and we know about SVM, a model that can be fed a set of features and utilizes complex algorithms to classify train data, and can predict the classification of new data.

So, the implementation of a BOW approach will involve the following steps:

  1. Take a sample dataset.
  2. For each image in the dataset, extract descriptors (with SIFT, SURF, and so on).
  3. Add each descriptor to the BOW trainer.
  4. Cluster the descriptors to k clusters (okay, this sounds obscure, but bear with me) whose centers (centroids) are our visual words.

At this point, we have a dictionary of visual words ready to be used. As you can imagine, a large dataset will help make our dictionary richer in visual words. Up to an extent, the more words, the better!

After this, we are ready to test our classifier and attempt detection. The good news is that the process is very similar to the one outlined previously: given a test image, we can extract features and quantize them based on their distance to the nearest centroid to form a histogram.

Based on this, we can attempt to recognize visual words and locate them in the image. Here's a visual representation of the BOW process:

BOW in computer vision

This is the point in the chapter when you have built an appetite for a practical example, and are rearing to code. However, before proceeding, I feel that a quick digression into the theory of the k-means clustering is necessary so that you can fully understand how visual words are created, and gain a better understanding of the process of object detection using BOW and SVM.

The k-means clustering

The k-means clustering is a method of vector quantization to perform data analysis. Given a dataset, k represents the number of clusters in which the dataset is going to be divided. The term "means" refers to the mathematical concept of mean, which is pretty basic, but for the sake of clarity, it's what people commonly refer to as average; when visually represented, the mean of a cluster is its centroid or the geometrical center of points in the cluster.

Note

Clustering refers to the grouping of points in a dataset into clusters.

One of the classes we will be using to perform object detection is called BagOfWordsKMeansTrainer; by now, you should able to deduce what the responsibility of this class is to create:

"kmeans() -based class to train a visual vocabulary using the bag-of-words approach"

This is as per the OpenCV documentation.

Here's a representation of a k-means clustering operation with five clusters:

The k-means clustering

After this long theoretical introduction, we can look at an example, and start training our object detector.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.90.182