Understanding the Mean Shift algorithm

The Mean Shift algorithm is an iterative algorithm that can be used to find the maxima of a density function. A very rough translation of the preceding sentence to computer vision terminology would be the following—the Mean Shift algorithm can be used to find an object in an image using a back-projection image. But how is it achieved in practice? Let's walk through this step by step. Here are the individual operations that are performed to find an object using the Mean Shift algorithm, in order:

The back-projection of an image is created using a modified histogram to find the pixels that are most likely to contain our object of interest. (It is also common to filter the back-projection image to get rid of unwanted noise, but this is an optional operation to improve the results.)
An initial search window is needed. This search window will contain our object of interest after a number of iterations, which we'll get to in the next step. After each iteration, the search window is updated by the algorithm. The updating of the search window happens by calculating the mass center of the search window in the back-projection image, and then shifting the current center point of the search window to the mass center of the window. The following picture demonstrates the concept of the mass center in a search window and how the shifting happens:

The two points at the two ends of the arrow in the preceding picture correspond to the search-window center and mass center.

Just like any iterative algorithm, some termination criteria are required by the Mean Shift algorithm to stop the algorithm when the results are as expected or when reaching an accepted result does not happen as fast as needed. So, the number of iterations and an epsilon value are used as termination criteria. Either by reaching the number of iterations in the algorithm or by finding a shift distance that is smaller than the given epsilon value (convergence), the algorithm will stop.

Now, let's see a hands-on example of how this algorithm is used in practice by using the OpenCV library. The meanShift function in OpenCV implements the Mean Shift algorithm almost exactly as it was described in the preceding steps. This function requires a back-projection image, a search window, and the termination criteria, and it is used as seen in the following example:

Rect srchWnd(0, 0, 100, 100); 
TermCriteria criteria(TermCriteria::MAX_ITER 
                      + TermCriteria::EPS, 
                      20, // number of iterations 
                      1.0 // epsilon value 
                      ); 
// Calculate back-projection image 
meanShift(backProject, 
          srchWnd, 
          criteria);

srchWnd is a Rect object, which is simply a rectangle that must contain an initial value that is used and then updated by the meanShift function. backProjection must contain a proper back-projection image that is calculated with any of the methods that we learned in Chapter 5, Back-Projection and Histograms. The TermCriteria class is an OpenCV class that is used by iterative algorithms that require similar termination criteria. The first parameter defines the type of the termination criteria, which can be MAX_ITER (same as COUNT), EPS, or both. In the preceding example, we have used the termination criteria of 20 iterations and an epsilon value of 1.0, which of course can be changed depending on the environment and application. The most important thing to note here is that a higher number of iterations and a lower epsilon can yield more accurate results, but it can also lead to slower performance, and vice versa.

The preceding example is just a demonstration of how the meanShift function is called. Now, let's walk through a complete hands-on example to learn our first real-time object-tracking algorithm:

The structure of the tracking example we'll create is quite similar to the previous examples in this chapter. We need to open a video, or a camera, on the computer using the VideoCapture class and then start reading the frames, as seen here:

VideoCapture cam(0); 
if(!cam.isOpened()) 
    return -1; 
 
int key = -1; 
while(key != ' ') 
{ 
    Mat frame; 
    cam >> frame; 
    if(frame.empty()) 
        break; 
 
    int k = waitKey(10); 
    if(k > 0) 
        key = k; 
} 
 
cam.release();

Again, we have used the waitKey function to stop the loop if the spacebar key is pressed.

We're going to assume that our object of interest has a green color. So, we're going to form a hue histogram that contains only the green colors, as seen here:

int bins = 360; 
int grnHue = 120; // green color hue value 
int hueOffset = 50; // the accepted threshold 
Mat histogram(bins, 1, CV_32FC1); 
for(int i=0; i<bins; i++) 
{ 
    histogram.at<float>(i, 0) = 
            (i > grnHue - hueOffset) 
            && 
            (i < grnHue + hueOffset) 
            ? 
                255.0 : 0.0; 
}

This needs to happen before entering the process loop, since our histogram is going to stay constant throughout the whole process.

One last thing to take care of before entering the actual process loop and the tracking code is the termination criteria, which will stay constant throughout the whole process. Here's how we'll create the required termination criteria:

Rect srchWnd(0,0, 100, 100); 
TermCriteria criteria(TermCriteria::MAX_ITER 
                      + TermCriteria::EPS, 
                      20, 
                      1.0);

The initial value of the search window is quite important when using the Mean Shift algorithm to track objects, since this algorithm always makes an assumption about the initial position of the object to be tracked. This is an obvious downside of the Mean Shift algorithm, which we'll learn how to deal with later on in this chapter when we discuss the CAM Shift algorithm and its implementation in the OpenCV library.

After each frame is read in the while loop we're using for the tracking code, we must calculate the back-projection image of the input frame using the green hue histogram that we created. Here's how it's done:

Mat frmHsv, hue; 
vector<Mat> hsvChannels; 
cvtColor(frame, frmHsv, COLOR_BGR2HSV); 
split(frmHsv, hsvChannels); 
hue = hsvChannels[0]; 
 
int nimages = 1; 
int channels[] = {0}; 
Mat backProject; 
float rangeHue[] = {0, 180}; 
const float* ranges[] = {rangeHue}; 
double scale = 1.0; 
bool uniform = true; 
calcBackProject(&hue, 
                nimages, 
                channels, 
                histogram, 
                backProject, 
                ranges, 
                scale, 
                uniform);

You can refer to Chapter 5, Back-Projection and Histograms, for more detailed instructions about calculating the back-projection image.

Call the meanShift function to update the search window using the back-projection image and provided termination criteria, as seen here:

meanShift(backProject, 
          srchWnd, 
          criteria);

To visualize the search window, or in other words the tracked object, we need to draw the search-window rectangle on the input frame. Here's how you can do this by using the rectangle function:

rectangle(frame, 
          srchWnd, // search window rectangle 
          Scalar(0,0,255), // red color 
          2 // thickness 
          );

We can do the same on back-projection image result, however, first we need to convert the back-projection image to BGR color space. Remember that the result of the back-projection image contained a single channel image with the same depth as the input image. Here's how we can draw a red rectangle at the search-window position on the back-projection image:

cvtColor(backProject, backProject, COLOR_GRAY2BGR); 
rectangle(backProject, 
          srchWnd, 
          Scalar(0,0,255), 
          2);

Add the means to switch between the back-projection and original video frame using the B and V keys. Here's how it's done:

switch(key) 
{ 
case 'b': imshow("Camera", backProject); 
    break; 
case 'v': default: imshow("Camera", frame); 
    break; 
}

Let's give our program a try and see how it performs when executed in a slightly controlled environment. The following picture demonstrates the initial position of the search window and our green object of interest, both in the original frame view and the back-projection view:

Moving the object around will cause the meanShift function to update the search window and consequently track the object. Here's another result, depicting the object tracked to the bottom-right corner of the view:

Notice the small amount of noise that can be seen in the corner, which would be taken care of by the meanShift function since the mass center is not affected too much by it. However, as mentioned previously, it is a good idea to perform some sort of filtering on the back-projection image to get rid of noise. For instance, and in case of noise similar to what we have in the back-projection image, we can use the GaussianBlur function, or even better, the erode function, to get rid of unwanted pixels in the back-projection image. For more information on how to use filtering functions, you can refer to Chapter 4, Drawing, Filtering, and Transformation.

In such tracking applications, we usually need to observe, record, or in any way process the route that the object of interest has taken before any given moment and for a desired period of time. This can be simply achieved by using the center point of the search window, as seen in the following example:

Point p(srchWnd.x + srchWnd.width/2, 
        srchWnd.y + srchWnd.height/2); 
route.push_back(p); 
if(route.size() > 60) // last 60 frames 
    route.erase(route.begin()); // remove first element

Obviously, the route is a vector of Point objects. route needs to be updated after the meanShift function call, and then we can use the following call to the polylines function in order to draw the route over the original video frame:

polylines(frame, 
          route, // the vector of Point objects 
          false, // not a closed polyline 
          Scalar(0,255,0), // green color 
          2 // thickness 
          );

The following picture depicts the result of displaying the tracking route (for the last 60 frames) on the original video frames read from the camera:

Now, let's address some issues that we observed while working with the meanShift function. First of all, it is not convenient to create the hue histogram manually. A flexible program should allow the user to choose the object they want to track, or at least allow the user to choose the color of the object of interest conveniently. The same can be said about the search window size and its initial position. There are a number of ways to deal with such issues and we're going to address them with a hands-on example.

When using the OpenCV library, you can use the setMouseCallback function to customize the behavior of mouse clicks on an output window. This can be used in combination with a few simple methods, such as bitwise_not to mimic an easy-to-use object selection for the users. setMouseCallback, as it can be guessed from its name, sets a callback function to handle the mouse clicks on a given window.

The following callback function in conjunction with the variables defined here can be used to create a convenient object selector:

bool selecting = false; 
Rect selection; 
Point spo; // selection point origin 
 
void onMouse(int event, int x, int y, int flags, void*) 
{ 
    switch(event) 
    { 
    case EVENT_LBUTTONDOWN: 
    { 
        spo.x = x; 
        spo.y = y; 
        selection.x = spo.x; 
        selection.y = spo.y; 
        selection.width = 0; 
        selection.height = 0; 
        selecting = true; 
 
    } break; 
    case EVENT_LBUTTONUP: 
    { 
        selecting = false; 
    } break; 
    default: 
    { 
        selection.x = min(x, spo.x); 
        selection.y = min(y, spo.y); 
        selection.width = abs(x - spo.x); 
        selection.height = abs(y - spo.y); 
    } break; 
    } 
}

The event contains an entry from the MouseEventTypes enum, which describes whether a mouse button was pressed or released. Based on such a simple event, we can decide when the user is actually selecting an object that's visible on the screen. This is depicted as follows:

if(selecting) 
{ 
    Mat sel(frame, selection); 
    bitwise_not(sel, sel); // invert the selected area 
 
    srchWnd = selection; // set the search window 
 
    // create the histogram using the hue of the selection 
}

This allows a huge amount of flexibility for our applications, and the code is bound to work with objects of any color. Make sure to check out the example codes for this chapter from the online Git repository for a complete example project that uses all the topics we've learned so far in this chapter.

Another method of selecting an object or a region on an image is by using the selectROI and selectROIs functions in the OpenCV library. These functions allow the user to select a rectangle (or rectangles) on an image using simple mouse clicks and drags. Note that the selectROI and selectROIs functions are easier to use than handling mouse clicks using callback functions, however they do not offer the same amount of power, flexibility, and customization.

Before moving on to the next section, let's recall that meanShift does not handle an increase or decrease in the size of the object that is being tracked, nor does it take care of the orientation of the object. These are probably the main issues that have led to the development of a more sophisticated version of the Mean Shift algorithm, which is the next topic we're going to learn about in this chapter.

Table of Contents for Understanding the Mean Shift algorithm

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding the Mean Shift algorithm