Depth estimation with a normal camera

A depth camera is a fantastic little device to capture images and estimate the distance of objects from the camera itself, but, how does the depth camera retrieve depth information? Also, is it possible to reproduce the same kind of calculations with a normal camera?

A depth camera, such as Microsoft Kinect, uses a traditional camera combined with an infrared sensor that helps the camera differentiate similar objects and calculate their distance from the camera. However, not everybody has access to a depth camera or a Kinect, and especially when you're just learning OpenCV, you're probably not going to invest in an expensive piece of equipment until you feel your skills are well-sharpened, and your interest in the subject is confirmed.

Our setup includes a simple camera, which is most likely integrated in our machine, or a webcam attached to our computer. So, we need to resort to less fancy means of estimating the difference in distance of objects from the camera.

Geometry will come to the rescue in this case, and in particular, Epipolar Geometry, which is the geometry of stereo vision. Stereo vision is a branch of computer vision that extracts three-dimensional information out of two different images of the same subject.

How does epipolar geometry work? Conceptually, it traces imaginary lines from the camera to each object in the image, then does the same on the second image, and calculates the distance of objects based on the intersection of the lines corresponding to the same object. Here is a representation of this concept:

Depth estimation with a normal camera

Let's see how OpenCV applies epipolar geometry to calculate a so-called disparity map, which is basically a representation of the different depths detected in the images. This will enable us to extract the foreground of a picture and discard the rest.

Firstly, we need two images of the same subject taken from different points of view, but paying attention to the fact that the pictures are taken at an equal distance from the object, otherwise the calculations will fail and the disparity map will be meaningless.

So, moving on to an example:

import numpy as np
import cv2

def update(val = 0):
    # disparity range is tuned for 'aloe' image pair
    stereo.setBlockSize(cv2.getTrackbarPos('window_size', 'disparity'))
    stereo.setUniquenessRatio(cv2.getTrackbarPos('uniquenessRatio', 'disparity'))
    stereo.setSpeckleWindowSize(cv2.getTrackbarPos('speckleWindowSize', 'disparity'))
    stereo.setSpeckleRange(cv2.getTrackbarPos('speckleRange', 'disparity'))
    stereo.setDisp12MaxDiff(cv2.getTrackbarPos('disp12MaxDiff', 'disparity'))

    print 'computing disparity...'
    disp = stereo.compute(imgL, imgR).astype(np.float32) / 16.0

    cv2.imshow('left', imgL)
    cv2.imshow('disparity', (disp-min_disp)/num_disp)

if __name__ == "__main__":
    window_size = 5
    min_disp = 16
    num_disp = 192-min_disp
    blockSize = window_size
    uniquenessRatio = 1
    speckleRange = 3
    speckleWindowSize = 3
    disp12MaxDiff = 200
    P1 = 600
    P2 = 2400
    imgL = cv2.imread('images/color1_small.jpg')
    imgR = cv2.imread('images/color2_small.jpg')    
    cv2.createTrackbar('speckleRange', 'disparity', speckleRange, 50, update)    
    cv2.createTrackbar('window_size', 'disparity', window_size, 21, update)
    cv2.createTrackbar('speckleWindowSize', 'disparity', speckleWindowSize, 200, update)
    cv2.createTrackbar('uniquenessRatio', 'disparity', uniquenessRatio, 50, update)
    cv2.createTrackbar('disp12MaxDiff', 'disparity', disp12MaxDiff, 250, update)
    stereo = cv2.StereoSGBM_create(
        minDisparity = min_disp,
        numDisparities = num_disp,
        blockSize = window_size,
        uniquenessRatio = uniquenessRatio,
        speckleRange = speckleRange,
        speckleWindowSize = speckleWindowSize,
        disp12MaxDiff = disp12MaxDiff,
        P1 = P1,
        P2 = P2

In this example, we take two images of the same subject and calculate a disparity map, showing in brighter colors the points in the map that are closer to the camera. The areas marked in black represent the disparities.

First of all, we import numpy and cv2 as usual.

Let's skip the definition of the update function for a second and take a look at the main code; the process is quite simple: load two images, create a StereoSGBM instance (StereoSGBM stands for semiglobal block matching, and it is an algorithm used for computing disparity maps), and also create a few trackbars to play around with the parameters of the algorithm and call the update function.

The update function applies the trackbar values to the StereoSGBM instance, and then calls the compute method, which produces a disparity map. All in all, pretty simple! Here is the first image I've used:

Depth estimation with a normal camera

This is the second one:

Depth estimation with a normal camera

There you go: a nice and quite easy to interpret disparity map.

Depth estimation with a normal camera

The parameters used by StereoSGBM are as follows (taken from the OpenCV documentation):




This parameter refers to the minimum possible disparity value. Normally, it is zero but sometimes, rectification algorithms can shift images, so this parameter needs to be adjusted accordingly.


This parameter refers to the maximum disparity minus minimum disparity. The resultant value is always greater than zero. In the current implementation, this parameter must be divisible by 16.


This parameter refers to a matched block size. It must be an odd number greater than or equal to 1. Normally, it should be somewhere in the 3-11 range.


This parameter refers to the first parameter controlling the disparity smoothness. See the next point.


This parameter refers to the second parameter that controls the disparity smoothness. The larger the values are, the smoother the disparity is. P1 is the penalty on the disparity change by plus or minus 1 between neighbor pixels. P2 is the penalty on the disparity change by more than 1 between neighbor pixels. The algorithm requires P2 > P1.

See the stereo_match.cpp sample where some reasonably good P1 and P2 values are shown (such as 8*number_of_image_channels*windowSize*windowSize and 32*number_of_image_channels*windowSize*windowSize, respectively).


This parameter refers to the maximum allowed difference (in integer pixel units) in the left-right disparity check. Set it to a nonpositive value to disable the check.


This parameter refers to the truncation value for prefiltered image pixels. The algorithm first computes the x-derivative at each pixel and clips its value by the [-preFilterCap, preFilterCap] interval. The resultant values are passed to the Birchfield-Tomasi pixel cost function.


This parameter refers to the margin in percentage by which the best (minimum) computed cost function value should "win" the second best value to consider the found match to be correct. Normally, a value within the 5-15 range is good enough.


This parameter refers to the maximum size of smooth disparity regions to consider their noise speckles and invalidate. Set it to 0 to disable speckle filtering. Otherwise, set it somewhere in the 50-200 range.


This parameter refers to the maximum disparity variation within each connected component. If you do speckle filtering, set the parameter to a positive value; it will implicitly be multiplied by 16. Normally, 1 or 2 is good enough.

With the preceding script, you'll be able to load the images and play around with parameters until you're happy with the disparity map generated by StereoSGBM.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.