There is no virtual limit to the type of objects you can detect in your images and videos. However, to obtain an acceptable level of accuracy, you need a sufficiently large dataset, containing train images that are identical in size.
This would be a time consuming operation if we were to do it all by ourselves (which is entirely possible).
We can avail of ready-made datasets; there are a number of them freely downloadable from various sources:
I'll be using the UIUC dataset in my example, but feel free to explore the Internet for other types of datasets.
Now, let's take a look at an example:
import cv2 import numpy as np from os.path import join datapath = "/home/d3athmast3r/dev/python/CarData/TrainImages/" def path(cls,i): return "%s/%s%d.pgm" % (datapath,cls,i+1) pos, neg = "pos-", "neg-" detect = cv2.xfeatures2d.SIFT_create() extract = cv2.xfeatures2d.SIFT_create() flann_params = dict(algorithm = 1, trees = 5)flann = cv2.FlannBasedMatcher(flann_params, {}) bow_kmeans_trainer = cv2.BOWKMeansTrainer(40) extract_bow = cv2.BOWImgDescriptorExtractor(extract, flann) def extract_sift(fn): im = cv2.imread(fn,0) return extract.compute(im, detect.detect(im))[1] for i in range(8): bow_kmeans_trainer.add(extract_sift(path(pos,i))) bow_kmeans_trainer.add(extract_sift(path(neg,i))) voc = bow_kmeans_trainer.cluster() extract_bow.setVocabulary( voc ) def bow_features(fn): im = cv2.imread(fn,0) return extract_bow.compute(im, detect.detect(im)) traindata, trainlabels = [],[] for i in range(20): traindata.extend(bow_features(path(pos, i))); trainlabels.append(1) traindata.extend(bow_features(path(neg, i))); trainlabels.append(-1) svm = cv2.ml.SVM_create() svm.train(np.array(traindata), cv2.ml.ROW_SAMPLE, np.array(trainlabels)) def predict(fn): f = bow_features(fn); p = svm.predict(f) print fn, " ", p[1][0][0] return p car, notcar = "/home/d3athmast3r/dev/python/study/images/car.jpg", "/home/d3athmast3r/dev/python/study/images/bb.jpg" car_img = cv2.imread(car) notcar_img = cv2.imread(notcar) car_predict = predict(car) not_car_predict = predict(notcar) font = cv2.FONT_HERSHEY_SIMPLEX if (car_predict[1][0][0] == 1.0): cv2.putText(car_img,'Car Detected',(10,30), font, 1,(0,255,0),2,cv2.LINE_AA) if (not_car_predict[1][0][0] == -1.0): cv2.putText(notcar_img,'Car Not Detected',(10,30), font, 1,(0,0, 255),2,cv2.LINE_AA) cv2.imshow('BOW + SVM Success', car_img) cv2.imshow('BOW + SVM Failure', notcar_img) cv2.waitKey(0) cv2.destroyAllWindows()
This is quite a lot to assimilate, so let's go through what we've done:
path
:def path(cls,i): return "%s/%s%d.pgm" % (datapath,cls,i+1) pos, neg = "pos-", "neg-"
More on the path function
This function is a utility method: given the name of a class (in our case, we have two classes, pos
and neg
) and a numerical index, we return the full path to a particular testing image. Our car dataset contains images named in the following way: pos-x.pgm
and neg-x.pgm
, where x
is a number.
Immediately, you will find the usefulness of this function when iterating through a range of numbers (say, 20), which will allow you to load all images from pos-0.pgm
to pos-20.pgm
, and the same goes for the negative class.
detect = cv2.xfeatures2d.SIFT_create() extract = cv2.xfeatures2d.SIFT_create()
flann_params = dict(algorithm = 1, trees = 5)flann = cv2.FlannBasedMatcher(flann_params, {})
Note that currently, the enum
values for FLANN are missing from the Python version of OpenCV 3, so, number 1
, which is passed as the algorithm parameter, represents the FLANN_INDEX_KDTREE
algorithm. I suspect the final version will be cv2.FLANN_INDEX_KDTREE
, which is a little more helpful. Make sure to check the enum
values for the correct flags.
bow_kmeans_trainer = cv2.BOWKMeansTrainer(40)
extract_bow = cv2.BOWImgDescriptorExtractor(extract, flann)
def extract_sift(fn): im = cv2.imread(fn,0) return extract.compute(im, detect.detect(im))[1]
At this stage, we have everything we need to start training the BOW trainer.
for i in range(8): bow_kmeans_trainer.add(extract_sift(path(pos,i))) bow_kmeans_trainer.add(extract_sift(path(neg,i)))
cluster()
method on the trainer, which performs the k-means classification and returns the said vocabulary. We'll assign this vocabulary to BOWImgDescriptorExtractor
so that it can extract descriptors from test images:vocabulary = bow_kmeans_trainer.cluster() extract_bow.setVocabulary(vocabulary)
def bow_features(fn): im = cv2.imread(fn,0) return extract_bow.compute(im, detect.detect(im))
BOWImgDescriptorExtractor
, associating labels to the positive and negative images we're feeding (1
stands for a positive match, -1
for a negative):traindata, trainlabels = [],[] for i in range(20): traindata.extend(bow_features(path(pos, i))); trainlabels.append(1) traindata.extend(bow_features(path(neg, i))); trainlabels.append(-1)
svm = cv2.ml.SVM_create()
svm.train(np.array(traindata), cv2.ml.ROW_SAMPLE, np.array(trainlabels))
We're all set with a trained SVM; all that is left to do is to feed the SVM a couple of sample images and see how it behaves.
predict
method and return it:def predict(fn): f = bow_features(fn); p = svm.predict(f) print fn, " ", p[1][0][0] return p
car, notcar = "/home/d3athmast3r/dev/python/study/images/car.jpg", "/home/d3athmast3r/dev/python/study/images/bb.jpg" car_img = cv2.imread(car) notcar_img = cv2.imread(notcar)
car_predict = predict(car) not_car_predict = predict(notcar)
Naturally, we're hoping that the car image will be detected as a car (result of predict()
should be 1.0
), and that the other image will not (result should be -1.0
), so we will only add text to the images if the result is the expected one.
font = cv2.FONT_HERSHEY_SIMPLEX if (car_predict[1][0][0] == 1.0): cv2.putText(car_img,'Car Detected',(10,30), font, 1,(0,255,0),2,cv2.LINE_AA) if (not_car_predict[1][0][0] == -1.0): cv2.putText(notcar_img,'Car Not Detected',(10,30), font, 1,(0,0, 255),2,cv2.LINE_AA) cv2.imshow('BOW + SVM Success', car_img) cv2.imshow('BOW + SVM Failure', notcar_img) cv2.waitKey(0) cv2.destroyAllWindows()
The preceding operation produces the following result:
It also results in this:
Having detected an object is an impressive achievement, but now we want to push this to the next level in these ways:
To accomplish this, we will use the sliding windows approach. If it's not already clear from the previous explanation of the concept of sliding windows, the rationale behind the adoption of this approach will become more apparent if we take a look at a diagram:
Observe the movement of the block:
0
and move down a step, and repeat the entire process.Continue rescaling and classifying until you get to a minimum size.
This gives you the chance to detect objects in several regions of the image and at different scales.
At this stage, you will have collected important information about the content of the image; however, there's a problem: it's most likely that you will end up with a number of overlapping blocks that give you a positive score. This means that your image may contain one object that gets detected four or five times, and if you were to report the result of the detection, your report would be quite inaccurate, so here's where non-maximum suppression comes into play.
We are now ready to apply all the concepts we learned so far to a real-life example, and create a car detector application that scans an image and draws rectangles around cars.
Let's summarize the process before diving into the code:
Let's also take a look at the structure of the project, as it is a bit more complex than the classic standalone script approach we've adopted until now.
The project structure is as follows:
├── car_detector │ ├── detector.py │ ├── __init__.py │ ├── non_maximum.py │ ├── pyramid.py │ └── sliding_w112661222.indow.py └── car_sliding_windows.py
The main program is in car_sliding_windows.py
, and all the utilities are contained in the car_detector
folder. As we're using Python 2.7, we'll need an __init__.py
file in the folder for it to be detected as a module.
The four files in the car_detector
module are as follows:
Let's examine them one by one, starting from the image pyramid:
import cv2 def resize(img, scaleFactor): return cv2.resize(img, (int(img.shape[1] * (1 / scaleFactor)), int(img.shape[0] * (1 / scaleFactor))), interpolation=cv2.INTER_AREA) def pyramid(image, scale=1.5, minSize=(200, 80)): yield image while True: image = resize(image, scale) if image.shape[0] < minSize[1] or image.shape[1] < minSize[0]: break yield image
This module contains two function definitions:
You will notice that the image is not returned with the return
keyword but with the yield
keyword. This is because this function is a so-called generator. If you are not familiar with generators, take a look at https://wiki.python.org/moin/Generators.
This will allow us to obtain a resized image to process in our main program.
Next up is the sliding windows function:
def sliding_window(image, stepSize, windowSize): for y in xrange(0, image.shape[0], stepSize): for x in xrange(0, image.shape[1], stepSize): yield (x, y, image[y:y + windowSize[1], x:x + windowSize[0]])
Again, this is a generator. Although a bit deep-nested, this mechanism is very simple: given an image, return a window that moves of an arbitrary sized step from the left margin towards the right, until the entire width of the image is covered, then goes back to the left margin but down a step, covering the width of the image repeatedly until the bottom right corner of the image is reached. You can visualize this as the same pattern used for writing on a piece of paper: start from the left margin and reach the right margin, then move onto the next line from the left margin.
The last utility is non-maximum suppression, which looks like this (Malisiewicz/Rosebrock's code):
def non_max_suppression_fast(boxes, overlapThresh): # if there are no boxes, return an empty list if len(boxes) == 0: return [] # if the bounding boxes integers, convert them to floats -- # this is important since we'll be doing a bunch of divisions if boxes.dtype.kind == "i": boxes = boxes.astype("float") # initialize the list of picked indexes pick = [] # grab the coordinates of the bounding boxes x1 = boxes[:,0] y1 = boxes[:,1] x2 = boxes[:,2] y2 = boxes[:,3] scores = boxes[:,4] # compute the area of the bounding boxes and sort the bounding # boxes by the score/probability of the bounding box area = (x2 - x1 + 1) * (y2 - y1 + 1) idxs = np.argsort(scores)[::-1] # keep looping while some indexes still remain in the indexes # list while len(idxs) > 0: # grab the last index in the indexes list and add the # index value to the list of picked indexes last = len(idxs) - 1 i = idxs[last] pick.append(i) # find the largest (x, y) coordinates for the start of # the bounding box and the smallest (x, y) coordinates # for the end of the bounding box xx1 = np.maximum(x1[i], x1[idxs[:last]]) yy1 = np.maximum(y1[i], y1[idxs[:last]]) xx2 = np.minimum(x2[i], x2[idxs[:last]]) yy2 = np.minimum(y2[i], y2[idxs[:last]]) # compute the width and height of the bounding box w = np.maximum(0, xx2 - xx1 + 1) h = np.maximum(0, yy2 - yy1 + 1) # compute the ratio of overlap overlap = (w * h) / area[idxs[:last]] # delete all indexes from the index list that have idxs = np.delete(idxs, np.concatenate(([last], np.where(overlap > overlapThresh)[0]))) # return only the bounding boxes that were picked using the # integer data type return boxes[pick].astype("int")
This function simply takes a list of rectangles and sorts them by their score. Starting from the box with the highest score, it eliminates all boxes that overlap beyond a certain threshold by calculating the area of intersection and determining whether it is greater than a certain threshold.
Now, let's examine the heart of this program, which is detector.py
. This a bit long and complex; however, everything should appear much clearer given our newfound familiarity with the concepts of BOW, SVM, and feature detection/extraction.
Here's the code:
import cv2 import numpy as np datapath = "/path/to/CarData/TrainImages/" SAMPLES = 400 def path(cls,i): return "%s/%s%d.pgm" % (datapath,cls,i+1) def get_flann_matcher(): flann_params = dict(algorithm = 1, trees = 5) return cv2.FlannBasedMatcher(flann_params, {}) def get_bow_extractor(extract, flann): return cv2.BOWImgDescriptorExtractor(extract, flann) def get_extract_detect(): return cv2.xfeatures2d.SIFT_create(), cv2.xfeatures2d.SIFT_create() def extract_sift(fn, extractor, detector): im = cv2.imread(fn,0) return extractor.compute(im, detector.detect(im))[1] def bow_features(img, extractor_bow, detector): return extractor_bow.compute(img, detector.detect(img)) def car_detector(): pos, neg = "pos-", "neg-" detect, extract = get_extract_detect() matcher = get_flann_matcher() print "building BOWKMeansTrainer..." bow_kmeans_trainer = cv2.BOWKMeansTrainer(1000) extract_bow = cv2.BOWImgDescriptorExtractor(extract, flann) print "adding features to trainer" for i in range(SAMPLES): print i bow_kmeans_trainer.add(extract_sift(path(pos,i), extract, detect)) bow_kmeans_trainer.add(extract_sift(path(neg,i), extract, detect)) voc = bow_kmeans_trainer.cluster() extract_bow.setVocabulary( voc ) traindata, trainlabels = [],[] print "adding to train data" for i in range(SAMPLES): print i traindata.extend(bow_features(cv2.imread(path(pos, i), 0), extract_bow, detect)) trainlabels.append(1) traindata.extend(bow_features(cv2.imread(path(neg, i), 0), extract_bow, detect)) trainlabels.append(-1) svm = cv2.ml.SVM_create() svm.setType(cv2.ml.SVM_C_SVC) svm.setGamma(0.5) svm.setC(30) svm.setKernel(cv2.ml.SVM_RBF) svm.train(np.array(traindata), cv2.ml.ROW_SAMPLE, np.array(trainlabels)) return svm, extract_bow
Let's go through it. First, we'll import our usual modules, and then set a path for the training images.
Then, we'll define a number of utility functions:
def path(cls,i): return "%s/%s%d.pgm" % (datapath,cls,i+1)
This function returns the path to an image given a base path and a class name. In our example, we're going to use the neg-
and pos-
class names, because this is what the training images are called (that is, neg-1.pgm
). The last argument is an integer used to compose the final part of the image path.
Next, we'll define a utility function to obtain a FLANN matcher:
def get_flann_matcher(): flann_params = dict(algorithm = 1, trees = 5) return cv2.FlannBasedMatcher(flann_params, {})
Again, it's not that the integer, 1
, passed as an algorithm argument represents FLANN_INDEX_KDTREE
.
The next two functions return the SIFT feature detectors/extractors and a BOW trainer:
def get_bow_extractor(extract, flann): return cv2.BOWImgDescriptorExtractor(extract, flann) def get_extract_detect(): return cv2.xfeatures2d.SIFT_create(), cv2.xfeatures2d.SIFT_create()
The next utility is a function used to return features from an image:
def extract_sift(fn, extractor, detector): im = cv2.imread(fn,0) return extractor.compute(im, detector.detect(im))[1]
We'll also define a similar utility function to extract the BOW features:
def bow_features(img, extractor_bow, detector): return extractor_bow.compute(img, detector.detect(img))
In the main car_detector
function, we'll first create the necessary object used to perform feature detection and extraction:
pos, neg = "pos-", "neg-" detect, extract = get_extract_detect() matcher = get_flann_matcher() bow_kmeans_trainer = cv2.BOWKMeansTrainer(1000) extract_bow = cv2.BOWImgDescriptorExtractor(extract, flann)
Then, we'll add features taken from training images to the trainer:
print "adding features to trainer" for i in range(SAMPLES): print i bow_kmeans_trainer.add(extract_sift(path(pos,i), extract, detect))
For each class, we'll add a positive image to the trainer and a negative image.
After this, we'll instruct the trainer to cluster the data into k groups.
The clustered data is now our vocabulary of visual words, and we can set the BOWImgDescriptorExtractor
class' vocabulary in this way:
vocabulary = bow_kmeans_trainer.cluster() extract_bow.setVocabulary(vocabulary)
With a visual vocabulary ready, we can now associate train data with classes. In our case, we have two classes: -1
for negative results and 1
for positive ones.
Let's populate two arrays, traindata
and trainlabels
, containing extracted features and their corresponding labels. Iterating through the dataset, we can quickly set this up with the following code:
traindata, trainlabels = [], [] print "adding to train data" for i in range(SAMPLES): print i traindata.extend(bow_features(cv2.imread(path(pos, i), 0), extract_bow, detect)) trainlabels.append(1) traindata.extend(bow_features(cv2.imread(path(neg, i), 0), extract_bow, detect)) trainlabels.append(-1)
You will notice that at each cycle, we'll add one positive and one negative image, and then populate the labels with a 1
and a -1
value to keep the data synchronized with the labels.
Should you wish to train more classes, you could do that by following this pattern:
traindata, trainlabels = [], [] print "adding to train data" for i in range(SAMPLES): print i traindata.extend(bow_features(cv2.imread(path(class1, i), 0), extract_bow, detect)) trainlabels.append(1) traindata.extend(bow_features(cv2.imread(path(class2, i), 0), extract_bow, detect)) trainlabels.append(2) traindata.extend(bow_features(cv2.imread(path(class3, i), 0), extract_bow, detect)) trainlabels.append(3)
For example, you could train a detector to detect cars and people and perform detection on these in an image containing both cars and people.
Lastly, we'll train the SVM with the following code:
svm = cv2.ml.SVM_create() svm.setType(cv2.ml.SVM_C_SVC) svm.setGamma(0.5) svm.setC(30) svm.setKernel(cv2.ml.SVM_RBF) svm.train(np.array(traindata), cv2.ml.ROW_SAMPLE, np.array(trainlabels)) return svm, extract_bow
There are two parameters in particular that I'd like to focus your attention on:
SVM_LINEAR
indicates a linear hyperplane, which, in practical terms, works very well for a binary classification (the test sample either belongs to a class or it doesn't), while SVM_RBF
(radial basis function) separates data using the Gaussian functions, which means that the data is split into several kernels defined by these functions. When training the SVM to classify for more than two classes, you will have to use RBF.Finally, we'll pass the traindata
and trainlabels
arrays into the SVM train
method, and return the SVM and BOW extractor object. This is because in our applications, we don't want to have to recreate the vocabulary every time, so we expose it for reuse.
We are ready to test our car detector!
Let's first create a simple program that loads an image, and then operates detection using the sliding windows and image pyramid techniques, respectively:
import cv2 import numpy as np from car_detector.detector import car_detector, bow_features from car_detector.pyramid import pyramid from car_detector.non_maximum import non_max_suppression_fast as nms from car_detector.sliding_window import sliding_window def in_range(number, test, thresh=0.2): return abs(number - test) < thresh test_image = "/path/to/cars.jpg" svm, extractor = car_detector() detect = cv2.xfeatures2d.SIFT_create() w, h = 100, 40 img = cv2.imread(test_img) rectangles = [] counter = 1 scaleFactor = 1.25 scale = 1 font = cv2.FONT_HERSHEY_PLAIN for resized in pyramid(img, scaleFactor): scale = float(img.shape[1]) / float(resized.shape[1]) for (x, y, roi) in sliding_window(resized, 20, (w, h)): if roi.shape[1] != w or roi.shape[0] != h: continue try: bf = bow_features(roi, extractor, detect) _, result = svm.predict(bf) a, res = svm.predict(bf, flags=cv2.ml.STAT_MODEL_RAW_OUTPUT) print "Class: %d, Score: %f" % (result[0][0], res[0][0]) score = res[0][0] if result[0][0] == 1: if score < -1.0: rx, ry, rx2, ry2 = int(x * scale), int(y * scale), int((x+w) * scale), int((y+h) * scale) rectangles.append([rx, ry, rx2, ry2, abs(score)]) except: pass counter += 1 windows = np.array(rectangles) boxes = nms(windows, 0.25) for (x, y, x2, y2, score) in boxes: print x, y, x2, y2, score cv2.rectangle(img, (int(x),int(y)),(int(x2), int(y2)),(0, 255, 0), 1) cv2.putText(img, "%f" % score, (int(x),int(y)), font, 1, (0, 255, 0)) cv2.imshow("img", img) cv2.waitKey(0)
The notable part of the program is the function within the pyramid/sliding window loop:
bf = bow_features(roi, extractor, detect) _, result = svm.predict(bf) a, res = svm.predict(bf, flags=cv2.ml.STAT_MODEL_RAW_OUTPUT) print "Class: %d, Score: %f" % (result[0][0], res[0][0]) score = res[0][0] if result[0][0] == 1: if score < -1.0: rx, ry, rx2, ry2 = int(x * scale), int(y * scale), int((x+w) * scale), int((y+h) * scale) rectangles.append([rx, ry, rx2, ry2, abs(score)])
Here, we extract the features of the region of interest (ROI), which corresponds to the current sliding window, and then we call predict
on the extracted features. The predict
method has an optional parameter, flags
, which returns the score of the prediction (contained at the [0][0]
value).
So, we'll set an arbitrary threshold of -1.0
for classified windows, and all windows with less than -1.0
are going to be taken as good results. As you experiment with your SVMs, you may tweak this to your liking until you find a golden mean that assures best results.
Finally, we add the computed coordinates of the sliding window (meaning, we multiply the current coordinates by the scale of the current layer in the image pyramid so that it gets correctly represented in the final drawing) to the array of rectangles.
There's one last operation we need to perform before drawing our final result: non-maximum suppression.
We turn the rectangles array into a NumPy array (to allow certain kind of operations that are only possible with NumPy), and then apply NMS:
windows = np.array(rectangles) boxes = nms(windows, 0.25)
Finally, we proceed with displaying all our results; for the sake of convenience, I've also printed the score obtained for all the remaining windows:
This is a remarkably accurate result!
A final note on SVM: you don't need to train a detector every time you want to use it, which would be extremely impractical. You can use the following code:
svm.save('/path/to/serialized/svmxml')
You can subsequently reload it with a load method and feed it test images or frames.
18.118.137.7