Object detection with the sliding window solution

In this section, we'll learn to solve the object detection problem of detecting multiple object positions using the sliding window technique. At the end of this section, we'll explore the downsides of using this method.

The very first step is to train a convolutional neural network (CNN)as a VGG-16 or maybe a more advanced architecture, such as the inception network, or even the residual network. We'll use cropped images as the training data, instead of the original size of the images. The cropped images will contain only the object in question.

In this example, we'll use the car detection problem and the images will contain cars. When it comes to practical implementation, there are two things that we need to keep in mind: 

  • It would help if we used different crop sizes for different images. 
  • Add images other than those of cars as it would help if the neural network knew what an image without a car looks like. We can use images of nature, birds, and so on. 

To begin detecting multiple cars or objects with a sliding window algorithm we need to understand the algorithm. The concept of the sliding window is relatively simple. 

Begin by choosing different window sizes and bounding boxes that are similar to the sizes of the cropped images that we used to train the model.  The second step involves moving or sliding the window across the entire image:

For the sake of this example, let's move the window right, down, and then right again in the next row. Continue this process all the way until the window is at the bottom-right of the image. 

We repeat this process for different window sizes and move it in a similar fashion. 

The significance of carrying out this step is that the pixel values obtained from this are used as in input to the convolutional network that we trained, which implies that the network is trained with a cropped image. 

The network doesn't know that bigger images exist, but it's capable of telling us whether a car is present in the selected pixels given to the network. 

For this example, for a random window size, when the window lies on top of a car, it will tell us whether there's a car because the network is also trained with images that don't include cars. 

Here are some screenshots of the routine in progress. When you look at the following screenshot, you can see that the pixels bounded in a box in the following image do not contain a car:

The selected pixels contain a car and are thus, marked as red. 

The window size can be varied, and each time we give the selected pixels to the CNN, which was trained with cropped images.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.17.18