Chapter Four

Single shot detection for detecting real-time flying objects for unmanned aerial vehicle

Sampurna Mandal1, Sk Md Basharat Mones1, Arshavee Das1, Valentina E. Balas2, Rabindra Nath Shaw3 and Ankush Ghosh1,    1School of Engineering and Applied Sciences, The Neotia University, Kolkata, India,    2Department and Applied Software, Aurel Vlaicu University of Arad, Arad Romania,    3Department of Electrical, Electronics & Communication Engineering, Galgotias University, Greater Noida, India

Abstract

This world of rapidly changing technology has an impact on the infrastructure of worldwide transportation systems. These changes are becoming more and more clear with time. As the field of Robotics and Artificial Intelligence is booming, every gadget is getting smarter. The concept of manual vehicles is shifting toward autonomous vehicles. For example, manual delivery is shifting toward autodelivery by autonomous drones. However, the Unmanned Aerial Vehicle (UAV) is one of the recent hot topics in research. There are various challenges that we need to overcome on the questions of safety and reliability. One such challenge for UAVs is detecting flying objects in real time. Our work focuses on detecting real-time objects using the Single Shot Detector (SSD) algorithm. The objective is to train the labeled images of flying objects with SSD and evaluate the efficiency of the algorithm in terms of confidence value and accuracy.

Keywords

Single Shot Detection (SSD); Convolutional Neural Network (CNN); deep learning (DL); machine learning (ML); artificial intelligence (AI); robotics; automation

4.1 Introduction

This world is headed toward a future where the skies are not occupied with only birds, manually driven airplanes, helicopters, and fighter planes. In the near future the sky is going to be occupied by unmanned vehicles ranging from unmanned drones to large Unmanned Aerial Vehicles (UAV). Some of the UAVs will be able to communicate among themselves without human interruption. It is extremely important to make commercial as well as military UAVs economical, safe, and reliable. One of the solutions is to use inexpensive sensors and camera devices for collision avoidance.

The latest automotive technologies have given solutions to these problems, for example, nowadays commercial products [1,2] are designed to sense the obstacles and avoid collision with other cars, pedestrians, cyclists, and other movable and immovable objects. Modern technology has already made progress on navigation and accurate position estimation with either single or multiple cameras [39], although there is comparatively little improvement in visual guided collision avoidance technique [10]. Basically, it is quite difficult to apply the same algorithm for pedestrian or automobile detection just by simply extending and modifying. Some challenges for flying object detection are given below:

  • • The object motion is more complex due to the 3D environment.
  • • There are various kinds of flying objects with diverse shapes. These flying objects can appear both on the ground and in the sky. Because of that the background also becomes more diverse and complex.
  • • The UAV must detect potentially dangerous objects from a safe distance, where speed is also involved.

Fig. 4.1 shows some examples where flying objects are being detected on just a single image where even the human eye cannot clearly see them. When the sequence of frames is observed, these objects popped up and therefore were detected easily, which leads to the conclusion that motion cues play a very crucial role in detection.

image
Figure 4.1 Object detection sample image.

But these motion cues become very difficult to target if the images are taken from a moving camera with featured backgrounds. It becomes a matter of challenge to stabilize as they are changing rapidly and are nonplanar. Moreover, since there are other moving objects, as shown in the first row of Fig. 4.1, there is a person in motion. But only motion is not sufficient, and the corresponding appearance must be taken care of.

In this chapter, our target is to detect and classify the object of interest present in the image or frame. The work for flying object detection in this chapter is divided into two steps:

  • • At first, the images of flying objects taken from various sources with different shapes and diverse background are detected and classified with labeled bounding box along with a confidence score predicted by the network.
  • • Then, the same network is applied to video to test the performance of the network while the motion is included along with the constantly changing background.

The primary motivation behind this work is to test the Single Shot Detection (SSD) algorithm to detect and classify flying objects such as airplanes and birds. The secondary objective for working on this chapter is to demonstrate how this algorithm is successfully applied for real-time flying object detection and classification for frames taken from a video feed during UAV operation.

4.2 Related work

There are three main categories in which the moving objects can be classified:

  • • Appearance-based methods, which rely on the appearance in individual frames;
  • • Motion-based methods, which rely on the information about motion across frames; and
  • • Hybrid methods, which combine the above two processes.

Let us briefly discuss the review of the incorporated processes:

4.2.1 Appearance-based methods

These types of methods rely on ML (machine learning) algorithms and the performances of these algorithms are quite good even in cluttered background or in complex light variation. They are typically based on Deformable Part Models [11], Convolutional Neural Networks (CNN) [12], or Random Forests [13]. Among these algorithms the best is Aggregate Channel Features [14].

These algorithms perform better for those kind of target objects where the objects are clearly visible in individual images. This is not the case applied for us. As an example, let’s say, there is fog or smoke in the environment. The algorithm should be able to detect a target which is a flying object even in a hazy or foggy environment. Another example is shown in Fig. 4.1, the object is quite small and is not visible with the human eye. It was only possible to detect that object because of the motion cues.

4.2.2 Motion-based methods

These types of methods can be divided further into two subsets: The first one comprises the targets that rely upon background subtraction [1518] and this subclass determines the objects as a group of pixels that are different from the group of pixels that form the background. The second subclass includes those targets that rely on optical flow [1921].

When the camera is in a fixed position or the motion is very small and almost negligible, the background subtraction method works best, which is not the case for an onboard camera of a fast-moving aircraft. To overcome these problems, flow-based methods are used and give satisfactory results. But flow-based methods are critical when these come to detect the small and blurry objects. To overcome this problem, some methods use the combination of the both background subtraction and flow-based methods [22,23].

In our case, the background changes as the flying objects move through 3D spaces. The same types of flying objects (e.g., birds) can appear in various shapes and sizes simultaneously depending on the variable distance between the camera and the objects. So, it is clearly seen that only motion information is not sufficient to detect flying objects with high reliability. As said earlier, there are other methods which combine both background subtraction and flow-based methods, for example, it is seen that the work done in the references [2427] depend on optical flow critically, which is again combined with [21] and may suffer from the low quality of flow vectors. Additionally, it can be included that the assumption made by Narayana et al. [26] about the translational camera motion is again violated in aerial videos.

4.2.3 Hybrid methods

These approaches combine information related to both object appearance and motion patterns and are therefore highly reliable. This method performs well regardless of the frame background variation and motion cue. Walk et al.’s [28] work can be shown as an example here, where histograms of flow vectors are used are used as features combined with the appearance features, which are more standard, and then fed to a statistical learning method. This approach is taken in Ref. [29], where at first the patches are aligned to compensate for motion and after that the differences of the frames (both consecutive and separate frames) are used as additional features. This alignment depends on the Lucas–Kanade optical flow algorithm [20]. The resulting algorithm responds very well when it comes to pedestrian detection and performs better than the single-frame methods. The flow estimates are not reliable while detecting the smaller target objects which are harder to detect, and, like purely flow-based methods, this method becomes less effective.

The first deep learning object detection technique was Overleaf Network [30]. This technique used CNNs for classifying every part of the image along with a sliding window to detect and classify an image as object (object of interest) or nonobject (background). These earlier methods led to even more advanced techniques for object detection. In recent years, the deep learning community has proposed several object detectors, such as Faster RCNN [31], YOLO [32], R-FCN [33], SSD [34], and RetinaNet [35]. The primary focus of these algorithms was:

  • • to improve the accuracy and confidence score of the algorithm in terms of map; and
  • • to improve the computational complexity, so that these algorithms can be used effective for real-time object detection for embedded and mobile platforms [36].

These detection algorithms can be divided into two subclasses based on their high-level architecture: (1) single based approach and (2) two-step approach, that is, the region-based approach. The single-step approach is given more priority when the aim is to achieve faster results and high memory efficiency. Whereas the two-step approach is taken when accuracy is more important than time complexity and memory consumption.

4.2.4 Single-step detectors

Single-step detectors are simpler and faster than region-based models with reasonable accuracy. These detectors directly predict the coordinate offsets and object classes instead of predicting objects/nonobjects. SSD, YOLO, and RetinaNet are some of the examples of single-step detectors.

4.2.5 Two-step detectors/region-based detectors

This method has two stages for object detection:

  • • In the first stage, a set of regions are generated depending on the probability of an object being present within that particular region.
  • • In the second step, that particular area/region with high probability of having an object is fed as the input for the detection and classification of objects.

R-CNN, Fast R-CNN, FPN, Faster R-CNN, and R-FCN are some of the examples of single-step detectors.

4.3 Methodology

4.3.1 Model training

Here, a pretrained model of SSD [37] “Caffe Model” [38] is used for training the dataset. The dataset contains classes of flying objects like birds and aero planes (all types of planes are labeled as aero plane). OpenCV 4 library is used here. Within this OpenCV library, a highly improved deep learning model “DNN” is present. This “DNN” module supports a deep learning framework like “Caffe.” The “Caffe” model architecture, which is used in this chapter, is shown below (Fig. 4.2):

image
Figure 4.2 Caffe model architecture.

The layers of Caffe model are described as follows:

  • • Vision layers:
    • • Particularly operate to some region of the input and as a result, produce a corresponding region of the output.
    • • Other layers ignore the spatial structure of the input. Though there are few exceptions.
    • • May include convolution, pooling, Local Response Normalization layers.
    • • Convolutional layers consist of a set of learnable filters which are applied over image.
  • • Loss layers:
    • • Compares an output to a target and assigns cost to minimize.
    • • May include SoftMax, sum-of-square/Euclidean, Hinge/Margin, information gain, accuracy score evaluation.
  • • Activation layers:
    • • Elementwise operation is performed, taking one bottom blob and producing one top blob of the same size.
    • • Transforms the weighted sum of inputs that goes into artificial neurons.
    • • These functions are nonlinear to encode to encode complex patterns.
    • • May include ReLU or Leaky ReLU, Sigmoid, TanH/Hyperbolic Tangent, Absolute Value, etc.
  • • Data layers:
    • • Takes data directly from memory, or from files on disk (when efficiency is not critical) or common image formats.
    • • May include database, In-memory, HDF5 input, HHDF5 output, images, windows or dummy.
  • • Common layers:
    • • May include inner product, splitting, flattening, reshape, concatenation, slicing, elementwise operation, argmax, softmax, mean-variation normalization.

4.3.2 Evaluation metric

A confidence score is calculated as an evaluation standard. This confidence score shows the probability of the image being detected correctly by the algorithm and is given as a percentage. The scores are taken on the mean average precision at different IoU (Intersection over Union) thresholds. For this work, the thresholds are set at 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, that is, if the confidence of the detected object is over 60%, 65%, 70%, 75%, 80%, 85%, and 90%, respectively, then only the label will be taken for evaluation.

Intersection over Union (IoU) [39] is measured by the magnitude of overlap between two bounding boxes of two objects. The formula for IoU is given below:

IoU=AreaofOverlapAreaofUnion (4.1)

image (4.1)

Eq. (4.1) can be expresses as:

IoU=XYXY (4.2)

image (4.2)

Precision is calculated at the threshold value depending on the True Positives (TP), False Positives (FP), and False Negatives (FN) which are the result of comparing the predicted object to the ground truth object. FN is not applicable for our case. A TP is counted when the ground truth bounding box is matched with the prediction bounding box. A FP is counted when a predicted object had no association with the ground truth object. A FN is counted when a ground truth object had no association with the predicted object. The confidence score is given as the mean over all the precision scores for all thresholds. The average precision is given as:

Confidence=Averageprecision=1|Thresholds|TTPTP+FP+FN (4.3)

image (4.3)

4.4 Results and discussions

Images of the flying objects are taken with a diverse background to test if the algorithm can work properly. Figs. 4.3 and 4.4 show the output when these images are fed to the algorithm for testing purposes. The output contains bounding boxes with a label of the object and a confidence score. This open source tool is known as Bounding Box Label [12], and is used to label all the different types of birds and aeroplanes (Table 4.1).

imageimage
Figure 4.3 Flying aeroplane detected by SSD algorithm. SSD, Single shot detection.
imageimage
Figure 4.4 Flying birds detected by SSD algorithm. SSD, Single shot detection.

Table 4.1

Average confidence score for detected objects.
ObjectAverage confidence score
Bird96.65
Aeroplane97.89

The average confidence score was calculated as the mean of all confidence scores given (for labeled images) by the algorithm. However, for testing purposes 100 images were provided for each class. It is seen that there are some images where the algorithm was detecting birds as aeroplanes and aeroplanes as birds. For some images instead of having aeroplanes and birds in the image, the network was not able to detect all the objects. The confusion matrix for each class is given below:

Tables 4.2 and 4.3 show that 97% aeroplane images and 95% bird images were correctly detected by the algorithm. For Table 4.2, TP=97, FP=1, FN=2, True Negative (TN)=Not Applicable (NA). For Table 4.3, TP=96, FP=3, FN=1, TN=NA.

Table 4.2

Confusion matrix for aero plane.
ClassificationClassDetected
AeroplaneNot aeroplane
ActualAeroplane972
Not aeroplane1NA

NA, Not applicable.

Table 4.3

Confusion matrix for bird.
ClassificationClassDetected
BirdNot bird
ActualBird961
Not bird3NA

NA, Not applicable.

4.4.1 For real-time flying objects from video

Figs. 4.5 and 4.6 are the screenshots taken from two videos. The first video shows the birds starting to fly and the second video is from an aeroplane takeoff.

image
Figure 4.5 Frames from flying birds video.
image
Figure 4.6 Frames from aeroplane takeoff video.

At first, the confidence of the network was tested for different objects in multiobject scenarios. Then the same network was applied to evaluate the performance of the network from a video feed. Some snapshots were taken after testing the video with the algorithms. The results from Figs. 4.5 and 4.6 clearly show that the network is working very well even for instantaneous object detection. The SSD algorithm was able to detect and classify an object accurately in the video feed, even when the full object contours were obscured partially by another object. Based on the accuracy attained in detecting and classifying flying objects, this can be applied both for commercial and military applications. With slide modifications, many transportation-related projects can apply this approach successfully.

4.5 Conclusion

Vision-based unmanned autonomous vehicles have a wide range of operations in intelligent navigation systems. In this chapter, a framework for real-time flying object detection is presented. The flying objects were detected by the network with a very high accuracy (average 97%, for bird and aeroplane classes) as shown in the results. These results show that the network is highly reliable for detecting flying objects. Here “SSD,” which is an open source platform for object detection and classification, is based on CNN. Future work can focus on detecting more flying objects like drone, helicopters, and even skydivers and paragliders. UAVs and CNN can be used for peripheral flying objects counting as well as detection and classification. Thus SSD can be very useful for detecting real-time objects. We hope that this work will add some value to modern ongoing research work on UAVs. Though this chapter focuses mainly on flying object detection for UAVs, the modified algorithm can be applied in the transportation and civil engineering fields. It is highly recommended to test the approach with a combination of different object classes on various images and to observe how these advancements will have an impact on how UAVs implement complex task. Another application can include automated identification of roadway features. These applications have the ability to transform the infrastructure of transportation asset management in the near future.

References

1. Mercedes-Benz Intelligent Drive, http://techcenter.mercedes-benz.com/en/intelligentdrive/detail.html/.

2. Mobileeye Inc. http://us.mobileye.com/technology/.

3. G. Conte, P. Doherty, An integrated UAV navigation system based on aerial image matching, in: Proceedings of the IEEE Aerospace Conference (2008) 3142–3151.

4. Martínez C, Mondragon IF, Olivares-Méndez M, Campoy P. On-board and ground visual pose estimation techniques for UAV control. J Intell Robot Syst. 2011;61(1–4):301–320.

5. L. Meier, P. Tanskanen, F. Fraundorfer, M. Pollefeys, PIXHAWK: a system for autonomous flight using onboard computer vision, in: Proceedings of the IEEE International Conference on Robotics and Automation (2011).

6. C. Hane, C. Zach, J. Lim, A. Ranganathan, M. Pollefeys, Stereo depth map fusion for robot navigation, in: Proceedings of the International Conference on Intelligent Robots and Systems (2011) 1618–1625.

7. Weiss S, Achtelik M, Lynen S, et al. Monocular vision for long-term micro aerial vehicle state estimation: a compendium. J Field Robot. 2013;30:803–831.

8. S. Lynen, M. Achtelik, S. Weiss, M. Chli, R. Siegwart, A robust and modular multi-sensor fusion approach applied to mav navigation, in: Proceedings of the Conference on Intelligent Robots and Systems, (2013).

9. C. Forster, M. Pizzoli, D. Scaramuzza, SVO: fast semi-direct monocular visual odometry, in: Proceedings of the International Conference on Robotics and Automation (2014).

10. T. Zsedrovits, A. Zarandy, B. Vanek, T. Peni, J. Bokor, T. Roska, Visual detection and implementation aspects of a uav see and avoid system, in: Proceedings of the European Conference on Circuit Theory and Design (2011).

11. Felzenszwalb P, Girshick R, McAllester D, Ramanan D. Object Detection with Discriminatively Trained Part Based Models. IEEE Trans Pattern Anal Mach Intell. 2010.

12. F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, et al., Theano: new features and speed improvements, 2012.

13. A. Bosch, A. Zisserman, X. Munoz, Image classification using random forests and ferns, in: Proceedings of the International Conference on Computer Vision (2007).

14. P. Dollar, Z. Tu, P. Perona, S. Belongie, Integral channel features, in: Proceedings of the British Machine Vision Conference (2009).

15. Oliver N, Rosario B, Pentland A. A Bayesian computer vision system for modeling human interactions. IEEE Trans Pattern Anal Mach Intell. 2000;22(8):831–843.

16. N. Seungjong, J. Moongu, A new framework for background subtraction using multiple cues, in: Proceedings of the Asian Conference on Computer Vision. Springer Berlin Heidelberg (2013) 493–506.

17. A. Sobral, BGSLibrary: an OpenCV C++ background subtraction library, in: Proceedings of the IX Workshop de VisaoComputacional, (2013).

18. Zamalieva D, Yilmaz A. Background subtraction for the moving camera: a geometric approach. Comput Vis Image Underst. 2014;127:73–85.

19. T. Brox, J. Malik, Object segmentation by long term analysis of point trajectories, in: Proceedings of the European Conference on Computer Vision (2010) 282–295.

20. B. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: Proceedings of the International Joint Conference on Artificial Intelligence (1981) 674–679.

21. Brox T, Malik J. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans Pattern Anal Mach Intell. 2011.

22. Y. Zhang, S.-J. Kiselewich, W.-A. Bauson, R. Hammoud, Robust moving object detection at distance in the visible spectrum and beyond using a moving camera, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (2006).

23. Kim S-W, Yun K, Yi K-M, Kim S-J, Choi J-Y. Detection of moving objects with a moving camera using non-panoramic background model. Mach Vis Appl. 2013;24:1015–1028.

24. S. Kwak, T. Lim, W. Nam, B. Han, J. Han, Generalized background subtraction based on hybrid inference by belief propagation and bayesian filtering, in: Proceedings of the International Conference on Computer Vision (2011) 2174–2181.

25. A. Elqursh, A. Elgammal, Online moving camera background subtraction, in: Proceedings of the European Conference on Computer Vision (2012) 228–241.

26. M. Narayana, A. Hanson, E. Learned-miller, Coherent motion segmentation in moving camera videos using optical flow orientations, in: Proceedings of the International Conference on Computer Vision (2013).

27. A. Papazoglou, V. Ferrari, Fast object segmentation in unconstrained video, in: Proceedings of the International Conference on Computer Vision (2013).

28. S. Walk, N. Majer, K. Schindler, B. Schiele, New features and insights for pedestrian detection, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (2010).

29. D. Park, C.L. Zitnick, D. Ramanan, P. Dollar, Exploring weak stabilization for motion feature extraction, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (2013).

30. A.Z. Ryan, An overview of emerging results in cooperative UAV control, in: Proceedings of the 43rd IEEE Conference on Decision and Control (Vol. 1) (2004) 602–607.

31. Ren SKH. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–1149.

32. J. Redmon, S. Divvala, You only look once: unified, real-time object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 779–788.

33. J. Dai, Y. Li, R-FCN: object detection via regionbased fully convolutional networks. NIPS, 2016, pp. 379–387.

34. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu et al. SSD: single shot multibox detector. ECCV, 2016, pp. 21–37.

35. T.-Y. Lin, P. Goyal. Focal loss for dense object detection. ICCV, 2017.

36. S. Mandal, S. Biswas;V.E. Balas, R.N. Shaw, A. Ghosh, Motion prediction for autonomous vehicles from lyft dataset using deep learning, in: Proceedings of the 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA) 30–31 Oct. (2020) 768–773, doi: 10.1109/ICCCA49541.2020.9250790.

37. S. Mandal, V.E. Balas, R.N. Shaw, A. Ghosh, Prediction analysis of idiopathic pulmonary fibrosis progression from OSIC dataset, in: Proceedings of the 2020 IEEE International Conference on Computing, Power and Communication Technologies (GUCON), 2–4 Oct. (2020) 861–865, doi: 10.1109/GUCON48875.2020.9231239.

38. Kumar M, Shenbagaraman VM, Ghosh A. Predictive data analysis for energy management of a smart factory leading to sustainability. Book Chapter [ISBN 978-981-15-4691-4] In: Favorskaya MN, Mekhilef S, Pandey RK, Singh N, eds. Innovations in Electrical and Electronic Engineering. Springer 2020;765–773.

39. Tripathi S, Kang B. Low-complexity object detection with deep convolutional neural network for embedded systems. Proc SPIE. 2017;10396 10 396 10 396 15.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.239.194