© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2021
D. PaperState-of-the-Art Deep Learning Models in TensorFlowhttps://doi.org/10.1007/978-1-4842-7341-8_13

13. Object Detection

David Paper1  
(1)
Logan, UT, USA
 

Object detection is an automated computer vision technique for locating instances of objects in digital photographs or videos. Specifically, object detection draws bounding boxes around one or more effective targets located in a still image or video data. An effective target is the object of interest in the image or video data that is being investigated. The effective target (or targets) should be known at the beginning of the task.

Object Detection in a Natural Scene

Detecting objects in natural scenes is effortless for us, but extremely challenging for computer algorithms. As humans, we look at a scene and identify objects of interest without a thought. We process visual data in the ventral visual stream, which is a hierarchy of areas in the brain that helps in object recognition. We recognize different types and sizes of objects and categorize them with the aid of cells in our visual cortex that respond to simple shapes like lines and curves.

With object detection, data scientists attempt to mimic what humans do with computer algorithms. But before a computer algorithm can even begin to detect objects, a natural scene must be captured as a digital image (or digital video). A digital image is composed of picture elements (or pixels). A pixel is the basic logical unit in digital graphics. Simply, a pixel is a tiny square of color.

Pixels are uniformly combined in a two-dimensional (2D) grid to form a complete digital image, video, text or any visible element on a computer display. Each pixel has a specific number, and this number tells the algorithm its color. The number represents the pixel intensity value. The pixel intensity value represents a grayscale or color image.

In grayscale images, each pixel represents the intensity of only one color. So a grayscale image can be represented by a single 2D matrix (or grid). In the standard RGB system, color images have three channels (red, green, and blue). So a color image can be represented by three 2D matrices. One matrix represents the intensity of red in the pixel. One represents the intensity of green in the pixel. And one represents the intensity of blue in the pixel.

Each matrix is considered a color channel. A color channel is an array of values (one per pixel) that together specify one aspect or dimension of the image. So RGB color contains three color channels – red, green, and blue. Each color channel is expressed from 0 (least saturated) to 255 (most saturated). So 16,777,216 (2563) different colors can be represented in the RGB color space by combining red, green, and blue pixel intensities.

Once a digital image is captured as a grayscale (2D matrix) or color (three color channel 2D matrices) image, it is processed for model consumption. Only then can computer algorithms inspect each 2D grid of pixel values to begin identifying patterns.

Detection vs. Classification

Image classification presents an input image to a neural network so it can learn a single class label associated with the image. The network can also learn the probability associated with the class label. The learned class label characterizes the contents of the entire image or at least the most dominant and visible contents of the image. So an image classification network is able to learn how to correctly label an image of a cat.

Object detection presents an input image to a neural network so it can learn the exact location of an image in a picture object (or scene). To locate input images in a scene, object detection algorithms create a list of bounding boxes for each object in the picture. Bounding boxes are created as the (x, y) coordinate input image locations. The algorithms also identify the class labels associated with each bounding box and the probability/confidence score associated with each bounding box and class label.

Simply, image classification involves one image into the network and one class label out. Object detection involves one image into the network and multiple bounding boxes and their associated class labels out.

Object detection is commonly applied to count, locate, and label objects in a scene with precision. Neural networks are the state-of-the-art methods for object detection. Convolutional neural networks are frequently used to automatically learn inherent features of an object to identify them within an image by differentiating it from its background.

Whereas image classification involves assigning a class label to an image, object localization involves drawing a bounding box around one or more objects in an image scene. So classification separates images into different classes, while object localization identifies effective targets in an image scene. Object detection is more challenging than localization because it combines classification and object localization to learn how to draw a bounding box around each object of interest in the image (or effective target) and assign it a class label. The difference between object localization and object detection is subtle. Object localization aims to locate the main (or most visible) object in an image scene, while object detection tries to find out all the objects and their boundaries.

Imagine that an image contains two cats and a person. Object detection networks are able to locate instances of entities and classify the types of entities found within the image. So an object detection network can locate two cats and a person in this image and classify them correctly.

Object detection is commonly confused with image recognition. So how are they different? Image recognition assigns a label to an image. A picture of a dog receives the label dog. A picture of two dogs still receives the label dog. Object detection, however, draws a bounding box around each dog and labels the box as dog. The model predicts where each object is in the image and the label that should be applied to it. So object detection provides more information about an image than recognition.

Bounding Boxes

A bounding box is an imaginary rectangle that serves as a point of reference for object detection and creates a collision box for that object. The area of a bounding box (usually shortened to bbox) is defined by two longitudes and two latitudes where latitude is a decimal number between –90.0 and 90.0 and longitude is a decimal number between –180.0 and 180.0. Data annotators draw bbox rectangles to outline an object of interest within each image (scene) by defining its x and y coordinates.

A hitbox (or collision box) is an invisible shape commonly used in video games for real-time collision detection. So it is a type of bounding box. It is often a rectangle (in 2D games) or cuboid (in 3D) that is attached to and follows a point on a visible object (such as a model or a sprite).

Basic Structure

Deep learning object detection models typically have two parts. An encoder takes an image as input and runs it through a series of blocks and layers that learn to extract statistical features used to locate and label objects. Outputs from the encoder are then passed to a decoder, which predicts bounding boxes and labels for each object.

The simplest decoder is a pure regressor. A regressor is the name given to any variable in a regression model that is used to predict a response variable. A regressor is also referred to as an explanatory variable, an independent variable, a manipulated variable, a predictor variable or a feature. The whole point of building a regression model is to understand how changes in a regressor lead to changes in a response variable (or regressand). So a regressor is a feature, and a regressand is a response variable (or target).

Regression is a supervised ML technique used to predict continuous values. The goal of a regression algorithm is to plot a best-fit line or a curve between the data. A regressor is connected to the output of the encoder and directly predicts the location and size of each bounding box.

The output of a regressor model is the x, y coordinate pair for the object and its extent (or intensity) in the image. But a regressor is limited because we need to specify the number of boxes ahead of time. If an image has two dogs but the regressor model was designed to detect a single object, one of the dogs goes unlabeled. But if we know the number of objects we need to predict in each image ahead of time, pure regressor-based models may be a good option.

An extension of the regressor approach is a region proposal network. The decoder in a region proposal network proposes regions of an image where it believes an object might reside. The pixels belonging to these regions are then fed into a classification subnetwork to determine a label (or reject the proposal). The network then runs the pixels containing those regions through a classification network. The benefit of this method is a more accurate and flexible model that can propose arbitrary numbers of regions that may contain a bounding box. But the added accuracy comes at the cost of computational efficiency.

A single shot detector (SSD) seeks a middle ground. Rather than using a subnetwork to propose regions, a SSD relies on a set of predetermined regions. A grid of anchor points is laid over the input image. At each anchor point, boxes of multiple shapes and sizes serve as regions. For each box at each anchor point, the model outputs a prediction of whether or not an object exists within the region and modifications to the box’s location and size to make it fit the object more precisely. Because there are multiple boxes at each anchor point and anchor points may be close together, a SSD produces many potential detections that overlap. So post-processing must be applied to SSD outputs to cull the best one by pruning deficient ones. The most popular post-processing technique for SDD is non-maximum suppression. Non-max suppression is used to select the most appropriate bounding box for an object.

Object detectors output the location and label for each object, but how do we measure performance? The most common metric for object location is intersection-over-union (IOU). Given two bounding boxes, IOU computes the area of the intersection and divides by the area of the union. The value ranges from 0 (no interaction) to 1 (perfectly overlapping). A simple percent correct metric can be used for labels.

Notebooks for chapters are located at the following URL:

https://github.com/paperd/deep-learning-models

We demonstrate object detection with an end-to-end code experiment. Begin setting up the Colab ecosystem by importing the main TensorFlow library and instantiating the GPU.

Import the TensorFlow Library

Import the library and alias it as tf:
import tensorflow as tf

Aliasing the TensorFlow library as tf is common practice.

GPU Hardware Accelerator

As a convenience, we provide the steps to enable the GPU in a Colab notebook:
  1. 1.

    Click Runtime in the top-left menu.

     
  2. 2.

    Click Change runtime type from the drop-down menu.

     
  3. 3.

    Choose GPU from the Hardware accelerator drop-down menu.

     
  4. 4.

    Click Save.

     
Verify that the GPU is active:
tf.__version__, tf.test.gpu_device_name()

If ‘/device:GPU:0’ is displayed, the GPU is active. If ‘..’ is displayed, the regular CPU is active.

Note

If you get the error NAME ‘TF’ IS NOT DEFINED, re-execute the code to import the TensorFlow library!

Object Detection Experiment

We grab images from Google Drive and Wikimedia Commons for the experiment. We use the pre-trained faster_rcnn/openimages_v4/inception_resnet_v2 object detection module for image object detection. The object detection model is trained on Open Images V4 (version 4) with the ImageNet pre-trained Inception ResNet V2 (version 2) as the image feature extractor. The module internally performs non-maximum suppression. The maximum number of detections outputted is 100. So it detects a maximum of 100 objects in a given scene. Detections are outputted for 600 boxable categories. A boxable category is one that is capable of (or suitable for) placing in a bounding box.

Open Images is a dataset of approximately nine million richly annotated images with image-level labels, object bounding boxes, object segmentations, visual relationships, and local narratives. The images are very diverse and often contain complex scenes with several objects (8.4 per scene on average).

The training set of Open Images V4 contains 14.6 million bounding boxes for 600 object classes on 1.74 million images, which makes it the largest existing dataset with object location annotations (as of this writing). The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). Moreover, the dataset is annotated with image-level labels spanning thousands of classes.

Note

It is recommended to run this module on a GPU to get acceptable inference times.

Import Requisite Libraries

Enable access to the TF-hub module:
import tensorflow_hub as hub
Access a plotting module:
import matplotlib.pyplot as plt
Access modules for file handling and manipulating data in memory:
import tempfile
from six.moves.urllib.request import urlopen
from six import BytesIO

This tempfile module creates temporary files and directories. The urlopen module is the uniform resource locator (URL) handling module for Python. It is used to fetch a URL. A URL is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. The BytesIO module manipulates byte data in memory.

Access modules from the PIL library:
from PIL import Image
from PIL import ImageColor
from PIL import ImageDraw
from PIL import ImageFont
from PIL import ImageOps

The Image module allows image upload into memory. The remaining modules enable image manipulation.

Import NumPy:
import numpy as np

Create Functions for the Experiment

The first function displays an image:
def display_image(image):
  fig = plt.figure(figsize=(20, 15))
  plt.grid(False)
  plt.imshow(image)
  plt.axis('off')

We present many different display functions to provide alternative ways to display images. For practice, create your own display function (or functions) and use it for this experiment.

The second function draws a bounding box around an image within a scene as shown in Listing 13-1.
def draw_bounding_box_on_image(
    image, ymin, xmin, ymax, xmax,
    color, font, thickness=4, display_str_list=()):
  """Adds a bounding box to an image."""
  draw = ImageDraw.Draw(image)
  im_width, im_height = image.size
  (left, right, top, bottom) = (
      xmin * im_width, xmax * im_width,
      ymin * im_height, ymax * im_height)
  draw.line([(left, top), (left, bottom),
             (right, bottom), (right, top),
             (left, top)],
             width=thickness, fill=color)
  display_str_heights = [font.getsize(ds)[1]
                         for ds in display_str_list]
  total_display_str_height = (
      1 + 2 * 0.05) * sum(display_str_heights)
  if top > total_display_str_height:
    text_bottom = top
  else:
    text_bottom = top + total_display_str_height
  for display_str in display_str_list[::-1]:
    text_width, text_height = font.getsize(display_str)
    margin = np.ceil(0.05 * text_height)
    draw.rectangle(
        [(left, text_bottom - text_height - 2 * margin),
         (left + text_width, text_bottom)], fill=color)
    draw.text(
        (left + margin, text_bottom - text_height - margin),
        display_str, fill='black', font=font)
    text_bottom -= text_height - 2 * margin
Listing 13-1

Function to Draw Bounding Boxes

The function accepts a processed image, x and y minimum and maximum coordinates, color, font, thickness, and a list of display strings. The processed image is sent by the draw_boxes function, which is presented next. The x and y coordinates provide the boundaries of the bounding box for the image. The color, font, thickness, and list of strings are provided by the draw_boxes function. The remainder of the function adds a bounding box to an image.

Begin by creating an ImageDraw object and assigning it to variable draw. The ImageDraw module is used to create new images, annotate or retouch existing images, and generate graphics on the fly for web use. Image x and y coordinates are assigned to variables, and boundary lines are assigned to the ImageDraw object.

The function continues by assigning the list of display strings to variable display_str_heights. Display strings are used to label the bounding boxes for each image in a scene. If the total height of the display strings added to the top of the bounding box exceeds the top of the image, we need to stack the strings below the bounding box instead of above it. The reason is to ensure that strings that identify each image in a bounding box are readable. The total_display_str_height variable ensures that each display string has a reasonable top and bottom margin for display. You can experiment with this value, but the setting is pretty good as it is. The next bit of logic checks the top and bottom margins to ensure a good fit of the image inside the bounding box container.

The remainder of the function reverses (with a for loop) the list of display strings for display from bottom to top. The draw.rectangle method draws a bounding box around each image. The draw.text method labels each bounding box with its appropriate label. When the loop is finished, an image that includes all bounding boxes is created.

The third function accepts an image, boxes, class labels, scores, maximum number of boxes, and minimum score. The parameters with the exception of the image are generated by a pre-trained model. With the accepted parameter values, it overlays labeled boxes on an image with formatted scores and label names as shown in Listing 13-2.
def draw_boxes(
    image, boxes, class_names, scores,
    max_boxes=10, min_score=0.1):
  # Overlay labeled boxes on an image with formatted scores and label names.
  colors = list(ImageColor.colormap.values())
  one = '/usr/share/fonts/truetype/liberation/'
  two =  'LiberationSansNarrow-Regular.ttf'
  font_url = one + two
  try:
    font = ImageFont.truetype(font_url, 25)
  except IOError:
    print('Font not found, using default font.')
    font = ImageFont.load_default()
  for i in range(min(boxes.shape[0], max_boxes)):
    if scores[i] >= min_score:
      ymin, xmin, ymax, xmax = tuple(boxes[i])
      display_str = '{}: {}%'.format(
          class_names[i].decode('ascii'),
          int(100 * scores[i]))
      color = colors[hash(class_names[i]) % len(colors)]
      image_pil = Image.fromarray(
          np.uint8(image)).convert('RGB')
      draw_bounding_box_on_image(
          image_pil, ymin, xmin, ymax, xmax,
          color, font, display_str_list=[display_str])
      np.copyto(image, np.array(image_pil))
  return image
Listing 13-2

Container Function for Drawing Bounding Boxes

The draw_boxes function is a container because it accepts the image and dictionaries from the pre-trained model, calls draw_bounding_box_on_image, and returns an image with bounding boxes drawn around detected objects.

Although the function looks complex, it is pretty straightforward. It creates the colors inherent in the image. We need the colors because we recreate the detected objects from the scene so we can draw bounding boxes around them. It then creates the fonts that we use to label each bounding box. For practice, you can change the fonts. It then loops through all of the objects to retrieve (from dictionaries output from a pre-trained model) their x and y coordinates in the scene, labels, scores, image pixels, and colors to supply to the draw_bounding_box_on_image function. This function draws the bounding boxes for each object and returns a scene with bounding boxed images.

Note

To avoid confusion, think of the original image as a scene. The pre-trained model learns how to identify image objects in the scene. It outputs a set of dictionaries that contain information about the objects it detects and coordinates for the bounding boxes from the scene. We then draw the bounding boxes from the dictionary information. So when we talk about image objects, we mean the objects that are detected from the scene.

Load a Pre-trained Object Detection Model

Load an object detection module and apply it on the downloaded image:
p1 = 'https://tfhub.dev/google/faster_rcnn/'
p2 = 'openimages_v4/inception_resnet_v2/1'
URL = p1 + p2
module_handle = URL
obj_detect = hub.load(module_handle).signatures['default']

The pre-trained model is an object detection model trained on Open Images V4 with ImageNet pre-trained Inception ResNet V2 as the image feature extractor.

Open Images is huge! It contains 15,851,536 boxes on 600 categories, 2,785,498 instance segmentations on 350 categories, 3,284,280 relationship annotations on 1,466 relationships, 675,155 localized narratives, and 59,919,574 image-level labels on 19,957 categories. It also includes an optional extension with 478,000 crowdsourced images with more than 6,000 categories.

The module performs non-maximum suppression internally. The maximum number of detections outputted is 100. Detections are outputted for 600 boxable categories. It is recommended to run this module on a GPU to get acceptable inference times.

The model accepts a variable-size three-channel image. It outputs several dictionaries including detection_boxes (bounding box coordinates), detection_class_entities (detection class names), detection_class_names (human-readable class names), detection_class_labels (labels as tensors), and detection_scores (detection scores). Detection scores indicate how confident the model is in labeling the object.

Note

We provide the details on Open Images for the curious. The pre-trained model handles the workload. We just use its outputs to create a visualization.

Load an Image from Google Drive

Mount Google Drive to the Colab notebook:
from google.colab import drive
drive.mount('/content/gdrive')

Be sure that the image is in the appropriate directory (i.e., Colab Notebooks) in your Google Drive!

Access and display the image:
img_path = 'gdrive/My Drive/Colab Notebooks/images/cats_dogs.jpg'
pil_image = Image.open(img_path)
display_image(pil_image)

Convert the JPEG image to a Python Imaging Library (PIL) image and display it. PIL is a library that supports opening, manipulating, and saving many different image file formats. It is also known as the Pillow library.

Check image size:
pil_image.size

Prepare the Image

Generate a temporary path for the image file:
_, filename = tempfile.mkstemp(suffix='.jpg')
filename
Prepare the image for processing and save it to the temporary file path:
pil_image_rgb = pil_image.convert('RGB')
pil_image_rgb.save(filename, format='JPEG', quality=90)
print('Image downloaded to %s.' % filename)
display_image(pil_image)

Run Object Detection on the Image

Create a function to load the image:
def load_img(path):
  img = tf.io.read_file(path)
  img = tf.image.decode_jpeg(img, channels=3)
  return img

The function loads the image and prepares it for the pre-trained model.

Create a function to run object detection as shown in Listing 13-3.
def run_detector(detector, path):
  img = load_img(path)
  converted_img  = tf.image.convert_image_dtype(
      img, tf.float32)[tf.newaxis, ...]
  result = detector(converted_img)
  result = {key:value.numpy()
            for key,value in result.items()}
  print("Found %d objects." %
        len(result["detection_scores"]))
  image_with_boxes = draw_boxes(
      img.numpy(), result["detection_boxes"],
      result["detection_class_entities"],
      result["detection_scores"])
  display_image(image_with_boxes)
Listing 13-3

Object Detection Function

The function accepts the loaded pre-trained model signature and path to the image (scene). It loads the image and converts it to a NumPy array for model consumption. Run the model signature on the NumPy image. Retrieve the dictionary key/value pairs from dictionaries outputted from the pre-trained model. Create a scene with bounding boxes around detected objects and display it.

Run the detector:
run_detector(obj_detect, filename)

As we know, the pre-trained model is limited to detecting 100 objects. But the message is misleading because it says that 100 objects are found no matter what scene is fed into the model.

Detection is perfect, but don’t get too excited. The model is powerful enough to detect images in a simple scene. The scene is simple because the background offers no distractions and each dog/cat is separately presented.

Let’s try another one:
img_path = 'gdrive/My Drive/Colab Notebooks/images/butterfly.jpg'
pil_image = Image.open(img_path)
display_image(pil_image)
Process the image:
_, filename = tempfile.mkstemp(suffix='.jpg')
pil_image_rgb = pil_image.convert('RGB')
pil_image_rgb.save(filename, format='JPEG', quality=90)
print('Image downloaded to %s.' % filename)
Run the detector:
run_detector(obj_detect, filename)

Detection is perfect, but the image is simple.

Detect Images from Complex Scenes

Let’s try detection on more complex images. We’ve already located some images from Wikimedia Commons, but you can locate your own by following a few simple steps:
  1. 1.
     
  2. 2.

    Click the Images link.

     
  3. 3.

    Click an image.

     
  4. 4.

    Right-click the image.

     
  5. 5.

    Select “Copy link address” from the drop-down menu.

     
  6. 6.

    Paste the link address into a code cell.

     
  7. 7.

    Surround the link address with single or double quotes.

     
  8. 8.

    Assign to a variable.

     

Create a Download Function

Create a function to download, process, and save an image to a temporary file path as shown in Listing 13-4.
def download_and_resize_image(
    url, new_width=256, new_height=256,
    display=False):
  _, filename = tempfile.mkstemp(suffix='.jpg')
  response = urlopen(url)
  image_data = response.read()
  image_data = BytesIO(image_data)
  pil_image = Image.open(image_data)
  pil_image = ImageOps.fit(
      pil_image, (new_width, new_height),
      Image.ANTIALIAS)
  pil_image_rgb = pil_image.convert('RGB')
  pil_image_rgb.save(
      filename, format='JPEG', quality=90)
  print('Image downloaded to %s.' % filename)
  if display:
    display_image(pil_image)
  return filename
Listing 13-4

Download and Preprocess Function

The function generates a temporary path for the image file. It then reads the image file from the supplied URL. The function continues by converting the image file to a PIL image. The PIL image is then resized, converted to RGB, and saved to the temporary file path. The function ends by returning the filename of the PIL image.

Load an Image Scene

Load a scene from a Wikimedia Commons URL:
p1 = 'https://upload.wikimedia.org/wikipedia/commons/7/79/'
p2 = 'At_taverna_under_the_church%2C_Ano_Potamia%2C_Naxos%'
p3 = '2C_190574.jpg'
URL = p1 + p2 + p3
downloaded_image_path = download_and_resize_image(
    URL, 1280, 856, True)

The source for the image scene is located at

https://commons.wikimedia.org/wiki/File:At_taverna_under_the_church,_Ano_Potamia,_Naxos,_190574.jpg

Detect

Run object detection:
run_detector(obj_detect, downloaded_image_path)

With a more complex scene, detection is not perfect. But it does make some correct detections.

Detect on More Scenes

Piece together some paths as shown in Listing 13-5.
p1 = 'https://upload.wikimedia.org/wikipedia/commons/4/45/'
p2 = 'Green_Dragon_Tavern_%2836196%29.jpg'
tavern = p1 + p2
p1 = 'https://upload.wikimedia.org/wikipedia/commons/3/31/'
p2 = 'Circus_Circus_Hotel-Casino_sign.jpg'
casino = p1 + p2
p1 = 'https://upload.wikimedia.org/wikipedia/commons/9/91/'
p2 = 'Leon_hot_air_balloon_festival_2010.jpg'
balloon = p1 + p2
p1 = 'https://upload.wikimedia.org/wikipedia/commons/d/d8/'
p2 = '2012_Festival_of_Sail_-_7943922284.jpg'
sail = p1 + p2
p1 = 'https://upload.wikimedia.org/wikipedia/commons/a/ab/'
p2 = '17_mai_2018.jpg'
flag = p1 + p2
p1 = 'https://upload.wikimedia.org/wikipedia/commons/4/43/'
p2 = 'Fruit_baskets.jpg'
basket= p1 + p2
p1 = 'https://upload.wikimedia.org/wikipedia/commons/c/c7/'
p2 = 'Fruit_stands%2C_Rue_de_Seine%2C_Paris_22_May_2014.jpg'
stand= p1 + p2
p1 = 'https://upload.wikimedia.org/wikipedia/commons/9/95/'
p2 = 'Wine_tasting_%40_brown_brothers.jpg'
wine = p1 + p2
Listing 13-5

Image Paths

Create a function to detect images in a scene:
def detect_img(image_url):
  image_path = download_and_resize_image(image_url, 640, 480)
  run_detector(obj_detect, image_path)
Run object detection on one of the scenes:
detect_img(wine)

So scene complexity limits detection accuracy.

Try one more:
detect_img(sail)

Find the Source

We sometimes come across Wikimedia Commons images in our research. But the sources of such images are never (at least in our experience) included. If we want to use the image in any way, we must locate its source to see if it is permitted.

Find the Source of a Wikipedia Commons Image

We can find the source of an image with a few steps:
  1. 1.

    Substitute commons for upload.

     
  2. 2.

    Change wikipedia to wiki.

     
  3. 3.

    Substitute commons/(number)/(number) for File:

     
  4. 4.

    Translate the %(number) to its HTML encoded equivalent.

     

To find the HTML encoded equivalent, peruse

https://krypted.com/utilities/html-encoding-reference/

Note

We cannot guarantee that this process works for every image, but it works for the ones that we have used.

Let’s try the process on the tavern image:

https://upload.wikimedia.org/wikipedia/commons/4/45/Green_Dragon_Tavern_%2836196%29.jpg

Substitute commons for upload:

https://commons.wikimedia.org/wikipedia/commons/4/45/Green_Dragon_Tavern_%2836196%29.jpg

Change wikipedia to wiki:

https://commons.wikimedia.org/wiki/commons/4/45/Green_Dragon_Tavern_%2836196%29.jpg

Substitute commons/(number)/(number) for File:

https://commons.wikimedia.org/wiki/File:Green_Dragon_Tavern_%2836196%29.jpg

Translate:

https://commons.wikimedia.org/wiki/File:Green_Dragon_Tavern_(36196).jpg

The %28 and %29 codes translate into left and right parentheses from the HTML Encoding Reference.

The casino image is easier because we don’t have to do any translations:

https://commons.wikimedia.org/wiki/File:Circus_Circus_Hotel-Casino_sign.jpg

Here are the sources for the remaining images:

https://commons.wikimedia.org/wiki/File:Leon_hot_air_balloon_festival_2010.jpg

https://commons.wikimedia.org/wiki/File:2012_Festival_of_Sail_-_7943922284.jpg

https://commons.wikimedia.org/wiki/File:17_mai_2018.jpg

https://commons.wikimedia.org/wiki/File:Fruit_baskets.jpg

https://commons.wikimedia.org/wiki/File:Fruit_stands,_Rue_de_Seine,_Paris_22_May_2014.jpg

https://commons.wikimedia.org/wiki/File:Wine_tasting_@_brown_brothers.jpg

Summary

We used a powerful pre-trained image detector model on several image scenes to demonstrate object detection. The model worked extremely well on simple scenes, but not as well on complex ones. As object detection in deep learning continues to evolve, we are confident that future models will vastly improve detection capability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.30.162