© Mansib Rahman 2017

Mansib Rahman, Beginning Microsoft Kinect for Windows SDK 2.0, https://doi.org/10.1007/978-1-4842-2316-1_6

6. Computer Vision & Image Processing

Mansib Rahman

(1)Montreal, Quebec, Canada

Hopefully, by this point some of the Kinect’s magic has worn off and you can see the machine for what it really is: two cameras with varying degrees of sophistication and a laser pointer. Barring the exorbitant cost of fielding a time of flight (ToF) camera, recreating a device that is conceptually similar to the Kinect in your own garage is not impossible. Getting the color, depth, and infrared streams from it could be technically challenging, but is essentially a solved problem. What sets the Kinect apart from such a hobbyist device, however, other than its precision manufacturing and marginally superior components, is its capability to look at the depth feed and extract bodies and faces from it.

“So what?” you say. “Is that not why we ordered a Kinect from the Microsoft Store instead of digging out our Arduinos and requesting chip and sensor samples from TI?” What if I told you that you too, could track bodies, faces, dogs, cats, stop signs, paintings, buildings, golden arches, swooshes, other Kinects, and pretty much anything else you can imagine without having to rely on some XDoohickeyFrameSource? That the 25 detected body joints are merely software abstractions that you could have devised yourself? There is no sacred sensor or chip that can track things like human bodies on its own. These are all capabilities that you can code yourself, which you should in fact learn how to do if you intend to get the most out of your Kinect.

What the Kinect is, aside from its various sensors, is essentially a bag of computer vision and image processing algorithmic tricks. Not to disparage its hardware (it is nothing to scoff at), but these algorithms are the reason we write home about the Kinect.

When we say computer vision, we mean algorithms that help computer programs form decisions and derive understanding from images (and by extension, videos). Are there soccer balls in this picture, and if so, how many? Is the Indian elephant in the video clip crying or merely sweaty? Do the seminiferous tubule cross-sections in the microscope slides have a circularity > 0.85, and if so, can they be circled?

Note

Elephants do not sweat, because they do not have sweat glands. This is why they are often seen fanning their ears and rolling in the mud to cool down instead.

Image processing, on the other hand, is the application of mathematical operations on images to transform them in a desired manner. Such algorithms can be used to apply an Instagram filter, perform red-eye removal, or subtract one video frame from another to find out what areas of a scene have changed over time. Image processing techniques are themselves often used in computer vision procedures.

A421249_1_En_6_Fig1_HTML.jpg
Figure 6-1. Seminiferous tubule cross-sections in male rat reproductive organs with circularities > 0.85 highlighted using computer vision techniques (base image provided courtesy of the lab of Dr. Bernard Robaire at McGill University’s Faculty of Medicine)

Algorithms and mathematical operations may cause consternation for some, but there is no need to worry; you will not immediately need to know how to convolute a kernel around an image matrix (though it helps). Researchers and developers have pooled together to create numerous open source libraries that implement a wide variety of computer vision and image processing techniques for us to freely use. The most popular one is likely OpenCV, Intel’s BSD-licensed library, which is what we will be using in this chapter. There are many other excellent libraries that can be used, however, such as AForge.NET and SimpleCV. Although the exact implementations will be different among the libraries, the general idea can be replicated with any of them.

OpenCV’s primary interface is in C++, but there are bindings for Python and Java. For .NET usage, we must rely on a wrapper. We will be using the Emgu CV wrapper, which in addition to C# supports VB, VC++, IronPython, and other .NET-compatible languages and is cross-platform, supporting not only the typical desktop OS, but also iOS, Android, Windows Phone, Windows Store, and frameworks such as Unity and Xamarin. Whenever we refer to Emgu CV code and structures in this chapter, assume there are equivalents in OpenCV.

Tip

When we say Emgu CV is a .NET wrapper for OpenCV, wrapper refers to a set of classes that can be used to call unmanaged code from a managed context. This allows us to call C++ OpenCV-like method signatures from within our C# code.

Computer vision is a massive topic, one that can span entire bookshelves and university courses. This chapter can only cover the equivalent of a tuft of hair on a western lowland gorilla’s armpit’s worth. While you might not be able to create a system that captures a couple of dozen body joints from a cat as fluidly as the Kinect does with a human body, at least not off the bat, hopefully this will whet your appetite for computer vision and image processing and inspire you to develop applications with basic tracking-analysis capabilities that you might not have been able to accomplish with just heuristic methods.

Note

Seeing as it is impossible to cover the entire OpenCV/Emgu CV library in one chapter, I recommend that you consult the official Emgu CV API reference at http://www.emgu.com/wiki/files/3.1.0/document/index.html to augment your learning. Additional material on OpenCV can be found at http://docs.opencv.org/3.1.0 .

Setting Up Emgu CV

The process of setting up Emgu CV as a dependency is perhaps its biggest pain point. Like bullet casings and unexploded ordinance littering former battlefields, you will find scores of unanswered questions on Stack Overflow, the Emgu CV forums, and MSDN, often on the issue of “The type initializer for 'Emgu.CV.CvInvoke' threw an exception,” which is code for “You bungled your Emgu CV dependencies.” C’est la vie d’un développeur. Fortunately, we have decent technical books to guide us through the process.

System Specs

Neither Emgu CV nor OpenCV has much in the way of publicly available system specifications. You would think that algorithms that are iterating dozens of times on multi-million-pixel image frames would be very CPU intensive, and you would not be wrong. But OpenCV has been around since the Windows 98 era and can run just fine on an iPhone 4’s 800MHz ARM CPU. Generally, you will want to use something above the system specs for the Kinect for Windows SDK v2 so that there are CPU cycles to spare for your image processing work after expending the necessary system resources for the Kinect. If you plan to use OpenCV’s CUDA module, you will need to have a CUDA-enabled GPU, which means NVIDIA.

Tip

CUDA, which formerly meant Compute Unified Device Architecture but is now no longer an acronym, is a platform and API that allows supported GPUs to be used for many tasks typically handled by the CPU. This is very nifty for image processing tasks because of the high number of cores on GPUs. CUDA allows the GPU to be exploited through a C/C++ or Fortran API. OpenCV (and by extension Emgu CV) has a module that allows us to parallelize many tasks with CUDA-enabled GPUs.

Installation & Verification

Emgu CV can be downloaded from SourceForge at https://sourceforge.net/projects/emgucv/ .

At the time of writing, the current latest version of Emgu CV is 3.1.0, and that is the version the book’s code and instructions will be tested against. There should be no breaking changes in future versions, but your mileage may vary. The particular file downloaded from SourceForge was libemgucv-windesktop-3.1.0.2282.exe, and the installer was run with all its default configuration options.

Once the installer finishes, verify that it put everything in the right place. Visit the installation folder (the default location is C:Emgu). Then, navigate to emgucv-windesktop 3.1.0.xxxxSolutionVS2013-2015 and open the Emgu.CV.Example Visual Studio solution.

By default, the Hello World sample program should be loaded (if not, you should skip down to Figure 6-3 to see how you can load it). After compiling and running, you should see a blue screen emblazoned with the words “Hello, world,” as in Figure 6-2.

A421249_1_En_6_Fig2_HTML.jpg
Figure 6-2. Hello World sample in Emgu CV
A421249_1_En_6_Fig3_HTML.jpg
Figure 6-3. Choosing the CameraCapture project from among others in the Startup Projects picker

The sample might look like something that could have been achieved trivially with WPF or WinForms, but all of it is in fact made from calls to the Emgu CV API. The blue background is actually an Emgu CV image, which we will discuss further in the next section.

In the example solution, we have other samples that can be experimented with. Let us run the CameraCapture project to see some basic image processing techniques at work. To change the active project in Visual Studio, visit the Startup Projects picker on the toolbar and select your desired project, as in Figure 6-3.

Your computer will have to have a webcam configured in order for the project to run. For devices such as the Surface Book or Surface Pro, the sample works out of the box. Unless you hacked your Kinect to be used as a webcam, Emgu CV does not get screens from it directly, because we have not connected it to the sample project. The compiled result of the CameraCapture sample should look something like Figure 6-4.

A421249_1_En_6_Fig4_HTML.jpg
Figure 6-4. Playing with CameraCapture in the Emgu CV samples. Because of the large resolution of the image, there are scroll bars to pan it.
Note

As of Windows 10, it is actually possible to use the Kinect as a regular webcam. As a bonus, with the Kinect’s IR capabilities, we can use it for Windows Hello (though, if you ask me, it is easier to simply use an IR-enabled laptop camera like those found in the Surface line of products).

In Figure 6-4, we see a snapshot of the application livestreaming footage from a webcam. The captured video frames are processed with different techniques and presented for our viewing pleasure. On the top right, we have the regular webcam image in grayscale. Converting a color image to grayscale is one of the most basic image processing techniques. If you recall in Chapter 3, manipulating images involved iterating through pixels of the image and applying functions to alter their color components. With Emgu CV, one can just call CvInvoke.CvtColor(initialImage, finalImage, ColorConversion.Bgr2Gray) to accomplish the same thing.

Tip

Typically, when we grayscale an image we have to go from a three-color channel image (RGB) to one with only one (Gray). One method of doing this is by summing the RGB components of a pixel and dividing them by three. Human eyes do not perceive the three colors equally, however, so it is better to weigh the three components differently. A common approach is $$ greykern0.5em pixelkern0.5em intensity=frac{0.3	imes R+0.59	imes G+0.11	imes B}{3} $$, where R, G, and B represent the intensities of the Red, Green, and Blue pixel color components, respectively.

The smoothed gray image on the bottom left is the regular grayscale image blurred, downsampled, blurred, and then upsampled. The bottom right image has the Canny edge detector algorithm applied to the regular image with the detected edges highlighted in the output image. We will explore similar algorithms in more depth later in the chapter. If the effects on your images look similar to the ones found in this book, then your installation should be fine.

Tip

Downsampling is essentially a way to reduce the information in an image so that we can shrink the image itself. Upsampling is increasing the size of the image based on an interpolation of what colors the new pixels should be. It can never quite recover details lost through downsampling. The simplest way to explain the difference between resampling an image and resizing it is that resampling changes the number of pixels within an image, whereas resizing changes the size of the image without changing the number of pixels. In other words, upsizing an image magnifies it, whereas as resizing does not.

The Canny edge detector is an algorithm used to detect edges in an image, often used in more complex algorithms to select salient information in an image for further processing.

Including Emgu CV as a Dependency

Things have gotten easier with the advent of the 3.1.x releases of Emgu CV, as many of the DLLs have been consolidated, but it still has the potential to trip you up. For the current version, the following instructions should set everything up correctly, but these may be subject to change in future versions.

  1. Create a new blank WPF project in Visual Studio and give it a name.

  2. Set the project’s platform to x64 (Emgu CV works with x86 too, however). See the Note for Figure 5-12 in Chapter 5 if you need help recalling how.

  3. Compile the solution.

  4. Grab Emgu.CV.UI.dll and Emgu.CV.World.dll from the emgucv-windesktop 3.1.0.2282in folder and copy them to the <ProjectName>inx64Debug folder.

  5. Copy the contents of the emgucv-windesktop 3.1.0.2282inx64 folder and paste it into the <ProjectName>inx64Debug folder.

  6. Open the Reference Manager window from Solution Explorer, then browse and add Emgu.CV.UI.dll and Emgu.CV.World.dll as references.

  7. In MainWindow.xaml.cs, ensure that you include the Emgu.CV namespace.

Note

If you change from a Debug to a Release configuration, you will need to ensure that all the DLLs are in the Release folder as well.

To verify that everything has installed properly, try running the code in Listing 6-1. To run it successfully, you need to include any random image named img.jpg in the project’s x64Debug folder.

Listing 6-1. Load an Image into WPF with Emgu CV
using System.Windows;
using Emgu.CV;


namespace KinectEmguCVDependency
{
    public partial class MainWindow : Window
    {
        public MainWindow()
        {


            Mat img = CvInvoke.Imread("img.jpg", Emgu.CV.CvEnum.LoadImageType.AnyColor);
            /*
            in later versions, the correct code might be: Mat img = CvInvoke.Imread("img.jpg", Emgu.CV.CvEnum.ImreadModes.AnyColor);
            */
        }
    }
}

The code in Listing 6-1 does not do anything except load an image to be used for whatever purpose. Instead of relying on the default WPF classes, however, it uses an Emgu CV call: Cv.Invoke.Imread(...). If you get no errors and a blank screen, this means everything is in place.

Manipulating Images in Emgu CV

As we saw in Chapter 3, it is entirely possible—and, in fact, not too difficult—to fiddle with the individual data bytes of an image. For more advanced usage with computer-version libraries such as OpenCV, however, we must rely on somewhat more abstracted versions of byte arrays that allow algorithms to take advantage of the image’s metadata.

Understanding Mat

OpenCV has been around since the turn of the millennia; back then, it was still all in C. Images were primarily represented through the IplImage struct, otherwise known as Image<TColor, TDepth> in Emgu CV. With the advent of C++, OpenCV saw IplImage gradually phased out in favor of the object-oriented Mat class (also known as Mat in Emgu CV). A lot of old code and textbooks rely on the IplImage, but as of OpenCV 3.0, Mat is officially preferred.

Byte arrays representing images are essentially matrices, hence the shorthand Mat for OpenCV’s image byte array representation. Mat contains two parts , a matrix header that contains metadata (e.g., how many color channels, rows, cols, etc.) and a pointer to the location of the actual matrix. This distinction is made because computer vision work naturally entails performing very memory-intensive tasks, and it would be very inefficient to copy the actual matrices all the time. Hence, if you create a Mat from another Mat’s data, they share a reference to the same Mat. A Mat must be explicitly duplicated with its Clone() method to establish two separate references.

Using Mat with Kinect Image Data

The old Image<TDepth, TColor> class required a handful of extension methods to get working with the Kinect. Using Mat is considerably easier and only requires a few extra lines as opposed to an entire class’s worth of methods like before.

Listing 6-2. Converting Kinect Image Data to Mat Format
using System;
using System.Windows;
using System.Windows.Media;
using System.Windows.Media.Imaging;
using System.Runtime.InteropServices;
using Microsoft.Kinect;
using Emgu.CV;


namespace KinectImageProcessingBasics
{
    public partial class MainWindow : Window
    {
        [DllImport("kernel32.dll", EntryPoint = "CopyMemory", SetLastError = false)]
        public static extern void CopyMemory(IntPtr dest, IntPtr src, uint count);


        private KinectSensor kinect;
        private FrameDescription colorFrameDesc;
        private WriteableBitmap colorBitmap;


        public MainWindow()
        {
            kinect = KinectSensor.GetDefault();
            ColorFrameSource colorFrameSource = kinect.ColorFrameSource;
            colorFrameDesc = colorFrameSource.FrameDescription;
            ColorFrameReader colorFrameReader = colorFrameSource.OpenReader();
            colorFrameReader.FrameArrived += Color_FrameArrived;
            colorBitmap = new WriteableBitmap(colorFrameDesc.Width,
                colorFrameDesc.Height,
                96.0,
                96.0,
                PixelFormats.Bgra32,
                null);


            DataContext = this;

            kinect.Open();

            InitializeComponent();
        }


        public ImageSource ImageSource
        {
            get
            {
                return colorBitmap;
            }
        }


        private void Color_FrameArrived(object sender, ColorFrameArrivedEventArgs e)
        {
            using (ColorFrame colorFrame = e.FrameReference.AcquireFrame())
            {
                if (colorFrame != null)
                {
                    if ((colorFrameDesc.Width == colorBitmap.PixelWidth) && (colorFrameDesc.Height == colorBitmap.PixelHeight))
                    {
                        using (KinectBuffer colorBuffer = colorFrame.LockRawImageBuffer())
                        {
                            colorBitmap.Lock();


                            Mat img = new Mat(colorFrameDesc.Height, colorFrameDesc.Width, Emgu.CV.CvEnum.DepthType.Cv8U, 4);
                            colorFrame.CopyConvertedFrameDataToIntPtr(
                            img.DataPointer,
                            (uint)(colorFrameDesc.Width * colorFrameDesc.Height * 4),
                            ColorImageFormat.Bgra);
                            //Process data in Mat at this point
                            CopyMemory(colorBitmap.BackBuffer, img.DataPointer, (uint)(colorFrameDesc.Width * colorFrameDesc.Height * 4));


                            colorBitmap.AddDirtyRect(new Int32Rect(0, 0, this.colorBitmap.PixelWidth, this.colorBitmap.PixelHeight));

                            colorBitmap.Unlock();
                            img.Dispose();
                        }
                    }
                }
            }
        }
    }
}

MainPage.xaml remains the same, however:

<Grid>
        <Image Source="{Binding ImageSource}" Stretch="UniformToFill" />
</Grid>

In Listing 6-2, we have an entire application that grabs a color image frame from the Kinect, transposes its data to a Mat, and then pushes the Mat’s data into a WriteableBitmap for display. Naturally, we must include the Emgu.CV namespace, but we also include the System.Runtime.InteropServices namespace because we will be relying on a C++ function to deal with Mat data.

The function in question is CopyMemory(IntPtr dest, IntPtr src, uint count), which will copy image data from our Mat to the WriteableBitmap’s BackBuffer. To use it, we have to call unmanaged code from the WIN32 API. We thus rely on the DllImport attribute to retrieve the CopyMemory(...) function from kernel32.dll in [DllImport("kernel32.dll", EntryPoint = "CopyMemory", SetLastError = false)].

A new Mat object is created every time the Color_FrameArrived(...) event handler is called, as opposed to reusing the same one, because we want to prevent access to its data by two concurrent calls of the event handler. Mat has nine different constructors, which allows it to be initialized from an image saved to disk, or simply be empty with some metadata. In our case, its parameters in order are rows, columns, Emgu.CV.CvEnum.DepthType, and channels. Rows and columns are essentially equivalent to the height and width of the image in pixels. DepthType refers to how much data each channel of a pixel can hold. Cv8U refers to eight unsigned bits, also known as bytes. The channels value is set to 4 for R, G, B, and A. With the CopyConvertedFrameDataToIntPtr(...) method, we copy the Kinect’s color image data directly from its buffer to the Mat’s buffer.

In this example, we directly copy data from the Mat to the WriteableBitmap with CopyMemory(...). In a production scenario, we would have performed some processing or algorithmic work first. Ideally, this work would have been performed on a background worker thread for performance reasons, including writing to a WriteableBitmap, but for simplicity this was avoided here. On an i5 Surface Book, the code results in performance comparable to the official ColorBasics-WPF sample, with an additional ∼50 MB of RAM usage. After we are done with the Mat object, we make sure to dispose of it to prevent a memory leak.

Although the code described in Listing 6-1 showed the basic usage of the Mat class for a Kinect application, we did not even have to use it to the extent that we did. We can also display the Mat directly with Emgu CV’s GUI API. In fact, it is not necessary to render the algorithmic results from the Mat to any image at all. Once we obtain our data or decision, we can throw out the Mat and render the results or decisions on a Canvas or to the Kinect image data directly with a DrawingContext. Combined with a polling and background worker-thread approach, this would allow us to display a color image feed constantly without having to slow it down with algorithmic work. We do not have to apply the computer vision algorithms as frequently as we receive frames. If we are trying to determine a user’s gender, for example, we could run the algorithm on every tenth frame, as the gender is unlikely to change during operation of the app. It is always best to be judicious in our use of computer vision and image processing techniques to avoid overtaxing the host computer .

Basic Image Processing Techniques

I do not think anyone would disagree with me when I say that the cardinal rule of working with data is that scrubbing and cleaning the data is the most important step in processing it. In the case of images, the best computer vision algorithms in the world on the fastest machines are completely useless if the image data is incomprehensible. As you can imagine, we cannot discern contours in a blurry image, and we have a difficult time tracking something like a red ball in an image full of other objects with varying tones of red. With the help of a few key image processing techniques, we can drastically simplify the image for decision-making algorithms. This is by no means a comprehensive overview of even all the basic image processing techniques available for use, but rather a sampling to induct the reader into a mindset to go uncover more.

Color Conversion

In casual parlance, we describe the world’s colors with a system of words and, for the most part, understand each other without too much ambiguity. The sky is blue, the dirt is brown, and the leaves are green. Computers, on the other hand, have several options when it comes to describing colors, all of which have stringent definitions for each of their colors. These color description systems, known as color spaces, are all suitable for different applications. RGB, for example, facilitates representation of colors on digital systems such as televisions and monitors, and CMYK (Cyan, Magenta, Yellow, and Black) is ideal for color printing. In computer vision, we may often need to go in between color spaces to take advantage of certain algorithms or to extract information. There are in fact over 150 color conversions we can make in OpenCV, though you will initially only need to know a select few.

In Emgu CV, we change an image’s color space from one to another in one method call:

CvInvoke.CvtColor(sourceImage, destinationImage, ColorConversion.Bgra2Gray);                

The sourceImage and destinationImage inputs are Mats, and ColorConversion is an Emgu CV enum that indicates from and to which color space we are converting. The source and destination images need to be the same size. You should keep in mind that image information will be lost in many of these conversions. If you convert a color image to grayscale, for example, the color channels will be stripped and that data will be irrecoverable. When the grayscale image is converted back to color, it will have four channels, but the color channels (and not the opacity channel, which will be set to max) will all hold the same values as the prior gray channel .

More on Grayscale

Grayscale is a color space that deserves special mention because of its pervasive use in computer vision. It is typically used when something needs to be measured on a single scale, but still needs to be visualized. The most obvious example is when you want to measure the intensity of light in an image. Darker gray regions in the image would naturally represent darker colors, as in Figure 6-5. We are not limited to only grayscaling RGB images, however. We can grayscale an individual color channel to see the intensity of that color in the image. Depth images are grayscaled to measure distance. A force-sensing capacitive touch screen, such as the one found in the iPhone 6s or the scuttled Nokia McLaren project, could provide a grayscale image of which regions of the screen receive more pressure. Grayscale in Emgu CV is measured with 8 bits, where 255 indicates pure white and 0 indicates pure black, with varying degrees of gray in between.

A421249_1_En_6_Fig5_HTML.jpg
Figure 6-5. An RGB image split into color channels that are then grayscaled to better help visualize their intensities in the original image. White areas represent where the color channel’s intensity is the highest (© Nevit Dilmen)

Harkening back to our Kinect sample with Mat, converting to grayscale requires a couple of alterations. Since there is less data, we need to adjust the size of the buffers to accommodate this.

Listing 6-3. Converting Kinect Color Image to Grayscale
...
//inside body of Color_FrameArrived(...)
colorFrame.CopyConvertedFrameDataToIntPtr(img.DataPointer, (uint)(colorFrameDesc.Width * colorFrameDesc.Height * 4), ColorImageFormat.Bgra);
CvInvoke.CvtColor(img, img, Emgu.CV.CvEnum.ColorConversion.Bgra2Gray);
CopyMemory(colorBitmap.BackBuffer, img.DataPointer, (uint)(colorFrameDesc.Width * colorFrameDesc.Height));
...

In Listing 6-3, we make a call to CvtColor(...) and rewrite img as a grayscale image. Additionally, we alter the buffer-size argument in CopyMemory(...). Grayscale images are a single byte; thus, we no longer need to multiply by four as we did with RGBA images.

Listing 6-4. Setting WriteableBitmap Format to Gray8
colorBitmap = new WriteableBitmap(colorFrameDesc.Width,
                colorFrameDesc.Height,
                96.0,
                96.0,
                PixelFormats.Gray8,
                null);

The changes to grayscale need to be reflected to the WriteableBitmap as well. In Listing 6-4, we set the PixelFormats of our WriteableBitmap to Gray8. The compiled result should look similar to Figure 6-6.

A421249_1_En_6_Fig6_HTML.jpg
Figure 6-6. Kinect color stream in grayscale

Thresholding

Thresholding is an image processing operation that sets the pixels of an image to one of two colors, typically white and black, depending on whether the pixel’s intensity meets a certain threshold. The final result is a binary image where pixels’ values are essentially described as either true (white) or false (black). This helps us segment an image into regions and create sharply defined borders. From a computer vision perspective, this can help us extract desired features from an image or provide a basis from which to apply further algorithms.

OpenCV has five basic thresholding functions that can be used: binary thresholding, inverted binary thresholding, truncating, thresholding to zero, and inversely thresholding to zero. These are described in detail in Table 6-1 and shown in Figure 6-7. While our goal can usually be achieved by more than one of the thresholding functions, using the right one can save extra steps and processing. Some of these functions take a MaxValue argument, which will be the intensity value that is assigned to a true or false pixel. Thresholding should typically be applied to grayscale images, but can be applied to images with other color formats as well.

Table 6-1. Basic Thresholding Operations in OpenCV

Threshold Type

Description

Binary Threshold

If the intensity of a pixel is greater than the threshold, set its value to MaxValue. Else, set it to 0. ThresholdType enum for use in Emgu CV is Binary.

Inverted Binary Threshold

The opposite of binary thresholding. If the intensity of a pixel is greater than the threshold, set its value to 0. Else, set it to MaxValue. ThresholdType enum for use in Emgu CV is BinaryInv.

Truncate

If the intensity of a pixel is greater than the threshold, set it to the threshold’s value. Else, the intensity of the pixel stays the same. ThresholdType enum for use in Emgu CV is Trunc.

Threshold to Zero

If the intensity of a pixel is less than the threshold, set its value to 0. Else, the intensity of the pixel stays the same. ThresholdType enum for use in Emgu CV is ToZero.

Inverted Threshold to Zero

The opposite of thresholding to zero. If the intensity of a pixel is greater than the threshold, set its value to 0. Else, the intensity of the pixel stays the same. ThresholdType enum for use in Emgu CV is ToZeroInv.

A421249_1_En_6_Fig7_HTML.jpg
Figure 6-7. Tokyo Tower at night with different threshold techniques applied. Thresholding applied on top row from left to right: none, binary, inverted binary. Thresholding applied on bottom row from left to right: truncation, threshold to zero, inverted threshold to zero. A value of 100 was used for the threshold and 255 for the maximum value.
Listing 6-5. Thresholding a Kinect Color Feed Image
colorFrame.CopyConvertedFrameDataToIntPtr(img.DataPointer, (uint)(colorFrameDesc.Width * colorFrameDesc.Height * 4), ColorImageFormat.Bgra);
CvInvoke.CvtColor(img, img, Emgu.CV.CvEnum.ColorConversion.Bgra2Gray);
CvInvoke.Threshold(img, img, 220, 255, Emgu.CV.CvEnum.ThresholdType.Binary);
CopyMemory(colorBitmap.BackBuffer, img.DataPointer, (uint)(colorFrameDesc.Width * colorFrameDesc.Height));

In Listing 6-5, we make a call to Threshold(...), and, as with CvtColor(...), the first two arguments are the input and output images, respectively. This is followed by the threshold value (double), the max value (double), and the ThresholdType (enum). ThresholdType includes the five basic thresholding operations, as well as the more advanced Otsu and Mask. By setting the threshold to 220 and the max value to 255 and applying a binary threshold, we are telling our application to blacken all but the brightest pixels, which should be made white. The result looks something akin to Figure 6-8, where only the natural light from the outside, as well as its reflection on the computer monitor and table, is considered bright.

A421249_1_En_6_Fig8_HTML.jpg
Figure 6-8. Kinect color feed with binary thresholding applied

Smoothing

Smoothing (or blurring) is generally used to remove undesirable details within an image. This might include noise, edges, blemishes, sensitive data, or other fine phenomena. It also makes the transition between colors more fluid, which is a consequence of the image’s edges being less distinct. In a computer vision context, this can be important in reducing false positives from object-detection algorithms.

Note

Blurring sensitive data is an inadequate form of protection against prying eyes. This is especially true with alphanumeric characters. Data can be interpolated from the blurred result; it is better to completely erase sensitive data on an image altogether by overwriting its pixels with junk (0-intensity pixels).

There are four basic smoothing filters that can be used in OpenCV. These are averaging, Gaussian filtering, median filtering, and bilateral filtering. As with thresholding operations, there are filters in each scenario that are most appropriate to use. Some of these scenarios are described in Table 6-2.

Table 6-2. Basic Smoothing Filters in OpenCV

Smoothing Filter Type

Description

Averaging

Basic smoothing filter that determines pixel values by averaging their neighboring pixels. Can be called in Emgu CV with CvInvoke.Blur(Mat src, Mat dst, System.Drawing.Size ksize). src and dst are the input and output images, respectively. ksize is an odd-number-sized box matrix (e.g., (3, 3) or (5, 5) or (7, 7), etc.) otherwise known as a kernel. A higher ksize will result in a more strongly blurred image. The average filter is sometimes known as a mean filter or a normalized box filter.

Gaussian Filtering

A filter that is effective at removing Gaussian noise in an image. Gaussian noise is noise whose values follow a Gaussian distribution. In practice, this means sensor noise in images caused by bad lighting, high temperature, and electric circuitry. Images put through Gaussian filtering look like they are behind a translucent screen. The filter can be called with CvInvoke.GaussianBlur(Mat src, Mat dst, Size ksize, double sigmaX, double sigmaY = 0, BorderType borderType = BorderType.Reflect101). For amateur applications, the sigma values can be left at 0, and OpenCV will calculate the proper values from the kernel size. BorderType is an optional value and can be ignored as well.

Median Filtering

A filter that is effective at removing salt-and-pepper noise in an image. Salt-and-pepper noise typically consists of white and black pixels randomly occurring throughout an image, like the static noise on older TVs that have no channel currently playing. The noise does not necessarily have to be white and black, however; it can be other colors. For lesser levels of Gaussian noise, it can be substituted in place of the Gaussian filter to better preserve edges in images. Can be called with CvInvoke.MedianBlur(Mat src, Mat dst, int ksize). See Figure 6-9.

Bilateral Filtering

A filter that is effective at removing general noise from an image while preserving edges. Slower than other filters and can lead to gradient reversal, which is the introduction of false edges in the image. Can be called with CvInvoke.BilateralFilter(Mat src, Mat dst, int diameter, double sigmaColor, double sigmaSpace, BorderType borderType = BorderType.Reflect101). Values of diameters larger than five tend to be very slow, thus five can be used as a default parameter initially. Both sigma values can share the same value for convenience. Large sigma values, such as 150 or more, will lead to images that appear cartoonish.

A421249_1_En_6_Fig9_HTML.jpg
Figure 6-9. Salt-and-pepper image before and after median filtering is applied. Top-left: original image, top-right: image with salt-and-pepper noise applied, bottom: image after median filtering has been applied. A 5 × 5 kernel was used.

Sharpening

There is no built-in sharpening filter in OpenCV. Instead, we can use the Gaussian filter in a procedure that is perhaps ironically known as unsharp masking. Unsharp masking is the most common type of sharpening, and you have probably encountered if you have used Photoshop or GIMP. The filter works by subtracting a slightly blurry version of the photo in question from said photo. The idea is that the area that gets blurred is where the edges are and that removing a similar quantity of blur from the original will increase the contrast between the edges. In practice, this effect can be replicated in two lines in Emgu CV or OpenCV.

Listing 6-6. Implementing an unsharp mask in Emgu CV
CvInvoke.GaussianBlur(image, mask, new System.Drawing.Size(3, 3), 0)
CvInvoke.AddWeighted(image, 1.3, mask, -0.4, 0, result);

image, mask, and result in Listing 6-6 are all Mats; the second input of AddWeighted(...) is the weight accorded to the elements of the image Mat, whereas the fourth input is weight accorded to those of the mask Mat (the second image input in the function). The fifth input is a scalar value that can be added to the combined intensity values of image and mask. After we get the Gaussian blur of image in the mask Mat, we add it with negative weight (essentially subtraction) to image with the AddWeighted(...) method. The resulting sharpened image is in the result Mat. The parameters can be tinkered with to alter the degree of sharpening applied to the image. Increasing the kernel size of the Gaussian blur and the weights in favor of the mask to be subtracted will result in a more exaggerated sharpening of the image .

Morphological Transformations

Without going into too much detail, morphological transformations are techniques used to process the image based on its geometry. You will generally come to use them extensively in computer vision work to deal with noise and to bring focus to certain features within an image. While there are numerous morphological transformations that we can make use of, there are four basic ones that should cover your bases initially: erosion, dilation, opening, and closing. These are described in Table 6-3 and demonstrated in Figure 6-10.

Table 6-3. Basic Morphological Transformations in OpenCV

Morphological Transformation

Description

Erosion

The erosion operator thins the bright regions of an image (while growing the darker regions). It is one of the two basic morphological operators and is the opposite of the dilation operator. Some of its uses include separating foreground objects, eradicating noise, and reducing highlights in an image. CvInvoke.Erosion(Mat src, Mat dst, Mat element, Point anchor, int iterations, borderType borderType, MCvScalar borderValue);. IntPtr.Zero can be used to use the default 3 x 3 structuring element. Point can be set to the default value of (-1, -1). borderType and borderValue can be set to default and 0 respectively.

Dilation

The dilation operator thickens the bright regions of an image (while shrinking the darker regions). It is one of the two basic morphological operators and is the opposite of the erosion operator. Some of its uses include combining foreground objects and bringing focus to certain areas of the image. Dilation can be applied with CvInvoke.Dilate(Mat src, Mat dst, Mat element, Point anchor, int iterations, borderType borderType, MCvScalar borderValue);.

Opening

Opening is a morphological operation that essentially consists of eroding an image and then dilating it. It is the sister of the closing morphological operation. The result is an image with small foreground noise reduced. It is less destructive than removing noise through erosion. Opening is applied with the CvInvoke.MorphologyEx(Mat src, Mat dst, MorphOp operation, Mat kernel, Point anchor, int iterations, BorderType borderType, MCvScalar borderValue);. MorphOp is an enum representing the name of the morphological operator—Open, in our case.

Closing

Closing is a morphological operation that essentially consists of dilating an image and then eroding it. It is the sister of the opening morphological operation. The result is an image with small dark holes filled in (or small background noise reduced). It is more precise than trying to fill holes through dilation alone. Closing can be applied with CvInvoke.MorphologyEx(Mat src, Mat dst, MorphOp operation, Mat kernel, Point anchor, int iterations, BorderType borderType, MCvScalar borderValue);. The MorphOp value for Closing is Close .

A421249_1_En_6_Fig10a_HTML.jpgA421249_1_En_6_Fig10b_HTML.jpg
Figure 6-10. An evening view of Moriya, Ibaraki Prefecture, Japan, with different morphological operators applied. From top to bottom: original, erosion, dilation, opening, closing. Each operator was applied on the original image for 20 iterations using a 5 × 5 kernel .
Note & Tip

Morphological transformations in OpenCV are applied in respect to the high-intensity pixels in the image. In a binary image, this means all the white pixels. So, when we say we are dilating an image, we are almost literally dilating the brighter regions in the image. For this reason, we strive to keep whatever we are interested in (the foreground) in white for image processing operations. That being said, all morphological operations have an equivalent reverse operation; thus, we can also keep the objects of interest in black and the background in white if convenient and apply the reverse operation.

Note

Erode and Dilate can also be called with the CvInvoke.MorphologyEx(...) method. The relevant MorphOp values are Erode and Dilate.

Highlighting Edges with Morphological Operators

At times, you may want to visually highlight the contours of an object in an image. This might be to bring attention to an object your algorithm detected, or perhaps to add some type of visual effect (e.g., make someone look like a superhero in a Kinect camera feed by giving them a colored glow around their body). One way to achieve this effect is through morphological operators.

The process consists of dilating the area of interest and then subtracting the dilation from the original image.

Listing 6-7. Highlighting an Edge Using Morphological Operators
Mat img = new Mat("orange.jpg", LoadImageType.Grayscale);
Mat thresholdedImg = new Mat();
CvInvoke.Threshold(img, thresholdedImg, 240, 255, ThresholdType.BinaryInv);
Mat dest = thresholdedImg.Clone();
CvInvoke.Dilate(dest, dest, new Mat(), new Point(-1, -1), 5, BorderType.Default, new MCvScalar(0));
CvInvoke.BitwiseXor(thresholdedImg, dest, dest);
CvInvoke.BitwiseXor(img, dest, img);

In Listing 6-7, we choose to highlight the contours of an orange. We start off by taking in an image of an orange (Figure 6-11a) inside of a new Mat. We choose to work with a grayscale version (Figure 6-11b) for simplicity’s sake. We binary threshold it (Figure 6-11c) using CvInvoke.Threshold(img, thresholdedImg, 240, 255, ThresholdType.BinaryInv);, turning all the white background pixels dark, as they are of no interest, and the darker orange pixels white, since that is the region of interest. On a copy of the thresholded image, we dilate the bright regions (Figure 6-11d) with OpenCV’s dilate method: CvInvoke.Dilate(dest, dest, new Mat(), new Point(-1, -1), 5, BorderType.Default, new MCvScalar(0));. All the default arguments are used, except for the iterations value, which is set to 5. Iterations refers to how many times you want to apply the morphological operator. In our case, the more times dilate is called, the larger the edge highlighting will ultimately be. We then XOR the thresholded image with its dilated result (Figure 6-11e). Since the XOR operation only results in a true, or 255, brightness value for each pixel that is different between both corresponding Mats, only the regions in an image that have changed between both images will be highlighted in the resulting image. This region is the part of the image that has been dilated beyond the original thresholded image. Finally, we XOR this resulting image with the original image of the grayscale orange to obtain the highlighted contour of the orange (Figure 6-11f).

A421249_1_En_6_Fig11a_HTML.jpg
Figure 6-11a. Standard orange image asset (United States Department of Health and Human Services)
A421249_1_En_6_Fig11b_HTML.jpg
Figure 6-11b. Orange image, grayscaled
A421249_1_En_6_Fig11c_HTML.jpg
Figure 6-11c. Orange grayscale image with thresholding applied
A421249_1_En_6_Fig11d_HTML.jpg
Figure 6-11d. Image after having dilation applied. While the image looks similar to Figure 6-11c, under closer inspection, you will notice that the white area takes up a slightly greater portion of the image. A couple of black pixels near the top of the white area have also been filled in.
A421249_1_En_6_Fig11e_HTML.jpg
Figure 6-11e. The XOR’d result of Figure 6-11c and Figure 6-11d. This is the contouring that will be applied to the final image.
A421249_1_En_6_Fig11f_HTML.jpg
Figure 6-11f. The XOR’d result of Figure 6-11b and Figure 6-11e. The contour obtained from dilation drawn on the initial image.

We applied the contour highlighting to a grayscale image, but had we wanted, we could have used the same technique to apply, say, a red or blue contour around the original image of the orange. This would entail taking pixel values of Figure 6-11e and applying them to the relevant color channels of Figure 6-11a in a weighted manner.

Bitwise & Arithmetical Operations

In the previous section, we briefly dwelled on the bitwise XOR operator. It was a quick way for us to determine the difference between two pictures. Bitwise and arithmetical operations such as the XOR operator are commonly used in image processing. Since images are ultimately arrays filled with numbers, on a numerical level, such operations work as you would expect. On a visual level, however, it might not immediately be obvious which operation to use for which result. Bitwise and arithmetical operations fall under the broader category of array operations. There are dozens of such operations in OpenCV. We covered a sparse few already, such as the inRange(...) method. While in time you will come to learn all of them, for now we will focus on the elementary ones.

The arithmetical and bitwise operators are add, subtract, divide, multiply, AND, NOT, OR , and XOR. Again, their functionality is self-explanatory. Adding, for example, adds each pixel between the corresponding array indices of two images. We will rely on visual representations of these operations to gain a better understanding of them. Figures 6-12a and 6-12b are two base images we will use to demonstrate the use of these operations.

A421249_1_En_6_Fig12a_HTML.jpg
Figure 6-12a. The first source image
A421249_1_En_6_Fig12b_HTML.jpg
Figure 6-12b. The second source image (the red borders are not a part of the image)

Addition

Adding, as explained earlier, adds each corresponding pixel in both images together. When adding two pixels results in a value larger than 255 in any color channel, the excess is cut off. This results in the big white circle being the most prominent artefact in Figure 6-13. Although it’s mathematically clear, it can still be weird to wrap your head around the fact that adding gray to black results in gray in the world of image processing, as opposed to black in, say, a paint app or the real world.

A421249_1_En_6_Fig13_HTML.jpg
Figure 6-13. Addition of Figure 6-12b to Figure 6-12a

Subtraction

As you would imagine, subtracting returns the opposite result of adding (Figure 6-14).

A421249_1_En_6_Fig14_HTML.jpg
Figure 6-14. Subtraction of Figure 6-12b from Figure 6-12a

Multiplication

Multiplying two images together is not often too useful, given that multiplying any two pixel values above 16 will result in a value above 255. One way to make use of it would be to intensify certain areas of an image using a mask (Figure 6-15).

A421249_1_En_6_Fig15_HTML.jpg
Figure 6-15. Multiplication of Figure 6-12b by Figure 6-12a

We are not obliged to multiply two images together, however. We can multiply an image by a scalar. This uniformly brightens an image. Figure 6-16 features Figure 6-12b brightened by 50 percent. This is achieved by multiplying the source image by a scalar value of 1.5. With a positive value smaller than 1, we can achieve the opposite effect.

A421249_1_En_6_Fig16_HTML.jpg
Figure 6-16. Multiplication of Figure 6-12b by a scalar value of 1.5

Note how the 25 percent gray stripe all the way on the right has turned white. Gray at 25 percent has a 75 percent intensity value, or 191.25. 191.25 × 1.5 = 286.75, which is larger than 255; hence, it becomes white.

Division

Division possesses characteristics similar to those of multiplication (Figure 6-17).

A421249_1_En_6_Fig17_HTML.jpg
Figure 6-17. Division of Figure 6-12a by Figure 6-12b

Likewise, for division, operating with small scalar values tends to be more useful (Figure 6-18).

A421249_1_En_6_Fig18_HTML.jpg
Figure 6-18. Division of Figure 6-12a by a scalar of 2

There is a way to benefit from dividing two images, however. Dividing any number (other than 0) by itself results a value of 1, a value close enough to 0 from an image processing context. We can use this property to determine areas that have changed between two images. Areas that remain the same will appear black, whereas areas that are not the same will report different values. This is particularly useful in removing glare from images when they were taken from multiple angles, though this is a more advanced technique that will not be discussed here.

Bitwise AND

The bitwise AND operator works as you might have expected adding two images would have. Lighter colors such as the color white act as a background to darker colors. Adding two light colors together results in a darker color (Figure 6-19).

A421249_1_En_6_Fig19_HTML.jpg
Figure 6-19. Bitwise conjunction of Figure 6-12a and Figure 6-12b
Note

I just referred to white as a color, and that is sure to ruffle some feathers (everything does these days!). Many of us learned early in school, whether it be from friends, parents, or teachers, that white and/or black are not colors. The answer will depend on whether you look at the additive or subtractive theory of colors, or whether you look at color as light or as pigmentation. Personally, I prefer Wikipedia’s definition best: White, gray, and black are achromatic colors—colors without a hue.

Bitwise NOT

Bitwise NOT is a bit interesting in that it is a unary operation. Thus, it operates only on a single source image. It is my favorite operator because it is the one with the most predictable results: it inverses the colors in an image (Figures 6-20a and 6-20b).

A421249_1_En_6_Fig20a_HTML.jpg
Figure 6-20a. Logical negation of Figure 6-12a
A421249_1_En_6_Fig20b_HTML.jpg
Figure 6-20b. Logical negation of Figure 6-12b

Bitwise OR

Looking at Figure 6-21, you would think that the bitwise operator is equivalent to the additive arithmetical operation. While they work similarly in certain cases, they are in fact different. See Figure 6-21.

A421249_1_En_6_Fig21_HTML.jpg
Figure 6-21. Bitwise disjunction of Figure 6-12a and Figure 6-12b

To see how they differ, let’s rotate Figure 6-12b by 90 degrees and apply itself in its original orientation with the bitwise OR and addition operations (Figure 6-22).

A421249_1_En_6_Fig21a_HTML.jpg
Figure 6-21a. Bitwise disjunction of Figure 6-12b and Figure 6-12b rotated by 90°
A421249_1_En_6_Fig21b_HTML.jpg
Figure 6-21b. Addition of Figure 6-12b and Figure 6-12b rotated by 90°
A421249_1_En_6_Fig22_HTML.jpg
Figure 6-22. Exclusive bitwise disjunction of Figure 6-12a and Figure 6-12b

Interestingly, the OR operation results in regions staying the same color if both of the source regions had the same color previously. This is unlike the addition operation, which just maxes the regions to white. Mathematically, this makes sense. 1000 0000 | 1000 0000 (128, the intensity of Gray 50 percent) results in 1000 0000. All in all, the OR operation, being so permissive, will usually result in an image that is much brighter, if not mostly white. Most bits in the resulting image will be switched to ones, unless both of their source bits were zeroes. A great use case for this property is to see which parts two images have in common.

Tip

It is important to investigate array operations on different corner cases; they often produce unexpected results. Better yet, thresholding the images to binary will yield more reliable results.

Bitwise XOR

We already saw the bitwise XOR operator in action. We used it to get the difference between two binary images. It is regularly used for such. We should avoid relying on XOR unless the image is binary, however. You would expect two images with highly differing intensities to result in an image with a high intensity. Thus, it will surprise some to learn that when we XOR an intensity 128 with an intensity of 100, as opposed to a dull gray, the resulting image would be nearly white! This is because 128 in binary is 1000 0000, which is a digit larger than the decimal number 100, 110 0100 in binary. For this reason, it is recommended to use subtraction when possible. See Figure 6-22.

Using Arithmetic & Bitwise Operators

To use any of the arithmetic for the operators in the app, simply call them through CvInvoke.Add(src1, src2, dst); with Add being replaced by the relevant arithmetic operation: Subtract, Divide , or Multiply. src2 can also be a ScalarArray(double value) if you wish to operate with a scalar value.

Likewise, with bitwise operators, call CvInvoke.BitwiseAnd(src1, src2, dst);, replacing the And portion of the signature with the relevant bitwise operator: Not, Or , or Xor. BitwiseNot(...) will of course only take one src argument.

Visualizing Movement Through the Use of Arithmetic Operators

Arithmetic and bitwise operators are often used to compare and contrast two images. One practical purpose of this capability is to detect movement between frames. This can be employed to develop a simplistic video-surveillance or -monitoring system.

Our system will consist of an image-viewing app that displays a binary image. White pixels will indicate movement, whereas black pixels will indicate the lack of movement. The final app will resemble Figure 6-23a. See Listing 6-8.

A421249_1_En_6_Fig23a_HTML.jpg
Figure 6-23a. The Kinect Motion Detector app visualizing the motion of a swinging pendulum made from a 5¥ coin and a length of floss
A421249_1_En_6_Fig23b_HTML.jpg
Figure 6-23b. The precursor scene to Figure 6-24a in color
Listing 6-8. Detecting Motion with Arithmetic Operators
[...] //Declare Kinect and Bitmap variables

Mat priorFrame;
Queue<Mat> subtractedMats = new Queue<Mat>();
Mat cumulativeFrame;


[...] //Initialize Kinect and WriteableBitmap in MainWindow constructor

private void Color_FrameArrived(object sender, ColorFrameArrivedEventArgs e)
{
    using (ColorFrame colorFrame = e.FrameReference.AcquireFrame())
    {
        if (colorFrame != null)
        {
            if ((colorFrameDesc.Width == colorBitmap.PixelWidth) && (colorFrameDesc.Height == colorBitmap.PixelHeight))
            {
                using (KinectBuffer colorBuffer = colorFrame.LockRawImageBuffer())
                {


                    Mat img = new Mat(colorFrameDesc.Height, colorFrameDesc.Width, Emgu.CV.CvEnum.DepthType.Cv8U, 4);
                    colorFrame.CopyConvertedFrameDataToIntPtr(
img.DataPointer, (uint)(colorFrameDesc.Width * colorFrameDesc.Height * 4), ColorImageFormat.Bgra);
                    CvInvoke.CvtColor(img, img, Emgu.CV.CvEnum.ColorConversion.Bgra2Gray);


                    if (priorFrame != null)
                    {
                        CvInvoke.Subtract(priorFrame, img, priorFrame);
                        CvInvoke.Threshold(priorFrame, priorFrame, 20, 255, Emgu.CV.CvEnum.ThresholdType.Binary);
                        CvInvoke.GaussianBlur(priorFrame, priorFrame, new System.Drawing.Size(3, 3), 5);
                        subtractedMats.Enqueue(priorFrame);
                    }
                    if (subtractedMats.Count > 4)
                    {
                        subtractedMats.Dequeue().Dispose();


                        Mat[] subtractedMatsArray = subtractedMats.ToArray();
                        cumulativeFrame = subtractedMatsArray[0];


                        for (int i = 1; i < 4; i++)
                        {
                            CvInvoke.Add(cumulativeFrame, subtractedMatsArray[i], cumulativeFrame);
                        }
                        colorBitmap.Lock();


                        CopyMemory(colorBitmap.BackBuffer, cumulativeFrame.DataPointer, (uint)(colorFrameDesc.Width * colorFrameDesc.Height));
                        colorBitmap.AddDirtyRect(new Int32Rect(0, 0, colorBitmap.PixelWidth, colorBitmap.PixelHeight));


                        colorBitmap.Unlock();
                    }
                    priorFrame = img.Clone();
                    img.Dispose();
                }
            }
        }
    }
}

In Listing 6-8, we repurposed the Kinect Image Processing Basics project to develop our movement-detector app. The concept is pretty straightforward. Subtracting one frame from another shows the regions that have changed over the span of 1/30th of a second. It could suffice to simply display these frames as they are, but for many applications, it would be preferable to sum these frames to show a protracted motion. This would enable a very slight movement—say, the twitch of a finger or the rising and falling of the chest—to be much more perceptible and distinguishable from mere background noise. In our program, we sum four of these frames together to get a cumulative frame representing a motion lasting 2/15th of a second (4 × 1/30th of a second differences). The nice thing is that we are always using the last five frames taken by the Kinect to view these changes; thus, the resulting video feed still plays at 30 frames per second.

Listing 6-9. Kinect Motion Detector – Subtracting Frames
CvInvoke.CvtColor(img, img, Emgu.CV.CvEnum.ColorConversion.Bgra2Gray);

if (priorFrame != null)
{
    CvInvoke.Subtract(priorFrame, img, priorFrame);
    CvInvoke.Threshold(priorFrame, priorFrame, 20, 255, Emgu.CV.CvEnum.ThresholdType.Binary);
    CvInvoke.GaussianBlur(priorFrame, priorFrame, new System.Drawing.Size(3, 3), 5);
    subtractedMats.Enqueue(priorFrame);
}

In Listing 6-9, we have the portion of the Color_FrameArrived(...) method that deals with the subtraction of one frame from another. Although the app could conceivably work in color, we grayscale all images so that the final result is easier to work with for any object-detection tools and to make it easier to understand for any user.

A priorFrame Mat holds a reference to the image taken last time Color_FrameArrived(...) was fired. We need two images to apply subtraction, so we wait until we have a second image before starting. After subtraction, we threshold so that most minor changes between frames appear white. This tends to cause noise as a result of the limits of the Kinect hardware, but we can apply a Gaussian blur to rectify this. The parameters for the thresholding and blurring can be tweaked to your liking. Finally, we save the image containing the difference of the current and prior frames into the subtractedMats queue. This queue contains the last few differences, which will be summed together in the next step (Listing 6-10).

Listing 6-10. Kinect Motion Detector – Summing Frame Differences
if (subtractedMats.Count > 4)                    
{
    subtractedMats.Dequeue().Dispose();


    Mat[] subtractedMatsArray = subtractedMats.ToArray();
    cumulativeFrame = subtractedMatsArray[0];


    for (int i = 1; i < 4; i++)
    {
        CvInvoke.Add(cumulativeFrame, subtractedMatsArray[i], cumulativeFrame );
    }
    colorBitmap.Lock();


    CopyMemory(colorBitmap.BackBuffer, cumulativeFrame.DataPointer, (uint)(colorFrameDesc.Width * colorFrameDesc.Height));
    colorBitmap.AddDirtyRect(new Int32Rect(0, 0, colorBitmap.PixelWidth, colorBitmap.PixelHeight));


    colorBitmap.Unlock();
}


priorFrame = img.Clone();
img.Dispose();

We only display an image if our queue has five images (four after junking the oldest one). These are summed together in a for loop and then displayed. The current image is copied into the priorFrame variable for reuse in the next frame’s event handler call.

A final note: we make sure to dispose of the Mat that is dequeued and the one that was just obtained. Not doing so would cause the application to eventually run out of memory .

Although this is a very basic motion detector, it can serve as the foundation for a more complex project. For example, with the use of blob-detection techniques, it can be used to track the velocity of cars moving down a certain stretch of road. It can be placed near a hospital bed and be used to determine the breathing rate of a patient or to see if they are twitching or having seizures (this can be enhanced with the use of the Kinect’s skeletal-tracking abilities). The possibilities are limitless, yet to start you only require an understanding of arithmetic and 2D grids.

Object Detection

Object detection is probably the most touted capability in computer vision introductions. Detecting objects, after all, is how self-driving cars can attain humanlike prescience. Companies like Google, Microsoft, and Tesla spend millions of dollars and human hours developing and collating object-recognition algorithms for various robotics and artificial-intelligence endeavors (the Kinect, of course, being a notable example). Fortunately, we do not have to spend a dime to start using some of these algorithms. OpenCV has some object-recognition tools that can be used out of the box, and we can build around them further to achieve most of our goals.

It is worth giving some attention to the concept of features and feature detection . Features are essentially geometric points of interest in an image’s foreground. These might include corners, blobs, and edges, among other phenomena. Feature detectors, algorithms that detect certain features, can be strung together with image processing techniques to detect objects in an image. The topic can be expansive; thus, we will focus on a couple of out-of-the-box object-recognition techniques in this chapter.

Note

A blob refers to a concentration of uniformly textured pixels. For example, the thresholded image of an orange in Figure 6-11c consists of one big white blob. In a sense, the black background region could also be described as a blob. What we consider a blob depends on algorithmic parameters.

Simple Blob Detector

As previously mentioned, objects are typically detected with a series of feature detectors and image processing techniques. OpenCV bundles one such series together in a class called the Simple Blob Detector. Simple Blob Detector returns all the blobs detected in an image filtered by the desired area, darkness, circularity, convexity, and ratio of their inertias.

A421249_1_En_6_Fig24_HTML.jpg
Figure 6-24. When compared to a similarly sized banana, the apple has a higher inertia than the banana. Looking from a 2D perspective, the banana’s contour is less convex than the apple’s (neglecting the apple’s stem).
Note

Do not fret if you do not remember your college physics! Inertia in this context refers to how likely the blob is to rotate around its principal axis. For all practical intents, this translates to the degree of elongation of the blob, with higher inertia values referring to lesser degrees of elongation (more inertia means the blob will be less susceptible to rotation). Convexity, on the other hand, refers to how dented a blob is. Higher convexity equates to less denting, whereas a blob with low convexity (in other words, more concavity) has one or more larger dents in its shape.

To demonstrate the use of Simple Blob Detector, we will make a program that detects oranges among a group of similarly shaped fruits and vegetables, like those in Figure 6-25.

A421249_1_En_6_Fig25_HTML.jpg
Figure 6-25. An assortment of fruits on a bed cover
Note

The following project is a console project. I would copy an existing EmguCV sample such as Hello World to avoid having to set up the dependencies and platform settings from scratch.

Listing 6-11. Detecting Oranges with Simple Blob Detector
using System;
using System.Drawing;
using Emgu.CV;
using Emgu.CV.Util;
using Emgu.CV.CvEnum;
using Emgu.CV.Structure;
using Emgu.CV.Features2D;


namespace OrangeDetector
{
   class Program
   {
      static void Main(string[] args)
      {
            String win1 = "Orange Detector"; //The name of the window
            CvInvoke.NamedWindow(win1); //Create the window using the specific name


            MCvScalar orangeMin = new MCvScalar(10, 211, 140);
            MCvScalar orangeMax = new MCvScalar(18, 255, 255);


            Mat img = new Mat("fruits.jpg", ImreadModes.AnyColor);
            Mat hsvImg = new Mat();
            CvInvoke.CvtColor(img, hsvImg, ColorConversion.Bgr2Hsv);


            CvInvoke.InRange(hsvImg, new ScalarArray(orangeMin), new ScalarArray(orangeMax), hsvImg);

            CvInvoke.MorphologyEx(hsvImg, hsvImg, MorphOp.Close, new Mat(), new Point(-1, -1), 5, BorderType.Default, new MCvScalar());

            SimpleBlobDetectorParams param = new SimpleBlobDetectorParams();
            param.FilterByCircularity = false;
            param.FilterByConvexity = false;
            param.FilterByInertia = false;
            param.FilterByColor = false;
            param.MinArea = 1000;
            param.MaxArea = 50000;


            SimpleBlobDetector detector = new SimpleBlobDetector(param);
            MKeyPoint[] keypoints = detector.Detect(hsvImg);
            Features2DToolbox.DrawKeypoints(img, new VectorOfKeyPoint(keypoints), img, new Bgr(255, 0, 0), Features2DToolbox.KeypointDrawType.DrawRichKeypoints);


            CvInvoke.Imshow(win1, img); //Show image
            CvInvoke.WaitKey(0); //Wait for key press before executing next line
            CvInvoke.DestroyWindow(win1);
      }
   }
}

Listing 6-11 features all the code necessary to detect oranges in an image. The general process is that we first filter our image by the HSV values of our oranges so that only orange-colored regions in our image show up in a binary image. We then use Simple Blob Detector to highlight these regions.

The first step is performed outside of Visual Studio. Using a tool such as GIMP or Photoshop (or even MS Paint), we obtain the HSV range for the oranges. For those not familiar with HSV, it is a color profile like RGB. It refers to hue, saturation, and value. Hue refers to the color (e.g., orange, red, blue, etc.), saturation refers to the intensity of the color, and value refers to the brightness of the color.

The HSV model is not standardized between applications and technologies, so the HSV values in GIMP, as depicted in Figure 6-26, will not match those in OpenCV. We will have to translate the values mathematically.

A421249_1_En_6_Fig26_HTML.jpg
Figure 6-26. HSV values taken by GIMP’s color picker

The hue values for our oranges range from 20 to 36 in GIMP, the saturation from 83 to 100, and the value from 55 to 100. In GIMP, the entire range of the hue is 0 to 360, starting and ending at the color red; the saturation ranges from 0 to 100; and the value ranges from 0 to 100. The ranges in OpenCV depend on the color spaces. Since we will be performing a BGR-to-HSV transformation, our hue will range from 0 to 179, starting and ending at the color red. The saturation and value will range from 0 to 255. To determine the hue range, we simply divide our GIMP values by 2. The saturation and value can be obtained by getting their percentages and multiplying those by 255. The end result is H: 10–18, S: 211–255, V: 140–255.

Note

Had we done RGB to HSV, our hue would have started and ended at the color blue. Had we done RGB to HSV Full, our hue would have gone from 0 to 255. As you can imagine, improper color spaces can be cause for much consternation in image processing work.

Note

We did not take lighting into consideration in our algorithm. Generally, you will have to use techniques such as histogram equalization to minimize the effects of lighting on your detection tasks. While the technique itself is not very complicated, its proper use is somewhat beyond the scope of this book. Visit http://docs.opencv.org/3.1.0/d5/daf/tutorial_py_histogram_equalization.html to learn more.

Now that we have our HSV range, we create two MCvScalars to describe their lower and upper limits:

MCvScalar orangeMin = new MCvScalar(10, 211, 140);
MCvScalar orangeMax = new MCvScalar(18, 255, 255);

MCvScalar is simply a construct for holding single, double, triple, or quadruple tuples.

We then load our fruit image and convert from BGR to HSV. We could have gone from RGB as well; the native image color space is sRGB. The only real difference between BGR and RGB, however, is a matter of interpretation, and interpreting BGR in HSV was easier in our case (see earlier note).

Mat img = new Mat("fruits.jpg", LoadImageType.AnyColor);
Mat hsvImg = new Mat();
CvInvoke.CvtColor(img, hsvImg, ColorConversion.Bgr2Hsv);

We then select all HSV values within our desired range (between orangeMin and orangeMax):

CvInvoke.InRange(hsvImg, new ScalarArray(orangeMin), new ScalarArray(orangeMax), hsvImg);

Our MCvScalars have to be included as the single elements in two ScalarArrays, whose function is eponymous. In essence, the InRange(...) method is another way of applying thresholds to our image. In our case, all values within the HSV range we set will appear as white on the image, and all other values will appear as black. The resulting image will look like Figure 6-27.

A421249_1_En_6_Fig27_HTML.jpg
Figure 6-27. Filtering the image by orange HSV values

In Figure 6-27, there are a lot of tiny black specks in the white blobs representing the oranges. There is also a larger single black spot in the bottom right orange, which is where its green stem is situated. We will use the Closing morphological operation to eliminate these artefacts:

CvInvoke.MorphologyEx(hsvImg, hsvImg, MorphOp.Close, new Mat(), new Point(-1, -1), 5, BorderType.Default, new MCvScalar());

We used five iterations of the closing operation with default parameters. The processed result is depicted in Figure 6-28.

A421249_1_En_6_Fig28_HTML.jpg
Figure 6-28. Our HSV-thresholded image with the closing morphological operation applied

In addition to the three larger white blobs, which represent the location of our oranges (which are really tangerines, by the way), we have two smaller white splotches in the image. If you guessed that they represented portions of the tomatoes tinged orange due to the lighting, you would be correct. We will simply filter them out in Simple Blob Detector by ignoring all blobs under 1,000 square pixels in area. In a production project, we would have a more advanced set of criteria to eliminate false positives (the Kinect in particular makes this easier because of its ability to measure depth and in turn measure heights and widths). For our simple project, however, we will stick to what comes out of the box for the purpose of demonstrating Simple Blob Detector’s parameters. We configure these parameters with the use of the SimpleBlobDetectorParams class:

SimpleBlobDetectorParams param = new SimpleBlobDetectorParams();
param.FilterByCircularity = false;
param.FilterByConvexity = false;
param.FilterByInertia = false;
param.FilterByColor = false;
param.MinArea = 1000;
param.MaxArea = 50000;

We set filtering to false for all parameters other than area, as we do not need them. The parameters have default values when set to true, and instead of setting their values to something more appropriate for detecting oranges, it is easier to simply turn them off. Had we tried to separate the oranges from, say, carrots, some of them would have been useful enough to have kept on.

The final bit of code to get our orange detector working consists of the initialization of SimpleBlobDetector itself and its circling of the blobs:

SimpleBlobDetector detector = new SimpleBlobDetector(param);
MKeyPoint[] keypoints = detector.Detect(hsvImg);
Features2DToolbox.DrawKeypoints(img, new VectorOfKeyPoint(keypoints), img, new Bgr(255, 0, 0), Features2DToolbox.KeypointDrawType.DrawRichKeypoints);

The constructor for SimpleBlobDetector takes our parameter object. We then call the instantiated object’s Detect(Mat img) method, which returns a series of keypoints (MKeyPoints in Emgu; the M is for “managed”). This is inputted along with our original color image in the DrawKeypoints(...) method, which draws circles around our blobs. The inputs for this method are the input image, the keypoints (which must be in a VectorOfKeyPoint object), the output image, the color of the circles to be drawn (blue in this case, for contrast), and the KeypointDrawType. The KeypointDrawType dictates how elaborately the keypoints should be drawn on the image. DrawRichKeypoints indicates that we should circle the entire keypoint as opposed to just indicating the center with a dot (which is the default). It should be noted that the MKeyPoints object contains properties such as the size and property of the detected blob. The final result is shown in Figure 6-29.

A421249_1_En_6_Fig29_HTML.jpg
Figure 6-29. Detected oranges in the assortment of fruits and vegetables

The blue circles fail to completely encircle a couple of the oranges. This is because the HSV range did not include the darkest shades of the orange, which are near the edge of the oranges where the light does not shine. I left it like this intentionally. It is actually trivial to fix; you can try altering the HSV thresholds and SimpleBlobDetector area parameters to see this. The goal of this is to demonstrate that computer vision algorithms are rarely perfect, but often are merely “good enough.” In our case, we could have achieved something close to perfection, but this is not a complicated project. There are tradeoffs we always have to consider. Allowing for a greater HSV range, for example, would make the tomatoes more susceptible to being detected, and we would need more aggressive filtering in that regard.

The example we developed is somewhat contrived. Had we taken the picture from a different angle, a different camera, with different lighting and different fruits and/or vegetables, or had we adjusted a million other factors differently, our code would have probably failed to detect every orange or would have detected false oranges. Additionally, this code would need to be further altered to support the detection of other fruits or vegetables. This is a common predicament you will come across in computer vision work. Recognizing objects in one specific scenario is often not that difficult, but enabling your code to work with a high success rate over many different environments is where the challenge lies.

Tip

Whenever possible, try to reduce environmental inconsistencies before the Kinect even starts filming. Standardize as much of your lighting and scene as possible. Refrain from having too much motion and noise in the background. Remove any unnecessary artefacts in the foreground. This will render your computer vision and image processing tasks less arduous.

Conclusion

There are some optimizations applied to the Kinect by Microsoft that you will never be able to replicate without designing your own Kinect, but for most projects an elementary knowledge of computer vision and image processing techniques will take you far. For those who are the types to dream big, the Kinect need not merely be a way to track skeletal joints. The Kinect is but a starter kit, with sensors and certain computer vision and machine-learning abstractions baked in. If there is something you want to track or analyze that the Kinect cannot do on its own already, let your software be the limit, not your hardware.

The techniques discussed in this chapter will take you only so far, but hopefully this gave you a taste of what can be achieved with some matrices and your intuition. All the large, commercial computer vision projects out there are still built with many of these simple blocks, whether it be Amazon Go or Google Car.

Before moving forward, I would like to include a word of caution about computer vision algorithms, particularly about those in the OpenCV library. Not all the algorithms available in the library are free to use commercially. So, it is recommended that you perform due diligence before proceeding with a complex computer vision project. Additionally, certain frameworks, such as Emgu CV, require your code to be released as open source or that you purchase a commercial license.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.218.147