Training and Building Production Models

As we enter the last section of the book, this chapter provides an overview of using machine learning in a production environment. At this point in the book, you have learned the various algorithms that ML.NET provides, and you have created a set of three production applications. With all of this knowledge garnered, your first thought will probably be: how can I immediately create the next killer machine learning app? Prior to jumping right into answering that question, this chapter will help to prepare you for those next steps in that journey. As discussed and utilized in previous chapters, there are three major components of training a model: feature engineering, sample gathering, and creating a training pipeline. In this chapter we will focus on those three components, expanding your thought process for how to succeed in creating a production model, as well as providing some suggested tools for being able to repeat that success with a production-grade training pipeline.

Over the course of this chapter we will discuss the following:

  • Investigating feature engineering
  • Obtaining training and testing datasets
  • Creating your model-building pipeline

Investigating feature engineering

As we have discussed in previous chapters, features are one of the most important components—and objectively the most important component—in the model building process. When approaching a new problem, the main question that arises is: how are you going to solve this problem? For example, a common exploit in the cyber-security world is the use of steganography. Steganography, which dates back to 440 BCE is the practice of hiding data within a container. This container has ranged from drawings, crosswords, music, and pictures, to name a few. In the cyber-security world, steganography is used to hide malicious payloads in files that would otherwise be ignored, such as images, and audio and video files.

Take the following image of a basket of food. This image—created using an online steganography tool—has an embedded message in it; have a look at whether you can spot any unusual patterns in the following image:

Most tools today can mask content within both complex and solid color images, to the point where, you as an end-user wouldn't even notice—as seen in the preceding example.

Continuing this scenario, a quick question you might need to answer right now is: does the file contain another file format within the file? Another element to consider is the scope of your question. Attempting to answer the aforementioned question would lead to a time-consuming deep dive into analyzing every file format used with a recursive parser—not something that would make sense to tackle right off the bat. A better option would be to scope the question to perhaps just analyze audio files or image files. Taking that thought process further, let's proceed by scoping the problem to a specific image type and payload type.

PNG image files with embedded executables

Let us dive into this more specific question: how can we detect Windows Executables within Portable Network Graphics (PNG) files? For those curious, the reasoning behind specifically choosing PNG files is that they are a very common lossless image format used in video games and the internet due to their great image quality-to-file size ratio. This level of usage creates an interface for attackers to get a PNG file on your machine, with you as the end user not thinking twice about it, versus a proprietary format or Windows Executable (EXE), which will likely cause alarm to the end user.

In the next section, we will break down the PNG file into the following steps:

To dive further into the PNG file format, the specification for PNG is available here: http://libpng.org/pub/png/spec/1.2/PNG-Contents.html

Creating a PNG parser

Let us now dive into taking apart the PNG file format into features in order to drive a potential model for detecting hidden payloads. A PNG file is structured with continuous chunks. Each chunk is composed of a header description field, followed by a payload of data. The chunks required for a PNG file include IHDR, IDAT, and IEND. The sections, as per the specification, must appear in that order. Each of these sections will be explained below.

The first element ahead of the chunks is to implement the check in order to make sure the file is actually a PNG image file. This check is generally called the File Magic check. The majority of files used throughout our digital world have a unique signature, making both the parsing and saving of these files easier.

For those curious about other file format's signature, an extensive list can be found here: https://www.garykessler.net/library/file_sigs.html

PNG files specifically begin with the following bytes:

137, 80, 78, 71, 13, 10, 26, 10

By using these File Magic bytes, we can then utilize the SequenceEqual .NET method to compare the file data's first sequence of bytes, as shown in the following code:

using var ms = new MemoryStream(data);

byte[] fileMagic = new byte[FileMagicBytes.Length];

ms.Read(fileMagic, 0, fileMagic.Length);

if (!fileMagic.SequenceEqual(FileMagicBytes))
{
return (string.Empty, false, null);
}

If the SequenceEqual method checks against the FileMagicBytes property and does not match, we return false. In this scenario, the file is not a PNG file, and therefore, we want to stop parsing the file any further.

From this point, we will now iterate over the chunks of the file. At any point, if the bytes aren't set properly, this should be noted, as Microsoft Paint or Adobe PhotoShop would save the file, as per the PNG file format's specification. A malicious generator, on the other hand, may bend the rules around adhering to the PNG file format's specification, as shown here:

while (ms.Position != data.Length)
{
byte[] chunkInfo = new byte[ChunkInfoSize];

ms.Read(chunkInfo, 0, chunkInfo.Length);

var chunkSize = chunkInfo.ToInt32();

byte[] chunkIdBytes = new byte[ChunkIdSize];

ms.Read(chunkIdBytes, 0, ChunkIdSize);

var chunkId = Encoding.UTF8.GetString(chunkIdBytes);

byte[] chunk = new byte[chunkSize];

ms.Read(chunk, 0, chunkSize);

switch (chunkId)
{
case nameof(IHDR):
var header = new IHDR(chunk);

// Payload exceeds length
if (data.Length <= (header.Width * header.Height *
MaxByteDepth) + ms.Position)
{
break;
}

return (FileType, false, new[] {"SUSPICIOUS: Payload is larger
than what the size should be" });
case nameof(IDAT):
// Build Embedded file from the chunks
break;
case nameof(IEND):
// Note that the PNG had an end
break;
}
}

For each chunk, we read the ChunkInfoSize variable, which is defined as 4 bytes. This ChunkInfoSize array, once read, contains the size of the chunk to then read from. In addition to determining which chunk type we are to read, we also read the 4-byte chunk for the 4-character string (IHDR, IDAT, IEND).

Once we have the chunk size and the type, we then build out the class object representations of each. For the scope of this code example, we will just look at a snippet of the IHDR class, which contains the high-level image properties such as the dimensions, bit depth, and compression:

public class IHDR
{
public Int32 Width;

public Int32 Height;

public byte BitDepth;

public byte ColorType;

public byte Compression;

public byte FilterMethod;

public byte Interlace;

public IHDR(byte[] data)
{
Width = data.ToInt32();

Height = data.ToInt32(4);
}
}

We'll just pull the Width and Height properties, which are the first 8 bytes (4 bytes each). For this example, we also make use of an extension method to convert a byte array into an Int32 array. IN most cases, BitConverter would be the ideal scenario, however, for this code example, I wanted to simplify the sequential accessing of data, such as the offset of 4 bytes when retrieving the previously mentioned Height property.

The previously mentioned IDAT chunks are the actual image data—and the potential chunk in which to contain the embedded payloads. The IEND, as the name implies, simply tells the PNG parser that the file is complete, that is, there is no payload in the IEND chunk.

Once the file has been parsed, we return the file type (PNG)—whether or not it is a validly structured PNG file—and we note anything that is suspicious, such as if the file size is considerably larger than it should be, given the width, height, and maximum bit depth (24). For each of these notes, they could be normalized, along with the valid/invalid flag in a production model. In addition, these could have a numeric representation with a simple enumeration.

For those who are curious about the full application's source code, please refer to https://github.com/jcapellman/virus-tortoise, which utilizes many of the same principles that were shown in the Creating the File Classification application section of Chapter 9, Using ML.NET with ASP.NET Core

Taking this example a step further, to iterate through the IDAT chunks that contain the actual image data—and potential executable payloads—would complete the extractor in a production application.

Now that we have seen the required level of effort for building a production level of features, let us dive into building a production training dataset.

Obtaining training and testing datasets

Now that we have completed our discussion on feature engineering, the next step is to obtain a dataset. For some problems, this can be very difficult. For instance, when attempting to predict something that no one else has done, or that is in an emerging sector, having a training set to train on would be more difficult than say, finding malicious files for our previous example. 

Another aspect to consider is diversity and how the data is broken out. For instance, consider how you would predict malicious Android applications based on behavioral analysis using the anomaly detection trainer that ML.NET provides. When thinking about building your dataset, most Android users, I would argue, do not have half of their apps as malicious. Therefore, an even malicious and benign (50/50) breakdown of training and test sets might be over-fitting on malicious applications. Figuring out and analyzing the actual representation of what your target users will encounter is critical, otherwise your model may either tend to a false positive or false negative, both of which your end users will not be happy with.

The last element to consider when training and testing datasets is how you are obtaining your datasets. Since your model is largely based on the training and test datasets, finding real datasets that represent your problem set is crucial. Using the previous steganography example, if you pulled random PNG files without validation, there is a possibility of training a model on bad data. A mitigation for this would be to check for hidden payloads within the IDAT chunks. Likewise, validation in the PNG example on the actual files is critical, as well. Training on JPG, BMP, or GIF files mixed in with your PNG files when you only run against PNG files in your production app could lead to false positives or negatives, as well. Because the binary structures of the other image formats differ from PNG, this non-representative data will skew the training set toward the unsupported formats.

For those in the cyber-security field, VirusTotal (https://www.virustotal.com) and Reversing Labs (https://www.reversinglabs.com) offer extensive databases of files to download for a fee if local sources of data for various file types prove difficult to obtain.

Creating your model-building pipeline

Once your feature extractor has been created and your dataset obtained, the next element to establish is a model building pipeline. The definition of the model building pipeline can be shown better in the following diagram:

For each of the steps, we will discuss how they relate to the pipeline that you choose in the next section.

Discussing attributes to consider in a pipeline platform

There are quite a few pipeline tools that are available for deployment on-premises, both in the cloud and as SaaS (Software as a Service) services. We will review a few of the more commonly used platforms in the industry. However, the following points are a few elements to keep in mind, no matter which platform you choose:

  • Speed is important for several reasons. While building your initial model, the time to iterate is very important, as you will more than likely be adjusting your training set and hyper-parameters in order to test various combinations. On the other end of the process, when you are in pre-production or production, the time to iterate with testers or customers (who are awaiting a new model in order to address issues or add features) is critical in most cases.
  • Repeatability is also important to ensure that a perfect model can be rebuilt every time, given the same dataset, features, and hyper-parameters. Utilizing automation as much as possible is one method to avoid the human-error aspect of training models, while also helping the repeatability aspect. All of the platforms that will be reviewed in the next section promote defining a pipeline without any human input after launching a new training session.
  • Versioning and tracking of comparisons are important in order to ensure that when changes are made, they can be compared. For example, whether it is hyper-parameters—such as the depth of your trees in a FastTree model—or additional samples that you add, keeping track of these changes as you iterate is critical. Hypothetically, if you made a documented change and your efficacy drops significantly, you could always go back and evaluate that one change. If you hadn't versioned or documented your individual changes for comparisons, this simple change could be very difficult to pinpoint the drop in efficacy.  Another element of tracking is to track progress over a period of time, such as per quarter or per year.  This level of tracking can help to paint a picture and can also help to drive the next steps or track trends in efficacy in order to obtain more samples or add additional features.
  • Lastly, quality assurance is important for several reasons, and, in almost every case, critical to the success or failure of a project. Imagine a model being deployed straight to production without any extra checks being put in place by a dedicated quality assurance team performing manual and automated tests. Automated tests—as simple as a set of unit tests to ensure that samples test the same, or better, from model to model prior to release, and then to production—can be a good stop-gap solution instead of an entire automated suite of tests with specific ranges of efficacy to keep within.

All four of these elements should be considered when performing each step in the model building pipeline that was discussed in the previous section. The last step of delivery depends on the previous three elements being completed properly. The actual delivery is dependent on your application. For instance, if you're creating an ASP.NET application, such as the one that we created in Chapter 9, Using ML.NET with ASP.NET Core, adding the ML.NET model as part of your Jenkins pipeline—so that it automatically gets bundled with your deployment—would be a good approach.

Exploring machine learning platforms

The following are platforms I have either personally used, and/or had colleagues utilize in order to solve various problems.  Each platform has its pros and cons, especially given the uniqueness of each problem that we are trying to solve.

Azure Machine Learning

Microsoft's Azure Cloud Platform provides a complete platform for Kubernetes, virtual machines, and databases, in addition to providing a machine learning platform. This platform provides direct connections to Azure SQL databases, Azure File Storage, and public URLs, to name just a few for training and test sets. A lightweight version that doesn't scale is provided inside of Visual Studio Community 2019 for free. The following screenshot shows the full-fledged UI: 

In addition, non-.NET technologies, such as TensorFlow, PyTorch, and scikit-learn are fully supported. Tools such as the popular Jupyter Notebook and Azure Notebook are also fully supported.

Similar to Apache Airflow, reviewing run histories in order to compare versions is also easy to do in Azure Machine Learning.

All phases of the aforementioned model building pipeline are supported. Here are some of the pros and cons of Azure Machine Learning:

Pros:

  • Extensive integrations into multiple data sources
  • ML.NET natively supported
  • Can scale up and down depending on your needs
  • No infrastructure setup required

Cons:

  • Can be expensive when training

Apache Airflow

Apache Airflow, an open source software, provides the ability to create pipelines of almost unlimited complexity. While not a natively supported framework, .NET Core applications—such as those that we have created throughout this book—can run, provided the .NET Core runtime is installed or simply compiled with the self-contained flags. While the learning curve is higher than Microsoft's Azure Machine Learning platform, being free in certain scenarios, especially when simply experimenting, might be more beneficial. The following screenshot shows the UI of Airflow:

Much like Azure Machine Learning, the visualization of the pipelines does make the configuration of a particular pipeline easier than Apache Spark. However, much like Apache Spark, the setup and configuration (depending on your skill level) can be quite daunting, especially following the pip installation. An easier path to get up and running is by using a pre-built Docker container, such as Puckel's Docker container (https://hub.docker.com/r/puckel/docker-airflow). 

Here are some of the pros and cons of Apache Airflow:

Pros:

  • Free and open source
  • Documentation and examples given the 4+ years
  • Runs on Windows, Linux, and macOS

Cons:

  • Complex to set up (especially with the official pip instructions)
  • .NET is not natively supported

Apache Spark

Apache Spark, another open source tool, while generally used in big-data pipelines, can also be configured for feature extraction, training, and the production of models at scale. When memory and CPU constraints are hindering your ability to build models, for example, training with a massive dataset, I have personally seen Apache Spark scale to utilizing multiple 64C/128T AMD servers with over a terabyte of ram being maximized. I found this platform to be more difficult to set up than Apache Airflow or Azure's Machine Learning platform, however, once set up it can be quite powerful. The following screenshot shows the UI of Apache Spark:

A great step by step install guide can be found on Microsoft's Apache Spark page (https://dotnet.microsoft.com/learn/data/spark-tutorial/intro) for both Windows and Linux. This guide does remove some of the unknowns, however, compared to Azure or Airflow it is still far from easy to get up and running. Here are some of the pros and cons of Apache Spark:

Pros:

  • Free and open source
  • .NET bindings from Microsoft
  • Lots of documentation due to its long history (> 5 years old)
  • Runs on Windows, macOS, and Linux

Cons:

  • Can be difficult to configure and get up and running
  • Sensitive to IT infrastructure changes
Microsoft has written a .NET binding for Apache Spark and released it for free: https://dotnet.microsoft.com/apps/data/spark.  These bindings are available for Windows, macOS, and Linux.

Summary

Over the course of this chapter, we have deep-dived into what goes into production-ready model training from the original purpose question to a trained model. Through this deep dive, we have examined the level of effort that is needed to create detailed features through production thought processes and feature engineering. We then reviewed the challenges, the ways to address the training, and how to test dataset questions. Lastly, we also dove into the importance of an actual model building pipeline, using an entirely automated process.

In the next chapter, we will utilize a pre-built TensorFlow model in a WPF application to determine if a submitted image contains certain objects or not. This deep dive will explore how ML.NET provides an easy-to-use interface for TensorFlow models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.239.194