Setting Up the ML.NET Environment

Now that you have a firm grasp of the basics of machine learning, an understanding of what Microsoft's ML.NET is, and what it provides, it is time to train and create your first machine learning model! We will be building a simple restaurant sentiment analysis model based on reviews and integrating this model into a simple .NET Core application. Before we can jump into training and creating our model, we will first have to configure the development environment.

In this chapter, we will cover the following topics:

  • Setting up your development environment
  • Creating your first model, from start to finish, with ML.NET
  • Evaluating the model

Setting up your development environment

Fortunately, configuring your environment for ML.NET is relatively easy. In this section, we will be installing Visual Studio 2019 and .NET Core 3. If you are unsure whether you have either installed, please observe the following steps. In addition, there are some organizational elements and processes to establish early on as we proceed through this book and you begin experimenting on your own.

Installing Visual Studio

At the heart of ML.NET development is Microsoft Visual Studio. For all samples and screenshots used throughout this book, Microsoft Visual Studio 2019 Professional on Windows 10 19H2 will be used. At the time of writing, 16.3.0 is the latest version. Please use the latest version available. If you do not have Visual Studio 2019, a fully featured Community version is available for free on www.visualstudio.com.

For the scope of this book as mentioned in Chapter 1, Getting Started with Machine Learning and ML.NET, we will be creating a wide range of application types to demonstrate ML.NET in various problem areas on specific application platforms. Hence, we are going to install several of the available workloads upfront to avoid having to return to the installer in later chapters:

  1. Firstly, ensure that .NET desktop developmentUniversal Windows Platform Development, and ASP.NET and web development are checked. These workloads will enable you to create UWP, WPF, and ASP.NET applications that we will be using in later chapters:

  1.  In addition, ensure that .NET Core cross-platform development is also checked. This will enable .NET Core development for both command-line and desktop apps, such as the app we will be making later in this chapter:

Installing .NET Core 3

As mentioned in Chapter 1, Getting Started with Machine Learning and ML.NET, .NET Core 3 is the preferred .NET framework at the time of writing when targeting multiple platforms, due to the optimization work achieved during the development of .NET Core 3. At the time of writing .NET Core 3 is not bundled with the Visual Studio Installer prior to version 16.3.0 and needs to be downloaded separately here: https://dotnet.microsoft.com/download/dotnet-core/3.0. The download specifically used through the scope of this book is version 3.0.100, but a newer version may be available by the time you are reading this. For those readers who are curious, the runtime is bundled with the SDK.

You can verify that the installation was successful by opening a PowerShell or Command Prompt and executing the following command:

dotnet --version
3.0.100

The output should begin with 3, as shown here. At the time of writing, 3.0.100 is the latest production version available.

Be sure to install both 32-bit and 64-bit versions to avoid issues when targeting 32-bit and 64-bit platforms later on in this book and your future experiments.

Creating a process

Over the course of this book and your own explorations, you will gather sample data, build models, and try various applications. Establishing a process early on to keep these elements organized will make things easier in the long run. Here are a few suggestions to keep in mind:

  • Always use source control for all of your code.
  • Ensure that test and training sets are named properly in their own folders (versioned if possible).
  • Versioning models with both naming and source control.
  • Retain evaluation metrics in a spreadsheet along with the parameters used.

As you develop your skillset and create more complex problems, additional tooling such as Apache Spark or other clustering platforms will more than likely be required. We will discuss this in Chapter 11, Training and Building Production Models, along with other suggestions on training at scale.

Creating your first ML.NET application

The time has come to start creating your first ML.NET application. For this first application, we will create a .NET Core console application. This application will classify a sentence of words as either a positive statement or a negative statement, training on a small sample dataset provided. For this project, we will use a binary logistic regression classification model using the Stochastic Dual Coordinate Ascent (SDCA) method. In Chapter 3, Regression Model, we will go into greater depth on this method.

Creating the project in Visual Studio

Upon opening, and depending on your configuration in Visual Studio, it will either open directly on to the project creation screen, or will be an empty Visual Studio window. If your environment displays the latter, simply click File, then New, and then Project:

  1. When the window opens, type console app in the search field to find Console App (.NET Core). Make sure that the language type is C# (there are Visual Basic templates of the same name), highlight this template, and then click Next:

  1. I suggest giving the project name something you can refer back to, such as Chapter02, to help you find the project later:

  1. At this point, you have a .NET Core 3 console application, so now let's add the ML.NET NuGet package. Right-click on the project and click Manage NuGet Packages:

  1. Type microsoft ml into the search field. You should see the latest Microsoft.ML version available:

  1. Once found, click the Install button. Simple!
At the time of writing, 1.3.1 is the latest version available and all examples throughout this book will use that version. Prior to 1.0, the syntax was very much in flux, but since then has been consistent, so using a newer version should function identically.

At this point, the project is configured for ML.NET—all future projects will reference ML.NET in this fashion and refer you back to these steps.

Project architecture

The simple project will be split into two primary functions:

  • Training and evaluation
  • Model runs

This split between functionality models real-world production applications that utilize machine learning, as there are often teams dedicated to each.

For those who wish to start with a completed project and follow along with the rest of this section, you can get the code from here: https://github.com/PacktPublishing/Hands-On-Machine-Learning-With-ML.NET/tree/master/chapter02

The following screenshot shows the project breakdown in Solution Explorer of Visual Studio. As mentioned earlier, the project is split into two main classes—Predictor and Trainer:

The Trainer class contains all the model building and evaluation code, while the Predictor class, as the name implies, contains the code to run predictions with a trained model.

The BaseML class is what we will be using in subsequent chapters and expanding upon throughout the remainder of the book. The idea behind this class is to cut down on the DRY (don't repeat yourself) violations and to create a cohesive and easy to iterate framework. The Constants class further assists this idea—to cut down on magic strings as we move into more complex applications; this design will be used in all future chapter projects.

Lastly, the Program class is the main entry point for our console application.

Running the code

We will now deep dive into the various classes used within this project, including the following classes:

  • RestaurantFeedback
  • RestaurantPrediction
  • Trainer
  • Predictor
  • BaseML
  • Program

The RestaurantFeedback class

The RestaurantFeedback class provides the input class for our model. In ML.NET (and other frameworks), the traditional approach is to have a structured input to feed into your data pipeline, which, in turn, is passed into the training phase and eventually your trained model.

The following class defines our container class to hold our prediction values. This is the approach that we will use throughout the rest of the book:

using Microsoft.ML.Data;

namespace chapter02.ML.Objects
{
public class RestaurantFeedback
{
[LoadColumn(0)]
public bool Label { get; set; }

[LoadColumn(1)]
public string Text { get; set; }
}
}

You might be wondering what the correlation between the Label and Text properties in the RestarauntFeedback class and the source data is at first glance. Contained within the Data folder, there is a file named sampledata.csv. This file contains the following:

0    "Great Pizza"
0 "Awesome customer service"
1 "Dirty floors"
1 "Very expensive"
0 "Toppings are good"
1 "Parking is terrible"
0 "Bathrooms are clean"
1 "Management is unhelpful"
0 "Lighting and atmosphere are romantic"
1 "Crust was burnt"
0 "Pineapple was fresh"
1 "Lack of garlic cloves is upsetting"
0 "Good experience, would come back"
0 "Friendly staff"
1 "Rude customer service"
1 "Waiters never came back"
1 "Could not believe the napkins were $10!"
0 "Supersized Pizza is a great deal"
0 "$5 all you can eat deal is good"
1 "Overpriced and was shocked that utensils were an upcharge"

The first column maps to the Label property. As you might recall in Chapter 1, Getting Started with Machine Learning and ML.NET, supervised learning such as that being performed in this sample requires labeling. In this project, our label is a Boolean. False (0) in the dataset indicates positive feedback, while True (1) indicates negative feedback.

The second column maps to the Text property to propagate the sentiment (which is, the sentence to feed into the model).

The RestaurantPrediction class

The RestaurantPrediction class contains the output properties that will come out of our model runs. Depending on the algorithm used, the output class, as you will find in future chapters, will contain many more properties:

using Microsoft.ML.Data;

namespace chapter02.ML.Objects
{
public class RestaurantPrediction
{
[ColumnName("PredictedLabel")]
public bool Prediction { get; set; }

public float Probability { get; set; }

public float Score { get; set; }
}
}

Akin to the RestaurantFeedback Label property, the Prediction property contains the overall result of positive or negative feedback. The Probability property contains the confidence of our model of that decision. The Score property is used for the evaluation of our model.

The Trainer class

In the following, you will find the sole method in the Trainer class. The Trainer method at a high level does the following:

  • It loads the training data (in this case our CSV) into memory.
  • It builds a training set and a test set.
  • It creates the pipeline.
  • It trains and saves the model.
  • It performs an evaluation on the model.

This is the structure and flow we will follow throughout the rest of this book. Now, let's dive into the code behind the Train method:

  1. First, we check to make sure that the training data filename exists:
if (!File.Exists(trainingFileName)) {
Console.WriteLine($"Failed to find training data file
({trainingFileName}");

return;
}

Even though this is a simple test application, it is always a good practice to treat it like a production-grade application. In addition, since this is a console application, you may incorrectly pass in a path for the training data, which then can cause exceptions further on in the method.

  1. Use the LoadFromTextFile helper method that ML.NET provides to assist with the loading of text files into an IDataView object:
IDataView trainingDataView = 
MlContext.Data.LoadFromTextFile<RestaurantFeedback>
(trainingFileName);

As you can see, we are passing in both the training filename and the type; in this case, it is the RestaurantFeedback class mentioned earlier. It should be noted that this method has several other parameters, including the following:

  • separatorChar: This is the column separator character; it defaults to (in other words, a tab).
  • hasHeader: If set to true, the dataset's first row has the header; it defaults to false.
  • allowQuoting: This defines whether the source file can contain columns defined by a quoted string; it defaults to false.
  • trimWhitespace: This removes trailing whitespace from the rows; it defaults to false.
  • allowSparse: This defines whether the file can contain numerical vectors in sparse format; it defaults to false. The sparse format requires a new column to have the number of features.

For most projects used throughout this book, we will use the default settings.

  1. Given the IDataView object we created previously, use the TrainTestSplit method that ML.NET provides to create a test set from the main training data:
DataOperationsCatalog.TrainTestData dataSplit = MlContext.Data.TrainTestSplit(trainingDataView, testFraction: 0.2);

As mentioned in Chapter 1, Getting Started with Machine Learning and ML.NET, sample data is split into two sets—training and test. The parameter, testFraction, specifies the percentage of the dataset to hold back for testing, in our case, 20%. By default, this parameter is set to 0.2.

  1. Firstly, we create the pipeline:
TextFeaturizingEstimator dataProcessPipeline = 
MlContext.Transforms.Text.FeaturizeText(
outputColumnName: "Features",
inputColumnName: nameof(RestaurantFeedback.Text));

Future examples will have a much more complex pipeline. In this example, we are simply mapping the Text property discussed earlier to the Features output column.

  1. Next, we instantiate our Trainer class:
SdcaLogisticRegressionBinaryTrainer sdcaRegressionTrainer =
MlContext.BinaryClassification.Trainers.SdcaLogisticRegression(
labelColumnName: nameof(RestaurantFeedback.Label),
featureColumnName: "Features");

As you might remember from Chapter 1, Getting Started with Machine Learning and ML.NET, the various algorithms found in ML.NET are referred to as trainers. In this project, we are using an SCDA trainer.

  1. Then, we complete the pipeline by appending the trainer we instantiated previously:
EstimatorChain<BinaryPredictionTransformer<CalibratedModelParametersBase<LinearBinaryModelParameters, PlattCalibrator>>> trainingPipeline = dataProcessPipeline.Append(sdcaRegressionTrainer);
  1. Next, we train the model with the dataset we created earlier in the chapter:
ITransformer trainedModel = trainingPipeline.Fit(dataSplit.TrainSet);
  1. We save our newly created model to the filename specified, matching the training set's schema:
MlContext.Model.Save(trainedModel, dataSplit.TrainSet.Schema, ModelPath);
  1. Now, we transform our newly created model with the test set we created earlier:
IDataView testSetTransform = trainedModel.Transform(dataSplit.TestSet);
  1. Finally, we will use the testSetTransform function created previously and pass it into the BinaryClassification class's Evaluate method:
CalibratedBinaryClassificationMetrics modelMetrics = 
MlContext.BinaryClassification.Evaluate(
data: testSetTransform,
labelColumnName: nameof(RestaurantFeedback.Label),
scoreColumnName: nameof(RestaurantPrediction.Score));

Console.WriteLine(
$"Area Under Curve: {modelMetrics.AreaUnderRocCurve:P2}
{Environment.NewLine}" +
$"Area Under Precision Recall Curve:
{modelMetrics.AreaUnderPrecisionRecallCurve:P2}" +
$"{Environment.NewLine}" +
$"Accuracy:
{modelMetrics.Accuracy:P2}{Environment.NewLine}" +
$"F1Score:
{modelMetrics.F1Score:P2}{Environment.NewLine}" +
$"Positive Recall:
{modelMetrics.PositiveRecall:#.##}{Environment.NewLine}" +
$"Negative Recall:
{modelMetrics.NegativeRecall:#.##}{Environment.NewLine}");

This method allows us to generate model metrics. We then print the main metrics using the trained model with the test set. We will dive into these properties specifically in the Evaluating the Model section of this chapter.

The Predictor class

The Predictor class, as noted earlier, is the class that provides prediction support in our project. The idea behind this method is to provide a simple interface to run the model, given the relatively simple input. In future chapters, we will be expanding this method structure to support more complex integrations, such as those hosted in a web application:

  1. Akin to what was done in the Trainer class, we verify that the model exists prior to reading it:
if (!File.Exists(ModelPath)) {
Console.WriteLine($"Failed to find model at {ModelPath}");

return;
}
  1. Then, we define the ITransformer object:
ITransformer mlModel;

using (var stream = new FileStream(ModelPath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
mlModel = MlContext.Model.Load(stream, out _);
}

if (mlModel == null)
{
Console.WriteLine("Failed to load model");

return;
}

This object will contain our model once we load via the Model.Load method. This method can also take a direct file path. However, the stream approach lends itself to support non on-disk approaches that we will use in later chapters.

  1. Next, create a PredictionEngine object given the model we loaded earlier:
var predictionEngine = 
MlContext.Model.CreatePredictionEngine<RestaurantFeedback,
RestaurantPrediction>(mlModel);

We are passing in both TSrc and TDst, in our case for this project, RestaurantFeedback and RestaurantPrediction, respectively.

  1. Then, call the Predict method on the PredictionEngine class:
var prediction = predictionEngine.Predict(new RestaurantFeedback { Text = inputData });

Because, when we created the object with TSrc, the type was set to RestaurantFeedback, we have a strongly typed interface to our model. We then create the RestaurantFeedback object with the inputData variable that contains the string with the sentence we are going to run our model on.

  1. Finally, display the prediction output along with the probability: 
Console.WriteLine($"Based on "{inputData}", 
the feedback is predicted to be:{Environment.NewLine}" +
"{(prediction.Prediction ? "Negative" : "Positive")}
at a {prediction.Probability:P0}" + "confidence");

The BaseML class

The BaseML class, as discussed earlier, is going to contain the common code between our Trainer and Predictor classes, starting with this chapter. Over the remainder of the book, we will build on top of the BaseML class defined as follows:

using System;
using System.IO;

using chapter02.Common;

using Microsoft.ML;

namespace chapter02.ML.Base
{
public class BaseML
{
protected static string ModelPath =>
Path.Combine(AppContext.BaseDirectory,
Constants.MODEL_FILENAME);

protected readonly MLContext MlContext;

protected BaseML()
{
MlContext = new MLContext(2020);
}
}
}

For all ML.NET applications in both training and predictions, an MLContext object is required. Initializing the object with a specific seed value is needed to create more consistent results during the testing component. Once a model is loaded, the seed value (or lack thereof) does not affect the output.

The Program class

Those of you who have created console applications should be familiar with the Program class and the Main method within. We will follow this structure for other console-based applications throughout the remainder of the book. The following code block contains the program class from which the application will begin execution:

using System;

using chapter02.ML;

namespace chapter02
{
class Program
{
static void Main(string[] args)
{
if (args.Length != 2)
{
Console.WriteLine(
$"Invalid arguments passed in,
exiting.{Environment.NewLine}" +
$"{Environment.NewLine}Usage: {Environment.NewLine}" +
$"predict <sentence of text to predict against>
{Environment.NewLine}" +
$"or {Environment.NewLine}" +
$"train <path to training data file>
{Environment.NewLine}");

return;
}

switch (args[0])
{
case "predict":
new Predictor().Predict(args[1]);
break;
case "train":
new Trainer().Train(args[1]);
break;
default:
Console.WriteLine($"{args[0]} is an invalid option");
break;
}
}
}
}

This constitutes a fairly straightforward method implementation for those familiar with parsing command-line arguments. A simple two-argument approach is used as the help text indicates.

When executing a more complex command-line application that takes in several arguments (optional and required), Microsoft has provided a simple-to-use NuGet package, which is available here: https://github.com/dotnet/command-line-api

Running the example

To run both the training and prediction, simply build the project and then pass in the appropriate data.

For training, you can use the included sampledata.csv file or create your own. We will do this by opening a PowerShell window and passing in the relative path:

.chapter02.exe train ......Datasampledata.csv
Area Under Curve: 100.00%
Area Under Precision Recall Curve: 100.00%
Accuracy: 100.00%
F1Score: 100.00%
Positive Recall: 1
Negative Recall: 1

Once the model is built, you can run the prediction as follows:

.chapter02.exe predict "bad"
Based on "bad", the feedback is predicted to be:
Negative at a 64% confidence

Feel free to try various phrases to test the efficacy of the model, and congratulations on training your first model!

Evaluating the model

As you saw when running the trainer component of the sample project, there are various elements of model evaluation. For each model type, there are different metrics to look at when analyzing the performance of a model.

In binary classification models like the one found in the example project, the following properties are exposed in CalibratedBiniaryClassificationMetrics that we set after calling the Evaluate method. However, first, we need to define the four prediction types in a binary classification:

  • True negative: Properly classified as negative
  • True positive: Properly classified as positive
  • False negative: Improperly classified as negative
  • False positive: Improperly classified as positive

The first metric to understand is Accuracy. As the name implies, accuracy is one of the most commonly used metrics when evaluating a model. This metric is calculated simply as the ratio of correctly classified predictions to total classifications.

The next metric to understand is Precision. Precision is defined as the proportion of true results over all the positive results in a model. For example, a precision of 1 means there were no false positives, an ideal scenario. A false positive is classifying something as positive when it should be classified as negative, as mentioned previously. A common example of a false positive is misclassifying a file as malicious when it is actually benign.

The next metric to understand is Recall. Recall is the fraction of all correct results returned by the model. For example, a recall of 1 means there were no false negatives, another ideal scenario. A false negative is classifying something as negative when it should have been classified as positive.

The next metric to understand is the F-score, which utilizes both precision and recall, producing a weighted average based on the false positives and false negatives. F-scores give another perspective on the performance of the model compared to simply looking at accuracy. The range of values is between 0 and 1, with an ideal value of 1.

Area Under the Curve, also referred to as AUC, is, as the name implies, the area under the curve plotted with true positives on the y-axis and false positives on the x-axis. For classifiers such as the model that we trained earlier in this chapter, as you saw, this returned values of between 0 and 1.

Lastly, Average Log Loss and Training Log Loss are both used to further explain the performance of the model. The average log loss is effectively expressing the penalty for wrong results in a single number by taking the difference between the true classification and the one the model predicts. Training log loss represents the uncertainty of the model using probability versus the known values. As you train your model, you will look to have a low number (lower numbers are better).

As regards the other model types, we will deep dive into how to evaluate them in their respective chapters, where we will cover regression and clustering metrics.

Summary

Over the course of this chapter, we have set up our development environment and learned about the proper organization of files going forward. We also created our first ML.NET application in addition to training, evaluating, and running predictions against a new model. Lastly, we explored how to evaluate a model and what the various properties mean.

In the next chapter, we will deep dive into logistic regression algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.171.130