With classification models behind us, it is now time to dive into clustering models. Currently, in ML.NET there is only one cluster algorithm, k-means. In this chapter, we will dive into k-means clustering as well as the various applications best suited to utilizing a clustering algorithm. In addition, we will build a new ML.NET clustering application that determines the type of a file simply by looking at the content. Finally, we will explore how to evaluate a k-means clustering model with the properties that ML.NET exposes.

In this chapter, we will cover the following topics:

Breaking down the k-means algorithm
Creating the clustering application
Evaluating a k-means model

Breaking down the k-means algorithm

As mentioned in Chapter 1, Getting Started with Machine Learning and ML.NET, k-means clustering, by definition, is an unsupervised learning algorithm. This means that data is grouped into clusters based on the data provided to the model for training. In this section, we will dive into a number of use cases for clustering and the k-means trainer.

Use cases for clustering

Clustering, as you may be beginning to realize, has numerous applications where the output categorizes similar outputs into groups of similar data points.

Some of its potential applications include the following:

Natural disaster tracking such as earthquakes or hurricanes and creating clusters of high-danger zones
Book or document grouping based on the authors, subject matter, and sources
Grouping customer data into targeted marketing predictions
Search result grouping of similar results that other users found useful

In addition, it has numerous other applications such as predicting malware families or medical purposes for cancer research.

Diving into the k-means trainer

The k-means trainer used in ML.NET is based on the Yinyang method as opposed to a classic k-means implementation. Like some of the trainers we have looked at in previous chapters, all of the input must be of the Float type. In addition, all input must be normalized into a single feature vector. Fortunately, the k-means trainer is included in the main ML.NET NuGet package; therefore, no additional dependencies are required.

To learn more about the Yinyang implementation, Microsoft Research published a white paper here: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf.

Take a look at the following diagram, showing three clusters and a data point:

In clustering, each of these clusters represents a grouping of similar data points. With k-means clustering (and other clustering algorithms), the distances between the data point and each of the clusters are the measures of which cluster the model will return. For k-means clustering specifically, it uses the center point of each of these clusters (also called a centroid) and then calculates the distance to the data point. The smallest of these values is the predicted cluster.

For the k-means trainer, it can be initialized in one of three ways. One way is to utilize a randomized initialization—as you have probably guessed, this can lead to randomized prediction results. Another way is to utilize k-means++, which strives to produce O(log K) predictions. Lastly, k-means||, the default method in ML.NET, uses a parallel method to reduce the number of passes required to initialize.

For more information on k-means||, you can refer to a paper published by Stanford, which explains it in detail: https://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf.

For more information on k-means++, you can refer to a paper published by Stanford in 2006, explaining it in detail: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf.

We will demonstrate this trainer in the example application in the next section.

Creating the clustering application

As mentioned earlier, the application we will be creating is a file type classifier. Given a set of attributes statically extracted from a file, the prediction will return if it is a document, an executable, or a script. For those of you who have used the Linux file command, this is a simplified version but based on machine learning. The attributes included in this example aren't the definitive list of attributes, nor should they be used as-is in a production environment; however, you could use this as a starting point for creating a true ML-based replacement for the Linux file command.

As with previous chapters, the completed project code, sample dataset, and project files can be downloaded here: https://github.com/PacktPublishing/Hands-On-Machine-Learning-With-ML.NET/tree/master/chapter05.

Exploring the project architecture

Building on the project architecture and code we created in previous chapters, the major change architecturally is in the feature extraction being done on both the training and test sets.

Here, you will find the Visual Studio Solution Explorer view of the project. The new additions to the solution are the FileTypes, FileData, and FilePrediction files that we will review later on in this section:

The sampledata.csv file contains 80 rows of random files I had on my system, comprising 30 Windows executables, 20 PowerShell scripts, and 20 Word documents. Feel free to adjust the data to fit your own observations or to adjust the trained model. Here is a snippet of the data:

0,1,1,0
0,1,1,0
0,1,1,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
1,1,0,1
1,1,0,1
1,1,0,1
1,1,0,1

Each of these rows contains the value for the properties in the newly created FileData class that we will review later on in this chapter.

In addition to this, we added the testdata.csv file, which contains additional data points to test the newly trained model against and evaluate. The breakdown was even with 10 Windows executables, 10 PowerShell scripts, and 10 Word documents. Here is a snippet of the data inside testdata.csv:

0,1,1,0
0,1,1,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
2,0,0,0
1,1,0,1

Diving into the code

For this application, as noted in the previous section, we are building on top of the work completed in Chapter 4, Classification Model. For this deep dive, we are going to focus solely on the code that was changed for this application.

Classes that were changed or added are as follows:

Constants
BaseML
FileTypes
FileData
FileTypePrediction
FeatureExtractor
Predictor
Trainer
Program

The Constants class

The Constants class has been changed to save the model to chapter5.mdl, in addition to supporting a feature-extracted testdata.csv variable. The following code block reflects these changes:

namespace chapter05.Common
{
    public class Constants
    {
        public const string MODEL_FILENAME = "chapter5.mdl";

        public const string SAMPLE_DATA = "sampledata.csv";

        public const string TEST_DATA = "testdata.csv";
    }
}

The BaseML class

The sole change in the BaseML class is the addition of the FEATURES variable. By using a variable here, we can remove the use of a magic string in our Trainer class (we will discuss this later in this section):

protected const string FEATURES = "Features";

The FileTypes enumeration

The FileTypes enumeration contains a strongly typed method for mapping our classifications and a numeric value. As we discovered in our previous examples, utilizing an enumeration as opposed to magic or constant values provides better flexibility, as shown here and throughout the remaining classes:

namespace chapter05.Enums
{
    public enum FileTypes
    {
        Executable = 0,
        Document = 1,
        Script = 2
    }
}

The FileData class

The FileData class is the container class that contains the data to both predict and train our model:

First, we add constant values for True and False since k-means requires floating-point values:

public class FileData
{
    private const float TRUE = 1.0f;
    private const float FALSE = 0.0f;

Next, we create a constructor that supports both our prediction and training. We optionally pass in the filename for the training to provide a label, in this case, ps1, exe, and doc for scripts, executables, and documents, respectively. We also call helper methods to determine whether the file is binary, or whether it starts with MZ or PK:

public FileData(Span<byte> data, string fileName = null)
{
    // Used for training purposes only
    if (!string.IsNullOrEmpty(fileName))
    {
        if (fileName.Contains("ps1"))
        {
            Label = (float) FileTypes.Script;
        } else if (fileName.Contains("exe"))
        {
            Label = (float) FileTypes.Executable;
        } else if (fileName.Contains("doc"))
        {
            Label = (float) FileTypes.Document;
        }
    }

    IsBinary = HasBinaryContent(data) ? TRUE : FALSE;

    IsMZHeader = HasHeaderBytes(data.Slice(0, 2), "MZ") ? TRUE : 
                                FALSE;

    IsPKHeader = HasHeaderBytes(data.Slice(0, 2), "PK") ? TRUE : 
                                FALSE;
}

MZ and PK are considered to be magic numbers of Windows executables and modern Microsoft Office files. Magic numbers are unique byte strings that are found at the beginning of every file. In this case, both are simply two bytes. When performing analysis on files, making quick determinations is crucial for performance. For the keen reader, PK is also the magic number for ZIP. Modern Microsoft Office documents are actually ZIP archives. For the sake of simplicity in this example, PK is used as opposed to performing an additional level of detection.

Next, we also add an additional constructor to support the hard truth setting of values. We will deep dive into the purpose of this addition later on in this section:

/// <summary>
/// Used for mapping cluster ids to results only
/// </summary>
/// <param name="fileType"></param>
public FileData(FileTypes fileType)
{
    Label = (float)fileType;

    switch (fileType)
    {
        case FileTypes.Document:
            IsBinary = TRUE;
            IsMZHeader = FALSE;
            IsPKHeader = TRUE;
            break;
        case FileTypes.Executable:
            IsBinary = TRUE;
            IsMZHeader = TRUE;
            IsPKHeader = FALSE;
            break;
        case FileTypes.Script:
            IsBinary = FALSE;
            IsMZHeader = FALSE;
            IsPKHeader = FALSE;
            break;
    }
}

Next, we implement our two helper methods. The first, HasBinaryContent, as the name implies, takes the raw binary data and searches for non-text characters to ensure it is a binary file. Secondly, we define HasHeaderBytes; this method takes an array of bytes, converts it into a UTF8 string, and then checks to see whether the string matches the string passed in:

private static bool HasBinaryContent(Span<byte> fileContent) =>
            System.Text.Encoding.UTF8.GetString(fileContent.ToArray()).Any(a => char.IsControl(a) && a != '
' && a != '
');

private static bool HasHeaderBytes(Span<byte> data, string match) => System.Text.Encoding.UTF8.GetString(data) == match;

Next, we add the properties used for prediction, training, and testing:

[ColumnName("Label")]
public float Label { get; set; }

public float IsBinary { get; set; }

public float IsMZHeader { get; set; }

public float IsPKHeader { get; set; }

Lastly, we override the ToString method to be used with the feature extraction:

public override string ToString() => $"{Label},{IsBinary},{IsMZHeader},{IsPKHeader}";

The FileTypePrediction class

The FileTypePrediction class contains the properties mapped to our prediction output. In k-means clustering, the PredictedClusterId property stores the closest cluster found. In addition to this, the Distances array contains the distances from the data point to each of the clusters:

using Microsoft.ML.Data;

namespace chapter05.ML.Objects
{
    public class FileTypePrediction
    {
        [ColumnName("PredictedLabel")]
        public uint PredictedClusterId;

        [ColumnName("Score")]
        public float[] Distances;
    }
}

The FeatureExtractor class

The FeatureExtractor class that we utilized in the logistic regression example from Chapter 3, Regression Model, has been adapted to support both test and training data extraction:

First, we generalize the extraction to take the folder path and the output file. As noted earlier, we also pass in the filename, providing the Labeling to occur cleanly inside the FileData class:

private void ExtractFolder(string folderPath, string outputFile)
{
    if (!Directory.Exists(folderPath))
    {
        Console.WriteLine($"{folderPath} does not exist");

        return;
    }

    var files = Directory.GetFiles(folderPath);

    using (var streamWriter =
        new StreamWriter(Path.Combine(AppContext.BaseDirectory, 
                         $"../../../Data/{outputFile}")))
    {
        foreach (var file in files)
        {
            var extractedData = new FileData(
                                    File.ReadAllBytes(file), file);

            streamWriter.WriteLine(extractedData.ToString());
        }
    }

    Console.WriteLine($"Extracted {files.Length} to {outputFile}");
}

Lastly, we take the two parameters from the command line (called from the Program class) and simply call the preceding method a second time:

public void Extract(string trainingPath, string testPath)
{
    ExtractFolder(trainingPath, Constants.SAMPLE_DATA);
    ExtractFolder(testPath, Constants.TEST_DATA);
}

The Predictor class

There are a couple of changes in this class to handle the file type prediction scenario:

First, we add a helper method, GetClusterToMap, which maps known values to the prediction clusters. Note the use of Enum.GetValues here; as you add more file types, this method does not need to be modified:

private Dictionary<uint, FileTypes> GetClusterToMap(PredictionEngineBase<FileData, FileTypePrediction> predictionEngine)
{
    var map = new Dictionary<uint, FileTypes>();

    var fileTypes = Enum.GetValues(
                            typeof(FileTypes)).Cast<FileTypes>();

    foreach (var fileType in fileTypes)
    {
        var fileData = new FileData(fileType);

        var prediction = predictionEngine.Predict(fileData);

        map.Add(prediction.PredictedClusterId, fileType);
    }

    return map;
}

Next, we pass in the FileData and FileTypePrediction types into the CreatePredictionEngine method to create our prediction engine. Then, we read the file in as a binary file and pass these bytes into the constructor of FileData prior to running the prediction and mapping initialization:

var predictionEngine = MlContext.Model.CreatePredictionEngine<FileData, FileTypePrediction>(mlModel);

var fileData = new FileData(File.ReadAllBytes(inputDataFile));

var prediction = predictionEngine.Predict(fileData);

var mapping = GetClusterToMap(predictionEngine);

Lastly, we need to adjust the output to match the output that a k-means prediction returns, including the Euclidean distances:

Console.WriteLine(
    $"Based on input file: {inputDataFile}{Environment.NewLine}
                            {Environment.NewLine}" +
    $"Feature Extraction: {fileData}{Environment.NewLine}
                            {Environment.NewLine}" +
    $"The file is predicted to be a {
                        mapping[prediction.PredictedClusterId]}
                            {Environment.NewLine}");

Console.WriteLine("Distances from all clusters:");

for (uint x = 0; x < prediction.Distances.Length; x++) { 
    Console.WriteLine($"{mapping[x+1]}: 
                        {prediction.Distances[x]}");
}

The Trainer class

Inside the Trainer class, several modifications need to be made to support k-means classification:

The first change is the addition of a GetDataView helper method, which builds the IDataView object from the columns previously defined in the FileData class:

private IDataView GetDataView(string fileName)
{
    return MlContext.Data.LoadFromTextFile(path: fileName,
        columns: new[]
        {
            new TextLoader.Column(nameof(FileData.Label), 
                                    DataKind.Single, 0),
            new TextLoader.Column(nameof(FileData.IsBinary), 
                                    DataKind.Single, 1),
            new TextLoader.Column(nameof(FileData.IsMZHeader), 
                                    DataKind.Single, 2),
            new TextLoader.Column(nameof(FileData.IsPKHeader), 
                                    DataKind.Single, 3)
        },
        hasHeader: false,
        separatorChar: ',');
}

We then build the data process pipeline, transforming the columns into a single Features column:

var trainingDataView = GetDataView(trainingFileName);

var dataProcessPipeline = MlContext.Transforms.Concatenate(
    FEATURES,
    nameof(FileData.IsBinary),
    nameof(FileData.IsMZHeader),
    nameof(FileData.IsPKHeader));

We can then create the k-means trainer with a cluster size of 3 and create the model:

var trainer = MlContext.Clustering.Trainers.KMeans(
              featureColumnName: FEATURES, numberOfClusters: 3);
var trainingPipeline = dataProcessPipeline.Append(trainer);
var trainedModel = trainingPipeline.Fit(trainingDataView);

MlContext.Model.Save(trainedModel, trainingDataView.Schema, ModelPath);

The default value for the number of clusters is 5. An interesting experiment to run based either on this dataset or one modified by you is to see how the prediction results change by adjusting this value.

Now we evaluate the model we just trained using the testing dataset:

var testingDataView = GetDataView(testingFileName);

IDataView testDataView = trainedModel.Transform(testingDataView);

ClusteringMetrics modelMetrics = MlContext.Clustering.Evaluate(
    data: testDataView,
    labelColumnName: "Label",
    scoreColumnName: "Score",
    featureColumnName: FEATURES);

Finally, we output all of the classification metrics, each of which we will detail in the next section:

Console.WriteLine($"Average Distance: {modelMetrics.AverageDistance}");
Console.WriteLine($"Davies Bould Index: {modelMetrics.DaviesBouldinIndex}");
Console.WriteLine($"Normalized Mutual Information: {modelMetrics.NormalizedMutualInformation}");

The Program class

The Program class, as mentioned in previous chapters, is the main entry point for our application. The only change in the Program class is the help text to indicate usage for the extract to accept the test folder path for extraction:

if (args.Length < 2)
{
    Console.WriteLine($"Invalid arguments passed in, 
                        exiting.{Environment.NewLine}{Environment.NewLine}
                        Usage:{Environment.NewLine}" +
                      $"predict <path to input file>{Environment.NewLine}"+
                      $"or {Environment.NewLine}" +
                      $"train <path to training data file> 
                        <path to test data file>{Environment.NewLine}" +
                      $"or {Environment.NewLine}" + 
                      $"extract <path to training folder> 
                        <path to test folder>{Environment.NewLine}");

    return;
}

Finally, we modify the switch/case statement to support the additional parameter to the extract method to support both the training and test datasets:

switch (args[0])
{
    case "extract":
        new FeatureExtractor().Extract(args[1], args[2]);
        break;
    case "predict":
        new Predictor().Predict(args[1]);
        break;
    case "train":
        new Trainer().Train(args[1], args[2]);
        break;
    default:
        Console.WriteLine($"{args[0]} is an invalid option");
        break;
}

Running the application

To run the application, the process is nearly identical to Chapter 3, Regression Model's example application with the addition of passing in the test dataset when training:

To run the training on the command line as we did in previous chapters, simply pass in the following command (assuming you have added two sets of files; one each for your training and test sets):

PS chapter05inDebug
etcoreapp3.0> .chapter05.exe extract ......TrainingData ......TestData
Extracted 80 to sampledata.csv
Extracted 30 to testdata.csv

Included in the code repository are two pre-feature extracted files (sampledata.csv and testdata.csv) to allow you to train a model without performing your own feature extraction. If you would like to perform your own feature extraction, create a TestData and TrainingData folder. Populate these folders with a sampling of PowerShell (PS1), Windows Executables (EXE) and Microsoft Word documents (DOCX).

After extracting the data, we must then train the model by passing in the newly created sampledata.csv and testdata.csv files:

PS chapter05inDebug
etcoreapp3.0> .chapter05.exe train ......Datasampledata.csv ......Data	estdata.csv 
Average Distance: 0
Davies Bould Index: 0
Normalized Mutual Information: 1

To run the model with this file, simply pass in the filename to the built application (in this case, the compiled chapter05.exe is used) and the predicted output will show:

PS chapter05inDebug
etcoreapp3.0> .chapter05.exe predict .chapter05.exe
Based on input file: .chapter05.exe

Feature Extraction: 0,1,1,0

The file is predicted to be a Executable

Distances from all clusters:
Executable: 0
Script: 2
Document: 2

Note the expanded output to include several metric data points—we will go through what each one of these means at the end of this chapter.

Feel free to modify the values and see how the prediction changes based on the dataset that the model was trained on. A few areas of experimentation from this point could include the following:

Adding some additional features to increase the prediction accuracy
Adding additional file types to the clusters such as video or audio
Adding a new range of files to generate new sample and test data

Evaluating a k-means model

As discussed in previous chapters, evaluating a model is a critical part of the overall model-building process. A poorly trained model will only provide inaccurate predictions. Fortunately, ML.NET provides many popular attributes to calculate model accuracy based on a test set at the time of training to give you an idea of how well your model will perform in a production environment.

In ML.NET, as noted in the example application, there are three properties that comprise the ClusteringMetrics class object. Let's dive into the properties exposed in the ClusteringMetrics object:

Average distance
The Davies-Bouldin index
Normalized mutual information

In the next sections, we will break down how these values are calculated and the ideal values to look for.

Average distance

Also referred to as the average score is the distance from the center of a cluster to the test data. The value, of type double, will decrease as the number of clusters increases, effectively creating clusters for the edge cases. In addition to this, a value of 0, such as the one found in our example, is possible when your features create distinct clusters. This means that, if you find yourself seeing poor prediction performance, you should increase the number of clusters.

The Davies-Bouldin Index

The Davies-Bouldin Index is another measure for the quality of the clustering. Specifically, the Davies-Bouldin Index measures the scatter of cluster separation with values ranging from 0 to 1 (of type double), with a value of 0 being ideal (as was the case of our example).

For more details on the Davies-Bouldin Index, specifically the math behind the algorithm, a good resource can be found here: https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index.

Normalized mutual information

The normalized mutual information metric is used to measure the mutual dependence of the feature variables.

The range of values is from 0 to 1 (the type is of double)—closer to or equal to 1 is ideal, akin to the model we trained earlier in this chapter.

For more details on normalized mutual information along with the math behind the algorithm, please read http://en.wikipedia.org/wiki/Mutual_information#Normalized_variants.

Summary

Over the course of this chapter, we dove into ML.NET's clustering support via the k-means clustering algorithm. We have also created and trained our first clustering application using k-means to predict what file type a file is. Lastly, we dove into how to evaluate a k-means clustering model and the various properties that ML.NET exposes to achieve a proper evaluation of a k-means clustering model.

In the next chapter, we will deep dive into anomaly detection algorithms with ML.NET by creating a login anomaly predictor.

Table of Contents for Clustering Model

Create new playlist

Sign In

Sign Up