Using ML.NET with UWP

Now that we have established how to create a production-grade .NET Core console application, in this chapter, we will deep dive into creating a fully functional Windows 10 application with the Universal Windows Platform (UWP) framework. This application will utilize an ML.NET binary classification model to make web-page-content classifications, in order to determine if the content is benign or malicious. In addition, we will explore breaking your code into a component-based architecture, using a .NET Standard Library to share between our desktop application and the console application that will train our model. By the end of the chapter, you should have a firm grasp of designing and coding a production-grade UWP desktop application with ML.NET.

The following topics will be covered in this chapter:

  • Breaking down the UWP application
  • Creating the web browser classification application
  • Exploring additional production-application enhancements

Breaking down the UWP architecture

At a high level, UWP provides an easy framework to create rich desktop applications for Windows 10. As discussed, with .NET Core, UWP allows the targeting of x86, x64, and Advanced RISC Machine (ARM). At the time of this writing, ARM is not supported with ML.NET. In addition, UWP applications can also be written with JavaScript and HTML.

A typical UWP desktop application includes the following core code elements:

  • Views
  • Models
  • View Models

These components form a common app architecture principle of the Model-View-ViewModel, otherwise known as MVVM. In addition to the code components, images and audio are also common, depending on the nature of your application or game.

Similarly to mobile apps on the Android and iOS platforms, each app is sandboxed to specific permissions that you, the developer, request upon installation. Therefore, as you develop your own UWP applications, request only the required access that your app absolutely requires.

For the example application we will be creating in this chapter, we only require access to the internet as a client, as found in the Capabilities tab labeled Internet (Client), as shown in the following screenshot:

The Internet (Client) and other permissions are defined in the Package.appxmanifest file found in the root of UWP applications, under the Capabilities tab. This file is shown in the Visual Studio Solution Explorer screenshot in the later Exploring the project architecture section.

To prepare for our deep dive into integrating ML.NET in a UWP application, let's dive into the three core components found in a UWP application.

Views

Views, as we defined in the previous chapter's Blazor discussion, contain the user interface (UI) components of an application. Views in UWP development, such as those found in Windows Presentation Foundation (WPF) and Xamarin.Forms, use the Extensible Application Markup Language (XAML) syntax. Those familiar with modern web development with Bootstrap's Grid pattern will be able to quickly see the parallels as we deep dive into this later in this chapter. 

The biggest differentiation between web development and UWP development is the powerful two-way binding XAML views when used with the MVVM principle. As you will see in the deep dive, XAML binding eliminates the manual setting and getting of values in code behinds, as you might have performed in Windows Forms or WebForms projects previously.

For applications using the web approach, HTML would define your View as with our Blazor project in Chapter 9Using ML.NET with ASP.NET Core.

Models

Models provide the container of data between the View and View Model. Think of the Model as purely the transport for containing the data between the View and View Model. For example, if you had a movie list, a List collection of MovieItems would be defined in your MovieListingModel class. This container class would be instantiated and populated in the View Model, to be in turn bound in your View.

View Models

View Models provide the business-logic layer for populating your Model, and thereby your View indirectly. As mentioned previously, the MVVM binding provided in UWP development eases the management of trigger points to ensure your UI layer is up to date. This is achieved through the use of implementing the INotifyPropertyChanged interface in our View Model. For each property that we want to bind to our UI, we simply call OnPropertyChanged. The power behind this is that you can have complex forms with triggers within the setter of other properties, without having conditionals and endless code to handle the complexities.

If you want to deep dive further into UWP development, Channel9 from Microsoft has a series called Windows 10 Development for Absolute Beginners that covers all of the main aspects of UWP development: https://channel9.msdn.com/Series/Windows-10-development-for-absolute-beginners.

Creating the web browser classification application

As mentioned earlier, the application we will be creating is a web browser classification application. Using the knowledge garnered in the logistic classification chapter, we will be using the SdcaLogisticRegression algorithm to take the text content of a web page, featurize the text, and provide a confidence level of maliciousness. In addition, we will be integrating this technique into a Windows 10 UWP application that mimics a web browser—effectively on navigation to a page—running the model, and making a determination as to whether the page was malicious. If found to be malicious, we redirect to a warning page. While in a real-world scenario this might prove too slow to run on every page, the benefits of a highly secured web browser, depending on the environment requirements might far outweigh the slight overhead running our model incurs.

As with previous chapters, the completed project code, sample dataset, and project files can be downloaded from https://github.com/PacktPublishing/Hands-On-Machine-Learning-With-ML.NET/tree/master/chapter10.

Exploring the project architecture

With this chapter, we will dive into a native Windows 10 desktop application. As mentioned in the first section of this chapter, we will be using the UWP framework to create our application.

No additional ML.NET NuGet packages are needed for this sample application. However, we will be using the HtmlAgilityPack NuGet package to provide a quick method to extract the text from a given web page. At the time of this writing, version 1.11.18 was the latest version and is the version used in this example.

In the following screenshot, you will find the Visual Studio Solution Explorer view of the solution. Given that this example comprises three separate projects (more akin to a production scenario), the amount of both new and significantly modified files is quite large. We will review in detail each of the new files shown in the solution screenshot, later on in this section:

The sampledata.csv file (found in the Data folder in the code repository) contains eight rows of extracted text from URLs found in the trainingURLList.csv file (also found in the Data folder). Feel free to adjust the URL list file to test websites you frequently visit. Here is a snippet of one such row:

False|BingImagesVideosMapsNewsShoppingMSNOfficeOutlookWordExcelPowerPointOneNoteSwayOneDriveCalendarPeopleSigninRewardsDownloadtoday’simagePlaytoday'squizTheTajMahalinAgraforIndia'sRepublicDay©MicheleFalzone/plainpictureIt'sRepublicDayinIndiaImageofthedayJan26,2020It'sRepublicDayinIndia©MicheleFalzone/plainpictureForIndia's70thRepublicDay

In addition to the sampledata.csv file, we also added the testdata.csv file that contains additional data points to test the newly trained model against and to evaluate. Here is a snippet of a sample row of the data inside of testdata.csv:

True|USATODAY:LatestWorldandUSNews-USATODAY.comSUBSCRIBENOWtogethomedeliveryNewsSportsEntertainmentLifeMoneyTechTravelOpinionWeatherIconHumidityPrecip.WindsOpensettingsSettingsEnterCityNameCancelSetClosesettingsFullForecastCrosswordsInvestigationsAppsBest-SellingBooksCartoons

Due to the size of the example project, we will be diving into the code for each of the different components before running the applications at the end of this section in the following order:

  • .NET Standard Library for common code between the two applications
  • Windows 10 UWP browser application
  • .NET Core console application for feature extraction and training

Diving into the library

Due to the nature of this application and that of production applications where there are multiple platforms and/or ways to execute shared code, a library is being used in this chapter's example application. The benefit of using a library is that all common code can reside in a portable and dependency-free manner. Expanding the functionality in this sample application to include other platforms such as Linux or Mac applications with Xamarin would be a much easier lift than having the code either duplicated or kept in the actual applications.

Classes and enumerations that were changed or added in the library are as follows:

  • Constants
  • WebPageResponseItem
  • Converters
  • ExtensionMethods
  • WebPageInputItem
  • WebPagePredictionItem
  • WebContentFeatureExtractor
  • WebContentPredictor
  • WebContentTrainer

The Classification, TrainerActions, and BaseML classes remain unmodified from Chapter 9Using ML.NET with ASP.NET Core.

The Constants class

The Constants class, as used in all of our examples to this point, is the common class that contains our constant values used in our library, trainer, and UWP applications. For this chapter, the MODEL_NAME and MALICIOUS_THRESHOLD properties were added to hold our model's name and an arbitrary threshold for when we should decide to classify our prediction as malicious or not, respectively. If you find your model too sensitive, try adjusting this threshold, like this:

public static class Constants
{
public const string MODEL_NAME = "webcontentclassifier.mdl";

public const string SAMPLE_DATA = "sampledata.csv";

public const string TEST_DATA = "testdata.csv";

public const double MALICIOUS_THRESHOLD = .5;
}

The WebPageResponseItem class

The WebPageResponseItem class is our container class between our predictor and application. This class contains the properties we set after running the predictor and then use to display in our desktop application, as shown in the following code block:

public class WebPageResponseItem
{
public double Confidence { get; set; }

public bool IsMalicious { get; set; }

public string Content { get; set; }

public string ErrorMessage { get; set; }

public WebPageResponseItem()
{
}

public WebPageResponseItem(string content)
{
Content = content;
}
}

The Converters class

The Converters class has been adjusted to provide an extension method to convert our container class into the type our model expects. In this example, we have the Content property, which simply maps to the HTMLContent variable in the WebPageInputItem class, as follows:

public static WebPageInputItem ToWebPageInputItem(this WebPageResponseItem webPage)
{
return new WebPageInputItem
{
HTMLContent = webPage.Content
};
}

The ExtensionMethods class

The ExtensionMethods class, as discussed previously in Chapter 9, Using ML.NET with ASP.NET Core, has been expanded to include the ToWebContentString extension method. In this method, we pass in the URL from which we want to retrieve the web content. Using the previously mentioned HtmlAgilityPack, we create an HtmlWeb object and call the Load method, prior to iterating through the Document Object Model (DOM). Given most websites have extensive scripts and style sheets, our purpose in this example is just to examine the text in the page, thus the filters of script and style nodes in our code. Once the nodes have been traversed and added to a StringBuilder object, we return the typecast of that object to a string, as shown in the following code block:

public static string ToWebContentString(this string url)
{
var web = new HtmlWeb();

var htmlDoc = web.Load(url);

var sb = new StringBuilder();

htmlDoc.DocumentNode.Descendants().Where(n => n.Name == "script" ||
n.Name == "style").ToList().ForEach(n => n.Remove());

foreach (var node in htmlDoc.DocumentNode.SelectNodes("//text()
[normalize-space(.) != '']"))
{
sb.Append(node.InnerText.Trim().Replace(" ", ""));
}

return sb.ToString();
}

The WebPageInputItem class

The WebPageInputItem class is our input object to our model, containing both the label and extracted content of our web page, as shown in the following code block:

public class WebPageInputItem
{
[LoadColumn(0), ColumnName("Label")]
public bool Label { get; set; }

[LoadColumn(1)]
public string HTMLContent { get; set; }
}

The WebPagePredictionItem class

The WebPagePredictionItem class is the output object from our model, containing the prediction of whether a web page is malicious or benign, in addition to a probability score that the prediction is accurate and the Score value used in the evaluation phase of our model creation, as shown in the following code block:

public class WebPagePredictionItem
{
public bool Prediction { get; set; }

public float Probability { get; set; }

public float Score { get; set; }
}

The WebContentFeatureExtractor class

The WebContentFeatureExtractor class contains our GetContentFile and Extract methods, which operate as follows:

  1. First, our GetContentFile method takes the inputFile and outputFile values (the URL list CSV and feature-extracted CSV respectively). It then reads each URL, grabs the content, then outputs to the outputFile string, as follows:
private static void GetContentFile(string inputFile, string outputFile)
{
var lines = File.ReadAllLines(inputFile);

var urlContent = new List<string>();

foreach (var line in lines)
{
var url = line.Split(',')[0];
var label = Convert.ToBoolean(line.Split(',')[1]);

Console.WriteLine($"Attempting to pull HTML from {line}");

try
{
var content = url.ToWebContentString();

content = content.Replace('|', '-');

urlContent.Add($"{label}|{content}");
}
catch (Exception)
{
Console.WriteLine(
$"Failed to pull HTTP Content from {url}");
}
}

File.WriteAllText(
Path.Combine(AppContext.BaseDirectory, outputFile),
string.Join(Environment.NewLine, urlContent));
}
  1. Next, we use the Extract method to call both the training and test extraction, passing in the output filenames for both, like this:
public void Extract(string trainingURLList, string testURLList, string trainingOutputFileName, string testingOutputFileName)
{
GetContentFile(trainingURLList, trainingOutputFileName);

GetContentFile(testURLList, testingOutputFileName);
}

The WebContentPredictor class

The WebContentPredictor class provides the interface for both our command line and desktop applications, using an overloaded Predict method, described here:

  1. The first Predict method is for our command-line application that simply takes in the URL and calls into the overload in Step 3, after calling the ToWebContentString extension method, like this:
public WebPageResponseItem Predict(string url) => Predict(new WebPageResponseItem(url.ToWebContentString()));
  1. Then, we create the Initialize method, in which we load our model from the embedded resource. If successful, the method returns true; otherwise, it returns false, as shown in the following code block:
public bool Initialize()
{
var assembly = typeof(WebContentPredictor).GetTypeInfo()
.Assembly;

var resource = assembly.GetManifestResourceStream(
$"chapter10.lib.Model.{Constants.MODEL_NAME}");

if (resource == null)
{
return false;
}

_model = MlContext.Model.Load(resource, out _);

return true;
}
  1.  And finally, we call our Predict method that creates our prediction engine. Then, we call the predictor's Predict method, and then update the Confidence and IsMalicious properties, prior to returning the updated WebPageResponseItem object, as follows:
public WebPageResponseItem Predict(WebPageResponseItem webPage)
{
var predictionEngine = MlContext.Model.CreatePredictionEngine
<WebPageInputItem, WebPagePredictionItem>(_model);

var prediction = predictionEngine.Predict(
webPage.ToWebPageInputItem());

webPage.Confidence = prediction.Probability;
webPage.IsMalicious = prediction.Prediction;

return webPage;
}

The WebContentTrainer class

The WebContentTrainer class contains all of the code to train and evaluate our model. As with previous examples, this functionality is self-contained within one method called Train:

  1. The first change is the use of the WebPageInputItem class to read the CSV into the dataView object separated by |, as shown in the following code block:
var dataView = MlContext.Data.LoadFromTextFile<WebPageInputItem>(
trainingFileName, hasHeader: false, separatorChar: '|');
  1. Next, we map our file data features to create our pipeline. In this example, we simply featurize the HTMLContent property and pass it to the SdcaLogisticRegression trainer, like this:
var dataProcessPipeline = MlContext.Transforms.Text
.FeaturizeText(FEATURES, nameof(WebPageInputItem.HTMLContent))
.Append(MlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "Label", featureColumnName: FEATURES));
  1. Then, we fit the model, and save the model to disk, like this:
var trainedModel = dataProcessPipeline.Fit(dataView);

MlContext.Model.Save(trainedModel, dataView.Schema, Path.Combine(AppContext.BaseDirectory, modelFileName));
  1. Finally, we load in the testing file, and call the BinaryClassification evaluation, like this:
var testingDataView = MlContext.Data.LoadFromTextFile<WebPageInputItem>(testingFileName, hasHeader: false, separatorChar: '|');

IDataView testDataView = trainedModel.Transform(testingDataView);

var modelMetrics = MlContext.BinaryClassification.Evaluate(
data: testDataView);

Console.WriteLine($"Entropy: {modelMetrics.Entropy}");
Console.WriteLine($"Log Loss: {modelMetrics.LogLoss}");
Console.WriteLine($"Log Loss Reduction: {modelMetrics.LogLossReduction}");

Diving into the UWP browser application

With the library code having been reviewed, the next component is the desktop application. As discussed in the opening section, our desktop application is a UWP application. For the scope of this example, we are using standard approaches for handling the application architecture, following the MVVM approach discussed in the opening section of this chapter.

The files we will be diving into in this section are as follows:

  • MainPageViewModel
  • MainPage.xaml
  • MainPage.xaml.cs

The rest of the files inside the UWP project, such as the tile images and app class files, are untouched from the default Visual Studio UWP application template.

The MainPageViewModel class

The purpose of the MainPageViewModel class is to contain our business logic and control the View:

  1. The first thing we do is instantiate our previously discussed WebContentPredictor class to be used to run predictions, as follows:
private readonly WebContentPredictor _prediction = new WebContentPredictor();
  1. The next block of code handles the power of MVVM for our GO button, the web service URL field, and the web classification properties. For each of these properties, we call OnPropertyChanged upon a change in values, which triggers the binding of the View to refresh for any field bound to these properties, as shown in the following code block:
private bool _enableGoButton;

public bool EnableGoButton
{
get => _enableGoButton;

private set
{
_enableGoButton = value;
OnPropertyChanged();
}
}

private string _webServiceURL;

public string WebServiceURL
{
get => _webServiceURL;

set
{
_webServiceURL = value;

OnPropertyChanged();

EnableGoButton = !string.IsNullOrEmpty(value);
}
}

private string _webPageClassification;

public string WebPageClassification
{
get => _webPageClassification;

set
{
_webPageClassification = value;
OnPropertyChanged();
}
}
  1. Next, we define the Initialize method, which calls the predictor's Initialize method. The method will return false if the model can't be loaded or found, as follows:
public bool Initialize() => _prediction.Initialize();
  1. Then, we take the entered URL the user entered via the WebServiceURL property. From that value, we validate that either http or https is prefixed. If not, http:// is prefixed to the URL prior to converting it to a URI, like this:
public Uri BuildUri()
{
var webServiceUrl = WebServiceURL;

if (!webServiceUrl.StartsWith("http://",
StringComparison.InvariantCultureIgnoreCase) &&
!webServiceUrl.StartsWith("https://",
StringComparison.InvariantCultureIgnoreCase))
{
webServiceUrl = $"http://{webServiceUrl}";
}

return new Uri(webServiceUrl);
}
  1. Now, onto our Classify method that takes the URL entered from the user. This method calls our Predict method, builds our status bar text, and, if found to be malicious, builds the HTML response to send back to our WebView object, as follows:
public (Classification ClassificationResult, string BrowserContent) Classify(string url)
{
var result = _prediction.Predict(url);

WebPageClassification = $"Webpage is considered
{result.Confidence:P1} malicious";

return result.Confidence < Constants.MALICIOUS_THRESHOLD ?
(Classification.BENIGN, string.Empty) :
(Classification.MALICIOUS, $"<html><body bgcolor="red">
<h2 style="text-align: center">Machine Learning has
found {WebServiceURL} to be a malicious site and was
blocked automatically</h2></body></html>");
}
  1. And lastly, we implement the OnPropertyChanged event handler and method that are the standard implementations of the INotifyPropertyChanged interface, as discussed in the opening section of this chapter and shown in the following code block:
public event PropertyChangedEventHandler PropertyChanged;

protected virtual void OnPropertyChanged([CallerMemberName] string propertyName = null)
{
PropertyChanged?.Invoke(this,
new PropertyChangedEventArgs(propertyName));
}

MainPage.xaml

As discussed in the opening section describing UWP development, XAML markup is used to define your UI. For the scope of this application, our UI is relatively simple:

  1. The first thing we define is our Grid. In XAML, a Grid is a container similar to a div element in web development. We then define our Rows. Similar to Bootstrap, (but easier to understand, in my opinion) is to pre-define the height of each row. Setting a row to Auto will auto-size the height to the content's height, while an asterisk translates to using all remaining height based on the main container's height, as shown in the following code block:
<Grid>
<Grid.RowDefinitions>
<RowDefinition Height="Auto" />
<RowDefinition Height="*" />
<RowDefinition Height="Auto" />
</Grid.RowDefinitions>
  1. Similar to the row definitions in Step 1, we pre-define columns. "Auto" and "*" equate to the same principle as they did for the rows, just in regard to width instead of height, as shown in the following code block:
<Grid.ColumnDefinitions>
<ColumnDefinition Width="*" />
<ColumnDefinition Width="Auto" />
</Grid.ColumnDefinitions>
  1. We then define our TextBox object for the URL entry. Note the Binding call in the Text value. This binds the textbox's text field to the WebServiceURL property in our View Model, as follows:
<TextBox Grid.Row="0" Grid.Column="0" KeyUp="TxtBxUrl_KeyUp" Text="{Binding WebServiceURL, Mode=TwoWay, UpdateSourceTrigger=PropertyChanged}" />
  1. Then, we add the button to mimic a browser's GO button, which triggers the navigation. Also, note the use of Binding to enable or disable the button itself (which is bound based on text being entered into the URL textbox), as shown in the following code block:
<Button Grid.Row="0" Grid.Column="1" Content="GO" Click="BtnGo_Click" IsEnabled="{Binding EnableGoButton}" />
  1. We then add the WebView control that comes with UWP, as follows:
<WebView Grid.Row="1" Grid.Column="0" Grid.ColumnSpan="2" x:Name="wvMain" NavigationStarting="WvMain_OnNavigationStarting" />
  1. Lastly, we add our status bar grid and TextBlock control to show the classification along the bottom of the window, as follows:
<Grid Grid.Column="0" Grid.ColumnSpan="2" Grid.Row="2" Background="#1e1e1e" Height="30">
<TextBlock Text="{Binding WebPageClassification, Mode=OneWay}" Foreground="White" Margin="10,0,0,0" />
</Grid>

MainPage.xaml.cs

The MainPage.xaml.cs file contains the code behind the XAML view discussed previously:

  1. The first thing we define is a wrapper property around the DataContext property built into the base Page class, as follows:
private MainPageViewModel ViewModel => (MainPageViewModel) DataContext;
  1. Next, we define the constructor for MainPage to initialize the DataContext to our MainPageViewModel object, as follows:
public MainPage()
{
InitializeComponent();

DataContext = new MainPageViewModel();
}
  1. We then override the base OnNavigatedTo method to initialize our View Model, and validate the model was loaded properly, as follows:
protected override async void OnNavigatedTo(NavigationEventArgs e)
{
var initialization = ViewModel.Initialize();

if (initialization)
{
return;
}

await ShowMessage("Failed to initialize model - verify the
model has been created");

Application.Current.Exit();

base.OnNavigatedTo(e);
}
  1. Next, we add our ShowMessage wrapper to provide an easy one-liner to call throughout our application, like this:
public async Task<IUICommand> ShowMessage(string message)
{
var dialog = new MessageDialog(message);

return await dialog.ShowAsync();
}
  1. Then, we handle the GO button click by calling the Navigate method, as follows:
private void BtnGo_Click(object sender, RoutedEventArgs e) => Navigate();
  1. We then create our Navigate wrapper method, which builds the URI and passes it to the WebView object, as follows:
private void Navigate()
{
wvMain.Navigate(ViewModel.BuildUri());
}
  1. We also want to handle the keyboard input to listen for the user hitting the Enter key after entering a URL, to provide the user with the ability to either hit Enter or click the GO button, like this:
private void TxtBxUrl_KeyUp(object sender, KeyRoutedEventArgs e)
{
if (e.Key == VirtualKey.Enter && ViewModel.EnableGoButton)
{
Navigate();
}
}
  1. Lastly, we block navigation until a classification can be obtained by hooking into the WebView's OnNavigationStarting event, as follows:
private void WvMain_OnNavigationStarting(WebView sender, WebViewNavigationStartingEventArgs args)
{
if (args.Uri == null)
{
return;
}

var (classificationResult, browserContent) =
ViewModel.Classify(args.Uri.ToString());

switch (classificationResult)
{
case Classification.BENIGN:
return;
case Classification.MALICIOUS:
sender.NavigateToString(browserContent);
break;
}
}

Diving into the trainer application

Now that we have reviewed the shared library and the desktop application, let us dive into the trainer application. With the major architectural changes being performed in Chapter 8's example, by design the trainer application has only minimal changes to handle the specific class objects used in this chapter's example.

We will review the following files:

  • ProgramArguments
  • Program

The ProgramArguments class

Building off the work in Chapter 9's ProgramArguments class, we are only making three additions to the class. The first two additions are to include both the Training and Testing output filenames to provide better flexibility with our example's infrastructure. In addition, the URL property holds the URL you can pass, using the command line, into the trainer application to get a prediction, as shown in the following code block:

public string TrainingOutputFileName { get; set; }

public string TestingOutputFileName { get; set; }

public string URL { get; set; }

The Program class

Inside the Program class, we will now modify the switch case statement to use the classes/methods from Chapter 10, Using ML.NET with UWP, as follows:

switch (arguments.Action)
{
case ProgramActions.FEATURE_EXTRACTOR:
new WebContentFeatureExtractor().Extract(
arguments.TrainingFileName, arguments.TestingFileName,
arguments.TrainingOutputFileName,
arguments.TestingOutputFileName);
break;
case ProgramActions.PREDICT:
var predictor = new WebContentPredictor();

var initialization = predictor.Initialize();

if (!initialization)
{
Console.WriteLine("Failed to initialize the model");

return;
}

var prediction = predictor.Predict(arguments.URL);

Console.WriteLine(
$"URL is {(prediction.IsMalicious ? "malicious" : "clean")}
with a {prediction.Confidence:P2}% confidence");
break;
case ProgramActions.TRAINING:
new WebContentTrainer().Train(arguments.TrainingFileName,
arguments.TestingFileName, arguments.ModelFileName);
break;
default:
Console.WriteLine($"Unhandled action {arguments.Action}");
break;
}

Running the trainer application

To begin running the trainer application, we will need to first run the chapter10.trainer application to perform feature extraction prior to the training of our model. To run the trainer application, the process is nearly identical to Chapter 9's sample application, with the addition of passing in the test dataset filename path when training:

  1. Run the trainer application, passing in the paths to the training and test URL list CSVs to perform feature extraction, as follows:
PS chapter10	rainerinDebug
etcoreapp3.0> .chapter10.trainer.exe TrainingFileName ........Data	rainingURLList.csv TestingFileName ........Data	estingURLList.csv
Attempting to pull HTML from https://www.google.com, false
Attempting to pull HTML from https://www.bing.com, false
Attempting to pull HTML from https://www.microsoft.com, false
Attempting to pull HTML from https://www8.hp.com/us/en/home.html, false
Attempting to pull HTML from https://dasmalwerk.eu, true
Attempting to pull HTML from http://vxvault.net, true
Attempting to pull HTML from https://www.tmz.com, true
Attempting to pull HTML from http://openmalware.org, true
Failed to pull HTTP Content from http://openmalware.org
Attempting to pull HTML from https://www.dell.com, false
Attempting to pull HTML from https://www.lenovo.com, false
Attempting to pull HTML from https://www.twitter.com, false
Attempting to pull HTML from https://www.reddit.com, false
Attempting to pull HTML from https://www.tmz.com, true
Attempting to pull HTML from https://www.cnn.com, true
Attempting to pull HTML from https://www.usatoday.com, true
  1. Run the application to train the model, based on Step 1's sample and test data exports, as follows:
PS chapter10	rainerinDebug
etcoreapp3.0> .chapter10.trainer.exe ModelFileName webcontentclassifier.mdl Action TRAINING TrainingFileName ........Datasampledata.csv TestingFileName ........Data	estdata.csv
Entropy: 0.9852281360342516
Log Loss: 0.7992317560011841
Log Loss Reduction: 0.18878508766684401

Feel free to modify the values and see how the prediction changes, based on the dataset on which the model was trained. A few areas of experimentation from this point might be to:

  • Tweak the hyperparameters reviewed in the Trainer class on the Stochastic Dual Coordinate Ascent (SDCA) algorithm, such as MaximumNumberOfIterations, to see how accuracy is affected.
  • Add new features in addition to simply using the HTML content—perhaps the connection type or the number of scripts.
  • Add more variation to the training and sample set to get a better sampling of both benign and malicious content.

For convenience, the GitHub repository includes all of the following data files in the Data folder:

  • The testdata.csv and sampledata.csv feature-extracted CSV files
  • The testingURLList.csv and trainingURLList.csv URL list CSV files

Running the browser application

Now that our model has been trained, we can run our desktop application and test the efficacy of the model. To run the example, make sure the chapter10_app is the startup app and hit F5. Upon launching our browser application, enter www.google.com, as shown in the following screenshot:

Note the status bar below the web page content in the preceding screenshot, indicating the malicious percentage after running the model. Next, type dasmalwerk.eu into your browser (this is a website that the default training URL list pre-classified as malicious), and note the forced redirect, as shown in the following screenshot:

Feel free to try various files on your machine to see the confidence score, and if you receive a false positive, perhaps add additional features to the model to correct the classification.

Additional ideas for improvements

Now that we have completed our deep dive, there are a couple of additional elements to possibly further enhance the application. A few ideas are discussed here.

Single-download optimization

Currently, when a new URL is entered or the page is changed in the WebView UWP control, the navigation is halted until a classification can be made. When this occurs—as we detailed previously—with the use of the HtmlAgilityPack library, we download and extract the text. If the page is deemed to be clean (as one would more than likely encounter the majority of the time), we would effectively be downloading the content twice. An optimization here would be to store the text in the application's sandbox storage once classification is done, then point the WebView object to that stored content. In addition, if this approach is used, add a purge background worker to remove older data so that your end users don't end up with several gigabytes of web page content.

Logging

As with our previous chapter's deep dive into logging, adding logging could be crucial to remotely understand when an error occurs on a desktop application. Unlike our web application in the previous chapter, where your errors would be more than likely server-side and could be accessed remotely, your desktop application could be installed on any number of configurations of Windows 10, with an almost unlimited number of permutations. As mentioned previously, logging utilizing NLog (https://nlog-project.org/) or a similar open source project is highly recommended, coupled with a remote logging solution such as Loggly so that you can get error data from your user's machines. Given the General Data Protection Regulation (GDPR) and the recent California Consumer Privacy Act (CCPA), ensure that the fact this data is leaving the end user's machines is conveyed, and do not include personal data in these logs.

Utilizing a database

Users typically visit the same websites fairly frequently, therefore storing the classification of a particular website's URL in a local database such as LiteDB (http://www.litedb.org/) would significantly improve the performance for the end user. One implementation method would be to store a SHA256 hash of the URL locally as the key, with the classification as the value. Longer term, you could provide a web URL reputation database, with the SHA256 hash of the URL being sent up to a scalable cloud storage solution such as Microsoft's Cosmos DB. Storing the SHA256 hash of the URL avoids any questions from your end users about personally identifiable information and anonymity.

Summary

Over the course of this chapter, we have deep dived into what goes into a production-ready Windows 10 UWP application architecture, using the work performed in previous chapters as a foundation. We also created a brand new web-page-classification Windows 10 application, utilizing the SdcaLogisticRegression algorithm from ML.NET. Lastly, we also discussed some ways to further enhance the example application (and production applications in general).

With the conclusion of this chapter, this ends the real-world application section. The next section of the book includes both general machine learning practices in an agile production team and extending ML.NET with TensorFlow and Open Neural Network Exchange (ONNX) models. In the next chapter, we will focus on the former.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.248.211