© Benjamin Weissman and Enrico van de Laar 2020
B. Weissman, E. van de LaarSQL Server Big Data Clusters https://doi.org/10.1007/978-1-4842-5985-6_7

7. Machine Learning on Big Data Clusters

Benjamin Weissman1  and Enrico van de Laar2
(1)
Nurnberg, Germany
(2)
Drachten, The Netherlands
 

In the previous chapters, we spent significant time on how we can query data stored inside SQL Server instances or on HDFS through Spark. One advantage of having access to data stored in different formats is that it allows you to perform analysis of the data at a large, and distributed, scale. One of the more powerful options we can utilize inside Big Data Clusters is the ability to implement machine learning solutions on our data. Because Big Data Clusters allow us to store massive amounts of data in all kinds of formats and sizes, the ability to train, and utilize, machine learning models across all of that data becomes far easier.

In many situations where you are working with machine learning, the challenge to get all the data you need to build your models on in one place takes up the bulk of the work. Building a machine learning model (or training as it is called in the data science world) becomes far easier if you can directly access all the data you require without having to move it from different data sources to one place. Besides having access to the data from a single point of entry, Big Data Clusters also allow you to operationalize your machine learning models at the same location where your data resides. This means that, technically, you can use your machine learning models to score new data as it is stored inside your Big Data Cluster. This greatly increases the capabilities of implementing machine learning inside your organization since Big Data Clusters allow you to train, exploit, and store machine learning models inside a single solution instead of having various platforms in place to perform a specific action inside your organization’s advanced analytics platform.

In this chapter we are going to take a closer look at the various options available inside Big Data Clusters to train, store, and operationalize machine learning models. Generally speaking, there are two directions we are going to cover: In-Database Machine Learning Services inside SQL Server and machine learning on top of the Spark platform. Both of these areas cover different use cases, but they can also overlap. As you have seen in the previous chapter, we can easily bring data stored inside a SQL Server instance to Spark and vice versa if we so please. The choice of which area you choose to perform your machine learning processes on is, in this situation, more based on what solution you personally prefer to work in. We will discuss the various technical advantages and disadvantages of both machine learning surfaces inside Big Data Clusters in each section of this chapter. This will give you a better understanding of how each of these solutions works and hopefully will help you select which one fits your requirements the best.

SQL Server In-Database Machine Learning Services

With the release of SQL Server 2016, Microsoft introduced a new feature named in-database R Services. This new feature allows you to execute R programming code directly inside SQL Server queries using either the new sp_execute_external_script stored procedure or the sp_rxPredict CLR procedure. The introduction of in-database R Services was a new direction that allowed organizations to integrate their machine learning models directly inside their SQL Server databases by allowing the user to train, score, and store models directly inside SQL Server. While R was the only language available inside SQL Server 2016 for use with sp_execute_external_script , Python was added with the release of SQL Server 2017 which also resulted in a rename of the feature to Machine Learning Services. With the release of SQL Server 2019, on which Big Data Clusters are built, Java was also added as the third programming language that is available to access directly from T-SQL code.

While there are some restrictions in place regarding In-Database Machine Learning Services (for instance, some functions that are available with In-Database Machine Learning Services, like PREDICT , only accept algorithms developed by Revolution Analytics machine learning models), it is a very useful feature if you want to train and score your machine learning models very closely to where your data is stored. This is also the area where we believe In-Database Machine Learning Services shine. By utilizing the feature data movement is practically minimal (considering that the data your machine learning models require is also directly available in the SQL Server instance), model management is taken care of by storing the models inside SQL Server tables, and it opens the door for (near) real-time model scoring by passing the data to your machine learning model before it is stored inside a table in your database.

All of the example code inside this chapter is available as a T-SQL notebook at this book’s GitHub page. For the examples in this section, we have chosen to use R as the language of choice instead of Python which we used in the previous chapter.

Training Machine Learning Models in the SQL Server Master Instance

Before we can get started training our machine learning models, we have to enable the option to allow the use of the sp_execute_external_script function inside the SQL Server Master Instance of the Big Data Cluster. If you do not enable the option to run external scripts inside the SQL Instance, a large portion of the functionality of In-Database Machine Learning Services is disabled.

Some In-Database Machine Learning functionality is still with external scripts disabled. For instance, you can still use the PREDICT function together with a pretrained machine learning model to score data. However, you cannot run the code needed to train the model, since that mostly happens through the external script functionality.

If you do not have external scripts enabled and want to run a section of R code using sp_execute_external_script , you will be confronted with the following error message (Figure 7-1).
../images/480532_2_En_7_Chapter/480532_2_En_7_Fig1_HTML.jpg
Figure 7-1

Error running sp_execute_external_script with external scripts disabled

Enabling sp_execute_external_script is simple and straightforward. Connect to your SQL Server Master Instance and run the code shown in Listing 7-1 to immediately enable the option.
-- Before we can start, we need to enable external scripts
EXEC sp_configure 'external scripts enabled',1
RECONFIGURE WITH OVERRIDE
GO
Listing 7-1

Enable external scripts

After enabling the use of external scripts, we can directly run R, Python, or Java code through the sp_execute_external_script procedure . Like we mentioned in the introduction of this section, we have chosen to use R as the language of choice for this section of the book, and the code of Listing 7-2 shows a simple R command to return the version information of R.
EXEC sp_execute_external_script
    @language =N'R',
    @script=N'print (version)'
Listing 7-2

Sample R code using sp_execute_external_script

Running the code in Listing 7-2 should return the results shown in Figure 7-2.
../images/480532_2_En_7_Chapter/480532_2_En_7_Fig2_HTML.jpg
Figure 7-2

R version results through sp_execute_external_script

As you can see from the code, the sp_execute_external_script procedure accepts a number of parameters. Our example displays the minimal parameters that need to be supplied when calling the procedure, namely, @language and @script. The @language parameter sets the language that is used in the @script section. In our case, this is R. Through the @script parameter, we can run the R code we want to execute, in this case the print (version) command.

While sp_execute_external_script always returns results regarding the machine it is executed on, the output of the print (version) R command starts on line 5 with _ platform x86_64.

While we can work just fine with R output being returned inside the message window, we can also supply additional parameters to sp_execute_external_script to return the output generated with R to a table format. We do that by mapping a variable we defined in R (using the @output_data_1_name parameter shown in the following) to a variable we define in T-SQL and using the WITH RESULT SETS statement when we call the procedure as shown in the example of Listing 7-3.
EXEC sp_execute_external_script
    @language =N'R',
    @script=N'
            r_hi <- "Hello World!"
            r_hello <- as.data.frame(r_hi)',
    @output_data_1_name = N'r_hello'
WITH RESULT SETS (([hello] nvarchar(250)));
GO
Listing 7-3

Returning data using WITH RESULT SETS

By running the code in Listing 7-3, you should get the text “Hello World!” returned inside a table result as shown in Figure 7-3.
../images/480532_2_En_7_Chapter/480532_2_En_7_Fig3_HTML.jpg
Figure 7-3

Output returned to a table format

Just like how we can define and map output results through the sp_execute_external_script procedure , we can define input datasets. This is of course incredibly useful since this allows us to define a query as an input dataset to the R session and map it to an R variable. Being able to get data stored inside your SQL Server database inside the In-Database Machine Learning Service feature opens up the door to perform advanced analytics on that data like training or score machine learning models.

We are going to train a machine learning model on the “Iris” dataset. This dataset is directly available inside R and shows various characteristics of Iris flowers and to which species a specific Iris flower belongs. We can use this data to create a classification machine learning model in which we are going to predict which species an Iris flower belongs to.

Since the dataset is already present inside R, we can use a bit of R scripting together with the sp_execute_external_script procedure to return the dataset as a SQL table. The code of Listing 7-4 creates a new database called “InDBML” and a new table called “Iris” and fills that table from the Iris dataset inside an R session.
-- Create a new database to hold the Iris data
CREATE DATABASE InDBML
GO
USE [InDBML]
GO
-- Create a table to hold the Iris data
CREATE TABLE Iris
    (
        Sepal_Length FLOAT,
        Sepal_Width FLOAT,
        Petal_Length FLOAT,
        Petal_Width FLOAT,
        Species VARCHAR(50)
    )
-- Get the Iris dataset from the R session and insert it into our table
INSERT INTO Iris
EXEC sp_execute_external_script
    @language =N'R',
    @script=N'
            r_iris <- iris',
    @output_data_1_name = N'r_iris'
-- Get data from the new table
SELECT * FROM Iris
Listing 7-4

Create a new database and fill it with test data

If everything processed correctly, you should have received the results as shown in Figure 7-4 which shows the values stored inside the Iris table.
../images/480532_2_En_7_Chapter/480532_2_En_7_Fig4_HTML.jpg
Figure 7-4

Iris table values

Now that we have some data to create a machine learning model on, we can get started by training a model. But before we do that, we are going to perform two additional tasks. We are going to create a “Model” table. One very useful feature of In-Database Machine Learning Services is the ability to “serialize” a model into a binary string which we can then store inside a SQL table. When the model is stored inside a table, we can retrieve it whenever we need it through a SQL query. The code of Listing 7-5 creates a model table inside the InDBML database.
-- Create a table to hold our trained ML models
CREATE TABLE models
    (
    model_name nvarchar(100) not null,
    model_version nvarchar(100) not null,
    model_object varbinary(max) not null
    )
 GO
Listing 7-5

Create model table

Next to the model_object column that is going to hold our serialized model, we also create two additional columns that store the name and the version of the model. This can be very useful in situation where you are storing multiple models inside your SQL Server database and want to select a specific model version or name.

The next thing we are going to do is to split our Iris dataset into a training and a testing set. Splitting a dataset is a common task when you are training machine learning models. The training dataset is the data you are going to use to feed into the model you are training; the test dataset is a portion of the data you are “hiding” from the model while it is training. In that way the model was never exposed to the testing data, which means we can use the data inside the testing set to validate how well the model performs when shown data is has never seen before. For that reason, it is very important that both the training and the testing datasets are a good representation of the full dataset. For instance, if we train the model only on characteristics of the “Setosa” Iris species inside our dataset and then show it data from another species through our test dataset, it will predict wrong (predicting Setosa) since it has never seen that other species during training.

The code of Listing 7-6 randomly selects 80% of the rows from the Iris table and inserts them into a new Iris_train table. The other 20% of the data goes into a new Iris_test table.
-- Randomly select 80% of the data into a separate training table
SELECT TOP 80 PERCENT *
INTO Iris_train
FROM Iris
ORDER BY NEWID()
-- Select the remaining rows into a testing table
SELECT *
INTO Iris_test
FROM Iris
EXCEPT
SELECT * FROM Iris_train
Listing 7-6

Split dataset into training and testing dataset

Now that we have a model table and got our training and testing data separated, we are ready to train our machine learning model and store it inside our model table after training which is exactly what the code of Listing 7-7 does.
DECLARE @model VARBINARY(MAX)
-- Train a decision tree based on our training dataset
EXEC sp_execute_external_script
    @language = N'R',
    @script = N'
            iris.dtree <- rxDTree(Species ~ Sepal_Length + Sepal_Width + Petal_Length + Petal_Width, data = iris_sqldata)
            trained_model <- rxSerializeModel(iris.dtree, realtimeScoringOnly = FALSE)',
    @input_data_1 = N'SELECT * FROM Iris_train',
    @input_data_1_name = N'iris_sqldata',
    @params = N'@trained_model VARBINARY(MAX) OUTPUT',
    @trained_model = @model OUTPUT
-- Insert the model into our model table
INSERT INTO models
    (
    model_name,
    model_version,
    model_object
    )
 VALUES
    (
    'iris.dtree',
    'v1.0',
    @model
    )
Listing 7-7

Train a machine learning model using sp_execute_external_script

The preceding code performs a number of steps to train and store a machine learning model. To make sure you understand how sp_execute_external_script can be used to train and store models inside your SQL Server Master Instance, we are going to describe each step that is being performed in the preceding code.
  1. 1.

    The first line of the script, DECLARE @model VARBINARY(MAX), declares a T-SQL variable of the VARBINARY datatype that will hold our model after training it.

     
  2. 2.

    In the second step, we execute the sp_execute_external_script procedure and supply the R code needed to train our model. Notice we are using an algorithm called rxDTree. rxDTree is a decision tree algorithm building by Revolution Analytics, a company that Microsoft bought in 2015 and provided parallel and chunk-based algorithms for machine learning. The syntax for the model training is pretty straightforward; we are predicting the species based on the other columns (or as they are called: features) of the training dataset.

    The line trained_model <- rxSerializeModel(iris.dtree, realtimeScoringOnly = FALSE) is the command to serialize our model and store inside the trained_model R variable. We map that variable as an output parameter to the T-SQL @model variable in the call to the sp_execute_external_script procedure. We map the query that selects all the records from the training dataset as an input variable for R to use as input for the algorithm.

     
  3. 3.

    Finally, in the last step, we insert the trained model inside the model table we created earlier. We supply some additional data like a name and a version so we can easily select this model when we use it to predict Iris species in the next step.

     
After running this code, which should only take a few seconds, we should end up with our model stored as a binary string inside our model table as shown in Figure 7-5.
../images/480532_2_En_7_Chapter/480532_2_En_7_Fig5_HTML.jpg
Figure 7-5

Trained decision tree model inside the model table

Scoring Data Using In-Database Machine Learning Models

Now that we have trained our model, we can use it to score, or predict, the data we stored in the Iris_test table. To do that we can use two methods, one using the sp_execute_external_script procedure which we have also used to train our model and the other by using the PREDICT function that is available in SQL Server.

The code of Listing 7-8 shows the first approach; notice that the syntax is mostly the same as the earlier examples of this method, but this time we supply the trained model as an input parameter together with a query to select the data from the Iris_test table.
-- Retrieve the model from the model table
DECLARE @model VARBINARY(MAX) = (SELECT model_object FROM models WHERE model_name = 'iris.dtree')
-- Run a prediction using the Iris_test data as input
-- Return all columns, including the probability for each species
EXEC sp_execute_external_script
    @language = N'R',
    @script = N'
            model = rxUnserializeModel(model);
            Iris_prediction = rxPredict(model, data=Iris_test)
            Iris_pred_results <- cbind(Iris_test, Iris_prediction)
            str(Iris_pred_results)
            ',
    @input_data_1 = N'
                    SELECT
                        Sepal_Length,
                        Sepal_Width,
                        Petal_Length,
                        Petal_Width,
                        Species
                    FROM Iris_test',
    @input_data_1_name = N'Iris_test',
    @output_data_1_name = N'Iris_pred_results',
    @params = N'@model varbinary(max)',
    @model = @model
 WITH RESULT SETS (("Sepal_Length" FLOAT, "Sepal_Width" FLOAT, "Petal_Length" FLOAT, "Petal_Width" FLOAT, Species VARCHAR(50), setosa_Pred FLOAT, versicolor_Pred FLOAT, verginica_Pred FLOAT))
Listing 7-8

Run a prediction using the in-database stored model

In the first part of the R script inside the sp_execute_external_script code, we have to unserialize our model again using rxUnserializeModel. With the model unserialized, we can perform a prediction of the input data. The last line of R code adds the probability columns for each Iris species to the input dataset. This means we end up with a single table as output that contains all the input columns as well as the columns generated by the scoring process.

We won’t go into details about machine learning or machine learning algorithms in this book, but the problem we are trying to solve using machine learning in this case is one called classification. Machine learning algorithms can basically be grouped into three different categories: regression, classification, and clustering. With regression we are trying to predict a numerical value, for instance, the price of a car. Classification usually deals with predicting a categorical value, like the example we went through in this chapter: What species of Iris plant is this? Clustering algorithms try to predict a result by trying to group categories together based on their characteristics. In the Iris example we could also have chosen to use a clustering algorithm since there might be clear Iris species characteristics that tend to group together based on the species.

After running the code in Listing 7-8, we see the results shown in Figure 7-6. If you ran the code yourself, you might see some different results since we split our training and test data based on randomly selected rows.
../images/480532_2_En_7_Chapter/480532_2_En_7_Fig6_HTML.jpg
Figure 7-6

Scored results for the data inside the Iris_test table using our trained machine learning model

Performing a prediction using the sp_execute_external_script method works perfectly fine and gives you maximum flexibility in terms of what you can do using R code. However, it does result in quite a lot of lines of code. Another method we have available inside SQL Server is using the PREDICT function; PREDICT is far easier to use, has a simpler syntax, and, in general, performs faster than sp_execute_external_script. It does have its drawbacks though, for instance, you cannot write custom R code to perform additional steps on the data and you are required to use a serialized model that was trained using a Revolution Analytics algorithm (by using sp_execute_external_script you can basically use every algorithm available in R or R libraries).

We performed the same scoring on our data inside the Iris_test table using the PREDICT function in the code of Listing 7-9.
DECLARE @model VARBINARY(MAX) = (SELECT model_object FROM models WHERE model_name = 'iris.dtree')
-- Alternative method is using the PREDICT function
SELECT
  Iris_test.*,
  pred.*
FROM PREDICT(MODEL = @model, DATA = dbo.Iris_test as Iris_test)
WITH(setosa_Pred FLOAT, versicolor_Pred FLOAT, virginica_Pred FLOAT) AS pred
Listing 7-9

Running a model prediction using the PREDICT function

As you can directly see, PREDICT is far more readable than sp_execute_external_script and, for those more familiar with T-SQL, far easier to understand. In a sense, we are joining the model, and its outputs, to the data inside the Iris_test table. We need to supply the column names and datatypes of the prediction output inside the WITH clause and can select what we want to return using the SELECT statement. In this case we are selecting all the columns of the Iris_test table together with all the columns that are returned by the prediction, and the results should look like those shown in Figure 7-7.
../images/480532_2_En_7_Chapter/480532_2_En_7_Fig7_HTML.jpg
Figure 7-7

Iris species prediction using PREDICT

Now that we have trained a machine learning model, and scored data using it, inside SQL Server Machine Learning Services, you should have a general idea of the capabilities of these methods. In general, we believe In-Database Machine Learning Services is especially useful when all, or the largest part, of your data is stored inside SQL Server databases. With the model stored inside a SQL Server database as well, you can build solutions that are able to (near) real-time score data as soon as it is stored inside your SQL Server database (for instance, by using triggers that call the PREDICT function). If you want to, you are not limited to just SQL Server tables however. As you have seen in earlier chapters, we can map data stored inside the Spark cluster (or on other systems all together) using external tables and pass that data to the In-Database Machine Learning Services.

In some situations, however, you cannot use In-Database Machine Learning Services, perhaps because your data doesn’t fit inside SQL Server, either by size or by data type, or you are more familiar with working on Spark. In any of those cases, we always have the option of performing machine learning tasks on the Spark portion of the Big Data Cluster which we are going to explore in more detail in the next section.

Machine Learning in Spark

Since Big Data Clusters are made up from SQL Server and Spark nodes, we can easily choose to run our machine learning processes, from training to scoring, inside the Spark platform. There are many reasons we can come up with why you would choose Spark over SQL for a machine learning platform (and vice versa). However, when you have a very large dataset that doesn’t make sense to load into a database, you are more or less stuck on using Spark since Spark can handle large datasets very well and can train various machine learning algorithms in the same distributed nature as it handles data processing.

As expected on an open, distributed, data processing platform, there are many libraries available which you can use to satisfy your machine learning needs. In this book we decided on using the built-in Spark ML libraries which provide a large selection of different algorithms and should cover most of your advanced analytical needs.

Just like we did for the In-Database Machine Learning Services for SQL Server section, we need to get some data inside Spark to work with. For the sake of simplicity, we decided on reusing the Iris dataset we also used for the SQL Server section. Just like we did in the previous chapter, all the data processing, wrangling, and analysis we are doing in Spark happen on a dataframe. Assuming you worked through the examples in the previous SQL Server section, we are going to extract the Iris dataset from inside the SQL Server Master Instance and load it into a dataframe in Spark using the code of Listing 7-10. If you are unfamiliar with connecting to the SQL Server Master Instance through Spark, we suggest reading the last section of the previous chapter where we go into detail how you can make this scenario work.
# Before we get started, let's get the Iris data from the database/table we
# created in the previous section
df_Iris = spark.read.format("jdbc")
    .option("url", "jdbc:sqlserver://master-0.master-svc;databaseName=InDBMl")
    .option("dbtable", "dbo.Iris")
    .option("user", "[username]")
    .option("password", "[password]").load()
Listing 7-10

Reading data from the SQL Server Master Instance

If we look at some of the contents of the df_Iris dataframe , using the df_Iris.show(10) command, we should see that all the Iris species characteristics, as well as the species itself, are present in the dataframe (Figure 7-8).
../images/480532_2_En_7_Chapter/480532_2_En_7_Fig8_HTML.jpg
Figure 7-8

df_Iris dataframe top ten rows

With our data inside a dataframe in Spark, we are almost ready to start to do some machine learning. First thing we need to handle though is the loading of a number of Spark ML libraries as shown in the code in Listing 7-11.
# To perform machine learning tasks, we need to import a number of libraries
# In this case we are going to perform classification
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.ml.feature import *
Listing 7-11

Loading machine learning libraries

In this case, since we are doing a so-called classification problem, we only need to import the pyspark.ml.classification libraries together with the libraries we need to perform some modification to the features (which is another name for the columns of our dataframe in this case) of the dataframe and evaluate our model performance.

After the libraries are loaded, we are going to perform some modifications on our dataframe to make it suitable to work for our machine learning algorithm. Different machine learning algorithms have different requirements in terms of your data, for instance, some algorithms only work on numerical values as input, just like the classification algorithm we are using. The code of Listing 7-12 performs a number of tasks on our df_Iris dataframe.
# We are going to combine all the features we need to predict the Iris species
# into a single vector feature
feature_cols = df_Iris.columns[:-1]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_Iris = assembler.transform(df_Iris)
df_Iris = df_Iris.select("features", "Species")
# Since we are going to perform logistic regression, we are going to convert
# the string values inside species to a numerical value
label_indexer = StringIndexer(inputCol="Species", outputCol="label").fit(df_Iris)
df_Iris = label_indexer.transform(df_Iris)
Listing 7-12

Process the data so it is suitable for machine learning

The first code section combines the different features inside a new column called “features.” All of these features are Iris species characteristics and they are combined into a single format called a vector (we will take a look at how this visually looks a bit further down in the book). The line feature_cols = df_Iris.columns[:-1] selects all the columns of the dataframe except the rightmost column which is the actual species of the Iris plant.

In the second section, we are mapping the different Iris species to a numerical value. The algorithm we are going to use to predict the Iris species requires numerical input, which means we have to perform a conversion. This is not unusual in the realm of machine learning and data science. In many cases you have to convert a string value to a numerical value so the algorithm can work with it. After the conversion from string to numerical, we add a new column called “label” which contains the species in a numerical value.

In the next step, we are only selecting the features and the label column from the df_Iris dataframe and return the top ten rows (code of Listing 7-13 results in Figure 7-9) to give you an idea how the data looks after the transformations we’ve performed in the previous code segment.
# We only need the feature column and the label column
df_Iris = df_Iris.select("features", "label")
df_Iris.show(10)
Listing 7-13

Only select the features and label dataframe columns

../images/480532_2_En_7_Chapter/480532_2_En_7_Fig9_HTML.jpg
Figure 7-9

modified df_Iris dataframe

As you can see from Figure 7-9, all of the features (Petal_Length, Petal_Width, etc.) have been transformed inside a single vector inside a single column of our dataframe. The label column now returns a number for the species, 2.0 being Setosa, 1.0 virginica, and 0.0 versicolor.

Now that we have our entire dataframe converted into a format that is workable for our machine learning classification algorithm, we can split our data into a training dataframe and a testing dataframe like we did in the previous section as well. The code of Listing 7-14 handles the split in which 80% of the data goes into the Iris_train dataframe and the remaining 20% in the Iris_test dataframe.
# Split the dataset
(Iris_train, Iris_test) = df_Iris.randomSplit([0.8, 0.2])
Listing 7-14

Split the dataframe into a training and testing dataframe

Now that we have our datasets ready for training, we can start the actual machine learning phase. The first thing we need to do is to initialize the machine learning algorithm (Listing 7-15). In this part we can supply which algorithm we want to use and various parameters (also called hyperparameters) we want to configure during the training phase of the machine learning model.
# Initiate the classifier, in this case LogisticRegression
lr = LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)
Listing 7-15

Initiate the classifier

In this case we have chosen to use a logistic regression algorithm to try and predict which species of Iris a plant belongs to, based on its characteristics. We are going to ignore the algorithm parameters for now. When you are in the phase when you try to optimize and tune your model, you will frequently go back to the parameters (either manually or programmatically) and modify them to find the optimal setting.

Training the model is actually very easy and straightforward and, in this case, can be achieved by a single line of PySpark code (Listing 7-16).
# Train the multiclass model
model = lr.fit(Iris_train)
Listing 7-16

Train the model

After the preceding code (Listing 7-16) finished running, we have access to a trained machine learning model in the form of the variable “model.” We can then use the trained model to perform predictions on our test dataset to analyze how well it performed. Using the code of Listing 7-17, we are going to “fit” the trained model on our test dataset and return the top 20 results which are shown in Figure 7-10.
# Predict on our test dataset using the model we trained
# and return the predictions
Iris_pred = model.transform(Iris_test)
Iris_pred.show(20)
Listing 7-17

Perform a prediction

../images/480532_2_En_7_Chapter/480532_2_En_7_Fig10_HTML.jpg
Figure 7-10

Prediction results on our test dataset

As you can see in Figure 7-10, our model performed a good job on the test dataset. In the top 20 rows that were returned by the command, only a single row had a prediction for a different species instead of the actual one (we predicted virginica while it should have been versicolor). While we could analyze each and every row to look for differences between the actual species and the predicted species, a far faster way to look at model performance is by using the Spark ML evaluation library which we loaded earlier.

The code of Listing 7-18 evaluated the model performance against our test dataset and measured it on the performance metric accuracy. Accuracy is frequently used to measure how well a classification model is performing and is the ratio of correct predictions divided by the number of incorrect predictions.
# How good did our model perform?
evaluator = MulticlassClassificationEvaluator(metricName='accuracy')
accuracy = evaluator.evaluate(Iris_pred)
print("Accuracy: " + format(accuracy))
Listing 7-18

Measuring model performance

The results the preceding code returns will probably vary each time you run the code. This is because the dataset we are using is rather small and we perform a randomize split, which means the number of unique species which ends up in the training and testing datasets has a huge influence on model performance. We ended up with the results shown in Figure 7-11, which is quite a respectable level of accuracy.
../images/480532_2_En_7_Chapter/480532_2_En_7_Fig11_HTML.jpg
Figure 7-11

Accuracy of our trained model

With our model trained and tested, we can take additional steps depending on what we are planning to do with the model. If we are interested in optimizing model performance more, we could go back and tune our algorithm parameters before training the model again. Perhaps it would also be useful, in this scenario, to look how good the split is between the training and test dataset since that has a huge impact on the model accuracy and there are a hundred more things we could do to optimize our model even further if we wanted to (even selecting a different algorithm to see if that predicts better than the current one).

Another thing we could do is store the model. We are way more flexible in that area than inside SQL Server In-Database Machine Learning Services where the model had to be serialized and stored inside a table. In the case of Spark, we can choose different methods and libraries to store our models. For instance, we can use a library called Pickle to store our model on the filesystem, or use the .save function on the model variable to store it on an HDFS location of our choosing. Whenever we need our trained model to score new data, we can simply load it from the filesystem and use it to score the new data.

Summary

In this chapter we explored the various methods available to perform machine learning tasks inside SQL Server Big Data Clusters. We looked at SQL Server In-Database Machine Learning Services which allowed us to train, utilize, and store machine learning models directly inside the SQL Server Master Instance using a combination of T-SQL queries and the new sp_execute_external_script procedure. In the Spark department, we also have a wide variety of machine learning capabilities available to use. We used the Spark ML library to train a model on a dataframe and used it to score new data. Both of the methods have their strengths and weaknesses, but having both of these solutions available inside a single box allows optimal flexibility for all our machine learning needs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.50.92