Building an experiment that uses R

Every meaningful data science starts with a question, a sense of purpose. What problems are you trying to solve? Who are you trying to save? What do you wish to accomplish with it? During play (learning) time, you can afford to be meaningless, since you are usually aiming at the technicalities. When playtime is over, seek meaning.

Once you have the big question, only then you reach out for data. Our goal here will be to develop a model capable of predicting whether a blood donor will donate again in a given month. For this, Blood donation data, made available by Microsoft Azure, will be used.

Clear some room by deleting any modules that may still be inside the experiment—select them and hit delete or right-click them and choose delete. Click inside the search bar on the left side and query Blood donation data. This module corresponds to a dataset, click and drag it to the grey area to start building the experiment.

Blood donation data can be also found under Saved Datasets | Samples.

Once you can see a module called Blood donation data, right-click it, drag and drop it into the middle area. Now the dataset is ready to be used, but, first things first, let's get started by visualizing it. Put the mouse arrow right over the dataset's output-circle, right-click it, and select Visualize, as shown in the following screenshot:

Figure 13.10: Visualizing a dataset

The first 100 rows will be displayed along with the number of rows and columns. If you happen to click a column, more information will be displayed about that variable. The following screenshot demonstrates the information displayed when the column Recency is selected:

Some data will require you to run the experiment before you're able to visualize it.

Figure 13.11: Information about the Recency variable

The Blood donation data module comes with 748 rows and five columns. The variables are as follows:

Recency: Months since the last donation
Frequency: Total number of donation
Monetary: Total blood donated in cc
Time: Months since the first donation
Class: A binary where one means blood donation in March 2017 and zero means no donation

To check whether there are missing values is the least to do. Fortunately, this dataset has no missing values, so that is not something we will have to deal with. If it were, Studio has a module to handle missing values. Later, we will commit to a boosting model. Unless you feel that a principal component analysis may be needed, there is no reason for rescaling data.

The next step will be to split the data (training and test):

Search for the Split Data module on the left-side blade (Data Transformation | Sample and Split)
Drag it to the middle area, preferably under the Blood donation data module
Connect the output of Blood donation data to the input of the Split Data module

When you click the Split Data module, a set of options regarding this module might appear on the right side (Propeties pane); if they don't, search for the small arrow pointing left on the top-right corner right below the icon you click to SIGN OUT and click it. The following screenshot shows what it looks like:

Figure 13.12: Split Data options

For the second option, a fraction of rows in the first output dataset, the default value is 0.5, which means data is split 50/50. Change this to 0.8 so that the data will be split 80/20. The 80% of the dataset delivered by output will be later used for training; the other 20% handled by output will act during evaluation.

Training data can be handed to a module called Train Model:

Search for the Train Model module on the left-side pallette (Machine Learning | Train Model)
Drag and drop it into the central area
Connect the Split Data first output to the Train Model second input

The Train Model module will display an alert, a white exclamation point within a red circle. To properly function, the label column (some will recognize it as the target variable) must be pointed. Select the module. On the right side where the split rate was previously adjusted, Train Model will display an option for the label column. Hit the Launch column selector button to pick the label.

All the variables but the one chosen as the label will be considered features by the module. Our target here will be the binary variable called Class. Start typing Class, and soon this variable is displayed; pick it (as shown in the following screenshot) and click the check mark at the bottom-right corner:

Figure 13.13: Selecting the label column

Even after the selecting the label column, the alert doesn't go away. Train Model needs another input. We're going for an R model:

Search for the module Create R Model on the left side (R language modules
Drag it to the central area
Connect the output of Create R Model to the first input of Train Model

In the Properties pane, the module Create R Model comes with two scripts: Trainer R script and Scorer R script. There is where the code must be deployed. Trainer R script is inputted with the dataset, the same one inputted into the related Train Model module, and must generate a proper model to be stored in a variable named model.

Detailed information on individual modules can be found in Azure's documentation. For example, detailed info about Create R Model can be found at https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/create-r-model.

The Scorer R script understands the variables, dataset and model. Its goal is to compute scores for the corresponding model while storing it in a variable called scores. Let's start with Trainer R script. Select the Create R Model module, and then click on the small stacked-windows icon right next to Trainer R script to open the editor, as shown here:

Figure 13.14: Opening the Trainer R script editor

After clicking the icon next to Trainer R script, a new window will pop up, where the R model can be designed. It comes with a sample code such as the following:

# Input: dataset
# Output: model

# The code below is an example which can be replaced with your own code.
# See the help page of "Create R Model" module for the list of predefined functions and constants.

library(e1071)
features <- get.feature.columns(dataset)
labels <- as.factor(get.label.column(dataset))
train.data <- data.frame(features, labels)
feature.names <- get.feature.column.names(dataset)
names(train.data) <- c(feature.names, "Class")
model <- naiveBayes(Class ~ ., train.data)

First of all, it starts by loading the e1071 package. Not all packages are supported by Machine Learning Studio. The following link contains all the supported ones: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/r-packages-supported-by-azure-machine-learning.

There is a number of reasons for a package to not be supported, it goes from Java dependency, incompatibilities between package binaries and the sandbox environment to the requirement of direct internet access. The sample code is using a bunch of functions that are typical for Machine Learning Studio.

Recall that the module Train Model can distinguish labels (Y) from features (X) given that we selected the label column (Class); also, keep in mind that dataset is the variable name given to the actual data inputted to Train Model. Here is the explanation for some of the functions from the sample code:

get.feature.columns(dataset) gets the columns from the dataset that are supposed to act as features
as.factor(get.label.column(dataset) gets the column that is supposed to be a label (or target) from the dataset and makes sure its type is factor
get.feature.column.names(dataset) gets the names from the columns that display features

Except for as.factor(), all these functions are particular to Machine Learning Studio and work very well in conjunction with Train Model. There are other ways of loading and arranging data, but the way the sample code did is actually very good and tends to avoid problems.

The last line is fitting a Naive Bayes model from the e1071 package and storing it in a variable called model. This explains pretty much how Trainer R script is supposed to work: take data coming from dataset, handle it, and use it to fit a compatible model that must be stored in a variable named model.

Adapt the sample code so that it looks like the following:

library(gbm)
set.seed(13)
model <- gbm(formula = Class ~ ., 
    data = dataset, 
    distribution = 'bernoulli', 
    n.trees = 3*10^4)

The drill that comes with the sample code is actually very good, but I prefer to stick with simplicity. You can use hashtags to document the code as well. The model variable is now storing a generalized boosted regression model fitted with over 30,000 trees. After running the modifications, again hit the check mark near the bottom-right corner.

It's time to adjust the Scorer R script. Click the icon right next to the words Scorer R script. The following is the code that comes by default:

# Input: model, dataset
# Output: scores

# The code below is an example which can be replaced with your own code.
# See the help page of "Create R Model" module for the list of predefined functions and constants.

library(e1071)
probabilities <- predict(model, dataset, type="raw")[,2]
classes <- as.factor(as.numeric(probabilities >= 0.5))
scores <- data.frame(classes, probabilities)

The Scorer R script takes the variables model, from Trainer R script, and dataset, the one that will be later inputted into Score Model module, and must output a DataFrame named scores. Such a variable, as we will see later, will tag along with the dataset inputted to the Score Model module and is expected to bring the predicted values, but it could bring anything really.

Be extra cautious while designing an Scorer R script. As almost anything can go through the score variable, and it is not that uncommon to mistake fails for wins.

Adapting the sample code for the gbm model, here is my suggestion:

library(gbm)
probabilities <- predict.gbm(model, 
    newdata = dataset, 
    n.trees = 3*10^4, 
    type = 'response')
class <- as.factor(as.numeric(probabilities >= 0.5))
scores <- data.frame(Predicted = class, Probabilities = probabilities)

The preceding code uses the previously defined model to bear predictions. First, it scores the individual probability of a donor coming back, represented by the number one. Using a threshold of 50%, the actual class of an individual is also scored. Both class and probabilities are then gathered in a DataFrame named scores.

Scorers are meant to work with a specific module, the Score Model module, that is the one to set next:

In the modules menu (left-side menu), query for Score Model (it will be under Machine Learning | Score Model)
Drag and drop it in the center, preferably somewhere under the Train Model
Connect the output of Train Model to the first input of Score Model
Connect the second output from Split Data to the second input of Score Model

Done. Our experiment is loading the Blood donation data, splitting it into train and test (80/20), selecting the Class column as the label (target), training a model using the R gbm package, and scoring it through the test set. After connecting the dots and setting the parameters, the following screenshot shows how my experiment ended up looking:

Figure 13.15: Azure Machine Learning Studio, experimental draft

We're not finished yet, but for an instance look for the RUN button in the bottom menu, hit it. You can run your experiment at any stage. A clock will sign modules in the query. Modules successfully processed will be marked with a green check sign. Failures will be marked with a red X.

Check whether everything ran OK. Right-click the output circle of Score Model module and select visualize to retrieve the predictions made with the test dataset. The result might look very similar to Figure 13.16, and predicted values are addressed to the Predicted column, while the Probabilities column is displaying each (estimated) probability:

Figure 13.16: Predicted values for the test set

Usually, a module called Evaluate Model would come after Score Model, but, at the time of writing, the former wouldn't easily work with models designed by Create R Model. A solution would be to have the Execute R Script module in between Score Model and Evaluate Model. Instead, I prefer to demand and output the evaluation under Execute R Script itself, thus overcoming the need for the Evaluate Model module:

Search for Execute R Script under the menu on the left (R language modules)
Drag it into the central area, preferably under Score Model
Connect the output of Score Model to the first input (starting on the left) of Execute R Script

We have almost finished. The reason the output of Score Model must be connected to Execute R Script first input (from the left to the right) is that the code used assumes such a thing. Here is the code that I used to compute the hit rate for the test set:

test_set <- maml.mapInputPort(1)

hit_rate <- mean(test_set$Class == test_set$Predicted)
hit_rate <- data.frame(hit_rate)

maml.mapOutputPort("hit_rate");

After selecting the Execute R Script module in the experiment area, click the small stacked-windows icon right next to R script to modify the code. After running the experiment, the hit rate will be expressed as a dataset, which can be visualized by right-clicking the first output circle (starting on the left) of Execute R Script.

If all went well, the output of the sample hit rate should be around 84%. From there, you may want to deploy a few modifications to make an application out of it or to modify some of the parameters to improve the result. The following screenshot shows how my final diagram looked by the time I had finished with it:

Figure 13.17: Experiment's final result

Simply saving the model is another option. It's important to highlight that the third input of Execute R Script takes files in the .zip format. This allows users to do lots of stuff, such as bringing more data, other R scripts, and even loading entire packages, although not every package is sure to be compatible with Azure Machine Learning Studio.

This section introduced a very minimal example especially drawn to walk the reader through R models in Azure Machine Learning Studio. There are two things I wish to emphasize: one, there is much more to learn about it—this small exercise was only meant to get you in touch with R in the Azure Machine Learning Studio. Two, such a platform was designed to deal with more complex problems, and gracefully enables teams to work very well together.

Before closing up this chapter with a quiz, I have some cloud-related tips to share.

First, try a small-scale local version of what you going to do in the cloud before actually performing any cloud operations. By trying them, you may anticipate problems that would only be noted on the fly; while local errors may only cost you time, cloud errors might cost time and money.

Second, always seek efficiency. Is the preprocessed data larger than the processed data? If the answer is yes, kindly consider preprocessing the data locally and only uploading it afterward. Being inefficient is costly both in the clouds and locally, but acting so in the former is more likely to lead you to bankruptcy.

Third, defensive programming. Notably, when the final goal is an application, being ready to deal with unwanted data types and exceptions pays off. Also, defensively programming combined with extensive documentation might prevent colleagues and you from making big mistakes, even if the final result is not an application.

Despite the fact that cloud computing is more challenging than local computing, it's a real thing, and knowing how to do it right is a game changer. Embrace the challenge.

Table of Contents for Building an experiment that uses R

Create new playlist

Sign In

Sign Up

Table of Contents for
Building an experiment that uses R