Using Spark machine learning or H2O Sparking Water

If your objective is to use Spark as a tool for your machine learning projects, this section will introduce you to how it is possible to do so within R. Spark contains its own machine learning library that can be accessed by sparklyr, so it's pretty simple to work your machine learning projects on Spark. The website https://spark.rstudio.com/mlib/ provides a pretty good presentation about Spark Machine Learning library, so remember to visit it to discover the many available functions; they include an example workflow that is short and effective and shows how to sequence your project codes. We already talked about machine learning in Chapter 6, Machine Learning with R; in this section, I'm only going to develop the decision tree studies we did in that chapter, but now using the Spark machine learning library.

The Spark machine learning package is a Spark package and not an R package.

First, let's call sparklyr. Start a connection and upload the Chile DataFrame from the car package into our Spark cluster:

library(sparklyr)
library(dplyr)
library(car)

#sc <- spark_connect(master = 'local') #Only if you aren't connected
dt_chile <- copy_to(sc, Chile, 'chile')

Now we're going to clean our data as we did before on Chapter 6, Machine Learning with R, but with sparklyr, and divide it into two partitions:

dt_chile <- na.omit(dt_chile)
partitions <- dt_chile %>% sdf_partition(training = 0.7, test = .3,
 seed = 50)

chile_training <- partitions$training
chile_test <- partitions$test

Pretty simple, right? It took less effort to remove the rows containing at least an NA value, which we did with a na.omit() function. To create a test sample with 30% of our observations was even easier. We just called the sparklyr function, sdf_partition(), to create and store the fractions in the partitions object. Seed = 50 was set to make the example, reproduced with the same values. Let us check what we learned using the following code:

dt_chile_ML <- chile_training %>% ml_decision_tree(vote ~ ., seed = 50)
dt_chile_pred <- sdf_predict(chile_test, dt_chile_ML)
ml_multiclass_classification_evaluator(dt_chile_pred)
# [1] 0.5965722

The dt_chile_ML object was created with the chile_training data used with the ml_decision_tree function to train our data as a tree model. Then, we created the dt_chile_pred object to store the prediction result from calling sdf_predict upon the test dataset. In the end, we evaluate our results by calling the ml_multiclass_classification_evaluator() function.

If you want to work with the H2O distributed machine learning algorithms using sparklyr, you'll need the rsparkling package, so make sure you have it installed. The initial code is the same when you are using the Spark machine learning library, so copy and paste the previous code, adding the respective library() commands. The codes line with the following # symbol doesn't need to be rerun unless you started a new R session:

library(sparklyr); library(rsparkling)
library(dplyr); library(h2o)
library(car)


#only if you started a new R session since previous codes
#sc <- spark_connect(master = 'local')
#dt_chile <- copy_to(sc, Chile, 'chile')
#dt_chile <- na.omit(dt_chile)
#partitions <- dt_chile %>% sdf_partition(training = 0.7, test = .3, seed = 50)

When running the codes, you may find a compatibility problem. To solve it, install the right H2O version by calling the install.packages() code that is shown inside the message error and restart your R session. Then create the training and a test object with the following code:

chile_training_h2o <- as_h2o_frame(sc, partitions$training, 
                                   strict_version_check = FALSE)
chile_test_h2o <- as_h2o_frame(sc, partitions$test, 
                               strict_version_check = FALSE)

The previous as_h2o_frame converts the Spark DataFrame into H2O frame. We set strict_version_check to FALSE to not cross-check the version. Let's train the data with H2O machine learning functions:

dt_chile_ML <- h2o.gbm(y = "vote", training_frame =  
  as.factor(chile_training_h2o) , model_id = "ModelTree")
dt_chile_pred <- h2o.performance(dt_chile_ML, newdata =  
  as.factor(chile_test_h2o))

h2o.mse(dt_chile_pred)
# [1] 0.5894832

The h2o.* package functions are responsible for training and evaluating our data. The h2o.gbm function is used to train a tree model; we must use numerical, categorical, or factor data, so we called our training data with the as.factor function, and set model_id = "ModelTree" because we want this training model. Furthermore, h2o.performance was used to see our model performance with the test data, also as a factor. So, h2o.mse was used to show the performance result. Both the Spark machine learning library and H2O Sparking Water require their own function knowledge, so you have to explore it a bit by yourself. You can start here: https://spark.rstudio.com/.

Table of Contents for Using Spark machine learning or H2O Sparking Water

Create new playlist

Sign In

Sign Up

Table of Contents for
Using Spark machine learning or H2O Sparking Water