Super learner

In the preceding sections, we trained several models. Now, we will compose them into an ensemble called a super learner using a deep learning model. The process to build a super learner is straightforward (see the preceding figure):

Select base algorithms (for example, GLM, random forest, GBM, and so on).
Select a meta-learning algorithm (for example, deep learning).
Train each of the base algorithms on the training set.
Perform K-fold cross-validation on each of these learners and collect the cross-validated predicted values from each of the base algorithms.
The N cross-validated predicted values from each of the L-base algorithms can be combined to form a new NxL matrix. This matrix, along with the original response vector, is called the "level-one" data.
Train the meta-learning algorithm on the level-one data.
The super learner (or so-called "ensemble model") consists of the L-base learning models and the meta-learning model, which can then be used to generate predictions on a test set.

The key trick of ensembles is to combine a diverse set of strong learners together. We already discussed a similar trick in the context of the random forest algorithm.

The PhD thesis of Erin LeDell contains much more detailed information about super learners and their scalability. You can find it at http://www.stat.berkeley.edu/~ledell/papers/ledell-phd-thesis.pdf.

In our example, we will simplify the whole process by skipping cross-validation but using a single hold-out dataset. It is important to mention that this is not the recommended approach!

As the first step, we use trained models and a transfer dataset to get predictions and compose them into a new dataset, augmenting it by the actual labels.

This sounds easy; however, we cannot use the DataFrame#withColumn method directly and create a new DataFrame from multiple columns from different datasets, since the method accepts columns only from the left-hand side DataFrame or constant columns.

Nevertheless, we have already prepared the dataset for this situation by assigning a unique ID to each row. In this case, we will use it and join individual model predictions based on row_id. We also need to rename each model prediction column to uniquely identify the model prediction inside the dataset:

import org.apache.spark.ml.PredictionModel 
import org.apache.spark.sql.DataFrame 
 
val models = Seq(("NB", nbModel), ("DT", dtModel), ("RF", rfModel), ("GBM", gbmModel)) 
def mlData(inputData: DataFrame, responseColumn: String, baseModels: Seq[(String, PredictionModel[_, _])]): DataFrame= { 
baseModels.map{ case(name, model) => 
model.transform(inputData) 
     .select("row_id", model.getPredictionCol ) 
     .withColumnRenamed("prediction", s"${name}_prediction") 
  }.reduceLeft((a, b) =>a.join(b, Seq("row_id"), "inner")) 
   .join(inputData.select("row_id", responseColumn), Seq("row_id"), "inner") 
} 
val mlTrainData= mlData(transferData, "label", models).drop("row_id") 
mlTrainData.show()

The table is composed of the models' prediction and annotated by the actual label. It is interesting to see how individual models agree/disagree on the predicted value.

We can use the same transformation to prepare a validation dataset for our super learner:

val mlTestData = mlData(validationData, "label", models).drop("row_id")

Now, we can build our meta-learner algorithm. In this case, we will use the deep learning algorithm provided by the H2O machine learning library. However, it needs a little bit of preparation-we need to publish the prepared train and test data as H2O frames:

import org.apache.spark.h2o._ 
val hc= H2OContext.getOrCreate(sc) 
val mlTrainHF= hc.asH2OFrame(mlTrainData, "metaLearnerTrain") 
val mlTestHF= hc.asH2OFrame(mlTestData, "metaLearnerTest")

We also need to transform the label column into a categorical column. This is necessary; otherwise, the H2O deep learning algorithm would perform regression since the label column is numeric:

importwater.fvec.Vec
val toEnumUDF= (name: String, vec: Vec) =>vec.toCategoricalVec
mlTrainHF(toEnumUDF, 'label).update()
mlTestHF(toEnumUDF, 'label).update()

Now, we can build an H2O deep learning model. We can directly use the Java API of the algorithm; however, since we would like to compose all the steps into a single Spark pipeline, we will utilize a wrapper exposing the Spark estimator API:

val metaLearningModel= new H2ODeepLearning()(hc, spark.sqlContext)
      .setTrainKey(mlTrainHF.key)
      .setValidKey(mlTestHF.key)
      .setResponseColumn("label")
      .setEpochs(10)
      .setHidden(Array(100, 100, 50))
      .fit(null)

Since we directly specified the validation dataset, we can explore the performance of the model:

Alternatively, we can open the H2O Flow UI (by calling hc.openFlow) and explore its performance in the visual form:

You can easily see that the AUC for this model on the validation dataset is 0.868619 - the value higher than all the AUC values of the individual models.

Table of Contents for Super learner

Create new playlist

Sign In

Sign Up

Table of Contents for
Super learner