In the preceding sections, we trained several models. Now, we will compose them into an ensemble called a super learner using a deep learning model. The process to build a super learner is straightforward (see the preceding figure):
- Select base algorithms (for example, GLM, random forest, GBM, and so on).
- Select a meta-learning algorithm (for example, deep learning).
- Train each of the base algorithms on the training set.
- Perform K-fold cross-validation on each of these learners and collect the cross-validated predicted values from each of the base algorithms.
- The N cross-validated predicted values from each of the L-base algorithms can be combined to form a new NxL matrix. This matrix, along with the original response vector, is called the "level-one" data.
- Train the meta-learning algorithm on the level-one data.
- The super learner (or so-called "ensemble model") consists of the L-base learning models and the meta-learning model, which can then be used to generate predictions on a test set.
The key trick of ensembles is to combine a diverse set of strong learners together. We already discussed a similar trick in the context of the random forest algorithm.
In our example, we will simplify the whole process by skipping cross-validation but using a single hold-out dataset. It is important to mention that this is not the recommended approach!
As the first step, we use trained models and a transfer dataset to get predictions and compose them into a new dataset, augmenting it by the actual labels.
This sounds easy; however, we cannot use the DataFrame#withColumn method directly and create a new DataFrame from multiple columns from different datasets, since the method accepts columns only from the left-hand side DataFrame or constant columns.
Nevertheless, we have already prepared the dataset for this situation by assigning a unique ID to each row. In this case, we will use it and join individual model predictions based on row_id. We also need to rename each model prediction column to uniquely identify the model prediction inside the dataset:
import org.apache.spark.ml.PredictionModel import org.apache.spark.sql.DataFrame val models = Seq(("NB", nbModel), ("DT", dtModel), ("RF", rfModel), ("GBM", gbmModel)) def mlData(inputData: DataFrame, responseColumn: String, baseModels: Seq[(String, PredictionModel[_, _])]): DataFrame= { baseModels.map{ case(name, model) => model.transform(inputData) .select("row_id", model.getPredictionCol ) .withColumnRenamed("prediction", s"${name}_prediction") }.reduceLeft((a, b) =>a.join(b, Seq("row_id"), "inner")) .join(inputData.select("row_id", responseColumn), Seq("row_id"), "inner") } val mlTrainData= mlData(transferData, "label", models).drop("row_id") mlTrainData.show()
The table is composed of the models' prediction and annotated by the actual label. It is interesting to see how individual models agree/disagree on the predicted value.
We can use the same transformation to prepare a validation dataset for our super learner:
val mlTestData = mlData(validationData, "label", models).drop("row_id")
Now, we can build our meta-learner algorithm. In this case, we will use the deep learning algorithm provided by the H2O machine learning library. However, it needs a little bit of preparation-we need to publish the prepared train and test data as H2O frames:
import org.apache.spark.h2o._ val hc= H2OContext.getOrCreate(sc) val mlTrainHF= hc.asH2OFrame(mlTrainData, "metaLearnerTrain") val mlTestHF= hc.asH2OFrame(mlTestData, "metaLearnerTest")
We also need to transform the label column into a categorical column. This is necessary; otherwise, the H2O deep learning algorithm would perform regression since the label column is numeric:
importwater.fvec.Vec
val toEnumUDF= (name: String, vec: Vec) =>vec.toCategoricalVec
mlTrainHF(toEnumUDF, 'label).update()
mlTestHF(toEnumUDF, 'label).update()
Now, we can build an H2O deep learning model. We can directly use the Java API of the algorithm; however, since we would like to compose all the steps into a single Spark pipeline, we will utilize a wrapper exposing the Spark estimator API:
val metaLearningModel= new H2ODeepLearning()(hc, spark.sqlContext)
.setTrainKey(mlTrainHF.key)
.setValidKey(mlTestHF.key)
.setResponseColumn("label")
.setEpochs(10)
.setHidden(Array(100, 100, 50))
.fit(null)
Since we directly specified the validation dataset, we can explore the performance of the model:
Alternatively, we can open the H2O Flow UI (by calling hc.openFlow) and explore its performance in the visual form:
You can easily see that the AUC for this model on the validation dataset is 0.868619 - the value higher than all the AUC values of the individual models.