Finally, we will move on to our gradient boosting machine (GBM), which will be the final model in our ensemble of models. Note that in the previous chapters, we used H2O's version of GBM, but now, we will stick with Spark and use Spark's implementation of GBM as follows:
import org.apache.spark.ml.classification.{GBTClassifier, GBTClassificationModel}
val gbmModelPath= s"$MODELS_DIR/gbmModel"
val gbmModel= {
val model = new GBTClassifier()
.setFeaturesCol(idf.getOutputCol)
.setLabelCol("label")
.setMaxIter(20)
.setMaxDepth(6)
.setCacheNodeIds(true)
.fit(trainData)
val gbmPrediction = model.transform(testData)
gbmPrediction.show()
val gbmAUC = new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol(model.getPredictionCol)
.evaluate(gbmPrediction)
println(s" GBM AUC on test data: $gbmAUC")
model.write.overwrite.save(gbmModelPath)
model
}
So now, we have trained up four different learning algorithms: a (single) decision tree, a random forest, Naive Bayes, and a gradient boosted machine. Each provides a different AUC as summarized in the table here. We can see that the best performing model is RandomForest followed by GBM. However, it is fair to say that we did not perform any exhausted search for the GBM model nor did we use a high number of iterations as is usually recommended:
Decision tree |
0.659 |
Naive Bayes |
0.484 |
Random forest |
0.769 |
GBM |
0.755 |