Next, we will move on to our random forest algorithm, which, as you will recall from the previous chapters, is an ensemble of various decision trees whereby we perform a grid search again alternating between various depths and other hyper-parameters, which will be familiar:
import org.apache.spark.ml.classification.{RandomForestClassifier, RandomForestClassificationModel}
val rfModelPath= s"$MODELS_DIR/rfModel"
val rfModel= {
val rfGridSearch = for (
rfNumTrees<- Array(10, 15);
rfImpurity<- Array("entropy", "gini");
rfDepth<- Array(3, 5))
yield {
println( s"Training random forest: numTrees: $rfNumTrees,
impurity $rfImpurity, depth: $rfDepth")
val rfModel = new RandomForestClassifier()
.setFeaturesCol(idf.getOutputCol)
.setLabelCol("label")
.setNumTrees(rfNumTrees)
.setImpurity(rfImpurity)
.setMaxDepth(rfDepth)
.setMaxBins(10)
.setSubsamplingRate(0.67)
.setSeed(42)
.setCacheNodeIds(true)
.fit(trainData)
val rfPrediction = rfModel.transform(testData)
val rfAUC = new BinaryClassificationEvaluator()
.setLabelCol("label")
.evaluate(rfPrediction)
println(s" RF AUC on test data: $rfAUC")
((rfNumTrees, rfImpurity, rfDepth), rfModel, rfAUC)
}
println(rfGridSearch.sortBy(-_._3).take(5).mkString(" "))
val bestModel = rfGridSearch.sortBy(-_._3).head._2
// Stress that the model is minimal because of defined gird space^
bestModel.write.overwrite.save(rfModelPath)
bestModel
}
From our grid search, the highest AUC we are seeing is 0.769.