Spark random forest model

Next, we will move on to our random forest algorithm, which, as you will recall from the previous chapters, is an ensemble of various decision trees whereby we perform a grid search again alternating between various depths and other hyper-parameters, which will be familiar:

import org.apache.spark.ml.classification.{RandomForestClassifier, RandomForestClassificationModel}
val rfModelPath= s"$MODELS_DIR/rfModel"
val rfModel= {
  val rfGridSearch = for (
    rfNumTrees<- Array(10, 15);
    rfImpurity<- Array("entropy", "gini");
    rfDepth<- Array(3, 5))
    yield {
      println( s"Training random forest: numTrees: $rfNumTrees, 
              impurity $rfImpurity, depth: $rfDepth")
     val rfModel = new RandomForestClassifier()
         .setFeaturesCol(idf.getOutputCol)
         .setLabelCol("label")
         .setNumTrees(rfNumTrees)
         .setImpurity(rfImpurity)
         .setMaxDepth(rfDepth)
         .setMaxBins(10)
         .setSubsamplingRate(0.67)
         .setSeed(42)
         .setCacheNodeIds(true)
         .fit(trainData)
     val rfPrediction = rfModel.transform(testData)
     val rfAUC = new BinaryClassificationEvaluator()
                 .setLabelCol("label")
                 .evaluate(rfPrediction)
     println(s" RF AUC on test data: $rfAUC")
     ((rfNumTrees, rfImpurity, rfDepth), rfModel, rfAUC)
   }
   println(rfGridSearch.sortBy(-_._3).take(5).mkString("
"))
   val bestModel = rfGridSearch.sortBy(-_._3).head._2 
   // Stress that the model is minimal because of defined gird space^
   bestModel.write.overwrite.save(rfModelPath)
   bestModel
}

From our grid search, the highest AUC we are seeing is 0.769.

Table of Contents for Spark random forest model

Create new playlist

Sign In

Sign Up

Table of Contents for
Spark random forest model