Now, let's try building a random forest using 10 decision trees.
val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val numTrees = 10 val featureSubsetStrategy = "auto" val impurity = "gini" val maxDepth = 5 val maxBins = 10 val seed = 42 val rfModel = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed)
Just like our single decision tree model, we start by declaring the hyper-parameters, many of which should be familiar to you already from the decision tree example. In the preceding code, we will start by creating a random forest of 10 trees, solving a two-class problem. One key feature that is different is the feature subset strategy described as follows:
The featureSubsetStrategy object gives the number of features to use as candidates for making splits at each node. Can either be a fraction (for example, 0.5) or a function based on the number of features in your dataset. The setting auto allows the algorithm to choose this number for you but a common soft-rule states to use the square-root of the number of features you have.
Now that we have trained our model, let's score it against our hold-out set and compute the total error:
def computeError(model: Predictor, data: RDD[LabeledPoint]): Double = { val labelAndPreds = data.map { point => val prediction = model.predict(point.features) (point.label, prediction) } labelAndPreds.filter(r => r._1 != r._2).count.toDouble/data.count } val rfTestErr = computeError(rfModel, testData) println(f"RF Model: Test Error = ${rfTestErr}%.3f")
The output is as follows:
And also compute AUC by using the already defined method computeMetrics:
val rfMetrics = computeMetrics(rfModel, testData) println(f"RF Model: AUC on Test Data = ${rfMetrics.areaUnderROC}%.3f")
Our RF - where we hardcode the hyper-parameters - performs much better than our single decision tree with respect to the overall model error and AUC. In the next section, we will introduce the concept of a grid search and how we can try varying hyper-parameter values / combinations and measure the impact on the model performance.