In the previous section, we explored data and unified it into a form without missing values. We still need to transform the data into a form expected by Spark MLlib. As explained in the previous chapter, it involves the creation of RDD of LabeledPoints. Each LabeledPoint is defined by a label and a vector defining input features. The label serves as a training target for model builders and it references the index of categorical variables (see prepared transformation activityId2Idx):
import org.apache.spark.mllib import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.util.MLUtils val data = processedRawData.map { r => val activityId = r(0) val activityIdx = activityId2Idx(activityId) val features = r.drop(1) LabeledPoint(activityIdx, Vectors.dense(features)) }
The next step is to prepare data for training and model validation. We simply split the data into two parts: 80% for training and the remaining 20% for validation:
val splits = data.randomSplit(Array(0.8, 0.2)) val (trainingData, testData) = (splits(0), splits(1))
And after this step we are ready to invoke the modeling part of the workflow. The strategy for building a Spark RandomForest model is the same as GBM we showed in the previous chapter by calling the static method trainClassifier on object RandomForest:
import org.apache.spark.mllib.tree.configuration._ import org.apache.spark.mllib.tree.impurity._ val rfStrategy = new Strategy( algo = Algo.Classification, impurity = Entropy, maxDepth = 10, maxBins = 20, numClasses = activityId2Idx.size, categoricalFeaturesInfo = Map[Int, Int](), subsamplingRate = 0.68) val rfModel = RandomForest.trainClassifier( input = trainingData, strategy = rfStrategy, numTrees = 50, featureSubsetStrategy = "auto", seed = 42)
In this example, the parameters are split into two sets:
- Strategy which defines common parameters for building a decision tree
- RandomForest specific parameters
The strategy parameter list overlaps with the parameter list of decision tree algorithms discussed in the previous chapter:
- input: References training data represented by RDD of LabeledPoints.
- numClasses: Number of output classes. In this case we model only classes which are included in the input data.
- categoricalFeaturesInfo: Map of categorical features and their arity. We don't have categorical features in input data, that is why we pass an empty map.
- impurity: Impurity measure used for tree node splitting.
- subsamplingRate: A fraction of training data used for building a single decision tree.
- maxDepth: Maximum depth of a single tree. Deep trees have tendency to encode input data and overfit. On the other hand, overfitting in RandomForest is balanced by assembling multiple trees together. Furthermore, larger trees means longer training time and higher memory footprint.
- maxBins: Continuous features are transformed into ordered discretized features with at most maxBins possible values. The discretization is done before each node split.
The RandomForest - specific parameters are the following:
- numTrees: Number of trees in the resulting forest. Increasing the number of trees decreases model variance.
- featureSubsetStrategy: Specifies a method which produces a number of how many features are selected for training a single tree. For example: "sqrt" is normally used for classification, while "onethird" for regression problems. See value of RandomForest.supportedFeatureSubsetStrategies for available values.
- seed: Seed for random generator initialization, since RandomForest depends on random selection of features and rows.
The parameters numTrees and maxDepth are often referenced as stopping criteria. Spark also provides additional parameters to stop tree growing and produce fine-grained trees:
- minInstancesPerNode: A node is not split anymore, if it would provide left or right nodes which would contain smaller number of observations than the value specified by this parameter. Default value is 1, but typically for regression problems or large trees, the value should be higher.
- minInfoGain: Minimum information gain a split must get. Default value is 0.0.
Furthermore, Spark RandomForest accepts parameters which influence the performance of execution (see Spark documentation).
This is a common practice for non-deterministic algorithms; however, it is not enough if the algorithm is parallelized and its result depends on thread scheduling. In this case, ad-hoc methods need to be adopted (for example, limit parallelization by having only one computation thread, limit parallelization by limiting number of input partitions, or switching task scheduler to provide a fix schedule).