RandomForestClassifier

Let's assume that we are in a binary classification problem setting and want to use RandomForestClassifier. All SparkML algorithms have a compatible API, so they can be used interchangeably. So it really doesn't matter which one we use, but RandomForestClassifier has more (hyper)parameters than more simple models like logistic regression. At a later stage we'll use (hyper)parameter tuning which is also inbuilt in Apache SparkML. Therefore it makes sense to use an algorithm where more knobs can be tweaked. Adding such a binary classifier to our Pipeline is very simple:

import org.apache.spark.ml.classification.RandomForestClassifier
var rf = new RandomForestClassifier() 
  .setLabelCol("label")
  .setFeaturesCol("features")

var model = new Pipeline().setStages(transformers :+ rf).fit(df)

var result = model.transform(df)

As you can see, RandomForestClassifier takes two parameters: the column name of the actual labels (remember that we are in a supervised learning setting) and the features that we've created before. All other parameters are default values and we'll take care of them later. We can just add this machine learning model as a final stage to the Pipeline: transformers :+ rf. Then, we call fit and transform--always passing our DataFrame as a parameter--and we obtain a final DataFrame called result, which basically contains everything from the df DataFrame plus an additional column called predictions. Done! We've created our first machine learning Pipeline with Apache SparkML. Now, we want to check how well we are doing against our test dataset. This is also built-in to Apache SparkML.

Table of Contents for RandomForestClassifier

Create new playlist

Sign In

Sign Up

Table of Contents for
RandomForestClassifier