Cross-validation and model selection

In the previous example, we validated our approach by withholding 30% of the data when training, and testing on this subset. This approach is not particularly rigorous: the exact result changes depending on the random train-test split. Furthermore, if we wanted to test several different hyperparameters (or different models) to choose the best one, we would, unwittingly, choose the model that best reflects the specific rows in our test set, rather than the population as a whole.

This can be overcome with cross-validation. We have already encountered cross-validation in Chapter 4, Parallel Collections and Futures. In that chapter, we used random subsample cross-validation, where we created the train-test split randomly.

In this chapter, we will use k-fold cross-validation: we split the training set into k parts (where, typically, k is 10 or 3) and use k-1 parts as the training set and the last as the test set. The train/test cycle is repeated k times, keeping a different part as test set each time.

Cross-validation is commonly used to choose the best set of hyperparameters for a model. To illustrate choosing suitable hyperparameters, we will go back to our regularized logistic regression example. Instead of intuiting the hyper-parameters ourselves, we will choose the hyper-parameters that give us the best cross-validation score.

We will explore setting both the regularization type (through elasticNetParam) and the degree of regularization (through regParam). A crude, but effective way to find good values of the parameters is to perform a grid search: we calculate the cross-validation score for every pair of values of the regularization parameters of interest.

We can build a grid of parameters using MLlib's ParamGridBuilder.

scala> import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

scala> val paramGridBuilder = new ParamGridBuilder()
paramGridBuilder: ParamGridBuilder = ParamGridBuilder@1dd694d0

To add hyper-parameters over which to optimize to the grid, we use the addGrid method:

scala> val lambdas = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8)
Array[Double] = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8)

scala> val elasticNetParams = Array(0.0, 1.0)
elasticNetParams: Array[Double] = Array(0.0, 1.0)

scala> paramGridBuilder.addGrid(
  lrWithRegularization.regParam, lambdas).addGrid(
  lrWithRegularization.elasticNetParam, elasticNetParams)
paramGridBuilder.type = ParamGridBuilder@1dd694d0

Once all the dimensions are added, we can just call the build method on the builder to build the grid:

scala> val paramGrid = paramGridBuilder.build
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
  logreg_f7dfb27bed7d-elasticNetParam: 0.0,
  logreg_f7dfb27bed7d-regParam: 0.0
}, {
  logreg_f7dfb27bed7d-elasticNetParam: 1.0,
  logreg_f7dfb27bed7d-regParam: 0.0
} ...)

scala> paramGrid.length
Int = 8

As we can see, the grid is just a one-dimensional array of sets of parameters to pass to the logistic regression model prior to fitting.

The next step in setting up the cross-validation pipeline is to define a metric for comparing model performance. Earlier in the chapter, we saw how to use BinaryClassificationMetrics to estimate the quality of a model. Unfortunately, the BinaryClassificationMetrics class is part of the core MLLib API, rather than the new pipeline API, and is thus not (easily) compatible. The pipeline API offers a BinaryClassificationEvaluator class instead. This class works directly on DataFrames, and thus fits perfectly into the pipeline API flow:

scala> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

scala> val evaluator = new BinaryClassificationEvaluator()
evaluator: BinaryClassificationEvaluator = binEval_64b08538f1a2

scala> println(evaluator.explainParams)
labelCol: label column name (default: label)
metricName: metric name in evaluation (areaUnderROC|areaUnderPR) (default: areaUnderROC)
rawPredictionCol: raw prediction (a.k.a. confidence) column name (default: rawPrediction)

From the parameter list, we see that the BinaryClassificationEvaluator class supports two metrics: the area under the ROC curve, and the area under the precision-recall curve. It expects, as input, a DataFrame containing a label column (the model truth) and a rawPrediction column (the column containing the probability that an e-mail is spam or ham).

We now have all the parameters we need to run cross-validation. We first build the pipeline, and then pass the pipeline, the evaluator and the array of parameters over which to run the cross-validation to an instance of CrossValidator:

scala> val pipeline = new Pipeline().setStages(Array(indexer, tokenizer, hashingTF, lrWithRegularization))
pipeline: Pipeline = pipeline_3ed29f72a4cc

scala> val crossval = (new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3))
crossval: CrossValidator = cv_5ebfa1143a9d  

We will now fit crossval to trainDF:

scala> val cvModel = crossval.fit(trainDF)
cvModel: CrossValidatorModel = cv_5ebfa1143a9d

This step can take a fairly long time (over an hour on a single machine). This creates a transformer, cvModel, corresponding to the logistic regression object with the parameters that best represent trainDF. We can use it to predict the classification error on the test DataFrame:

scala> cvModel.transform(testDF).filter { 
  $"prediction" !== $"label" 
}.count
Long = 20

Cross-validation has therefore resulted in a model that performs identically to the original, naive logistic regression model with no hyper-parameters. cvModel also contains a list of the evaluation score for each set of parameter in the parameter grid:

scala> cvModel.avgMetrics
Array[Double] = Array(0.996427805316161, ...)

The easiest way to relate this to the hyper-parameters is to zip it with cvModel.getEstimatorParamMaps. This gives us a list of (hyperparameter values, cross-validation score) pairs:

scala> val params2score = cvModel.getEstimatorParamMaps.zip(
  cvModel.avgMetrics)
Array[(ml.param.ParamMap,Double)] = Array(({
  logreg_8f107aabb304-elasticNetParam: 0.0,
  logreg_8f107aabb304-regParam: 0.0
},0.996427805316161),...

scala> params2score.foreach {
  case (params, score) => 
    val lambda = params(lrWithRegularization.regParam)
    val elasticNetParam = params(
      lrWithRegularization.elasticNetParam)
    val l2Orl1 = if(elasticNetParam == 0.0) "L2" else "L1"
    println(s"$l2Orl1, $lambda => $score")
}
L2, 0.0 => 0.996427805316161
L1, 0.0 => 0.996427805316161
L2, 1.0E-12 => 0.9964278053175655
L1, 1.0E-12 => 0.9961429402772803
L2, 1.0E-10 => 0.9964382546369551
L1, 1.0E-10 => 0.9962223090037103
L2, 1.0E-8 => 0.9964159754613495
L1, 1.0E-8 => 0.9891008277659763

The best set of hyper-parameters correspond to L2 regularization with a regularization parameter of 1E-10, though this only corresponds to a tiny improvement in AUC.

This completes our spam filter example. We have successfully trained a spam filter for this particular Ling-Spam dataset. To obtain better results, one could experiment with better feature extraction: we could remove stop words or use TF-IDF vectors, rather than just term frequency vectors as features, and we could add additional features like the length of messages, or even n-grams. We could also experiment with non-linear algorithms, such as random forest. All of these steps would be straightforward to add to the pipeline.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.4.191