Regularization in logistic regression

One of the dangers of machine learning is over-fitting: the algorithm captures not only the signal in the training set, but also the statistical noise that results from the finite size of the training set.

A way to mitigate over-fitting in logistic regression is to use regularization: we impose a penalty for large values of the parameters when optimizing. We can do this by adding a penalty to the cost function that is proportional to the magnitude of the parameters. Formally, we re-write the logistic regression cost function (described in Chapter 2, Manipulating Data with Breeze) as:

Regularization in logistic regression

where Regularization in logistic regression is the normal logistic regression cost function:

Regularization in logistic regression

Here, params is the vector of parameters, Regularization in logistic regression is the vector of features for the ith training example, and Regularization in logistic regression is 1 if the i th training example is spam, and 0 otherwise. This is identical to the logistic regression cost-function introduced in Chapter 2, Manipulating data with Breeze, apart from the addition of the regularization term Regularization in logistic regression, the Regularization in logistic regression norm of the parameter vector. The most common value of n is 2, in which case Regularization in logistic regression is just the magnitude of the parameter vector:

Regularization in logistic regression

The additional regularization term drives the algorithm to reduce the magnitude of the parameter vector. When using regularization, features must all have comparable magnitude. This is commonly achieved by normalizing the features. The logistic regression estimator provided by MLlib normalizes all features by default. This can be turned off with the setStandardization parameter.

Spark has two hyperparameters that can be tweaked to control regularization:

  • The type of regularization, set with the elasticNetParam parameter. A value of 0 indicates Regularization in logistic regression regularization.
  • The degree of regularization (Regularization in logistic regression in the cost function), set with the regParam parameter. A high value of the regularization parameter indicates a strong regularization. In general, the greater the danger of over-fitting, the larger the regularization parameter ought to be.

Let's create a new logistic regression instance that uses regularization:

scala> val lrWithRegularization = (new LogisticRegression()
  .setMaxIter(50))
lrWithRegularization: LogisticRegression = logreg_16b65b325526

scala> lrWithRegularization.setElasticNetParam(0) lrWithRegularization.type = logreg_1e3584a59b3a

To choose the appropriate value of Regularization in logistic regression, we fit the pipeline to the training set and calculate the classification error on the test set for several values of Regularization in logistic regression. Further on in the chapter, we will learn about cross-validation in MLlib, which provides a much more rigorous way of choosing hyper-parameters.

scala> val lambdas = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8)
lambdas: Array[Double] = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8)

scala> lambdas foreach { lambda =>
  lrWithRegularization.setRegParam(lambda)
  val pipeline = new Pipeline().setStages(
    Array(indexer, tokenizer, hashingTF, lrWithRegularization))
  val model = pipeline.fit(trainDF)
  val transformedTest = model.transform(testDF)
  val classificationError = transformedTest.filter { 
    $"prediction" !== $"label"
  }.count
  println(s"$lambda => $classificationError")
}
0 => 20
1.0E-12 => 20
1.0E-10 => 20
1.0E-8 => 23

For our example, we see that any attempt to add L2 regularization leads to a decrease in classification accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.134.130