An example – logistic regression

Let's now imagine we want to build a classifier that takes a person's height and weight and assigns a probability to their being Male or Female. We will reuse the height and weight data introduced earlier in this chapter. Let's start by plotting the dataset:

An example – logistic regression

Height versus weight data for 181 men and women

There are many different algorithms for classification. A first glance at the data shows that we can, approximately, separate men from women by drawing a straight line across the plot. A linear method is therefore a reasonable initial attempt at classification. In this section, we will use logistic regression to build a classifier.

A detailed explanation of logistic regression is beyond the scope of this book. The reader unfamiliar with logistic regression is referred to The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. We will just give a brief summary here.

Logistic regression estimates the probability of a given height and weight belonging to a male with the following sigmoid function:

An example – logistic regression

Here, f is a linear function:

An example – logistic regression

Here, An example – logistic regression is an array of parameters that we need to determine using the training set. If we consider the height and weight as a features = (height, weight) matrix, we can re-write the sigmoid kernel f as a matrix multiplication of the features matrix with the params vector:

An example – logistic regression

To simplify this expression further, it is common to add a dummy feature whose value is always 1 to the features matrix. We can then multiply params(0) by this feature, allowing us to write the entire sigmoid kernel f as a single matrix-vector multiplication:

An example – logistic regression

The feature matrix, features, is now a (181 * 3) matrix, where each row is (1, height, weight) for a particular participant.

To find the optimal values of the parameters, we can maximize the likelihood function, L(params|features). The likelihood takes a given set of parameter values as input and returns the probability that these particular parameters gave rise to the training set. For a set of parameters and associated probability function P(male|featuresi), the likelihood is:

An example – logistic regression

If we magically know, ahead of time, the gender of everyone in the population, we can assign P(male)=1 for the men and P(male)=0 for the women. The likelihood function would then be 1. Conversely, any uncertainty leads to a reduction in the likelihood function. If we choose a set of parameters that consistently lead to classification errors (low P(male) for men or high P(male) for women), the likelihood function drops to 0.

The maximum likelihood corresponds to those values of the parameters most likely to describe the observed data. Thus, to find the parameters that best describe our training set, we just need to find parameters that maximize L(params|features). However, maximizing the likelihood function itself is very rarely done, since it involves multiplying many small values together, which quickly leads to floating point underflow. It is best to maximize the log of the likelihood, which has the same maximum as the likelihood. Finally, since most optimization algorithms are geared to minimize a function rather than maximize it, we will minimizeAn example – logistic regression.

For logistic regression, this is equivalent to minimizing:

An example – logistic regression

Here, the sum runs over all participants in the training data, An example – logistic regression is a vector An example – logistic regression of the i-th observation in the training set, and An example – logistic regression is 1 if the person is male, and 0 if the participant is female.

To minimize the Cost function, we must also know its gradient with respect to the parameters. This is:

An example – logistic regression

We will start by rescaling the height and weight by their mean and standard deviation. While this is not strictly necessary for logistic regression, it is generally good practice. It facilitates the optimization and would become necessary if we wanted to use regularization methods or build superlinear features (features that allow the boundary separating men from women to be curved rather than a straight line).

For this example, we will move away from the Scala shell and write a standalone Scala script. Here's the full code listing. Don't worry if this looks daunting. We will break it up into manageable chunks in a minute:

import breeze.linalg._
import breeze.numerics._
import breeze.optimize._
import breeze.stats._

object LogisticRegressionHWData extends App {

  val data = HWData.load

  // Rescale the features to have mean of 0.0 and s.d. of 1.0
  def rescaled(v:DenseVector[Double]) =
    (v - mean(v)) / stddev(v)

  val rescaledHeights = rescaled(data.heights)
  val rescaledWeights = rescaled(data.weights)

  // Build the feature matrix as a matrix with 
  //181 rows and 3 columns.
  val rescaledHeightsAsMatrix = rescaledHeights.toDenseMatrix.t
  val rescaledWeightsAsMatrix = rescaledWeights.toDenseMatrix.t

  val featureMatrix = DenseMatrix.horzcat(
    DenseMatrix.ones[Double](rescaledHeightsAsMatrix.rows, 1),
    rescaledHeightsAsMatrix,
    rescaledWeightsAsMatrix
  )

  println(s"Feature matrix size: ${featureMatrix.rows} x " +s"${featureMatrix.cols}")

  // Build the target variable to be 1.0 where a participant
  // is male, and 0.0 where the participant is female.
  val target = data.genders.values.map {
    gender => if(gender == 'M') 1.0 else 0.0
  }

  // Build the loss function ready for optimization.
  // We will worry about refactoring this to be more 
  // efficient later.
  def costFunction(parameters:DenseVector[Double]):Double = {
    val xBeta = featureMatrix * parameters
    val expXBeta = exp(xBeta)
    - sum((target :* xBeta) - log1p(expXBeta))
  }

  def costFunctionGradient(parameters:DenseVector[Double])
  :DenseVector[Double] = {
    val xBeta = featureMatrix * parameters
    val probs = sigmoid(xBeta)
    featureMatrix.t * (probs - target)
  }

  val f = new DiffFunction[DenseVector[Double]] {
    def calculate(parameters:DenseVector[Double]) =
      (costFunction(parameters), costFunctionGradient(parameters))
  }

  val optimalParameters = minimize(f, DenseVector(0.0, 0.0, 0.0))

  println(optimalParameters)
  // => DenseVector(-0.0751454743, 2.476293647, 2.23054540)
}

That was a mouthful! Let's take this one step at a time. After the obvious imports, we start with:

object LogisticRegressionHWData extends App {

By extending the built-in App trait, we tell Scala to treat the entire object as a main function. This just cuts out def main(args:Array[String]) boilerplate. We then load the data and rescale the height and weight to have a mean of zero and a standard deviation of one:

def rescaled(v:DenseVector[Double]) =
  (v - mean(v)) / stddev(v)

val rescaledHeights = rescaled(data.heights)
val rescaledWeights = rescaled(data.weights)

The rescaledHeights and rescaledWeights vectors will be the features of our model. We can now build the training set matrix for this model. This is a (181 * 3) matrix, for which the i-th row is (1, height(i), weight(i)), corresponding to the values of the height and weight for the ith participant. We start by transforming both rescaledHeights and rescaledWeights from vectors to (181 * 1) matrices

val rescaledHeightsAsMatrix = rescaledHeights.toDenseMatrix.t
val rescaledWeightsAsMatrix = rescaledWeights.toDenseMatrix.t

We must also create a (181 * 1) matrix containing just 1 to act as the dummy feature. We can do this using:

DenseMatrix.ones[Double](rescaledHeightsAsMatrix.rows, 1)

We now need to combine our three (181 * 1) matrices together into a single feature matrix of shape (181 * 3). We can use the horzcat method to concatenate the three matrices together:

val featureMatrix = DenseMatrix.horzcat(
  DenseMatrix.ones[Double](rescaledHeightsAsMatrix.rows, 1),
  rescaledHeightsAsMatrix,
  rescaledWeightsAsMatrix
)

The final step in the data preprocessing stage is to create the target variable. We need to convert the data.genders vector to a vector of ones and zeros. We assign a value of one for men and zero for women. Thus, our classifier will predict the probability that any given person is male. We will use the .values.map method, a method equivalent to the .map method on Scala collections:

val target = data.genders.values.map {
  gender => if(gender == 'M') 1.0 else 0.0
}

Note that we could also have used the indicator function which we discovered earlier:

val maleVector = DenseVector.fill(data.genders.size)('M')
val target = I(data.genders :== maleVector)

This results in the allocation of a temporary array, maleVector, and might therefore increase the program's memory footprint if there were many participants in the experiment.

We now have a matrix representing the training set and a vector denoting the target variable. We can write the loss function that we want to minimize. As mentioned previously, we will minimize An example – logistic regression. The loss function takes as input a set of values for the linear coefficients and returns a number indicating how well those values of the linear coefficients fit the training data:

def costFunction(parameters:DenseVector[Double]):Double = {
  val xBeta = featureMatrix * parameters 
  val expXBeta = exp(xBeta)
  - sum((target :* xBeta) - log1p(expXBeta))
}

Note that we use log1p(x) to calculate log(1+x). This is robust to underflow for small values of x.

Let's explore the cost function:

costFunction(DenseVector(0.0, 0.0, 0.0)) // 125.45963968135031
costFunction(DenseVector(0.0, 0.1, 0.1)) // 113.33336518036882
costFunction(DenseVector(0.0, -0.1, -0.1)) // 139.17134594294433

We can see that the cost function is somewhat lower for slightly positive values of the height and weight parameters. This indicates that the likelihood function is larger for slightly positive values of the height and weight. This, in turn, implies (as we expect from the plot) that people who are taller and heavier than average are more likely to be male.

We also need a function that calculates the gradient of the loss function, since that will help with the optimization:

def costFunctionGradient(parameters:DenseVector[Double])
:DenseVector[Double] = {
  val xBeta = featureMatrix * parameters 
  val probs = sigmoid(xBeta)
  featureMatrix.t * (probs - target)
}

Having defined the loss function and gradient, we are now in a position to set up the optimization:

 val f = new DiffFunction[DenseVector[Double]] {
   def calculate(parameters:DenseVector[Double]) = 
     (costFunction(parameters), costFunctionGradient(parameters))
 }

All that is left now is to run the optimization. The cost function for logistic regression is convex (it has a single minimum), so the starting point for optimization is irrelevant in principle. In practice, it is common to start with a coefficient vector that is zero everywhere (equating to assigning a 0.5 probability of being male to every participant):

  val optimalParameters = minimize(f, DenseVector(0.0, 0.0, 0.0))

This returns the vector of optimal parameters:

DenseVector(-0.0751454743, 2.476293647, 2.23054540)

How can we interpret the values of the optimal parameters? The coefficients for the height and weight are both positive, indicating that people who are taller and heavier are more likely to be male.

We can also get the decision boundary (the line separating (height, weight) pairs more likely to belong to a woman from (height, weight) pairs more likely to belong to a man) directly from the coefficients. The decision boundary is:

An example – logistic regression
An example – logistic regression

Height and weight data (shifted by the mean and rescaled by the standard deviation). The orange line is the logistic regression decision boundary. Logistic regression predicts that individuals above the boundary are male.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.53.93