Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Stochastic gradient descent

In the models we've seen so far, such as linear regression, we've talked about a criterion or objective function that the model must minimize while it is being trained. This criterion is also sometimes known as the cost function. For example, the least squares cost function for a model can be expressed as:

We've added a constant term of ½ in front of this for reasons that will become apparent shortly. We know from basic differentiation that when we are minimizing a function, multiplying the function by a constant factor does not alter the value of the minimum value of the function. In linear regression, just as with our perceptron model, our model's predicted is just the sum of a linear weighted combination of the input features. If we assume that our data is fixed and that the weights are variable and must be chosen so as to minimize our criterion, we can treat the cost function as being a function of the weights:

We have used the letter w to represent the model weights here for the more general case, though in linear regression we've seen that it is customary to use the Greek letter β instead. As our model variables are the weights, we can consider that our function is a function of a weight vector, . To find the minimum of this function, we just need to take the partial derivative of our cost function with respect to this weight vector. For a specific weight w_k, this partial derivative is given by:

Note that the coefficient of one half has usefully cancelled out the 2 from the derivative. We now have three different subscripts, so it is a good idea to take a step back and try to understand this equation. The innermost sum is still computing , which is the model's predicted output. Let's replace this into the equation to simplify things a bit:

Now we should be in a better position to understand this equation. It says that the partial derivative of the cost function that we are trying to minimize for a specific weight, w_k, in our model is just the difference between the predicted output of the model and the actual labeled output, multiplied by x_ik (for the i^th observation, the value of the input feature that corresponds to our weight w_k), and averaged over all the n observations in our data set.

Tip

If you are not familiar with partial differentiation but are familiar with differentiation, you already know everything you need to in order to understand this equation. We use partial differentiation to explicitly identify the variable that we will be differentiating with respect to an equation that has more than one variable. When we do this, we treat all other variables as constants and the differentiation is carried out normally.

To find the optimal weights, we need to solve this equation for every weight in our weight vector. Note that through the predicted output term, all the weights in the model appear in the partial derivative of every individual weight. Put differently, this produces a complete system of linear equations that is often very large, so solving this directly is often prohibitively expensive, computationally speaking.

Instead, many model implementations use iterative optimization procedures that are designed to gradually approach the correct solution. One such method is gradient descent. For a particular value of the weight vector, gradient descent finds the direction in which the gradient of the cost function is steepest, and adjusts the weights in that direction by a small amount, which is determined by a parameter known as the learning rate. Thus, the update equation is:

In the previous equation, the learning rate is denoted by the Greek letter η. Setting the learning rate to an appropriate value is a very important aspect of optimizing with gradient descent. If we choose a value that is too small, the algorithm will update the weights by a very small amount each time, and thus it will take too long to finish. If we use a value that is too large, we may cause the weights to change too drastically, oscillating between values, and so again the learning algorithm will either take too long to converge or oscillate continuously.

There are various sophisticated methods to estimate an appropriate learning rate, the details of which we won't discuss here. Instead, we'll try to find an appropriate learning rate through trial and error, and this often works just fine in practice. One way to keep track of whether our chosen learning rate is decent is to plot the cost function we are trying to minimize versus time (represented by the number of iterations made through the data set). We should be seeing a decreasing (or at least non-increasing) change in the cost function over time if we have chosen a good value for the learning rate.

A variant of the gradient descent method is stochastic gradient descent, which does a similar computation but takes the observations one at a time instead of all together. The key idea is that, on average, the gradient of the cost function computed for a particular observation will equal that of the gradient computed across all observations. This is, of course, an approximation, but it does mean that we can process individual observations one at a time, which is very useful, especially if we want to perform online learning. Stochastic gradient descent updates a particular weight, w_k, when processing the i^th observation in the data set according to the following equation:

Note

An excellent resource for some of the tricks that are useful when training a model with stochastic gradient descent is a book chapter by Leo Bottou, titled Stochastic Gradient Descent Tricks. A version of this can be found online at http://research.microsoft.com/pubs/192769/tricks-2012.pdf.

Gradient descent and local minima

Gradient descent methods rely on the idea that the cost function that is being minimized is a convex function. We'll skip the mathematical details of this and just say that a convex function is a function that has, at most, a single global minimum. Let's look at an example of a non-convex cost function in terms of a single weight w:

The global minimum of this function is the first trough on the left for a value of w, close to 4.5. If our initial guess for the weight w is 1, the gradient of the cost function points towards the global minimum, and we will progressively approach it until we reach it. If our initial guess of the weight is 12, then the gradient of the cost function will point downwards towards the trough near the value 10.5. Once we reach the second trough, the gradient of the cost function will be 0 and consequently, we will not be able to make any progress towards our global minimum because we have landed in a local minimum.

Detecting and avoiding local minima can be very tricky, especially if there are many of them. One way to do this is to repeat the optimization with different starting points and then pick the weights that produce the lowest value of the cost function across the different times the optimization is run. This procedure works well if the number of local minima is small and they are not too close together. Thankfully, the squared error cost function that we saw in the previous section is a convex function and so gradient descent methods are guaranteed to find the global minimum, but it is good to be aware that there are other examples of cost functions that we will encounter that are non-convex.

The perceptron algorithm

Without further ado, we'll present our first training algorithm for classification with neural networks. This is a variation of the perceptron learning algorithm and is known as the pocket perceptron algorithm.

Inputs:

x: A two-dimensional matrix, where the rows are the observations and the columns are the input features.
y: A vector with the class label (-1 or 1) for all the observations in x.
learning_rate: A number that controls the learning rate of the algorithm.
max_iterations: The maximum number of cycles through our data that our algorithm is allowed to perform while learning.

Outputs:

w: The learned weights of the perceptron.
converged: Whether the algorithm converged (true or false).
iterations: The actual number of iterations through the data performed during learning.

Method:

1. Randomly initialize the weights w.

2. Select an observation in x, and call it x_i.

3. Compute the predicted class, , using the current values of the weights w and the equation for the output of the perceptron.

4. If the predicted class, , is not the same as the actual class, y_i, then update the weights vector using stochastic gradient descent.

5. Repeat steps 2–4 for all the observations in our data set and count the number of errors made.

6. If the number of errors is zero, we have converged and the algorithm terminates.

7. If the number of errors made in the current iteration was less than the lowest numbers of errors ever made, store the weights vector as the best weights vector seen so far.

8. If we have reached the maximum number of iterations, stop and return the value of the best weights vector. Otherwise, begin a new iteration over the data set at step 2.

We'll see the R code for this directly and discuss the steps in detail:

step_function <- function(x) {
   if (x < 0) -1 else 1
}

pocket_perceptron <- function(x, y, learning_rate, max_iterations) {
  nObs = nrow(x)
  nFeatures = ncol(x)
  w = rnorm(nFeatures + 1, 0, 2) # Random weight initialization
  current_iteration = 0
  has_converged = F
  best_weights = w
  # Start by assuming you get all the examples wrong
  best_error = nObs 
  while ((has_converged == F) & 
         (current_iteration < max_iterations)) {
    # Assume we are done unless we misclassify an observation
    has_converged = T 
    # Keep track of misclassified observations
    current_error = 0
    for (i in 1:nObs) {
      xi = c(1, x[i,]) # Append 1 for the dummy input feature x0
      yi = y[i]
      y_predicted = step_function(sum(w * xi))
      if (yi != y_predicted) {
        current_error = current_error + 1
        # We have at least one misclassified example
        has_converged = F 
        w = w - learning_rate * sign(y_predicted - yi) * xi
      }
    }
    if (current_error < best_error) {
      best_error = current_error
      best_weights = w
    }
    current_iteration = current_iteration+1
  }
  model <- list("weights" = best_weights, 
                "converged" = has_converged, 
                "iterations" = current_iteration)
  model
}

The first function we define is the step function, which we know will produce either the value -1 or the value 1 corresponding to the two classes in our data set. We then define our main function, which we call pocket_perceptron(). The job of this function is to learn the weights for our perceptron so that our model classifies our training data correctly.

Note that we have not introduced any regularization in our algorithm, so as to keep things simple, and so we will likely end up with a model that will overfit our data, as we are shooting for 100 percent training accuracy. Proceeding with our algorithm description, we begin our function by initializing the weights vector to small randomly generated numbers. In practice, it is a good idea to make sure that weights are not set to 0 and are not symmetric, and this method is a good way to avoid this.

We will also set our starting best guess of the weights to be our initial vector and our starting best error rate to be the total number of observations, which is the worst possible error rate on a data set.

The main while loop of the function controls the number of iterations over which our algorithm will run. We will only begin a new iteration when we have not converged and when we have not hit our maximum number of iterations. Inside the while loop, we use a for loop to iterate over the observations in our data set and classify these using the current version of our weight vector.

Every time we make a mistake in classification, we update our error rate, note that we have not converged in this iteration, and update our weight vector according to the stochastic gradient descent update rule for least squares that we saw in the previous section. Although the cost function for the perceptron is not differentiable because of the step function used to threshold the output, it turns out that we can, in fact, still use the same update rule for the weights.

At the end of a complete iteration through our data set, also known as an epoch, we check whether we need to update our best weights vector and update the number of iterations. We update our best weights vector only if the performance in the current iteration on the training data was the best performance we have seen thus far across all completed iterations. When the algorithm terminates, we return the best weights we found, whether or not we converged, and the total number of completed iterations.

Note

The definitive textbook on neural networks, and one that explains perceptron learning in more detail, including proof of why the algorithm works, is Neural Networks and Learning Machines 3rd Edition, Simon Haykin, Prentice Hall

We can put our model to the test by generating some artificial data. We'll do this by sampling values from two uniform distributions in order to create two input features: x₁ and x₂. We'll then separate these data points into two different classes according to a linear decision boundary that we've chosen randomly:

Once we have the data and the computed class labels, we can run our perceptron algorithm on it. The following code generates the test data and builds our model:

> set.seed(4910341)
> x1 <- runif(200, 0, 10)
> set.seed(2125151)
> x2 <- runif(200, 0, 10)
> x <- cbind(x1, x2)
> y <- sign(-0.89 + 2.07 * x[,1] - 3.09 * x[,2])
> pmodel <- pocket_perceptron(x, y, 0.1, 1000)
> pmodel
$weights
                 x1        x2 
-1.738271  4.253327 -6.360326 

$converged
[1] TRUE

$iterations
[1] 32

We can see that after 32 iterations, our perceptron algorithm has converged. If we divide our weights vector by 2 (this does not alter our decision boundary), we can see more clearly that we have a decision boundary that is very close to the one that was used when classifying the data:

> model$weights / 2
                   x1         x2 
-0.8741571  2.1420697 -3.2122627

The following plot shows that the model's decision boundary is virtually indistinguishable from the population line. For our artificially generated data set, this is because the two classes are so close together. If the classes were further apart, we would more likely see a noticeable difference between the population decision boundary and the model's decision boundary. This is because the space of possible lines (or planes when we are dealing with more than two features) that can separate the data would be larger.

Linear separation

The data that we generated had a particular property that ensured that the perceptron algorithm would converge—it was linearly separable. When two classes are linearly separable in terms of a set of features, it means that it is possible to find a linear combination of these features as a decision boundary that will allow us to classify the two classes with 100 percent accuracy.

If we consider plotting the data points belonging to the two classes in the p-dimensional feature space, then linear separation means that there is a plane (or line for 2 dimensions, as we saw in our example) that can be drawn to separate the two classes. There is a theorem, known as the perceptron convergence theorem, which states that for linearly separable classes, the perceptron learning algorithm will always converge to a solution that correctly classifies all the data given enough time.

The logistic neuron

The perceptron is also known as a binary threshold neuron. We can create different types of neurons by changing the activation function. For example, if we remove the threshold function completely, we end up with a linear neuron, which essentially performs the same task as linear regression. By changing the activation function to a logistic function, we can create a logistic neuron.

A logistic neuron performs the same task as logistic regression, by taking a linear combination of inputs and applying the logistic function to predict a value in the interval [0,1]. Stochastic gradient descent can be applied in order to learn the weights of linear neurons as well as logistic neurons. Hence, it can also be applied to learn the weights for logistic and linear regression. The general form of the stochastic gradient descent weight update rule is:

Here, the derivative is computing the gradient of the cost function at the particular observation. We saw the simple form for linear regression and the linear neuron in the previous section. If we perform differentiation on the cost function for logistic regression, we will discover that the update rule for stochastic gradient descent for the logistic neuron appears to be exactly the same as with the linear neuron:

The subtle difference here is that the form of is completely different as it now includes the weights inside the logistic function, whereas this was not the case in linear regression. Logistic neurons are very important because they are the most common type of neuron used when building networks of many neurons connected together. As we'll see in the next section, we generally build neural networks in layers. The layer containing the neurons that produce our outputs is known as the output layer. The input layer is comprised of our data features that are the inputs to network.

Layers in between the input and output layers are known as hidden layers. Logistic neurons are the most common hidden layer neuron. Additionally, we use logistic neurons as output layer neurons when our task is classification, and linear neurons when our task is regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Stochastic gradient descent

Create new playlist

Sign In

Sign Up

Stochastic gradient descent

Tip

Note

Gradient descent and local minima

The perceptron algorithm

Note

Linear separation

The logistic neuron

Table of Contents for
Stochastic gradient descent