Minimizing the cost function

At the core of linear regression, there is the search for a line's equation that it is able to minimize the sum of the squared errors of the difference between the line's y values and the original ones. As a reminder, let's say our regression function is called h, and its predictions h(X), as in this formulation:

Minimizing the cost function

Consequently, our cost function to be minimized is as follows:

Minimizing the cost function

There are quite a few methods to minimize it, some performing better than others in the presence of large quantities of data. Among the better performers, the most important ones are Pseudoinverse (you can find this in books on statistics), QR factorization, and gradient descent.

Explaining the reason for using squared errors

Looking under the hood of a linear regression analysis, at first it could be puzzling to realize that we are striving to minimize the squared differences between our estimates and the data from which we are building the model. Squared differences are not as intuitively explainable as absolute differences (the difference without a sign).

For instance, if you have to predict a monetary value, such as the price of a stock or the return from an advertising activity, you are more interested in knowing your absolute error, not your R-squared one, which could be perceived as misleading (since with squares larger losses are emphasized).

As we mentioned before, linear regression takes its steps from the statistical knowledge domain, and there are actually quite a few reasons in statistics that make minimizing a squared error preferable to minimizing an absolute one.

Unfortunately, such reasons are quite complex and too technical and consequently beyond the real scope of this book; however, from a high-level point of view, a good and reasonable explanation is that squaring nicely achieves two very important objectives:

  • It removes negative values; therefore opposite errors won't reciprocally cancel each other when summed
  • It emphasizes larger differences, because as they are squared they will proportionally increase the sum of the errors compared to a simple sum of absolute values

Minimizing the squared differences with an estimator leads us to use the mean (as we suggested before as a basic model, without providing any justification for it).

Let's just check together using Python, without developing all the formulations. Let's define an x vector of values:

In: import numpy as np
  x = np.array([9.5, 8.5, 8.0, 7.0, 6.0])

Let's also define a function returning the cost function as squared differences:

In: def squared_cost(v,e):
     return np.sum((v-e)**2)

Using the fmin minimization procedure offered by the scipy package, we try to figure out, for a vector (which will be our x vector of values), the value that makes the least squared summation:

In: from scipy.optimize import fmin
  xopt = fmin(squared_cost, x0=0, xtol=1e-8, args=(x,))

Out: Optimization terminated successfully.
      Current function value: 7.300000
      Iterations: 44
      Function evaluations: 88

We just output our best e value and verify if it actually is the mean of the x vector:

In: print ('The result of optimization is %0.1f' % (xopt[0]))
  print ('The mean is %0.1f' % (np.mean(x)))

Out: The result of optimization is 78.0
    The mean is 78.0

If instead we try to figure out what minimizes the sum of absolute errors:

In: def absolute_cost(v,e):
     return np.sum(np.abs(v-e))

In: xopt = fmin(absolute_cost, x0=0, xtol=1e-8, args=(x,))

Out: Optimization terminated successfully.
      Current function value: 5.000000
         Iterations: 44
         Function evaluations: 88

In: print ('The result of optimization is %0.1f' % (xopt[0]))
  print ('The median is %0.1f' % (np.median(x)))

Out: The result of optimization is 8.0
    The median is 8.0

We will find out that it is the median, not the mean. Unfortunately, the median does not have the same statistical properties as the mean.

Pseudoinverse and other optimization methods

There is an analytical formula for solving a regression analysis and getting a vector of coefficients out of data, minimizing the cost function:

Pseudoinverse and other optimization methods

Demonstrating this equation goes beyond the practical scope of this book, but we can test it using the power of Python coding.

We can therefore directly solve for this by using np.linalg.inv from NumPy to obtain the inverse of a matrix, or alternative methods such as solving for w in linear equations that are called normal equations:

Pseudoinverse and other optimization methods

Here the function np.linalg.solve can do all the calculations for us:

In: observations = len(dataset)
  X  = dataset['RM'].values.reshape((observations,1)) # X should be always a matrix, never a vector
  Xb = np.column_stack((X,np.ones(observations))) # We add the bias
  y  = dataset['target'].values # y can be a vector

  def matrix_inverse(X,y, pseudo=False):
    if pseudo:
        return np.dot(np.linalg.pinv(np.dot(X.T,X)),np.dot(X.T,y))
    else:
        return np.dot(np.linalg.inv(np.dot(X.T, X)),np.dot(X.T,y))

  def normal_equations(X,y):
      return np.linalg.solve(np.dot(X.T,X), np.dot(X.T,y))

  print (matrix_inverse(Xb, y))
  print (matrix_inverse(Xb, y, pseudo=True))
  print (normal_equations(Xb, y))

Out:
  [  9.10210898 -34.67062078]
  [  9.10210898 -34.67062078]
  [  9.10210898 -34.67062078]

The only problem in solving a linear regression using these approaches is complexity, possibly some loss in accuracy of the computation when directly calculating the inverse using np.linalg.inv, and, naturally, the fact that the XTX multiplication has to be invertible (sometimes it isn't when using multiple variables that are strongly related to each other).

Even using another algorithm (QR factorization, a core algorithm in Statsmodels that can overcome some previously quoted numeric misbehaviors), the worst performance can be estimated to be O(n3); that is, cubic complexity.

Using Pseudoinverse (in NumPy, np.linalg.pinv) can help achieve a O(nm) complexity where m is estimated to be <2.37 (approximately quadratic, then).

This can really be a great limitation in being able to quickly estimate linear regression analysis. In fact, if you are working with 103 observations, a feasible number of observations in statistical analysis, it will take at worst 109 computations; however, when working with data science projects, which easily reach 106 observations, the number of computations required to find the solution to a regression problem may rocket to 1018.

Gradient descent at work

As an alternative to the usual classical optimization algorithms, the gradient descent technique is able to minimize the cost function of a linear regression analysis using far fewer computations. Gradient descent complexity ranks in the order O(n*p), thus making learning regression coefficients feasible even in the occurrence of a large n (which stands for the number of observations) and a large p (number of variables).

The method works by leveraging a simple heuristic that gradually converges on the optimal solution starting from a random one. Explaining it simply, it resembles walking blind in the mountains. If you want to descend to the lowest valley, even if you don't know and can't see the path, you can proceed approximately by going downhill for a while, then stopping, then going downhill again and so on, always aiming at each stage for where the surface descends until you arrive at a point when you cannot descend anymore. Hopefully, at that point you will have reached your destination.

In such a situation, your only risk is happening on an intermediate valley (where there is a wood or a lake, for instance) and mistaking it for your desired arrival because the land stops descending there.

In an optimization process, such a situation is defined as a local minimum (whereas your target is a global minimum instead, the best minimum possible) and it is a possible outcome of your journey downhill depending on the function you are working on minimizing. The good news is, in any case, that the error function of the linear model family is a bowl-shaped one (technically our cost function is a concave one) and it is unlikely that you can get caught anywhere if you descend properly.

The necessary steps to work out a gradient-descent-based solution are easily described, given our cost function for a set of coefficients (the vector w):

Gradient descent at work

We first start by choosing a random initialization for w, by choosing some random numbers (taken from a standardized normal curve, for instance, having a zero mean and unit variance).

Then we start reiterating an update of the values of w (opportunely using the gradient descent computations) until the marginal improvement from the previous J(w) is small enough to let us figure out that we have finally reached an optimum minimum.

We can opportunely update our coefficients, one by one, by subtracting from each of them a portion alpha (α, the learning rate) of the partial derivative of the cost function:

Gradient descent at work

Here, in our formula, wj is to be intended as a single coefficient (we are iterating over them). After resolving the partial derivative, the final resolution form is as follows:

Gradient descent at work

Simplifying everything, our gradient for the coefficient of x is just the average of our predicted values multiplied by their respective x value.

Alpha, called the learning rate, is very important in the process, because, if it is too large, it may cause the optimization to detour and fail. You have to think of each gradient as a jump or as a run in a direction. If you fully take it, you may happen to pass over the optimum minimum and end up in another rising slope. Too many consecutive long steps may even force you to climb up the cost slope, worsening your initial position (given by a cost function that is its summed square, the loss of an overall score of fitness).

Using a small alpha, gradient descent won't jump beyond the solution but it may take a much longer time to reach the desired minimum. How to choose the right alpha is a matter of trial and error; anyway, starting from an alpha such as 0.01 is never a bad choice, based on our experience in many optimization problems.

Naturally, the gradient, given the same alpha, will in any case produce shorter steps as you approach the solution. Visualizing the steps in a graph can really give you a hint about whether gradient descent is working out a solution or not.

Though quite conceptually simple (it is based on an intuition that we have surely applied ourselves to move step-by-step, directing where we can optimize our result), gradient descent is very effective and indeed scalable when working with real data. Such interesting characteristics have elevated it to the core optimization algorithm in machine learning; it is not limited to just the linear model family, but it can also be extended, for instance, to neural networks for the process of back propagation, which updates all the weights of the neural net in order to minimize training errors. Surprisingly, gradient descent is also at the core of another complex machine learning algorithm, gradient boosting tree ensembles, where we have an iterative process minimizing the errors using a simpler learning algorithm (a so-called weak learner because it is limited by a high bias) to progress towards optimization.

Here is a first implementation in Python. We will slightly modify it in the next chapter to make it work efficiently with more predictors than one:

In: observations = len(dataset)
  X  = dataset['RM'].values.reshape((observations,1))
  # X should be always a matrix, never a vector
  X = np.column_stack((X,np.ones(observations))) # We add the bias
  y  = dataset['target'].values # y can be a vector

Now, after defining the response variable, selecting our predictor (the RM feature, the average number of rooms per dwelling), and adding a bias (the constant number 1), we are ready in the following code to define all the functions in our optimization process:

In: import random

  def random_w( p ):
      return np.array([np.random.normal() for j in range(p)])

  def hypothesis(X,w):
      return np.dot(X,w)

  def loss(X,w,y):
      return hypothesis(X,w) - y

  def squared_loss(X,w,y):
      return loss(X,w,y)**2

  def gradient(X,w,y):
      gradients = list()
      n = float(len( y ))
      for j in range(len(w)):
          gradients.append(np.sum(loss(X,w,y) * X[:,j]) / n)
      return gradients

  def update(X,w,y, alpha=0.01):
      return [t - alpha*g for t, g in zip(w, gradient(X,w,y))]

  def optimize(X,y, alpha=0.01, eta = 10**-12, iterations = 1000):
      w = random_w(X.shape[1])
      path = list()
      for k in range(iterations):
          SSL = np.sum(squared_loss(X,w,y))
          new_w = update(X,w,y, alpha=alpha)
          new_SSL = np.sum(squared_loss(X,new_w,y))
          w = new_w
          if k>=5 and (new_SSL - SSL <= eta and 
ew_SSL - SSL >= -eta):
              path.append(new_SSL)
              return w, path
          if k % (iterations / 20) == 0:
              path.append(new_SSL)
      return w, path

After finally defining all the functions necessary for gradient descent to work, we can start optimizing it for a solution to our single regression problem:

IN: alpha = 0.048
  w, path = optimize(X,y,alpha, eta = 10**-12, iterations = 25000)
  print ("These are our final coefficients: %s" % w)
  print ("Obtained walking on this path of squared loss %s" % path)

Out: These are our final coefficients: [9.1021032698295059,-34.670584445862119]
  Obtained walking on this path of squared loss [369171.02494038735,   23714.645148620271, 22452.194702610999, 22154.055704515144,   22083.647505550518, 22067.019977742671, 22063.093237887566,   22062.165903044533, 22061.946904602359, 22061.895186155631,   22061.882972380481, 22061.880087987909, 22061.879406812728,   22061.879245947097, 22061.879207957238, 22061.879198985589,   22061.879196866852, 22061.879196366495, 22061.879196248334,   22061.879196220427, 22061.879196220034]

Scikit-learn linear_regression (and other linear models present in the linear methods module) are actually powered by gradient descent, making Scikit-learn our favorite choice when working in data science projects with large and big data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.100.20