Chapter 10
Calculating and using derivatives

In previous chapters, we have referred several times to derivatives, that is, to gradients and Hessians of functions. In practice, having good derivative information is important to obtaining solutions or to knowing that we have a valid solution. This chapter will look at ways in which we can acquire and use such information.

10.1 Why and how

Derivative information is important

  • because the KKT conditions (Karush, 1939; Kuhn and Tucker, 1951) for a minimum require first derivatives to be “zero” and the Hessian, that is, second derivative matrix, to be positive definite;
  • because many methods can use gradient information.

Indeed, even methods that claim to be “derivative free” will often use the concepts of gradients and Hessians, either for the function to be minimized or for an approximating model.

It is my experience that the main utility of good derivative information is in testing that we indeed have a solution. That is, it is useful for termination test and improves performance because it allows us to cease trying to proceed when our journey is complete. In some cases, approximate derivatives may actually give better performance for some gradient methods in initial steps when we are far from the solution. This is similar to secant methods outperforming Newton methods in early iterations.

Unfortunately, the calculation of derivatives is not a trivial task. This chapter looks at some approaches and presents some recommendations that I freely admit are open to discussion –they are my own views based on over four decades of dealing with these matters, but they are inevitably grounded in the particular problems I have had to try to solve, which may be very different from the problems faced by my readers.

The following are some general considerations and suggestions.

  • Always run a check on the computations. Indeed, some of the packages I have helped develop include checks of derivatives against numerical approximations because derivative coding is so error prone.
  • Comment your code. I usually find the same person who wrote the code (me!) is the one who has to figure it out later on, so it is important for my own working efficiency to try to write readable scripts. In this respect, derivatives can be particularly tricky.
  • If there are bounds, or if the function is undefined in some areas of the domain, then try to include proper checks to avoid a “crash” or, much worse, getting results that cause a silent failure, that is, where the answer is wrong but there is no error message.

10.2 Analytic derivatives—by hand

For problems that are run frequently, I believe that manually coded derivative functions generally allow the best performance. If the programmer is sensible, they will also be commented and more comprehensible to the reader of the code than any automatically generated function. Note that for some functions, it can be quite efficient to compute the objective function, the gradient and the Hessian together, because there may be common subcomputations.

Unfortunately, working out analytic derivatives is tedious and error prone. For nonlinear least squares or similar computations where one is trying to fit a function model(parms) to data by varying the parameters parms, I have found it useful to write residuals in the form

10.1 equation

This avoids one possible sign error in taking the derivatives of the resids, even though most statisticians will write the residuals in the data—model(parms) form. Of course, it makes no difference to the sum of squares. I urge readers to develop similar tricks that suit their own working habits so they can to avoid simple but costly errors in computing derivatives. As noted by McCullagh and Nelder (1989, p. 9), both Gauss and Legendre use the form above for residuals, so I am in good company.

10.3 Analytic derivatives—tools

As mentioned, there are software tools that permit symbolic mathematics and these could be used to generate expressions for the derivatives. I have myself made occasional use of such packages, for example, DERIVE (Nash, 1995). (This appears to be no longer generally available.) I have also made use on occasions of Maxima (http://maxima.sourceforge.net). However, I find that these tools are quite difficult to use well and need ongoing use to maintain one's skills with them. If you are willing to use these packages regularly, I believe that they may offer considerable rewards.

A particular use of symbolic maths tools is in allowing for partial development of some of the expressions used in the “by-hand” coding of derivative functions. They may also provide a way of checking that our code generates correct answers, possibly for intermediate stages of the derivative computation.

Similarly, there are packages for automatic differentiation (AD). This is not the same as symbolic differentiation, where a computer program is provided with a symbolic representation of the function to be differentiated and returns a symbolic expression with the derivative(s). Instead, clever application of the chain rule is used to compute a numerical value of the derivative(s) wanted, with the “function” supplied as computational code. Thus AD tools must be adapted to the language in which functions are written. In our case, this is R.

R offers some tools that combine symbolic differentiation and AD, and for readers of this book, I believe that these provide the most value for effort expended. However, the tools are limited in their scope. That is, they encompass only a limited set of functions for which derivatives can be computed.

10.4 Examples of use of R tools for differentiation

Let us use our nonlinear least squares Example 1.2. We set up the model expression in mod, provide the parameter names in namev, and use the R function deriv() to get the expression for the gradient.

mod <- 100 * b1/(1 + 10 * b2 * exp(-0.1 * b3 * t))
mod
## 100 * b1/(1 + 10 * b2 * exp(-0.1 * b3 * t))
namev <- c("b1", "b2", "b3")
try1 <- deriv(mod, namev)
try1
## expression({
##     .expr1 <- 100 * b1
##     .expr2 <- 10 * b2
##     .expr6 <- exp(-0.1 * b3 * t)
##     .expr8 <- 1 + .expr2 * .expr6
##     .expr13 <- .expr8^2
##     .value <- .expr1/.expr8
##     .grad <- array(0, c(length(.value), 3L), list(NULL, c("b1",
##         "b2", "b3")))
##     .grad[, "b1"] <- 100/.expr8
##     .grad[, "b2"] <- -(.expr1 * (10 * .expr6)/.expr13)
##     .grad[, "b3"] <- .expr1 * (.expr2 * (.expr6 * (0.1 * t)))/.expr13
##     attr(.value, "gradient") <- .grad
##     .value
## })

We can now copy and paste this expression into a function for computing the residuals and Jacobian of the model. We then compute the sum of squares and gradient.

resfn <- function(pars, t = t, y = y) {
    b1 <- pars[[1]]
    b2 <- pars[[2]]
    b3 <- pars[[3]]
    .expr1 <- 100 * b1
    .expr2 <- 10 * b2
    .expr6 <- exp(-0.1 * b3 * t)
    .expr8 <- 1 + .expr2 * .expr6
    .expr13 <- .expr8^2
    .value <- .expr1/.expr8
    .grad <- array(0, c(length(.value), 3L), list(NULL, c("b1",
        "b2", "b3")))
    .grad[, "b1"] <- 100/.expr8
    .grad[, "b2"] <- -(.expr1 * (10 * .expr6)/.expr13)
    .grad[, "b3"] <- .expr1 * (.expr2 * (.expr6 * (0.1 * t)))/.expr13
    attr(.value, "gradient") <- .grad
    .value
    res <- .value - y
    attr(res, "jacobian") <- .grad
    res
}
## Test the function
ydat <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443,
    38.558, 50.156, 62.948, 75.995, 91.972)  # for testing
tdat <- 1:length(ydat)  # for testing
start <- c(1, 1, 1)
myres <- resfn(start, y = ydat, t = tdat)
print(as.vector(myres))  # we don't want the ‘gradient’
##  [1]   4.64386   3.64458   2.25518   0.11562  -2.91533
##  [6]  -7.77921 -14.68094 -20.35397 -30.41538 -41.57497
## [11] -52.89343 -67.04642
ss <- as.numeric(crossprod(as.vector(myres)))
ss
## [1] 10685
JJ <- attr(myres, "jacobian")
JJ
##            b1       b2       b3
##  [1,]  9.9519  -8.9615  0.89615
##  [2,] 10.8846  -9.6998  1.93997
##  [3,] 11.8932 -10.4787  3.14361
##  [4,] 12.9816 -11.2964  4.51856
##  [5,] 14.1537 -12.1504  6.07520
##  [6,] 15.4128 -13.0373  7.82235
##  [7,] 16.7621 -13.9524  9.76668
##  [8,] 18.2040 -14.8902 11.91213
##  [9,] 19.7406 -15.8437 14.25933
## [10,] 21.3730 -16.8050 16.80496
## [11,] 23.1016 -17.7647 19.54122
## [12,] 24.9256 -18.7127 22.45528
svd(JJ)$d
## [1] 86.61282 12.82005  0.33299
grad <- crossprod(JJ, as.vector(myres))
print(as.vector(grad))
## [1] -5045.7  3917.7 -4117.1

Note that we need to get the individual parameters out of the vector pars or else change the expressions inside the gradient evaluation to extract these values directly. While I believe that it should not be too difficult to use R to automate the “copy and paste” and make the appropriate changes, I am not convinced that it is worth the effort.

10.5 Simple numerical derivatives

We can approximate derivatives by considering the elementary calculus expressions from which limits yield differentiation. Let us for the moment consider c10-math-0002, a function of a single parameter c10-math-0003. That is, using a forward step of c10-math-0004, an approximation to the derivative is

10.2 equation

The only awkward issue is the choice of the stepsize c10-math-0006. In calculus, the limit sends this to zero. In practice, we want it to be big enough that c10-math-0007 has enough digits different from c10-math-0008 to allow a good approximation to the true derivative. However, if it is too large, then we may encounter larger scale changes in c10-math-0009. As a compromise, I have found that a value computed as

10.3 equation

where eps is the floating point precision—the smallest number such that c10-math-0011 tests greater than 1—is a reasonable compromise (Nash, 1979, Chapter 18), but no value can be regarded as universally satisfactory.

There is, of course, no particular reason for stepping forward, and a backward derivative approximation is equally reasonable:

10.4 equation

A central approximation averages these:

equation

As Wikipedia points out (http://en.wikipedia.org/wiki/Finite_difference), the central difference yields an approximation with error proportional to the square of c10-math-0014, while for forward and backward approximations the error is proportional to c10-math-0015. We pay for the extra accuracy with an extra function evaluation. The central difference approximation is the default in function grad() from package pracma (Borchers, 2013). Note that there is also a grad() in package numDeriv.

10.6 Improved numerical derivative approximations

There are a couple of other ways by which we can improve our numerical approximations.

10.6.1 The Richardson extrapolation

The Richardson extrapolation is a general technique that aims to estimate a “better” answer for approximate computations. The forward, backward, and central derivative approximations of the previous section all become the desired derivative in the limit of c10-math-0016 going to zero, but only if we can evaluate the function sufficiently precisely, so the subtraction does not cancel all the digits. The idea of the Richardson extrapolation is to compute the derivative approximation for several values of c10-math-0017 and extrapolate to get the value for c10-math-0018. The same idea when extended to numerical integration is called the Romberg method. Wikipedia provides a readable summary (http://en.wikipedia.org/wiki/Richardson_extrapolation.

R package numDeriv (Gilbert, 2009) uses the Richardson extrapolation as its default method for computing derivative approximations. method="simple" or method="complex" can also be specified.

10.6.2 Complex-step derivative approximations

If the function for which we wish derivatives is analytic in complex values of the function, it turns out that we can find approximations that are almost as good as the analytic derivatives. This somewhat surprising result seems to have been first published by James Lyness. Besides many contributions to numerical mathematics, I owe James a personal debt of gratitude for saving me from an uncomfortable night at O'Hare Airport when a careless equipment operator cut a power cable for runway lights. A short and readable account of complex-step derivatives is given by Squire and Trapp (1998).

The complex-step approximation to the first derivative of c10-math-0019 is given as

10.5 equation

This requires only one function evaluation and can be shown to provide a very accurate approximation when it works. The fly in the ointment is that we need a function that can generate a complex result. Thus anything with “if ”-like structures, absolute values, or similar forms will fail. Generally, it is best to stick to very simple functions when using this derivative approximation, although it is certainly worth trying. Here is a small example using numDeriv. Thanks to Hans Werner Borchers for this.

detach(package:pracma)  # avoid confusion between two grad() functions
f1 <- function(x) 1 + x^2
f2 <- function(x) abs(1 + x^2)
require(numDeriv)
grad(f1, 1, method = "complex")
## [1] 2
grad(f2, 1, method = "complex")
## Error: function does not return a complex value as required by method ‘complex’.

10.7 Strategy and tactics for derivatives

Before suggesting what I believe are reasonable approaches for derivative calculations, let us examine the costs and accuracy of different methods (Figure 10.1). We will use the generalized Rosenbrock function with scale factor 100 for different numbers of parameters.

genrose.f <- function(x, gs = 100) {
    # objective function A generalization of the Rosenbrock
    # banana valley function to n parameters
    n <- length(x)
    fval <- 1 + sum(gs * (x[1:(n - 1)]^2 - x[2:n])^2 + (x[2:n] -
        1)^2)
    return(fval)
}
genrose.g <- function(x, gs = 100) {
    # vectorized gradient for genrose.f
    n <- length(x)
    gg <- as.vector(rep(0, n))
    tn <- 2:n
    tn1 <- tn - 1
    z1 <- x[tn] - x[tn1]^2
    z2 <- 1 - x[tn]
    gg[tn] <- 2 * (gs * z1 - z2)
    gg[tn1] <- gg[tn1] - 4 * gs * x[tn1] * z1
    return(gg)
}
c10f001

Figure 10.1

We compute relative errors in the gradients compared to analytic values by taking the ratio of the maximum absolute error to the maximum absolute gradient element. That is, we are not computing all the relative errors and then finding their maximum absolute value. This is to avoid a zero divide, while still obtaining a measure of the relative error. The results are given in the table at the end of this section.

Overall, my recommendation to users of optimization software is to try first to use analytic derivatives. This may sometimes give slightly slower timings or work measures for the optimization calculation than using a numerical approximation to the derivative, especially when the starting point is far from the solution. At least a partial understanding of this reality can be seen from an one-dimensional problem, where a secant line will in early iterations make better progress toward a minimum than the Newton (i.e., tangent) line. On the other hand, the analytic derivatives are much more reliable for determining that we are actually at an optimum. Moreover, they save much work in running the KKT and other optimality tests, as well as in evaluating test quantities that are used to decide termination of an optimization code.

If analytic derivatives are not available, then clearly the complex- step approximation, which here gives full accuracy, is the next best choice, but it may not be computable for particular functions. While numDeriv does better than a central difference approximation, the latter is much more economical of computing time and does achieve a reasonable accuracy.

Derivative relative errors

##
## ---------------------------------------------------
##  npar   forward   central   numDeriv   complexstep
## ------ --------- --------- ---------- -------------
##   2    3.25e-07  7.57e-10   6.51e-12        0
##
##   4    3.53e-07  8.37e-10   6.21e-12        0
##
##   6    3.53e-07  8.37e-10   9.91e-12        0
##
##   8    3.53e-07  5.37e-10   2.49e-12        0
##
##   10   3.52e-07  1.18e-09   1.11e-11        0
##
##   12   3.53e-07  8.37e-10   2.49e-12        0
##
##   14   3.55e-07  8.37e-10   1.07e-11        0
##
##   16   3.52e-07  1.18e-09   6.94e-12        0
##
##   18   3.52e-07  1.18e-09   7.84e-12        0
##
##   20   3.52e-07  1.18e-09   1.86e-11        0
## ---------------------------------------------------

References

  1. Borchers HW 2013 pracma: Practical Numerical Math Functions. R package version 1.5.0.
  2. Gilbert P 2009 numDeriv: Accurate Numerical Derivatives. R Foundation for Statistical Computing. R package version 2009.2-1.
  3. Karush W 1939 Minima of functions of several variables with inequalities as side constraints. Master's thesis, MSc Dissertation, Department of Mathematics, University of Chicago Chicago, IL.
  4. Kuhn HW and Tucker AW 1951 Nonlinear programming. Proceedings of 2nd Berkeley Symposium, pp. 481–492, Berkeley, USA.
  5. McCullagh P and Nelder J 1989 Generalized Linear Models, Monographs on Statistics and Applied Probability, Second Edition. Chapman and Hall, London.
  6. Nash JC 1979 Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation. Adam Hilger, Bristol. Second Edition, 1990, Bristol: Institute of Physics Publications.
  7. Nash JC 1995 Computer algebra systems: Derive. American Statistician 49(1), 93–99.
  8. Squire W and Trapp G 1998 Using complex variables to estimate derivatives of real functions. SIAM Review 40(1), 110–112.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.181.36