In previous chapters, we have referred several times to derivatives, that is, to gradients and Hessians of functions. In practice, having good derivative information is important to obtaining solutions or to knowing that we have a valid solution. This chapter will look at ways in which we can acquire and use such information.
Derivative information is important
Indeed, even methods that claim to be “derivative free” will often use the concepts of gradients and Hessians, either for the function to be minimized or for an approximating model.
It is my experience that the main utility of good derivative information is in testing that we indeed have a solution. That is, it is useful for termination test and improves performance because it allows us to cease trying to proceed when our journey is complete. In some cases, approximate derivatives may actually give better performance for some gradient methods in initial steps when we are far from the solution. This is similar to secant methods outperforming Newton methods in early iterations.
Unfortunately, the calculation of derivatives is not a trivial task. This chapter looks at some approaches and presents some recommendations that I freely admit are open to discussion –they are my own views based on over four decades of dealing with these matters, but they are inevitably grounded in the particular problems I have had to try to solve, which may be very different from the problems faced by my readers.
The following are some general considerations and suggestions.
For problems that are run frequently, I believe that manually coded derivative functions generally allow the best performance. If the programmer is sensible, they will also be commented and more comprehensible to the reader of the code than any automatically generated function. Note that for some functions, it can be quite efficient to compute the objective function, the gradient and the Hessian together, because there may be common subcomputations.
Unfortunately, working out analytic derivatives is tedious and error prone. For nonlinear least squares or similar computations where one is trying to fit a function model(parms) to data by varying the parameters parms, I have found it useful to write residuals in the form
This avoids one possible sign error in taking the derivatives of the resids, even though most statisticians will write the residuals in the data—model(parms) form. Of course, it makes no difference to the sum of squares. I urge readers to develop similar tricks that suit their own working habits so they can to avoid simple but costly errors in computing derivatives. As noted by McCullagh and Nelder (1989, p. 9), both Gauss and Legendre use the form above for residuals, so I am in good company.
As mentioned, there are software tools that permit symbolic mathematics and these could be used to generate expressions for the derivatives. I have myself made occasional use of such packages, for example, DERIVE (Nash, 1995). (This appears to be no longer generally available.) I have also made use on occasions of Maxima (http://maxima.sourceforge.net). However, I find that these tools are quite difficult to use well and need ongoing use to maintain one's skills with them. If you are willing to use these packages regularly, I believe that they may offer considerable rewards.
A particular use of symbolic maths tools is in allowing for partial development of some of the expressions used in the “by-hand” coding of derivative functions. They may also provide a way of checking that our code generates correct answers, possibly for intermediate stages of the derivative computation.
Similarly, there are packages for automatic differentiation (AD). This is not the same as symbolic differentiation, where a computer program is provided with a symbolic representation of the function to be differentiated and returns a symbolic expression with the derivative(s). Instead, clever application of the chain rule is used to compute a numerical value of the derivative(s) wanted, with the “function” supplied as computational code. Thus AD tools must be adapted to the language in which functions are written. In our case, this is R.
R offers some tools that combine symbolic differentiation and AD, and for readers of this book, I believe that these provide the most value for effort expended. However, the tools are limited in their scope. That is, they encompass only a limited set of functions for which derivatives can be computed.
Let us use our nonlinear least squares Example 1.2. We set up the model expression in mod
, provide the parameter names in namev
, and use the R function deriv()
to get the expression for the gradient.
mod <- ∼100 * b1/(1 + 10 * b2 * exp(-0.1 * b3 * t)) mod ## ∼100 * b1/(1 + 10 * b2 * exp(-0.1 * b3 * t)) namev <- c("b1", "b2", "b3") try1 <- deriv(mod, namev) try1 ## expression({ ## .expr1 <- 100 * b1 ## .expr2 <- 10 * b2 ## .expr6 <- exp(-0.1 * b3 * t) ## .expr8 <- 1 + .expr2 * .expr6 ## .expr13 <- .expr8^2 ## .value <- .expr1/.expr8 ## .grad <- array(0, c(length(.value), 3L), list(NULL, c("b1", ## "b2", "b3"))) ## .grad[, "b1"] <- 100/.expr8 ## .grad[, "b2"] <- -(.expr1 * (10 * .expr6)/.expr13) ## .grad[, "b3"] <- .expr1 * (.expr2 * (.expr6 * (0.1 * t)))/.expr13 ## attr(.value, "gradient") <- .grad ## .value ## })
We can now copy and paste this expression into a function for computing the residuals and Jacobian of the model. We then compute the sum of squares and gradient.
resfn <- function(pars, t = t, y = y) { b1 <- pars[[1]] b2 <- pars[[2]] b3 <- pars[[3]] .expr1 <- 100 * b1 .expr2 <- 10 * b2 .expr6 <- exp(-0.1 * b3 * t) .expr8 <- 1 + .expr2 * .expr6 .expr13 <- .expr8^2 .value <- .expr1/.expr8 .grad <- array(0, c(length(.value), 3L), list(NULL, c("b1", "b2", "b3"))) .grad[, "b1"] <- 100/.expr8 .grad[, "b2"] <- -(.expr1 * (10 * .expr6)/.expr13) .grad[, "b3"] <- .expr1 * (.expr2 * (.expr6 * (0.1 * t)))/.expr13 attr(.value, "gradient") <- .grad .value res <- .value - y attr(res, "jacobian") <- .grad res } ## Test the function ydat <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443, 38.558, 50.156, 62.948, 75.995, 91.972) # for testing tdat <- 1:length(ydat) # for testing start <- c(1, 1, 1) myres <- resfn(start, y = ydat, t = tdat) print(as.vector(myres)) # we don't want the ‘gradient’ ## [1] 4.64386 3.64458 2.25518 0.11562 -2.91533 ## [6] -7.77921 -14.68094 -20.35397 -30.41538 -41.57497 ## [11] -52.89343 -67.04642 ss <- as.numeric(crossprod(as.vector(myres))) ss ## [1] 10685 JJ <- attr(myres, "jacobian") JJ ## b1 b2 b3 ## [1,] 9.9519 -8.9615 0.89615 ## [2,] 10.8846 -9.6998 1.93997 ## [3,] 11.8932 -10.4787 3.14361 ## [4,] 12.9816 -11.2964 4.51856 ## [5,] 14.1537 -12.1504 6.07520 ## [6,] 15.4128 -13.0373 7.82235 ## [7,] 16.7621 -13.9524 9.76668 ## [8,] 18.2040 -14.8902 11.91213 ## [9,] 19.7406 -15.8437 14.25933 ## [10,] 21.3730 -16.8050 16.80496 ## [11,] 23.1016 -17.7647 19.54122 ## [12,] 24.9256 -18.7127 22.45528 svd(JJ)$d ## [1] 86.61282 12.82005 0.33299 grad <- crossprod(JJ, as.vector(myres)) print(as.vector(grad)) ## [1] -5045.7 3917.7 -4117.1
Note that we need to get the individual parameters out of the vector pars
or else change the expressions inside the gradient evaluation to extract these values directly. While I believe that it should not be too difficult to use R to automate the “copy and paste” and make the appropriate changes, I am not convinced that it is worth the effort.
We can approximate derivatives by considering the elementary calculus expressions from which limits yield differentiation. Let us for the moment consider , a function of a single parameter . That is, using a forward step of , an approximation to the derivative is
The only awkward issue is the choice of the stepsize . In calculus, the limit sends this to zero. In practice, we want it to be big enough that has enough digits different from to allow a good approximation to the true derivative. However, if it is too large, then we may encounter larger scale changes in . As a compromise, I have found that a value computed as
where eps is the floating point precision—the smallest number such that tests greater than 1—is a reasonable compromise (Nash, 1979, Chapter 18), but no value can be regarded as universally satisfactory.
There is, of course, no particular reason for stepping forward, and a backward derivative approximation is equally reasonable:
A central approximation averages these:
As Wikipedia points out (http://en.wikipedia.org/wiki/Finite_difference), the central difference yields an approximation with error proportional to the square of , while for forward and backward approximations the error is proportional to . We pay for the extra accuracy with an extra function evaluation. The central difference approximation is the default in function grad()
from package pracma
(Borchers, 2013). Note that there is also a grad()
in package numDeriv
.
There are a couple of other ways by which we can improve our numerical approximations.
The Richardson extrapolation is a general technique that aims to estimate a “better” answer for approximate computations. The forward, backward, and central derivative approximations of the previous section all become the desired derivative in the limit of going to zero, but only if we can evaluate the function sufficiently precisely, so the subtraction does not cancel all the digits. The idea of the Richardson extrapolation is to compute the derivative approximation for several values of and extrapolate to get the value for . The same idea when extended to numerical integration is called the Romberg method. Wikipedia provides a readable summary (http://en.wikipedia.org/wiki/Richardson_extrapolation.
R package numDeriv
(Gilbert, 2009) uses the Richardson extrapolation as its default method for computing derivative approximations. method="simple"
or method="complex"
can also be specified.
If the function for which we wish derivatives is analytic in complex values of the function, it turns out that we can find approximations that are almost as good as the analytic derivatives. This somewhat surprising result seems to have been first published by James Lyness. Besides many contributions to numerical mathematics, I owe James a personal debt of gratitude for saving me from an uncomfortable night at O'Hare Airport when a careless equipment operator cut a power cable for runway lights. A short and readable account of complex-step derivatives is given by Squire and Trapp (1998).
The complex-step approximation to the first derivative of is given as
This requires only one function evaluation and can be shown to provide a very accurate approximation when it works. The fly in the ointment is that we need a function that can generate a complex result. Thus anything with “if ”-like structures, absolute values, or similar forms will fail. Generally, it is best to stick to very simple functions when using this derivative approximation, although it is certainly worth trying. Here is a small example using numDeriv
. Thanks to Hans Werner Borchers for this.
detach(package:pracma) # avoid confusion between two grad() functions f1 <- function(x) 1 + x^2 f2 <- function(x) abs(1 + x^2) require(numDeriv) grad(f1, 1, method = "complex") ## [1] 2 grad(f2, 1, method = "complex") ## Error: function does not return a complex value as required by method ‘complex’.
Before suggesting what I believe are reasonable approaches for derivative calculations, let us examine the costs and accuracy of different methods (Figure 10.1). We will use the generalized Rosenbrock function with scale factor 100 for different numbers of parameters.
genrose.f <- function(x, gs = 100) { # objective function A generalization of the Rosenbrock # banana valley function to n parameters n <- length(x) fval <- 1 + sum(gs * (x[1:(n - 1)]^2 - x[2:n])^2 + (x[2:n] - 1)^2) return(fval) } genrose.g <- function(x, gs = 100) { # vectorized gradient for genrose.f n <- length(x) gg <- as.vector(rep(0, n)) tn <- 2:n tn1 <- tn - 1 z1 <- x[tn] - x[tn1]^2 z2 <- 1 - x[tn] gg[tn] <- 2 * (gs * z1 - z2) gg[tn1] <- gg[tn1] - 4 * gs * x[tn1] * z1 return(gg) }
We compute relative errors in the gradients compared to analytic values by taking the ratio of the maximum absolute error to the maximum absolute gradient element. That is, we are not computing all the relative errors and then finding their maximum absolute value. This is to avoid a zero divide, while still obtaining a measure of the relative error. The results are given in the table at the end of this section.
Overall, my recommendation to users of optimization software is to try first to use analytic derivatives. This may sometimes give slightly slower timings or work measures for the optimization calculation than using a numerical approximation to the derivative, especially when the starting point is far from the solution. At least a partial understanding of this reality can be seen from an one-dimensional problem, where a secant line will in early iterations make better progress toward a minimum than the Newton (i.e., tangent) line. On the other hand, the analytic derivatives are much more reliable for determining that we are actually at an optimum. Moreover, they save much work in running the KKT and other optimality tests, as well as in evaluating test quantities that are used to decide termination of an optimization code.
If analytic derivatives are not available, then clearly the complex- step approximation, which here gives full accuracy, is the next best choice, but it may not be computable for particular functions. While numDeriv
does better than a central difference approximation, the latter is much more economical of computing time and does achieve a reasonable accuracy.
Derivative relative errors
## ## --------------------------------------------------- ## npar forward central numDeriv complexstep ## ------ --------- --------- ---------- ------------- ## 2 3.25e-07 7.57e-10 6.51e-12 0 ## ## 4 3.53e-07 8.37e-10 6.21e-12 0 ## ## 6 3.53e-07 8.37e-10 9.91e-12 0 ## ## 8 3.53e-07 5.37e-10 2.49e-12 0 ## ## 10 3.52e-07 1.18e-09 1.11e-11 0 ## ## 12 3.53e-07 8.37e-10 2.49e-12 0 ## ## 14 3.55e-07 8.37e-10 1.07e-11 0 ## ## 16 3.52e-07 1.18e-09 6.94e-12 0 ## ## 18 3.52e-07 1.18e-09 7.84e-12 0 ## ## 20 3.52e-07 1.18e-09 1.86e-11 0 ## ---------------------------------------------------
numDeriv
: Accurate Numerical Derivatives. R Foundation for Statistical Computing. R package version 2009.2-1.18.191.181.36