The scientists who built the core R system have included a number of tools for optimization. In this chapter, we explore these and review their strengths and weaknesses. It is, however, important to note the following.
optim()
optim()
is one of the most used tools in R. It incorporates five methods, three of which are derived from the author's own work (Nash, 1979). The five methods are largely independent but are all called through optim()
, using a call such as
ans<-optim(start, fn=myfn, gr=mygr, method="BFGS", X=myx, Y=myY, control=list(trace=1))
where start
is the vector of parameters that are used as the initial point of the optimization iteration, myfn
is the function to be minimized, mygr
is its gradient, and the control list specifies that we want to follow the iterations (trace
), while the elements given by X=myX, Y=myY
are named variables required by myfn
and mygr
and are passed to optim()
via the “…” arguments. The method
is specified via a character string, here using a method called “BFGS” (Broyden–Fletcher–Goldfarb–Shanno).
The methods included in optim()
are
optim()
set. It uses a limited-memory update of the inverse Hessian so is somewhat similar to the BFGS approach. However, the amount of information kept to build the update can be controlled by the user, the default being to keep information from the last five iterations. See Lu et al. (1994) or the code published by Zhu et al. (1997). The method can, moreover, deal with bounds on the parameters and is the default method if any bound is specified. Generally, L-BFGS-B works well, although its very sophistication does get in the way of success on occasions. At the time of writing, R does not include the updated version of this method (Morales and Nocedal, 2011).type
which takes values 1 (default), 2, or 3 respectively. Personally, I have found the second option preferable, but none are definitely better than the others. Overall I have been disappointed in this method. However, a more recent approach (Dai and Yuan, 2001) that cleverly combines and exploits differences in the updates allowed me to prepare Rcgmin
with which I have been more than pleased.optim()
. First, it will always evaluate the function a fixed number of times and then report “convergence.” This is exceedingly misleading to most users. Moreover, the idea of “temperature” in the process is inconsistent with the descent ideas of the other methods.nlm()
nlm
is an interface to a particularized version of UNCMIN, a set of algorithms for unconstrained minimization from Schnabel et al. (1985). The approach is a Newton or quasi-Newton method, but the details are somewhat difficult to understand from the manual. The original code provided for a variety of line search and updating techniques, as well as different gradient approximations.
The calling sequence for nlm()
is rather different from that of optim()
, and the manual gives the usage as
nlm(f, p, ..., hessian = FALSE, typsize = rep(1, length(p)), fscale = 1, print.level = 0, ndigit = 12, gradtol = 1e-6, stepmax = max(1000 * sqrt(sum((p/typsize)^2)), 1000), steptol = 1e-6, iterlim = 100, check.analyticals = TRUE)
Note that the function f
is given first in the sequence, and the parameters second, a reverse from optim()
. Moreover, the gradient and possibly Hessian code are passed as attributes to f
, described as
f: the function to be minimized. If the function value has an attribute called 'gradient' or both 'gradient' and 'hessian' attributes, these will be used in the calculation of updated parameter values. Otherwise, numerical derivatives are used. 'deriv' returns a function with suitable 'gradient' attribute. This should be a function of a vector of the length of 'p' followed by any other arguments specified by the '...' argument.
As with so many optimization routines, nlm()
has a lot of controls that essentially adjust heuristics aimed at tuning the methods. In practice, I have rarely if ever used these, even though I work in the optimization field and should have the expertise to do so. The reality is that each program has different names or ways of implementing relatively similar ideas that concern how “big” or “small” different quantities should be. Users find it tedious to learn the intricacies of each control, and tinkering is time consuming and generally unproductive.
A possible exception is the limit on iterations. I generally want to know if a method gets to a valid solution when allowed to take as many iterations as it likes.
Despite the above criticisms of nlm()
, in my experience, it is one of the more reliable and efficient optimization methods for unconstrained functions, even when only the function code is provided. To simplify its use, I generally call it via optimx()
, which allows the use of a common, optim
-like calling sequence.
nlminb()
Because nlm()
is an unconstrained minimizer, but bounds constraints were needed for a number of maximum likelihood problems, nlminb()
was added to the base set of optimizers. The code on which nlminb()
is based was written by David Gay at Bell Labs and was part of the PORT Fortran Library (Fox et al. 1978). The S language at Bell Labs used this as its optimizer, and Doug Bates, in an e-mail dated June 2008, states that
> I created the nlminb function because I was unable to get reliable > convergence on some difficult optimization problems for the nlme and > lme4 packages using nlm and optim. The nlme package was originally > written for S from Bell Labs (the forerunner of SPLUS) and the PORT > package was the optimization code used. Even though it is very old > style Fortran code I find it quite reliable as an optimizer.
I agree with this general view, although the code in the PORT library is not easy to follow and use. Moreover, as with nlm()
, there are many settings and choices that could be helpful, but more R users will want to leave alone. However, nlminb
is quite similar in call to optim()
, namely,
nlminb(start, objective, gradient = NULL, hessian = NULL, ..., scale = 1, control = list(), lower = -Inf, upper = Inf)
The issue of controls was highlighted in a discovery in 2010 (see https://stat.ethz.ch/pipermail/r-help/2010-July/245194.html) that nlminb()
assumed positive objective functions (likely sums of squares) and that a default test for termination was abs.tol
, a control parameter to nlminb
. Many negative log likelihood functions will not be positive, and it is difficult to know how many incorrect results were used or published. The issue underlines one that is central to open-source software such as R—we need to be able to easily review code if it is to be improved. The PORT library has many fine codes, but in my view, they are not easily understood or changed, especially as implemented for the use by R, where the interface adds another layer of confusion.
Let us solve the well-known Rosenbrock (1960) banana-shaped valley problem with the methods of this chapter. We will start from the “standard” point of , and we will not, for the moment, supply the gradient.
fr <- function(x) { ## Rosenbrock Banana function x1 <- x[1] x2 <- x[2] 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 } sstart <- c(-1.2, 1)
To keep the output tidy, we will use a little display function.
prtopt <- function(optres, headname) { # compact print of optim() result cat(headname, ": f(") npar <- length(optres$par) for (i in 1:(npar - 1)) { cat(optres$par[[i]], ", ") } cat(optres$par[[npar]], ") = ", optres$value, " after ", optres$counts[[1]], "f & ", optres$counts[[2]], "g ") cat(optres$message, " convergence code=", optres$convergence, " ") }
First, we try the Nelder–Mead, BFGS, CG, and L-BFGS-B methods of optim()
anm <- optim(sstart, fr) # Nelder Mead is the default method prtopt(anm, "anm") ## anm : f(1.0003 , 1.0005 ) = 8.8252e-08 after 195 f & NA g ## convergence code= 0 abfgs <- optim(sstart, fr, method = "BFGS") prtopt(abfgs, "abfgs") ## abfgs : f(0.9998 , 0.99961 ) = 3.8274e-08 after 118 f & 38 g ## convergence code= 0 acg1 <- optim(sstart, fr, method = "CG", control = list(type = 1)) prtopt(acg1, "acg1") ## acg1 : f(-0.76481 , 0.59271 ) = 3.1065 after 402 f & 101 g ## convergence code= 1 acg2 <- optim(sstart, fr, method = "CG", control = list(type = 2)) prtopt(acg2, "acg2") ## acg2 : f(0.96767 , 0.93628 ) = 0.0010432 after 384 f & 101 g ## convergence code= 1 acg3 <- optim(sstart, fr, method = "CG", control = list(type = 3)) prtopt(acg3, "acg3") ## acg3 : f(0.94331 , 0.88953 ) = 0.0032119 after 352 f & 101 g ## convergence code= 1 albfgsb <- optim(sstart, fr, method = "L-BFGS-B") prtopt(albfgsb, "albfgsb") ## albfgsb : f(0.9998 , 0.9996 ) = 3.9985e-08 after 49 f & 49 g ## CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH convergence code= 0
Here we note that Nelder–Mead, BFGS, and L-BFGS-B have essentially found the correct solution at , but the CG examples have hit the maximum allowed gradient computations. Let us check the results using the analytic gradient. Note that we test the gradient code!
frg <- function(x) { ## Rosenbrock Banana function gradient x1 <- x[1] x2 <- x[2] g1 <- -400 * (x2 - x1 * x1) * x1 - 2 * (1 - x1) g2 <- 200 * (x2 - x1 * x1) gg <- c(g1, g2) } ## test gradient at standard start: require(numDeriv) gn <- grad(fr, sstart) ga <- frg(sstart) ## max(abs(ga-gn)) max(abs(ga - gn)) ## [1] 1.47e-08 ## analytic gradient results gabfgs <- optim(sstart, fr, frg, method = "BFGS") prtopt(gabfgs, "BFGS w. frg") ## BFGS w. frg : f(1 , 1 ) = 9.595e-18 after 110 f & 43 g ## convergence code= 0 gacg1 <- optim(sstart, fr, frg, method = "CG", control = list(type = 1)) prtopt(gacg1, "CG Fletcher-Reeves w. frg") ## CG Fletcher-Reeves w. frg : f(-0.76484,0.59276) = 3.1066 after 402 f & 101 g ## convergence code= 1 gacg2 <- optim(sstart, fr, frg, method = "CG", control = list(type = 2)) prtopt(gacg2, "CG Polak-Ribiere w. frg") ## CG Polak-Ribiere w. frg : f(0.99441,0.98882)=3.1238e-05 after 385 f & 101 g ## convergence code= 1 gacg3 <- optim(sstart, fr, frg, method = "CG", control = list(type = 3)) prtopt(gacg3, "CG Beale-Sorenson w. frg") ## CG Beale-Sorenson w. frg : f(0.89973,0.80901) = 0.010045 after 355 f & 101 g ## convergence code= 1 galbfgsb <- optim(sstart, fr, method = "L-BFGS-B") prtopt(galbfgsb, "L-BFGS-B w. frg") ## L-BFGS-B w. frg : f(0.9998 , 0.9996 ) = 3.9985e-08 after 49 f & 49 g ## CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH convergence code= 0
We observe that the CG methods are still doing poorly compared to other approaches.
18.221.249.198