This chapter concerns a number of R tools that are extensions or accessories to the materials we have discussed so far. Because they are treated here at the end of the book does not mean that they are unimportant. However, my concerns have been the machinery to find estimates of nonlinear parameters by optimizing functions. A number of the tools in this chapter stress other aspects of statistical estimation that illuminate the data or models in other ways.
As maximum likelihood estimation is such a common task in computational statistics, several tools and packages exist for carrying out some of the forms of ML tasks.
mle
in the stats4
(part of the base R distribution, R Core Team (2013), but it appears that one needs to load it with require(stats4)
) is intended for minimizing a function minuslogl
using a method chosen from optim()
. This tool appears to have fallen into disuse, possibly because it seems to be rather fragile. However, it does compute the solution for our Hobbs maximum likelihood example introduced in Chapter 12, and we include an illustration of use of fixed parameters (masks).
require(stats4, quietly = TRUE) lhobbs.res <- function(xl, y) { # log scaled Hobbs weeds problem - - residual base parameters on log(x) x <- exp(xl) if (abs(12 * x[3]) > 50) { # check computability rbad <- rep(.Machine$double.xmax, length(x)) return(rbad) } if (length(x) != 3) stop("hobbs.res - - parameter vector n!=3") t <- 1:length(y) res <- x[1]/(1 + x[2] * exp(-x[3] * t)) - y } lhobbs.lik <- function(Asym, b2, b3, lsig) { # likelihood function including sigma y <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443) y <- c(y, 38.558, 50.156, 62.948, 75.995, 91.972) xl <- c(Asym, b2, b3) logSigma <- lsig sigma2 = exp(2 * logSigma) res <- lhobbs.res(xl, y) nll <- 0.5 * (length(res) * log(2 * pi * sigma2) + sum(res * res)/sigma2) } mystart <- list(Asym = 1, b2 = 1, b3 = 1, lsig = 1) # must be a list amlef <- mle(lhobbs.lik, start = mystart, fixed = list(lsig = log(0.4))) amlef ## ## Call: ## mle(minuslogl = lhobbs.lik, start = mystart, fixed = list(lsig = log(0.4))) ## ## Coefficients: ## Asym b2 b3 lsig ## 5.27911 3.89366 -1.15976 -0.91629 amlef@min # minimal neg log likelihood ## [1] 8.117 amle <- mle(lhobbs.lik, start = as.list(coef(amlef))) amle ## ## Call: ## mle(minuslogl = lhobbs.lik, start = as.list(coef(amlef))) ## ## Coefficients: ## Asym b2 b3 lsig ## 5.27914 3.89373 -1.15976 -0.76712 val <- do.call(lhobbs.lik, args = as.list(coef(amle))) val ## [1] 7.8215 # Note: This does not work print(lhobbs.lik(as.list(coef(amle)))) ## Error: 'b2' is missing # But this displays the minimum of the negative log likelihood amle@min ## [1] 7.8215
The function mle2()
from package bbmle
(Bolker and Team, 2013) does seem to work more satisfactorily in my opinion. It has a data=
argument that allows us to specify the data with which the model functions are to be computed. Moreover, the standard output includes the value of the likelihood.
require(bbmle, quietly = TRUE) lhobbs2.lik <- function(Asym, b2, b3, lsig, y) { # likelihood function including sigma xl <- c(Asym, b2, b3) logSigma <- lsig sigma2 = exp(2 * logSigma) res <- lhobbs.res(xl, y) nll <- 0.5 * (length(res) * log(2 * pi * sigma2) + sum(res * res)/sigma2) } y0 <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443) y0 <- c(y0, 38.558, 50.156, 62.948, 75.995, 91.972) mystart <- list(Asym = 1, b2 = 1, b3 = 1, lsig = 1) # must be a list flist <- list(lsig = log(0.4)) amle2f <- mle2(lhobbs2.lik, start = mystart, data = list(y = y0), fixed = flist) amle2f ## ## Call: ## mle2(minuslogl = lhobbs2.lik, start = mystart, fixed = flist, ## data = list(y = y0)) ## ## Coefficients: ## Asym b2 b3 lsig ## 5.27911 3.89366 -1.15976 -0.91629 ## ## Log-likelihood: -8.12 amle2 <- mle2(lhobbs2.lik, start = as.list(coef(amlef)), data = list(y = y0)) amle2 ## ## Call: ## mle2(minuslogl = lhobbs2.lik, start = as.list(coef(amlef)), data = list(y = y0)) ## ## Coefficients: ## Asym b2 b3 lsig ## 5.27914 3.89373 -1.15976 -0.76712 ## ## Log-likelihood: -7.82
Note: The log likelihood displayed above for amle2
and amle2f
is the negative of the function we have minimized. The function value is found as
amle@details$value ## [1] 7.8215
maxLik
offers a somewhat different set of tools for maximum likelihood estimation. Its optimizers (some of which are built on the optim()
function) are all maximizers, in contrast to almost all the methods in this book, which by default minimize functions. Unfortunately, one quirk of this package is that it does not respect the quietly=TRUE
directive of the require()
function, so I had to be a bit more aggressive in suppressing the messages that did not fit the formatting of this page. Note that the objective function not only is specified as the log likelihood (rather than its negative) but that the parameters are supplied to this function as a vector. Personally, I prefer to supply parameters as a vector, but other users may find the individual parameters aid in linking the computations to their particular problems.
suppressMessages(require(maxLik)) llh <- function(xaug, y) { Asym <- xaug[1] b2 <- xaug[2] b3 <- xaug[3] lsig <- xaug[4] val <- (-1) * lhobbs2.lik(Asym, b2, b3, lsig, y) } aml <- maxLik(llh, start = c(1, 1, 1, 1), y = y0) aml ## Maximum Likelihood estimation ## Newton-Raphson maximisation, 15 iterations ## Return code 2: successive function values within tolerance limit ## Log-Likelihood: -7.8215 (4 free parameter(s)) ## Estimate(s): 5.2791 3.8937 -1.1597 -0.76715
likelihood
(Murphy, 2012) uses simulated annealing to maximize the likelihood function and is more at home in Chapter 15 where it has been mentioned. Moreover, this package uses a very different structure for maximum likelihood estimation than the other packages discussed in this section, so I will not pursue its use further here.
There are always generalizations of any style of model or computation, and R developers have not been slow to pursue such possibilities.
Package nlme
is a very large software collection for linear and nonlinear mixed effect models. We have already seen how various packages are available for nonlinear least squares in Chapter 6, but those tools assume we should minimize the sum of squared residuals. The residuals can be explicitly adjusted by weights, or these can be passed to the computation by the weights
argument to the nls()
function. However, it is assumed that the residuals are uncorrelated. When they have a variance/covariance structure, we need to modify the objective function. In the nlme
function gnls()
, this is precoded for us.
Package gnm
(Turner and Firth, 2012) aims to allow the fitting of models that are rather different, and specified in a different way, from those we have presented in the rest of the book. In particular, gnm
considers overparameterized representations, where we have more parameters and modeling terms than we actually need. Our goal is to determine which are useful and generate an effective model with a subset of the model features. (This description is my own; the package authors would probably express things otherwise.) There is both a manual and a quite extensive vignette for the package. We can run the Bates version 16.7 of the logistic model with gnm
and put the nlxb()
solution after for comparison.
require(gnm, quietly = TRUE) y0 <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443) y0 <- c(y0, 38.558, 50.156, 62.948, 75.995, 91.972) t0 <- 1:12 Hdta <- data.frame(y = y0, x = t0) formula <- y ∼ -1 + Mult(1, Inv(Const(1) + Exp(Mult(1 + offset(-x), Inv(1))))) st <- c(Asym = 200, xmid = 10, scal = 3) ans <- gnm(formula, start = st, data = Hdta) ## Running main iterations..... ## Done ans ## ## Call: ## gnm(formula = formula, data = Hdta, start = st) ## ## Coefficients: ## Mult(., Inv(Exp(Mult(1 + offset(-x), Inv(1))) + Const(1))). ## 196.19 ## Mult(1, Inv(Exp(Mult(. + offset(-x), Inv(1))) + Const(1))). ## 12.42 ## Mult(1, Inv(Exp(Mult(1 + offset(-x), Inv(.))) + Const(1))). ## 3.19 ## ## Deviance: 2.5873 ## Pearson chi-squared: 2.5873 ## Residual df: 9 require(nlmrt, quietly = TRUE) anls <- nlxb(y ∼ Asym/(1 + exp((xmid - x)/scal)), start = st, data = Hdta) anls ## nlmrt class object: x ## residual sumsquares = 2.5873 on 12 observations ## after 5 Jacobian and 6 function evaluations ## name coeff SE tstat pval gradient JSingval ## Asym 196.186 11.31 17.35 3.167e-08 -3.726e-09 44.93 ## xmid 12.4173 0.3346 37.11 3.715e-11 -1.492e-08 15.6 ## scal 3.18908 0.0698 45.69 5.767e-12 -2.818e-08 0.0474
In this section, I have deliberately left out some packages that are distributed privately and do not satisfy the checks of CRAN. Unfortunately, while such software may contain useful resources, I do not want to suggest the use of tools that may introduce conflicts with the mainstream R packages.
R has two packages that I consider well developed for estimating models that are specified by several equations: sem
(Fox et al. 2013) and systemfit
(Henningsen and Hamann, 2007).
systemfit
(Henningsen and Hamann, 2007) is, to my view, more suited to multiequation models from econometrics. These are mostly specified as sets of linear models with exogenous and endogenous variables. While there is a form of nonlinearity introduced by the multiple equation nature of the models, the nonlinear optimization tools of this book are generally not used by practitioners in this field. Indeed, while the package includes a function called nlsystemfit()
, the authors caution that it is still under development.
Most of the applications of sem
are also likely to be collections of linear models. There is a useful vignette (Fox, 2006).
systemfit
and sem
use slightly different ways of specifying the model equations to their estimating functions. While both allow equations to be provided as R expressions, sem
also allows, via a specifyModel()
function, for a form of shorthand for the model. Different optimizers can be specified to fit the model, and I have found that I was able to create a modified sem
and reinstall it with a modified optimizer. In a problem sent to me by John Fox to investigate a convergence issue, I changed one of the optimization tools from the optim()
method CG
(a code based on my own work but which I do not feel should now be used) to method BFGS
.
While the central engines for solving nonlinear least squares tasks have been dealt with in Chapter 6, R has several tools to extend that functionality. We conclude with a brief mention of some of these.
nlstools
(Baty and Delignette-Muller, 2013) contains a mixed bag of tools and data that reflect the agricultural research background of its authors. There are several functions for working with growth curves, as well as some aids for computing confidence intervals of parameters.
fitdistr()
from package MASS
(distributed with R) is designed to allow for fitting of common univariate distributions by maximum likelihood. Here the package developers have done the work for us of writing the objective function and in some cases of providing the starting values for optimization. Bounds can be specified on the parameters. fitdistrplus
(Delignette-Muller et al. 2013) is a package that extends this by allowing for censored data and by permitting other criteria such as moment matching to be used for estimating the distribution parameters. (This, of course, is moving away from optimization.)
In a similar manner, the authors of package grofit
(Kahm et al. 2010) have provided some precanned routines for fitting a variety of growth curves that are common in biological models. Like other packages of this type, the features provided are intended for a particular audience and not necessarily the easiest for others to employ. For example, we need to find the sum of squares buried in the object nls
within the returned solution. Here is an example using the familiar Hobbs data. The summary()
output for the returned object (called ah
here) is, to my eyes, not very helpful. Therefore, I examined what the object contains and found the nls()
component, which is displayed in two ways. Also given is the model
component that tells us a logistic model was found to be “best” by the function gcFitModel()
, which actually tries several forms and compares them using the Akaike information criterion (AIC). This is defined as
where is the number of parameters in the statistical model and is the minimized value of the negative log-likelihood function for the estimated model. Smaller is better.
## tgrofit.R - - Use Hobbs problem to test grofit y <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443) y <- c(y, 38.558, 50.156, 62.948, 75.995, 91.972) tt <- 1:12 require(grofit, quietly = TRUE) ah <- gcFitModel(time = tt, data = y) ## - -> Try to fit model logistic ## ....... OK ## - -> Try to fit model richards ## ....... ERROR in nls(). For further information see help(gcFitModel) ## - -> Try to fit model gompertz ## ....... OK ## - -> Try to fit model gompertz.exp ## ... OK
print(summary(ah)) ## mu.model lambda.model A.model integral.model stdmu.model stdlambda.model ## 1 15.38 6.0391 196.19 376.9 0.5832 0.20539 ## stdA.model ci90.mu.model.lo ci90.mu.model.up ci90.lambda.model.lo ## 1 11.307 14.42 16.339 5.7013 ## ci90.lambda.model.up ci90.A.model.lo ci90.A.model.up ci95.mu.model.lo ## 1 6.377 177.59 214.79 14.236 ## ci95.mu.model.up ci95.lambda.model.lo ci95.lambda.model.up ci95.A.model.lo ## 1 16.523 5.6366 6.4417 174.02 ## ci95.A.model.up ## 1 218.35 ah$model ## [1] "logistic" summary(ah$nls) ## ## Formula: data ∼ logistic(time, A, mu, lambda, addpar) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## A 196.186 11.307 17.4 3.2e-08 *** ## mu 15.380 0.583 26.4 7.8e-10 *** ## lambda 6.039 0.205 29.4 3.0e-10 *** ## - - - ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.536 on 9 degrees of freedom ## ## Number of iterations to convergence: 6 ## Achieved convergence tolerance: 3.07e-07 ah$nls ## Nonlinear regression model ## model: data ∼ logistic(time, A, mu, lambda, addpar) ## data: parent.frame() ## A mu lambda ## 196.19 15.38 6.04 ## residual sum-of-squares: 2.59 ## ## Number of iterations to convergence: 6 ## Achieved convergence tolerance: 3.07e-07
A rather different tool is provided in nnls
(Mullen and van Stokkum, 2012). Here we wish to solve a least squares problem that looks to be like a usual linear least squares calculation. However, we require that the resulting parameters to be positive and possibly that there sum be scaled. Such problems arise in decoding spectra, in various imaging calculations (where the pixel cannot have a negative intensity), and some other domains.
To keep the presentation simple, we will fabricate a small example. Suppose we have three substances that have known spectra signals given as Sig1
, Sig2
, and Sig3
below. These form our dictionary matrix. We have a mixture of these, with proportions or concentrations 0.23, 0.4, and 0.1. We will not impose a sum constraint but simply assume the combined signal is given by the linear combination of the columns of the dictionary matrix. We can measure the signal (spectrum) of the combination, but there is some measurement error or “noise” which we add to simulate a real problem.
Sig1 <- c(0, 0, 1, 2, 3, 4, 3, 2, 1, 0, 0, 2, 4, 8, 16, 8, 5, 1, 0) Sig2 <- c(2, 3, 4, 2, 1, 1, 3, 5, 7, 1, 0, 0, 0, 0, 0, 0, 5, 0, 0) Sig3 <- c(0, 0, 0, 0, 0, 4, 4, 4, 1, 0, 0, 1, 14, 18, 16, 18, 15, 1, 10) C <- cbind(Sig1, Sig2, Sig3) bb <- C %*% as.matrix(c(0.23, 0.4, 0.1)) scale <- 0.15 require(setRNG) ## Loading required package: setRNG setRNG(kind = "Wichmann-Hill", seed = c(979, 1479, 1542), normal.kind = "Box-Muller") d <- bb + scale * rnorm(19)
Now that we have our problem, let us solve it with nnls
.
require(nnls, quietly = TRUE) aCd <- nnls(C, d) aCd ## Nonnegative least squares model ## x estimates: 0.23301 0.40724 0.10194 ## residual sum-of-squares: 0.55 ## reason terminated: The solution has been computed sucessfully.
However, we can use other methods, including nonlinear least squares and minimization methods that include bounds constraints. First, let us define our residual, Jacobian, sum-of-squares, and gradient functions. And then, as a reminder that it is always a good idea to do so, we check them, but here I have commented out the printout of results.
############# another example ############ resfn <- function(x, matvec, A) { x <- as.matrix(x) res <- (A %*% x) - matvec } jacfn <- function(x, matvec = NULL, A) { A } ssfun <- function(x, matvec, A) { rr <- resfn(x, matvec, A) val <- sum(rr^2) } ggfun <- function(x, matvec, A) { rr <- resfn(x, matvec, A) JJ <- jacfn(x, matvec, A) gg <- 2 * as.numeric(t(JJ) %*% rr) }
We could check the functions by running the following code.
# Check functions: xx <- rep(1, 3) resfn print(resfn(xx, d, C)) jacfn print(jacfn(xx, d, C)) ssfun print(ssfun(xx, d, C)) ggfun:print(ggfun(xx, d, C))
Using nlfb()
from package nlmrt
is straightforward.
require(nlmrt, quietly = TRUE) strt <- c(p1 = 0, p2 = 0, p3 = 0) aCdnlfb <- nlfb(strt, resfn, jacfn, lower = 0, matvec = d, A = C) aCdnlfb ## nlmrt class object: x ## residual sumsquares = 0.55031 on 19 observations ## after 3 Jacobian and 4 function evaluations ## name coeff SE tstat pval gradient JSingval ## p1 0.233008 0.01645 14.16 1.803e-10 -3.98e-11 43.11 ## p2 0.407237 0.01602 25.41 2.309e-14 -2.976e-12 11.57 ## p3 0.101936 0.009332 10.92 7.932e-09 6.769e-11 10.07
Finally, let us try some optimization tools. We note that starting on the bounds does not yield a solution for bobyqa()
and nmkb()
. The latter code specifically says NOT to start on the bound, but this is not mentioned for bobyqa()
, although this outcome is not totally surprising to me given that this code does use some heuristics that might not always work. We get solutions when we alter the start. Also note how we would call the optimizers without a gradient function.
require(optimx, quietly = TRUE) strt <- c(p1 = 0, p2 = 0, p3 = 0) strt2 <- c(p1 = 0.1, p2 = 0.1, p3 = 0.1) lo <- c(0, 0, 0) aop <- optimx(strt, ssfun, ggfun, method = "all", lower = lo, matvec = d, A = C) ## Loading required package: numDeriv ## ## Attaching package: 'numDeriv' ## ## The following object is masked from 'package:maxLik': ## ## hessian ## Warning: no non-missing arguments to max; returning -Inf ## Warning: no non-missing arguments to min; returning Inf ## Warning: nmkb() cannot be started if any parameter on a bound summary(aop, order = value) ## p1 p2 p3 value fevals gevals niter convcode kkt1 ## Rcgmin 0.23301 0.40724 0.10194 5.5031e-01 12 4 NA 0 TRUE ## Rvmmin 0.23301 0.40724 0.10194 5.5031e-01 170 21 NA 0 TRUE ## L-BFGS-B 0.23301 0.40724 0.10194 5.5031e-01 9 9 NA 0 TRUE ## nlminb 0.23301 0.40724 0.10194 5.5031e-01 25 20 19 0 TRUE ## spg 0.23301 0.40724 0.10194 5.5031e-01 15 NA 13 0 TRUE ## hjkb 0.23302 0.40724 0.10193 5.5031e-01 338 NA 19 0 FALSE ## bobyqa NA NA NA 8.9885e+307 NA NA NA 9999 NA ## nmkb NA NA NA 8.9885e+307 NA NA NA 9999 NA ## kkt2 xtimes ## Rcgmin TRUE 0.004 ## Rvmmin TRUE 0.016 ## L-BFGS-B TRUE 0.000 ## nlminb TRUE 0.000 ## spg TRUE 0.160 ## hjkb TRUE 0.016 ## bobyqa NA 0.000 ## nmkb NA 0.000 aop2 <- optimx(strt2, ssfun, ggfun, method = "all", lower = lo, matvec = d, A = C) summary(aop2, order = value) ## p1 p2 p3 value fevals gevals niter convcode kkt1 ## Rcgmin 0.23301 0.40724 0.10194 0.55031 12 4 NA 0 TRUE ## Rvmmin 0.23301 0.40724 0.10194 0.55031 18 8 NA 0 TRUE ## nlminb 0.23301 0.40724 0.10194 0.55031 26 20 19 0 TRUE ## L-BFGS-B 0.23301 0.40724 0.10194 0.55031 9 9 NA 0 TRUE ## spg 0.23301 0.40724 0.10194 0.55031 18 NA 14 0 TRUE ## bobyqa 0.23301 0.40724 0.10194 0.55031 89 NA NA 0 TRUE ## hjkb 0.23301 0.40724 0.10194 0.55031 373 NA 19 0 FALSE ## nmkb 0.23304 0.40729 0.10191 0.55031 109 NA NA 0 FALSE ## kkt2 xtimes ## Rcgmin TRUE 0.004 ## Rvmmin TRUE 0.000 ## nlminb TRUE 0.000 ## L-BFGS-B TRUE 0.000 ## spg TRUE 0.164 ## bobyqa TRUE 0.004 ## hjkb TRUE 0.012 ## nmkb TRUE 0.016
## No gradient function is needed - - example not run aopn<-optimx(strt, ssfun, ## method='all', control=list(trace=0), lower=c(0,0,0), matvec=d, A=C) ## summary(aopn, order=value) aop2n<-optimx(strt1, ssfun, method='all', lower=lo, ## matvec=d, A=C) summary(aop2n, order=value)
In Section 1.4.4, we mentioned that some optimization problems have objective functions that can only be imprecisely computed. The obvious example is some process where the value of this “function” is actually measured, such as the time taken for vehicles to complete journeys under some settings of parameters believed to control that measurement. A more statistical example arises when we must integrate over a distribution function to compute a log likelihood or similar objective function. By using Monte-Carlo methods, we can do the integration quickly but imprecisely. Parameter values apart from the estimates are of limited interest, so the question then arises as to whether it is better to compute the integral precisely and optimize the resulting function, or to use an optimization method that is tolerant of imprecise objective function values. This was explored by Joe and Nash (2003), and we produced a Fortran code as well as a description of it.
The Joe–Nash program is of the “model and descend” type. We sample the function (we referred to it as a response surface) at a number of points, estimate a quadratic model or paraboloid, and move to the minimum of the model. Then we repeat this process. Of course, the details are very messy:
While this procedure should be relatively straightforward in R compared to Fortran, we have yet to build a package to do this. Note Deng and Ferris (2006) for an approach based on Powell's UOBYQA optimizer. As bothminqa
and nloptr
packages have versions of Powell's codes, it might be possible to adopt this strategy for use in R, but I have not yet seen attempts to do this.
This concludes my treatment of nonlinear parameter optimization with R. The story is not, of course, ended. As I have been writing, there have been discussions with others about many developments in this field. Unfortunately, some are not sufficiently stable to be sensibly discussed yet.
In the next few years, I anticipate that there will be a serious debate about the optimization tools in R and how they are structured. This is, of course, a common issue for open source software projects, and R is one of the largest and most successful of these. For R, the debate will need to deal with the difficulty of shifting away from legacy tools that are no longer being developed and that have limitations that are awkward to work around easily. An example is the nls()
, part of the base distribution. Its writers have largely dropped out of R activities, it has, as we have seen, a tendency to fail more than it needs to. Similarly, optim()
is showing its age. As we have also seen, there are replacements that address some of the weaknesses, but readers should expect these to have their own cycle of adoption and then replacement with yet other packages.
The more positive message of such ongoing change is that needs do generate new packages. The process is a little messy, in that the new offerings may work wonderfully for some problems and be weak with others. Over time, however, the collective experience advances the system.
18.227.52.7