In this chapter, we consider constraints that are neither bounds nor masks. If there are many constraints, then the problem is usually considered mathematical programming, see Chapter 14, and it is common that the constraints are then often linear. There may also be other features, such as variables that are all integers, or there is a mix of integer and real parameters. Such problems have generally been less common in statistics, but there appears to be growing interest.
In this chapter, however, we consider problems with just a few constraints, so that the main focus of attention is still the objective function.
Equality constraints are, in my experience, usually more troublesome than inequality constraints. This might seem paradoxical. After all, each equality constraint should reduce the dimension of the problem by one parameter. The difficulty is in choosing WHICH parameter to eliminate.
Nash (1979, p. 184) presents a problem (with seemingly erroneous results reported) that is apparently a linear regression except that four of the coefficients are related. We have six variables and want a model
However, there is a constraint so that
In this example, it is very obvious that we can solve for any one of the four parameters. The data for the problem is read from a comma-separated value (CSV) file. From the origin of the data set and problem, I call this the Hassan problem.
First, let us find the unconstrained solution and check the violation of the constraint.
hn <- read.csv("hassan182.csv", header = TRUE) convi <- function(x, dta) { # dta not used h <- rep(NA, 1) # vector value h[1] <- x[3] * x[6] - x[4] * x[5] h } uncmod <- lm(one ∼ two + three + four + five + six, data = hn) uncmod ## ## Call: ## lm(formula = one ∼ two + three + four + five + six, data = hn) ## ## Coefficients: ## (Intercept) two three four ## -1.1352 0.9460 0.0997 -0.0276 ## five six ## -31.4896 56.1636 bu <- coef(uncmod) cat("Constraint violation = b3*b6-b4*b5 =", as.numeric(convi(bu)), " ") ## Constraint violation = b3*b6-b4*b5 = 4.7287 cat("Sum of squared residuals = ", as.numeric(crossprod(uncmod$residuals)), " ") ## Sum of squared residuals = 891.99
A minor warning. The nonlinear least squares codes nls()
, nlxb()
, and nlsLM()
take model expressions similar to BUT NOT THE SAME AS those used by lm()
to build uncmod
. As linear models are a special case of nonlinear ones, we could have used these functions to perform the calculations above. To do this, we need to explicitly add our coefficients, namely,
uncmod2 <- nls(one ∼ b1 + two * b2 + three * b3 + four * b4 + five * b5 + six * b6, data = hn) ## Warning: No starting values specified for some parameters. ## Initializing ‘b1’, ‘b2’, ‘b3’, ‘b4’, ‘b5’, ‘b6’ to ‘1.’. ## Consider specifying ‘start’ or using a selfStart model summary(uncmod2) ## ## Formula: one ∼ b1 + two * b2 + three * b3 + four * b4 + five * b5 + six * ## b6 ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## b1 -1.1352 41.3043 -0.03 0.978 ## b2 0.9460 0.1917 4.93 8e-05 *** ## b3 0.0997 0.0576 1.73 0.099 . ## b4 -0.0276 0.0305 -0.91 0.376 ## b5 -31.4895 34.9621 -0.90 0.378 ## b6 56.1637 37.3244 1.50 0.148 ## - - - ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ## ## Residual standard error: 6.68 on 20 degrees of freedom ## ## Number of iterations to convergence: 3 ## Achieved convergence tolerance: 4.18e-07
There are three approaches to a solution of the constrained problem that we will examine here.
alabama
and Rsolnp
seem to be the most promising;For the present problem, which has rather nasty properties, we will see that it is very easy to get answers uncomfortably far from the best possible.
One difficulty with “solving” for one of the parameters in this problem is that we have the possibility of a zero divide. For example, if we solve for
and gets close to zero, we can expect our algorithms and programs to get into trouble. Despite this possibility, we can still try to approach the problem by solving for one parameter in terms of the rest and using nonlinear least squares on the resulting model, or else minimize the sum of squares of these residuals with an optimizer.
res3 <- function(x, dta) { b1 <- x[1] b2 <- x[2] # b3<-x[3] b4 <- x[3] b5 <- x[4] b6 <- x[5] res <- dta$two * b2 + dta$three * b4 * b5/b6 + dta$four * b4 + dta$five * b5 + dta$six * b6 + b1 - dta$one return(-res) } ss3 <- function(x, dta) { res <- res3(x, dta) sum(res * res) } require(nlmrt) ## Loading required package: nlmrt require(optimx) ## Loading required package: optimx st3 <- setNames(bu[-3], c("b1", "b2", "b4", "b5", "b6")) anlfb3 <- try(nlfb(start = st3, resfn = res3, dta = hn)) if (class(anlfb3) != "try-error") { cnb3 <- coef(anlfb3) print(cnb3) cat("RSS=", crossprod(anlfb3$resid), " ") } else { cat("Error - failed ") } ## b1 b2 b4 b5 b6 ## -28.680156 1.036184 -0.042352 -41.310598 79.207696 ## attr(,"pkgname") ## [1] "nlmrt" ## RSS= 985.52 ## derived_b3 cnb3["b4"] * cnb3["b5"]/cnb3["b6"] ## b4 ## 0.022089 ## Try optimx aop3 <- optimx(st3, fn = ss3, gr = "grnd", dta = hn, control = list(all.methods = TRUE)) ## Warning: Replacing NULL gr with ‘grnd’ approximation ## Loading required package: numDeriv ## Warning: Parameters or bounds appear to have different scalings. ## This can cause poor performance in optimization. ## It is important for derivative free methods like BOBYQA, UOBYQA, NEWUOA. ## Warning: bounds can only be used with method L-BFGS-B (or Brent) ## Warning: bounds can only be used with method L-BFGS-B (or Brent) ## Warning: bounds can only be used with method L-BFGS-B (or Brent) ## Warning: Function evaluation limit exceeded - - may not converge. summary(aop3, order = value) ## b1 b2 b4 b5 b6 ## ucminf -28.6803 1.03618 -0.042352 -41.311 79.208 ## Rvmmin -28.6803 1.03618 -0.042352 -41.311 79.208 ## nlminb -28.6803 1.03618 -0.042352 -41.311 79.208 ## Rcgmin -28.6807 1.03619 -0.042352 -41.310 79.207 ## nmkb -28.6803 1.03618 -0.042354 -41.308 79.209 ## BFGS -25.3920 1.01751 -0.043002 -35.760 82.639 ## CG -25.3920 1.01751 -0.043002 -35.760 82.639 ## Nelder-Mead -25.3920 1.01751 -0.043002 -35.760 82.639 ## L-BFGS-B -25.3920 1.01751 -0.043002 -35.760 82.639 ## hjkb -13.2112 0.96400 -0.031632 -39.492 71.582 ## bobyqa -1.1530 0.91213 -0.018279 -31.496 57.114 ## newuoa -1.1312 0.91182 -0.018244 -31.399 57.140 ## spg -1.1713 0.91260 -0.017788 -31.549 56.304 ## nlm -1.1354 0.91246 -0.017685 -31.490 56.164 ## value fevals gevals niter convcode kkt1 ## ucminf 985.52 27 27 NA 0 TRUE ## Rvmmin 985.52 68 23 NA 0 TRUE ## nlminb 985.52 45 24 23 0 TRUE ## Rcgmin 985.52 1228 310 NA 1 TRUE ## nmkb 985.52 1050 NA NA 0 FALSE ## BFGS 986.67 51 51 NA 0 FALSE ## CG 986.67 51 51 NA 0 FALSE ## Nelder-Mead 986.67 51 51 NA 0 FALSE ## L-BFGS-B 986.67 51 51 NA 0 FALSE ## hjkb 995.02 10007 NA 16 1 FALSE ## bobyqa 1039.36 235 NA NA 0 FALSE ## newuoa 1039.43 217 NA NA 0 FALSE ## spg 1041.92 2764 NA 1501 1 FALSE ## nlm 1042.51 NA NA 9 0 FALSE ## kkt2 xtimes ## ucminf FALSE 0.080 ## Rvmmin FALSE 0.068 ## nlminb FALSE 0.064 ## Rcgmin FALSE 0.888 ## nmkb FALSE 0.116 ## BFGS FALSE 0.132 ## CG FALSE 0.136 ## Nelder-Mead FALSE 0.136 ## L-BFGS-B FALSE 0.132 ## hjkb FALSE 0.465 ## bobyqa FALSE 0.012 ## newuoa FALSE 0.008 ## spg FALSE 4.280 ## nlm FALSE 0.000
The reader may rightly be dismayed by the variety of results with optimx
. However, there is general agreement between the most successful methods (those with the smallest sum of squared residuals) and the nonlinear least squares solution.
We chose to eliminate in the computations just completed. Is there any reason to choose this over any of , , or ? As it turns out, the choice can be important. In particular, when is eliminated, the best nonlinear least squares solution by nlfb()
or optimx
has a residual sum of squares of 1008.334. This is clearly not close to the optimal value of 985.5175 found when any of the other parameters is eliminated. Trying a starting vector of all 1's does allow two of the optimx
methods to find the lower value, but nlfb()
still ends up at the higher sum of squares. The unconstrained parameter is the largest numerically, and in the elimination of , it appears as a divisor. When we eliminate , the rather small is our divisor, and unfortunate steps during optimization could risk a zero divide. However, this issue does not seem to affect the elimination of , which is also quite large in magnitude.
In the process of developing this example, I explored the conventional advice of centering the variables, that is, subtracting their means. For linear regression, this often stabilizes the computations and reduces the dimensionality of the normal equations (or their equivalent) by 1. One can also divide by the standard deviation to standardize the variables.
In the present example, however, centering or standardizing the variables introduces a change that complicates the constraint. Unless such stabilization offers a very large advantage, I would avoid it. Simple scalings by powers of 10 are, however, useful without the necessity of algebraic manipulations. I found that all four choices of parameter to eliminate will give at least one satisfactory solution with optimx
if I scale the columns of the data by multiplying them respectively with
However, elimination of still failed to get a nonlinear least squares solution.
Let us try using the constrOptim.nl()
and auglag()
augmented Lagrangian methods of package alabama
and the solnp()
function of Rsolnp
. All these handle inequality constraints too, but for now, we will use just the equality constraint capabilities. Once again, we must caution that the calling syntax of different functions is not standardized, and in particular, note the eqB
field for solnp()
. We include the Jacobian calculation in the code but do not necessarily use it. Note that there is one other package, Rdonlp2
, which is an interface to Peter Spelluci's DONLP2 nonlinear programming code. However, there are licensing restrictions on this R-forge package, for which reason I am reluctant to pursue its use here. Here is the setup for the other codes.
resall <- function(x, dta) { b1 <- x[1] b2 <- x[2] b3 <- x[3] b4 <- x[4] b5 <- x[5] b6 <- x[6] res <- dta$two * b2 + dta$three * b3 + dta$four * b4 + dta$five * b5 + dta$six * b6 + b1 - dta$one return(-res) } jacall <- function(x, dta) { b1 <- x[1] b2 <- x[2] b3 <- x[3] b4 <- x[4] b5 <- x[5] b6 <- x[6] nobs <- dim(dta)[1] JJ <- cbind(rep(-1, nobs), -dta$two, -dta$three, -dta$four, -dta$five, -dta$six) } ssall <- function(x, dta) { res <- resall(x, dta) sum(res * res) } stones <- rep(1, 6) # trivial start names(stones) <- c("b1", "b2", "b3", "b4", "b5", "b6") stunc <- coef(uncmod) # unconstrained parameter start
We will not present all the output, which is quite voluminous. The calling sequences for constrOptim.nl()
, auglag()
, and solnp()
are as follows for start stones
.
require(alabama) aalabco <- constrOptim.nl(par = stones, fn = ssall, heq = convi, dta = hn) aauglag <- auglag(stones, fn = ssall, heq = convi, dta = hn) require(Rsolnp) asolnp <- solnp(stones, fun = ssall, eqfun = convi, eqB = c(0), dta = hn)
## Loading required package: alabama
## Start from all ones - - constrOptim.nl
## b1 b2 b3 b4 b5
## -21.673745 1.005383 0.016668 -0.051264 -30.666798
## b6
## 94.321144
## constrOptim.nl: sumsquares = 1003.8 constr. violation= 7.8447e-08
## Start from unconstrained model - - constrOptim.nl ## (Intercept) two three four five ## -16.404246 0.979924 0.018891 -0.043194 -37.585187 ## six ## 85.940314 ## constrOptim.nl: sumsquares = 997.18 constr. violation= 9.2954e-08
## Start from all ones - - auglag ## b1 b2 b3 b4 b5 ## -13.684176 0.968233 0.014537 -0.045940 -28.667509 ## b6 ## 90.596584 ## auglag: sumsquares = 1005.3 constr. violation= 1.9809e-11
## Start from unconstrained model - - auglag ## (Intercept) two three four five ## -18.188436 0.982143 0.018380 -0.042621 -37.150893 ## six ## 86.147218 ## auglag: sumsquares = 993.04 constr. violation= 1.3769e-08
## Loading required package: Rsolnp ## Loading required package: truncnorm ## Loading required package: parallel ## Start from all ones - - solnp ## b1 b2 b3 b4 b5 ## 15.3698499 0.9193906 0.1548982 0.0078678 -55.1476935 ## b6 ## -2.8011496 ## solnp: sumsquares = 1008.3 constr. violation= -8.043e-13 ## Start from unconstrained model - - solnp ## (Intercept) two three four five ## -28.728953 1.036451 0.022111 -0.042383 -41.326619 ## six ## 79.215658 ## solnp: sumsquares = 985.52 constr. violation= 2.5975e-12
Again the results are not consistent. Note that auglag()
can take starting parameters that are not feasible, while constrOptim.nl()
requires a start that satisfies the constraint(s).
Penalty functions are another way to try to impose constraints, albeit approximately. We construct an objective function from the original one and then add some weighted multiple of a positive function of the constraint violation. In our case, the constraint is an equality, so we square its value. We scale this with the factor pen
that has a default value of 100 but which is an argument to our following modified objective function.
To automate the process, we start with pen=10
and increase it successively until the residual sum of squares does not change iteration to iteration. This is NOT a guaranteed procedure, of course.
hassanp.f <- function(x, pen = 100, dta) { mod <- x[1] + x[2] * dta$two + x[3] * dta$three + x[4] * dta$four + x[5] * dta$five + x[6] * dta$six res <- mod - dta$one pfn <- x[3] * x[6] - x[4] * x[5] fun <- sum(res * res) + pen * pfn * pfn # return(fun) } xx <- coef(uncmod) bestRSS <- 1e+300 rss <- 1e+299 # make it different from bestRSS to start loop pen <- 10 while (rss != bestRSS) { bestRSS <- rss pen = 10 * pen anmp <- optim(par = xx, fn = hassanp.f, dta = hn, pen = pen) xx <- anmp$par # for next iteration rss <- ssall(xx, dta = hn) cat("RSS = ", rss, " for pen=", pen, " ") cat(" Constraint violation =", convi(xx, dta = hn), " ") } ## RSS = 985.43 for pen= 100 ## Constraint violation = 0.13551 ## RSS = 985.54 for pen= 1000 ## Constraint violation = 0.0142 ## RSS = 985.54 for pen= 10000 ## Constraint violation = 0.00050568 ## RSS = 985.52 for pen= 1e+05 ## Constraint violation = 0.00016767 ## RSS = 985.52 for pen= 1e+06 ## Constraint violation = 1.3581e-05 ## RSS = 985.52 for pen= 1e+07 ## Constraint violation = 1.6515e-06 ## RSS = 985.52 for pen= 1e+08 ## Constraint violation = 4.3183e-07 ## RSS = 985.52 for pen= 1e+09 ## Constraint violation = 4.1873e-08 ## RSS = 985.52 for pen= 1e+10 ## Constraint violation = 1.7881e-08 ## RSS = 985.52 for pen= 1e+11 ## Constraint violation = 3.9851e-09 ## RSS = 985.52 for pen= 1e+12 ## Constraint violation = -1.6715e-09 ## RSS = 985.52 for pen= 1e+13 ## Constraint violation = 4.4855e-10 ## RSS = 985.52 for pen= 1e+14 ## Constraint violation = 1.375e-11 ## RSS = 985.52 for pen= 1e+15 ## Constraint violation = -2.5066e-11 ## RSS = 985.52 for pen= 1e+16 ## Constraint violation = 1.2307e-11 ## RSS = 985.52 for pen= 1e+17 ## Constraint violation = -3.7594e-12 ## RSS = 985.52 for pen= 1e+18 ## Constraint violation = 4.4031e-13 ## RSS = 985.52 for pen= 1e+19 ## Constraint violation = -1.8341e-13 ## RSS = 985.52 for pen= 1e+20 ## Constraint violation = 1.5277e-13 ## RSS = 985.52 for pen= 1e+21 ## Constraint violation = -2.2871e-14 ## RSS = 985.52 for pen= 1e+22 ## Constraint violation = -5.5511e-15 ## RSS = 985.52 for pen= 1e+23 ## Constraint violation = 8.8818e-16 ## RSS = 985.52 for pen= 1e+24 ## Constraint violation = 8.8818e-16 ## Best parameters print(xx) ## (Intercept) two three four five ## -28.564318 1.035479 0.021894 -0.042405 -40.985170 ## six ## 79.382823
Note that there is a sensitivity to the initial penalty and the factor by which it is increased, so that it is possible to converge to a solution with the objective above 1000 (possibly a plateau or a local minimum, because similar proposed answers are seen elsewhere).
Penalty methods like this are sometimes helpful to get a quick handle on a problem. Note that we used the default Nelder–Mead optimizer to do the work. This approach would, therefore, make sense if we were blocked from installing software that is more suited to the problem, for example, if we were at a location without network service.
A particular class of optimization problem that arises quite often is one that may be called a sumscale problem. Here we wish to maximize or minimize an objective function has parameters that are constrained by some scaling, so that , where this function involves a sum of the parameters, their squares, or similar simple function.
The following are the examples of this type of objective function.
For the moment, let us consider a basic example, which is
This is a very simplified version of the multinomial maximum likelihood problem, suggested by Gabor Grothendieck as a test problem of this type. One approach to solving this is to eliminate the last parameter and optimize over the other where is the number of parameters. Let us try it for a few values of . We turn off some tests and warnings to keep the output tidy.
require(optimx, quietly = TRUE) # load a range of optimization methods pr <- function(y) { -prod(y) * (1 - sum(y)) } pr.g <- function(x) { g <- -prod(x) * (1 - sum(x))/x + prod(x) } n <- 2 cat(" n", " ", " x1", " ", " 1e5*(x1-1/n) ") options(digits = 6) while (n <= 6) { st <- 1:(n - 1)/(n * n) ctrl <- list(starttests = FALSE, trace = 0, dowarn = FALSE) ans <- optimx(st, pr, pr.g, method = "Rvmmin", control = ctrl) cc <- coef(ans) cat(n, " ", (1/n), " ") print(cc) n <- n + 1 }
## n x1 1e5*(x1-1/n) ## 2 0.5 ## [,1] ## [1,] 0.35 ## 3 0.333333 ## p1 p2 ## Rvmmin 0.333333 0.333333 ## 4 0.25 ## p1 p2 p3 ## Rvmmin 0.0625 0.125 0.1875 ## 5 0.2 ## p1 p2 p3 p4 ## Rvmmin 0.04 0.08 0.12 0.16 ## 6 0.166667 ## p1 p2 p3 p4 p5 ## Rvmmin 0.0277778 0.0555556 0.0833333 0.111111 0.138889
The “solution” is, as we expect, simply for all . For very small , this approach works fine. However, we see that very soon there are computational problems as the size of the problem increases. As the sum of the parameters is constrained to be equal to 1, the parameters are of the order of , so the function is therefore of the order of , which underflows around in R. In reality, the rot sets in much sooner.
We can do much better if we work with the log of the objective function. The following calculation handles very large numbers of parameters, and we only display the analytic answer along with the first five parameter estimates. Note that we change to using Rcgmin
which uses a method adapted for large numbers of parameters.
nll <- function(y) { if ((any(y <= 10 * .Machine$double.xmin)) || (sum(y) > 1 - .Machine$double.eps)) .Machine$double.xmax else -sum(log(y)) - log(1 - sum(y)) } nll.g <- function(y) { -1/y + 1/(1 - sum(y)) } # so far not safeguarded n <- 1000 cat(" n", " ", " x1", " ", " 1e5*(x1-1/n) ") ## n x1 1e5*(x1-1/n) options(digits = 6) while (n <= 10000) { st <- 1:(n - 1)/(n * n) ctrl <- list(starttests = FALSE, trace = 0, dowarn = FALSE) ans <- optimx(st, nll, nll.g, method = "Rcgmin", control = ctrl) cc <- coef(ans) cat(n, " ", (1/n), " ") print(cc[1:5]) n <- n + 1000 } ## 1000 0.001 ## [1] 0.001 0.001 0.001 0.001 0.001 ## 2000 5e-04 ## [1] 0.000500000 0.000500000 0.000499999 0.000500000 ## [5] 0.000500000 ## 3000 0.000333333 ## [1] 0.000333333 0.000333334 0.000333333 0.000333333 ## [5] 0.000333333 ## 4000 0.00025 ## [1] 0.000250002 0.000250000 0.000250000 0.000250000 ## [5] 0.000250000 ## 5000 2e-04 ## [1] 0.009956240 0.000244116 0.000446017 0.000329520 ## [5] 0.000181000 ## 6000 0.000166667 ## [1] 0.03565142 0.01697512 0.01075478 0.00764869 0.00578853 ## 7000 0.000142857 ## [1] 0.04500462 0.01959146 0.01113446 0.00691942 0.00440483 ## 8000 0.000125 ## [1] 0.00916996 0.00419875 0.00254975 0.00173191 0.00124712 ## 9000 0.000111111 ## [1] 0.000111111 0.000111111 0.000111111 0.000111111 ## [5] 0.000111111 ## 10000 1e-04 ## [1] 0.012336191 0.004579923 0.002023620 0.000785113 ## [5] 0.000124854
We see that we are doing quite well up to but thereafter get uneven results. The result is surprisingly good, but optimization methods do occasionally take a lucky step. They can equally be “unlucky.” Consider the case with several optimizers, where we first do not specify the gradient, then we give the analytic gradient, and finally the gradient and bounds on the parameters.
require(optimx, quietly = TRUE) n <- 5 mset <- c("L-BFGS-B", "BFGS", "CG", "spg", "ucminf", "nlm", "nlminb", "Rvmmin", "Rcgmin") a5 <- optimx(2:n/n^2, nll, method = mset, control = list(dowarn = FALSE)) ## Warning: bounds can only be used with method L-BFGS-B (or Brent) ## Warning: bounds can only be used with method L-BFGS-B (or Brent) a5g <- optimx(2:n/n^2, nll, nll.g, method = mset, control = list(dowarn = FALSE)) ## Warning: bounds can only be used with method L-BFGS-B (or Brent) ## Warning: bounds can only be used with method L-BFGS-B (or Brent) a5gb <- optimx(2:n/n^2, nll, nll.g, lower = 0, upper = 1, method = mset, control = list(dowarn = FALSE)) summary(a5, order = value) ## p1 p2 p3 p4 value fevals ## Rcgmin 0.2 0.200000 0.200000 0.200000 8.04719e+00 51 ## spg 0.2 0.200000 0.200000 0.200000 8.04719e+00 17 ## Rvmmin 0.2 0.200000 0.200000 0.200000 8.04719e+00 22 ## ucminf 0.2 0.200000 0.200000 0.200000 8.04719e+00 14 ## nlm 0.2 0.199999 0.200000 0.200000 8.04719e+00 NA ## nlminb 0.2 0.199999 0.199999 0.199999 8.04719e+00 23 ## L-BFGS-B NA NA NA NA 8.98847e+307 NA ## BFGS NA NA NA NA 8.98847e+307 NA ## CG NA NA NA NA 8.98847e+307 NA ## gevals niter convcode kkt1 kkt2 xtimes ## Rcgmin 18 NA 0 NA NA 0.004 ## spg NA 13 0 TRUE TRUE 0.056 ## Rvmmin 12 NA 2 NA NA 0.004 ## ucminf 14 NA 0 TRUE TRUE 0.000 ## nlm NA 11 0 TRUE TRUE 0.000 ## nlminb 66 12 0 TRUE TRUE 0.004 ## L-BFGS-B NA NA 9999 NA NA 0.000 ## BFGS NA NA 9999 NA NA 0.000 ## CG NA NA 9999 NA NA 0.004 summary(a5g, order = value) ## p1 p2 p3 p4 value fevals ## ucminf 0.2 0.200000 0.200000 0.200000 8.04719e+00 14 ## Rvmmin 0.2 0.200000 0.200000 0.200000 8.04719e+00 43 ## spg 0.2 0.200000 0.200000 0.200000 8.04719e+00 17 ## Rcgmin 0.2 0.200000 0.200000 0.200000 8.04719e+00 28 ## nlm 0.2 0.199999 0.200000 0.200000 8.04719e+00 NA ## nlminb 0.2 0.199999 0.199999 0.199999 8.04719e+00 23 ## L-BFGS-B NA NA NA NA 8.98847e+307 NA ## BFGS NA NA NA NA 8.98847e+307 NA ## CG NA NA NA NA 8.98847e+307 NA ## gevals niter convcode kkt1 kkt2 xtimes ## ucminf 14 NA 0 TRUE TRUE 0.000 ## Rvmmin 12 NA 0 TRUE TRUE 0.004 ## spg NA 13 0 TRUE TRUE 0.052 ## Rcgmin 12 NA 0 TRUE TRUE 0.000 ## nlm NA 11 0 TRUE TRUE 0.000 ## nlminb 12 12 0 TRUE TRUE 0.000 ## L-BFGS-B NA NA 9999 NA NA 0.000 ## BFGS NA NA 9999 NA NA 0.004 ## CG NA NA 9999 NA NA 0.000 summary(a5gb, order = value) ## p1 p2 p3 p4 value fevals ## Rvmmin 0.2 0.200000 0.200000 0.200000 8.04719e+00 38 ## Rcgmin 0.2 0.200000 0.200000 0.200000 8.04719e+00 18 ## spg 0.2 0.200000 0.200000 0.200000 8.04719e+00 18 ## nlminb 0.2 0.199999 0.199999 0.199999 8.04719e+00 23 ## L-BFGS-B NA NA NA NA 8.98847e+307 NA ## gevals niter convcode kkt1 kkt2 xtimes ## Rvmmin 14 NA 0 TRUE TRUE 0.004 ## Rcgmin 10 NA 0 TRUE TRUE 0.004 ## spg NA 13 0 TRUE TRUE 0.056 ## nlminb 12 12 0 TRUE TRUE 0.000 ## L-BFGS-B NA NA 9999 NA NA 0.000
The particular L-BFGS-B
failure is due to the optimization method trying to compute the gradient where sum(x) is greater than 1, although why a set of parameters like that has been generated has not been investigated. It seems likely the failure of Rvmmin
in the bounded case has a similar origin.
Objective functions defined by or will change with the scale of the parameters. Moreover, the constraint effectively imposes the scaling . The optimizer spg()
from package BB
allows us to project our search direction to satisfy constraints. Thus, we could use the following approach.
require(BB, quietly = TRUE) nllrv <- function(x) { -sum(log(x)) } nllrv.g <- function(x) { -1/x } proj <- function(x) { x/sum(x) } n <- 5 aspg <- spg(par = (1:n)/n^2, fn = nllrv, gr = nllrv.g, project = proj) ## iter: 0 f-value: 11.3069 pgrad: 0.360757 aspgn <- spg(par = (1:n)/n^2, fn = nllrv, project = proj) ## iter: 0 f-value: 11.3069 pgrad: 0.133333 cat("F_optimal: with gradient=", aspg$value, " num. approx.=", aspgn$value, " ") ## F_optimal: with gradient= 8.04719 num. approx.= 8.04719 pbest <- rep(1/n, n) cat("fbest = ", nllrv(pbest), " when all parameters = ", pbest[1], " ") ## fbest = 8.04719 when all parameters = 0.2 cat("deviations: with gradient=", max(abs(aspg$par - pbest)), " num. approx.=", max(abs(aspg$par - pbest)), " ") ## deviations: with gradient= 3.81244e-06 num. approx.= 3.81244e-06
Here the projection proj
is the key to success of method spg
. Other methods do not have the flexibility to impose the projection directly. We would need to carefully build the projection into the function(s) and/or the method codes. This was done by Geradin (1971) for the Rayleigh quotient problem but requires a number of changes to the program code.
Kovalchik et al. (2013) present an interesting and readable approach to estimating what they call a LEXPIT model. This attempts to explain the variation of a binary response variable , for example, one that has a 0 for a patient who does not have a particular diseases and 1 when they do. The model uses additive and multiplicative explanatory variables, which are put in matrices and . For explanatory purposes, I will simplify to a single additive covariate and single multiplicative covariate . The true but unknown is assumed to be distributed as a Bernoulli (i.e., zero–one) variate with probability parameter (a function of and )
where this probability is clearly bounded by 0 and 1. In fact, this probability should not actually take on those end-of-range values, or we have no variation.
where
is the inverse logit function. The logit function is
To illustrate the model and its estimation, let us suppose that we have patients whose age, sugar consumption, and health status according to some 0/1 indicator are known. We now want to model the probability the health (or disease) indicator is 1 by equation (13.4).
We have (actually we generated) 500 observations on status , age and sugar and now want to fit the model by maximum likelihood. We will minimize the negative log likelihood, which is provided by the following code.
expit <- function(q) { zz <- exp(q) zz/(1 + zz) } logit <- function(p) { log(p/(1 - p)) } negll <- function(par, data) { yy <- data$y ss <- data$sugar aa <- data$age beta0 <- par[1] gam0 <- par[2] gam1 <- par[3] ppi <- beta0 * ss + expit(gam0 + gam1 * aa) ## need to watch out for ppi out of bounds if (any(ppi >= 1) || any(ppi <= 0)) { val <- 1e+200 } else { val <- -sum(log(1 - ppi)) - sum(yy * logit(ppi)) } val }
Kovalchik et al. (2013) suggest that it is useful to provide starting values derived in some way from the data. Here we will take a very crude approach of setting the . This implies that we have the simpler logistic regression model, which we can estimate as follows:
tglm <- glm(y ∼ age, family = binomial(logit), data = mydata) pars <- c(0, coef(tglm)) names(pars) <- c("beta0", "gam0", "gam1") pars ## beta0 gam0 gam1 ## 0.0000000 -4.6583342 0.0504476
Now we can proceed to fit the LEXPIT model. We will try all the optimx
methods.
require(optimx) aoptimx <- optimx(pars, negll, "grnd", method = "all", control = list(trace = 0), data = mydata) ## Warning: Replacing NULL gr with ‘grnd’ approximation ## Warning: bounds can only be used with method L-BFGS-B (or Brent) ## Warning: bounds can only be used with method L-BFGS-B (or Brent) ## Warning: bounds can only be used with method L-BFGS-B (or Brent) print(summary(aoptimx, order = value)) ## beta0 gam0 gam1 value ## Rcgmin 0.1566348 -5.79545 0.0632420 1.99324e+02 ## ucminf 0.1566348 -5.79545 0.0632420 1.99324e+02 ## Rvmmin 0.1566348 -5.79545 0.0632420 1.99324e+02 ## nlminb 0.1566348 -5.79545 0.0632420 1.99324e+02 ## nmkb 0.1566367 -5.79532 0.0632398 1.99324e+02 ## nlm 0.1565503 -5.79397 0.0632233 1.99324e+02 ## spg 0.1565252 -5.79387 0.0632229 1.99324e+02 ## hjkb 0.1563568 -5.79060 0.0631810 1.99324e+02 ## BFGS 0.0815319 -4.67534 0.0494269 2.00107e+02 ## CG 0.0815319 -4.67534 0.0494269 2.00107e+02 ## Nelder-Mead 0.0815319 -4.67534 0.0494269 2.00107e+02 ## L-BFGS-B 0.0815319 -4.67534 0.0494269 2.00107e+02 ## bobyqa 0.0206010 -4.65833 0.0504476 2.00391e+02 ## newuoa NA NA NA 8.98847e+307 ## fevals gevals niter convcode kkt1 kkt2 xtimes ## Rcgmin 1007 150 NA 1 TRUE TRUE 0.736 ## ucminf 18 18 NA 0 TRUE TRUE 0.072 ## Rvmmin 71 12 NA 0 TRUE TRUE 0.060 ## nlminb 29 14 13 0 TRUE TRUE 0.060 ## nmkb 217 NA NA 0 TRUE TRUE 0.040 ## nlm NA NA 16 0 TRUE TRUE 0.008 ## spg 2342 NA 1304 0 TRUE TRUE 5.660 ## hjkb 2132 NA 19 0 FALSE TRUE 0.292 ## BFGS 13 13 NA 0 FALSE TRUE 0.044 ## CG 13 13 NA 0 FALSE TRUE 0.048 ## Nelder-Mead 13 13 NA 0 FALSE TRUE 0.052 ## L-BFGS-B 13 13 NA 0 FALSE TRUE 0.048 ## bobyqa 21 NA NA 0 FALSE TRUE 0.004 ## newuoa NA NA NA 9999 NA NA 0.004
Here we see most of the methods have found a reasonable solution.
We can also try the constrained nonlinear optimization packages alabama
and Rsolnp
. First, we need to define our inequality constraints.
pibnd <- function(par, data) { # inequality constraints yy <- data$y nobs <- length(yy) ss <- data$sugar aa <- data$age beta0 <- par[1] gam0 <- par[2] gam1 <- par[3] ppi <- beta0 * ss + expit(gam0 + gam1 * aa) hh <- rep(NA, 2) # Note probability bounded at both 0 and 1 hh <- c((ppi - 1e-04), (1 - 1e-04 - ppi)) hh } yy <- mydata$y nobs <- length(yy) lo <- rep(1e-04, nobs) up <- 1 - lo require(alabama) coo <- list(trace = FALSE) aaug <- auglag(par = pars, fn = negll, hin = pibnd, data = mydata, control.outer = coo) aaug$par ## beta0 gam0 gam1 ## 0.1501910 -5.6903499 0.0619483 aaug$value ## [1] 199.33 aaug$convergence ## [1] 0 aco <- constrOptim.nl(par = pars, fn = negll, hin = pibnd, data = mydata, control.outer = coo) aco$par ## beta0 gam0 gam1 ## 0.1501612 -5.6901151 0.0619458 aco$value ## [1] 199.33 aco$convergence ## [1] 0
Last in this section, we try Rsolnp
.
require(Rsolnp) pineq <- function(par, data) { # we will compute functions that obey the ineqLB and ineqUB # constraints yy <- data$y nobs <- length(yy) ss <- data$sugar aa <- data$age beta0 <- par[1] gam0 <- par[2] gam1 <- par[3] ppi <- beta0 * ss + expit(gam0 + gam1 * aa) } asolnp <- solnp(pars = pars, fun = negll, ineqfun = pineq, ineqLB = lo, ineqUB = up, data = mydata) ## ## Iter: 1 fn: 199.3254 Pars: 0.15280 -5.74331 0.06263 ## Iter: 2 fn: 199.3239 Pars: 0.15663 -5.79529 0.06324 ## Iter: 3 fn: 199.3239 Pars: 0.15663 -5.79529 0.06324 ## solnp- - > Completed in 3 iterations asolnp$pars ## beta0 gam0 gam1 ## 0.156626 -5.795292 0.063240 negll(asolnp$pars, mydata) ## [1] 199.324
While we see that several methods have quite satisfactorily solved this estimation problem, I would caution that an important step is that of providing reasonable starting values for the parameters. Constrained problems seem to require better starts than their unconstrained counterparts, but this is an opinion based on limited experience rather than exhaustive evidence.
The logarithmic barrier function is generally used to make the objective function so large that the parameters are deflected away from a constraint. An alternative is to penalize violation of the constraint. This actually allows the parameters to take on values that are infeasible, that is, outside the bounds. Increasing a weight or control on the penalty allows us to make it more stringent, and generally the parameters become “less infeasible” in such approaches, if we can tolerate such linguistic torture.
An annoyance with penalty methods is that we may introduce very large numbers as we try to force feasibility, and this may degrade the performance of our (unconstrained) optimization methods. In my own work, I have found penalty methods more useful as a trial to learn about problems than as an ongoing tool for solving a family of optimization problems. Barrier functions, as we have seen, can be used if we start and finish inside the feasible region.
In this chapter, we have seen that there are several ways that R can be used to tackle nonlinear optimization problems with general constraints. What are the lessons to take away? Here are my views:
3.14.129.194