Chapter 13
Handling general constraints

In this chapter, we consider constraints that are neither bounds nor masks. If there are many constraints, then the problem is usually considered mathematical programming, see Chapter 14, and it is common that the constraints are then often linear. There may also be other features, such as variables that are all integers, or there is a mix of integer and real parameters. Such problems have generally been less common in statistics, but there appears to be growing interest.

In this chapter, however, we consider problems with just a few constraints, so that the main focus of attention is still the objective function.

13.1 Equality constraints

Equality constraints are, in my experience, usually more troublesome than inequality constraints. This might seem paradoxical. After all, each equality constraint should reduce the dimension of the problem by one parameter. The difficulty is in choosing WHICH parameter to eliminate.

Nash (1979, p. 184) presents a problem (with seemingly erroneous results reported) that is apparently a linear regression except that four of the coefficients are related. We have six variables c13-math-0001 and want a model

13.1 equation

However, there is a constraint so that

13.2 equation

In this example, it is very obvious that we can solve for any one of the four parameters. The data for the problem is read from a comma-separated value (CSV) file. From the origin of the data set and problem, I call this the Hassan problem.

First, let us find the unconstrained solution and check the violation of the constraint.

hn <- read.csv("hassan182.csv", header = TRUE)
convi <- function(x, dta) {
    # dta not used
    h <- rep(NA, 1)  # vector value
    h[1] <- x[3] * x[6] - x[4] * x[5]
    h
}
uncmod <- lm(one  two + three + four + five + six, data = hn)
uncmod
##
## Call:
## lm(formula = one ∼ two + three + four + five + six, data = hn)
##
## Coefficients:
## (Intercept)          two        three         four
##     -1.1352       0.9460       0.0997      -0.0276
##        five          six
##    -31.4896      56.1636
bu <- coef(uncmod)
cat("Constraint violation = b3*b6-b4*b5 =", as.numeric(convi(bu)),
    "
")
## Constraint violation = b3*b6-b4*b5 = 4.7287
cat("Sum of squared residuals = ", as.numeric(crossprod(uncmod$residuals)),
    "
")
## Sum of squared residuals =  891.99

A minor warning. The nonlinear least squares codes nls(), nlxb(), and nlsLM() take model expressions similar to BUT NOT THE SAME AS those used by lm() to build uncmod. As linear models are a special case of nonlinear ones, we could have used these functions to perform the calculations above. To do this, we need to explicitly add our coefficients, namely,

uncmod2 <- nls(one  b1 + two * b2 + three * b3 + four * b4 +
    five * b5 + six * b6, data = hn)
## Warning: No starting values specified for some parameters.
## Initializing ‘b1’, ‘b2’, ‘b3’, ‘b4’, ‘b5’, ‘b6’ to ‘1.’.
## Consider specifying ‘start’ or using a selfStart model
summary(uncmod2)
##
## Formula: one ∼ b1 + two * b2 + three * b3 + four * b4 + five * b5 + six *
##     b6
##
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)
## b1  -1.1352    41.3043   -0.03    0.978
## b2   0.9460     0.1917    4.93    8e-05 ***
## b3   0.0997     0.0576    1.73    0.099 .
## b4  -0.0276     0.0305   -0.91    0.376
## b5 -31.4895    34.9621   -0.90    0.378
## b6  56.1637    37.3244    1.50    0.148
## - - -
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 6.68 on 20 degrees of freedom
##
## Number of iterations to convergence: 3
## Achieved convergence tolerance: 4.18e-07

There are three approaches to a solution of the constrained problem that we will examine here.

  • Solve the constraint equation for one of the parameters in terms of the others and substitute this into the model above.
  • Use tools for nonlinear optimization subject to nonlinear equality (or also inequality) constraints, of which alabama and Rsolnp seem to be the most promising;
  • Define a penalized objective function where we add a positive factor times the sum of squares of the constraint violation(s) and solve sequentially, with increasing values of the penalty factor. Parameters from each solution are fed in as starting values for the next solution with stronger penalty.

For the present problem, which has rather nasty properties, we will see that it is very easy to get answers uncomfortably far from the best possible.

13.1.1 Parameter elimination

One difficulty with “solving” for one of the parameters in this problem is that we have the possibility of a zero divide. For example, if we solve for

13.3 equation

and c13-math-0005 gets close to zero, we can expect our algorithms and programs to get into trouble. Despite this possibility, we can still try to approach the problem by solving for one parameter in terms of the rest and using nonlinear least squares on the resulting model, or else minimize the sum of squares of these residuals with an optimizer.

res3 <- function(x, dta) {
    b1 <- x[1]
    b2 <- x[2]
    # b3<-x[3]
    b4 <- x[3]
    b5 <- x[4]
    b6 <- x[5]
    res <- dta$two * b2 + dta$three * b4 * b5/b6 + dta$four *
        b4 + dta$five * b5 + dta$six * b6 + b1 - dta$one
    return(-res)
}
ss3 <- function(x, dta) {
    res <- res3(x, dta)
    sum(res * res)
}
require(nlmrt)
## Loading required package: nlmrt
require(optimx)
## Loading required package: optimx
st3 <- setNames(bu[-3], c("b1", "b2", "b4", "b5", "b6"))
anlfb3 <- try(nlfb(start = st3, resfn = res3, dta = hn))
if (class(anlfb3) != "try-error") {
    cnb3 <- coef(anlfb3)
    print(cnb3)
    cat("RSS=", crossprod(anlfb3$resid), "
")
} else {
    cat("Error - failed
")
}
##         b1         b2         b4         b5         b6
## -28.680156   1.036184  -0.042352 -41.310598  79.207696
## attr(,"pkgname")
## [1] "nlmrt"
## RSS= 985.52
## derived_b3
cnb3["b4"] * cnb3["b5"]/cnb3["b6"]
##       b4
## 0.022089
## Try optimx
aop3 <- optimx(st3, fn = ss3, gr = "grnd", dta = hn, control = list(all.methods = TRUE))
## Warning: Replacing NULL gr with ‘grnd’ approximation
## Loading required package: numDeriv
## Warning: Parameters or bounds appear to have different scalings.
##  This can cause poor performance in optimization.
##  It is important for derivative free methods like BOBYQA, UOBYQA, NEWUOA.
## Warning: bounds can only be used with method L-BFGS-B (or Brent)
## Warning: bounds can only be used with method L-BFGS-B (or Brent)
## Warning: bounds can only be used with method L-BFGS-B (or Brent)
## Warning: Function evaluation limit exceeded - - may not converge.
summary(aop3, order = value)
##                   b1      b2        b4      b5     b6
## ucminf      -28.6803 1.03618 -0.042352 -41.311 79.208
## Rvmmin      -28.6803 1.03618 -0.042352 -41.311 79.208
## nlminb      -28.6803 1.03618 -0.042352 -41.311 79.208
## Rcgmin      -28.6807 1.03619 -0.042352 -41.310 79.207
## nmkb        -28.6803 1.03618 -0.042354 -41.308 79.209
## BFGS        -25.3920 1.01751 -0.043002 -35.760 82.639
## CG          -25.3920 1.01751 -0.043002 -35.760 82.639
## Nelder-Mead -25.3920 1.01751 -0.043002 -35.760 82.639
## L-BFGS-B    -25.3920 1.01751 -0.043002 -35.760 82.639
## hjkb        -13.2112 0.96400 -0.031632 -39.492 71.582
## bobyqa       -1.1530 0.91213 -0.018279 -31.496 57.114
## newuoa       -1.1312 0.91182 -0.018244 -31.399 57.140
## spg          -1.1713 0.91260 -0.017788 -31.549 56.304
## nlm          -1.1354 0.91246 -0.017685 -31.490 56.164
##               value fevals gevals niter convcode  kkt1
## ucminf       985.52     27     27    NA        0  TRUE
## Rvmmin       985.52     68     23    NA        0  TRUE
## nlminb       985.52     45     24    23        0  TRUE
## Rcgmin       985.52   1228    310    NA        1  TRUE
## nmkb         985.52   1050     NA    NA        0 FALSE
## BFGS         986.67     51     51    NA        0 FALSE
## CG           986.67     51     51    NA        0 FALSE
## Nelder-Mead  986.67     51     51    NA        0 FALSE
## L-BFGS-B     986.67     51     51    NA        0 FALSE
## hjkb         995.02  10007     NA    16        1 FALSE
## bobyqa      1039.36    235     NA    NA        0 FALSE
## newuoa      1039.43    217     NA    NA        0 FALSE
## spg         1041.92   2764     NA  1501        1 FALSE
## nlm         1042.51     NA     NA     9        0 FALSE
##              kkt2 xtimes
## ucminf      FALSE  0.080
## Rvmmin      FALSE  0.068
## nlminb      FALSE  0.064
## Rcgmin      FALSE  0.888
## nmkb        FALSE  0.116
## BFGS        FALSE  0.132
## CG          FALSE  0.136
## Nelder-Mead FALSE  0.136
## L-BFGS-B    FALSE  0.132
## hjkb        FALSE  0.465
## bobyqa      FALSE  0.012
## newuoa      FALSE  0.008
## spg         FALSE  4.280
## nlm         FALSE  0.000

The reader may rightly be dismayed by the variety of results with optimx. However, there is general agreement between the most successful methods (those with the smallest sum of squared residuals) and the nonlinear least squares solution.

13.1.2 Which parameter to eliminate?

We chose to eliminate c13-math-0006 in the computations just completed. Is there any reason to choose this over any of c13-math-0007, c13-math-0008, or c13-math-0009? As it turns out, the choice can be important. In particular, when c13-math-0010 is eliminated, the best nonlinear least squares solution by nlfb() or optimx has a residual sum of squares of 1008.334. This is clearly not close to the optimal value of 985.5175 found when any of the other parameters is eliminated. Trying a starting vector of all 1's does allow two of the optimx methods to find the lower value, but nlfb() still ends up at the higher sum of squares. The unconstrained parameter c13-math-0011 is the largest numerically, and in the elimination of c13-math-0012, it appears as a divisor. When we eliminate c13-math-0013, the rather small c13-math-0014 is our divisor, and unfortunate steps during optimization could risk a zero divide. However, this issue does not seem to affect the elimination of c13-math-0015, which is also quite large in magnitude.

13.1.3 Scaling and centering?

In the process of developing this example, I explored the conventional advice of centering the variables, that is, subtracting their means. For linear regression, this often stabilizes the computations and reduces the dimensionality of the normal equations (or their equivalent) by 1. One can also divide by the standard deviation to standardize the variables.

In the present example, however, centering or standardizing the variables introduces a change that complicates the constraint. Unless such stabilization offers a very large advantage, I would avoid it. Simple scalings by powers of 10 are, however, useful without the necessity of algebraic manipulations. I found that all four choices of parameter to eliminate will give at least one satisfactory solution with optimx if I scale the columns of the data by multiplying them respectively with

equation

However, elimination of c13-math-0017 still failed to get a nonlinear least squares solution.

13.1.4 Nonlinear programming packages

Let us try using the constrOptim.nl() and auglag() augmented Lagrangian methods of package alabama and the solnp() function of Rsolnp. All these handle inequality constraints too, but for now, we will use just the equality constraint capabilities. Once again, we must caution that the calling syntax of different functions is not standardized, and in particular, note the eqB field for solnp(). We include the Jacobian calculation in the code but do not necessarily use it. Note that there is one other package, Rdonlp2, which is an interface to Peter Spelluci's DONLP2 nonlinear programming code. However, there are licensing restrictions on this R-forge package, for which reason I am reluctant to pursue its use here. Here is the setup for the other codes.

resall <- function(x, dta) {
    b1 <- x[1]
    b2 <- x[2]
    b3 <- x[3]
    b4 <- x[4]
    b5 <- x[5]
    b6 <- x[6]
    res <- dta$two * b2 + dta$three * b3 + dta$four * b4 + dta$five *
        b5 + dta$six * b6 + b1 - dta$one
    return(-res)
}
jacall <- function(x, dta) {
    b1 <- x[1]
    b2 <- x[2]
    b3 <- x[3]
    b4 <- x[4]
    b5 <- x[5]
    b6 <- x[6]
    nobs <- dim(dta)[1]
    JJ <- cbind(rep(-1, nobs), -dta$two, -dta$three, -dta$four,
        -dta$five, -dta$six)
}
ssall <- function(x, dta) {
    res <- resall(x, dta)
    sum(res * res)
}
stones <- rep(1, 6)  # trivial start
names(stones) <- c("b1", "b2", "b3", "b4", "b5", "b6")
stunc <- coef(uncmod)  # unconstrained parameter start

We will not present all the output, which is quite voluminous. The calling sequences for constrOptim.nl(), auglag(), and solnp() are as follows for start stones.

require(alabama)
aalabco <- constrOptim.nl(par = stones, fn = ssall, heq = convi,
    dta = hn)
aauglag <- auglag(stones, fn = ssall, heq = convi, dta = hn)
require(Rsolnp)
asolnp <- solnp(stones, fun = ssall, eqfun = convi, eqB = c(0),
    dta = hn)
## Loading required package: alabama
## Start from all ones - - constrOptim.nl
##         b1         b2         b3         b4         b5
## -21.673745   1.005383   0.016668  -0.051264 -30.666798
##         b6
##  94.321144
## constrOptim.nl: sumsquares = 1003.8    constr. violation= 7.8447e-08
## Start from unconstrained model - - constrOptim.nl
## (Intercept)         two       three        four        five
##  -16.404246    0.979924    0.018891   -0.043194  -37.585187
##         six
##   85.940314
## constrOptim.nl: sumsquares = 997.18    constr. violation= 9.2954e-08
## Start from all ones - - auglag
##         b1         b2         b3         b4         b5
## -13.684176   0.968233   0.014537  -0.045940 -28.667509
##         b6
##  90.596584
## auglag: sumsquares = 1005.3    constr. violation= 1.9809e-11
## Start from unconstrained model - - auglag
## (Intercept)         two       three        four        five
##  -18.188436    0.982143    0.018380   -0.042621  -37.150893
##         six
##   86.147218
## auglag: sumsquares = 993.04    constr. violation= 1.3769e-08
## Loading required package: Rsolnp
## Loading required package: truncnorm
## Loading required package: parallel
## Start from all ones - - solnp
##          b1          b2          b3          b4          b5
##  15.3698499   0.9193906   0.1548982   0.0078678 -55.1476935
##          b6
##  -2.8011496
## solnp: sumsquares = 1008.3    constr. violation= -8.043e-13
## Start from unconstrained model - - solnp
## (Intercept)         two       three        four        five
##  -28.728953    1.036451    0.022111   -0.042383  -41.326619
##         six
##   79.215658
## solnp: sumsquares = 985.52    constr. violation= 2.5975e-12

Again the results are not consistent. Note that auglag() can take starting parameters that are not feasible, while constrOptim.nl() requires a start that satisfies the constraint(s).

13.1.5 Sequential application of an increasing penalty

Penalty functions are another way to try to impose constraints, albeit approximately. We construct an objective function from the original one and then add some weighted multiple of a positive function of the constraint violation. In our case, the constraint is an equality, so we square its value. We scale this with the factor pen that has a default value of 100 but which is an argument to our following modified objective function.

To automate the process, we start with pen=10 and increase it successively until the residual sum of squares does not change iteration to iteration. This is NOT a guaranteed procedure, of course.

hassanp.f <- function(x, pen = 100, dta) {
    mod <- x[1] + x[2] * dta$two + x[3] * dta$three + x[4] *
        dta$four + x[5] * dta$five + x[6] * dta$six
    res <- mod - dta$one
    pfn <- x[3] * x[6] - x[4] * x[5]
    fun <- sum(res * res) + pen * pfn * pfn  # return(fun)
}
xx <- coef(uncmod)
bestRSS <- 1e+300
rss <- 1e+299  # make it different from bestRSS to start loop
pen <- 10
while (rss != bestRSS) {
    bestRSS <- rss
    pen = 10 * pen
    anmp <- optim(par = xx, fn = hassanp.f, dta = hn, pen = pen)
    xx <- anmp$par  # for next iteration
    rss <- ssall(xx, dta = hn)
    cat("RSS = ", rss, "  for pen=", pen, "
")
    cat("    Constraint violation =", convi(xx, dta = hn), "
")
}
## RSS =  985.43   for pen= 100
##     Constraint violation = 0.13551
## RSS =  985.54   for pen= 1000
##     Constraint violation = 0.0142
## RSS =  985.54   for pen= 10000
##     Constraint violation = 0.00050568
## RSS =  985.52   for pen= 1e+05
##     Constraint violation = 0.00016767
## RSS =  985.52   for pen= 1e+06
##     Constraint violation = 1.3581e-05
## RSS =  985.52   for pen= 1e+07
##     Constraint violation = 1.6515e-06
## RSS =  985.52   for pen= 1e+08
##     Constraint violation = 4.3183e-07
## RSS =  985.52   for pen= 1e+09
##     Constraint violation = 4.1873e-08
## RSS =  985.52   for pen= 1e+10
##     Constraint violation = 1.7881e-08
## RSS =  985.52   for pen= 1e+11
##     Constraint violation = 3.9851e-09
## RSS =  985.52   for pen= 1e+12
##     Constraint violation = -1.6715e-09
## RSS =  985.52   for pen= 1e+13
##     Constraint violation = 4.4855e-10
## RSS =  985.52   for pen= 1e+14
##     Constraint violation = 1.375e-11
## RSS =  985.52   for pen= 1e+15
##     Constraint violation = -2.5066e-11
## RSS =  985.52   for pen= 1e+16
##     Constraint violation = 1.2307e-11
## RSS =  985.52   for pen= 1e+17
##     Constraint violation = -3.7594e-12
## RSS =  985.52   for pen= 1e+18
##     Constraint violation = 4.4031e-13
## RSS =  985.52   for pen= 1e+19
##     Constraint violation = -1.8341e-13
## RSS =  985.52   for pen= 1e+20
##     Constraint violation = 1.5277e-13
## RSS =  985.52   for pen= 1e+21
##     Constraint violation = -2.2871e-14
## RSS =  985.52   for pen= 1e+22
##     Constraint violation = -5.5511e-15
## RSS =  985.52   for pen= 1e+23
##     Constraint violation = 8.8818e-16
## RSS =  985.52   for pen= 1e+24
##     Constraint violation = 8.8818e-16
## Best parameters
print(xx)
## (Intercept)         two       three        four        five
##  -28.564318    1.035479    0.021894   -0.042405  -40.985170
##         six
##   79.382823

Note that there is a sensitivity to the initial penalty and the factor by which it is increased, so that it is possible to converge to a solution with the objective above 1000 (possibly a plateau or a local minimum, because similar proposed answers are seen elsewhere).

Penalty methods like this are sometimes helpful to get a quick handle on a problem. Note that we used the default Nelder–Mead optimizer to do the work. This approach would, therefore, make sense if we were blocked from installing software that is more suited to the problem, for example, if we were at a location without network service.

13.2 Sumscale problems

A particular class of optimization problem that arises quite often is one that may be called a sumscale problem. Here we wish to maximize or minimize an objective function c13-math-0018 has parameters that are constrained by some scaling, so that c13-math-0019, where this function c13-math-0020 involves a sum of the parameters, their squares, or similar simple function.

The following are the examples of this type of objective function.

  • The maximum volume of a regular polyhedron where the sum of the lengths of the sides is fixed.
  • The minimum negative log likelihood for a multinomial model.
  • The Rayleigh quotient for the maximal or minimal eigensolutions of a matrix, where the eigenvectors should be normalized so the square norm of the vector is 1.

For the moment, let us consider a basic example, which is

equation

This is a very simplified version of the multinomial maximum likelihood problem, suggested by Gabor Grothendieck as a test problem of this type. One approach to solving this is to eliminate the last parameter and optimize over the other c13-math-0022 where c13-math-0023 is the number of parameters. Let us try it for a few values of c13-math-0024. We turn off some tests and warnings to keep the output tidy.

require(optimx, quietly = TRUE)
# load a range of optimization methods
pr <- function(y) {
    -prod(y) * (1 - sum(y))
}
pr.g <- function(x) {
    g <- -prod(x) * (1 - sum(x))/x + prod(x)
}
n <- 2
cat("  n", "	", "  x1", "			", " 1e5*(x1-1/n)
")
options(digits = 6)
while (n <= 6) {
    st <- 1:(n - 1)/(n * n)
    ctrl <- list(starttests = FALSE, trace = 0, dowarn = FALSE)
    ans <- optimx(st, pr, pr.g, method = "Rvmmin", control = ctrl)
    cc <- coef(ans)
    cat(n, "	", (1/n), "
")
    print(cc)
    n <- n + 1
}
##   n     x1      1e5*(x1-1/n)
## 2   0.5
##      [,1]
## [1,] 0.35
## 3   0.333333
##              p1       p2
## Rvmmin 0.333333 0.333333
## 4   0.25
##            p1    p2     p3
## Rvmmin 0.0625 0.125 0.1875
## 5   0.2
##          p1   p2   p3   p4
## Rvmmin 0.04 0.08 0.12 0.16
## 6   0.166667
##               p1        p2        p3       p4       p5
## Rvmmin 0.0277778 0.0555556 0.0833333 0.111111 0.138889

The “solution” is, as we expect, simply c13-math-0025 for all c13-math-0026. For very small c13-math-0027, this approach works fine. However, we see that very soon there are computational problems as the size of the problem increases. As the sum of the parameters is constrained to be equal to 1, the parameters are of the order of c13-math-0028, so the function is therefore of the order of c13-math-0029, which underflows around c13-math-0030 in R. In reality, the rot sets in much sooner.

We can do much better if we work with the log of the objective function. The following calculation handles very large numbers of parameters, and we only display the analytic answer along with the first five parameter estimates. Note that we change to using Rcgmin which uses a method adapted for large numbers of parameters.

nll <- function(y) {
    if ((any(y <= 10 * .Machine$double.xmin)) || (sum(y) > 1 -
        .Machine$double.eps))
        .Machine$double.xmax else -sum(log(y)) - log(1 - sum(y))
}
nll.g <- function(y) {
    -1/y + 1/(1 - sum(y))
}  # so far not safeguarded
n <- 1000
cat("  n", "	", "  x1", "			", " 1e5*(x1-1/n)
")
##   n     x1      1e5*(x1-1/n)
options(digits = 6)
while (n <= 10000) {
    st <- 1:(n - 1)/(n * n)
    ctrl <- list(starttests = FALSE, trace = 0, dowarn = FALSE)
    ans <- optimx(st, nll, nll.g, method = "Rcgmin", control = ctrl)
    cc <- coef(ans)
    cat(n, "	", (1/n), "
")
    print(cc[1:5])
    n <- n + 1000
}
## 1000   0.001
## [1] 0.001 0.001 0.001 0.001 0.001
## 2000   5e-04
## [1] 0.000500000 0.000500000 0.000499999 0.000500000
## [5] 0.000500000
## 3000   0.000333333
## [1] 0.000333333 0.000333334 0.000333333 0.000333333
## [5] 0.000333333
## 4000   0.00025
## [1] 0.000250002 0.000250000 0.000250000 0.000250000
## [5] 0.000250000
## 5000   2e-04
## [1] 0.009956240 0.000244116 0.000446017 0.000329520
## [5] 0.000181000
## 6000   0.000166667
## [1] 0.03565142 0.01697512 0.01075478 0.00764869 0.00578853
## 7000   0.000142857
## [1] 0.04500462 0.01959146 0.01113446 0.00691942 0.00440483
## 8000   0.000125
## [1] 0.00916996 0.00419875 0.00254975 0.00173191 0.00124712
## 9000   0.000111111
## [1] 0.000111111 0.000111111 0.000111111 0.000111111
## [5] 0.000111111
## 10000   1e-04
## [1] 0.012336191 0.004579923 0.002023620 0.000785113
## [5] 0.000124854

We see that we are doing quite well up to c13-math-0031 but thereafter get uneven results. The c13-math-0032 result is surprisingly good, but optimization methods do occasionally take a lucky step. They can equally be “unlucky.” Consider the c13-math-0033 case with several optimizers, where we first do not specify the gradient, then we give the analytic gradient, and finally the gradient and bounds on the parameters.

require(optimx, quietly = TRUE)
n <- 5
mset <- c("L-BFGS-B", "BFGS", "CG", "spg", "ucminf", "nlm", "nlminb",
    "Rvmmin", "Rcgmin")
a5 <- optimx(2:n/n^2, nll, method = mset, control = list(dowarn = FALSE))
## Warning: bounds can only be used with method L-BFGS-B (or Brent)
## Warning: bounds can only be used with method L-BFGS-B (or Brent)
a5g <- optimx(2:n/n^2, nll, nll.g, method = mset, control = list(dowarn = FALSE))
## Warning: bounds can only be used with method L-BFGS-B (or Brent)
## Warning: bounds can only be used with method L-BFGS-B (or Brent)
a5gb <- optimx(2:n/n^2, nll, nll.g, lower = 0, upper = 1, method = mset,
    control = list(dowarn = FALSE))
summary(a5, order = value)
##           p1       p2       p3       p4        value fevals
## Rcgmin   0.2 0.200000 0.200000 0.200000  8.04719e+00     51
## spg      0.2 0.200000 0.200000 0.200000  8.04719e+00     17
## Rvmmin   0.2 0.200000 0.200000 0.200000  8.04719e+00     22
## ucminf   0.2 0.200000 0.200000 0.200000  8.04719e+00     14
## nlm      0.2 0.199999 0.200000 0.200000  8.04719e+00     NA
## nlminb   0.2 0.199999 0.199999 0.199999  8.04719e+00     23
## L-BFGS-B  NA       NA       NA       NA 8.98847e+307     NA
## BFGS      NA       NA       NA       NA 8.98847e+307     NA
## CG        NA       NA       NA       NA 8.98847e+307     NA
##          gevals niter convcode kkt1 kkt2 xtimes
## Rcgmin       18    NA        0   NA   NA  0.004
## spg          NA    13        0 TRUE TRUE  0.056
## Rvmmin       12    NA        2   NA   NA  0.004
## ucminf       14    NA        0 TRUE TRUE  0.000
## nlm          NA    11        0 TRUE TRUE  0.000
## nlminb       66    12        0 TRUE TRUE  0.004
## L-BFGS-B     NA    NA     9999   NA   NA  0.000
## BFGS         NA    NA     9999   NA   NA  0.000
## CG           NA    NA     9999   NA   NA  0.004
summary(a5g, order = value)
##           p1       p2       p3       p4        value fevals
## ucminf   0.2 0.200000 0.200000 0.200000  8.04719e+00     14
## Rvmmin   0.2 0.200000 0.200000 0.200000  8.04719e+00     43
## spg      0.2 0.200000 0.200000 0.200000  8.04719e+00     17
## Rcgmin   0.2 0.200000 0.200000 0.200000  8.04719e+00     28
## nlm      0.2 0.199999 0.200000 0.200000  8.04719e+00     NA
## nlminb   0.2 0.199999 0.199999 0.199999  8.04719e+00     23
## L-BFGS-B  NA       NA       NA       NA 8.98847e+307     NA
## BFGS      NA       NA       NA       NA 8.98847e+307     NA
## CG        NA       NA       NA       NA 8.98847e+307     NA
##          gevals niter convcode kkt1 kkt2 xtimes
## ucminf       14    NA        0 TRUE TRUE  0.000
## Rvmmin       12    NA        0 TRUE TRUE  0.004
## spg          NA    13        0 TRUE TRUE  0.052
## Rcgmin       12    NA        0 TRUE TRUE  0.000
## nlm          NA    11        0 TRUE TRUE  0.000
## nlminb       12    12        0 TRUE TRUE  0.000
## L-BFGS-B     NA    NA     9999   NA   NA  0.000
## BFGS         NA    NA     9999   NA   NA  0.004
## CG           NA    NA     9999   NA   NA  0.000
summary(a5gb, order = value)
##           p1       p2       p3       p4        value fevals
## Rvmmin   0.2 0.200000 0.200000 0.200000  8.04719e+00     38
## Rcgmin   0.2 0.200000 0.200000 0.200000  8.04719e+00     18
## spg      0.2 0.200000 0.200000 0.200000  8.04719e+00     18
## nlminb   0.2 0.199999 0.199999 0.199999  8.04719e+00     23
## L-BFGS-B  NA       NA       NA       NA 8.98847e+307     NA
##          gevals niter convcode kkt1 kkt2 xtimes
## Rvmmin       14    NA        0 TRUE TRUE  0.004
## Rcgmin       10    NA        0 TRUE TRUE  0.004
## spg          NA    13        0 TRUE TRUE  0.056
## nlminb       12    12        0 TRUE TRUE  0.000
## L-BFGS-B     NA    NA     9999   NA   NA  0.000

The particular L-BFGS-B failure is due to the optimization method trying to compute the gradient where sum(x) is greater than 1, although why a set of parameters like that has been generated has not been investigated. It seems likely the failure of Rvmmin in the bounded case has a similar origin.

13.2.1 Using a projection

Objective functions defined by c13-math-0034 or c13-math-0035 will change with the scale of the parameters. Moreover, the constraint c13-math-0036 effectively imposes the scaling c13-math-0037. The optimizer spg() from package BB allows us to project our search direction to satisfy constraints. Thus, we could use the following approach.

require(BB, quietly = TRUE)
nllrv <- function(x) {
    -sum(log(x))
}
nllrv.g <- function(x) {
    -1/x
}
proj <- function(x) {
    x/sum(x)
}
n <- 5
aspg <- spg(par = (1:n)/n^2, fn = nllrv, gr = nllrv.g, project = proj)
## iter:  0  f-value:  11.3069  pgrad:  0.360757
aspgn <- spg(par = (1:n)/n^2, fn = nllrv, project = proj)
## iter:  0  f-value:  11.3069  pgrad:  0.133333
cat("F_optimal: with gradient=", aspg$value, "  num. approx.=",
    aspgn$value, "
")
## F_optimal: with gradient= 8.04719   num. approx.= 8.04719
pbest <- rep(1/n, n)
cat("fbest = ", nllrv(pbest), "  when all parameters = ", pbest[1],
    "
")
## fbest =  8.04719   when all parameters =  0.2
cat("deviations:  with gradient=", max(abs(aspg$par - pbest)),
    "   num. approx.=", max(abs(aspg$par - pbest)), "
")
## deviations:  with gradient= 3.81244e-06    num. approx.= 3.81244e-06

Here the projection proj is the key to success of method spg. Other methods do not have the flexibility to impose the projection directly. We would need to carefully build the projection into the function(s) and/or the method codes. This was done by Geradin (1971) for the Rayleigh quotient problem but requires a number of changes to the program code.

13.3 Inequality constraints

Kovalchik et al. (2013) present an interesting and readable approach to estimating what they call a LEXPIT model. This attempts to explain the variation of a binary response variable c13-math-0038, for example, one that has a 0 for a patient who does not have a particular diseases and 1 when they do. The model uses additive and multiplicative explanatory variables, which are put in matrices c13-math-0039 and c13-math-0040. For explanatory purposes, I will simplify to a single additive covariate c13-math-0041 and single multiplicative covariate c13-math-0042. The true but unknown c13-math-0043 is assumed to be distributed as a Bernoulli (i.e., zero–one) variate with probability parameter (a function of c13-math-0044 and c13-math-0045)

where this probability is clearly bounded by 0 and 1. In fact, this probability should not actually take on those end-of-range values, or we have no variation.

13.5 equation

where

13.6 equation

is the inverse logit function. The logit function is

13.7 equation

To illustrate the model and its estimation, let us suppose that we have patients whose age, sugar consumption, and health status according to some 0/1 indicator are known. We now want to model the probability the health (or disease) indicator is 1 by equation (13.4).

We have (actually we generated) 500 observations on status c13-math-0050, age and sugar and now want to fit the model by maximum likelihood. We will minimize the negative log likelihood, which is provided by the following code.

expit <- function(q) {
    zz <- exp(q)
    zz/(1 + zz)
}
logit <- function(p) {
    log(p/(1 - p))
}
negll <- function(par, data) {
    yy <- data$y
    ss <- data$sugar
    aa <- data$age
    beta0 <- par[1]
    gam0 <- par[2]
    gam1 <- par[3]
    ppi <- beta0 * ss + expit(gam0 + gam1 * aa)
    ## need to watch out for ppi out of bounds
    if (any(ppi >= 1) || any(ppi <= 0)) {
        val <- 1e+200
    } else {
        val <- -sum(log(1 - ppi)) - sum(yy * logit(ppi))
    }
    val
}

Kovalchik et al. (2013) suggest that it is useful to provide starting values derived in some way from the data. Here we will take a very crude approach of setting the c13-math-0051. This implies that we have the simpler logistic regression model, which we can estimate as follows:

tglm <- glm(y  age, family = binomial(logit), data = mydata)
pars <- c(0, coef(tglm))
names(pars) <- c("beta0", "gam0", "gam1")
pars
##      beta0       gam0       gam1
##  0.0000000 -4.6583342  0.0504476

Now we can proceed to fit the LEXPIT model. We will try all the optimx methods.

require(optimx)
aoptimx <- optimx(pars, negll, "grnd", method = "all", control = list(trace = 0),   data = mydata)
## Warning: Replacing NULL gr with ‘grnd’ approximation
## Warning: bounds can only be used with method L-BFGS-B (or Brent)
## Warning: bounds can only be used with method L-BFGS-B (or Brent)
## Warning: bounds can only be used with method L-BFGS-B (or Brent)
print(summary(aoptimx, order = value))
##                 beta0     gam0      gam1        value
## Rcgmin      0.1566348 -5.79545 0.0632420  1.99324e+02
## ucminf      0.1566348 -5.79545 0.0632420  1.99324e+02
## Rvmmin      0.1566348 -5.79545 0.0632420  1.99324e+02
## nlminb      0.1566348 -5.79545 0.0632420  1.99324e+02
## nmkb        0.1566367 -5.79532 0.0632398  1.99324e+02
## nlm         0.1565503 -5.79397 0.0632233  1.99324e+02
## spg         0.1565252 -5.79387 0.0632229  1.99324e+02
## hjkb        0.1563568 -5.79060 0.0631810  1.99324e+02
## BFGS        0.0815319 -4.67534 0.0494269  2.00107e+02
## CG          0.0815319 -4.67534 0.0494269  2.00107e+02
## Nelder-Mead 0.0815319 -4.67534 0.0494269  2.00107e+02
## L-BFGS-B    0.0815319 -4.67534 0.0494269  2.00107e+02
## bobyqa      0.0206010 -4.65833 0.0504476  2.00391e+02
## newuoa             NA       NA        NA 8.98847e+307
##             fevals gevals niter convcode  kkt1 kkt2 xtimes
## Rcgmin        1007    150    NA        1  TRUE TRUE  0.736
## ucminf          18     18    NA        0  TRUE TRUE  0.072
## Rvmmin          71     12    NA        0  TRUE TRUE  0.060
## nlminb          29     14    13        0  TRUE TRUE  0.060
## nmkb           217     NA    NA        0  TRUE TRUE  0.040
## nlm             NA     NA    16        0  TRUE TRUE  0.008
## spg           2342     NA  1304        0  TRUE TRUE  5.660
## hjkb          2132     NA    19        0 FALSE TRUE  0.292
## BFGS            13     13    NA        0 FALSE TRUE  0.044
## CG              13     13    NA        0 FALSE TRUE  0.048
## Nelder-Mead     13     13    NA        0 FALSE TRUE  0.052
## L-BFGS-B        13     13    NA        0 FALSE TRUE  0.048
## bobyqa          21     NA    NA        0 FALSE TRUE  0.004
## newuoa          NA     NA    NA     9999    NA   NA  0.004

Here we see most of the methods have found a reasonable solution.

We can also try the constrained nonlinear optimization packages alabama and Rsolnp. First, we need to define our inequality constraints.

pibnd <- function(par, data) {
    # inequality constraints
    yy <- data$y
    nobs <- length(yy)
    ss <- data$sugar
    aa <- data$age
    beta0 <- par[1]
    gam0 <- par[2]
    gam1 <- par[3]
    ppi <- beta0 * ss + expit(gam0 + gam1 * aa)
    hh <- rep(NA, 2)
    # Note probability bounded at both 0 and 1
    hh <- c((ppi - 1e-04), (1 - 1e-04 - ppi))
    hh
}
yy <- mydata$y
nobs <- length(yy)
lo <- rep(1e-04, nobs)
up <- 1 - lo
require(alabama)
coo <- list(trace = FALSE)
aaug <- auglag(par = pars, fn = negll, hin = pibnd, data = mydata,
    control.outer = coo)
aaug$par
##      beta0       gam0       gam1
##  0.1501910 -5.6903499  0.0619483
aaug$value
## [1] 199.33
aaug$convergence
## [1] 0
aco <- constrOptim.nl(par = pars, fn = negll, hin = pibnd, data = mydata,
    control.outer = coo)
aco$par
##      beta0       gam0       gam1
##  0.1501612 -5.6901151  0.0619458
aco$value
## [1] 199.33
aco$convergence
## [1] 0

Last in this section, we try Rsolnp.

require(Rsolnp)
pineq <- function(par, data) {
    # we will compute functions that obey the ineqLB and ineqUB
    # constraints
    yy <- data$y
    nobs <- length(yy)
    ss <- data$sugar
    aa <- data$age
    beta0 <- par[1]
    gam0 <- par[2]
    gam1 <- par[3]
    ppi <- beta0 * ss + expit(gam0 + gam1 * aa)
}
asolnp <- solnp(pars = pars, fun = negll, ineqfun = pineq, ineqLB = lo,
    ineqUB = up, data = mydata)
##
## Iter: 1 fn: 199.3254  Pars:   0.15280 -5.74331  0.06263
## Iter: 2 fn: 199.3239  Pars:   0.15663 -5.79529  0.06324
## Iter: 3 fn: 199.3239  Pars:   0.15663 -5.79529  0.06324
## solnp- - > Completed in 3 iterations
asolnp$pars
##     beta0      gam0      gam1
##  0.156626 -5.795292  0.063240
negll(asolnp$pars, mydata)
## [1] 199.324

While we see that several methods have quite satisfactorily solved this estimation problem, I would caution that an important step is that of providing reasonable starting values for the parameters. Constrained problems seem to require better starts than their unconstrained counterparts, but this is an opinion based on limited experience rather than exhaustive evidence.

13.4 A perspective on penalty function ideas

The logarithmic barrier function is generally used to make the objective function so large that the parameters are deflected away from a constraint. An alternative is to penalize violation of the constraint. This actually allows the parameters to take on values that are infeasible, that is, outside the bounds. Increasing a weight or control on the penalty allows us to make it more stringent, and generally the parameters become “less infeasible” in such approaches, if we can tolerate such linguistic torture.

An annoyance with penalty methods is that we may introduce very large numbers as we try to force feasibility, and this may degrade the performance of our (unconstrained) optimization methods. In my own work, I have found penalty methods more useful as a trial to learn about problems than as an ongoing tool for solving a family of optimization problems. Barrier functions, as we have seen, can be used if we start and finish inside the feasible region.

13.5 Assessment

In this chapter, we have seen that there are several ways that R can be used to tackle nonlinear optimization problems with general constraints. What are the lessons to take away? Here are my views:

  • As a first try, it is best to stick with the problem as stated and verify the objective and constraints.
  • Keep any scaling simple, such as powers of 10. This can help to render the problem less difficult for the optimization software, and it does not involve complicated algebra which may be error prone.
  • Equality constraints can be surprisingly difficult to deal with. I prefer, if possible, to eliminate one or more parameters, but as we have seen, one needs to consider which parameter(s) to eliminate.
  • Penalty methods are simple and useful for exploring a problem, especially if infeasible parameters can be tolerated. Otherwise, barrier methods will keep parameters feasible, but we do need an initial feasible set of parameters.
  • Good starting parameters are important to success.
  • As a continuing check, evaluate the objective and constraints in the original specification of the problem.

References

  1. Geradin M 1971 The computational efficiency of a new minimization algorithm for eigenvalue analysis. Journal of Sound and Vibration 19, 319–331.
  2. Kovalchik SA, Varadhan R, Fetterman B, Poitras NE, Wacholder S, and Katki HA 2013 A general binomial regression model to estimate standardized risk differences from binary response data. Statistics in Medicine 32(5), 808–821.
  3. Nash JC 1979 Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation. Adam Hilger, Bristol. Second Edition, 1990, Bristol: Institute of Physics Publications.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.129.194