In this introductory chapter we look at the classes of problems for which we will discuss solution tools. We also consider the interrelationships between different problem classes as well as among the solution methods. This is quite general. R is only incidental to this chapter except for some examples. Here we write our list of things to do.
The general constrained optimization problem can be stated as follows.
Note that is a scalar function but is a vector. There may or may not be constraints on the values of , and these are expressed formally in the vector of functions . While these functions are general, many problems have much simpler constraints, such as requirements that the values of be no less than some lower bounds or no greater than some upper bounds as we shall discuss in the following text.
We have specified the problem as a minimization, but maximization problems can be transformed to minimizations by multiplying the objective function by .
Note also that we have asked for the set of arguments x that minimize the objective, which essentially implies the global minimum. However, many—if not most—of the numerical methods in optimization are able to find only local minima and quite a few problems are such that there may be many local minima and possibly even more than one global minimum. That is, the global minimum may occur at more than one set of parameters x and may occur on a line or surface.
While there do exist methods for tackling the general optimization problem, almost all the “real” work of optimization in problems related to statistics and modeling tends to be done by more specialized methods that work on problems that are restricted in some ways by the nature of the objective or the constraints (or lack thereof). Indeed, for a number of particular problems, there are very specialized packages expressly designed to solve them. Unfortunately, the user often has to work quite hard to decide if his or her problem actually matches the design considerations of the specialized package. Seemingly small changes—for example, a condition that parameters must be positive—can render the specialized package useless. On the other hand, a very general tool may be quite tedious for the user to apply easily, because objective functions and constraints may require a very large amount of program code in some cases.
In the real world, the objective function and the constraints are not only functions of but also depend on data; in fact, they may depend on vast arrays of data, particularly in statistical problems involving large systems.
To illustrate, consider the following examples, which, while “small,” illustrate some of the issues we will encounter.
The Cobb–Douglas production function (Nash and Walker-Smith, 1987, p. 375) predicts the quantity of production of a commodity as a function of the inputs of (it appears traditional to use a K for this variable) and used, namely,
A traditional approach to this problem is to take logarithms to get
However, the two forms imply very different ways in which errors are assumed to exist between the model and real-world data. Let us assume (almost certainly dangerously) that data for and are known precisely, but there may be errors in the data for . Let us use the name . In particular, if we use additive errors of the form
then we have
where we have given these errors a particular name . This means that the errors are actually multiplicative in the real scale of the data.
If we estimate the model using the log form, we can sometimes get quite different estimates of the parameters than using the direct form. The “errors” have different weights in the different scales, and this alters the estimates. If we really believe that the errors are distributed around the direct model with constant variance, then we should not be using the log form, because it implies that the relative errors are distributed with constant variance.
This problem is also a nonlinear least squares. As we shall see later, it demonstrates a number of computational issues. The problem came across my desk sometime in 1974 when I was working on the development of a program to solve nonlinear least squares estimation problems. I had written several variants of Gauss–Newton methods in BASIC for a Data General NOVA system. This early minicomputer offered a very limited environment of a 10 character per second teletype with paper tape reader and punch that allowed access to a maximum 8K byte (actually 4K word) segment of the machine. Arithmetic was particularly horrible in that floating point used six hexadecimal digits in the mantissa with no guard digit.
The problem was supplied by Mr. Dave Hobbs of Agriculture Canada. As I was told, the observations () are weed densities per unit area over 12 growing periods. I was never given the actual units of the observations. Here are the data (Figure 1.1).
# draw the data y <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443, 38.558, 50.156, 62.948, 75.995, 91.972) t <- 1:12 plot(t, y) title(main = "Hobbs' weed infestation data", font.main = 4)
It was suggested that the appropriate model was a 3-parameter logistic, that is,
where , is the growing period, and . We shall see later that there are other forms for the model that may give better computational properties.
What do we mean by “nonlinear?” The word clearly implies “not a straight line,” and many researchers take this to apply to the model they are trying to estimate. However, for the process of estimation, which generally involves minimizing a loss function such as a sum of squared deviations or maximizing a likelihood function, the key issue is that of solving a set of equations to find the result.
When we minimize the sum of squares for a model that is linear in the parameters, such as the log form of the Cobb–Douglas function (1.2) above where , , and appear only to the first power, we can apply standard calculus to arrive at the normal equations. These are a set of linear equations. However, when we want to minimize the sum of squares from the original model (1.1), it is generally necessary to use an iterative method from some starting set of the parameters , , and .
For the purposes of this book, “nonlinear” will refer to the process of finding a solution and implying that there is no method that finds a solution via a predetermined set of solutions of linear equations. That is, while we use a lot of linear algebra in finding solutions to the problems of interest in this book, we cannot, in advance, specify how many such subproblems are needed.
There are some particular forms of the objective function that lead to specialized, but quite common, solution methods. This gives us one dimension or axis by which to categorize the optimization methods we shall consider later.
If the objective function is a sum of squared terms, we can use a method for solving nonlinear least squares problems. Clearly, the estimation of the Cobb–Douglas production model above by minimizing the sum of squared residuals is a problem of this type.
We note that the Cobb–Douglas problem is linear in the parameters in the case of the log-form model. The linear least squares problem is so pervasive that it is worth noting how it may be solved because some approaches to nonlinear problems can be viewed as solving sequences of linear problems.
It is sometimes important to have an upper bound on the deviation of a model from “data.” We, therefore, wish to find the set of parameters in a model that minimizes the maximum deviation, hence a minimax problem. In particular, consider that there may be relatively simple approximations to some specialized and awkward-to-compute special functions. This sort of approximation problem is less familiar to statistical workers than sums-of-squares problems. Moreover, the small residuals may render some traditional methods such as the R function nls()
ill-suited to their solution.
Possibly as a result of the general mathematical education most of us receive, our view of functions is that they have a “nice” shape where there is just one minimum or maximum. Reality is much less kind. Functions often have multiple local extrema, and most of our methods—often based on traditional calculus ideas—are designed to find local minima or maxima efficiently. Knowing that there may be multiple local, and possibly even multiple global, solutions is important if we are to find the right solution or even an acceptable solution.
Sometimes an objective function is not precisely determined. On one occasion, I considered optimizing racing-car parameters such as fuel–air ratio, braking settings, and so on by “computing” the objective via measurement of the time taken to complete a circuit of the track. This is clearly stochastic. There are other such problems where integrals are estimated by Monte-Carlo methods. See, for example, http://en.wikipedia.org/wiki/Monte_Carlo_integration for a general overview.
Another categorization of optimization problems and their solution methods is via the constraints that are imposed on the parameters.
The most obvious case is unconstrained—no constraints. Truthfully, real problems always have constraints, but many can be most easily and efficiently solved without the need to explicitly include them.
The next simplest is bounds or box constraints. Here we impose lower and upper bounds on the parameters. If we have lower bounds in a vector lo and upper bounds in a vector up, then the parameters are in the -dimensional box defined by the inequalities
Linear constraints take the form
or
If all constraints are linear (which includes bounds), there may be specialized methods. In particular, if additionally the objective function is linear in the parameters , we have the special case of linear programming. An objective that is quadratic in with linear constraints gives us quadratic programming. The historical use of the word “programming” is unfortunate, but we will use it because it has a large literature. The term nonlinear programming is, as far as I can determine, a synonym for the general optimization problem of nonlinear objective with nonlinear constraints.
Equality constraints can be specified as a vector of (possibly) nonlinear functions
We can formally replace each such equation with two inequations and thereby subsume the set of functions into the expanded set of functions, even if this is not the approach we would choose in seeking a solution. However, feasible equality constraints imply that we can solve for some of the parameters in terms of the rest. For the most part, I prefer to assume that such a solution has been found and that we are working with a reduced problem that does NOT have equality constraints.
The solution of the equations that give rise to equality constraints is not quite the same as solving a complete set of (nonlinear) equations in unknowns. The constraint equations generally leave us with some parameters still to be determined, and general methods may be difficult to discover and apply.
For solving complete sets of equations, we want to solve for in
A general vector will not give 0 but instead a non-zero vector r, that is, “residuals.” Thus we can attempt a solution using nonlinear least squares methods. If our resulting value of the objective, which is
turns out to be zero, then we have found a solution. Here I will tread very lightly on the issue of whether this solution is unique and similarly consider the question of existence of a solution to be proven by finding such a solution. A failure to find a zero sum of squares is not, however, a guarantee that there is no such solution.
In Chapter 7, we discuss nonlinear equations. As we have just noted, these could be solved by nonlinear least squares if the solution has zero residuals. This is usually thought to be a poor approach, but I have sometimes found it expedient. In the other direction, solving a set of gradient equations can solve optimization and nonlinear least squares problems if the solution can be found and can be shown to be an optimum (minimum for minimization, maximum for maximization) and not a saddle point or on a flat area. Moreover, we may want a global optimum where multiple local optima exist. Thus, I generally avoid this approach to optimization because it demands too much checking of the “solution.”
It is helpful at this early stage to mention that the meaning of “optimize” is not without interpretations. It is possible to take a very pure mathematical viewpoint but that inevitably fails to help scientists, economists, and engineers who need to provide reasonable answers to difficult questions.
For our purposes, optimality means that we want to find the values of the parameters of some function describing a system of interest that cause that function to be a minimum. To find a maximum, we can find the minimum of the negative of the function.
Beyond the concept of a minimum, we may have a whole plethora of assumptions, many of which will be unstated. These assumptions give rise to constraints, some of which could be incorporated into the optimization methods, although others may be simply of the form “that set of parameters is not of interest.” This could be because there are mathematical possibilities that have no relevance to practical application, for example, solutions to a medical infection that also kills the patient.
Of particular interest in this book will be the optimality conditions, sometimes called the Karush Kuhn Tucker conditions (Karush, 1939; Kuhn and Tucker, 1951). These essentially say that, except at a constraint boundary, a minimum of a function will have zero gradient and a positive-definite Hessian. We return to these ideas several times within the book.
Solution methods sometimes dictate the way we approach problems. The old adage that for a man with a hammer everything looks like a nail is far too close to reality to be comfortable. Because there are many specialized computational methods that solve particular optimization-related problems very efficiently, there is often a temptation to cast problems so these methods can be applied. An important aspect of this book is to provide some perspective so that the user can decide when and if this is sensible.
18.224.54.255