The optimization problem is best described as the search for a local maximum or minimum value of a scalar-valued function f(x). This search can be performed for all possible input values in the domain of f (and in this case, we refer to this problem as an unconstrained optimization), or for a specific subset of it that is expressible by a finite set of identities and inequalities (and we refer to this other problem as a constrained optimization). In this section, we are going to explore both modalities in several settings.
We focus on the search for the local minima of a function f(x) in an interval [a, b] (the search for local maxima can then be regarded as the search of the local minima of the function –f(x) in the same interval). For this task, we have the routine minimize_scalar
in the module scipy.optimize
. It accepts as obligatory input a univariate function f(x), together with a search method.
Most search methods are based on the idea of bracketing that we used for root finding, although the concept of bracket is a bit different in this setting. In this case, a good bracket is a triple x < y < z where f(y) is less than both f(x) and f(z). If the function is continuous, its graph presents a U-shape on a bracket. This guarantees the existence of a minimum inside of the subinterval [x, z]. A successful bracketing method will look, on each successive step, for the target extremum in either [x, y] or [y, z].
Let's construct a simple bracketing method for testing purposes. Assume we have an initial bracket a < c < b. By quadratic interpolation, we construct a parabola through the points (a, f(a)), (c, f(c)), and (b, f(b)). Because of the U-shape condition, there must be a minimum (easily computable) for the interpolating parabola, say (d, f(d)). It is not hard to prove that the value d
lies between the midpoints of the subintervals [a, c], and [c, b]. We will use this point d for our next bracketing step. For example, if it happens that c < d, then the next bracket will be either c < d < b, or a < c < d. Easy enough! Let's implement this method:
In [1]: import numpy as np; ...: from scipy.interpolate import lagrange; ...: from scipy.optimize import OptimizeResult, minimize_scalar In [2]: def good_bracket(func, bracket): ...: a, c, b = bracket ...: return (func(a) > func(c)) and (func(b) > func(c)) ...: In [3]: def parabolic_step(f, args, bracket, **options): ...: stop = False ...: funcalls = 0 ...: niter = 0 ...: while not stop: ...: niter += 1 ...: interpolator = lagrange(np.array(bracket), ...: f(np.array(bracket))) ...: funcalls += 3 ...: a, b, c = interpolator.coeffs ...: d = -0.5*b/a ...: if np.allclose(bracket[1], d): ...: minima = d ...: stop = True ...: elif bracket[1] < d: ...: newbracket = [bracket[1], d, bracket[2]] ...: if good_bracket(f, newbracket): ...: bracket = newbracket ...: else: ...: bracket = [bracket[0], bracket[1], d] ...: else: ...: newbracket = [d, bracket[1], bracket[2]] ...: if good_bracket(f, newbracket): ...: bracket = newbracket ...: else: ...: bracket = [bracket[0], d, bracket[1]] ...: return OptimizeResult(fun=f(minima), x=minima, ...: nit=niter, nfev=funcalls)
The output of any minimizing method must be an OptimizeResult
object, with at least the attribute x
(the solution to the optimization problem). In the example we have just run, the attributes coded in this method are x
, fun
(the evaluation of f at that solution), nit
(number of iterations), and nfev
(number of functions evaluations needed).
Let's run this method over a few examples:
In [4]: f = np.vectorize(lambda x: max(1-x, 2+x)) In [5]: def g(x): return -np.exp(-x)*np.sin(x) In [6]: good_bracket(f, [-1, -0.5, 1]) Out[6]: True In [7]: minimize_scalar(f, bracket=[-1, -0.5, 1], ...: method=parabolic_step) Out[7]: fun: array(1.5000021457670878) nfev: 33 nit: 11 x: -0.50000214576708779 In [8]: good_bracket(g, [0, 1.2, 1.5]) Out[8]: True In [9]: minimize_scalar(g, bracket=[0,1.2,1.5], ...: method=parabolic_step) Out[9]: fun: -0.32239694192707441 nfev: 54 nit: 18 x: 0.78540558550495643
There are two methods already coded for univariate scalar minimization, golden
, using a golden section search, and brent
, following an algorithm by Brent and Dekker:
In [10]: minimize_scalar(f, method='brent', bracket=[-1, -0.5, 1]) Out[10]: fun: array(1.5) nfev: 22 nit: 21 x: -0.5 In [11]: minimize_scalar(f, method='golden', bracket=[-1, -0.5, 1]) Out[11]: fun: array(1.5) x: -0.5 nfev: 44 In [12]: minimize_scalar(g, method='brent', bracket=[0, 1.2, 1.5]) Out[12]: fun: -0.32239694194483443 nfev: 11 nit: 10 x: 0.78539817180087257 In [13]: minimize_scalar(g, method='golden', bracket=[0, 1.2, 1.5]) Out[13]: fun: -0.32239694194483448 x: 0.7853981573284226 nfev: 43
Although the bracket included in the routine minimize_scalar
already places a constraint on the function, it is feasible to force the search for a true minimum inside of a suitable interval for which no bracket can be easily found:
In [14]: minimize_scalar(g, method='bounded', bounds=(0, 1.5)) Out[14]: status: 0 nfev: 10 success: True fun: -0.32239694194483415 x: 0.78539813414299553 message: 'Solution found.'
Except in the cases of minimization by brute force or by basin hopping, we can perform all other searches with the common routine minimize
from the module scipy.optimize
. The parameter method, as with its univariate counterpart, takes care of selecting the algorithm employed to achieve the extremum. There are several well-known algorithms already coded, but we also have the possibility of implementing our own ideas via a suitable custom-made method.
In this section, we will focus on the description and usage of the coded implementations. The same technique that we employed in the construction of custom methods for minimize_scalar
is valid here, with the obvious challenges that the extra dimensions bring.
To compare all the different methods, we are going to run them against a particularly challenging function: Rocksenbrock's parabolic valley (also informally referred to as the banana function). The module scipy.optimize
has NumPy versions of this function, as well as its Jacobian and Hessian:
In [15]: from scipy.optimize import rosen; ....: from sympy import var, Matrix, solve, pprint In [16]: var('x y') Out[16]: (x, y) In [17]: F = Matrix([rosen([x, y])]); ....: pprint(F) [(-x + 1)2 + 100.0(-x2 + y)2] In [18]: X, Y = np.mgrid[-1.25:1.25:100j, -1.25:1.25:100j] In [19]: def f(x,y): return rosen([x, y]) In [20]: import matplotlib.pyplot as plt, matplotlib.cm as cm; ....: from mpl_toolkits.mplot3d.axes3d import Axes3D In [21]: plt.figure(); ....: plt.subplot(121, aspect='equal'); ....: plt.contourf(X, Y, f(X,Y), levels=np.linspace(0,800,16), ....: cmap=cm.Greys) ....: plt.colorbar(orientation='horizontal') ....: plt.title('Contour plot') ....: ax = plt.subplot(122, projection='3d', aspect='equal') ....: ax.plot_surface(X, Y, f(X,Y), cmap=cm.Greys, alpha=0.75) ....: plt.colorbar(orientation='horizontal') ....: plt.title('Surface plot') ....: plt.show()
The figure shows a large area (in the shape of a banana) that might contain local minima. Techniques of multivariate calculus help us locate all of the critical points precisely, instead of relying on intuition. We need first to compute the Jacobian and Hessian of the function:
In [22]: JacF = F.jacobian([x, y]); ....: pprint(JacF) [- 400.0⋅x⋅(- x2 + y) + 2⋅x - 2 - 200.0⋅x2 + 200.0⋅y] In [23]: HesF = JacF.jacobian([x, y]); ....: pprint(HesF) [1200.0⋅x2 - 400.0⋅y + 2 -400.0⋅x -400.0⋅x 200.0 ] In [24]: solve(JacF) Out[24]: [{x: 1.00000000000000, y: 1.00000000000000}] In [25]: HesF.subs({x: 1.0, y: 1.0}) Out[25]: Matrix([ [ 802.0, -400.0], [-400.0, 200.0]]) In [26]: _.det() Out[26]: 400.000000000000
These computations show that there is only one critical point at (1, 1
). Unequivocally, this point presents a local minimum at that location.
Trying to compute the critical points with this technique for a Rosenbrock function in higher dimensions, while doable, is computationally intense. Moving to four dimensions, for example, takes a decent computer about half a minute:
In [27]: var('x:4'); ....: X = [x0, x1, x2, x3]; ....: F = Matrix([rosen(X)]) In [28]: %time solve(F.jacobian(X)) CPU times: user 36.6 s, sys: 171 ms, total: 36.8 s Wall time: 36.7 s Out[28]: [{x0:1.0,x1:1.0,x2:1.0,x3:1.0}]
For large dimensions, the search for global minima can be done by brute-force; not very elegant, but it gets the job done. A brute-force algorithm is able to track global minima (or approximate it to satisfactory precision). We can call this method with the routine brute
in the module scipy.optimize
. The obligatory parameters are the function to be minimized, together with a description of the domain where we will apply the optimization. This domain is best coded as a tuple of slices. For example, to search for a global minimum of the Rosenbrock function in four variables, where each variable is bounded in absolute value by three, we could issue this command:
In [29]: from scipy.optimize import brute In [30]: interval = slice(-3, 3, 0.25); ....: box = [interval] * 4 In [31]: %time brute(rosen, box) CPU times: user 13.7 s, sys: 6 ms, total: 13.7 s Wall time: 13.7 s Out[31]: array([ 1., 1., 1., 1.])
Still quite a slow process! To achieve speed, it is always better to use iterative methods. The search for minima in this setting is achieved according to several schema (and combinations of these):
scipy.optimize
, we have two exponents of this category:basinhopping
in the module scipy.optimize
. The implementation has an acceptance test given by the Metropolis criterion of a standard Monte-Carlo simulation.method='Anneal'
. This is a variation of a Monte-Carlo simulation. It is useful for optimization problems where the search space is discrete and large.scipy.optimize
, we have two methods complying with this philosophy:method='Powell'
.method='Nelder-Mead'
.scipy.optimize
, we have the quasi-Newton method of Broyden, Fletcher, Goldfarb, and Shanno (BFGS), which uses exclusively first derivatives. We call it with method='BFGS'
.method ='CG'
.method='Newton-CG'
.method='dogleg'
and method='trust-ncg'
.Let's browse through these methods.
Let's find the global minimum of the Rosenbrock function of nine variables using the technique of basin hopping:
In [32]: from scipy.optimize import minimize, basinhopping In [33]: %time basinhopping(rosen, np.zeros(9)) CPU times: user 4.59 s, sys: 7.17 ms, total: 4.6 s Wall time: 4.6 s Out[33]: nfev: 75633 minimization_failures: 52 fun: 2.5483642615054407e-11 x: array([ 0.99999992, 0.99999994, 0.99999992, 0.99999981, 0.99999962, 0.99999928, 0.99999865, 0.9999972 , 0.99999405]) message: ['requested number of basinhopping iterations completed successfully'] njev: 6820 nit: 100
Let's compare to the behavior of (the deprecated) simulated annealing:
In [34]: minimize(rosen, np.zeros(9), method='Anneal') Out[34]: status: 5 success: False accept: 19 nfev: 651 T: 1130372817.0369582 fun: 707171392.44894326 x: array([ 11.63666756, -24.41186725, 48.26727994, 3.97730959, -31.52658563, 18.00560694, 1.22589971, 21.97577333, -43.9967434 ]) message: 'Final point not the minimum amongst encountered points' nit: 12
Let's compare the results of the Powell method with the downhill simplex method:
In [35]: minimize(rosen, np.zeros(9), method='Powell') Out[35]: status: 0 success: True direc: array([[ -9.72738085e-06, 2.08442100e-05, 2.06470355e-05, 4.39487337e-05, 1.29109966e-04, 1.98333214e-04, 3.66992711e-04, 7.00645878e-04, 1.38618490e-03], [ -6.95913466e-06, -7.25642357e-07, -2.39771165e-06, 4.10148947e-06, -6.17293950e-06, -6.53887928e-06, -1.06472130e-05, -5.23030557e-06, -2.28609232e-06], [ 0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], [ 1.23259262e-06, 9.30817407e-07, 2.48075497e-07, -7.07907998e-07, -2.01233105e-07, -1.10513430e-06, -2.57164619e-06, -2.58316828e-06, -3.89962665e-06], [ 6.07328675e-02, 8.51817777e-02, 1.30174960e-01, 1.71511253e-01, 9.72602622e-02, 1.47866889e-02, 1.12376083e-03, 5.35386263e-04, 2.04473740e-04], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], [ 3.88222708e-04, 8.26386166e-04, 5.56913200e-04, 3.08319925e-04, 4.45122275e-04, 2.66513914e-03, 6.31410713e-03, 1.24763367e-02, 2.45489699e-02], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00], [ -2.82599868e-13, 2.33068676e-13, -4.23850631e-13, -1.23391999e-12, -2.41224441e-12, -5.08909225e-12, -9.92053051e-12, -2.07685498e-11, -4.10004188e-11]]) nfev: 6027 fun: 3.1358222279861171e-21 x: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1.]) message: 'Optimization terminated successfully.' nit: 56 In [36]: minimize(rosen, np.zeros(9), method='Nelder-Mead') status: 1 nfev: 1800 success: False fun: 4.9724099905503065 x: array([ 0.85460488, 0.70911132, 0.50139591, 0.24591886, 0.06234451, -0.01112426, 0.02048509, 0.03266785, -0.01790827]) message: 'Maximum number of function evaluations has been exceeded.' nit: 1287
Let's observe the behavior of this algorithm on our running example:
In [37]: minimize(rosen, np.zeros(9), method='BFGS') Out[37]: status: 0 success: True njev: 83 nfev: 913 hess_inv: array([[ 1.77874730e-03, 1.01281617e-03, 5.05884211e-04, 3.17367120e-04, 4.42590321e-04, 7.92168518e-04, 1.52710497e-03, 3.06357905e-03, 6.12991619e-03], [ 1.01281617e-03, 1.91057841e-03, 1.01489866e-03, 7.31748268e-04, 8.86058826e-04, 1.44758106e-03, 2.76339393e-03, 5.60288875e-03, 1.12489189e-02], [ 5.05884211e-04, 1.01489866e-03, 2.01221575e-03, 1.57845668e-03, 1.87831124e-03, 3.06835450e-03, 5.86489711e-03, 1.17764144e-02, 2.35964978e-02], [ 3.17367120e-04, 7.31748268e-04, 1.57845668e-03, 3.13024681e-03, 3.69430791e-03, 6.16910056e-03, 1.17339522e-02, 2.33374859e-02, 4.66640347e-02], [ 4.42590321e-04, 8.86058826e-04, 1.87831124e-03, 3.69430791e-03, 7.37204979e-03, 1.23036988e-02, 2.33709766e-02, 4.62512165e-02, 9.24474614e-02], [ 7.92168518e-04, 1.44758106e-03, 3.06835450e-03, 6.16910056e-03, 1.23036988e-02, 2.48336778e-02, 4.71369608e-02, 9.29927375e-02, 1.85683729e-01], [ 1.52710497e-03, 2.76339393e-03, 5.86489711e-03, 1.17339522e-02, 2.33709766e-02, 4.71369608e-02, 9.44348689e-02, 1.86490477e-01, 3.72360210e-01], [ 3.06357905e-03, 5.60288875e-03, 1.17764144e-02, 2.33374859e-02, 4.62512165e-02, 9.29927375e-02, 1.86490477e-01, 3.73949424e-01, 7.46959044e-01], [ 6.12991619e-03, 1.12489189e-02, 2.35964978e-02, 4.66640347e-02, 9.24474614e-02, 1.85683729e-01, 3.72360210e-01, 7.46959044e-01, 1.49726446e+00]]) fun: 6.00817150312557e-11 x: array([ 0.99999993, 0.99999986, 0.99999976, 0.99999955, 0.99999913, 0.99999832, 0.99999666, 0.99999334, 0.99998667]) message: 'Optimization terminated successfully.' jac: array([ 5.23788826e-06, -5.45925187e-06, -1.35362172e-06, 8.75480656e-08, -9.45374358e-06, 7.31889131e-06, 3.34352248e-07, -7.24984749e-07, 2.02705630e-08])
The pure conjugate gradient method works best with functions with a clear, unique critical point, and where the range of the slopes is not too large. Multiple stationary points tend to confuse the iterations, and too steep slopes (larger than 1000) result in terrible rounding errors.
Without offering an expression for the Jacobian, the algorithm computes a decent approximation of this operator to compute the first derivatives:
In [38]: minimize(rosen, np.zeros(9), method='CG') Out[38]: status: 0 success: True njev: 326 nfev: 3586 fun: 1.5035665428352255e-10 x: array([ 0.9999999 , 0.99999981, 0.99999964, 0.99999931, 0.99999865, 0.99999733, 0.9999947 , 0.99998941, 0.99997879]) message: 'Optimization terminated successfully.' jac: array([ -1.48359492e-06, 2.95867756e-06, 1.71067556e-06, -1.83617409e-07, -2.47616618e-06, -5.34951641e-06, 2.50389338e-06, -2.37918319e-06, -3.86667920e-06])
Including an actual Jacobian improves matters greatly. Note the improvement in the evaluation of the found minimum (fun
):
In [39]: from scipy.optimize import rosen_der In [40]: minimize(rosen, np.zeros(9), method='CG', jac=rosen_der) Out[40]: status: 0 success: True njev: 406 nfev: 406 fun: 8.486856765134401e-12 x: array([ 0.99999998, 0.99999996, 0.99999994, 0.99999986, 0.99999969, 0.99999938, 0.99999875, 0.9999975 , 0.999995 ]) message: 'Optimization terminated successfully.' jac: array([ 1.37934336e-06, -9.03688875e-06, 8.53289049e-06, 9.77779178e-06, -2.63022111e-06, -1.02087919e-06, -6.55712127e-06, -1.71887373e-06, -9.12268328e-07])
The truncated Newton method requires a precise Jacobian to work:
In [41]: minimize(rosen, np.zeros(9), method='Newton-CG') ValueError: Jacobian is required for Newton-CG method In [38]: minimize(rosen, np.zeros(9), method='Newton-CG', jac=rosen_der) Out[41]: status: 0 success: True njev: 503 nfev: 53 fun: 5.231613200425767e-08 x: array([ 0.99999873, 0.99999683, 0.99999378, 0.99998772, 0.99997551, 0.99995067, 0.99990115, 0.99980214, 0.99960333]) message: 'Optimization terminated successfully.' nhev: 0 jac: array([ 6.67155399e-06, 2.50927306e-05, 1.03398234e-04, 4.09953321e-04, 1.63524314e-03, 6.48667316e-03, -1.91779902e-03, -2.81972861e-04, -5.67500380e-04])
The methods using trust regions require an exact expression for the Hessian:
In [42]: from scipy.optimize import rosen_hess In [43]: minimize(rosen, np.zeros(9), method='dogleg', ....: jac=rosen_der, hess=rosen_hess) Out[43]: status: 0 success: True njev: 25 nfev: 29 fun: 9.559277795967234e-19 x: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1.]) message: 'Optimization terminated successfully.' nhev: 24 jac: array([ 3.84137166e-14, 3.00870439e-13, 1.10489395e-12, 4.32831548e-12, 1.72455383e-11, 6.77315980e-11, 2.48459919e-10, 6.62723207e-10, -1.52775570e-09]) nit: 28 In [44]: minimize(rosen, np.zeros(9), method='trust-ncg', ....: jac=rosen_der, hess=rosen_hess) Out[44]: status: 0 success: True njev: 56 nfev: 67 fun: 3.8939669818289621e-18 x: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1.]) message: 'Optimization terminated successfully.' nhev: 55 jac: array([ 2.20490293e-13, 5.57109914e-13, 1.77013959e-12, -9.03965791e-12, -3.05174774e-10, 3.03425818e-09, 1.49134067e-08, 6.32240935e-08, -3.64210218e-08]) nit: 66
Note the huge improvement in terms of accuracy, iterations, and function evaluations over the previous methods! The obvious drawback is that quite often it is very challenging to obtain good representations of the Jacobian or Hessian operators.
Take, for example, the minimization of the plane function f(x,y) = 5x – 2y + 4 over the circle x2 + y2 = 4. Using SymPy, we can implement the technique of Lagrange multipliers:
In [45]: F = Matrix([5*x - 2*y + 4]); ....: G = Matrix([x**2 + y**2 - 4]) # constraint In [46]: var('z'); ....: solve(F.jacobian([x, y]) - z * G.jacobian([x, y])) Out[46]: [{x: 5/(2*z), y: -1/z}] In [47]: soln = _[0]; ....: solve(G.subs(soln)) Out[47]: [{z: -sqrt(29)/4}, {z: sqrt(29)/4}] In [48]: zees = _; ....: [(soln[x].subs(item), soln[y].subs(item)) for item in zees] Out[48]: [(-10*sqrt(29)/29, 4*sqrt(29)/29), (10*sqrt(29)/29, -4*sqrt(29)/29)]
Not too bad! On top of this constraint, we can further impose another condition in the form of an inequality. Think of the same problem as before, but constraining to half a circle instead: y > 0. This being the case, the new result will be only the point with coordinates x = –10√(29)/29 = –1.8569533817705186 and y = 4√(29)/29 = 0.74278135270820744.
It is, of course, possible to address this problem numerically. In the module scipy.optimize
, we have basically three methods, all of which can be called from the common routine minimize
:
method='L-BFGS-B'
). The implementation is actually a wrapper for a FORTRAN
routine with the same name, written by Ciyou Zhu, Richard Byrd, and Jorge Nocedal (for details, see for example, R. H. Byrd, P. Lu, and J. Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization, (1995), SIAM Journal on Scientific and Statistical Computing, 16, 5, pp. 1190-1208).method='TNC'
). This implementation is similar to the one we called with method='Newton-CG'
, except this version is a wrapper for a C
routine.method='COBYLA'
). This implementation wraps a FORTRAN
routine with the same name.method='SLSQP'
). The implementation is a wrapper for a FORTRAN
routine with the same name, written by Dieter Kraft.Let's use our running example to illustrate how to input different constraints. We implement these as a dictionary or a tuple of dictionaries—each entry in the tuple represents either an identity ('eq'
), or an inequality ('ineq'
), together with a functional expression (in the form of a ndarray
when necessary) and the corresponding derivative of it:
In [49]: def f(x): return 5*x[0] - 2*x[1] + 4 In [50]: def jacf(x): return np.array([5.0, -2.0]) In [51]: circle = {'type': 'eq', ....: 'fun': lambda x: x[0]**2 + x[1]**2 - 4.0, ....: 'jac': lambda x: np.array([2.0 * x[0], 2.0 * x[1]])} In [52]: semicircle = ({'type': 'eq', ....: 'fun': lambda x: x[0]**2 + x[1]**2 - 4.0, ....: 'jac': lambda x: np.array([2.0 * x[0], ....: 2.0 * x[1]])}, ....: {'type': 'ineq', ....: 'fun': lambda x: x[1], ....: 'jac': lambda x: np.array([0.0, 1.0])})
The constraints are fed to the routine minimize
through the parameter constraints
. The initial guess must satisfy the constraints too, otherwise, the algorithm fails to converge to anything meaningful:
In [53]: minimize(f, [2,2], jac=jacf, method='SLSQP', constraints=circle) Out[53]: status: 0 success: True njev: 11 nfev: 13 fun: -6.7703296142789693 x: array([-1.85695338, 0.74278135]) message: 'Optimization terminated successfully.' jac: array([ 5., -2., 0.]) nit: 11 In [54]: minimize(f, [2,2], jac=jacf, method='SLSQP', constraints=semicircle) Out[54]: status: 0 success: True njev: 11 nfev: 13 fun: -6.7703296142789693 x: array([-1.85695338, 0.74278135]) message: 'Optimization terminated successfully.' jac: array([ 5., -2., 0.]) nit: 11
3.145.73.207