Chapter 2

Cross-Entropy Method

2.1 Introduction

The cross-entropy (CE) method is a powerful technique for solving difficult estimation and optimization problems, based on Kullback-Leibler (or cross-entropy) minimization [15]. It was pioneered by Rubinstein in 1999 [100] as an adaptive importance sampling procedure for the estimation of rare-event probabilities. Subsequent work in [101] [102] has shown that many optimization problems can be translated into a rare-event estimation problem. As a result, the CE method can be utilized as randomized algorithms for optimization. The gist of the idea is that the probability of locating an optimal or near-optimal solution using naive random search is a rare event probability. The cross-entropy method can be used to gradually change the sampling distribution of the random search so that the rare event is more likely to occur. For this purpose, using the cross-entropy or Kullback-Leibler divergence as a measure of closeness between two sampling distributions, the method estimates a sequence of parametric sampling distributions that converges to a distribution with probability mass concentrated in a region of near-optimal solutions.

To date, the CE method has been successfully applied to optimization and estimation problems. The former includes mixed-integer nonlinear programming [69], continuous optimal control problems [111] [112], continuous multiextremal optimization [71], multidimensional independent component analysis [116], optimal policy search [19], clustering [11] [17, 72], signal detection [80], DNA sequence alignment [67] [93], fiber design [27], noisy optimization problems such as optimal buffer allocation [2], resource allocation in stochastic systems [31], network reliability optimization [70], vehicle routing optimization with stochastic demands [29], power system combinatorial optimization problems [45], and neural and reinforcement learning [81] [86, 118] [128]. CE has even been used as a main engine for playing games such as Tetris, Go, and backgammon [26].

In the estimation setting, the CE method can be viewed as an adaptive importance sampling procedure. In the optimization setting, the optimization problem is first translated into a rare-event estimation problem, and then the CE method for estimation is used as an adaptive algorithm to locate theoptimum.

The CE method is based on a simple iterative procedure where each iteration contains two phases: (a) generate a random data sample (trajectories, vectors, etc.) according to a specified mechanism; (b) update the parameters of the random mechanism on the basis of the data, in order to produce a “better” sample in the next iteration.

Among the two, optimization and estimation, CE is best known for optimization, for which it is able to handle problems of size up to thousands in a reasonable time limit. For estimation and also counting problems, multilevel CE can be useful only for problems of small and moderate size. The main reason for that is, unlike optimization, in estimation multilevel CE involves likelihood ratios factors for all variables. The well-known degeneracy of the likelihood ratio estimators [106] causes CE to perform poorly in high dimensions. To overcome this problem, [25] proposes a variation of the CE method by considering a single iteration for determining the importance sampling density. For this approach to work, one must be able to generate samples from the optimal (zero-variance) density.

A tutorial on the CE method is given in [37]. A comprehensive treatment can be found in [107]. The CE method website (www.cemethod.org) lists several hundred publications.

The rest of this chapter is organized as follows. Section 2.2 presents a general CE algorithm for the estimation of rare event probabilities, while Section 2.3 introduces a slight modification of this algorithm for solving combinatorial optimization problems. We consider applications of the CE method to the multidimensional knapsack problem, to the Mastermind game, and to Markov decision problems. Also, we provide supportive numerical results on the performance of the algorithm for these problems. Sections 2.4 and 2.5 discuss advanced versions of CE to deal with continuous and noisy optimization, respectively.

2.2 Estimation of Rare-Event Probabilities

Consider the problem of computing a probability of the form

2.1 c02-math-0001

for some fixed level c02-math-0002. In this expression c02-math-0003 is the sample performance, and c02-math-0004 a random vector with probability density function (pdf) c02-math-0005, sometimes called the prior pdf. The standard or crude Monte Carlo estimation method is based on repetitive sampling from c02-math-0006 and using the sample average:

2.2 c02-math-0007

where c02-math-0008 are independent and identically distributed (iid) samples drawn from c02-math-0009. However, for rare-event probabilities, say c02-math-0010 or less, this method fails for the following reasons [108]. The efficiency of the estimator is measured by its relative error (RE), which is defined as

2.3 c02-math-0011

Clearly, for the standard estimator (2.2)

equation

Thus, for small probabilities, meaning c02-math-0013, we get

equation

This says that the sample size c02-math-0015 is of the order c02-math-0016 to obtain relative error of c02-math-0017. For rare-event probabilities, this sample size leads to unmanageable computation time. For instance, let c02-math-0018 be the maximum number of customers that during a busy period of an c02-math-0019 queue is present. For a load at 80%, the probability can be computed to be c02-math-0020. A simulation to obtain relative error of 10% would require about c02-math-0021 samples of c02-math-0022 (busy cycles). On a 2.79 Ghz (0.99GB RAM) PC this would take about 18 days.

Assuming asymptotic normality of the estimator, the confidence intervals (CI) follow in a standard way. For example, a 95% CI for c02-math-0023 is given by

equation

This shows again the reciprocal relationship between sample size and rare-event probability to obtain accuracy: suppose one wishes that the confidence half width is at most c02-math-0025% relative to the estimate c02-math-0026. This boils down to

equation

For variance reduction we apply importance sampling simulation using pdf c02-math-0028, that is, we draw independent and identically distributed samples c02-math-0029 from c02-math-0030, calculate their likelihood ratios

equation

and set the importance sampling estimator by

2.4 c02-math-0032

By writing c02-math-0033, we see that this estimator is of the form c02-math-0034. Again, we measure efficiency by its relative error, similar to (2.3), which might be estimated by c02-math-0035, with

2.5 c02-math-0036

being the sample variance of the c02-math-0037.

In almost all realistic problems, the prior pdf c02-math-0038 belongs to a parameterized family

2.6 c02-math-0039

That means that c02-math-0040 is given by some fixed parameter c02-math-0041. Then it is natural to choose the importance sampling pdf c02-math-0042 from the same family using another parameter c02-math-0043, or c02-math-0044, etc. In that case the likelihood ratio becomes

2.7 c02-math-0045

We refer to the original parameter c02-math-0046 as the nominal parameter of the problem, and we call parameters c02-math-0047 used in the importance sampling, the reference parameters.

Example 2.1 Stochastic Shortest Path
Our objective is to efficiently estimate the probability c02-math-0048 that the shortest path from node c02-math-0049 to node c02-math-0050 in the network of Figure 2.1 has a length of at least c02-math-0051. The random lengths c02-math-0052 of the links are assumed to be independent and exponentially distributed with means c02-math-0053, respectively.
Figure 2.1 Shortest path from c02-math-0066 to c02-math-0067.
c02f001
Defining c02-math-0054, c02-math-0055, and
equation

the problem is cast in the framework of (2.1). Note that we can estimate (2.1) via (2.4) by drawing c02-math-0057 independently from exponential distributions that are possibly different from the original ones. That is, c02-math-0058 instead of c02-math-0059, c02-math-0060.

The challenging problem is how to select a vector c02-math-0061 that gives the most accurate estimate of c02-math-0062 for a given simulation effort. To do so we use a two-stage procedure, where both the level c02-math-0063 and the reference parameters c02-math-0064 are updated simultaneously. As we shall soon see, one of the strengths of the CE method for rare-event simulation is that it provides a fast way to estimate accurately the optimal parameter vector c02-math-0065.squ

The idea of the CE method is to choose the importance sampling pdf c02-math-0068 from within the parametric class of pdfs c02-math-0069 such that the Kullback-Leibler entropy between c02-math-0070 and the optimal (zero variance) importance sampling pdf c02-math-0071, given by [107],

equation

is minimal. The Kullback-Leibler divergence between c02-math-0073 and c02-math-0074 is given by

2.8 c02-math-0075

The CE minimization procedure then reduces to finding an optimal reference parameter, c02-math-0076, say by cross-entropy minimization:

2.9 c02-math-0077

where c02-math-0078 is any reference parameter and the subscript in the expectation operator indicates the density of c02-math-0079. This c02-math-0080 can be estimated via the stochastic counterpart of (2.9):

2.10 c02-math-0081

where c02-math-0082 are drawn from c02-math-0083.

The optimal parameter c02-math-0084 in (2.10) can often be obtained in explicit form, in particular when the class of sampling distributions forms an exponential family [[107], pages 319–320]. Indeed, analytical updating formulas can be found whenever explicit expressions for the maximal likelihood estimators of the parameters can be found [[37], page 36].

In the other cases, we consider a multilevel approach, where we generate a sequence of reference parameters c02-math-0085 and a sequence of levels c02-math-0086, while iterating in both c02-math-0087 and c02-math-0088. Our ultimate goal is to have c02-math-0089 close to c02-math-0090 after some number of iterations and to use c02-math-0091 in the importance sampling density c02-math-0092 to estimate c02-math-0093.

We start with c02-math-0094 and choose c02-math-0095 to be a not too small number, say c02-math-0096. At the first iteration, we choose c02-math-0097 to be the optimal parameter for estimating c02-math-0098, where c02-math-0099 is the c02-math-0100-quantile of c02-math-0101. That is, c02-math-0102 is the largest real number for which c02-math-0103 Thus, if we simulate under c02-math-0104, then level c02-math-0105 is reached with a reasonably high probability of around c02-math-0106. In this way we can estimate the (unknown!) level c02-math-0107, simply by generating a random sample c02-math-0108 from c02-math-0109, and calculating the c02-math-0110-quantile of the order statistics of the performances c02-math-0111; that is, denote the ordered set of performance values by c02-math-0112, then

equation

where c02-math-0114 denotes the integer part. Next, the reference parameter c02-math-0115 can be estimated via (2.10), replacing c02-math-0116 with the estimate of c02-math-0117. Note that we can use here the same sample for estimating both c02-math-0118 and c02-math-0119. This means that c02-math-0120 is estimated on the basis of the c02-math-0121 best samples, that is, the samples c02-math-0122 for which c02-math-0123 is greater than or equal to c02-math-0124. These form the elite samples in the first iteration. We repeat these steps in the subsequent iterations. The two updating phases, starting from c02-math-0125 are given below:

1. Adaptive updating of c02-math-0126. For a fixed c02-math-0127, let c02-math-0128 be the c02-math-0129-quantile of c02-math-0130 under c02-math-0131. To estimate c02-math-0132, draw a random sample c02-math-0133 from c02-math-0134 and evaluate the sample c02-math-0135-quantile c02-math-0136.
2. Adaptive updating of c02-math-0137. For fixed c02-math-0138 and c02-math-0139, derive c02-math-0140 as

2.11 c02-math-0141

The stochastic counterpart of (2.11) is as follows: for fixed c02-math-0142 and c02-math-0143, derive c02-math-0144 as the solution

2.12 c02-math-0145

where c02-math-0146 is the set of elite samples in the c02-math-0147-th iteration, that is, the samples c02-math-0148 for which c02-math-0149.

The procedure terminates when at some iteration c02-math-0150 a level c02-math-0151 is reached that is at least c02-math-0152, and thus the original value of c02-math-0153 can be used without getting too few samples. We then reset c02-math-0154 to c02-math-0155, reset the corresponding elite set, and deliver the final reference parameter c02-math-0156, using again (2.12). This c02-math-0157 is then used in (2.4) to estimate c02-math-0158.

Remark 2.1 Improved Cross-Entropy Method
The multilevel approach may not be appropriate for problems that cannot cast in the framework of (2.1). Specifically, when we deal with a high-dimensional problem there is often not a natural modeling in terms of level crossings. Furthermore, in high-dimensional settings, the likelihood ratio degeneracy becomes a severe issue causing the importance sampling estimator to be unreliable [106]. Recent enhancements of the cross-entropy method have been developed to overcome this problem. A successful approach has been proposed in [25]. It is based on applying a Markov chain Monte Carlo method to generate samples, of the zero-variance density. With these samples, one maximizes the logarithm of the proposed density with respect to the reference parameter c02-math-0159. The advantage is that it does not involve the likelihood ratio in this optimization program.
Remark 2.2 Smoothed Updating

Instead of updating the parameter vector c02-math-0160 directly via (2.12), we use the following smoothed version:

2.13 c02-math-0161

where c02-math-0162 is the parameter vector given in (2.12), and c02-math-0163 is called the smoothing parameter, typically with c02-math-0164. Clearly, for c02-math-0165 we have our original updating rule. The reason for using the smoothed (2.13) instead of the original updating rule is twofold: (a) to smooth out the values of c02-math-0166, (b) to reduce the probability that some component c02-math-0167 of c02-math-0168 will be zero or one at the first few iterations. This is particularly important when c02-math-0169 is a vector or matrix of probabilities. Because in those cases, when c02-math-0170 it is always ensured that c02-math-0171, while for c02-math-0172 one might have (even at the first iterations) that either c02-math-0173 or c02-math-0174 for some indices c02-math-0175. As a result, the algorithm will converge to a wrong solution.

The resulting CE algorithm for rare-event probability estimation can thus be written as follows.

Algorithm 2.1 CE Algorithm for Rare-Event Estimation
1. Initialize: Define c02-math-0176. Let c02-math-0177. Set c02-math-0178 (iteration counter).
2. Draw: Generate a random sample c02-math-0179 according to the pdf c02-math-0180
3. Select: Calculate the performances c02-math-0181 for all c02-math-0182, and order them from smallest to biggest, c02-math-0183. Let c02-math-0184 be the sample c02-math-0185-quantile of performances: c02-math-0186. If c02-math-0187 reset c02-math-0188 to c02-math-0189.
4. Update: Use the same sample c02-math-0190 to solve the stochastic program (2.12).
5. Smooth: Apply (2.13) to smooth out the vector c02-math-0191.
6. Iterate: If c02-math-0192, set c02-math-0193 and reiterate from step 2. Else proceed with step 7.
7. Final: Let c02-math-0194 be the final iteration counter. Generate a sample c02-math-0195 according to the pdf c02-math-0196 and estimate c02-math-0197 via (2.4).

Apart from specifying the family of sampling pdfs, the sample sizes c02-math-0198 and c02-math-0199, and the rarity parameter c02-math-0200, the algorithm is completely self-tuning. The sample size c02-math-0201 for determining a good reference parameter can usually be chosen much smaller than the sample size c02-math-0202 for the final importance sampling estimation, say c02-math-0203 versus c02-math-0204 = 100,000. Under certain technical conditions [107], the deterministic version of Algorithm 2.1 is guaranteed to terminate (reach level c02-math-0205) provided that c02-math-0206 is chosen small enough.

Note that Algorithm 2.1 breaks down the “hard” problem of estimating the very small probability c02-math-0207 into a sequence of “simple” problems, generating a sequence of pairs c02-math-0208, depending on c02-math-0209, which is called the rarity parameter. Convergence of Algorithm 2.1 is discussed in [107]. Other convergence proofs for the CE method may be found in [33] and [84].

Remark 2.3 Maximum Likelihood Estimator

Optimization problems of the form (2.12) appear frequently in statistics. In particular, if the c02-math-0210 factor is omitted, which will turn out to be important in CE optimization, one can write (2.12) also as

equation

where the product is the joint density of the elite samples. Consequently, c02-math-0212 is chosen such that the joint density of the elite samples is maximized. Viewed as a function of the parameter c02-math-0213, rather than the data c02-math-0214, this function is called the likelihood. In other words, c02-math-0215 is the maximum likelihood estimator (it maximizes the likelihood) of c02-math-0216 based on the elite samples. When the c02-math-0217 factor is present, the form of the updating formula remains similar.

To provide further insight into Algorithm 2.1, we shall follow it step-by-step in a couple of toy examples.

Example 2.2 Binomial Distribution
Assume that we want to estimate via simulation the probability c02-math-0218, where c02-math-0219, c02-math-0220 is the success probability, and c02-math-0221 is the target threshold. Suppose that c02-math-0222 is large compared to the mean c02-math-0223, so that
equation

is a rare-event probability. Assume further that c02-math-0225 is fixed and we want to estimate the optimal parameter c02-math-0226 in c02-math-0227 using (2.12).

In iteration c02-math-0228, we sample independent and identically distributed c02-math-0229. The right-hand side of (2.12) becomes
equation

where

equation

is the likelihood ratio of the c02-math-0232-th sample. To find the optimal c02-math-0233, we consider the first-order condition. We obtain

equation
Solving this for c02-math-0243 yields

2.14 c02-math-0244

In other words, c02-math-0245 is simply the sample fraction of the elite samples, weighted by the likelihood ratios. Note that, without the weights c02-math-0246, we would obtain the maximum likelihood estimator of c02-math-0247 for the c02-math-0248 distribution based on the elite samples, in accordance with Remark 2.5. Note also that (2.14) can be viewed as a stochastic counterpart of the deterministic updating formula

equation

where c02-math-0250 is the c02-math-0251-quantile of the c02-math-0252 distribution.

Assume for concreteness that c02-math-0253, c02-math-0254, and c02-math-0255, which corresponds to c02-math-0256. Table 2.1 presents the evolution of c02-math-0257 and c02-math-0258 for c02-math-0259, using sample size c02-math-0260. Note that iteration c02-math-0261 corresponds to the original binomial pdf withsuccess probability c02-math-0262, while iterations c02-math-0263 correspond to binomial pdfs with success probabilities c02-math-0264 respectively.
Table 2.1 The evolution of c02-math-0235 and c02-math-0236 for c02-math-0237, using c02-math-0238 and c02-math-0239 samples
c02-math-0240 c02-math-0241 c02-math-0242
0 0.1667
1 23 0.2530
2 33 0.3376
3 42 0.4263
4 51 0.5124
5 50 0.5028
The final step of Algorithm 2.1 now invokes the importance sampling estimator (2.4) to estimate c02-math-0265, using a sample size c02-math-0266 that is typically larger than c02-math-0267. Clearly, we reset the c02-math-0268 obtained in the last iteration to c02-math-0269.
Figure 2.2 illustrates the iterative procedure. We see that Algorithm 2.1 requires five iterations to reach the final level c02-math-0274. Note that until iteration 4 both parameters c02-math-0275 and c02-math-0276 increase gradually, each time “blowing up” the tail of the binomial pdf. At iteration five we reset the value c02-math-0277 to the desired c02-math-0278 and estimate the optimal c02-math-0279 value.squ
Figure 2.2 The five iterations of Algorithm 2.1. Iteration c02-math-0270 samples from c02-math-0271, estimates level c02-math-0272, and estimates the next sampling parameter c02-math-0273.
c02f002
Example 2.3 Degeneracy of c02-math-0280
When c02-math-0281 is the maximum of c02-math-0282, no “overshooting” of c02-math-0283 in Algorithm 2.1 will occur, and therefore c02-math-0284 does not need to be reset. In such cases, the sampling pdf may degenerate towards the atomic pdf that has all its mass concentrated at the points c02-math-0285 where c02-math-0286 is maximal. As an example, let c02-math-0287 be the target level in the above binomial example. In this case, clearly the final c02-math-0288, which corresponds to the degenerate value of c02-math-0289. Table 2.2 and Figure 2.3 show the evolution of the parameters in the CE algorithm, using c02-math-0290 and c02-math-0291.squ
Table 2.2 The evolution of c02-math-0302 and c02-math-0303 for the c02-math-0304 Example using c02-math-0305, c02-math-0306 and c02-math-0307 samples
c02-tab-0002
Figure 2.3 Degeneracy of the sampling distribution in Example 2.3.
c02f003
Example 2.4
Consider again the stochastic shortest path graph of Figure 2.1. Let us take

equation

as the nominal parameter and estimate the probability c02-math-0293 that the minimum path length is greater than c02-math-0294.

Crude Monte Carlo with c02-math-0295 samples—very large simulation effort—gave an estimate c02-math-0296 with an estimated relative error of c02-math-0297.
To apply Algorithm 2.1 to this problem, we need to establish the updating rule for the reference parameter c02-math-0298. Since the components c02-math-0299 are independent and form an exponential family parameterized by the mean we have [107]

2.15 c02-math-0300

with c02-math-0301 given in (2.10).
We set in all our experiments with Algorithm 2.1 the rarity parameter c02-math-0314, the sample size for step 2–4 of the algorithm c02-math-0315, and for the final sample size c02-math-0316. Table 2.3 displays the results of steps 1–5 of the CE algorithm.
Table 2.3 Convergence of the sequence c02-math-0317 in Example 2.4.
c02-tab-0003
We see that after five iterations, level c02-math-0321 is reached. As before, we set c02-math-0322 to the desired c02-math-0323 and estimate the optimal parameter vector c02-math-0324, which is c02-math-0325= c02-math-0326 c02-math-0327. Using c02-math-0328 we obtain c02-math-0329 = c02-math-0330, with an estimated RE of c02-math-0331. The same RE was obtained for the crude Monte Carlo with c02-math-0332 samples. Note that, whereas the standard method required more than an hour of computation time, the CE algorithm was finished in only 1 second, using a Matlab implementation on a 1500 MHz computer. We see that with a minimal amount of work we have achieved a dramatic reduction of the simulation effort.squ

2.3 Cross-Entropy Method forOptimization

Consider the following optimization problem:

2.16 c02-math-0333

having c02-math-0334 as the set of optimal solutions.

The problem is called a discrete or continuous optimization problem, depending on whether c02-math-0335 is discrete or continuous. An optimization problem where some c02-math-0336 are discrete and some others are continuous variables is called a mixed optimization problem. An optimization problem with a finite state space (c02-math-0337) is called a combinatorial optimization problem. The latter will be the main focus of this section.

We will show here how the cross-entropy method can be applied for approximating a pdf c02-math-0338 that is concentrated on c02-math-0339. In fact, the iterations of the cross-entropy algorithm generate a sequence of pdfs c02-math-0340 associated with a sequence of decreasing sets converging to c02-math-0341.

Suppose that we set the optimal value c02-math-0342, then an optimal pdf c02-math-0343 is characterized by

equation

This matches exactly to the case of degeneracy of the cross-entropy algorithm for rare-event simulation in Example 2.7.

Hence, the starting point is to associate with the optimization problem (2.16) a meaningful estimation problem. To this end, we define a collection of indicator functions c02-math-0345 on c02-math-0346 for various levels c02-math-0347. Next, let c02-math-0348 be a family of probability densities on c02-math-0349, parameterized by real-valued parameters c02-math-0350. For a fixed c02-math-0351 we associate with (2.16) the problem of estimating the rare-event probability

2.17 c02-math-0352

We call the estimation problem (2.17) the associated stochastic problem.

It is important to understand that one of the main goals of CE in optimization is to generate a sequence of pdfs c02-math-0353, converging to a pdf c02-math-0354 that is close or equal to an optimal pdf c02-math-0355. For this purpose we apply Algorithm 2.1 for rare-event estimation, but without fixing in advance c02-math-0356. It is plausible that if c02-math-0357 is close to c02-math-0358, then c02-math-0359 assigns most of its probability mass on c02-math-0360, and thus any c02-math-0361 drawn from this distribution can be used as an approximation to an optimal solution c02-math-0362, and the corresponding function value as an approximation to the true optimal c02-math-0363 in (2.16).

In summary, to solve a combinatorial optimization problem we will employ the CE Algorithm 2.1 for rare-event estimation without fixing c02-math-0364 in advance. By doing so, the CE algorithm for optimization can be viewed as a modified version of Algorithm 2.1. In particular, in analogy to Algorithm 2.1, we choose a not very small number c02-math-0365, say c02-math-0366, initialize the parameter c02-math-0367 by setting c02-math-0368, and proceed as follows:

1. Adaptive updating of c02-math-0369. For a fixed c02-math-0370, let c02-math-0371 be the c02-math-0372-quantile of c02-math-0373 under c02-math-0374. As before, an estimator c02-math-0375 of c02-math-0376 can be obtained by drawing a random sample c02-math-0377 from c02-math-0378 and then evaluating the sample c02-math-0379-quantile of the performances as

2.18 c02-math-0380

2. Adaptive updating of c02-math-0381. For fixed c02-math-0382 and c02-math-0383, derive c02-math-0384 from the solution of the program

2.19 c02-math-0385

The stochastic counterpart of (2.19) is as follows: for fixed c02-math-0386 and c02-math-0387, derive c02-math-0388 from the following program:

2.20 c02-math-0389

Note that, in contrast to the formulas (2.11) and (2.12) (for the rare-event setting), formulas (2.19) and (2.20) do not contain the likelihood ratio factors c02-math-0390. The reason is that in the rare-event setting, the initial (nominal) parameter c02-math-0391 is specified in advance and is an essential part of the estimation problem. In contrast, the initial reference vector c02-math-0392 in the associated stochastic problem is quite arbitrary. In effect, by dropping the likelihood ratio factor, we can efficiently estimate at each iteration c02-math-0393 the CE optimal high-dimensional reference parameter vector c02-math-0394 for the rare event probability c02-math-0395 by avoiding the degeneracy properties of the likelihood ratio.

Thus, the main CE optimization algorithm, which includes smoothed updating of parameter vector c02-math-0396 and which presents a slight modification of Algorithm 2.1 can be summarized as follows.

Algorithm 2.2 Main CE Algorithm for Optimization
1. Initialize: Choose an initial parameter vector c02-math-0397. Set c02-math-0398 (level counter).
2. Draw: Generate a sample c02-math-0399 from the density c02-math-0400.
3. Select: Compute the sample c02-math-0401-quantile c02-math-0402 of the performances according to (2.18).
4. Update: Use the same sample c02-math-0403 and solve the stochastic program (2.20). Denote the solution by c02-math-0404.
5. Smooth: Apply (2.13) to smooth out the vector c02-math-0405.
6. Iterate: If the stopping criterion is met, stop; otherwise set c02-math-0406, and return to Step 2.

Note that when c02-math-0407 must be minimized instead of maximized, we simply change the inequalities “c02-math-0408” to “c02-math-0409” and take the c02-math-0410-quantile instead of the c02-math-0411-quantile. Alternatively, one can just maximize c02-math-0412.

Remark 2.4 Stopping Criteria
As a stopping criterion, one can use, for example, if for some c02-math-0413, say c02-math-0414,

2.21 c02-math-0415

then stop.

As an alternative estimate for c02-math-0416, one can consider

2.22 c02-math-0417

Before one implements the algorithm, one must decide on specifying the initial vector c02-math-0418, the sample size c02-math-0419, the stopping parameter c02-math-0420, and the rarity parameter c02-math-0421. However, the execution of the algorithm is “self-tuning.”

Note that the estimation Step 7 of Algorithm 2.1 is omitted in Algorithm 2.2, because, in the optimization setting, we are not interested in estimating c02-math-0422 per se.

Remark 2.5 Class of parametric densities
To run the algorithm one needs to propose a class of parametric sampling densities c02-math-0423, the initial vector c02-math-0424, the sample size c02-math-0425, the rarity parameter c02-math-0426, and a stopping criterion. Of these, the most challenging is the selection of an appropriate class of parametric sampling densities c02-math-0427. Typically, there is not a unique parametric family and the selection is guided by the following competing objectives. First, the class c02-math-0428 has to be flexible enough to include a reasonable parametric approximation to an optimal importance sampling density c02-math-0429 for the estimation of the associated rare-event probability c02-math-0430. Second, each density c02-math-0431 has to be simple enough to allow fast random variable generation. In many cases, these two competing objectives are reconciled by using a standard statistical model for c02-math-0432, such as the multivariate Bernoulli or Gaussian densities with independent components of the vector c02-math-0433.

Algorithm 2.2 can, in principle, be applied to any discrete optimization problem. However, for each individual problem, two essential ingredients need to be supplied:

1. We need to specify how the samples are generated. In other words, we need to specify the family of densities c02-math-0434.
2. We need to update the parameter vector c02-math-0435, based on cross-entropy minimization program (2.20), which is the same for all optimization problems.

In general, there are many ways to generate samples from c02-math-0436, and it is not always immediately clear which way of generating the sample will yield better results or easier updating formulas.

Remark 2.6 Parameter Selection
The choice for the sample size c02-math-0437 and the rarity parameter c02-math-0438 depends on the size of the problem and the number of parameters in the associated stochastic problem. Typical choices are c02-math-0439 or c02-math-0440 and c02-math-0441, where c02-math-0442 is the number of parameters that need to be estimated/updated and c02-math-0443 is a constant between 1 and 10.

Below we present several applications of the CE method to combinatorial optimization problems, namely to the multidimensional 0/1 knapsack problem, the mastermind game, and to reinforcement learning in Markov decision problems. We demonstrate numerically the efficiency of the CE method and its fast convergence for several case studies. For additional applications of CE, see [107].

2.3.1 The Multidimensional 0/1 Knapsack Problem

Consider the multidimensional 0/1 knapsack problem (see Section A.1.2):

2.23 c02-math-0444

In other words, the goal is to maximize the performance function c02-math-0445c02-math-0445a over the set c02-math-0446 given by the knapsack constraints of (2.23).

Note that each item c02-math-0447 is either chosen, c02-math-0448, or left out, c02-math-0449. We model this by a Bernoulli random variable c02-math-0450, with parameter c02-math-0451. By setting these variables independent, we get the following parameterized set:

equation

of Bernoulli pdfs. The optimal solution c02-math-0453 of the knapsack problem (2.23) is characterized by a degenerate pdf c02-math-0454 for which all c02-math-0455; that is,

equation

To proceed, consider the stochastic program 2.20. We have

equation

where, for convenience, we write c02-math-0458. Applying the first-order optimality conditions we have

equation

for c02-math-0460.

As a concrete example, consider the knapsack problem; available at http://people.brunel.ac.uk/∼mastjjb/jeb/orlib/mknapinfo.html from the OR Library involving c02-math-0461 variables and c02-math-0462 constraints. In this problem the vector c02-math-0463 is

equation

and the matrix c02-math-0465 and the vector c02-math-0466 are

equation

The performance of the CE algorithm is illustrated in Table 2.4 using the sample size c02-math-0468, rarity parameter c02-math-0469, and smoothing parameter c02-math-0470. The initial probabilities were all set to c02-math-0471. Observe that the probability vector c02-math-0472 quickly converges to a degenerate one, corresponding to the solution c02-math-0473 with value c02-math-0474, which is also the optimal known value c02-math-0475. Hence, the optimal solution has been found.

Table 2.4 The evolution of c02-math-0476 and c02-math-0477 for the c02-math-0478-Knapsack Example, using c02-math-0479, c02-math-0480, and c02-math-0481 samples

c02-tab-0004

2.3.2 Mastermind Game

In the well-known game Mastermind, the objective is to decipher a hidden “code” of colored pegs through a series of guesses. After each guess, new information on the true code is provided in the form of black and white pegs. A black peg is earned for each peg that is exactly right and a white peg for each peg that is in the solution, but in the wrong position; see Figure 2.4. To get a feel for the game, one can visit www.archimedes-lab.org/mastermind.html and play the game.

Figure 2.4 The Mastermind game.

c02f004

Consider a Mastermind game with c02-math-0485 colors, numbered c02-math-0486 and c02-math-0487 pegs (positions). The hidden solution and a guess can be represented by a row vector of length c02-math-0488 with numbers in c02-math-0489. For example, for c02-math-0490 and c02-math-0491 the solution could be c02-math-0492, and a possible guess c02-math-0493. Let c02-math-0494 be the space of all possible guesses; clearly c02-math-0495. On c02-math-0496 we define a performance function c02-math-0497 that returns for each guess c02-math-0498 the “pegscore”

2.24 c02-math-0499

where c02-math-0500 and c02-math-0501 are the number of black and white pegs returned after the guess, c02-math-0502. We have assumed, somewhat arbitrarily, that a black peg is worth twice as much as a white peg. As an example, for the solution c02-math-0503 and guess c02-math-0504 above one gets two black pegs (for the first and third pegs in the guess) and two white pegs (for the second and fourth pegs in the guess, which match the fifth and second pegs of the solution but are in the wrong place). Hence the score for this guess is 6.

There are some specially designed algorithms that efficiently solve the problem for different number of colors and pegs. For examples, see www.mathworks.com/contest/mastermind.cgi/home.html.

Note that, with the above performance function, the problem can be formulated as an optimization problem, that is, as c02-math-0505, and hence we could apply the CE method to solve it. In order to apply the CE method we first need to generate random guesses c02-math-0506 using an c02-math-0507 probability matrix c02-math-0508. Each element c02-math-0509 of c02-math-0510 describes the probability that we choose the c02-math-0511th color for the c02-math-0512th peg (location). Since only one color may be assigned to one peg, we have that

2.25 c02-math-0513

In each stage c02-math-0514 of the CE algorithm, we sample independently for each row (that is, each peg) a color using a probability matrix c02-math-0515 and calculate the score c02-math-0516 according to (2.24). The updating of the elements c02-math-0517 of the probability matrix c02-math-0518 is performed according to

2.26 c02-math-0519

where the samples c02-math-0520 are drawn independently and identically distributed using matrix c02-math-0521. Note that in this case c02-math-0522 in (2.26) simply counts the fraction of times among the elites that color c02-math-0523 has been assigned to peg c02-math-0524.

2.3.2.1 Simulation Results

Table 2.5 represents the dynamics of the CE algorithm for the Mastermind test problem denoted at www.mathworks.com/contest/mastermind.cgi/home.html as “problem (5.2.3)” with the matrix c02-math-0525 of size c02-math-0526. We used c02-math-0527, c02-math-0528, and c02-math-0529. It took 34 seconds of CPU time to find the true solution.

Table 2.5 Dynamics of Algorithm 2.2 for the mastermind problem with c02-math-0530 colors and c02-math-0531 pegs. In each iteration c02-math-0532 was the sampled maximal score

c02-math-0533 c02-math-0534 c02-math-0535
1 31 26
2 34 28
3 38 32
4 42 35
5 48 40
6 55 45
7 60 51
8 64 57
9 70 63
10 72 68
11 72 72

2.3.3 Markov Decision Process and Reinforcement Learning

The Markov decision process (MDP) model is standard in machine learning, operations research, and related fields. The application to reinforcement learning in this section is based on [83]. We start by reviewing briefly some of the basic definitions and concepts in MDP. For details see, for example, [95].

An MDP is defined by a tuple c02-math-0536 where

1. c02-math-0537 is a finite set of states.
2. c02-math-0538 is the set of possible actions by the decision maker. We assume it is the same for every state, to ease notations.
3. c02-math-0539 is the transition probability matrix with elements c02-math-0540 presenting the transition probability from state c02-math-0541 to state c02-math-0542, when action c02-math-0543 is chosen.
4. c02-math-0544 is the immediate reward for performing action c02-math-0545 in state c02-math-0546 (c02-math-0547 may be a random variable).

At each time instance c02-math-0548, the decision maker observes the current state c02-math-0549 and determines the action to be taken (say c02-math-0550). As a result, a reward given by c02-math-0551 is received immediately, and a new state c02-math-0552 is chosen according to c02-math-0553. A policy or strategy c02-math-0554 is a rule that determines, for each history c02-math-0555 of states and actions, the probability distribution of the decision maker's actions at time c02-math-0556. A policy is called Markov if each action depends only on the current state c02-math-0557. A Markov policy is called stationary if it does not depend on the time c02-math-0558. If the decision rule c02-math-0559 of a stationary policy is nonrandomized, we say that the policy is deterministic.

In the sequel, we consider problems for which an optimal policy belongs to the class of stationary Markov policies. Associated with such a MDP are the stochastic processes of consecutive observed states c02-math-0560 and actions c02-math-0561. For stationary policies we may denote c02-math-0562, which is a random variable on the action set c02-math-0563. In case of deterministic policies it degenerates in a single action. Finally, we denote the immediate rewards in state c02-math-0564 by c02-math-0565.

The goal of the decision maker is to maximize a certain reward function. The following are standard reward criteria:

1. Finite horizon reward. This applies when there exists a finite time c02-math-0566 (random or deterministic) at which the process terminates. The objective is to maximize the total reward

2.27 c02-math-0567

Here, c02-math-0568 denotes the expectation with respect to the probability measure induced by the strategy c02-math-0569.
2. Infinite horizon discounted reward. The objective is to find a strategy c02-math-0570 that maximizes

2.28 c02-math-0571

where c02-math-0572 is the discount factor.
3. Average reward. The objective is to maximize

equation

Suppose that the MDP is identified; that is, the transition probabilities and the immediate rewards are specified and known. Then it is well known that there exists a deterministic stationary optimal strategy for the above three cases. Moreover, there are several efficient methods for finding it, such as value iteration and policy iteration [95].

In the next section, we will restrict our attention to the stochastic shortest path MDP, where it is assumed that the process starts from a specific initial state c02-math-0574 and terminates in an absorbing state c02-math-0575 with zero reward. For the finite horizon reward criterion c02-math-0576 is the time at which c02-math-0577 is reached (which we will assume will always happen eventually). It is well known that for the shortest path MDPthere exists a stationary Markov policy that maximizes c02-math-0578 for each of the reward functions above.

However, when the transition probability or the reward in an MDP are unknown, the problem is much more difficult and is referred to as a learning probability. A well-known framework for learning algorithms is reinforcement learning (RL), where an agent learns the behavior of the system through trial-and-error with an unknown dynamic environment; see [61].

There are several approaches to RL, which can be roughly divided into the following three classes: model-based, model-free, and policy search. In the model-based approach, a model of the environment is first constructed. The estimated MDP is then solved using standard tools of dynamic programming (see [66]).

In the model-free approach, one learns a utility function instead of learning the model. The optimal policy is to choose at each state an action that maximizes the expected utility. The popular Q-learning algorithm (e.g., [125]) is an example of this approach.

In the policy search approach, a subspace of the policy space is searched, and the performance of policies is evaluated based on their empirical performance (e.g., [7]). An example of a gradient-based policy search method is the REINFORCE algorithm of [127]. For a direct search in the policy space approach see [98]. The CE algorithm in the next section belongs to the policy search approach.

Remark 2.7 Stochastic Approximation
Many RL algorithms are based on the classic stochastic approximation (SA) algorithm. To explain SA, assume that we need to find the unique solution c02-math-0579 of a nonlinear equation c02-math-0580, where c02-math-0581 is a random variable with a pdf parameterized by c02-math-0582, such that its expectation c02-math-0583 is not available in closed form. However, assume that we can simulate c02-math-0584 for any parameter c02-math-0585. The SA algorithm for estimating c02-math-0586 is
equation

where c02-math-0588 is a positive sequence satisfying

2.29 c02-math-0589

The connection between SA and Q-learning is given by [117]. This work has made an important impact on the entire field of RL. Even if c02-math-0590 remains bounded away from 0 (and thus convergence is not guaranteed), it is still required that c02-math-0591 is small in order to ensure convergence to a reasonable neighboring solution [10].
Our main goal here is to introduce a fast learning algorithm based on the CE method instead of the slow SA algorithms and to demonstrate its high efficiency.

2.3.3.1 Policy Learning via the CE Method

We consider a CE learning algorithm for the shortest path MDP, where it is assumed that the process starts from a specific initial state c02-math-0592, and that there is an absorbing state c02-math-0593 with zero reward. The objective is given in (2.27), with c02-math-0594 being the stopping time at which c02-math-0595 is reached (which we will assume will always happen).

To put this problem in the CE framework, consider the maximization problem (2.27). Recall that for the shortest path MDP an optimal stationary strategy exists. We can represent each stationary strategy as a vector c02-math-0596 with c02-math-0597 being the action taken when visiting state c02-math-0598. Writing the expectation in (2.27) as

2.30 c02-math-0599

we see that the optimization problem (2.27) is of the form c02-math-0600. We shall also consider the case where c02-math-0601 is measured (observed) with some noise, in which case we have a noisy optimization problem.

The idea now is to combine the random policy generation and the random trajectory generation in the following way: At each stage of the CE algorithm we generate random policies and random trajectories using an auxiliary c02-math-0602 matrix c02-math-0603, such that for each state c02-math-0604 we choose action c02-math-0605 with probability c02-math-0606. Once this “policy matrix” c02-math-0607 is defined, each iteration of the CE algorithm comprises the following two standard phases:

1. Generation of c02-math-0608 random trajectories c02-math-0609 using the auxiliary policy matrix c02-math-0610. The cost of each trajectory is computed via

2.31 c02-math-0611

2. Updating of the parameters of the policy matrix (c02-math-0612) on the basis of the data collected in the first phase.

The matrix c02-math-0613 is typically initialized to a uniform matrix (c02-math-0614.) We describe both the trajectory generation and updating procedure in more detail. We shall show that in calculating the associated sample performance, one can take into account the Markovian nature of the problem and speed up the Monte Carlo process.

2.3.3.2 Generating Random Trajectories

Generation of random trajectories for MDP is straightforward and is given for convenience only. All one has to do is to start the trajectory from the initial state c02-math-0615 and follow the trajectory by generating at each new state according to the probability distribution of c02-math-0616, until the absorbing state c02-math-0617 is reached at, say, time c02-math-0618,

Algorithm 2.3 Trajectory Generation for MDP
Input: c02-math-0619 auxiliary policy matrix.
1. Start from the given initial state c02-math-0620, set c02-math-0621.
2. Generate an action c02-math-0622 according to the c02-math-0623th row of c02-math-0624, calculate the reward c02-math-0625, and generate a new state c02-math-0626 according to c02-math-0627. Set c02-math-0628. Repeat until c02-math-0629.
3. Output the total reward of the trajectory c02-math-0630 given by (2.31).

2.3.3.3 Updating Rules

Let the c02-math-0631 trajectories c02-math-0632, their scores, c02-math-0633 c02-math-0634, and the associated c02-math-0635-quantile c02-math-0636 of their order statistics be given. Then one can update the parameter matrix c02-math-0637 using the CE method, namely as per

2.32 c02-math-0638

where the event c02-math-0639 means that the trajectory c02-math-0640 contains a visit to state c02-math-0641 and the event c02-math-0642 means the trajectory corresponding to policy c02-math-0643 contains a visit to state c02-math-0644 in which action c02-math-0645 was taken.

We now explain how to take advantage of the Markovian nature of the problem. Let us think of a maze where a certain trajectory starts badly, that is, the path is not efficient in the beginning, but after some time it starts moving quickly toward, the goal. According to (2.32), all the updates are performed in a similar manner in every state in the trajectory. However, the actions taken in the states that were sampled near the target were successful, so one would like to “encourage” these actions. Using the Markovian property one can substantially improve the above algorithm by considering for each state the part of the reward from the visit to that state onward. We therefore use the same trajectory and simultaneously calculate the performance for every state in the trajectory separately. The idea is that each choice of action in a given state affects the reward from that point on, regardless of the past.

The sampling Algorithm 2.3 does not change in steps 1 and 2. The difference is in step 3. Given a policy c02-math-0646 and trajectory c02-math-0647, we calculate the performance from every state until termination. For every state c02-math-0648 in the trajectory, the (estimated) performance is c02-math-0649. The updating formula for c02-math-0650 is similar to (2.32), however, each state c02-math-0651 is updated separately according to the (estimated) performance c02-math-0652 obtained from state c02-math-0653 onward.

2.33 c02-math-0654

It is important to understand here that, in contrast to (2.32), the CE optimization is carried for every state separately and a different threshold parameter c02-math-0655 is used for every state c02-math-0656, at iteration c02-math-0657. This facilitates faster convergence for “easy” states where the optimal strategy is easy to find. The above trajectory sampling method can be viewed as a variance reduction method. Numerical results indicate that the CE algorithm with updating (2.33) is much faster than that with updating (2.32).

2.3.3.4 Numerical Results with the Maze Problem

The CE Algorithm with the updating rule (2.33) and trajectory generation according to Algorithm 2.3 was tested for a stochastic shortest path MDP and, in particular, for a maze problem, which presents a two-dimensional grid world. The agent moves on a grid in four possible directions. The goal of the agent is to move from the upper-left corner (the starting state) to the lower-right corner (the goal) in the shortest path possible. The maze contains obstacles (walls) into which movement is not allowed. These are modeled as forbidden moves for the agent.

We assume the following:

1. The moves in the grid are allowed in four possible directions with the goal to move from the upper-left corner to the lower-right corner.
2. The maze contains obstacles (“walls”) into which movement is not allowed.
3. The cost of allowed moves are random variables uniformly distributed between c02-math-0658 and c02-math-0659. The cost of forbidden moves are random variables uniformly distributed between 25 and 75. Thus, the expected costs are 1 and 50 for the allowed and forbidden moves, respectively.

In addition, we introduce the following:

  • A small (failure) probability not to succeed moving in an allowed direction
  • A small probability of succeeding moving in the forbidden direction (“moving through the wall”)

In Figure 2.5 we present the results for c02-math-0660 maze. We set the following parameters: c02-math-0661, c02-math-0662, c02-math-0663.

Figure 2.5 Performance of the CE Algorithm for the c02-math-0673 maze. Each arrow indicates a probability c02-math-0674 of going in that direction.

c02f005

The initial policy was a uniformly random one. The success probabilities in the allowed and forbidden states were taken to be 0.95 and 0.05, respectively. The arrows c02-math-0664 in Figure 2.5 indicate that at the current iteration the probability of going from c02-math-0665 to c02-math-0666 is at least 0.01. In other words, if c02-math-0667 corresponds to the action that will lead to state c02-math-0668 from c02-math-0669 then we will plot an arrow from c02-math-0670 to c02-math-0671 provided c02-math-0672.

In all our experiments, CE found the target exactly, within 5–10 iterations, and CPU time was less than one minute (on a 500MHz Pentium processor). Note that the successive iterations of the policy in Figure 2.5 quickly converge to the optimal policy. We have also run the algorithm for several other mazes and the optimal policy was always found.

2.4 Continuous Optimization

When the state space is continuous, in particular, when c02-math-0675, the optimization problem (2.16) is often referred to as a continuous optimization problem. The CE sampling distribution on c02-math-0676 can be quite arbitrary and does not need to be related to the function that is being optimized. The generation of a random vector c02-math-0677 in Step 2 of Algorithm 2.2 is most easily performed by drawing the c02-math-0678 coordinates independently from some two-parameter distribution. In most applications a normal (Gaussian) distribution is employed for each component. Thus, the sampling density c02-math-0679 of c02-math-0680 is characterized by a vector of means c02-math-0681 and a vector of variances c02-math-0682 (and we may write c02-math-0683). The choice of the normal distribution is motivated by the availability of fast normal random number generators on modern statistical software and the fact that the cross-entropy minimization yields a very simple solution—at each iteration of the CE algorithm the parameter vectors c02-math-0684 and c02-math-0685 are the vectors of sample means and sample variance of the elements of the set of c02-math-0686 best-performing vectors (that is, the elite set); see, for example, [71]. In summary, the CE method for continuous optimization with a Gaussian sampling density is as follows.

Algorithm 2.4 CE for Continuous Optimization: Normal Updating
1. Initialize: Choose c02-math-0687 and c02-math-0688. Set c02-math-0689.
2. Draw: Generate a random sample c02-math-0690 from the c02-math-0691 distribution.
3. Select: Let c02-math-0692 be the indices of the c02-math-0693 best performing (elite) samples.
4. Update: For all c02-math-0694 let

2.34 c02-math-0695

2.35 c02-math-0695a

5. Smooth:

2.36 c02-math-0696

6. Iterate: Stop if c02-math-0697, and return c02-math-0698 (or the overall best solution generated by the algorithm) as the approximate solution to the optimization. Otherwise, increase c02-math-0699 by c02-math-0700 and return to Step 2.

Smoothing, as in Step 5, is often crucial to prevent premature shrinking of the sampling distribution. Instead of using a single smoothing factor, it is often useful to use separate smoothing factors for c02-math-0701 and c02-math-0702 (see (2.35)). In particular, it is suggested in [108] to use the following dynamic smoothing for c02-math-0703:

2.37 c02-math-0704

where c02-math-0705 is an integer (typically between 5 and 10) and c02-math-0706 is a smoothing constant (typically between 0.8 and 0.99). Finally, significant speed-up, can be achieved by using a parallel implementation of CE [46].

Another approach is to inject extra variance into the sampling distribution, for example, by increasing the components of c02-math-0707, once the distribution has degenerated; see the examples below and [11].

Example 2.5 The Peaks Function
MATLAB's peaks function
equation

has various local maxima. In this case, implementing the CE Algorithm 2.4 we found that the global maximum of this function is approximately c02-math-0709 and is attained at c02-math-0710. The choice of the initial value for c02-math-0711 is not important, but the initial standard deviations should be chosen large enough to ensure initially a “uniform” sampling of the region of interest. The CE algorithm is stopped when all standard deviations of the sampling distribution are less than some small c02-math-0712.

Figure 2.6 gives the evolution of the worst and the best of the elite samples, that is, c02-math-0713 and c02-math-0714, for each iteration c02-math-0715. We see that the values quickly converge to the optimal value c02-math-0716.squ
Figure 2.6 Evolution of the CE algorithm for the peaks function.
c02f006
Remark 2.8 Injection
When using the CE method to solve practical optimization problems with many constraints and many local optima, it is sometimes necessary to prevent the sampling distribution from shrinking too quickly. A simple but effective approach is the following injection method [107]. Let c02-math-0717 denote the best performance found at the c02-math-0718-th iteration, and (in the normal case) c02-math-0719 the largest standard deviation at the c02-math-0720-th iteration. If c02-math-0721 is sufficiently small and c02-math-0722 is also small, then add some small value to each standard deviation, for example, a constant c02-math-0723 or the value c02-math-0724, for some fixed c02-math-0725 and c02-math-0726. When using CE with injection, a possible stopping criterion is to stop when a certain number of injections is reached.

2.5 Noisy Optimization

One of the distinguishing features of the CE Algorithm 2.2 is that it can easily handle noisy optimization problems, that is, when the objective function c02-math-0727 is corrupted with noise. We denote such noisy function as c02-math-0728 and assume that for each c02-math-0729 we can readily obtain an outcome of c02-math-0730, for example, via generation of some additional random vector c02-math-0731, whose distribution may depend on c02-math-0732.

A classical example of noisy optimization is simulation-based optimization [107]. A typical instance is the buffer allocation problem (BAP). The objective in the BAP is to allocate c02-math-0733 buffer spaces among the c02-math-0734 “niches” (storage areas) between c02-math-0735 machines in a serial production line, so as to optimize some performance measure, such as the steady-state throughput. This performance measure is typically not available analytically and thus must be estimated via simulation. A detailed description of the BAP with application of CE is given in [107].

Another example is the noisy traveling salesman problem (TSP), where, say, the cost matrix c02-math-0736, denoted now by c02-math-0737, is random. Think of c02-math-0738 as the “random” time to travel from city c02-math-0739 to city c02-math-0740.

The total cost of a tour c02-math-0741 is given by

2.38 c02-math-0742

We assume that c02-math-0743.

The main CE optimization Algorithm 2.2 for deterministic functions c02-math-0744 is valid also for noisy ones c02-math-0745. Extensive numerical studies [107] with such noisy Algorithm 2.2 show that it works nicely because, during the course of optimization, it filters out efficiently the noise component from c02-math-0746.

We applied our CE Algorithm 2.2 to both deterministic and noisy TSP. We selected a number of benchmark problems from the TSP library (www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/tsp/). In all cases, the same set of CE parameters were chosen: c02-math-0747, c02-math-0748, c02-math-0749. Table 2.6 presents the performance of the deterministic version of Algorithm 2.2 for several TSPs from this library. To study the variability in the solutions, each problem was repeated 10 times. In Table 2.6, “min”, “mean”, and “max” denote the smallest (that is, best), average, and largest of the 10 estimates for the optimal value. The best-known solution is denoted by c02-math-0750.

Table 2.6 Case studies for the TSP

c02-tab-0006

The average CPU time in seconds and the average number of iterations are given in the last two columns. The size of the problem (number of nodes) is indicated in its name. For example, st70 has c02-math-0754 nodes.

Example 2.6 Noisy TSP
Suppose that in the first test case of Table 2.6, burma14, some uniform noise is added to the cost matrix. In particular, suppose that the cost of traveling from c02-math-0755 to c02-math-0756 is given by c02-math-0757, where c02-math-0758 is the cost for the deterministic case. The expected cost is thus c02-math-0759, and the total cost c02-math-0760 of a tour c02-math-0761 is given by (2.38). The CE algorithm for optimizing the unknown c02-math-0762 remains exactly the same as in the deterministic case, except that c02-math-0763 is replaced with c02-math-0764, and that a different stopping criterion than (2.21) needs to be employed. A simple rule is to stop when the transition probabilities c02-math-0765 satisfy c02-math-0766 for all c02-math-0767 and c02-math-0768. An alternative stopping rule taken from [107] is as follows.
2.5.1 Stopping Criterion
To identify the stopping time c02-math-0769, denote by c02-math-0770 and c02-math-0771 the worst of the elite samples (c02-math-0772) for both the deterministic and noisy case, respectively, and we consider the following moving average-process:

2.39 c02-math-0773

where c02-math-0774 is fixed, say c02-math-0775. Define next

2.40 c02-math-0776

and

2.41 c02-math-0777

respectively, where c02-math-0778 is fixed, say c02-math-0779. Clearly, for c02-math-0780 and moderate c02-math-0781 and c02-math-0782, we may expect that c02-math-0783, provided the variance of c02-math-0784 is bounded.

With this at hand, we define the stopping criterion as

2.42 c02-math-0785

where c02-math-0786 and c02-math-0787 are fixed and c02-math-0788 is a small number, say c02-math-0789.

Figure 2.7 displays the evolution of the the worst of the elite samples (c02-math-0790) for both the deterministic and noisy case, denoted by c02-math-0791 and c02-math-0792, respectively. We see in both cases a similar rapid drop in the level c02-math-0793. It is important to note, however, that even though the algorithm in both the deterministic and noisy case converges to the optimal solution, the c02-math-0794 for the noisy case do not converge to c02-math-0795, in contrast to the c02-math-0796 for the deterministic case. This is because the latter eventually estimates the c02-math-0797-quantile of the deterministic c02-math-0798, whereas the former estimates the c02-math-0799-quantile of c02-math-0800, which is random. To estimate c02-math-0801 in the noisy case, one needs to take the sample average of c02-math-0802, where c02-math-0803 is the solution found at the final iteration.squ
Figure 2.7 Evolution of the worst of the elite samples for a deterministic and noisy TSP.
c02f007
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.142.85