Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2 Cross-Entropy Method

2.1 Introduction

The cross-entropy (CE) method is a powerful technique for solving difficult estimation and optimization problems, based on Kullback-Leibler (or cross-entropy) minimization [15]. It was pioneered by Rubinstein in 1999 [100] as an adaptive importance sampling procedure for the estimation of rare-event probabilities. Subsequent work in [101] [102] has shown that many optimization problems can be translated into a rare-event estimation problem. As a result, the CE method can be utilized as randomized algorithms for optimization. The gist of the idea is that the probability of locating an optimal or near-optimal solution using naive random search is a rare event probability. The cross-entropy method can be used to gradually change the sampling distribution of the random search so that the rare event is more likely to occur. For this purpose, using the cross-entropy or Kullback-Leibler divergence as a measure of closeness between two sampling distributions, the method estimates a sequence of parametric sampling distributions that converges to a distribution with probability mass concentrated in a region of near-optimal solutions.

To date, the CE method has been successfully applied to optimization and estimation problems. The former includes mixed-integer nonlinear programming [69], continuous optimal control problems [111] [112], continuous multiextremal optimization [71], multidimensional independent component analysis [116], optimal policy search [19], clustering [11] [17, 72], signal detection [80], DNA sequence alignment [67] [93], fiber design [27], noisy optimization problems such as optimal buffer allocation [2], resource allocation in stochastic systems [31], network reliability optimization [70], vehicle routing optimization with stochastic demands [29], power system combinatorial optimization problems [45], and neural and reinforcement learning [81] [86, 118] [128]. CE has even been used as a main engine for playing games such as Tetris, Go, and backgammon [26].

In the estimation setting, the CE method can be viewed as an adaptive importance sampling procedure. In the optimization setting, the optimization problem is first translated into a rare-event estimation problem, and then the CE method for estimation is used as an adaptive algorithm to locate theoptimum.

The CE method is based on a simple iterative procedure where each iteration contains two phases: (a) generate a random data sample (trajectories, vectors, etc.) according to a specified mechanism; (b) update the parameters of the random mechanism on the basis of the data, in order to produce a “better” sample in the next iteration.

Among the two, optimization and estimation, CE is best known for optimization, for which it is able to handle problems of size up to thousands in a reasonable time limit. For estimation and also counting problems, multilevel CE can be useful only for problems of small and moderate size. The main reason for that is, unlike optimization, in estimation multilevel CE involves likelihood ratios factors for all variables. The well-known degeneracy of the likelihood ratio estimators [106] causes CE to perform poorly in high dimensions. To overcome this problem, [25] proposes a variation of the CE method by considering a single iteration for determining the importance sampling density. For this approach to work, one must be able to generate samples from the optimal (zero-variance) density.

A tutorial on the CE method is given in [37]. A comprehensive treatment can be found in [107]. The CE method website (www.cemethod.org) lists several hundred publications.

The rest of this chapter is organized as follows. Section 2.2 presents a general CE algorithm for the estimation of rare event probabilities, while Section 2.3 introduces a slight modification of this algorithm for solving combinatorial optimization problems. We consider applications of the CE method to the multidimensional knapsack problem, to the Mastermind game, and to Markov decision problems. Also, we provide supportive numerical results on the performance of the algorithm for these problems. Sections 2.4 and 2.5 discuss advanced versions of CE to deal with continuous and noisy optimization, respectively.

2.2 Estimation of Rare-Event Probabilities

Consider the problem of computing a probability of the form

2.1 $c02-math-0001$

for some fixed level $c02-math-0002$ . In this expression $c02-math-0003$ is the sample performance, and $c02-math-0004$ a random vector with probability density function (pdf) $c02-math-0005$ , sometimes called the prior pdf. The standard or crude Monte Carlo estimation method is based on repetitive sampling from $c02-math-0006$ and using the sample average:

2.2 $c02-math-0007$

where $c02-math-0008$ are independent and identically distributed (iid) samples drawn from $c02-math-0009$ . However, for rare-event probabilities, say $c02-math-0010$ or less, this method fails for the following reasons [108]. The efficiency of the estimator is measured by its relative error (RE), which is defined as

2.3 $c02-math-0011$

Clearly, for the standard estimator (2.2)

Thus, for small probabilities, meaning $c02-math-0013$ , we get

equation

This says that the sample size $c02-math-0015$ is of the order $c02-math-0016$ to obtain relative error of $c02-math-0017$ . For rare-event probabilities, this sample size leads to unmanageable computation time. For instance, let $c02-math-0018$ be the maximum number of customers that during a busy period of an $c02-math-0019$ queue is present. For a load at 80%, the probability can be computed to be $c02-math-0020$ . A simulation to obtain relative error of 10% would require about $c02-math-0021$ samples of $c02-math-0022$ (busy cycles). On a 2.79 Ghz (0.99GB RAM) PC this would take about 18 days.

Assuming asymptotic normality of the estimator, the confidence intervals (CI) follow in a standard way. For example, a 95% CI for $c02-math-0023$ is given by

This shows again the reciprocal relationship between sample size and rare-event probability to obtain accuracy: suppose one wishes that the confidence half width is at most $c02-math-0025$ % relative to the estimate $c02-math-0026$ . This boils down to

For variance reduction we apply importance sampling simulation using pdf $c02-math-0028$ , that is, we draw independent and identically distributed samples $c02-math-0029$ from $c02-math-0030$ , calculate their likelihood ratios

and set the importance sampling estimator by

2.4 $c02-math-0032$

By writing $c02-math-0033$ , we see that this estimator is of the form $c02-math-0034$ . Again, we measure efficiency by its relative error, similar to (2.3), which might be estimated by $c02-math-0035$ , with

2.5 $c02-math-0036$

being the sample variance of the $c02-math-0037$ .

In almost all realistic problems, the prior pdf $c02-math-0038$ belongs to a parameterized family

2.6 $c02-math-0039$

That means that $c02-math-0040$ is given by some fixed parameter $c02-math-0041$ . Then it is natural to choose the importance sampling pdf $c02-math-0042$ from the same family using another parameter $c02-math-0043$ , or $c02-math-0044$ , etc. In that case the likelihood ratio becomes

2.7 $c02-math-0045$

We refer to the original parameter $c02-math-0046$ as the nominal parameter of the problem, and we call parameters $c02-math-0047$ used in the importance sampling, the reference parameters.

Example 2.1 Stochastic Shortest Path

Our objective is to efficiently estimate the probability $c02-math-0048$ that the shortest path from node $c02-math-0049$ to node $c02-math-0050$ in the network of Figure 2.1 has a length of at least $c02-math-0051$ . The random lengths $c02-math-0052$ of the links are assumed to be independent and exponentially distributed with means $c02-math-0053$ , respectively.

Figure 2.1 Shortest path from $c02-math-0066$ to $c02-math-0067$ .

Defining $c02-math-0054$ , $c02-math-0055$ , and

the problem is cast in the framework of (2.1). Note that we can estimate (2.1) via (2.4) by drawing $c02-math-0057$ independently from exponential distributions that are possibly different from the original ones. That is, $c02-math-0058$ instead of $c02-math-0059$ , $c02-math-0060$ .

The challenging problem is how to select a vector $c02-math-0061$ that gives the most accurate estimate of $c02-math-0062$ for a given simulation effort. To do so we use a two-stage procedure, where both the level $c02-math-0063$ and the reference parameters $c02-math-0064$ are updated simultaneously. As we shall soon see, one of the strengths of the CE method for rare-event simulation is that it provides a fast way to estimate accurately the optimal parameter vector $c02-math-0065$ .

The idea of the CE method is to choose the importance sampling pdf $c02-math-0068$ from within the parametric class of pdfs $c02-math-0069$ such that the Kullback-Leibler entropy between $c02-math-0070$ and the optimal (zero variance) importance sampling pdf $c02-math-0071$ , given by [107],

is minimal. The Kullback-Leibler divergence between $c02-math-0073$ and $c02-math-0074$ is given by

2.8 $c02-math-0075$

The CE minimization procedure then reduces to finding an optimal reference parameter, $c02-math-0076$ , say by cross-entropy minimization:

2.9 $c02-math-0077$

where $c02-math-0078$ is any reference parameter and the subscript in the expectation operator indicates the density of $c02-math-0079$ . This $c02-math-0080$ can be estimated via the stochastic counterpart of (2.9):

2.10 $c02-math-0081$

where $c02-math-0082$ are drawn from $c02-math-0083$ .

The optimal parameter $c02-math-0084$ in (2.10) can often be obtained in explicit form, in particular when the class of sampling distributions forms an exponential family [[107], pages 319–320]. Indeed, analytical updating formulas can be found whenever explicit expressions for the maximal likelihood estimators of the parameters can be found [[37], page 36].

In the other cases, we consider a multilevel approach, where we generate a sequence of reference parameters $c02-math-0085$ and a sequence of levels $c02-math-0086$ , while iterating in both $c02-math-0087$ and $c02-math-0088$ . Our ultimate goal is to have $c02-math-0089$ close to $c02-math-0090$ after some number of iterations and to use $c02-math-0091$ in the importance sampling density $c02-math-0092$ to estimate $c02-math-0093$ .

We start with $c02-math-0094$ and choose $c02-math-0095$ to be a not too small number, say $c02-math-0096$ . At the first iteration, we choose $c02-math-0097$ to be the optimal parameter for estimating $c02-math-0098$ , where $c02-math-0099$ is the $c02-math-0100$ -quantile of $c02-math-0101$ . That is, $c02-math-0102$ is the largest real number for which $c02-math-0103$ Thus, if we simulate under $c02-math-0104$ , then level $c02-math-0105$ is reached with a reasonably high probability of around $c02-math-0106$ . In this way we can estimate the (unknown!) level $c02-math-0107$ , simply by generating a random sample $c02-math-0108$ from $c02-math-0109$ , and calculating the $c02-math-0110$ -quantile of the order statistics of the performances $c02-math-0111$ ; that is, denote the ordered set of performance values by $c02-math-0112$ , then

where $c02-math-0114$ denotes the integer part. Next, the reference parameter $c02-math-0115$ can be estimated via (2.10), replacing $c02-math-0116$ with the estimate of $c02-math-0117$ . Note that we can use here the same sample for estimating both $c02-math-0118$ and $c02-math-0119$ . This means that $c02-math-0120$ is estimated on the basis of the $c02-math-0121$ best samples, that is, the samples $c02-math-0122$ for which $c02-math-0123$ is greater than or equal to $c02-math-0124$ . These form the elite samples in the first iteration. We repeat these steps in the subsequent iterations. The two updating phases, starting from $c02-math-0125$ are given below:

1. Adaptive updating of $c02-math-0126$ . For a fixed $c02-math-0127$ , let $c02-math-0128$ be the $c02-math-0129$ -quantile of $c02-math-0130$ under $c02-math-0131$ . To estimate $c02-math-0132$ , draw a random sample $c02-math-0133$ from $c02-math-0134$ and evaluate the sample $c02-math-0135$ -quantile $c02-math-0136$ .

2. Adaptive updating of $c02-math-0137$ . For fixed $c02-math-0138$ and $c02-math-0139$ , derive $c02-math-0140$ as

2.11 $c02-math-0141$

The stochastic counterpart of (2.11) is as follows: for fixed $c02-math-0142$ and $c02-math-0143$ , derive $c02-math-0144$ as the solution

2.12 $c02-math-0145$

where $c02-math-0146$ is the set of elite samples in the $c02-math-0147$ -th iteration, that is, the samples $c02-math-0148$ for which $c02-math-0149$ .

The procedure terminates when at some iteration $c02-math-0150$ a level $c02-math-0151$ is reached that is at least $c02-math-0152$ , and thus the original value of $c02-math-0153$ can be used without getting too few samples. We then reset $c02-math-0154$ to $c02-math-0155$ , reset the corresponding elite set, and deliver the final reference parameter $c02-math-0156$ , using again (2.12). This $c02-math-0157$ is then used in (2.4) to estimate $c02-math-0158$ .

Remark 2.1 Improved Cross-Entropy Method

The multilevel approach may not be appropriate for problems that cannot cast in the framework of (2.1). Specifically, when we deal with a high-dimensional problem there is often not a natural modeling in terms of level crossings. Furthermore, in high-dimensional settings, the likelihood ratio degeneracy becomes a severe issue causing the importance sampling estimator to be unreliable [106]. Recent enhancements of the cross-entropy method have been developed to overcome this problem. A successful approach has been proposed in [25]. It is based on applying a Markov chain Monte Carlo method to generate samples, of the zero-variance density. With these samples, one maximizes the logarithm of the proposed density with respect to the reference parameter $c02-math-0159$ . The advantage is that it does not involve the likelihood ratio in this optimization program.

Remark 2.2 Smoothed Updating

Instead of updating the parameter vector $c02-math-0160$ directly via (2.12), we use the following smoothed version:

2.13 $c02-math-0161$

where $c02-math-0162$ is the parameter vector given in (2.12), and $c02-math-0163$ is called the smoothing parameter, typically with $c02-math-0164$ . Clearly, for $c02-math-0165$ we have our original updating rule. The reason for using the smoothed (2.13) instead of the original updating rule is twofold: (a) to smooth out the values of $c02-math-0166$ , (b) to reduce the probability that some component $c02-math-0167$ of $c02-math-0168$ will be zero or one at the first few iterations. This is particularly important when $c02-math-0169$ is a vector or matrix of probabilities. Because in those cases, when $c02-math-0170$ it is always ensured that $c02-math-0171$ , while for $c02-math-0172$ one might have (even at the first iterations) that either $c02-math-0173$ or $c02-math-0174$ for some indices $c02-math-0175$ . As a result, the algorithm will converge to a wrong solution.

The resulting CE algorithm for rare-event probability estimation can thus be written as follows.

Algorithm 2.1 CE Algorithm for Rare-Event Estimation

1. Initialize: Define $c02-math-0176$ . Let $c02-math-0177$ . Set $c02-math-0178$ (iteration counter).

2. Draw: Generate a random sample $c02-math-0179$ according to the pdf $c02-math-0180$

3. Select: Calculate the performances $c02-math-0181$ for all $c02-math-0182$ , and order them from smallest to biggest, $c02-math-0183$ . Let $c02-math-0184$ be the sample $c02-math-0185$ -quantile of performances: $c02-math-0186$ . If $c02-math-0187$ reset $c02-math-0188$ to $c02-math-0189$ .

4. Update: Use the same sample $c02-math-0190$ to solve the stochastic program (2.12).

5. Smooth: Apply (2.13) to smooth out the vector $c02-math-0191$ .

6. Iterate: If $c02-math-0192$ , set $c02-math-0193$ and reiterate from step 2. Else proceed with step 7.

7. Final: Let $c02-math-0194$ be the final iteration counter. Generate a sample $c02-math-0195$ according to the pdf $c02-math-0196$ and estimate $c02-math-0197$ via (2.4).

Apart from specifying the family of sampling pdfs, the sample sizes $c02-math-0198$ and $c02-math-0199$ , and the rarity parameter $c02-math-0200$ , the algorithm is completely self-tuning. The sample size $c02-math-0201$ for determining a good reference parameter can usually be chosen much smaller than the sample size $c02-math-0202$ for the final importance sampling estimation, say $c02-math-0203$ versus $c02-math-0204$ = 100,000. Under certain technical conditions [107], the deterministic version of Algorithm 2.1 is guaranteed to terminate (reach level $c02-math-0205$ ) provided that $c02-math-0206$ is chosen small enough.

Note that Algorithm 2.1 breaks down the “hard” problem of estimating the very small probability $c02-math-0207$ into a sequence of “simple” problems, generating a sequence of pairs $c02-math-0208$ , depending on $c02-math-0209$ , which is called the rarity parameter. Convergence of Algorithm 2.1 is discussed in [107]. Other convergence proofs for the CE method may be found in [33] and [84].

Remark 2.3 Maximum Likelihood Estimator

Optimization problems of the form (2.12) appear frequently in statistics. In particular, if the $c02-math-0210$ factor is omitted, which will turn out to be important in CE optimization, one can write (2.12) also as

where the product is the joint density of the elite samples. Consequently, $c02-math-0212$ is chosen such that the joint density of the elite samples is maximized. Viewed as a function of the parameter $c02-math-0213$ , rather than the data $c02-math-0214$ , this function is called the likelihood. In other words, $c02-math-0215$ is the maximum likelihood estimator (it maximizes the likelihood) of $c02-math-0216$ based on the elite samples. When the $c02-math-0217$ factor is present, the form of the updating formula remains similar.

To provide further insight into Algorithm 2.1, we shall follow it step-by-step in a couple of toy examples.

Example 2.2 Binomial Distribution

Assume that we want to estimate via simulation the probability $c02-math-0218$ , where $c02-math-0219$ , $c02-math-0220$ is the success probability, and $c02-math-0221$ is the target threshold. Suppose that $c02-math-0222$ is large compared to the mean $c02-math-0223$ , so that

is a rare-event probability. Assume further that $c02-math-0225$ is fixed and we want to estimate the optimal parameter $c02-math-0226$ in $c02-math-0227$ using (2.12).

In iteration $c02-math-0228$ , we sample independent and identically distributed $c02-math-0229$ . The right-hand side of (2.12) becomes

where

is the likelihood ratio of the $c02-math-0232$ -th sample. To find the optimal $c02-math-0233$ , we consider the first-order condition. We obtain

Solving this for $c02-math-0243$ yields

2.14 $c02-math-0244$

In other words, $c02-math-0245$ is simply the sample fraction of the elite samples, weighted by the likelihood ratios. Note that, without the weights $c02-math-0246$ , we would obtain the maximum likelihood estimator of $c02-math-0247$ for the $c02-math-0248$ distribution based on the elite samples, in accordance with Remark 2.5. Note also that (2.14) can be viewed as a stochastic counterpart of the deterministic updating formula

where $c02-math-0250$ is the $c02-math-0251$ -quantile of the $c02-math-0252$ distribution.

Assume for concreteness that $c02-math-0253$ , $c02-math-0254$ , and $c02-math-0255$ , which corresponds to $c02-math-0256$ . Table 2.1 presents the evolution of $c02-math-0257$ and $c02-math-0258$ for $c02-math-0259$ , using sample size $c02-math-0260$ . Note that iteration $c02-math-0261$ corresponds to the original binomial pdf withsuccess probability $c02-math-0262$ , while iterations $c02-math-0263$ correspond to binomial pdfs with success probabilities $c02-math-0264$ respectively.

Table 2.1 The evolution of $c02-math-0235$ and $c02-math-0236$ for $c02-math-0237$ , using $c02-math-0238$ and $c02-math-0239$ samples

$c02-math-0240$	$c02-math-0241$	$c02-math-0242$
0	—	0.1667
1	23	0.2530
2	33	0.3376
3	42	0.4263
4	51	0.5124
5	50	0.5028

The final step of Algorithm 2.1 now invokes the importance sampling estimator (2.4) to estimate $c02-math-0265$ , using a sample size $c02-math-0266$ that is typically larger than $c02-math-0267$ . Clearly, we reset the $c02-math-0268$ obtained in the last iteration to $c02-math-0269$ .

Figure 2.2 illustrates the iterative procedure. We see that Algorithm 2.1 requires five iterations to reach the final level $c02-math-0274$ . Note that until iteration 4 both parameters $c02-math-0275$ and $c02-math-0276$ increase gradually, each time “blowing up” the tail of the binomial pdf. At iteration five we reset the value $c02-math-0277$ to the desired $c02-math-0278$ and estimate the optimal $c02-math-0279$ value.

Figure 2.2 The five iterations of Algorithm 2.1. Iteration $c02-math-0270$ samples from $c02-math-0271$ , estimates level $c02-math-0272$ , and estimates the next sampling parameter $c02-math-0273$ .

Example 2.3 Degeneracy of $c02-math-0280$

When $c02-math-0281$ is the maximum of $c02-math-0282$ , no “overshooting” of $c02-math-0283$ in Algorithm 2.1 will occur, and therefore $c02-math-0284$ does not need to be reset. In such cases, the sampling pdf may degenerate towards the atomic pdf that has all its mass concentrated at the points $c02-math-0285$ where $c02-math-0286$ is maximal. As an example, let $c02-math-0287$ be the target level in the above binomial example. In this case, clearly the final $c02-math-0288$ , which corresponds to the degenerate value of $c02-math-0289$ . Table 2.2 and Figure 2.3 show the evolution of the parameters in the CE algorithm, using $c02-math-0290$ and $c02-math-0291$ .

Table 2.2 The evolution of $c02-math-0302$ and $c02-math-0303$ for the $c02-math-0304$ Example using $c02-math-0305$ , $c02-math-0306$ and $c02-math-0307$ samples

Figure 2.3 Degeneracy of the sampling distribution in Example 2.3.

Example 2.4

Consider again the stochastic shortest path graph of Figure 2.1. Let us take

as the nominal parameter and estimate the probability $c02-math-0293$ that the minimum path length is greater than $c02-math-0294$ .

Crude Monte Carlo with $c02-math-0295$ samples—very large simulation effort—gave an estimate $c02-math-0296$ with an estimated relative error of $c02-math-0297$ .

To apply Algorithm 2.1 to this problem, we need to establish the updating rule for the reference parameter $c02-math-0298$ . Since the components $c02-math-0299$ are independent and form an exponential family parameterized by the mean we have [107]

2.15 $c02-math-0300$

with $c02-math-0301$ given in (2.10).

We set in all our experiments with Algorithm 2.1 the rarity parameter $c02-math-0314$ , the sample size for step 2–4 of the algorithm $c02-math-0315$ , and for the final sample size $c02-math-0316$ . Table 2.3 displays the results of steps 1–5 of the CE algorithm.

Table 2.3 Convergence of the sequence $c02-math-0317$ in Example 2.4.

We see that after five iterations, level $c02-math-0321$ is reached. As before, we set $c02-math-0322$ to the desired $c02-math-0323$ and estimate the optimal parameter vector $c02-math-0324$ , which is $c02-math-0325$ = $c02-math-0326$ $c02-math-0327$ . Using $c02-math-0328$ we obtain $c02-math-0329$ = $c02-math-0330$ , with an estimated RE of $c02-math-0331$ . The same RE was obtained for the crude Monte Carlo with $c02-math-0332$ samples. Note that, whereas the standard method required more than an hour of computation time, the CE algorithm was finished in only 1 second, using a Matlab implementation on a 1500 MHz computer. We see that with a minimal amount of work we have achieved a dramatic reduction of the simulation effort.

2.3 Cross-Entropy Method forOptimization

Consider the following optimization problem:

2.16 $c02-math-0333$

having $c02-math-0334$ as the set of optimal solutions.

The problem is called a discrete or continuous optimization problem, depending on whether $c02-math-0335$ is discrete or continuous. An optimization problem where some $c02-math-0336$ are discrete and some others are continuous variables is called a mixed optimization problem. An optimization problem with a finite state space ( $c02-math-0337$ ) is called a combinatorial optimization problem. The latter will be the main focus of this section.

We will show here how the cross-entropy method can be applied for approximating a pdf $c02-math-0338$ that is concentrated on $c02-math-0339$ . In fact, the iterations of the cross-entropy algorithm generate a sequence of pdfs $c02-math-0340$ associated with a sequence of decreasing sets converging to $c02-math-0341$ .

Suppose that we set the optimal value $c02-math-0342$ , then an optimal pdf $c02-math-0343$ is characterized by

This matches exactly to the case of degeneracy of the cross-entropy algorithm for rare-event simulation in Example 2.7.

Hence, the starting point is to associate with the optimization problem (2.16) a meaningful estimation problem. To this end, we define a collection of indicator functions $c02-math-0345$ on $c02-math-0346$ for various levels $c02-math-0347$ . Next, let $c02-math-0348$ be a family of probability densities on $c02-math-0349$ , parameterized by real-valued parameters $c02-math-0350$ . For a fixed $c02-math-0351$ we associate with (2.16) the problem of estimating the rare-event probability

2.17 $c02-math-0352$

We call the estimation problem (2.17) the associated stochastic problem.

It is important to understand that one of the main goals of CE in optimization is to generate a sequence of pdfs $c02-math-0353$ , converging to a pdf $c02-math-0354$ that is close or equal to an optimal pdf $c02-math-0355$ . For this purpose we apply Algorithm 2.1 for rare-event estimation, but without fixing in advance $c02-math-0356$ . It is plausible that if $c02-math-0357$ is close to $c02-math-0358$ , then $c02-math-0359$ assigns most of its probability mass on $c02-math-0360$ , and thus any $c02-math-0361$ drawn from this distribution can be used as an approximation to an optimal solution $c02-math-0362$ , and the corresponding function value as an approximation to the true optimal $c02-math-0363$ in (2.16).

In summary, to solve a combinatorial optimization problem we will employ the CE Algorithm 2.1 for rare-event estimation without fixing $c02-math-0364$ in advance. By doing so, the CE algorithm for optimization can be viewed as a modified version of Algorithm 2.1. In particular, in analogy to Algorithm 2.1, we choose a not very small number $c02-math-0365$ , say $c02-math-0366$ , initialize the parameter $c02-math-0367$ by setting $c02-math-0368$ , and proceed as follows:

1. Adaptive updating of $c02-math-0369$ . For a fixed $c02-math-0370$ , let $c02-math-0371$ be the $c02-math-0372$ -quantile of $c02-math-0373$ under $c02-math-0374$ . As before, an estimator $c02-math-0375$ of $c02-math-0376$ can be obtained by drawing a random sample $c02-math-0377$ from $c02-math-0378$ and then evaluating the sample $c02-math-0379$ -quantile of the performances as

2.18 $c02-math-0380$

2. Adaptive updating of $c02-math-0381$ . For fixed $c02-math-0382$ and $c02-math-0383$ , derive $c02-math-0384$ from the solution of the program

2.19 $c02-math-0385$

The stochastic counterpart of (2.19) is as follows: for fixed $c02-math-0386$ and $c02-math-0387$ , derive $c02-math-0388$ from the following program:

2.20 $c02-math-0389$

Note that, in contrast to the formulas (2.11) and (2.12) (for the rare-event setting), formulas (2.19) and (2.20) do not contain the likelihood ratio factors $c02-math-0390$ . The reason is that in the rare-event setting, the initial (nominal) parameter $c02-math-0391$ is specified in advance and is an essential part of the estimation problem. In contrast, the initial reference vector $c02-math-0392$ in the associated stochastic problem is quite arbitrary. In effect, by dropping the likelihood ratio factor, we can efficiently estimate at each iteration $c02-math-0393$ the CE optimal high-dimensional reference parameter vector $c02-math-0394$ for the rare event probability $c02-math-0395$ by avoiding the degeneracy properties of the likelihood ratio.

Thus, the main CE optimization algorithm, which includes smoothed updating of parameter vector $c02-math-0396$ and which presents a slight modification of Algorithm 2.1 can be summarized as follows.

Algorithm 2.2 Main CE Algorithm for Optimization

1. Initialize: Choose an initial parameter vector $c02-math-0397$ . Set $c02-math-0398$ (level counter).

2. Draw: Generate a sample $c02-math-0399$ from the density $c02-math-0400$ .

3. Select: Compute the sample $c02-math-0401$ -quantile $c02-math-0402$ of the performances according to (2.18).

4. Update: Use the same sample $c02-math-0403$ and solve the stochastic program (2.20). Denote the solution by $c02-math-0404$ .

5. Smooth: Apply (2.13) to smooth out the vector $c02-math-0405$ .

6. Iterate: If the stopping criterion is met, stop; otherwise set $c02-math-0406$ , and return to Step 2.

Note that when $c02-math-0407$ must be minimized instead of maximized, we simply change the inequalities “ $c02-math-0408$ ” to “ $c02-math-0409$ ” and take the $c02-math-0410$ -quantile instead of the $c02-math-0411$ -quantile. Alternatively, one can just maximize $c02-math-0412$ .

Remark 2.4 Stopping Criteria

As a stopping criterion, one can use, for example, if for some $c02-math-0413$ , say $c02-math-0414$ ,

2.21 $c02-math-0415$

then stop.

As an alternative estimate for $c02-math-0416$ , one can consider

2.22 $c02-math-0417$

Before one implements the algorithm, one must decide on specifying the initial vector $c02-math-0418$ , the sample size $c02-math-0419$ , the stopping parameter $c02-math-0420$ , and the rarity parameter $c02-math-0421$ . However, the execution of the algorithm is “self-tuning.”

Note that the estimation Step 7 of Algorithm 2.1 is omitted in Algorithm 2.2, because, in the optimization setting, we are not interested in estimating $c02-math-0422$ per se.

Remark 2.5 Class of parametric densities

To run the algorithm one needs to propose a class of parametric sampling densities $c02-math-0423$ , the initial vector $c02-math-0424$ , the sample size $c02-math-0425$ , the rarity parameter $c02-math-0426$ , and a stopping criterion. Of these, the most challenging is the selection of an appropriate class of parametric sampling densities $c02-math-0427$ . Typically, there is not a unique parametric family and the selection is guided by the following competing objectives. First, the class $c02-math-0428$ has to be flexible enough to include a reasonable parametric approximation to an optimal importance sampling density $c02-math-0429$ for the estimation of the associated rare-event probability $c02-math-0430$ . Second, each density $c02-math-0431$ has to be simple enough to allow fast random variable generation. In many cases, these two competing objectives are reconciled by using a standard statistical model for $c02-math-0432$ , such as the multivariate Bernoulli or Gaussian densities with independent components of the vector $c02-math-0433$ .

Algorithm 2.2 can, in principle, be applied to any discrete optimization problem. However, for each individual problem, two essential ingredients need to be supplied:

1. We need to specify how the samples are generated. In other words, we need to specify the family of densities $c02-math-0434$ .

2. We need to update the parameter vector $c02-math-0435$ , based on cross-entropy minimization program (2.20), which is the same for all optimization problems.

In general, there are many ways to generate samples from $c02-math-0436$ , and it is not always immediately clear which way of generating the sample will yield better results or easier updating formulas.

Remark 2.6 Parameter Selection

The choice for the sample size $c02-math-0437$ and the rarity parameter $c02-math-0438$ depends on the size of the problem and the number of parameters in the associated stochastic problem. Typical choices are $c02-math-0439$ or $c02-math-0440$ and $c02-math-0441$ , where $c02-math-0442$ is the number of parameters that need to be estimated/updated and $c02-math-0443$ is a constant between 1 and 10.

Below we present several applications of the CE method to combinatorial optimization problems, namely to the multidimensional 0/1 knapsack problem, the mastermind game, and to reinforcement learning in Markov decision problems. We demonstrate numerically the efficiency of the CE method and its fast convergence for several case studies. For additional applications of CE, see [107].

2.3.1 The Multidimensional 0/1 Knapsack Problem

Consider the multidimensional 0/1 knapsack problem (see Section A.1.2):

2.23 $c02-math-0444$

In other words, the goal is to maximize the performance function $c02-math-0445$ $c02-math-0445a$ over the set $c02-math-0446$ given by the knapsack constraints of (2.23).

Note that each item $c02-math-0447$ is either chosen, $c02-math-0448$ , or left out, $c02-math-0449$ . We model this by a Bernoulli random variable $c02-math-0450$ , with parameter $c02-math-0451$ . By setting these variables independent, we get the following parameterized set:

equation

of Bernoulli pdfs. The optimal solution $c02-math-0453$ of the knapsack problem (2.23) is characterized by a degenerate pdf $c02-math-0454$ for which all $c02-math-0455$ ; that is,

To proceed, consider the stochastic program 2.20. We have

equation

where, for convenience, we write $c02-math-0458$ . Applying the first-order optimality conditions we have

equation

for $c02-math-0460$ .

As a concrete example, consider the knapsack problem; available at http://people.brunel.ac.uk/∼mastjjb/jeb/orlib/mknapinfo.html from the OR Library involving $c02-math-0461$ variables and $c02-math-0462$ constraints. In this problem the vector $c02-math-0463$ is

and the matrix $c02-math-0465$ and the vector $c02-math-0466$ are

equation

The performance of the CE algorithm is illustrated in Table 2.4 using the sample size $c02-math-0468$ , rarity parameter $c02-math-0469$ , and smoothing parameter $c02-math-0470$ . The initial probabilities were all set to $c02-math-0471$ . Observe that the probability vector $c02-math-0472$ quickly converges to a degenerate one, corresponding to the solution $c02-math-0473$ with value $c02-math-0474$ , which is also the optimal known value $c02-math-0475$ . Hence, the optimal solution has been found.

Table 2.4 The evolution of $c02-math-0476$ and $c02-math-0477$ for the $c02-math-0478$ -Knapsack Example, using $c02-math-0479$ , $c02-math-0480$ , and $c02-math-0481$ samples

2.3.2 Mastermind Game

In the well-known game Mastermind, the objective is to decipher a hidden “code” of colored pegs through a series of guesses. After each guess, new information on the true code is provided in the form of black and white pegs. A black peg is earned for each peg that is exactly right and a white peg for each peg that is in the solution, but in the wrong position; see Figure 2.4. To get a feel for the game, one can visit www.archimedes-lab.org/mastermind.html and play the game.

Figure 2.4 The Mastermind game.

Consider a Mastermind game with $c02-math-0485$ colors, numbered $c02-math-0486$ and $c02-math-0487$ pegs (positions). The hidden solution and a guess can be represented by a row vector of length $c02-math-0488$ with numbers in $c02-math-0489$ . For example, for $c02-math-0490$ and $c02-math-0491$ the solution could be $c02-math-0492$ , and a possible guess $c02-math-0493$ . Let $c02-math-0494$ be the space of all possible guesses; clearly $c02-math-0495$ . On $c02-math-0496$ we define a performance function $c02-math-0497$ that returns for each guess $c02-math-0498$ the “pegscore”

2.24 $c02-math-0499$

where $c02-math-0500$ and $c02-math-0501$ are the number of black and white pegs returned after the guess, $c02-math-0502$ . We have assumed, somewhat arbitrarily, that a black peg is worth twice as much as a white peg. As an example, for the solution $c02-math-0503$ and guess $c02-math-0504$ above one gets two black pegs (for the first and third pegs in the guess) and two white pegs (for the second and fourth pegs in the guess, which match the fifth and second pegs of the solution but are in the wrong place). Hence the score for this guess is 6.

There are some specially designed algorithms that efficiently solve the problem for different number of colors and pegs. For examples, see www.mathworks.com/contest/mastermind.cgi/home.html.

Note that, with the above performance function, the problem can be formulated as an optimization problem, that is, as $c02-math-0505$ , and hence we could apply the CE method to solve it. In order to apply the CE method we first need to generate random guesses $c02-math-0506$ using an $c02-math-0507$ probability matrix $c02-math-0508$ . Each element $c02-math-0509$ of $c02-math-0510$ describes the probability that we choose the $c02-math-0511$ th color for the $c02-math-0512$ th peg (location). Since only one color may be assigned to one peg, we have that

2.25 $c02-math-0513$

In each stage $c02-math-0514$ of the CE algorithm, we sample independently for each row (that is, each peg) a color using a probability matrix $c02-math-0515$ and calculate the score $c02-math-0516$ according to (2.24). The updating of the elements $c02-math-0517$ of the probability matrix $c02-math-0518$ is performed according to

2.26 $c02-math-0519$

where the samples $c02-math-0520$ are drawn independently and identically distributed using matrix $c02-math-0521$ . Note that in this case $c02-math-0522$ in (2.26) simply counts the fraction of times among the elites that color $c02-math-0523$ has been assigned to peg $c02-math-0524$ .

2.3.2.1 Simulation Results

Table 2.5 represents the dynamics of the CE algorithm for the Mastermind test problem denoted at www.mathworks.com/contest/mastermind.cgi/home.html as “problem (5.2.3)” with the matrix $c02-math-0525$ of size $c02-math-0526$ . We used $c02-math-0527$ , $c02-math-0528$ , and $c02-math-0529$ . It took 34 seconds of CPU time to find the true solution.

Table 2.5 Dynamics of Algorithm 2.2 for the mastermind problem with $c02-math-0530$ colors and $c02-math-0531$ pegs. In each iteration $c02-math-0532$ was the sampled maximal score

$c02-math-0533$	$c02-math-0534$	$c02-math-0535$
1	31	26
2	34	28
3	38	32
4	42	35
5	48	40
6	55	45
7	60	51
8	64	57
9	70	63
10	72	68
11	72	72

2.3.3 Markov Decision Process and Reinforcement Learning

The Markov decision process (MDP) model is standard in machine learning, operations research, and related fields. The application to reinforcement learning in this section is based on [83]. We start by reviewing briefly some of the basic definitions and concepts in MDP. For details see, for example, [95].

An MDP is defined by a tuple $c02-math-0536$ where

1. $c02-math-0537$ is a finite set of states.

2. $c02-math-0538$ is the set of possible actions by the decision maker. We assume it is the same for every state, to ease notations.

3. $c02-math-0539$ is the transition probability matrix with elements $c02-math-0540$ presenting the transition probability from state $c02-math-0541$ to state $c02-math-0542$ , when action $c02-math-0543$ is chosen.

4. $c02-math-0544$ is the immediate reward for performing action $c02-math-0545$ in state $c02-math-0546$ ( $c02-math-0547$ may be a random variable).

At each time instance $c02-math-0548$ , the decision maker observes the current state $c02-math-0549$ and determines the action to be taken (say $c02-math-0550$ ). As a result, a reward given by $c02-math-0551$ is received immediately, and a new state $c02-math-0552$ is chosen according to $c02-math-0553$ . A policy or strategy $c02-math-0554$ is a rule that determines, for each history $c02-math-0555$ of states and actions, the probability distribution of the decision maker's actions at time $c02-math-0556$ . A policy is called Markov if each action depends only on the current state $c02-math-0557$ . A Markov policy is called stationary if it does not depend on the time $c02-math-0558$ . If the decision rule $c02-math-0559$ of a stationary policy is nonrandomized, we say that the policy is deterministic.

In the sequel, we consider problems for which an optimal policy belongs to the class of stationary Markov policies. Associated with such a MDP are the stochastic processes of consecutive observed states $c02-math-0560$ and actions $c02-math-0561$ . For stationary policies we may denote $c02-math-0562$ , which is a random variable on the action set $c02-math-0563$ . In case of deterministic policies it degenerates in a single action. Finally, we denote the immediate rewards in state $c02-math-0564$ by $c02-math-0565$ .

The goal of the decision maker is to maximize a certain reward function. The following are standard reward criteria:

1. Finite horizon reward. This applies when there exists a finite time $c02-math-0566$ (random or deterministic) at which the process terminates. The objective is to maximize the total reward

2.27 $c02-math-0567$

Here, $c02-math-0568$ denotes the expectation with respect to the probability measure induced by the strategy $c02-math-0569$ .

2. Infinite horizon discounted reward. The objective is to find a strategy $c02-math-0570$ that maximizes

2.28 $c02-math-0571$

where $c02-math-0572$ is the discount factor.

3. Average reward. The objective is to maximize

equation

Suppose that the MDP is identified; that is, the transition probabilities and the immediate rewards are specified and known. Then it is well known that there exists a deterministic stationary optimal strategy for the above three cases. Moreover, there are several efficient methods for finding it, such as value iteration and policy iteration [95].

In the next section, we will restrict our attention to the stochastic shortest path MDP, where it is assumed that the process starts from a specific initial state $c02-math-0574$ and terminates in an absorbing state $c02-math-0575$ with zero reward. For the finite horizon reward criterion $c02-math-0576$ is the time at which $c02-math-0577$ is reached (which we will assume will always happen eventually). It is well known that for the shortest path MDPthere exists a stationary Markov policy that maximizes $c02-math-0578$ for each of the reward functions above.

However, when the transition probability or the reward in an MDP are unknown, the problem is much more difficult and is referred to as a learning probability. A well-known framework for learning algorithms is reinforcement learning (RL), where an agent learns the behavior of the system through trial-and-error with an unknown dynamic environment; see [61].

There are several approaches to RL, which can be roughly divided into the following three classes: model-based, model-free, and policy search. In the model-based approach, a model of the environment is first constructed. The estimated MDP is then solved using standard tools of dynamic programming (see [66]).

In the model-free approach, one learns a utility function instead of learning the model. The optimal policy is to choose at each state an action that maximizes the expected utility. The popular Q-learning algorithm (e.g., [125]) is an example of this approach.

In the policy search approach, a subspace of the policy space is searched, and the performance of policies is evaluated based on their empirical performance (e.g., [7]). An example of a gradient-based policy search method is the REINFORCE algorithm of [127]. For a direct search in the policy space approach see [98]. The CE algorithm in the next section belongs to the policy search approach.

Remark 2.7 Stochastic Approximation

Many RL algorithms are based on the classic stochastic approximation (SA) algorithm. To explain SA, assume that we need to find the unique solution $c02-math-0579$ of a nonlinear equation $c02-math-0580$ , where $c02-math-0581$ is a random variable with a pdf parameterized by $c02-math-0582$ , such that its expectation $c02-math-0583$ is not available in closed form. However, assume that we can simulate $c02-math-0584$ for any parameter $c02-math-0585$ . The SA algorithm for estimating $c02-math-0586$ is

where $c02-math-0588$ is a positive sequence satisfying

2.29 $c02-math-0589$

The connection between SA and Q-learning is given by [117]. This work has made an important impact on the entire field of RL. Even if $c02-math-0590$ remains bounded away from 0 (and thus convergence is not guaranteed), it is still required that $c02-math-0591$ is small in order to ensure convergence to a reasonable neighboring solution [10].

Our main goal here is to introduce a fast learning algorithm based on the CE method instead of the slow SA algorithms and to demonstrate its high efficiency.

2.3.3.1 Policy Learning via the CE Method

We consider a CE learning algorithm for the shortest path MDP, where it is assumed that the process starts from a specific initial state $c02-math-0592$ , and that there is an absorbing state $c02-math-0593$ with zero reward. The objective is given in (2.27), with $c02-math-0594$ being the stopping time at which $c02-math-0595$ is reached (which we will assume will always happen).

To put this problem in the CE framework, consider the maximization problem (2.27). Recall that for the shortest path MDP an optimal stationary strategy exists. We can represent each stationary strategy as a vector $c02-math-0596$ with $c02-math-0597$ being the action taken when visiting state $c02-math-0598$ . Writing the expectation in (2.27) as

2.30 $c02-math-0599$

we see that the optimization problem (2.27) is of the form $c02-math-0600$ . We shall also consider the case where $c02-math-0601$ is measured (observed) with some noise, in which case we have a noisy optimization problem.

The idea now is to combine the random policy generation and the random trajectory generation in the following way: At each stage of the CE algorithm we generate random policies and random trajectories using an auxiliary $c02-math-0602$ matrix $c02-math-0603$ , such that for each state $c02-math-0604$ we choose action $c02-math-0605$ with probability $c02-math-0606$ . Once this “policy matrix” $c02-math-0607$ is defined, each iteration of the CE algorithm comprises the following two standard phases:

1. Generation of $c02-math-0608$ random trajectories $c02-math-0609$ using the auxiliary policy matrix $c02-math-0610$ . The cost of each trajectory is computed via

2.31 $c02-math-0611$

2. Updating of the parameters of the policy matrix ( $c02-math-0612$ ) on the basis of the data collected in the first phase.

The matrix $c02-math-0613$ is typically initialized to a uniform matrix ( $c02-math-0614$ .) We describe both the trajectory generation and updating procedure in more detail. We shall show that in calculating the associated sample performance, one can take into account the Markovian nature of the problem and speed up the Monte Carlo process.

2.3.3.2 Generating Random Trajectories

Generation of random trajectories for MDP is straightforward and is given for convenience only. All one has to do is to start the trajectory from the initial state $c02-math-0615$ and follow the trajectory by generating at each new state according to the probability distribution of $c02-math-0616$ , until the absorbing state $c02-math-0617$ is reached at, say, time $c02-math-0618$ ,

Algorithm 2.3 Trajectory Generation for MDP

Input: $c02-math-0619$ auxiliary policy matrix.

1. Start from the given initial state $c02-math-0620$ , set $c02-math-0621$ .

2. Generate an action $c02-math-0622$ according to the $c02-math-0623$ th row of $c02-math-0624$ , calculate the reward $c02-math-0625$ , and generate a new state $c02-math-0626$ according to $c02-math-0627$ . Set $c02-math-0628$ . Repeat until $c02-math-0629$ .

3. Output the total reward of the trajectory $c02-math-0630$ given by (2.31).

2.3.3.3 Updating Rules

Let the $c02-math-0631$ trajectories $c02-math-0632$ , their scores, $c02-math-0633$ $c02-math-0634$ , and the associated $c02-math-0635$ -quantile $c02-math-0636$ of their order statistics be given. Then one can update the parameter matrix $c02-math-0637$ using the CE method, namely as per

2.32 $c02-math-0638$

where the event $c02-math-0639$ means that the trajectory $c02-math-0640$ contains a visit to state $c02-math-0641$ and the event $c02-math-0642$ means the trajectory corresponding to policy $c02-math-0643$ contains a visit to state $c02-math-0644$ in which action $c02-math-0645$ was taken.

We now explain how to take advantage of the Markovian nature of the problem. Let us think of a maze where a certain trajectory starts badly, that is, the path is not efficient in the beginning, but after some time it starts moving quickly toward, the goal. According to (2.32), all the updates are performed in a similar manner in every state in the trajectory. However, the actions taken in the states that were sampled near the target were successful, so one would like to “encourage” these actions. Using the Markovian property one can substantially improve the above algorithm by considering for each state the part of the reward from the visit to that state onward. We therefore use the same trajectory and simultaneously calculate the performance for every state in the trajectory separately. The idea is that each choice of action in a given state affects the reward from that point on, regardless of the past.

The sampling Algorithm 2.3 does not change in steps 1 and 2. The difference is in step 3. Given a policy $c02-math-0646$ and trajectory $c02-math-0647$ , we calculate the performance from every state until termination. For every state $c02-math-0648$ in the trajectory, the (estimated) performance is $c02-math-0649$ . The updating formula for $c02-math-0650$ is similar to (2.32), however, each state $c02-math-0651$ is updated separately according to the (estimated) performance $c02-math-0652$ obtained from state $c02-math-0653$ onward.

2.33 $c02-math-0654$

It is important to understand here that, in contrast to (2.32), the CE optimization is carried for every state separately and a different threshold parameter $c02-math-0655$ is used for every state $c02-math-0656$ , at iteration $c02-math-0657$ . This facilitates faster convergence for “easy” states where the optimal strategy is easy to find. The above trajectory sampling method can be viewed as a variance reduction method. Numerical results indicate that the CE algorithm with updating (2.33) is much faster than that with updating (2.32).

2.3.3.4 Numerical Results with the Maze Problem

The CE Algorithm with the updating rule (2.33) and trajectory generation according to Algorithm 2.3 was tested for a stochastic shortest path MDP and, in particular, for a maze problem, which presents a two-dimensional grid world. The agent moves on a grid in four possible directions. The goal of the agent is to move from the upper-left corner (the starting state) to the lower-right corner (the goal) in the shortest path possible. The maze contains obstacles (walls) into which movement is not allowed. These are modeled as forbidden moves for the agent.

We assume the following:

1. The moves in the grid are allowed in four possible directions with the goal to move from the upper-left corner to the lower-right corner.

2. The maze contains obstacles (“walls”) into which movement is not allowed.

3. The cost of allowed moves are random variables uniformly distributed between $c02-math-0658$ and $c02-math-0659$ . The cost of forbidden moves are random variables uniformly distributed between 25 and 75. Thus, the expected costs are 1 and 50 for the allowed and forbidden moves, respectively.

In addition, we introduce the following:

A small (failure) probability not to succeed moving in an allowed direction
A small probability of succeeding moving in the forbidden direction (“moving through the wall”)

In Figure 2.5 we present the results for $c02-math-0660$ maze. We set the following parameters: $c02-math-0661$ , $c02-math-0662$ , $c02-math-0663$ .

Figure 2.5 Performance of the CE Algorithm for the $c02-math-0673$ maze. Each arrow indicates a probability $c02-math-0674$ of going in that direction.

The initial policy was a uniformly random one. The success probabilities in the allowed and forbidden states were taken to be 0.95 and 0.05, respectively. The arrows $c02-math-0664$ in Figure 2.5 indicate that at the current iteration the probability of going from $c02-math-0665$ to $c02-math-0666$ is at least 0.01. In other words, if $c02-math-0667$ corresponds to the action that will lead to state $c02-math-0668$ from $c02-math-0669$ then we will plot an arrow from $c02-math-0670$ to $c02-math-0671$ provided $c02-math-0672$ .

In all our experiments, CE found the target exactly, within 5–10 iterations, and CPU time was less than one minute (on a 500MHz Pentium processor). Note that the successive iterations of the policy in Figure 2.5 quickly converge to the optimal policy. We have also run the algorithm for several other mazes and the optimal policy was always found.

2.4 Continuous Optimization

When the state space is continuous, in particular, when $c02-math-0675$ , the optimization problem (2.16) is often referred to as a continuous optimization problem. The CE sampling distribution on $c02-math-0676$ can be quite arbitrary and does not need to be related to the function that is being optimized. The generation of a random vector $c02-math-0677$ in Step 2 of Algorithm 2.2 is most easily performed by drawing the $c02-math-0678$ coordinates independently from some two-parameter distribution. In most applications a normal (Gaussian) distribution is employed for each component. Thus, the sampling density $c02-math-0679$ of $c02-math-0680$ is characterized by a vector of means $c02-math-0681$ and a vector of variances $c02-math-0682$ (and we may write $c02-math-0683$ ). The choice of the normal distribution is motivated by the availability of fast normal random number generators on modern statistical software and the fact that the cross-entropy minimization yields a very simple solution—at each iteration of the CE algorithm the parameter vectors $c02-math-0684$ and $c02-math-0685$ are the vectors of sample means and sample variance of the elements of the set of $c02-math-0686$ best-performing vectors (that is, the elite set); see, for example, [71]. In summary, the CE method for continuous optimization with a Gaussian sampling density is as follows.

Algorithm 2.4 CE for Continuous Optimization: Normal Updating

1. Initialize: Choose $c02-math-0687$ and $c02-math-0688$ . Set $c02-math-0689$ .

2. Draw: Generate a random sample $c02-math-0690$ from the $c02-math-0691$ distribution.

3. Select: Let $c02-math-0692$ be the indices of the $c02-math-0693$ best performing (elite) samples.

4. Update: For all $c02-math-0694$ let

2.34 $c02-math-0695$

2.35 $c02-math-0695a$

5. Smooth:

2.36 $c02-math-0696$

6. Iterate: Stop if $c02-math-0697$ , and return $c02-math-0698$ (or the overall best solution generated by the algorithm) as the approximate solution to the optimization. Otherwise, increase $c02-math-0699$ by $c02-math-0700$ and return to Step 2.

Smoothing, as in Step 5, is often crucial to prevent premature shrinking of the sampling distribution. Instead of using a single smoothing factor, it is often useful to use separate smoothing factors for $c02-math-0701$ and $c02-math-0702$ (see (2.35)). In particular, it is suggested in [108] to use the following dynamic smoothing for $c02-math-0703$ :

2.37 $c02-math-0704$

where $c02-math-0705$ is an integer (typically between 5 and 10) and $c02-math-0706$ is a smoothing constant (typically between 0.8 and 0.99). Finally, significant speed-up, can be achieved by using a parallel implementation of CE [46].

Another approach is to inject extra variance into the sampling distribution, for example, by increasing the components of $c02-math-0707$ , once the distribution has degenerated; see the examples below and [11].

Example 2.5 The Peaks Function

MATLAB's peaks function

has various local maxima. In this case, implementing the CE Algorithm 2.4 we found that the global maximum of this function is approximately $c02-math-0709$ and is attained at $c02-math-0710$ . The choice of the initial value for $c02-math-0711$ is not important, but the initial standard deviations should be chosen large enough to ensure initially a “uniform” sampling of the region of interest. The CE algorithm is stopped when all standard deviations of the sampling distribution are less than some small $c02-math-0712$ .

Figure 2.6 gives the evolution of the worst and the best of the elite samples, that is, $c02-math-0713$ and $c02-math-0714$ , for each iteration $c02-math-0715$ . We see that the values quickly converge to the optimal value $c02-math-0716$ .

Figure 2.6 Evolution of the CE algorithm for the peaks function.

Remark 2.8 Injection

When using the CE method to solve practical optimization problems with many constraints and many local optima, it is sometimes necessary to prevent the sampling distribution from shrinking too quickly. A simple but effective approach is the following injection method [107]. Let $c02-math-0717$ denote the best performance found at the $c02-math-0718$ -th iteration, and (in the normal case) $c02-math-0719$ the largest standard deviation at the $c02-math-0720$ -th iteration. If $c02-math-0721$ is sufficiently small and $c02-math-0722$ is also small, then add some small value to each standard deviation, for example, a constant $c02-math-0723$ or the value $c02-math-0724$ , for some fixed $c02-math-0725$ and $c02-math-0726$ . When using CE with injection, a possible stopping criterion is to stop when a certain number of injections is reached.

2.5 Noisy Optimization

One of the distinguishing features of the CE Algorithm 2.2 is that it can easily handle noisy optimization problems, that is, when the objective function $c02-math-0727$ is corrupted with noise. We denote such noisy function as $c02-math-0728$ and assume that for each $c02-math-0729$ we can readily obtain an outcome of $c02-math-0730$ , for example, via generation of some additional random vector $c02-math-0731$ , whose distribution may depend on $c02-math-0732$ .

A classical example of noisy optimization is simulation-based optimization [107]. A typical instance is the buffer allocation problem (BAP). The objective in the BAP is to allocate $c02-math-0733$ buffer spaces among the $c02-math-0734$ “niches” (storage areas) between $c02-math-0735$ machines in a serial production line, so as to optimize some performance measure, such as the steady-state throughput. This performance measure is typically not available analytically and thus must be estimated via simulation. A detailed description of the BAP with application of CE is given in [107].

Another example is the noisy traveling salesman problem (TSP), where, say, the cost matrix $c02-math-0736$ , denoted now by $c02-math-0737$ , is random. Think of $c02-math-0738$ as the “random” time to travel from city $c02-math-0739$ to city $c02-math-0740$ .

The total cost of a tour $c02-math-0741$ is given by

2.38 $c02-math-0742$

We assume that $c02-math-0743$ .

The main CE optimization Algorithm 2.2 for deterministic functions $c02-math-0744$ is valid also for noisy ones $c02-math-0745$ . Extensive numerical studies [107] with such noisy Algorithm 2.2 show that it works nicely because, during the course of optimization, it filters out efficiently the noise component from $c02-math-0746$ .

We applied our CE Algorithm 2.2 to both deterministic and noisy TSP. We selected a number of benchmark problems from the TSP library (www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/tsp/). In all cases, the same set of CE parameters were chosen: $c02-math-0747$ , $c02-math-0748$ , $c02-math-0749$ . Table 2.6 presents the performance of the deterministic version of Algorithm 2.2 for several TSPs from this library. To study the variability in the solutions, each problem was repeated 10 times. In Table 2.6, “min”, “mean”, and “max” denote the smallest (that is, best), average, and largest of the 10 estimates for the optimal value. The best-known solution is denoted by $c02-math-0750$ .

Table 2.6 Case studies for the TSP

The average CPU time in seconds and the average number of iterations are given in the last two columns. The size of the problem (number of nodes) is indicated in its name. For example, st70 has $c02-math-0754$ nodes.

Example 2.6 Noisy TSP

Suppose that in the first test case of Table 2.6, burma14, some uniform noise is added to the cost matrix. In particular, suppose that the cost of traveling from $c02-math-0755$ to $c02-math-0756$ is given by $c02-math-0757$ , where $c02-math-0758$ is the cost for the deterministic case. The expected cost is thus $c02-math-0759$ , and the total cost $c02-math-0760$ of a tour $c02-math-0761$ is given by (2.38). The CE algorithm for optimizing the unknown $c02-math-0762$ remains exactly the same as in the deterministic case, except that $c02-math-0763$ is replaced with $c02-math-0764$ , and that a different stopping criterion than (2.21) needs to be employed. A simple rule is to stop when the transition probabilities $c02-math-0765$ satisfy $c02-math-0766$ for all $c02-math-0767$ and $c02-math-0768$ . An alternative stopping rule taken from [107] is as follows.

2.5.1 Stopping Criterion

To identify the stopping time $c02-math-0769$ , denote by $c02-math-0770$ and $c02-math-0771$ the worst of the elite samples ( $c02-math-0772$ ) for both the deterministic and noisy case, respectively, and we consider the following moving average-process:

2.39 $c02-math-0773$

where $c02-math-0774$ is fixed, say $c02-math-0775$ . Define next

2.40 $c02-math-0776$

and

2.41 $c02-math-0777$

respectively, where $c02-math-0778$ is fixed, say $c02-math-0779$ . Clearly, for $c02-math-0780$ and moderate $c02-math-0781$ and $c02-math-0782$ , we may expect that $c02-math-0783$ , provided the variance of $c02-math-0784$ is bounded.

With this at hand, we define the stopping criterion as

2.42 $c02-math-0785$

where $c02-math-0786$ and $c02-math-0787$ are fixed and $c02-math-0788$ is a small number, say $c02-math-0789$ .

Figure 2.7 displays the evolution of the the worst of the elite samples ( $c02-math-0790$ ) for both the deterministic and noisy case, denoted by $c02-math-0791$ and $c02-math-0792$ , respectively. We see in both cases a similar rapid drop in the level $c02-math-0793$ . It is important to note, however, that even though the algorithm in both the deterministic and noisy case converges to the optimal solution, the $c02-math-0794$ for the noisy case do not converge to $c02-math-0795$ , in contrast to the $c02-math-0796$ for the deterministic case. This is because the latter eventually estimates the $c02-math-0797$ -quantile of the deterministic $c02-math-0798$ , whereas the former estimates the $c02-math-0799$ -quantile of $c02-math-0800$ , which is random. To estimate $c02-math-0801$ in the noisy case, one needs to take the sample average of $c02-math-0802$ , where $c02-math-0803$ is the solution found at the final iteration.

Figure 2.7 Evolution of the worst of the elite samples for a deterministic and noisy TSP.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2: Cross-Entropy Method

Create new playlist

Sign In

Sign Up

Chapter 2

Cross-Entropy Method

2.1 Introduction

2.2 Estimation of Rare-Event Probabilities

2.3 Cross-Entropy Method forOptimization

2.3.1 The Multidimensional 0/1 Knapsack Problem

2.3.2 Mastermind Game

2.3.2.1 Simulation Results

2.3.3 Markov Decision Process and Reinforcement Learning

2.3.3.1 Policy Learning via the CE Method

2.3.3.2 Generating Random Trajectories

2.3.3.3 Updating Rules

2.3.3.4 Numerical Results with the Maze Problem

2.4 Continuous Optimization

2.5 Noisy Optimization

Table of Contents for
Chapter 2: Cross-Entropy Method