Agent-centered search methods have to ensure that they do not cycle without making progress toward a goal state. This is a potential problem since they execute actions before their consequences are completely known. Agent-centered search methods have to ensure both that it remains possible to achieve the goal and that they eventually do so. The goal remains achievable if no actions exist of which the execution makes it impossible to reach a goal state, if the agent-centered search methods can avoid the execution of such actions in case they do exist, or if the agent-centered search methods have the ability to reset the agent into the start state. Real-time search methods are agent-centered search methods that store a value, called h-value, in memory for each state that they encounter during the search and update them as the search progresses, both to focus the search and avoid cycling, which accounts for a large chunk of their search time per search episode.
11.1. LRTA*
Learning real-time A* (LRTA*) is probably the most popular real-time search method, and we relate all real-time search methods in this chapter to it. The
h-values of LRTA* approximate the goal distances of the states. They can be initialized using a heuristic function. They can be all zero if more informed
h-values are not available.
Figure 11.1 illustrates the behavior of LRTA* using a simplified goal-directed navigation task in known terrain without uncertainty about the start cell. The robot can move one cell to the north, east, south, or west (unless that cell is blocked). All action costs are one. The task of the robot is to navigate to the given goal cell and then stop. In this case, the states correspond to cells, and the current state corresponds to the current cell of the robot. We assume that there is no uncertainty in actuation and sensing. The
h-values are initialized with the Manhattan distances. A robot under time pressure could reason as follows: Its current cell C1 is not a goal cell. Thus, it needs to move to one of the cells adjacent to its current cell to get to a goal cell. The
cost-to-go of an action is its action cost plus the estimated goal distance of the successor state reached when executing the action, as given by
the
h-value of the successor state. If it moves to cell B1, then the action cost is one and the estimated goal distance from there is four as given by the
h-value of cell B1. The cost-to-go of moving to cell B1 is thus five. Similarly, the cost-to-go of moving to cell C2 is three. Thus, it looks more promising to move to cell C2.
Figure 11.2 visualizes the
h-value surface formed by the initial
h-values. However, a robot does not reach the goal cell if it always executes the move with the minimal cost-to-go and thus performs steepest descent on the initial
h-value surface. It moves back and forth between cells C1 and C2 and thus cycles forever due to the local minimum of the
h-value surface at cell C2.
We could avoid this problem by randomizing the action-selection process slightly, possibly together with resetting the robot into a start state (random restart) after the execution cost has become large. LRTA*, however, avoids this problem by increasing the
h-values to fill the local minima in the
h-value surface.
Figure 11.3 shows how LRTA* performs a search around the current state of the robot to determine which action to execute next if it breaks ties among actions in the following order: north, east, south, and west. It operates as follows:
Local Search Space–Generation Step: LRTA* decides on the local search space. The
local search space can be any set of nongoal states that contains the current state. We say that a local search space is
minimal if and only if it contains only the current state. We say that it is
maximal if and only if it contains all nongoal states. In
Figure 11.1, for example, the local search spaces are minimal. In this case, LRTA* can construct a search tree around the current state. The local search space consists of all nonleaves of the search tree.
Figure 11.3 shows the search tree for deciding which action to execute in the start cell C1.
Value-Update Step: LRTA* assigns each state in the local search space its correct goal distance under the assumption that the
h-values of the states just outside of the local search space correspond to their correct goal distances. In other words, it assigns each state in the local search space the minimum of the execution cost for getting from it to a state just outside of the local search space, plus the estimated remaining execution cost for getting from there to a goal state, as given by the
h-value of the state just outside of the local search space. The reason for this is that the local search space does not include any goal state. Thus, a minimal-cost path from a state in the local search space to a goal state
has to contain a state just outside of the local search space. Thus, the smallest estimated cost of all paths from the state via a state just outside of the local search space to a goal state is an estimate of the goal distance of the state. Since this lookahead value is a more accurate estimate of the goal distance of the state in the local search space, LRTA* stores it in memory, overwriting the existing
h-value of the state.
In the example, the local search space is minimal and LRTA* can simply update the
h-value of the state in the local search space according to the following rule provided that it ignores all actions that can leave the current state unchanged. LRTA* first assigns each leaf of the search tree the
h-value of the corresponding state. The leaf that represents cell B1 is assigned an
h-value of four, and the leaf that represents cell C2 is assigned an
h-value of two. This step is marked in
Figure 11.3. The new
h-value of the root of the search tree, that represents cell C1, then is the minimum of the costs-to-go of the actions that can be executed in it, that is, the minimum of five and three. ②. This
h-value is then stored in memory for cell C1 .
Figure 11.4 shows the result of one value-update step for a different example where the local search space is nonminimal.
Action-Selection Step: LRTA* selects an action for execution that is the beginning of a path that promises to minimize the execution cost from the current state to a goal state (ties can be broken arbitrarily). In the example, LRTA* selects the action with the minimal cost-to-go. Since moving to cell B1 has cost-to-go five and moving to cell C2 has cost-to-go three, LRTA* decides to move to cell C2.
Action-Execution Step: LRTA* executes the selected action and updates the state of the robot, and repeats the process from the new state of the robot until the robot reaches a goal state. If its new state is outside of the local search space, then it needs to repeat the local search space–generation and value-update steps. Otherwise, it can repeat these steps or proceed directly to the action-selection step. Executing more actions before generating the next local search space typically results in a smaller search cost (because LRTA* needs to search less frequently);
executing fewer actions typically results in a smaller execution cost (because LRTA* selects actions based on more information).
The left column of
Figure 11.1 shows the result of the first couple of steps of LRTA* for the example. The values in grey cells are the new
h-values calculated by the value-update step because the corresponding cells are part of the local search space. The robot reaches the goal cell after nine action executions; that is, with execution cost nine. We assumed here that the terrain is known and the reason for using real-time search is time pressure. Another reason for using real-time search could be missing knowledge of the terrain and the resulting desire to restrict the local search spaces to the known parts of the terrain.
We now formalize LRTA*, using the following assumptions and notation for deterministic and nondeterministic search tasks, which will be used throughout this chapter:
S denotes the finite
set of states of the state space,
the
start state, and
the
set of goal states.
denotes the finite, nonempty
set of (potentially nondeterministic)
actions that can be executed in state
.
denotes the
action cost that results from the execution of action
in
state
. We assume that all action costs are one unless stated otherwise.
denotes the
minimal action cost of any action.
denotes the
set of successor states that can result from the execution of action
in state
.
denotes the state that results from an actual execution of action
in state
. In deterministic state spaces,
contains only one state, and we use
also to denote this state. Thus,
in this case. An agent starts in the start state and has to move to a goal state. The agent always observes what its current state is and then has to select and execute its next action, which results in a state transition to one of the possible successor states. The search task is solved when the agent reaches a goal state. We denote the
number of states by
and the
number of state-action pairs (loosely called actions) by
; that is, an action that is applicable in more than one state counts more than once. Moreover, δ (
u,
T) ≥ 0 denotes the
goal distance of state
; that is, the minimal execution cost with which a goal state can be reached from state
u. The
depth d of the state space is its maximal goal distance,
. The expression
returns one of the elements
that minimize
f (
x); that is, one of the elements from the set
. We assume that all search tasks are deterministic unless stated otherwise, and discuss nondeterministic state spaces in
11.5.7 and
11.6.4.
Algorithm 11.1 shows pseudo code for LRTA* in deterministic state spaces. It associates a nonnegative
h-value
h (
u) with each state
. In practice, they are not initialized upfront but rather can be initialized as needed. LRTA* consists of a termination-checking step, a local search space–generation step, a value-update step, an action-selection step, and an action-execution step. LRTA* first checks whether it has already reached a goal state and thus can terminate successfully. If not, it generates the local search space
. Although we require only that
and
, in practice LRTA* often uses forward search (that searches from the current state to the goal states) to select a continuous part of the state space around the current state
u of the agent. LRTA* could determine the local search space, for example, with a breadth-first search around state
u up to a given depth or with a (forward) A* search around state
u until a given number of states have been expanded (or, in both cases, a goal state is about to be expanded). The states in the local search space correspond to the nonleaves of the corresponding search tree, and thus are all nongoal states. In
Figure 11.1, a (forward) A* search that expands two states picks a local search space that consists of the expanded states C1 and C2. LRTA* then updates the
h-values of all states in the local search space. Based on these
h-values, LRTA* decides which action to execute next. Finally, it executes the selected action, updates its current state, and repeats the process.
Algorithm 11.2 shows how LRTA* updates the
h-values in the local search space using a variant of Dijkstra's algorithm. It assigns each state its goal distance under the assumption that the
h-values of all states in the local search space do not overestimate the correct goal distances and the
h-values of all states outside of the local search space correspond to their correct goal distances. It does this by first assigning infinity to the
h-values of all states in the local search space. It then determines a state in the local search space of which the
h-value is still infinity and which minimizes the maximum of its previous
h-value and the minimum over all actions in the state of the cost-to-go of the action. This value then becomes the
h-value of this state, and the process repeats. The way the
h-values are updated ensures that the states in the local search space are updated in the order of their increasing new
h-values. This ensures that the
h-value of each state in the local search space is updated at most once. The method terminates when either the
h-value of each state in the local search space has been assigned a finite value or an
h-value would be assigned the value infinity. In the latter case, the
h-values of all remaining states in the local search space would be assigned the value infinity as well, which is already their current value.
Lemma 11.1
For all times t = 0, 1, 2,… (
until termination):
Consider the (
t + 1)
st value-update step of LRTA* (
i.e., call to procedure Value-Update-Step inAlgorithm 11.1).
Let refer to its local search space. Let and refer to the h-values immediately before and after, respectively, the value-update step. Then, for all states ,
the value-update step terminates with Proof
by induction on the number of iterations within the value-update step (see Exercises).
Lemma 11.2
Admissible initial h-values (that is, initial h-values that are lower bounds on the corresponding goal distances) remain admissible after every value-update step of LRTA* and are monotonically nondecreasing. Similarly, consistent initial h-values (i.e., initial h-values that satisfy the triangle inequality) remain consistent after every value-update step of LRTA* and are monotonically nondecreasing.
Proof
by induction on the number of value-update steps, using
Lemma 11.1 (see Exercises).
After the value-update step has updated the
h-values, LRTA* greedily chooses the action
in the current nongoal state
u with the minimal cost-to-go
for execution to minimize the estimated execution cost to a goal state. Ties can be broken arbitrarily, although we explain later that tie breaking can be important. Then, LRTA* has a choice. It can generate another local search space, update the
h-values of all states that it contains, and select another action for execution. This version of LRTA* results when the repeat-until loop in
Algorithm 11.1 is executed only once, which is why we refer to this version as
LRTA* without repeat-until loop. If the new state is still part of the local search space (the one that was used to determine the action of which the execution resulted in the new state), LRTA* can also select another action for execution based on the current
h-values. This version of LRTA* results when the repeat-until loop in
Algorithm 11.1 is executed until the new state is no longer part of the local search space (as shown in the pseudo code), which is why we refer to this version as
LRTA* with repeat-until loop. It searches less often and thus tends to have smaller search costs but larger execution costs. We analyze LRTA* without repeat-until loop, which is possible because LRTA* with repeat-until loop is a special case of LRTA* without repeat-until loop. After LRTA* has used the value-update step on some local search space, the
h-values do not change if LRTA* uses the value-update step again on the same local search space or a subset thereof. Whenever LRTA* repeats the body of the repeat-until loop, the new current state is still part of the local search space
Slss and thus not a goal state. Consequently, LRTA* could now search a subset of
Slss that includes the new current state, for example, using a minimal local search space, which does not change the
h-values and thus can be skipped.
11.2. LRTA* with Lookahead One
We now state a variant of LRTA* with minimal local search spaces, the way it is often stated in the literature.
Algorithm 11.3 shows the pseudo code of
LRTA* with lookahead one. Its action-selection step and value-update step can be explained as follows. The action-selection step greedily chooses the action
in the current nongoal state
u with the minimal cost-to-go
for execution to minimize the estimated execution cost to a goal state. (The cost-to-go
is basically the
f-value of state
of an A* search from
u toward the goal states.) The value-update step replaces the
h-value of the current state with a more accurate estimate of the goal distance of the state based on the costs-to-go of the actions that can be executed in it, which is similar to temporal difference learning in reinforcement learning. If all
h-values are admissible, then both
h (
u) and the minimum of the costs-to-go of the actions that can be executed in state
u are lower bounds on its goal distance, and the larger of these two values is the more accurate estimate. The value-update step then replaces the
h-value of state
u with this value. The value-update step of LRTA* is sometimes stated as
. Our slightly more complex variant guarantees that the
h-values are nondecreasing. Since the
h-values remain admissible and larger admissible
h-values tend to guide the search better than smaller admissible
h-values, there is no reason to decrease them. If the
h-values are consistent then both value-update steps are equivalent. Thus, LRTA* approximates Bellman's optimal condition for minimal-cost paths in a way similar to Dijkstra's algorithm. However, the
h-values of LRTA* with admissible initial
h-values approach the goal distances from below and are monotonically nondecreasing, while the corresponding values of Dijkstra's algorithm approach the goal distances from above and are monotonically nonincreasing.
LRTA* with lookahead one and LRTA* with minimal local search spaces behave identically in state spaces where the execution of all actions in nongoal states necessarily results in a state change, that is, it cannot leave the current state unchanged. In general, actions that are not guaranteed to result in a state change can safely be deleted from state spaces because there always exists a path that does not include them (such as the minimal-cost path), if there exists a path at all. LRTA* with arbitrary local search spaces, including minimal and maximal local search spaces, never executes actions whose execution can leave the current state unchanged but LRTA* with lookahead one can execute them. To make them identical, one can change the action-selection step of LRTA* with lookahead one to “
. However, this is seldom done in the literature.
In the following, we refer to LRTA* as shown in
Algorithm 11.1 and not to LRTA* with lookahead one as shown in
Algorithm 11.3 when we analyze the execution cost of LRTA*.
11.3. Analysis of the Execution Cost of LRTA*
A disadvantage of LRTA* is that it cannot solve all search tasks. This is so because it interleaves searches and action executions. All search methods can solve only search tasks for which the goal distance of the start state is finite. Interleaving searches and action executions limits the solvable search tasks because actions are executed before their complete consequences are known. Thus, even if the goal distance of the start state is finite, it is possible that LRTA* accidentally executes actions that lead to a state with infinite goal distance, such as actions that “blow up the world," at which point the search task becomes unsolvable for the agent. However, LRTA* is guaranteed to solve all search tasks in safely explorable state spaces. State spaces are
safely explorable if and only if the goal distances of all states are finite; that is, the depth is finite. (For safely explorable state spaces where all action costs are one, it holds that
.) To be precise: First, all states of the state space that cannot possibly
be reached from the start state, or can be reached from the start state only by passing through a goal state, can be deleted. The goal distances of all remaining states have to be finite. Safely explorable state spaces guarantee that LRTA* is able to reach a goal state no matter which actions it has executed in the past. Strongly connected state spaces (where every state can be reached from every other state), for example, are safely explorable. In state spaces that are not safely explorable, LRTA* either stops in a goal state or reaches a state with goal distance infinity and then executes actions forever. We could modify LRTA* to use information from its local search spaces to detect that it can no longer reach a goal state (e.g., because the
h-values have increased substantially) but this is complicated and seldom done in the literature. In the following, we assume that the state spaces are safely explorable.
LRTA* always reaches a goal state with a finite execution cost in all safely explorable state spaces, as can be shown by contradiction (the
cycle argument ). If LRTA* did not reach a goal state eventually, then it would cycle forever. Since the state space is safely explorable, there must be some way out of the cycle. We show that LRTA* eventually executes an action that takes it out of the cycle, which is a contradiction: If LRTA* did not reach a goal state eventually, it would execute actions forever. In this case, there is a time
t after which LRTA* visits only those states again that it visits infinitely often; it cycles in part of the state space. The
h-values of the states in the cycle increase beyond any bound, since LRTA* repeatedly increases the
h-value of the state with the minimal
h-value in the cycle by at least the minimal action cost
wmin of any action. It gets into a state
u with the minimal
h-value in the cycle and all of its successor states then have
h-values that are no smaller than its own
h-value. Let
h denote the
h-values before the value-update step and
h′ denote the
h-values after the value-update step. According to
Lemma 11.1, the
h-value of state
u is then set to
. In particular, the
h-values of the states in the cycle rise by arbitrarily large amounts above the
h-values of all states that border the cycle. Such states exist since a goal state can be reached from every state in safely explorable state spaces. Then, however, LRTA* is forced to visit such a state after time
t and leave the cycle, which is a contradiction.
The performance of LRTA* is measured by its execution cost until it reaches a goal state. The
complexity of LRTA* is its worst-case execution cost over all possible topologies of state spaces of the same sizes, all possible start and goal states, and all tie-breaking rules among indistinguishable actions. We are interested in determining how the complexity scales as the state spaces get larger. We measure the sizes of state spaces as nonnegative integers and use measures
x, such as
x =
nd, the product of the number of states and the depth. An
upper complexity bound Ox has to hold for every
x that is larger than some constant. Since we are mostly interested in the general trend (but not outliers) for the lower complexity bound, a
lower complexity bound Ω (
x) has to hold only for infinitely many different
x. Furthermore, we vary only
x. If
x is a product, we do not vary both of its factors independently. This is sufficient for our purposes.
A tight complexity bound means both Oxand . To be able to express the complexity in terms of the number of states only, we often make the assumption that the state spaces are reasonable.
Reasonable state spaces are safely explorable state spaces with
(or, more generally, state spaces of which the number of actions does not grow faster than the number of states squared). For example, safely explorable state spaces where the execution of different actions in the same state results in different successor states are reasonable. For reasonable state spaces where all action costs are one, it holds that
and
. We also study Eulerian state spaces. A
Eulerian state space is a state space where there are as many actions that leave a state as there are actions that enter it. For example, undirected state spaces are Eulerian. An
undirected state space is a state space
where every action in a state
u of which the execution results in a particular successor state
v has a unique corresponding action in state
v of which the execution results in state
u.
11.3.1. Upper Bound on the Execution Cost of LRTA*
In this section, we provide an upper bound on the complexity of LRTA* without repeat-until loop. Our analysis is centered around the invariant from
Lemma 11.3. The time superscript
t refers to the values of the variables immediately before the (
t + 1) st value-update step of LRTA*. For instance,
u0 denotes the start state
s and
the action executed directly after the (
t + 1) st value-update step. Similarly,
denotes the
h-values before the (
t + 1) st value-update step and
denotes the
h-values after the
st value-update step. In the following, we prove an upper bound on the execution cost after which LRTA* is guaranteed to reach a goal state in safely explorable state spaces.
Lemma 11.3
For all times t = 0,1,2,…
(until termination) it holds that the execution cost of LRTA* with admissible initial h-values h0 at time t is at most . (
Sums have a higher precedence than other operators. For example,
.)
Proof
by induction: The
h-values are admissible at time
t according to
Lemma 11.2. Thus, they are bounded from above by the goal distances, which are finite since the state space is safely explorable. For
t = 0, the execution cost and its upper bound are both zero, and the lemma thus holds. Now assume that the theorem holds at time
t. The execution cost increases by
and the upper bound increases by
and the lemma thus continues to hold.
Theorem 11.1
(Completeness of LRTA*) LRTA* with admissible initial h-values h0 reaches a goal state with an execution cost of at most .
Proof
According to
Lemma 11.3, the execution cost is at most
Since the goal distances are finite in safely explorable state spaces and the minimal action cost
wmin of any action is strictly positive,
Theorem 11.1 shows that LRTA* with admissible initial
h-values reaches a goal state after a bounded number of action executions in safely explorable state spaces; that is, it is complete. More precisely: LRTA* reaches a goal state with an execution cost of at most
, and thus after at most
action executions. Thus, it reaches a goal state with finite execution cost in safely explorable state spaces. One consequence of this result is that search tasks in state spaces where all states are clustered around the goal states are easier to solve with LRTA* than search tasks in state spaces that do not possess this property. Consider, for example, sliding-tile puzzles, which are sometimes considered to be hard search tasks because they have a small goal density. The E
ight-P
uzzle has 181,440 states but only one goal state. However, the average goal distance of the E
ight-P
uzzle (with the goal configuration where the tiles form a ring around the center) is only 21.5 and its maximal goal distance is only 30. This implies that LRTA* never moves far away from the goal state even if it makes a mistake and executes an action that does not decrease the goal distance. This makes the E
ight-P
uzzle easy to search with LRTA* relative to other state spaces with the same number of states.
11.3.2. Lower Bound on the Execution Cost of LRTA*
LRTA* reaches a goal state with an execution cost of at most
, and it holds that
in safely explorable state spaces where all action costs are one. Now assume that LRTA* with minimal local search spaces is zero-initialized, which implies that it is uninformed. In the following, we show that the upper complexity bound is then tight for infinitely many
n.
Figure 11.5 shows an example for which the execution cost of zero-initialized LRTA* with minimal local search spaces is
in the worst case until it reaches a goal state. The upper part of the figure shows the state space. All action costs are one. The states are annotated with their goal distances, their initial
h-values, and their names. The lower part of the figure shows the behavior of LRTA*. On the right, the figure shows the state space with the
h-values after the value-update step but before the action-execution step. The current state is shown in bold. On the left, the figure shows the searches that resulted in the
h-values shown on the right. Again, the states are annotated with their
h-values after the value-update step but before the action-execution step. The current state is on top. Ellipses show the local search spaces, and dashed lines show the actions that LRTA* is about to execute. For the example search task, after LRTA* has visited a state for the first time, it has to move through all previously visited states again before it is able to visit another state for the first time. Thus, the execution cost is quadratic in the number of states. If LRTA* breaks ties in favor of successor states with smaller indices, then its execution cost
f (
n) until it reaches the goal state satisfies the recursive equations
and
. Thus, its execution cost is
(for
). The execution cost equals exactly the sum of the goal distances because LRTA* was zero-initialized and its final
h-values are equal to the goal distances. For example,
for
n = 5. In this case, LRTA* follows the path
s1,
s2,
s1,
s3,
s2,
s1,
s4,
s3,
s2,
s1, and
s5.
Overall, the previous section showed that the complexity of zero-initialized LRTA* with minimal local search spaces is
over all state spaces where all actions costs are one, and the example in this section shows that it is
. Thus, the complexity is
over all state spaces where all actions costs are one and
over all safely explorable state spaces where all actions costs are one (see Exercises).
11.6. How to Use Real-Time Search
We now discuss how to use real-time search using LRTA* as a basis. In addition to the case studies described in the following, real-time search methods have also been used in other nondeterministic state spaces from mobile robotics, including moving-target search, the task of catching moving prey.
11.6.1. Case Study: Offline Search
Many traditional state spaces from artificial intelligence are deterministic, including sliding-tile puzzles and blocks worlds. The successor states of actions can be predicted with certainty in deterministic state spaces. Real-time search methods can solve offline search tasks in these state spaces by moving a fictitious agent in the state space and thus provide alternatives to traditional search methods, such as A*. They have been applied successfully to traditional search tasks and S
trips-type search tasks. For instance, real-time search methods easily determine suboptimal solutions for the T
wenty-four P
uzzle, a sliding-tile puzzle with more than 10
24 states, and blocks worlds with more than 10
27 states.
For these search tasks, real-time search methods compete with other heuristic search methods such as greedy (best-first) search, which can determine suboptimal solutions faster than real-time search, or linear-space best-first search, which can consume less memory.
11.6.2. Case Study: Goal-Directed Navigation in A Priori Unknown Terrain
Goal-directed navigation in a priori unknown terrain requires a robot to move to a goal position in a priori unknown terrain. The robot knows the start and goal cell of the
Gridworld but not which cells are blocked. It can repeatedly move one cell to the north, east, south, or west (unless that cell is blocked). All action costs are one. On-board sensors tell the robot in every cell which of the four adjacent cells (to the north, east, south, or west) are blocked. The task of the robot is to move to the given goal cell, which we assume to be possible. We study this search task again in
Chapter 19. The robot uses a navigation strategy from robotics: It maintains the blockage status of the cells that it knows so far. It always moves along a presumed unblocked path from its current cell toward the goal cell until it visits the goal cell. A
presumed unblocked path is a sequence of adjacent cells that does not contain cells that the robot knows to be blocked. The robot moves on the presumed unblocked path toward the goal cell but immediately repeats the process if it observes a blocked cell on the planned path or reaches the end of the path. Thus, it has to search repeatedly.
The states of the state space correspond to the cells, and the actions correspond to moving from a cell to an adjacent cell. The action costs can increase between searches. Initially, they are all one. When the robot observes that a cell is blocked, then it removes all actions that enter or leave the cell, which corresponds to setting their action costs to infinity. LRTA* and RTAA* apply to this scenario since consistent (or admissible) h-values remain consistent (or admissible, respectively) if action costs increase.
Figure 11.12 shows a simple goal-directed navigation task in
a priori unknown terrain that we use to illustrate the behavior of repeated (forward) A* (that repeatedly searches from the current cell of the robot to the goal cell), LRTA* with repeat-until loop, and RTAA* with repeat-until loop. Black cells are blocked. All cells are labeled with their initial
h-values, which are the Manhattan distances. All search methods start a new search episode (i.e., run another search) when the action cost of an
action on their current path increases or, for LRTA* and RTAA*, when the new state is outside of their local search spaces, and break ties among cells with the same
f-values in favor of cells with larger
g-values and remaining ties in order from highest to lowest priority: east, south, west, and north.
Figure 11.13,
Figure 11.14,
Figure 11.15 and
Figure 11.16 show the robot as a small black circle. The arrows show the found paths from the current cell of the robot to the goal cell, which is in the bottom-right corner. Cells that the robot has already observed to be blocked are black. All other cells have their
h-value in the bottom-left corner. Generated cells have their
g-value in the top-left corner and their
f-value in the top-right corner. Expanded cells are gray and, for RTAA* and LRTA*, have their updated
h-values in the bottom-right corner, which makes it easy to compare them to the
h-values before the update in the bottom-left corner.
Repeated (forward) A*, RTAA* with maximal local search spaces (i.e., with
lookahead = ∞), and LRTA* with the same local search spaces follow the same paths if they break ties in the same way. They differ only in the number of cell expansions until the robot reaches the goal cell, which is larger for repeated (forward) A* (23) than for RTAA* with maximal local search spaces (20) and larger for RTAA* with maximal local search spaces (20) than for LRTA* with maximal local search spaces (19). The first property is due to RTAA* and LRTA* updating the
h-values while repeated (forward) A*
does not. Thus, repeated (forward) A* falls prey to the local minimum in the
h-value surface and thus expands the three left-most cells in the bottom row a second time, whereas RTAA* and LRTA* avoid these cell expansions. The second property is due to some updated
h-values of LRTA* being larger than the ones of RTAA*. However, most updated
h-values are identical, although this is not guaranteed in general. We also compare RTAA* with
lookahead = 4 to RTAA* with maximal local search spaces; that is,
lookahead = ∞.
Figures 11.14 and
11.16 show that smaller local search spaces
increase the execution cost (from 10 to 12) but decrease the number of cell expansions (from 20 to 17) because smaller local search spaces imply that less information is used during each search episode. (The last search episode of RTAA* with
lookahead = 4 expands only one cell since the goal cell is about to be expanded next.)
For goal-directed navigation in unknown terrain, smaller local search spaces tend to increase the execution cost of LRTA* and RTAA* but initially tend to decrease the number of cell expansions and the search cost. Increasing the execution cost tends to increase the number of search episodes. As the local search spaces become smaller, eventually the rate with which the number of search episodes increases tends to be larger than the rate with which the lookahead and the time per search episode decrease, so that the number of cell expansions and the search cost increase again. The number of cell expansions and the search cost tend to be larger for repeated (forward) A* than RTAA*, and larger for RTAA* than LRTA*. RTAA* with local search spaces that are neither minimal nor maximal tends to increase the h-values less per update than LRTA* with the same local search spaces. Consequently, its execution cost and number of cell expansions tend to be larger than the ones of LRTA* with the same local search spaces. However, it tends to update the h-values much faster than LRTA* with the same local search spaces, resulting in smaller search costs. Overall, the execution cost of RTAA* tends to be smaller than the one of LRTA* for given time limits per search episode because it tends to update the h-values much faster than LRTA*, which allows it to use larger local search spaces and overcompensate for its slightly less informed h-values.
11.6.3. Case Study: Coverage
Coverage requires a robot to visit each location in its terrain once or repeatedly. Consider a strongly connected state space without any goal states. Strongly connected state spaces guarantee that real-time search methods are always able to reach all states no matter which actions they have executed in the past. Then, LRTA* with lookahead one, RTA* with lookahead one, and node counting visit all states repeatedly, as can be shown with a cycle argument in a way similar to the proof that LRTA* always reaches a goal state with a finite execution cost. In fact, the worst-case cover time (where the cover time is measured as the execution cost until all states have been visited at least once) is the same as the worst-case execution cost until a goal state has been reached if an adversary can pick the goal state, for example, the state that was visited last when covering the state space. Thus, the earlier complexity results carry over to coverage.
The property of real-time search methods to visit all states of strongly connected state spaces repeatedly has been used to build
ant robots —simple robots with limited sensing and computational capabilities. Ant robots have the advantage that they are easy to program and cheap to build. This makes it feasible to deploy groups of ant robots and take advantage of the resulting fault tolerance and parallelism. Ant robots cannot use conventional search methods due to their limited sensing and computational capabilities. To overcome these limitations, ant robots can use real-time search methods to leave markings in the state space that can be read by the other ant robots, similar to what real ants do when they use chemical (pheromone) traces to guide their navigation. For example, ant robots that all use LRTA* only have to sense the markings (in form of the
h-values) of their adjacent states, and change the marking of their current state. They cover all states once or repeatedly even if they move asynchronously, do not communicate with each other except via the markings, do not have any kind of memory, do not know and cannot learn the topology of the state space, nor determine complete paths. The ant robots do not even need to be localized, which completely eliminates solving difficult
and time-consuming localization tasks, and robustly cover all states even if they are moved without realizing that they have been moved (say, by people running into them), some ant robots fail, and some markings get destroyed.
Many of the real-time search methods discussed in this chapter could be used to implement ant robots. For example, the following properties are known if edge counting is used to implement ant robots, in addition to it executing all actions repeatedly. A
Eulerian cycle is a sequence of action executions that executes each action in each state exactly once and returns to the start state. Consider edge ant walk, the variant of edge counting that repeatedly executes all actions in a state in turn according to a fixed order. In other words, edge counting executes the first action according to the given order when it visits some state for the first time, it executes the second action according to the given order when it visits the same state next time, and so on. It turns out that this variant of edge counting repeatedly executes a Eulerian cycle after at most
action executions in Eulerian state spaces without goal states, where the diameter
d′ of a state space is the maximal distance between any pair of its states. Furthermore, consider
k ant robots that each execute one action in each time step so that they repeatedly execute all actions in a state in turn according to a fixed order. In other words, the first ant robot that executes an action in some state executes the first action according to the given order, the next ant robot that executes an action in the same state executes the second action according to the given order, and so on. It turns out that the numbers of times that any two actions have been executed differ by at most a factor of two after at most
time steps in Eulerian state spaces without goal states.
Overall, it is probably easiest to use node counting to implement ant robots in
Gridworlds since each ant robot then marks cells and always increases the
h-value of its current cell by the same amount.
Figure 11.17 shows a simplified example. Each ant robot can repeatedly move one cell to the north, east, south, or west (unless that cell is blocked). We assume that there is no uncertainty in actuation and sensing. The ant robots move in a given sequential order (although this is not necessary in general). If a cell contains an ant robot, one of its corners is marked. Different corners represent different ant robots.
Figure 11.17 (top) demonstrates how a single ant robot covers the
Gridworld, and
Figure 11.17 (bottom) demonstrates how three ant robots cover it. Ant robots that leave more information in each cell (e.g., complete maps) tend to cover terrain even faster.
11.6.4. Case Study: Localization and Goal-Directed Navigation under Pose Uncertainty
Consider the goal-directed navigation task with pose uncertainty shown in
Figure 11.18. The robot knows the
Gridworld, but is uncertain about its start pose, where a pose is a cell and orientation (north, east, south, or west). It can move one cell forward (unless that cell is blocked), turn left 90 degrees, or turn right 90 degrees. All action costs are one. On-board sensors tell the robot in every pose which of the four adjacent cells (front, left, behind, or right) are blocked. We assume that there is no uncertainty in actuation and sensing, and consider two navigation tasks. Localization requires the robot to gain certainty about its pose and then stop. We study this search task again in
Chapter 19. Goal-directed navigation with pose uncertainty requires the robot to navigate to any of the given goal poses and then stop. Since there might be many poses that produce the same sensor reports as the goal poses, this navigation task includes localizing the robot sufficiently so that it knows that it is in a goal pose when it stops. We require that the
Gridworlds be strongly connected (every pose can be reached from every other pose) and not completely symmetrical (localization is possible). This modest
assumption makes all robot navigation tasks solvable, since the robot can always first localize itself and then, for goal-directed navigation tasks with pose uncertainty, move to a goal pose.
We now formally describe the navigation tasks, using the following notation:
P is the finite set of possible robot poses.
A (
p) is the finite set of possible actions that the robot can execute in pose
: left, right, and possibly forward.
is the pose that results from the execution of action
in pose
.
is the observation that the robot makes in pose
: whether or not there are blocked cells immediately adjacent to it in the four directions (front, left, behind, and right). The robot starts in pose
and then repeatedly makes an observation and executes an action until it decides to stop. It knows the
Gridworld, but is uncertain about its start pose. It could be in any pose
in
. We require only that
for all
, which automatically holds after the first observation, and
, which automatically holds for
.
Since the robot does not know its start pose, the navigation tasks cannot be formulated as search tasks in small deterministic state spaces of which the states are the poses (pose space). Rather, the robot has to maintain a belief about its current pose. Analytical results about the execution cost of search methods are often about their worst-case execution cost (here, the execution cost for the worst possible start pose) rather than their average-case execution cost, especially if the robot cannot associate probabilities or other likelihood estimates with the poses. Then, all it can do is maintain a belief in the form of a set of possible poses, namely the poses that it could possibly be in. Thus, its beliefs are sets of poses and their number could be exponential in the number of poses. The beliefs of the robot depend on its observations, which the robot cannot predict with certainty since it is uncertain about its pose. For example, it cannot predict whether the cell in front of it will be blocked after it moves forward for the goal-directed navigation task with pose uncertainty from
Figure 11.18. The navigation tasks are therefore search tasks in large nondeterministic state spaces of which the states are the beliefs of the robot (belief space). The robot will usually be uncertain about its current pose but can always determine its current belief for sure. For example, if the robot has no knowledge of its start pose for the goal-directed navigation task with pose uncertainty from
Figure 11.18 but observes blocked cells all around it except in its front, then its start belief contains the following seven poses: A 1 →, A 6 ↓, A 8 ↓, A 10 →, D 1 →, D 5 ↑, and D 8 ↑.
We now formally describe the state spaces of the navigation tasks, using the following notation:
B denotes the set of beliefs and
b (
s) the start belief.
A (
b) denotes the set of actions that can be executed when the belief is
b.
denotes the set of possible observations that can be made after the execution of action
a when the belief was
b.
denotes the successor belief that results if observation
o is made after the execution of action
a when the belief was
b. Then, for all
,
, and
,
To understand the definition of
A (
b), notice that
for all
after the preceding observation since the observation determines the actions that can be executed.
For goal-directed navigation tasks with pose uncertainty, the robot has to navigate to any pose in
and stop. In this case, we define the set of goal beliefs as
. To understand this definition, notice that the robot knows that it is in a goal pose if and only if its belief is
. If the belief contains more than one pose, however, the robot does not know which goal pose it is in. If it is important that the robot knows which goal pose it is in, we use
. For localization tasks, we use
.
The belief space is then defined as follows. It is safely explorable since our assumptions imply that all navigation tasks are solvable:
For goal-directed navigation tasks with pose uncertainty, we can use the admissible goal-distance heuristic,
. (Thus, the robot determines for each pose in the belief state how many actions it would have to execute to reach a goal pose if it knew that it was currently in that pose. The maximum of these values then is an approximation of the minimax goal distance of the belief state.) For example, the goal-distance heuristic is 18 for the start belief state used earlier, namely the maximum of 18 for A 1 →, 12 for A 6 ↓, 10 for A 8 ↓, 1 for A 10 →, 17 for D 1 →, 12 for D 5 ↑, and 9 for D 8 ↑. The calculation of
involves no pose uncertainty and can be done efficiently without interleaving searches and action executions, by using traditional search methods in the pose space. This is possible because the pose space is deterministic and small. The
h-values are admissible because the robot needs an execution cost of at least
in the worst case to solve the goal-directed navigation task with pose uncertainty from pose
, even if it knows that it starts in that pose. The
h-values are often only partially informed because they do not take into account that the robot might not know its pose and then might have to execute additional localization actions to overcome its pose uncertainty. For localization tasks, on the other hand, it is difficult to obtain more informed initial
h-values than zero-initialized ones.
Figure 11.19 (excluding the dashed part) shows how min-max LRTA* with lookahead one performs a minimax search around the current belief state of the robot to determine which action to execute next. The local search space consists of all nonleaves of the minimax tree where it is the robot's turn to move (labeled “Agent” in the figure). Min-max LRTA* first assigns all leaves of the minimax tree the
h-value determined by the heuristic function for the corresponding belief state . Min-max LRTA* then backs up these
h-values toward the root of the minimax tree. The
h-value of a node in the minimax tree where it is nature's turn to move (labeled “Nature” in the figure) is the maximum of the
h-values of its children since nature chooses successor states that maximize the minimax goal distance ②. The
h-value of a node in the minimax search tree where it is the robot's turn to move is the maximum of its previous
h-value and the minimum over all actions in the node of the sum of the action cost plus the
h-value of the corresponding child since min-max
LRTA* chooses actions that minimize the minimax cost-to-go . Finally, min-max LRTA* selects the action in the root that minimizes the sum of the action cost plus the
h-value of the corresponding child. Consequently, it decides to move forward. Min-max LRTA* then executes the selected action (possibly already searching for action sequences in response to the possible observations it can make next), makes an observation, updates the belief state of the robot based on this observation, and repeats the process from the new belief state of the robot until the navigation task is solved.
Min-max LRTA* has to ensure that it does not cycle forever. It can use one of the following two approaches to gain information between action executions and thus guarantee progress:
Direct Information Gain: If min-max LRTA* uses sufficiently large local search spaces, then it achieves a gain in information in the following sense. After executing the selected actions it is guaranteed that the robot has either solved the navigation task or at least reduced the number
of poses it can be in. This way, it guarantees progress toward a solution. For example, moving forward reduces the number of possible poses from seven to at most two for the
goal-directed navigation task with pose uncertainty from
Figure 11.18. We study a search method with direct information gain, called greedy localization, in more detail in
Chapter 19 since it performs agent-centered search but not real-time search.
Indirect Information Gain: Min-max LRTA* with direct information gain does not apply to all search tasks. Even if it applies, the local search spaces and thus the search cost that it needs to guarantee a direct information gain can be large. To operate with smaller local search spaces, it can use real-time search. It then operates as before, with the following two changes. First, when min-max LRTA* needs the
h-value of a belief state just outside of the local search space (i.e., the
h-value of a leaf of the minimax tree) in the value-update step, it now checks first whether it has already stored an
h-value for this belief state in memory. If so, then it uses this
h-value. If not, then it calculates the
h-value using the heuristic function, as before. Second, after min-max LRTA* has calculated the
h-value of a belief state in the local search space where it is the turn of the robot to move, it now stores it in memory, overwriting any existing
h-value of the corresponding belief state.
Figure 11.19 (including the dashed part) summarizes the steps of min-max LRTA* with indirect information gain before it decides to move forward. The increase of the potential
, see the proof of
Lemma 11.3, can be interpreted as an indirect information gain that guarantees that min-max LRTA* reaches a goal belief state in safely explorable state spaces. A disadvantage of min-max LRTA* with indirect information gain over min-max LRTA* with direct information gain is that the robot has to store potentially one
h-value in memory for each belief state it has encountered during its searches. In practice, however, the memory requirements of real-time search methods often seem to be small, especially if the initial
h-values are well informed and thus focus the searches well, which prevents them from encountering a large number of different belief states. Furthermore, real-time search methods only need to store the
h-values of those belief states in memory that differ from the initial
h-values. If the
h-values are the same, then they can be automatically regenerated when they are not found in memory. For the example from
Figure 11.19, for instance, it is unnecessary to store the calculated
h-value 18 of the start belief state in memory.
An advantage of min-max LRTA* with indirect information gain over min-max LRTA* with direct information gain is that it is able to operate with smaller local search spaces, even local search spaces that contain only the current belief state. Another advantage is that it improves its worst-case execution cost, although not necessarily monotonically, until it converges if it maintains its h-values across a series of localization tasks in the same terrain or goal-directed navigation tasks with pose uncertainty with the same set of goal poses in the same terrain. The actual start poses or the beliefs of the robot about its start poses do not need to be identical.
So far, we have assumed that the robot can recover from the execution of each action. If this is not the case, then the robot has to guarantee that the execution of each action does not make it impossible to reach a goal belief state, which is often possible by increasing the local search spaces of real-time search methods. For example, if min-max LRTA* is applied to goal-directed navigation tasks with pose uncertainty in the presence of irreversible actions and always executes actions so that the resulting belief state is guaranteed to contain either only goal poses, only poses that are part of the current belief state of the robot, or only poses that are part of the start belief state, then either the goal-directed navigation task remains solvable in the worst case or it was not solvable in the worst case to begin with. We have also assumed that there is no actuator or sensor noise. Search tasks with actuator but no sensor noise can be modeled with MDPs, and search tasks with actuator and sensor noise can be modeled with partially observable MDPs (POMDPs). We have already shown that MDPs can be solved with probabilistic LRTA*. A POMDP can be expressed as an MDP of which the state space is the set of probability distributions over all states of the POMDP. Thus, the state space of the resulting MDP is continuous and needs to get discretized before we can use probabilistic LRTA* on it.
11.7. Summary
In this chapter, we illustrated the concept of real-time search, described which kinds of search tasks it is suitable for, and discussed the design and the properties of some real-time search methods. We learned that real-time search methods have been applied to a variety of search tasks, including traditional off-line search, Strips-type planning, moving-target search, search of MDPs and POMDPs, reinforcement learning, and robot navigation (such as goal-directed navigation in unknown terrain, coverage and localization). We learned that real-time search methods have several advantages. First, different from the many existing ad hoc search and planning methods that interleave searches and action executions, they have a solid theoretical foundation and are state space independent. Second, they allow for fine-grained control over how much search to do between action executions. Third, they can use heuristic knowledge to guide their search, which can reduce the search cost without increasing the execution cost. Fourth, they can be interrupted at any state and resume execution at a different state. In other words, other control programs can take over control at arbitrary times if desired. Fifth, they amortize learning over several search episodes, which allows them to determine a solution with a suboptimal execution cost fast and then improve the execution cost as they solve similar search tasks, until the execution cost is satisficing or optimal. Thus, they still asymptotically minimize the execution cost in the long run in case similar search tasks unexpectedly repeat. Sixth, several agents can often solve search tasks cooperatively by performing an individual real-time search each but sharing the search information, thereby reducing the execution cost. For example, off-line search tasks can be solved on several processors in parallel by running a real-time search method on each processor and letting all real-time search methods share their h-values.
Although these properties can make real-time search methods the search methods of choice, we also learned that they are not appropriate for every search task. For example, real-time search methods execute actions before their consequences are completely known and thus cannot guarantee a small execution cost when they solve a search task for the first time. If a small execution cost is important, we might have to perform complete searches before starting to execute actions. Furthermore, real-time search methods trade off the search and execution costs but do not reason about the trade-off explicitly. In particular, it can sometimes be beneficial to update h-values of states that are far away from the current state, and repeated forward searches might not be able to determine these states efficiently. Time-bounded A*, for example, performs only one complete (forward) A* search and executes actions greedily until this search is complete. Finally, real-time search methods have to store an h-value in memory for each state encountered during their searches and thus can have large memory requirements if the initial h-values do not focus their searches well.
Table 11.1 provides an overview of the real-time search methods introduced in this chapter, together with the kind of their state spaces, the sizes of their local search spaces, and whether they are complete and converge. We discussed LRTA* as a prototypical real-time search method and analyzed its properties, including its completeness, execution cost and convergence. We then discussed several variants of LRTA*, including variants with maximal local search spaces, minimal local search spaces, lookahead one (similar to minimal local search spaces) and lookahead zero (called minimal lookahead, where all information is local to the current state). We discussed variants of LRTA* that update their
h-values faster than LRTA* (namely RTAA*), that detect convergence, that speed up convergence (including FALCONS) and that do not converge (including RTA*, node counting and edge counting). We also discussed variants of LRTA* for nondeterministic state spaces (min-max LRTA*) and probabilistic state spaces (probabilistic LRTA*).
Table 11.1 Overview of real-time search methods.
Real-Time Search Method | Algorithm | State Space | Spaces or Lookaheads | h-Values | Completeness | Convergence |
---|
LRTA* | 11.1/11.2 | deterministic | one to infinity | admissible | yes | yes |
LRTA* with lookahead one | 11.3 | deterministic | one | admissible | yes | yes |
Min-LRTA* | 11.4 | deterministic | zero | admissible | yes | yes |
RTAA* | 11.5 | deterministic | one to infinity | consistent | yes | yes |
FALCONS | 11.7 | deterministic | one | admissible | yes | yes |
RTA* | — | deterministic | one (to infinity) | admissible | yes | no |
Node counting | — | deterministic | one | zero | yes | no |
Edge counting | — | deterministic | zero | zero | yes | no |
Min-Max LRTA* | — | nondeterministic | one (to infinity) | admissible | yes | yes |
Probabilistic LRTA* | — | probabilistic | one (to infinity) | admissible | yes | yes |
11.9. Bibliographic Notes
The term
agent-centered search has been coined by
Koenig (2001b); the term
real-time heuristic search has been coined by
Korf (1990); and the term
trial-based real-time dynamic programming has been coined by
Barto, Bradtke, and Singh (1995).
Korf (1990) used real-time search originally for finding suboptimal solutions to large off-line search tasks, such as the N
inety-N
ine-P
uzzle.
Real-time search is related to pebble algorithms that agents can use to cover state spaces by marking otherwise indistinguishable states with colored pebbles (
Blum Raghavan, and Schieber, 1991). Real-time search is also related to plan-envelope methods that operate on MDPs and, like Probabilistic LRTA*, reduce the search cost by searching only small local search spaces (plan envelopes). If the local search space is left during execution, then they repeat the process from the new state, until they reach a goal state (
Dean, Kaelbling, Kirman, and Nicholson, 1995). However, they search all the way from the start state to a goal state, using local search spaces that usually border at least one goal state and are likely not to be left during execution. Finally, real-time search is also similar to reinforcement learning (see work by
Thrun 1992 and
Koenig and Simmons, 1996a) and to incremental search (see work by
Koenig, 2004). Real-time search has been applied to traditional search tasks by
Korf (1990) and
Korf (1993);
Strips-type planning tasks by
Bonet, Loerincs, and Geffner 1997; moving-target search by
Ishida (1997) and
Koenig and Simmons (1995); coverage with ant robots by
Wagner, Lindenbaum, andBruckstein (1999),
Koenig, Szymanski, and Liu (2001a) and
Svennebring and Koenig (2004); localization by
Koenig (2001a); MDPs by
Barto, Bradtke, and Singh (1995); and POMDPs by
Bonet and Geffner (2000).
Variants of LRTA* have been studied by
Russell and Wefald (1991); variants of min-max LRTA* have been studied by
Bonet and Geffner (2000) and
Hernández and Meseguer (2005),
Hernández and Meseguer (2007a) and
Hernández and Meseguer (2007b); variants of node counting have been studied by
Balch and Arkin (1993),
Pirzadeh and Snyder (1990) and
Thrun (1992); and variants of edge counting have been studied by
Sutherland (1969). Improvements on these real-time search methods have been reported by
Edelkamp and Eckerle (1997),
Ishida (1997),
Moore and Atkeson (1993),
Pemberton and Korf (1992) and
Pemberton and Korf (1994),
Russell and Wefald (1991),
Sutton (1991),
Thorpe (1994) and
Bulitko, Sturtevant, and Kazakevich (2005). Some of these improvements have been unified by
Bulitko and Lee (2006).