CHAPTER 4

Quantum Agent Models

Agent models are often used in classical robotics to model a robot planning in a world with uncertainty. Classical environments are deterministic from time step to time step but can have a large dynamic state space that can be difficult to predict. The dynamic nature of many real-world environments means robots have to reason over a large range of possible future scenarios in planning their actions. To make matters more challenging, the robot may be uncertain of the state of the environment (as well as its own state) as it has to infer these properties from noisy, imperfect sensor data. Frameworks such as the Markov Decision Process (MDP) [Bellman, 1957] and Partially Observable Markov Decision Process (POMDP) [Kaelbling et al., 1998] have traditionally been used to model these “planning with uncertainty” scenarios.

Planning with uncertainty is a central challenge in many areas of robotics, and MDPs and POMDPs have widespread use. They have been used to model grasping and manipulation tasks [Glashan et al., 2007, Horowitz and Burdick, 2013, Hsiao et al., 2007, Pajarinen and Kyrki, 2015], robot navigation [Koenig and Simmons, 1998, Simmons and Koenig, 1995], path planning [Cortes et al., 2008, Spaan and Vlassis, 2004], target tracking [Hsu et al., 2008], high-level behavior control [Pineau and Thrun, 2002], and many other applications [Cassandra, 1998].

Often times, robots are designed with a goal in mind to solve a particular problem. As a robot performs actions, it may be able to improve its performance on its design goal over time by learning. Frameworks such as Reinforcement Learning (RL) [Sutton and Barto, 1998], Online Learning [Bottou, 1998], and Active Learning [Settles, 2010] are common paradigms for incorporating learning mechanisms into robot action, so a robot can improve using past experience.

While the field of quantum agents is still in its infancy, some work has been done to generalize MDPs and POMDPs to the quantum world and to formulate quantum speedups for agent learning. One direction of research is to develop agent algorithms that can operate in quantum-scale environments. Attempts have been made to formulate qMDPs [Ying and Ying, 2014] and QOMDPs [Barry et al., 2014] for agents attempting to operate within the unitary dynamics of a quantum environment.

Another area of active research is to develop agents that operate in classical environments but get a potential quantum speedup in their learning capability. Reinforcement Learning that uses quantum random walks for speedup has been proposed [Paparo et al., 2014]. Simple agent decision making algorithms (such as multi-armed bandits) have also been implemented in optical and photonic setups [Mahajan and Teneketzis, 2008, Naruse et al., 2013].

4.1 CLASSICAL MARKOV DECISION PROCESSES

A Markov Decision Process (MDP) [Bellman, 1957] is a model of an agent operating in an uncertain but observable world. Formally, an MDP is a 5-tuple (S, A, T, R, γ):

Markov Decision Process (MDP) (S, A, T, R, γ) Definition

S is a set of robot-environment states.

A is a set of actions the agent can take.

T (si , a, sj) : S × A × S → [0, 1] is a state transition function which gives the probability that taking action a in state si results in state sj.

R(si, a) : S × Aimage is a reward function that quantifies the reward the agent gets for taking action a in state si.

γ ∈ [0, 1) is a discount factor that trades off between immediate reward and potential future reward.

In the MDP model, the world is in one state at any given time. The agent fully observes the true state of the world and itself. At each time step, the agent chooses and executes an action from its set of actions. The agent receives some reward based on the reward function for its action choice in that state. The world then probabilistically transitions to a new state according to the state transition function. The objective of the MDP agent is to act in a way that allows it to maximize its future expected reward.

The solution to an MDP is known as a policy, π(si): S → A, a function that maps states to actions. The optimal policy over an infinite horizon is one inducing the value function:

image

This famous equation, known as the “Bellman Equation,” quantifies the value of being in a state based on immediate reward as well as discounted expected future reward. The future reward is based on what the agent can do from its next state and the reward of future states it may transition to, discounted more as time increases. The recursion specifies a dynamic program which breaks the optimization into subproblems that are solved sequentially in a recurrent fashion.

The Bellman Equation is amenable to solutions via the Value Iteration algorithm [Bellman, 1957]. In general, there is a unique solution for V * that is non-infinite if γ < 1. When the input size is polynomial in |S| and |A|, finding an -optimal policy (i.e., one within cost of the true optimal policy) for an MDP can be done in polynomial time.

4.2 CLASSICAL PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES

Partially Observable Markov Decision Processes (POMDPs) [Russell and Norvig, 2003] generalize MDPs to the case where the world is not fully observable. In this setting, the true world state is hidden to the agent, though the agent still receives observations of the world that help it infer the world state. POMDPs have been useful for modeling robots because imperfect robot sensors and actuators provide only an inherently limited view of the world.

Formally, a POMDP is an 8-tuple (S, A, T, R, γ, Ω, O, image). It has all 5 components of an MDP and has in addition:

Partially Observable Markov Decision Process (POMDP) (S, A, T, R, γ, Ω, O, image) Definition

• Ω, a set of possible observations of the world.

O(si, aj, ok) : S × A × Ω → [0, 1], an observation model indicating the probability of making observation ok given that action aj was taken and the system ended up in state si.

image, a probability distribution over possible initial states.

The world in the POMDP, like the MDP, functions deterministically. However, unlike the MDP, an agent never knows exactly the state in which it resides. Instead, the agent maintains a probability distribution over possible states and attempts to statistically infer the system state from the observation data it receives. In a POMDP, the agent maintains, at each time step, a vector image, a probability distribution (called the “belief state”) over possible states. For siS, image is the probability the system is in state si. Thus, image and image.

Interestingly, POMDPs can be viewed as MDPs in belief space [Kaelbling et al., 1998]. Given a POMDP, one can define a corresponding “belief MDP” where the state space is defined as the set of all possible belief states of the POMDP. The optimal solution to the belief MDP is also the optimal solution to the POMDP. This observation doesn’t make finding a solution easier since the state space of the belief state MDP is continuous and all known algorithms for solving MDPs optimally in polynomial time are polynomial in the size of the state space.

As with MDPs, the solution for a POMDP is a policy which dictates what action to take in any state of the robot-environment system. While efforts have been made to solve POMDPs using a myriad of techniques [Braziunas, 2003, Murphy, 2000, Roy et al., 2005, Shani et al., 2013], classical POMDPs do not have optimistic theoretical properties. The policy existence problem for POMDPs is PSPACE-hard [Papadimitriou and Tsitsiklis, 1987] and undecidable in the infinite horizon [Madani et al., 2003].

4.3 QUANTUM SUPEROPERATORS

Efforts have been made to generalize MDPs to quantum MDPs (qMDPs) [Ying and Ying, 2014] and POMDPs to quantum POMDPs (QOMDPs) [Barry et al., 2014] to model control systems operating on quantum-scale environments. In both frameworks, quantum superoperators formalize an agent’s capability to affect its quantum world and the expected quantum state evolution that results from the action.

A quantum superoperator is a linear operator acting on a vector space of linear operators. In the quantum world, superoperators can be used to evolve density matrices of quantum states. Formally, a quantum superoperator S = {K1, … , Kκ} acting on states of dimension d is defined by a set of κ d × d Kraus matrices. A set of matrices is a set of Kraus matrices if it satisfies image.

When a quantum superoperator operates on a density matrix ρ, there are κ possible next states for ρ. The next state ρi is sampled from the probability distribution of possible next states. The next state is given by image with probability image. The superoperator returns observation i if the ith Kraus matrix is applied.

4.4 QUANTUM MDPS

A quantum MDP (qMDP) [Ying and Ying, 2014] is a 4-tuple (H, Act, M, Q):

Quantum MDP (qMDP) (H, Act, M, Q) Definition

H is the set of possible quantum states of the system. In quantum mechanics, H is a Hilbert space, and states in H are density matrices that can be either mixed or pure states.

Act is a finite set of actions. Each αAct has a corresponding superoperator α that can be used to evolve the system.

image is a finite set of quantum measurements. The set of all possible observations is Ω = {OM,m : Mimage and m is a possible outcome of M}. OM,m indicates we perform the measurement M and obtain the outcome m.

Q : Act image image → 2ActimageM is a mapping from current measurements and actions to the set of available measurements and actions after a certain current action or current measurement is performed.

For the qMDP, the world is quantum-scale and thus governed by unitary dynamics. The possible states of the quantum system are described by the Hilbert space H. The quantum state is a possible superposition of states, with the true underlying state being unknown. At each time step, the agent can choose to perform an action or a measurement (to try to deduce the underlying state).

With a qMDP, a series of measurements is introduced to infer the state and help select the next action, based on the outcomes of said measurements. Obtaining information about a quantum system in superposition without decoherence is non-trivial. One method is to use indirect measurements where the main quantum system interacts with an ancilla quantum system. The coupled ancilla system is subject to measurement to obtain indirect information about the underlying system, though the main system remains unmeasured. This measurement methodology is further described in Section 6.1.2.

The algorithm that utilizes the obtained measurement information for decision making is called a scheduler. A scheduler for a qMDP is a function image : (Act image Ω)* → Act image image that selects the next action based on the outcomes of previously performed measurements and actions. A key difference between MDPs and qMDPs is that with MDPs, the scheduler can use the exact state of the system because the state of the classical world is known in the MDP. In a quantum environment, the true world state is not known and must be inferred from partial measurements of the system that give information about the system but do not collapse the superposition.

Classically, for an MDP, it can be shown that a memory-less scheduler exists that is optimal for all initial states [Baier and Katoen, 2008]. In the quantum case, however, there is no guarantee that an optimal scheduler exists for even a fixed initial state. The challenge lies in knowing the current state of the quantum system. A quantum system in a superposition of multiple states does not have a definite current state and thus does not provide enough information for a scheduler to choose a next best action.

Even if one knows the superposition weights and possible states, the uncertain knowledge of the true state makes the qMDP a very different mathematical object from the classical MDP. The reachability problem in MDPs is to determine whether a given scheduler will reach a particular state. Ying and Ying [2014] provides some results for the reachability properties of qMDPs. For the finite horizon case, the reachability problem for a qMDP is undecidable. For the infinite horizon case, it is EXPTIME-hard to determine whether the supremum reachability probability of a qMDP is 1.

An invariant subspace is a subspace of H that is invariant under all superoperators α for all α ∈ Act and invariant under all measurements Mimage. An invariant subspace can thus be viewed as a type of stationary region for a qMDP. It is proven by Ying and Ying [2014] that a qMDP has an optimal scheduler for reaching an invariant subspace of its Hilbert space H if and only if the orthocomplement of the target subspace contains no invariant subspaces. Using this property, Ying and Ying [2014] develops an algorithm for finding an optimal scheduler for a qMDP, when one exists.

4.5 QOMDPS

As discussed in the previous section, the qMDP is dependent on measurements to infer environment state, something that is not necessary in classical MDPs. When measurements are needed to infer environment state, the environment is classically, by definition, partially observable. The partial observability of the true underlying state of a quantum system in superposition may suggest POMDPs as a potentially better model of quantum systems than MDPs.

A quantum POMDP (QOMDP) [Barry et al., 2014] is a 6-tuple (S, Ω, image, R, γ, ρ0):

Quantum POMDP (QOMDP) (S, Ω, image, R, γ, ρ0) Definition

S, a set of possible quantum states of the system. In quantum mechanics, S is a Hilbert space and states in S are density matrices that can be either mixed or pure states.

• Ω = {o1, … , o|Ω|}, a set of possible observations.

image = {A1, … , A|image|}, a set of superoperators that allow the agent to manipulate the quantum system. Each superoperator image has |Ω| Kraus matrices. Each superoperator returns the same set of possible observations. The return of oi means the ith Kraus matrix was applied. Taking action a in state ρ returns observation oi with probability image. If oi is observed after taking action a in state ρ, the next state is image.

R = {R1, … , R|A|}, is a set of operators for receiving the reward. The reward associated with taking action a in state ρ is R(ρ, a) = Tr(ρRa).

• γ ∈ [0, 1), the discount factor.

ρ0S, the starting state.

The agent workflow in QOMDPs is analogous to that of classical POMDPs. For a QOMDP, the world is a quantum system that begins in state ρ0 and then evolves based on the agent’s actions.

Quantum superoperators formalize an agent’s capability to take an action in a quantum environment and receive a percept. In the QOMDP, acting and sensing in the environment are coupled via superoperators. At each time step, the agent chooses a superoperator to apply from its set of superoperators. Observation of the system may be done (often via indirect measurement methodologies), and the agent receives an observation from Ω according to the laws of quantum mechanics. The agent also receives a reward given by R based on the state of the system after the superoperator is applied.

In a classical POMDP, the environment is always in one underlying state which is hidden from the agent. The POMDP agent has a belief space over the states of the environment. In a QOMDP, the true underlying state may be a superposition. The QOMDP agent knows the state probabilities of the superposition because of N(ρ, a, oi).

As with POMDPs, a policy for a QOMDP is a function π : SA, mapping of states to actions. The value of the policy is:

image

Equation 4.2 is the QOMDP version of the classical Bellman equation. However, POMDPs and QOMDPs are quite different mathematical objects with different theoretical properties. Because classical POMDPs can be simulated within the framework of the QOMDP, the QOMDP inherits the hardness properties of the POMDP [Barry et al., 2014].

The policy existence problem is to decide, given a decision process D, a starting state s, horizon h, and value V, whether there is a policy of horizon h that achieves value at least V for s in D. If h is infinite, the policy existence problem is undecidable for both POMDPs and QOMDPs [Barry et al., 2014]. If h is polynomial in the size of the input, the policy existence problem is in PSPACE [Barry et al., 2014].

The goal-state reachability problem of POMDPs is, given a goal decision process D, starting state s, determine whether there exists a policy that can reach the goal state from s in a finite number of steps with probability 1. For POMDPs, the goal-state reachability problem is decidable. However, for QOMDPs, Barry et al. [2014] proves that goal-state reachability is undecidable.

4.6 CLASSICAL REINFORCEMENT LEARNING MODELS

Classical Reinforcement Learning (RL) in robotics is used to incorporate learning capability into the planning mechanisms of MDPs and POMDPs. RL models an embodied agent situated and operating in a dynamic environment. The robotic agent receives sensory percepts to understand the current world state and has mechanisms for performing actions to affect the physical environment. The robotic agent typically has an objective or goal to accomplish and memory to allow for learning to improve its performance over time with respect to the goal.

Key to the RL model is that the environment (or other feedback mechanism) provides some kind of reward signal after the agent acts to help it learn. The agent uses this reward as a feedback signal, helping it know whether it is performing well or poorly in accomplishing its task. Each time step for the agent consists of a percept-action-reward cycle. After each time step, the agent learns from experience to update its internal processing to perform better in the future. The agent can, over time, learn which actions to use in what states to maximize performance with respect to its objective.

A simple RL agent can be described by a 6-tuple (S, A, Λ, I, D, U):

Reinforcement Learning Agent (S, A, Λ, I, D, U) Definition

S = {s1, s2, … , sm}, a set of possible environment percepts

A = {a1, a2, … , an}, a set of possible actions

• Λ = {0, 1}, a reward set which describes environment reward. This is the very simple case of reward being either “0” for negative reinforcement and “1” for positive reinforcement.

I = {i1, i2, … , ip}, a set of internal states of the agent.

D : S × IA, a function which maps percepts and internal state to action.

U : S × A × Λ × II, a function which updates the internal state via the reward of the last percept-action-reward sequence.

The mathematical framework is a very simple template for a reinforcement learning agent, of which there are many concrete variations. Some classes of reinforcement learning agents include Projection Simulation (PS) and Reflective Projection Simulation (RPS) agents [Briegel and De las Cuevas, 2012].

4.6.1 PROJECTION SIMULATION AGENTS

Projection Simulation (PS) agents are agents whose internal states represent Episodic and Compositional Memory (ECM), consisting of directed, weighted graphs. The vertices of the graphs are “clips” that represent fragments of episodic experience (internal representation of received percepts and possible actions). Edges in the graphs represent the transition probability between clips of episodic experience. Collectively, the graphical structures represent the agent’s “brain.”

Deliberation consists of association-driven hops between the memory sequence clips in the graphs. The elementary process can be described by a Markov chain with the clips (i.e., graph vertices) and transition probabilities (i.e., graph edge weights). Given a new percept, the corresponding clip in the ECM is excited, and hopping between clips occurs according to the specified transition probabilities. In the simplest model, the hopping process commences once a unit-length action clip is encountered. This unit-length action clip is then sent to the robot actuators. The ECM thus functions as a type of associative memory where, when the agent receives similar percepts, it traverses similar memory clips in the graphical model and acts according to the best policy it has learned thus far.

Learning involves changing the edge weights of the graph to upweight or downweight the probability of transitioning to particular clips (i.e., percepts and/or action sequences). Learning in the PS agent occurs according to the following update rules for h, the matrix containing the edge weights of the graph. At each time step the PS agent receives a percept, takes an action, and receives a reinforcement signal. If the action the agent takes is rewarded positively with rt ∈ Λ and clips ci and cj are traversed in the hopping process to output that action, the connection between the clips is strengthened:

image

where rt > 0 is a positive reward and 0 ≤ γ ≤ 1 is a “forgetfulness” parameter that trades off between the current reward and the previously learned model. If the action was not rewarded positively or clips ci and cj were not played in the hopping process:

image

The learning rules strengthen clip connections that help obtain positive reward while down-weighting ones that are detracting from or are not useful toward obtaining positive reinforcement. The weights in h are normalized to form probabilities after each update.

With a large ECM, traversing edges between clips to get to an action can be slow. To speed up learning, the PS agent also maintains F = {f(s)|f(s) ⊆ A, sS}, a non-empty set of flagged actions for each percept. This set represents the agent’s short term memory, and is used to speed up learning. If, given a percept, an action is taken but not rewarded, the action is removed from the set of flagged actions.

4.6.2 REFLECTIVE PROJECTION SIMULATION AGENTS

Reflective Projection Simulation (RPS) agents are PS agents who repeat their diffusion process many times to attempt to find the stationary distribution of their Markov chains. They approximate the complete mixing of their Markov chains, simulating infinite deliberation times, as opposed to just the finite deliberation of PS agents. Once the mixing converges, an RPS agent samples from the stationary distribution over the clip space until an action clip is reached.

To implement this design, RPS agents utilize a random walk algorithm on Markov chains. Formally, a Markov chain is specified by a transition probability matrix P with entries Pji, the probability of moving from clip j to clip i in the graph. For an irreducible Markov chain, there exists a stationary distribution π such that = π. For an irreducible and aperiodic Markov chain, one can approximate this distribution by Ptπ0 (i.e., applying P to the initial distribution π0 to simulate the random state evolution after t time steps) where timage.

The mixing time, image, satisfies:

image

where δ is the spectral gap of the Markov chain, defined as δ = 1 − |λ2| (λ2 is the second largest eigenvalue of P), and ∊′ is the probability of sampling a particular action from the stationary distribution.

The internal state of an RPS agent is an irreducible, aperiodic, and reversible Markov chain over subsets of the clip space. For a percept s, let πs be the stationary distribution of the current Markov chain Ps and f(s) be the flagged action (i.e., short term memory) function which maps percepts to actions. The RPS computes the stationary distribution of Ps as πs(i). The RPS agent outputs an action according to:

image

which is the renormalized stationary distribution with support constrained to only flagged actions. The agent prepares this probability distribution and samples clips using it until a flagged action is hit.

The RPS, by approximating the complete mixed distribution, simulates infinite deliberation time. The main computational requirement is running the algorithm for enough iterations to produce a convergent sampling distribution. This is where agent learning deployed on quantum computers may provide a speedup compared to classical implementations of RPS agents.

4.7 QUANTUM AGENT LEARNING

The model of quantum agent learning in Paparo et al. [2014] preserves the classical robot learning context. The robotic agent is operating in a physical environment that is, for practical purposes, entirely classical. The robotic agent thus cannot query the environment or try actions in quantum superposition since that would violate the laws of classical physics. The quantum RPS agent, however, can potentially obtain a speedup in its internal deliberation processes. By being able to deal with larger problem instances, a quantum robotic agent could theoretically operate in more unstructured and complicated environments than classical robotic agents.

The quantum RPS agent leverages search via quantum random walks on Markov chains [Magniez et al., 2011, Szegedy, 2004]. The quantum RPS agent uses several standard quantum operators such as the quantum diffusion operator image and quantum-walk operator to construct two key pieces of mathematical machinery: the approximate reflection operator subroutine R(Ps)(q, k) and the reflection onto the set of flagged actions ref[f(s)]. The quantum active learning algorithm (shown in Algorithm 4.1) is essentially an application of these two components.

The approximate reflection operator subroutine R(Ps)(q, k) (as its name suggests) approximates the reflection operator 2 |πsimage imageπs | − 1, where |πsimage = ∑i image |iimage is the encoding of the stationary distribution πs. The parameters q and k control how accurate the approximation is and how many times the quantum-walk operator needs to be applied. The distance between the approximate subroutine and ideal reflection operator is upper bounded by 21−k under a suitable metric. The fidelity of operator approaches unity exponentially quickly in k, making it a good approximation. ref[f(s)] is the operator for reflections over flagged actions. This is a Grover-like reflection which, after it has been applied some number of times, finds a flagged action upon measurement. The operator is used with the search algorithm described in Magniez et al. [2011].

Algorithm 4.1 Quantum Active Learning Algorithm

Input : Initial state

Output : Flagged action to take

 

1: Initialize register to quantum state |πinitimage = image |πs |0image

2: Agent performs sequence of reflections over flagged actions ref[f(s)] interlaced with applications of the approximate reflection operator R(Ps)(q, k).

3: Resulting quantum state measured and the found flagged action is checked. If the output clip is not an action, go back to step 1.

4: return Flagged action

The probability of not hitting a flagged action decreases exponentially with the number of algorithm iterations. To find the stationary distribution, the total number of calls to quantum diffusion operators is image where δs = 1 − |λ2| (the “spectral gap” of Ps), λ2 is the second largest eigenvalue of Ps, and s = ∑if(s)πs(i) (the probability of sampling a flagged action from stationary distribution πs). The total required number of reflections over flagged actions is image.

A classical agent and a quantum agent who perceive the same percepts and can produce the same actions are proved to be −equal in terms of produced behavior [Paparo et al., 2014]. The behavioral equivalence of classical and quantum RPS agents means the solutions they learn are not intrinsically different. However, the quantum robotic agent is significantly faster in finding the solution, producing, in theory, a quadratic speedup over the classical robotic agent [Paparo et al., 2014].

Some mechanisms for implementing quantum RPS agents are suggested by Paparo et al. [2014]. Quantum random walks and related processes can be naturally implemented in linear optics setups by, for instance, arrays of polarizing beam splitters [Aspuru-Guzik and Walther, 2012]. Internal states of trapped ions [Blatt and Roos, 2012] could also be used. Finally, condensed matter systems in which the proposed Markov chains could be realized through cooling or relaxation processes toward target distributions that encode the state of belief of the agent are also a potential implementation strategy [Schaetz et al., 2013].

4.8 MULTI-ARMED BANDIT PROBLEM AND SINGLE PHOTON DECISION MAKER

The multi-armed bandit problem (defined in Mahajan and Teneketzis [2008]) describes the task of a gambler who has the choice of spinning 1 of N slot machines at each time step. The slot machines differ in their expected payoffs, which are unknown to the gambler a priori.

To improve profits over time, the gambler can learn from the experience of winning and losing from playing the slot machines. A slot machine that contributes to the gambler’s total gain is a natural choice for the next run, though the gambler must identify those favorable slot machines before he/she can exploit them. Thus, the complexity of optimally sequentially playing the slot machines is associated with an initial “exploration” phase where the gambler tries different slot machines to learn their distribution (which may lose some reward) and a subsequent “exploitation” phase where the most profitable slot machines are played repeatedly to maximize profit.

The multi-armed bandit problem is an analogy for many decision-making settings where there are many options but the expected rewards are uncertain. In robotics, multi-armed bandit problem formulations have been proposed for exploring unknown environments [Si et al., 2007], robot foraging [Srivastava et al., 2013], and task partitioning in swarm robotics [Pini et al., 2012]. There are many classical approaches to tackling these problems, but here we discuss an exciting quantum implementation.

Efforts have been made to solve multi-armed bandit problems using photonic implementations. For the N = 2 case of the multi-armed bandits problem, [Naruse et al., 2015] demonstrates a physical implementation that uses a single photon as a decision maker. The system implementation diagram is shown in Figure 4.1.

In the implementation, a laser excites a nitrogen-vacancy in a diamond crystal, and the emitted single photon passes through an immersion objective (IO), polarization adjuster apparatus (PAA), and a polarization beam splitter (PBS). The played slot machine at each time step (either the right or left slot machine) depends on the detection of a single photon at the reflected or transmitted beam of the PBS respectively. A winner-loser detector closes the feedback loop by adjusting the PAA to select slot machines that increase the total gain or suppress the choice of slot machine that did not reward the gambler.

The proposed physical implementation of multi-armed bandit decision making in optics benefits from the quantum nature of the photon. Decision making could feasibly occur at the speed of light without additional computational effort. Note, however, that, at the moment, the entire optical system has additional latencies and can take several seconds to read out a decision. The current implementation is thus not yet a real-time decision-making system. In addition, much work remains to be done to scale the system to larger than N = 2 in the multi-armed bandits problem.

image

Figure 4.1: Illustration of single photon decision maker. (Adapted from “Single-photon decision maker,” by Naruse et al. [2015], Scientific Reports 5: 13253. Adapted with permission.)

4.9 CHAPTER SUMMARY

Chapter Key Points

• Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) are classical frameworks for modeling robotic planning with uncertainty.

• Quantum MDPs (qMDPs) and Quantum POMDPs (QOMDPs) attempt to model an agent tasked with manipulating quantum phenomena in a quantum environment. The models have different complexity properties from their classical counterparts.

• Reinforcement Learning is a classical framework for modeling a robot that learns to improve its performance in a dynamic environment with feedback from a teacher.

• Attempts have been made to formulate Quantum Agent Learning algorithms that may help a robot making decisions in a classical environment improve its behavior over time. In theory, a quantum agent may be able to learn faster than a classical agent to reach optimal behavior.

• Attempts have been made to solve simple multi-armed bandit instances with a photon decision maker.

Table 4.1: Summary of quantum operating principles in discussed quantum agent models

image

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.76.72