CHAPTER 5

Machine Learning Mechanisms for Quantum Robotics

Applications of machine learning are ubiquitous in robotics today. Agent learning (previously covered) is only one of the forms of machine learning commonly used in classical robotics. We briefly summarize a few areas of classical robotics that have recently used machine learning:

Automated analysis and data mining of sensor data for perception: A robot needs to be able to robustly understand its sensor data, mine it for patterns, and understand possible percepts (e.g., obstacles) in its world. Machine learning approaches are commonly being researched for computer vision [Cipolla et al., 2013], lidar [Lodha et al., 2006, Zhao et al., 2011], sonar [Dietterich, 1997], force sensing [Edsinger-Gonzales and Weber, 2004], and other such perception technologies.

Robot localization: Robots need to be able to self-localize in an environment. Modern localization technologies (e.g., inertial navigation [Bagnell et al., 2010], wi-fi localization [Biswas and Veloso, 2010], range-sensor-based localization [Thrun et al., 2001]) all make extensive use of machine learning methods.

Learning robot controllers and planners: Learning robot controllers from simulation or actual data are key techniques being employed in many domains involving robots. High-level robot planners also benefit from learning, such as learning retrospectively from experience. Online Learning [Gaskett and Cheng, 2003, Hadsell et al., 2007, Hagras, 2001], Reinforcement Learning [Kaelbling et al., 1996], and Deep Learning [Bengio, 2009, LeCun et al., 2010, Lenz et al., 2015] are key approaches being heavily researched by groups all over the world.

Learning human-robot interaction: Data-driven approaches are being used to understand how humans and robots interact [Hiraki, 1995]. Machine learning approaches are being used to model this interaction and recognize user affect [Rani et al., 2006].

Given the extensive and varied ways that classical machine learning is currently being used in robotics, one expects that fundamental advances in machine learning will be impactful. Quantum machine learning is a rapidly emerging research discipline. For excellent surveys of the quantum machine learning literature, we direct the reader to Wittek [2014] and Schuld et al. [2015]. In our exposition in quantum robotics, we are indebted to the literature summaries provided by these and strive to build upon their work by also incorporating research released after their publication. One of our further contributions to the literature is in presenting quantum algorithms in comparison to classical robotic algorithms and highlighting the ways quantum robotics allows a speedup, a representation advantage, or, in some cases, a fundamental departure from classical robotic methods.

5.1 QUANTUM OPERATING PRINCIPLES IN QUANTUM MACHINE LEARNING

Quantum machine learning algorithms can be fairly nuanced in their mathematics. To make the concepts easier to understand, we boil down these algorithms to their key Quantum Operating Principles (QOPs) in this section. Recall from Section 1.3, QOPs is a presentation style we introduce to help explain the basic principles by which many quantum machine learning algorithms obtain benefits over classical approaches.

5.1.1 QUANTUM MEMORY

The first key QOP is the use of qubit representation in a quantum memory. By employing qubits in superposition, quantum computers are expected to have substantial representational power gains over classical methods of representing data. While N classical bits can only represent one N-bit number, N qubits can represent all possible 2N N-bit numbers.

This capability suggests a future world of “Big Quantum Data,” a term coined by Lloyd et al. [2013]. The International Data Corporation (IDC) estimates that by 2020, humans will have produced 40 zettabytes of electronic data [Gantz and Reinsel, 2012]. This entire data set could fit in about 79 qubits. It is estimated that there are 1090 particles in the universe. This number of possibilities can be represented with about 300 qubits. To put things into perspective, it is estimated that the human brain stores only 2.5 petabytes of binary data [Juels and Wong, 2014]. This can be stored in 55 qubits.

There is asymptotically modest overhead in converting between classical and quantum data. Classical data represented as N-dimensional complex vectors can be mapped onto quantum states over log2 (N) qubits. Description of quantum memory (QRAM) is given in Lloyd et al. [2013]. The mapping takes O(log2 (N)) steps, but, once in quantum form, quantum complexities apply for algorithms. Using Grover’s algorithm or the Adiabatic Theorem, many classical algorithms are estimated to obtain exponential or quadratic speedups.

5.1.2 QUANTUM INNER PRODUCTS AND DISTANCES

Quantum computing theorizes faster computation of the inner product between two vectors [Schuld et al., 2015]. Since many machine learning algorithms make use of inner products, these algorithms are expected to benefit from quantum approaches.

Lloyd’s algorithm [Lloyd et al., 2013] for computing quantum inner products makes use of the polarization identity for dot products:

image

Define Z = ||xi||2 + ||xj||2 so image. The first step of the algorithm involves generating the quantum states |ψimage and |ϕimage:

image

|ψimage can be generated by addressing the QRAM, while the other state can be estimated using the subsequent steps. By evolving the state image (|0image − |1image) image |0image with the Hamiltonian H = (||xi|| |0image image0| + ||xj|| |1image image1|) image σx, (where σx = image, a Pauli matrix), the following state is obtained:

image

The parameter t is chosen subject to the constraint that ||xi|| << 1 and ||xj|| << 1. By measuring the result, the state |ϕimage is returned with probability image. Repeating the experiment multiple times allows estimation of Z. If the desired accuracy of estimation is , the complexity of constructing |ϕimage and Z is O(−1). After a swap operation is applied, the resulting state can be used to obtain an estimate of the dot product:

image

With QRAM accesses, the overall complexity of a single dot product of an N-dimensional vector with Lloyd’s dot product algorithm is O(−1 log N). Classically, computing a dot product is O(N).

The quantum dot product algorithm generalizes to nonlinear metrics. Harrow et al. [2009] shows that an arbitrary distance function can be approximated by q-th order polynomials. While classical non-linearity can be exponentially hard, the expected time complexity for this approximation on quantum computers is 0(−1q log(N)).

5.1.3 HAMILTONIAN SIMULATION

Machine learning algorithms that can be cast within a Hamiltonian framework can be simulated on a quantum computer [Berry et al., 2007]. The Lie Product Formula states that for arbitrary n × n matrices X and Y,

image

To simulate a Hamiltonian H that is a sum of m component Hamiltonians Hj:

image

one can calculate:

image

where i = image. If the desired accuracy of the simulation is , it will be bounded by:

image

where dj is the numeric dimensionality of the Hilbert space that Hamiltonian Hj acts on. The time complexity of Hamiltonian Simulation is linear in the number of Hamiltonian steps (i.e., number of exponentials).

5.1.4 QOPS SUMMARY FOR QUANTUM MACHINE LEARNING

Table 5.1 highlights the Quantum Operating Principles (QOPs) used by specific machine learning algorithms. In the subsequent sections, we survey each of these algorithms in the context of classical robotics and how the algorithm is expected to change in the quantum world. We also hypothesize the application of the quantum algorithm to key computational problems in robotics.

5.2 QUANTUM PRINCIPAL COMPONENT ANALYSIS (PCA)

Principal Component Analysis (PCA) constructs a set of features (i.e., “principal components”), which are linear combinations of the original features, that minimize the reconstruction error of the original data. PCA has two major uses: first, as a method of data exploration for identifying the key linear directions of variance in a data set and, second, as an approach to dimensionality reduction. Generally, keeping fewer of the new (principal) components is sufficient for low reconstruction error on the original data, facilitating reduction in dimensionality of the data and, at the same time, explanation of its variance.

5.2.1 CLASSICAL PCA ANALYSIS

The input to PCA is a data matrix X (N data points × D dimensions per point). Classical PCA calculates the eigendecomposition of XTX using Singular Value Decomposition (SVD) as XTX = W∑WT where W is a matrix of eigenvectors and is a diagonal matrix containing the eigenvalues in decreasing order. The projection of the data onto its top k eigenvectors or “principal components” can be calculated via Tk = XWk. When the data is unprojected and the reconstruction error computed via R = ||Ximage, high reconstruction error is indicative of anomalous behavior to those extracted linear directions of maximal variance. The method can thus help identify the key linear directions of variance in data and distinguish data points that are anomalous to general trends.

5.2.2 QUANTUM PCA ANALYSIS

Quantum PCA [Lloyd et al., 2014] potentially offers an exponential factor speedup when compared to classical PCA. The method computes the eigendecomposition of the data matrix using a quantum state built via simulating Hamiltonians. The key insight is that a density matrix ρ can be used as a Hamiltonian on another density matrix. n copies of the density matrix ρ are used to perform the unitary transform eiρt where i = image. Applying a quantum eigenvalue estimation on eiρt yields the following quantum state:

image

where χk are the eigenvectors of ρ and image are its eigenvalues. Sampling from this state reveals the key principal components [Lloyd et al., 2014]. Classical PCA takes time polynomial in the dimension of system to compute the principal components, while Quantum PCA takes time logarithmic in the dimension of the system.

Quantum PCA also has an interesting interpretation as a method for “quantum self-tomography.” In conventional sensing with robots, the environment is passive: it is there to be measured. In Quantum PCA, the quantum state ρ plays an active role in its own analysis. The quantum state ρ itself is used to implement the unitary operator eiρt, a Hamiltonian that acts on copies of itself to reveal its own variance properties. This remarkable introspective capability, where the quantum particles effectively self-analyze, is unseen in classical systems.

5.2.3 POTENTIAL IMPACT OF QUANTUM PCA ON ROBOTICS

In robotics, PCA has been used for belief compression in POMDP planning [Roy and Gordon, 2003], simultaneous mapping and localization models [Brunskill and Roy, 2005], and robot environment modeling [Vlassis and Krose, 1999]. These are areas that may expect a speedup in computational processes. In addition, the self-analysis aspects of Quantum PCA may lead to some interesting development of sensor technology.

5.3 QUANTUM REGRESSION

Regression analysis aims to understand the relationships between statistical variables {(x1, y1), … , (xn, yn)}. The goal is to approximate the statistical relation between the variables using a function f such that {(x1, f(x1)), … , (xn, f(xn))} resembles the original training data. With the approximation of f, values outside the training data such as (z, f(z)) can be extrapolated.

Regression methods are ubiquitous in robotics since so much of real-world physics can be modeled with continuous statistical processes. Large regression problems with many data points and variables may be solved exponentially faster using quantum methods, allowing robots to handle more complicated environments.

5.3.1 LEAST SQUARES FITTING

Least squares fitting is a common algorithm for classical regression. A function of the form:

image

is fit where f(x, λ) : image and λi is a component of the weight vector λ. The fitting is done with the criterion of minimizing the magnitude of the least squares error:

image

The fit parameters are found by applying the Moore-Penrose pseudoinverse of F (denoted F+). The solution is given by:

image

5.3.2 QUANTUM APPROACHES TO CURVE FITTING

HHL (named after its co-inventors: Aram Harrow, Avinatan Hassidim, and Seth Lloyd) [Harrow et al., 2009] is a quantum algorithm for efficiently analyzing a linear system of the form:

image

where F is a sparse Hermitian matrix, x is a vector of predictor variables, and b is the N × 1 vector of predicted variables. HLL computes the expectation value < x |M|x > of an arbitrary poly-size Hermitian operator M in Oimage steps where κ is the matrix F’s condition number, s is the maximum number of nonzero elements in any row or column of F, and is the acceptable error distance to the optimal solution.

The key insight of Wiebe et al. [2012] is that HLL can be adapted to the problem of curve fitting, so that one can efficiently obtain the quality of the fit without having to learn the fit parameters. The HLL algorithm is used to construct the quantum state:

image

where I is an isometry superoperator which maps a matrix X to another matrix of the following form:

image

The time complexity of implementing the quantum state in Equation 5.14 within error is Oimage with high probability. A swap test is then used to efficiently determine the accuracy of the fit. The swap test can distinguish |yimage from I(F) |λimage and return “1” if the states are different. The overlap of the two quantum states (that determines the fit quality) can be statistically measured by repeating the swap test multiple times.

Limitations of Wiebe et al. [2012] are the assumptions of F being sparse and Hermitian and that the full fitted curve is not determined efficiently. Wang [2014] attempts to provide a quantum curve fitting algorithm that can work for the general case, though their running time is poly-logarithmic in the number of samples but not in the dimensionality of the data (a fundamental tradeoff between the two contrasting approaches). The approach in Wang [2014] leverages the density matrix exponentiation technique developed by Lloyd et al. [2014] that can simulate general non-sparse Hamiltonians. Wang [2014] proposes simulating the Hamiltonian image (where i = image) to create an algorithm that is theoretically exponentially faster than classical curve-fitting. It remains to be experimentally verified whether this complexity result is true in practice with real capability to simulate non-sparse Hamiltonians.

5.3.3 POTENTIAL IMPACT OF QUANTUM REGRESSION ON ROBOTICS

Given the prevalence of regression methods in robotics, there are numerous applications. Since so many computational modeling problems are continuous in robotics, a major benefit would be given to large regression problems with many data points and variables.

5.4 QUANTUM CLUSTERING

Clustering is another type of exploratory statistical analysis. In clustering, a set of objects is divided into several subgroups (clusters) such that each contains similar objects. Quantum clustering represents an exponential speedup on classical equivalents (in some cases), enabling the clustering of significantly larger datasets. For example, quantum cluster analysis may one day scale clustering to run over multi-exabyte data sets of perceptual data.

5.4.1 CLASSICAL CLUSTER ANALYSIS

The aim of clustering is to identify a set of K clusters (or partitions) C = {C1, C2, … CK} which coherently divide a set of input data points x = {x1, x2, … , xn} such that each data point is assigned to one or more clusters. A generic clustering algorithm is described by the procedure in Algorithm 5.1. A variety of clustering methods exist which employ different measures of similarity and distance used to assign data points to their “closest” cluster.

Algorithm 5.1 Generic Clustering Algorithm

Input : Data points x = {x1, x2, … , xn}, A set of initial K cluster centroids Z = {Z1, Z2, … , ZK} (generally cluster guesses, often generated at random)

Output : K stable cluster centroids

 

1: repeat

2:Form K clusters by assigning each point to the “closest” cluster Zk

3:Recompute the centroid of each cluster based on the new assignment of data points.

4: until Centroids do not change.

5: return K stable cluster centroids Z = {Z1, Z2, … , ZK}

One common method, k-means clustering, assigns data points to clusters by minimizing the Within-Cluster Sum of Squares (WCSS).

image

This method calculates the sum of squares within each cluster individually, by computing the squared distance from each data point x to its cluster’s geometric mean μi. The individual data point squared distances are then summed together. The algorithm attempts to find an assignment of data points to clusters to minimize this cost.

Another method, k-medians clustering, optimizes data to cluster assignment using the geometric median instead. k-medians uses the cost function:

image

The median of cluster Ci is denoted as mi. |xmi| is the measure of Manhattan distance from each data point to the geometric median of the cluster to which it belongs. The cost function aims to find the minimum cost solution across all possible clusterings of the points.

5.4.2 QUANTUM CLUSTER ANALYSIS

Attempts have been made to extend classical clustering to quantum clustering. One method of quantum clustering utilizes Grover’s algorithm in conjunction with a designed oracle function [Aïmeur et al., 2013]. Consider all pairs of points xi and xj within cluster Ck. While Aïmeur et al. [2013] did not specify the oracle’s formula or quantum circuit implementation, their approach can generically be understood as applying the Hadamard gate on the superposition of all the pairs of indexes. The oracle would then output a superposition including all the triples |i, j, Dist(xi, xj)image as quantum states which allow for speeding up the calculation of distances between pairs of points using quantum subroutines.

For example, Grover’s algorithm can be used to calculate the minimum value of a subset of distance values. If the subset is the list of centroids, this quantum subroutine can be used to compute Step 2 in Algorithm 5.1. If the subset is the list of data points assigned to a centroid, this quantum subroutine can be used to compute Step 3 in Algorithm 5.1. Overall, these quantum subroutines are hypothesized to result in a quadratic speedup for quantum clustering.

Another approach to quantum clustering utilizes adiabatic computation [Lloyd et al., 2013]. This formulation of a quantum k-means algorithm constructs the quantum state:

image

where c is a cluster, j represents a point in cluster c, and M is the number of data points. This represents the output of k-means clustering in quantum state form and can be measured to produce the solution. To construct this state, Lloyd et al. [2013] begins by initializing the basic quantum state:

image

where k is the number of clusters, M is the number of data points, and |cimage |jimage represents an assignment initialization guess. d copies of the seed state image allow computing data point distances to cluster centers in superposition. Let image be a data point. Let image be a cluster center. Then, using Hamiltonians, one can compute all image in superposition.

Two Hamiltonians are applied in succession. The first, H1, evaluates distances and corresponds to Step 2 in Algorithm 5.1.

image

The second, Hf, performs the cluster assignment. This corresponds to Step 3 in Algorithm 5.1.

image

Both taken together construct the quantum state |ψimage. The construction of the quantum state potentially allows for an exponential speedup over classical approaches.

5.4.3 POTENTIAL IMPACT OF QUANTUM CLUSTERING ON ROBOTICS

In robotics, classical clustering has been used to understand image data [Martínez and Vitria, 2001], trajectory data [Aleotti and Caselli, 2005], and robot experience [Oates et al., 2000]. These are all domains that will receive a speedup from quantum clustering. A robot using quantum clustering may be able to understand more complex percepts and patterns in noisy sensor data.

5.5 QUANTUM SUPPORT VECTOR MACHINES

The Support Vector Machine (SVM) is a powerful classification tool that takes in data with supervised labels and learns to classify previously unseen instances. In robotics, SVM has been used for learning robotic grasping models [Pelossof et al., 2004], place recognition models [Pronobis et al., 2008], data fusion methods [Xizhe et al., 2004], and object recognition [Ji et al., 2012].

5.5.1 CLASSICAL SVM ANALYSIS

The simplest Support Vector Machine model takes as input a set of training instances {(x1, y1), (x2, y2), … , (xN, yN)} where xi are feature vectors and yi ∈ {−1, +1} are binary classes to which data points belong. An SVM attempts to fit a linear separating hyperplane to the data that maximally separates the data from the two classes.

Formally, a hyperplane in image is wT x − b = 0 where w is the normal vector to the hyperplane and b is a bias parameter that helps determine offset from the origin. The margin equation for the two classes is given by yi(wT xib) ≥ 1 for i = 1, … , N. The distance between the two classes is given by image, and the goal is to minimize w to get the maximum margin between the two classes. This optimization problem is represented by image

Since most classification problems are not precisely linearly separable, soft margin SVMs are commonly used to find a solution. A soft margin relaxes the assumption that positive and negative training examples need to be perfectly linearly separable by a hard margin. In the soft margin SVM, the introduction of nonnegative slack variables i into the objective function that measure the degree to which a data point deviates from the margin allows element densities to overlap. The soft margin SVM objective is given by min image

Kernel functions can be used to automatically map data from an input space to a higher dimensional space. The key idea is to transform the data using a function ϕ(x) that retains many properties of the inner product but is nonlinear. The Kernel function nonlinearly maps the input data to a higher dimensional space where a linear separator can be fit with greater ease.

Given all these constraints, the following optimization problem is obtained (using the Lagrangian of the dual problem):

image

where K(xi, xj) = ϕ(xi)T ϕ(xj).

image

Figure 5.1: Example support vector machine and fit separating max-margin hyperplane on two class data.

The αi are learned weights for points. The few αi that are non-zero correspond to the “support vectors” that define the margin. The C constraint is a form of regularization that follows from the derivation of the slack variables, trading off between maximizing the margin and allowing more softness on the margin. The Kernel function K(xi, xj) allows the SVM to first transform the data to a nonlinear space before fitting the hyperplane.

For an arbitrary new data point x,

image

assigns the new point to its category (i.e., either the + class or the — class, based on the sign of the expression).

The SVM is classically useful on small to medium-size data sets but challenging to scale to larger data sets for real-time robotics. In addition, convex loss functions must often be used with the SVM (such as the well-known hinge loss) for the optimization to be tractable. However, convex functions can be sensitive to noise, especially class label (i.e., +/−) outliers [Anguita et al., 2003].

5.5.2 QUANTUM SVM ANALYSIS

Several quantum variants of the SVM have been proposed. In theory, a quantum SVM could scale to much larger data sets than the classical SVM. The quantum SVM would also not have to make convex approximations, potentially finding more robust decision boundaries that increase classification accuracy.

Anguita et al. [2003] proposes a Grover-style algorithm to perform an exhaustive search in the parameter and cost space for any arbitrary SVM objective function. Even non-convex SVM objective functions (such as the famous “0/1” loss function) can be used in the optimization. While generally this type of discrete optimization would be intractable classically, running Monte Carlo Random Search on a quantum computer could be useful with speedups from quantum parallelism.

Assume that the minimization problem has M different solutions. In a search space with N possible configurations, the probability of success of a Monte Carlo search after r iterations is: image = 1 − (1 − M/N)r. The quantum approach exhibits an advantage whenever 22.5M < image [Anguita et al., 2003].

Neven et al. [2008] proposes an adiabatic optimization approach to the SVM where the SVM objective function is cast as a Quadratic Unconstrained Binary Program (QUBO). QUBO structures are well-optimizable by adiabatic optimization. The paper reports better generalization error may be possible for many problems.

Rebentrost et al. [2014] illustrates that an exponential speedup is possible when using the least squares formulation of the SVM. By using Hamiltonian simulation on an adiabatic processor as well as expressing the problem in the eigenbasis obtained by Quantum PCA [Lloyd et al., 2014], a quantum variant of the least squares SVM can be fit efficiently:

image

where K is the kernel matrix, γ is a regularization parameter, I is the identity matrix, and 1 is a vector of ones. By simulating the matrix exponential of F, an exponential speedup in SVM training is forecasted by this model [Rebentrost et al., 2014].

5.5.3 POTENTIAL IMPACT OF QUANTUM SVMS ON ROBOTICS

In robotics, SVM is used for problems such as learning robotic grasping models [Pelossof et al., 2004], place recognition models [Pronobis et al., 2008], data fusion methods [Xizhe et al., 2004], and object recognition [Ji et al., 2012]. One may expect to be able to build a classifier on a larger amount of data with a quantum SVM. Thus one may expect to be able to grasp objects that are more complex and navigate environments with more varied structure.

5.6 QUANTUM BAYESIAN NETWORKS

A Bayesian Network is a compact graphical representation of a complicated joint probability distribution function P(X1, … , Xn). Often times, real-world probability distributions have a conditional independence structure that allows the factorization of the distribution into a few product terms. Bayesian Networks are one method of concisely representing such a factorization.

image

Figure 5.2: Example Bayesian network for diagnosing the health of a sonar range sensor.

Figure 5.2 shows an example Bayesian Network. The nodes represent variables, and the links between variables express conditional independence relations. A node in a Bayesian Network is conditionally independent of other nodes given its parent nodes. A Bayesian Network thus factorizes according to the following rule:

image

In the example shown in Figure 5.2, imagine a robot with a sonar range sensor. The sonar sensor allows the robot to detect obstacles in the environment, returning a spike in sensor signal if an obstacle is detected. However, the sonar sensor may also break (e.g., due to the robot falling on itself) and, if it is broken, the sonar sensor may spike spuriously. If a sensor spike is received, one may want to try to infer the likely cause. This situation can be modeled using a Bayesian Network.

Let B represent the event “Broken Sensor,” O represent the event “Obstacle Sensed,” and S represent the event “Sensor Spike.” In our example, the network shows that either having a broken sensor and/or sensing an obstacle in the environment may lead to a spike in the sensor’s signal. These factors can be used to predict the probability of getting a sensor spike. If one is obtained empirically, one can try to infer the likely causes.

The factorization of the probability distribution function shown in the Bayes Net is: P(B, O, S) = P(S|B, O) P(B) P(O). The conditional probability tables (CPTs) express the parameters of the probability distribution. Notice that the factorization makes the probability distribution more succinct, requiring less memory to store the joint distribution.

A Bayesian Network, with defined structure and CPT parameters, can be used for inferring unobserved variables. Any joint probability in the overall probability distribution function can be calculated via the factorization and CPT parameters (for algorithms regarding this capability, see Koller and Friedman [2009]).

In robotics, Bayesian Networks have been applied to particle filtering [Doucet et al., 2000], structure learning in robot grasping [Song et al., 2011], 3D reconstruction [Delage et al., 2006], and robot localization [Zhou and Sakane, 2007], amongst other problems.

5.6.1 CLASSICAL BAYESIAN NETWORK STRUCTURE LEARNING

The problem of Bayesian Network Structure Learning (BNSL) is to learn both the network conditional independence structure and CPT parameters from data. Classically, there are many ways to tackle this problem. Methods include using independence tests between pairs of variables to establish variable relations, scoring the entire graph structure (using maximum likelihood estimation) and searching over structures, using Bayesian scores and a prior to bias the search to particular structures, and Bayesian model averaging with top models to fuse multiple models to find the optimal fit. For an in-depth treatment of classical BNSL, see Koller and Friedman [2009].

Classically, BNSL is an exponentially hard computational problem. While this does not change in the quantum world, one does expect a significant speedup in solving instances of BNSL and potential capability to learn larger Bayesian Network structures from data.

5.6.2 BAYESIAN NETWORK STRUCTURE LEARNING USING ADIABATIC OPTIMIZATION

O’Gorman et al. [2015] propose an algorithm for using quantum adiabatic optimization to perform BNSL. The general strategy is as follows:

Step 1 of the algorithm involves creating a bit string representation of a candidate graph structure. Define: d = (dij | 1 ≤ i < jn, i image j as the string of bits of possible edges where dij = 1 indicates an edge from vertex Xi to vertex Xj, dij = 0 indicates no edge, and dii = 0. G(d) specifies an encoded candidate directed graph.

Algorithm 5.2 Quantum Adiabatic Optimization for Bayesian Network Structure Learning

Input : n variables

Output : QUBO instance for BNSL problem (solvable via adiabatic optimization)

 

1: Encode all digraphs using a set of Boolean variables, each of which indicates presence or absence of a directed edge.

2: Define a quadratic pseudo-Boolean function on those variables that scores encoded digraphs.

3: Encode additional resource and graphical constraints for optimization.

4: return Quadratic Unconstrained Binary Optimization (QUBO) instance that is defined over O(n2) variables when mapped from a BNSL instance with n variables.

Step 2 of the algorithm involves defining a quadratic pseudo-Boolean Hamiltonian function Hscore (d) such that Hscore (d) = s(G(d)), the likelihood score for the proposed graphical structure given the data. A pseudo-Boolean function is a function f: Bbimage where B = {0, 1} is a Boolean domain and b is a nonnegative integer called the arity of the function.

Any pseudo-Boolean function has a unique multinomial form, so one can write:

image

Additional considerations of the optimization include incorporating a resource constraint on the maximum connectivity and a penalty on cycles in the graphical structure. The resource constraint is implemented since quantum adiabatic hardware is currently very limited in total qubit connections. For this constraint, the function image is defined whose value is zero if variable Xi has at most m parents and positive otherwise. Cycles in the graph are also penalized using a image term. This constraint leverages a theoretical computer science result that, if a tournament has any cycle, it has a cycle of length three. Thus, to penalize cycles, it is sufficient to penalize directed 3-cycles. The overall Hamiltonian is given by:

image

The Hamiltonian ensures that the ground state of the system encodes the highest likelihood scoring directed acyclic graph with a maximum parent set size that can represent the Bayesian Network.

The algorithm described in O’Gorman et al. [2015] has been implemented on a D-Wave Two chip system installed at NASA Ames Research Center that allows problems up to seven Bayesian Network variables. Note that the implementation is not yet competitive, as classical methods are able to deal with many tens of Bayesian Network variables. A polynomial speedup on BNSL is expected with quantum adiabatic optimization, though is yet to be demonstrated with current devices.

5.6.3 POTENTIAL IMPACT OF QUANTUM BAYESIAN NETWORKS ON ROBOTICS

Many classical robotic representations are Bayesian. Problems such as robotic localization and mapping have been cast within a Bayesian framework with much practical success. Bayesian Networks have been applied to particle filtering [Doucet et al., 2000], structure learning in robot grasping [Song et al., 2011], 3D reconstruction [Delage et al., 2006], and robot localization [Zhou and Sakane, 2007], amongst other problems. Current Bayesian Network methods are limited by the capability to learn complex graphical structures. If Quantum Bayesian Networks are able to scale to larger graphical structures, a richer representation of robotic interactions with an environment could be captured.

5.7 QUANTUM ARTIFICIAL NEURAL NETWORKS

Artificial Neural networks (ANNs) are a model of computation loosely inspired by the structure of the brain. ANNs consist of connected neurons that, with enough layers, can serve as a powerful non-linear classifier or feature extractor. In robotics, ANNs have helped mobile robot perception [Pomerleau, 1993] and control of robot manipulators [Lewis et al., 1998], amongst a growing number of applications.

5.7.1 CLASSICAL ARTIFICIAL NEURAL NETWORKS

An artificial neural network is a network of units (“neurons”) connected by weighted edges (“synapses”). The network receives an input, applies a complex transformation to the input based on its connections, and produces a response called the “activation.”

The simplest case, called a “Perceptron,” consists of only a single neuron [Rosenblatt, 1958]. A single neuron works as a binary classifier. The activation function models whether the neuron fires or not:

image

where x is the input vector in imaged, w is a weight vector, and b is the bias term. Training of the classical perceptron is facilitated via the “delta rule,” a form of gradient descent where each weight wj is updated in proportion to the error it produces in prediction:

image

where yi is the true label of the data point and γ is a predefined learning rate that can be varied throughout training.

The perceptron model suffers from the “XOR” problem [Minsky and Papert, 1969]. Training will not converge if the training examples are not linearly separable with respect to their classes. The solution is to use multiple layers in the neural network.

Feedforward networks, perhaps the most common type of multi-layer artificial neural network, assume connections only go in one direction with no directed cycles. Typical feedforward networks have an input layer, some number of neurons in a hidden layer, and an output layer for classification. Feedforward networks can be trained using the famous backpropagation algorithm which updates weights in the network in proportion to the error they produce [Rumelhart et al., 1988].

A Hopfield Network [Hopfield, 1982], another type of multi-layer artificial neural network, is an “associative memory” which recalls a canonical pattern previously generalized from training examples. Given an input pattern, it automatically generalizes to the output without requiring explicit addressing of the training examples.

Structurally, a Hopfield Network is an assembly of fully connected perceptrons with no self-loops. K perceptrons have K(K − 1) interconnections. The perceptrons send signals around to each other until convergence in either a synchronous or asynchronous function. In the simplest model, the weights are assumed to be symmetric with the state of a single neuron si being either +1 or − 1 determined by:

image

where θi is a predefined neuron firing threshold. The “Energy” of the Hopfield Network is determined as:

image

Interestingly, updating weights in the network to learn a pattern always means the energy quantity remains the same or decreases. This dissipative nature of neural computation makes implementation of quantum artificial neural networks challenging.

Boltzmann Machines [Hinton and Sejnowski, 1986] can be viewed as a stochastic variant of the Hopfield network. A Boltzmann Machine consists of two sets of units: visible units (v) which encode the input/output of the network and hidden units (h) which capture complex correlations between input and output. Like Hopfield units, these units can be “on” (1) or “off” (0). The probability density function of a Boltzmann Machine for a given configuration of visible and hidden nodes is given by:

image

where E(v, h) is the network’s energy function, b and d are bias vectors that provide a penalty for a unit taking a value of 1, and wvh, wv, and wh are tunable weight vectors. Z is a normalization factor, also known as a partition function.

Restricted Boltzmann Machines (RBMs) [Smolensky, 1986] are a class of Boltzmann machines where visible units are not directly connected to other visible units, and hidden units are not directly connected to other hidden units. Thus, the visible units and hidden units form a bipartite graph and wh = wv = 0.

Long training time served as a major hurdle for multi-layer artificial neural networks. Deep Learning [Bengio, 2009, Bengio et al., 2007, Hinton et al., 2006, LeCun et al., 2010] leverages computational tricks such as learning by layers, feed-forward approximations, parallel architectures, and GPU acceleration to speed up computation. For example, Boltzmann machines have traditionally been hard to train since computing the partition function is #P-hard. Deep Restricted Boltzmann Machines (dRBMs) [Le Roux and Bengio, 2008, Salakhutdinov and Hinton, 2009] require that each layer of hidden units is connected only to units in adjacent layers. This allows for the use of a contrastive divergence approximation to efficiently train the network in a greedy layer-by-layer fashion. The classical approximation can lose accuracy, which may be improvable by quantum approaches.

5.7.2 QUANTUM APPROACHES TO ARTIFICIAL NEURAL NETWORKS

Quantum variants of the Hopfield Network (called Quantum Associative Memories), Perceptrons, and Feedforward networks have been suggested. Quantum Artificial Neural Networks have several potential advantages to classical artificial neural networks. First, quantum artificial neural networks can likely have a more powerful representation capability by storing network weights in superposition. Second, speedups in optimization suggest potentially faster training times for backpropagation. Finally, quantum artificial neural networks may more easily find the optimum of a complicated non-convex objective function than classical approaches.

Quantum Associative Memories

Quantum Associative Memories [Ventura and Martinez, 2000] are the quantum extension to Hopfield Networks. Quantum Associative Memories store patterns in superposition. While classical Hopfield Networks provide O(d) memory storage using d neurons, Quantum Associative Memories provide O(2d) storage using d neurons. Quantum Associative Memories store their N patterns in an equally-weighted superposition:

image

Since measurement can only return one value, the memory |Mimage has to be periodically cloned [Duan and Guo, 1998]. To avoid contradicting the non-cloning theorem, the cloning of memory is approximate and thus imperfect.

Assuming the memory can be maintained, there are two key approaches to pattern matching and classification of an unseen data point. The first is a modified Grover’s algorithm [Ventura and Martinez, 2000], augmented with extra rotations to allow efficient search for patterns over the database when not all patterns are represented. The second (known as “Probabilistic Quantum Memories” [Trugenberger, 2001]) is to calculate the Hamming distance between the input pattern and the instances in memory. The following Hamiltonian measures the number of bits that the input pattern and a stored memory instance differ:

image

where σz is the Pauli Z matrix, mk is a memory instance, n is the length of a memory instance numeric vector, and c is a control bit. By applying the Hamiltonian to the input pattern and memory instances, the probability of obtaining a pattern xk for input i is:

image

which peaks around the patterns with the smallest Hamming distance (|cimage is a control bit). This strategy helps identify the pattern closest to the input.

The energy function of the Hopfield network [Equation (5.31)] resembles the Hamiltonian of a spin glass or Ising model in the quantum world. This makes adiabatic quantum computing a prime target paradigm for implementation of quantum artificial neural networks. Adiabatic quantum computing has also been used to implement quantum associative memories. Neigovzen et al. [2009] demonstrate a quantum associative memory with two qubits in a liquid-state nuclear magnetic resonance system. This approach is further studied by Seddiqi and Humble [2014].

Quantum Perceptron and Feedforward Networks

A key challenge with building a quantum perceptron is incorporating nonlinearity. Activation functions in the classical perceptron create nonlinearities in the mathematical model. Individual quantum states, however, can only evolve according to linear, unitary, and reversible dynamics. The literature suggests two key approaches for incorporating nonlinearity into quantum perceptrons.

Lewenstein [1994] and Zak and Williams [1998] suggest using quantum measurement to play the role of the threshold function in the quantum perceptron. In quantum mechanics, a measurement collapses a superposition to one state from a distribution of possible states. This serves as an effective thresholding mechanism:

image

From this mechanism, one can define an update rule on quantum states that is similar to the classical neuron firing rule:

image

where U is a unitary operator, at is a previous state, and the function σ(x) performs a measurement and reinitializes the quantum state at+1. Narayanan and Menneer [2000] suggests that the principle of alternating classical and quantum components can be used not just in a perceptron but can be scaled to a multilayer network of neurons.

Another approach to incorporating nonlinearity into quantum mechanics is via the Feynman Path Integral [Faber and Giraldi, 2002]:

image

where |ψ(x0, 0)image is the initial state of the quantum system at time t = 0, |ψ(xf , T)image is the system at time t = T, ℏ is Plank’s constant, m is the mass, and V is the potential energy. Nonlinearity can be adjusted through model training by using the V(x) function as well as via the coupling of the quantum system to its environment. The environment surrounding the quantum system is a potential design choice that can introduce nonlinearity. Behrman et al. [1996] demonstrate a possible implementation strategy for this approach via solid-state quantum dot hardware, which is discussed further in Section 7.4.

Quantum Deep Learning

Training Boltzmann Machines (BMs) generally requires approximation since calculating the gradients of the objective function (which involves evaluation of the notorious partition function Z) is intractable. Practical applications of Deep Restricted Boltzmann Machines (dRBMs), for example, often utilize mean field [Bengio, 2009] or contrastive divergence [Hinton, 2002] approximations to the gradients which allow some sizes of dRBMs to be trained on classical computers.

Wiebe et al. [2014] explores whether quantum strategies can be used to enable efficient training of BMs and dRBMs. They provide some alternative approaches to the approximations previously posited for these networks by employing quantum sampling. By optimizing their procedure with quantum amplitude estimation (a variant of Grover’s algorithm), they show BMs can be trained for achieving a fixed sampling error in modeling with quadratically fewer operations than classical algorithms. In addition, their quantum approach shows improved accuracy (in simulation) for learning higher quality models, especially for large BMs on noisy data.

Interestingly, the quantum deep learning approach circumvents the need for a dRBM topology of the network. While classically, the dRBM is perhaps one of the most scalable of BM-style models, it is also a fairly restrictive model of data. In the quantum world, approaches may scale to training true BMs with fewer required assumptions. For instance, the contrastive divergence approximation for gradients may not be necessary with quantum annealing.

Attempts have been made to evaluate the feasibility of RBMs that use quantum sampling on the D-Wave hardware [Adachi and Henderson, 2015, Denil and De Freitas, 2011, Dumoulin et al., 2013]. Implementation is challenging since performance can be constrained by the limited connectivity of qubits in the machine and faulty qubits from imperfections introduced during the manufacturing process [Dumoulin et al., 2013]. Preliminary results from these studies suggest that quantum sampling for Deep Neural Networks can obtain similar or better accuracy with fewer iterations than the classical contrastive divergence / dRBM training procedure on data sets like MNIST [Adachi and Henderson, 2015]. More tests are needed to determine whether the results are due to quantum effects, whether the quantum annealing approach can improve results on other data sets, and to determine how the approach scales relative to the classical training procedure.

Further Reading

The literature surrounding quantum artificial neural network and its theoretical building blocks is vast, and there are multiple excellent literature summaries of quantum artificial neural networks. For further reading on the topic, view Benedetti et al. [2015], Ezhov and Ventura [2000], Faber and Giraldi [2002], Schuld et al. [2014], Wittek [2014].

5.7.3 POTENTIAL IMPACT OF QUANTUM ARTIFICIAL NEURAL NETWORKS TO ROBOTICS

Artificial neural network (especially Deep Learning) approaches are currently popular in robotics for problem domains including computer vision [LeCun et al., 2010] and robotic reinforcement learning [Mnih et al., 2015]. Quantum Artificial Neural Networks may provide speedup to deal with more complex representations in these and other robotic problems.

5.8 MANIFOLD LEARNING AND QUANTUM SPEEDUPS

High dimensional data can be extremely difficult to interpret. Manifold learning leverages the idea that oftentimes data set dimensionality is only artificially high. Data may lie along a low-dimensional manifold embedded in a high-dimensional space.

An example of a manifold is the famous swiss roll (shown in Figure 5.3). The swiss roll data is 3D but only artificially so, as it is a 2D plane that has been wrapped into a spiral. The data is actually a 2D manifold in a 3D space. The goal of manifold learning is to find the lower-dimensional representation of the higher-dimensional data.

image

Figure 5.3: Example “Swiss Roll” manifold: A 2D plane embedded in a 3D space.

5.8.1 CLASSICAL MANIFOLD LEARNING

Isometric Feature Mapping (Isomap) [Tenenbaum et al., 2000] is a popular algorithm for manifold learning. The two steps of the algorithm are shown in Algorithm 5.3.

Algorithm 5.3 Isomap Algorithm

Input : Data points as a N × d0 data matrix, k (integer number of nearest neighbors to use in estimating geodesic distances)

Output : Lower-dimensional representation of N input points

 

1: Estimate the geodesic distances (distances along a manifold) between the data points in the input using shortest-path distances on the data set’s k-nearest neighbor graph.

2: Use Multi-Dimensional Scaling (MDS) to find points in a low-dimensional Euclidean space whose inter-point distances match the distances found in step 1.

3: return Lower-dimensional representation of N data points as a N × d data matrix where the new dimensionality d ≤ d0.

Isomap makes the local linearity assumption that, if the manifold is smooth enough, the Euclidean distance between nearby points in the high-dimensional space is a good approximation to distances along the manifold. The Isomap algorithm first constructs a k-nearest neighbor graph that is weighted by Euclidean distances. A shortest path algorithm (such as Dijkstra’s or Floyd’s) is used to output estimates for all other inter-point distances. The result is a dissimilarity matrix Dimagen×n of distances between all points.

Afterward, Multidimensional Scaling (MDS) is used to construct a set of points in a lower-dimensional space whose inter-point distances match those in the higher-dimensional space. MDS solves the problem: given a matrix Dimagen×n of dissimilarities, construct a set of points whose inter-point Euclidean distances match those in D closely. The output is a lower-dimensional representation of the higher-dimensional data.

Other forms of manifold learning include Local Linear Embedding [Saul and Roweis, 2003] which attempts to capture local geometric features of points, Laplacian Eigenmaps [Belkin and Niyogi, 2001] which leverages the graph Laplacian to reveal information about the data, and Semidefinite Embedding [Weinberger et al., 2004] which uses semidefinite programming to approximate Gram matrices to convert between high- and low-dimensional spaces. For a more in-depth treatment of manifold learning techniques, the reader is encouraged to browse [Cayton, 2005].

Classical manifold learning has scalability challenges. Building and analyzing the neighborhood graph and calculating inter-point distances can be computationally demanding. Many manifold learning techniques can only handle hundreds to low thousands of data points without parallelization. Much work is done to optimize these algorithms to run on larger data sets. Furthermore, many manifold learning methods assume convex parameter spaces, though this is not always true of manifolds in the real world.

5.8.2 QUANTUM SPEEDUPS FOR MANIFOLD LEARNING

Quantum methods may help scale manifold learning to larger data sets. Dürr et al. [2004] develops a quantum subroutine quant_find_smallest_values which, with high probability using a variant of Grover’s algorithm, can find the k closest neighbors of a point in image time if there are n points in the database. This hypothetically allows the calculation of inter-point distances faster on a quantum computer than classically possible.

With high probability, the algorithm in Aïmeur et al. [2013] constructs the neighborhood graph of a set of n points in image time. Classically, if one uses an arbitrary metric and if the only information available is the distance between pairs of points, Ω(n2) time is required. With access to all the D features in the point, a KD-tree approach can provide a speedup to O((k + D)n log n). However, the fastest classical result is not as theoretically fast as the quantum result.

5.8.3 POTENTIAL IMPACT OF QUANTUM MANIFOLD LEARNING ON ROBOTICS

The theory of how classical manifold learning affects robotics is still in its infancy. Thus it may be premature to think about the application of quantum manifold learning to robotics. However, classical manifold learning has proven useful for problems in robot navigation [Keeratipranon et al., 2006], dynamic visual tracking [Qiao et al., 2010], motion synthesis [Havoutis and Ramamoorthy, 2010], and is being tried on an increasing number of problems in robotics. These are all important robotic problem domains that will get an important speedup from quantum approaches.

5.9 QUANTUM BOOSTING

The idea behind boosting is to create a “strong” classifier out of many simpler “weak” classifiers. Boosting is motivated by the observation that simple classifiers can do unexpectedly well for many tasks. Having a diversity of simple classifiers and fusing them in a way that explicitly seeks out models that complement each other in reducing classification error can be a powerful strategy. In robotics, boosting has been applied to structured prediction for imitation learning [Ratliff et al., 2007], improving accuracy for semantic place classification [Rottmann et al., 2005], and learning loop closures [Walls and Wolcott, 2011].

5.9.1 CLASSICAL BOOSTING ANALYSIS

A popular boosting algorithm, Adaboost [Freund et al., 1999], works by trying to iteratively improve classification accuracy on previously incorrectly classified training instances (Algorithm 5.4). Adaboost can be a powerful algorithm to improve classification accuracy, though care must be taken to avoid overfitting to the training data. A key result is that the error rate (on training data) of the combined “strong” classifier approaches zero exponentially fast in the number of iterations if the weak classifiers do at least significantly better than random guessing [Freund et al., 1999].

Adaboost attempts to find a function:

image

where f(x) weights the individual weak classifier predictions hs(x) to come up with the optimal predictor. To do so tractably, Adaboost minimizes a convex loss function over a convex set of functions. The loss Adaboost minimizes is:

image

The convexity assumptions can limit the modeling power of classical Adaboost. In addition, a key challenge with applying boosting style approaches in robotics has been getting ensemble methods to work well in real time. As more classifiers are used in an ensemble, the potential modeling power of Adaboost is improved for complicated world percepts. However, for a robotic system, it may be computationally intensive to evaluate a large number of classifiers on embedded hardware in real time. Quantum approaches to boosting may help alleviate some of these challenges.

Algorithm 5.4 Adaboost Algorithm

Input : N training instances {(x1, y1), … , (xN , yN)}

Output : “Strong” classifier from many “weak” classifiers

 

1: Initially assign a weight wi = image to each of N training instances.

2: At each iteration, a subset of training samples is drawn (with replacement). The selection probability of a training example is equal to its weight.

3: Train classifier ensemble on sample. Each classifier ht (at iteration t) makes a prediction for each data point, and the measurement error of the classifier is computed. The error of one of the classifiers at iteration t is:

image

where 1(z) is the indicator function.

4: Adjust weights of instances so that incorrectly classified examples have higher weight and correctly classified examples have lower weight. A weighting rule often used is: set wi,t = wi,t−1 × image for correctly classified instances, while keeping the weight of incorrectly classified instances the same. All weights are then re-normalized to form a probability distribution. The classifiers are also assigned weights: image which upweights the output of simple classifiers in the ensemble that have low error rate when making predictions.

5: Repeat steps 2-4 for some number T of iterations or some error convergence criterion.

6: return Weighted combined classifier that can be used to make predictions on new data points.

5.9.2 QBOOST

Quantum Boosting (QBoost) [Neven et al., 2008] allows both the regularization term and loss function to be non-convex. By using adiabatic optimization and discrete search over the parameter space, the convexity assumption can be avoided. Note that the loss function needs to be quadratic (since adiabatic optimization is based on QUBOs) but can be non-convex.

Given a set of weak learners {hs|hs : image → {−1, 1}, s = 1, 2, … , S}, boosting can be written as:

image

where λ ||w||0 is a regularization term. Equation (5.41) can be expanded as:

image

Term 2 in Equation (5.42) upweights weak learners correlated with the correct output label. Term 1 in Equation (5.42) downweights weak learners correlated with each other. The resulting QUBO equation can be discretized and solved using quantum annealing.

5.9.3 POTENTIAL IMPACT OF QUANTUM BOOSTING ON ROBOTICS

In robotics, boosting has been applied to structured prediction for imitation learning [Ratliff et al., 2007], improving accuracy for semantic place classification [Rottmann et al., 2005], and learning loop closures [Walls and Wolcott, 2011]. These are all domains that may benefit from a quantum boosting algorithm.

5.10 CHAPTER SUMMARY

Chapter Key Points

• Quantum approaches to machine learning utilize a variety of quantum operating principles such as quantum inner products, Hamiltonian simulation, adiabatic optimization, and Grover’s search to obtain benefits (in theory) over classical machine learning approaches.

• Many machine learning methods obtain speedups. Some methods (such as Quantum PCA) obtain new capabilities. Others (such as quantum SVM or quantum deep learning) may become more general in their modeling power.

• Developed quantum approaches, if and when they work, may be useful for a broad array of robotic problem domains and applications.

Table 5.1: Summary of quantum operating principles in quantum machine learning

image

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.183.210