Chapter 14

Adaptive HImage Tracking Control of Nonlinear Systems Using Reinforcement Learning

Hamidreza ModaresBahare KiumarsiKyriakos G. VamvoudakisFrank L. Lewis,§    Missouri University of Science and Technology, Rolla, MO, United States
UTA Research Institute, University of Texas at Arlington, Fort Worth, TX, United States
Virginia Tech, Blacksburg, VA, United States
§State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang, China

Abstract

This chapter presents online solutions to the optimal HImage tracking of nonlinear systems to attenuate the effect of disturbance on the performance of the systems. To obviate the requirement of the complete knowledge of the system dynamics, reinforcement learning (RL) is used to learn the solutions to the Hamilton–Jacobi–Isaacs equations arising from solving the HImage tracking problem. Off-policy RL algorithms are designed for continuous-time systems, which allows the reuse of data for learning and consequently leads to data efficient RL algorithms. A solution is first presented for the HImage optimal tracking control of affine nonlinear systems. It is then extended to a special class of nonlinear nonaffine systems. It is shown that for the nonaffine systems existence of a stabilizing solution depends on the performance function. A performance function is designed to assure the existence of the solution to a class of nonaffine system, while taking into account the input constraints.

Keywords

HImage control; Optimal tracking; Reinforcement learning

Chapter Points

  • •  The result of this approach is to design an online data-based solution to the HImage tracking control problem.
  • •  Reinforcement learning is employed to learn the solution to the HImage tracking in real-time and without requiring the system dynamics.

14.1 Introduction

Reinforcement learning (RL) [13], inspired by learning mechanisms observed in animals, is concerned with how an agent or decision maker takes actions so as to optimize a cost of its long-term interactions with the environment. The cost function is prescribed and captures some desired system behaviors such as minimizing the transient error and minimizing the control effort for achieving a specific goal. The agent learns an optimal policy so that, by taking actions produced based on this policy, the long-term cost function is optimized. Similar to RL, optimal control involves finding an optimal policy by optimizing a long-term performance criterion. Strong connections between RL and optimal control have prompted a major effort towards introducing and developing online and model-free RL algorithms to learn the solution to optimal control problems [46].

RL methods have been successfully used to solve the optimal regulation problems by learning the solution to the so-called Hamilton–Jacobi equations arising from both optimal H2Image [718] and HImage [1930] regulation problems. For continuous-time (CT) systems, [8,9] proposed a promising RL algorithm, called integral RL (IRL), to learn the solution to the Hamilton–Jacobi–Bellman (HJB) equations using only partial knowledge about the system dynamics. They used an iterative online policy iteration [31] procedure to implement their IRL algorithm. The original IRL algorithm and many of its extensions are on-policy algorithms. That is, the policy that is applied to the system to generate data for learning (behavior policy) is the same as the policy that is being updated and learned about (target policy). The work [15] presented an off-policy RL algorithms for CT systems in which the behavior policy could be different from the target policy. This algorithm does not require any knowledge of the system dynamics and is data efficient because it reuses the data generated by the behavior policy to learn as many target policies as required. Many variants and extensions of off-policy RL algorithms are presented in the literature. Other than the IRL-based PI algorithms and off-policy RL algorithms, efficient synchronous PI algorithms with guaranteed closed-loop stability were proposed for CT systems in [7,11,12] to learn the solution to the HJB equation. Synchronous IRL algorithms were also presented for solving the HJB equation in [23,32].

Although RL algorithms have been widely used to solve the optimal regulation problems, few results considered solving the optimal tracking control problem (OTCP) for both discrete-time [3336] and continuous-time systems [6,37]. Moreover, existing methods for continuous-time systems require the exact knowledge of the system dynamics a priori while finding the feedforward part of the control input using either the dynamic inversion concept or the solution of output regulator equations [3941]. While the importance of the RL algorithms is well understood for solving optimal regulation problems for uncertain systems, the requirement of the exact knowledge of the system dynamics for finding the steady-state part of the control input in the existing OTCP formulation does not allow for direct extending of the IRL algorithm for solving the OTCP.

In this chapter, we develop adaptive optimal controllers based on the RL techniques to learn the optimal HImage tracking control solutions for nonlinear continuous-time systems without knowing the system dynamics or the command generator dynamics. An augmented system is first constructed from the tracking error dynamics and the command generator dynamics to introduce a new discounted performance function for the OTCP. The tracking Hamilton–Jacobi–Isaac (HJI) equations are then derived to solve OTCPs. Off-policy RL algorithms, implemented on an actor-critic structure, are used to find the solution to the tracking HJI equations online using only measured data along the augmented system trajectories. These algorithms are developed for both affine and nonaffine nonlinear systems. Therefore, they can be employed in control of many real-world applications, including robot manipulators, mobile robots, unmanned aerial vehicles (UAVs), power systems and human–robot interaction systems.

14.2 HImage Optimal Tracking Control for Nonlinear Affine Systems

Existing solutions to the HImage tracking problem are composed of two steps [3841]. A feedforward control input is designed to guarantee perfect tracking using either dynamic inversion or by solving the so-called output regulator equations in the first step. A feedback control input is designed in the second step by solving an HJI equation to stabilize the tracking error dynamics. In these methods, procedures for computing the feedback and feedforward terms are based on offline solution methods which require complete knowledge of the system dynamics. In this section, a new formulation for the HImage tracking is presented which allows developing model-free RL solutions.

Consider the nonlinear time-invariant system given as

˙x(t)=f(x(t))+g(x(t))u(t)+k(x(t))w(t),

Image (14.1)

where x(t)RnImage, u(t)RmImage and w(t)RpImage represent the state of the system, the control input and the external disturbance of the system, respectively. The drift dynamics is represented by f(x(t))RnImage, g(x(t))Rn×mImage is the input dynamics and k(x(t))RpImage is the disturbance dynamics. It is assumed that f(0)=0Image and f(x(t))Image, g(x(t)Image and k(x(t))Image are unknown Lipschitz functions and the system is stabilizable.

Assumption 1

Let r(t)Image be the bounded reference trajectory and assume that there exists a Lipschitz continuous command generator function hd(t)RnImage with hd(0)=0Image such that

˙r(t)=hd(t)r(t).

Image (14.2)

Define the tracking error

ed(t)x(t)r(t).

Image (14.3)

Using (14.1)(14.3), the tracking error dynamics is given by

˙ed(t)=f(x(t))+g(x(t))u(t)+k(x(t))w(t)hd(r(t)).

Image (14.4)

The performance output to be controlled is defined such that it satisfies

z(t)2=edTQed+uTRu.

Image (14.5)

The goal of the HImage tracking is to attenuate the effect of the disturbance input w on the performance output z. Before defining the HImage tracking control problem, we define the following general L2Image-gain or disturbance attenuation condition.

Definition 1

Bounded L2Image-gain or disturbance attenuation

The nonlinear system (14.1) is said to have L2Image-gain less than or equal to γ if the following disturbance attenuation condition is satisfied for all wL2[0,)Image:

teα(τt)z(τ)2dτteα(τt)w(τ)2dτγ2,

Image (14.6)

where α>0Image is the discount factor and γ represents the amount of attenuation from the disturbance input w(t)Image to the defined performance output variable z(t)Image.

The disturbance attenuation condition (14.6) implies that the effect of the disturbance input to the desired performance output is attenuated by a degree at least equal to γ. The desired performance output represents a meaningful cost in the sense that it includes a positive penalty on the tracking error and a positive penalty on the control effort. The use of the discount factor is essential. This is because the feedforward part of the control input does not converge to zero in general and thus penalizing the control input in the performance function without a discount factor makes the performance function unbounded.

Using (14.5) in (14.6) one has

teα(τt)(edTQed+uTRu)dτγ2teα(τt)(wTw)dτ.

Image (14.7)

Definition 2

HImage optimal tracking

The HImage tracking control problem is to find a control policy u=β(ed,r)Image for some smooth function β depending on the tracking error e and the reference trajectory r, such that:

(i) The closed-loop system ˙x=f(x)+g(x)β(ed,r)+k(x)wImage satisfies the attenuation condition (14.7).

(ii) The tracking error dynamics (14.4) with w=0Image is locally asymptotically stable.

The main difference between Definition 2 and the standard definition of the HImage tracking control problem (see [38], Definition 5.2.1) is that a more general disturbance attenuation condition is defined here. Previous work on the HImage optimal tracking divides the control input into feedback and feedforward parts. The feedforward part is first obtained separately without considering any optimality criterion. Then, the problem of optimal design of the feedback part is reduced to an HImage optimal regulation problem. In contrast, in the new formulation, both feedback and feedforward parts of the control input are obtained simultaneously and optimally as a result of the defined L2Image-gain with discount factor in (14.7).

14.2.1 HJI Equation for HImage Optimal Tracking

In this section, it is first shown that the problem of solving the HImage tracking problem can be transformed into a min–max optimization problem subject to an augmented system composed of the tracking error dynamics and the command generator dynamics. A tracking HJI equation is then developed which gives the solution to the min–max optimization problem. The stability and L2Image-gain boundedness of the tracking HJI control solution are discussed.

Define the augmented system state

X(t)=[ed(t)Tr(t)T]TR2n,

Image

where ed(t)Image is the tracking error defined in (14.3) and r(t)Image is the reference trajectory.

Using (14.2) and (14.4), define the augmented system

˙X(t)=F(X(t))+G(X(t))u(t)+K(X(t))w(t),

Image (14.8)

where u(t)=u(X(t))Image and

F(X)=[f(ed+r)hd(r)hd(r)],G(X)=[g(ed+r)0],K(X)=[k(ed+r)0].

Image

The disturbance attenuation condition (14.7) using the augmented state becomes

teα(τt)(XTQTX+uTRu)dτγ2teα(τt)(wTw)dτ,

Image (14.9)

where

QT=[Q000].

Image

Based on (14.9), define the performance function

J(u,w)=teα(τt)(XTQTX+uTRuγ2wTw)dτ.

Image (14.10)

Solvability of the HImage control problem is equivalent to solvability of the following zero-sum game [42]:

V(X(t))=J(u,w)=minumaxdJ(u,w),

Image (14.11)

where J is defined in (14.10) and V(X(t))Image is defined as the optimal value function. This two-player zero-sum game control problem has a unique solution if a game theoretic saddle point exists, i.e., if the following Nash condition holds:

V(X(t))=minumaxdJ(u,w)=maxdminuJ(u,w).

Image

Differentiating (14.10), note that V(X(t))=J(u(t),w(t))Image gives the following Bellman equation:

H(V,u,w)Δ=XTQTX+uTRuγ2wTwαV+VXT(F+G u+K w)=0,

Image (14.12)

where FF(X)Image, GG(X)Image, KK(X)Image and VX=V/XImage.

Applying stationarity conditions H(V,u,w)/u=0,H(V,u,w)/w=0Image [43] gives the optimal control and disturbance inputs as

u=12R1GTVX,

Image (14.13)

w=12γ2KTVX,

Image (14.14)

where VImage is the optimal value function defined in (14.11). Substituting the control input (14.13) and the disturbance (14.14) into (14.12), the following tracking HJI equation is obtained:

H(V,u,w)XTQTX+VXTFαVX14VXTGTR1GVX+14γ2VXTK KTVX=0.

Image (14.15)

It is shown in [44] that the control solution (14.13)(14.15) satisfies the disturbance attenuation condition (14.9) (part (i) of Definition 2) and that it guarantees the stability of the tracking error dynamics (14.4) without the disturbance (part (ii) of Definition 2), if the discount factor is less than an upper bound.

14.2.2 Off-Policy IRL for Learning the Tracking HJI Equation

In this section, an off-policy RL algorithm is first given to learn this control solution online and without requiring any knowledge of the system dynamics.

The Bellman equation (14.12) is linear in the cost function V, while the HJI equation (14.15) is nonlinear in the value function VImage. Therefore, solving the Bellman equation for V is easier than solving the HJI for VImage. Instead of directly solving for VImage, a policy iteration (PI) algorithm iterates on both control and disturbance players to break the HJI equation into a sequence of differential equations linear in the cost. An offline PI algorithm for solving the HImage optimal tracking problem is given as follows.

Algorithm 1 extends the results of the simultaneous RL algorithm in [27] to the tracking problem. The convergence of this algorithm to the minimal nonnegative solution of the HJI equation was shown in [27]. In fact, similar to [27], the convergence of Algorithm 1 can be established by proving that iteration on (14.16) is essentially a Newton iterative sequence which converges to the unique solution of the HJI equation (14.15).

Image
Algorithm 1 Offline RL algorithm.

Algorithm 1 requires complete knowledge of the system dynamics. In the following, the off-policy IRL algorithm, which was presented in [14,15] for solving the H2Image optimal regulation problem, is extended here to solve the HImage optimal tracking for systems with completely unknown dynamics. To this end, the system dynamics (14.8) is first written as

˙X=F+Guj+Kwj+G(uuj)+K(wwj),

Image (14.19)

where ujRmImage and wjRqImage are policies to be updated. In this equation, the control input u is the behavior policy which is applied to the system to generate data for learning, while ujImage is the target policy which is evaluated and updated using data generated by the behavior policy. The fixed control policy u should be a stable and exploring control policy. Moreover, the disturbance input w is the actual external disturbance that comes from an external source and is not under our control. However, the disturbance wjImage is the disturbance that is evaluated and updated. One advantage of this off-policy IRL Bellman equation is that, in contrast to on-policy RL-based methods, the disturbance input that is applied to the system does not require to be adjustable.

Differentiating Vj(X)Image along with the system dynamics (14.19) and using (14.16)–(14.18) gives

˙Vj=(VXj)T(F+Guj+Kwj)+(VXj)TG(uuj)+(VXj)TK(wwj)=αVjXTQTX(uj)TRuj+γ2(wj)Twj2(uj+1)TR(uuj)+2γ2(wj+1)T(wwj).

Image (14.20)

Multiplying both sides of (14.20) by eα(τt)Image and integrating from both sides yields the following off-policy IRL Bellman equation:

eαTVj(X(t+T))Vj(X(t))=t+Tteα(τt)(XTQTX(uj)TRuj+γ2(wj)Twj)dτ+t+Tteα(τt)(2(uj+1)TR(uuj)+2γ2(wj+1)T(wwj))dτ.

Image (14.21)

Note that, for a fixed control policy u (the policy that is applied to the system) and a given disturbance w (the actual disturbance that is applied to the system), Eq. (14.21) can be solved for both the value function VjImage and the updated policies uj+1Image and wj+1Image simultaneously.

Lemma 1

The off-policy IRL equation (14.21) gives the same solution for the value function as the Bellman equation (14.16) and the same updated control and disturbance policies as (14.18) and (14.17).

Proof

See [44]. □

The following algorithm uses the off-policy tracking Bellman equation (14.21) to iteratively solve the HJI equation (14.15) without requiring any knowledge of the system dynamics. The implementation of this algorithm is discussed in the next subsection. It is shown how the data collected from a fixed control policy u are reused to evaluate many updated control policies uiImage sequentially until convergence to the optimal solution is achieved.

Inspired by the off-policy algorithm in [14], Algorithm 2 has two separate phases. First, a fixed initial exploratory control policy u is applied and the system information is recorded over the time interval T. Second, without requiring any knowledge of the system dynamics, the information collected in phase 1 is repeatedly used to find a sequence of updated policies ujImage and wjImage converging to uImage and wImage. Note that Eq. (14.23) is a scalar equation and can be solved in a least square sense after collecting enough data samples from the system. It is shown in the following section how to collect required information in phase 1 and reuse it in phase 2 in a least square sense to solve (14.23) for VjImage, uj+1Image and wj+1Image simultaneously. After the learning is done and the optimal control policy uImage is found, it can be applied to the system.

Image
Algorithm 2 Online off-policy RL algorithm for solving the tracking HJI equation.

Theorem 1

Convergence of Algorithm 2

The off-policy Algorithm 2 converges to the optimal control and disturbance solutions given by (14.13) and (14.14) where the value function satisfies the tracking HJI equation (14.15).

Proof

See [44]. □

14.2.3 Implementing Algorithm 2 Using Neural Networks

In order to implement the off-policy RL Algorithm 2, it is required to reuse the collected information found by applying a fixed control policy u to the system to solve Eq. (14.23) for VjImage, uj+1Image and wj+1Image iteratively. Three neural networks (NNs), i.e., the actor NN, the critic NN and the disturber NN, are used here to approximate the value function and the updated control and disturbance policies in the Bellman equation (14.23). That is, the solution VjImage, uj+1Image and wj+1Image of the Bellman equation (14.23) is approximated by three NNs as

ˆVj(X)=ˆW1Tσ(X),

Image (14.24)

ˆuj+1(X)=ˆW2Tϕ(X),

Image (14.25)

ˆwj+1(X)=ˆW3Tφ(X),

Image (14.26)

where σ=[σ1,...,σl1]Rl1Image, ϕ=[ϕ1,...,ϕl2]Rl2Image and φ=[φ1,...,φl3]Rl3Image provide suitable basis function vectors, ˆW1Rl1Image, ˆW2Rm×l2Image and ˆW3Rq×l3Image are constant weight vectors and l1Image, l2Image and l3Image are the number of neurons. Define v1=[v11,...,vm1]T=uujImage, v2=[v21,...,v2q]T=wwjImage and assume R=diag(r,...,rm)Image. Then, substituting (14.24)(14.26) in (14.23) yields

e(t)=ˆW1T(eαTσ(X(t+T))σ(X(t)))t+Tteα(τt)(XTQTX(uj)TRuj+γ2(wj)Twj)dτ+2ml=1rlt+Tteα(τt)ˆW2,lTϕ(X(t))v1ldτ2γ2qk=1t+Tteα(τt)ˆW3,kTφ(X(t))v2kdτ,

Image (14.27)

where e(t)Image is the Bellman approximation error, ˆW2,lImage is the lth column of ˆW2Image and ˆW3,kImage is the kth column of ˆW3Image. The Bellman approximation error is the continuous-time counterpart of the temporal difference (TD) [10]. In order to bring the TD error to its minimum value, the least squares method is used. To this end, rewrite Eq. (14.27) as

y(t)+e(t)=ˆWTh(t),

Image (14.28)

where

ˆW=[ˆW1T,ˆW2,lT,...,ˆW2,mT,ˆW3,1T,...,ˆW3,qT]TRl1+m×l2+q×l3,h(t)=[eαTσ(X(t+T))σ(X(t)))2r1t+Tteα(τt)ϕ(X(t))v11dτ2rmt+Tteα(τt)ϕ(X(t))v1mdτ2γ2t+Tteα(τt)φ(X(t))v21dτ2γ2t+Tteα(τt)φ(X(t))v2qdτ],

Image (14.29)

y(t)=t+Tteα(τt)(XTQTX(uj)TRuj+γ2(wj)Twj)dτ.

Image (14.30)

The parameter vector ˆWImage, which gives the approximated value function, actor and disturbance (14.24)(14.26), is found by minimizing, in the least squares sense, the Bellman error. Assume that the systems state, input and disturbance information are collected at Nl1+m×l2+q×l3Image (the number of independent elements in ˆWImage) points t1Image to tNImage in the state space, over the same time interval T in phase 1. Then, for a given ujImage and wjImage, one can use this information to evaluate (14.29) and (14.30) at N points to form

H=[h(t1),....,h(tN)],Y=[y(t1),....,y(tN)]T.

Image

The least squares solution to (14.28) is then equal to

ˆW=(HHT)1HY,

Image

which gives VjImage, uj+1Image and wj+1Image. Note that although X(t+T)Image appears in Eq. (14.27), this equation is solved in a least square sense after observing N samples X(t)Image, X(t+T)Image, …, X(t+NT)Image. Therefore, the knowledge of the system is not required to predict the future state X(t+T)Image at time t to solve (14.27).

14.3 HImage Optimal Tracking Control for a Class of Nonlinear Nonaffine Systems

This section considers the design of an RL-based optimal tracking control solution for a class of nonaffine systems.

14.3.1 A Class of Nonaffine Dynamical Systems

A special class of nonaffine systems can be described as

˙X(t)=f(X(t))+g(X(t))L(u)+Dw(t),

Image (14.31)

where X(t)RnImage, u(t)RmImage and w(t)RpImage are the state of the system, the control input and the external disturbance input, respectively. The functions f(X(t))Image and g(X(t))Image are Lipschitz functions. This system is affine in a nonlinear function L(.)Image of the control input u(t)Image. This class of nonaffine systems allows the definition of a new performance function for the optimal HImage problem such that the existence of the constrained optimal control is assured (if any exists).

The following example shows that the UAV as a real-world application can be presented in the form of (14.31).

Example 1

A general class of nonlinear nonaffine UAV systems has the following well-known form:

˙x1=Vcosγcosψ+d1w1,˙x2=Vcosγsinψ+d2w2,˙x3=Vsinγ+d3w3,˙V=α2V2gsinγ+α1ˉTα3nzα4n2zV2,˙γ=gV(nzcosϕcosγ),˙ψ=gVcosγnzsinϕ,

Image (14.32)

with

nx=ˉTˉTmaxcosαDmg,nx=ˉTˉTmaxsinα+Kmg,

Image

where x1Image, x2Image, x3Image are the UAV location coordinates, γ is the pitch angle, ψ is the heading angle, ϕ is bank angle, V is the UAV velocity and m is the mass of the UAV. The terms nxImage and nzImage denote longitudinal and normal components of the load factor, depending on the current thrust ˉTImage, drag force D and lift force K (g is the acceleration due to gravity) [45].

Define the state of the UAV as

X={x1,x2,x3,V,γ,ψ}T

Image (14.33)

and the control input and disturbance inputs (wind velocity) as u(t)=[ˉT,nz,ϕ]T=[u1u2u3]TImage and w(t)Image, respectively. The constraints on the control input are as follows:

|u1|ˉu1,|u2|ˉu2.

Image (14.34)

Using (14.32) and (14.33), the UAV dynamics can be written as a nonlinear nonaffine CT system as

˙X(t)=M(X(t),u(t))+Dw(t),

Image (14.35)

with

D=[d1d2d3000]T,M(X,u)=[x4cos(x5)cos(x6)x4cos(x5)sin(x6)x4sin(x5)α2x24gsin(x5)+α1u1α3u2α4u22x24gx4(cos(x5)+u2cos(u3))gx4cos(x5)u2sin(u3)].

Image

The UAV dynamics (14.35) can be written in the form of (14.31) with

f(X(t))=[x4cos(x5)cos(x6)x4cos(x5)sin(x6)x4sin(x5)α2x24gsin(x5)gx4(cos(x5)0],g(X(t))=[000000000000000α1α3α4x2400000100000gx4cos(x5)],L(u(t))=[L1L2L3L4L5]=[u1u2u22u2cos(u3)u2sin(u3)].

Image

Eq. (14.31) represents a large class of nonaffine systems far larger than the systems that are affine in the control itself. In fact, most aircraft dynamics can be expressed in the form of (14.31) if the lift equation satisfies certain assumptions [45].

14.3.2 Performance Function and HImage Control Tracking for Nonaffine Systems

It is shown in [46] that the existence of an admissible optimal control solution for nonaffine systems depends on how the utility function r(X,u)Image is defined. Moreover, to deal with the input constraints, a nonquadratic performance index needs to be defined as follows.

Let the reference trajectory be generated by the command generator dynamics (14.2). The performance or control output z(t)Image is defined such that it satisfies

z(t)2=(Xr)TQ(Xr)+W(L(u)),

Image (14.36)

where Q0Image and W(L(u))Image is a positive definite nonquadratic function of L(u)Image which penalizes the control effort and is chosen as follows to assure the constrained control effort:

W(L(u))=L(u)0w(s)ds=lj=1(Lj(u)0wj(sj)dsj),

Image

where w(s)=tanh1(ˉL1s)=[w1(s1)wl(sl)]TImage and ˉLImage is the constant diagonal matrix given by ˉL=diag(ˉL1,...,ˉLl)Image, which determines the bounds on L(u)Image. Note that the bounds are originally given for the control input u(t)Image itself. However, one can transform these bounds to bounds on L(u)Image.

The HImage control is to develop a control input such that (1) the system (1) with w=0Image is asymptotically stable and (2) the L2Image gain condition (14.6) with z(t)Image defined in (14.36) is satisfied in the presence of wL2[0,)Image.

The disturbance attenuation condition is satisfied if the following cost function is nonpositive:

J(X)=teα(τt)[(Xr)TQ(Xr)+W(L(u))γ2wTw]dτ.

Image (14.37)

14.3.3 Solution of the HImage Control Tracking Problem of Nonaffine Systems

Define the tracking error as (14.3). Then, using (14.2) and (14.31), the tracking error dynamics becomes

˙ed(t)=˙X(t)˙r(t)=f(X(t))+g(X(t))L(u)+Dw(t)hd(r)(t).

Image (14.38)

Based on (14.2) and (14.38), an augmented system can be constructed in terms of the tracking error e(t)Image and the reference trajectory r(t)Image as

˙Z(t)=[˙e(t)˙r(t)]=[f(e(t)+r(t))hd(r(t))hd(r(t))]+[g(e(t)+r(t))0]L(u)+[D0]w(t)F(Z(t))+G(Z(t))L(u)+Kw(t),

Image (14.39)

where the augmented state is

Z(t)=[e(t)r(t)].

Image

The performance index (14.37) can be rewritten as

J(L(u),w)=teα(τt)(ZT(τ)Q1Z(τ)+W(L(u))γ2wTw)dτ,

Image (14.40)

with Q1=[Q000]Image.

The HImage control problem can be expressed as a two-player zero-sum differential game in which the control effort policy player L(u)Image seeks to minimize the value function, while the disturbance policy player w(t)Image desires to maximize it. The goal is to find the feedback saddle point (L(u),w)Image such that [42]

V(Z(t))=minL(u)maxwJ(L(u),w).

Image (14.41)

On the basis of (14.40) and noting that V(Z(t))=J(L(u),w)Image, the HImage tracking Bellman equation is

ZTQ1Z+W(L(u))γ2wTwαV(Z)+˙V(Z)=0

Image (14.42)

and the Hamiltonian is given by

H(Z,L(u),w,VZ)=ZTQ1Z+W(L(u))γ2wTwαV(Z)+VTZ(F(Z)+G(Z)L(u)+Kw).

Image

Then the optimal control effort L(u)Image and disturbance input w(t)Image for the given problem are obtained by employing the stationarity condition

L(u)=argminL(u)H(Z,L(u),w,V)d[ZTQ1Z+W(L(u))γ2wTwαV+(VZ)T˙Z]dL(u),w=argmaxwH(Z,L(u),w,V)d[ZTQ1Z+W(L(u))γ2wTwαV+(VZ)T˙Z]dw,

Image

which give

L(u)=ˉLtanhT(v),

Image (14.43)

w=12γ2(VZ)TK,

Image (14.44)

where

v=(VZ)TG.

Image (14.45)

Substituting (14.43) and (14.44) in Bellman equation (14.42) yields the HJI equation

ZTQ1Z+W(L(u))γ2(w)TwαV(Z)+˙V(Z)=0.

Image (14.46)

To find the optimal control solution, the tracking HJI equation (14.46) could first be solved and then the control effort L(u)Image given by (14.43).

Note that the minimization problem (14.41) is defined in terms of L(u)Image. Under certain conditions, this is equivalent to minimization in terms of u(t)Image.

Lemma 2

We have minuH(Z,L(u),w,VZ)=minL(u)H(Z,L(u),w,VZ)Image if the elements of L(u)Image are independent.

Proof

The minimum of H(Z,L(u),w,VZ)Image with respect to u is equal to

minuH(Z,L(u),w,VZ)=(L(u)u)TH(Z,L(u),w,VZ)L(u)=0

Image (14.47)

and the minimum of H(Z,L(u),w,VZ)Image with respect to L(u)Image is equal to

minL(u)H(Z,L(u),w,VZ)=dH(Z,L(u),w,VZ)dL(u)=0.

Image (14.48)

Eqs. (14.47) and (14.48) are equivalent if and only if J=dL(u)/duImage is a nonsingular matrix which guarantees the elements of L(u)Image are independent [46]. □

Note that if the elements of L(u)Image are independent, then the optimal control is given by

u=L1(ˉLtanhT(v)),

Image (14.49)

thus L(u)=L(u)Image. Otherwise, it is shown in the subsequent sections how to use (14.43) to find vImage and uImage consequently to assure L(u)=L(u)Image. The next result holds for both independent and dependent L(u)Image.

Theorem 2

Solution to bounded L2Image gain problem

Assume that there exists a continuous-time positive semidefinite solution V(Z)Image to the tracking HJI equation (14.46). Let L(u)Image be given by (14.43). Then L(u)Image in (14.31) makes the L2Image gain from the disturbance to the performance output less than or equal to γ.

Proof

See [46]. □

If the elements of L(u)Image are independent, then there exists a uImage such that L(u)=L(u)Image and this uImage makes the L2Image gain less than or equal to γ. On the other hand, if the elements of L(u)Image are dependent, a method of solution is suggested in subsequent sections.

14.3.4 Off-Policy Reinforcement Learning for Nonaffine Systems

In this section, the off-policy RL is presented to solve the optimal HImage control of nonaffine nonlinear systems. In the proposed method, no knowledge about the system dynamics and the reference trajectory dynamics is needed. Moreover, it does not require an adjustable disturbance input and it avoids bias in finding the value function. Two algorithms are developed for two different cases: (1) for nonaffine systems with independent elements in L(u)Image and (2) for nonaffine systems with dependent elements in L(u)Image. Then the implementation of these two algorithms is given.

The system dynamics (14.39) can be rewritten as

˙Z(t)=F(Z(t))+G(Z(t))Lj(u)+Kwj+G(Z(t))(L(u)Lj(u))+K(wwj),

Image (14.50)

where Lj(u)Image and wj(t)Image are the policies that are updated. By contrast, L(u)Image and w(t)Image are the policies that are applied to the system to collect the data.

By the definition, it is easy to see that

eα(tktk1)Vj+1(Z(tk))Vj+1(Z(tk1))=tktk1eα(τtk1)(Vj+1Z)T˙Z(t)αVj+1dt.

Image (14.51)

Substituting (14.50) into (14.51) yields

eα(tktk1)Vj+1(Z(tk))Vj+1(Z(tk1))=tktk1eα(τtk1)(Vj+1Z)T[F(Z(t))+G(Z(t))Lj(u)+Kwj+G(Z(t)(L(u)Lj(u))+K(wwj)]dt.

Image (14.52)

On the other hand, one has

(Vj+1Z)T[F(Z)+G(Z)Lj(u)+Kwj]=αVj+1ra(Z(t),Lj(u),wj),

Image (14.53)

where

ra(Z(t),Lj(u),wj)=ZTQ1Z+W(L(uj))γ2(wj)Twj.

Image

Substituting (14.53) into (14.52) yields

eα(tktk1)Vj+1(Z(tk))Vj+1(Z(tk1))=tktk1eα(τtk1)((Vj+1Z)T×[G(Z(t)(L(u)Lj(u))+K(wwj)]ra(Z(t),Lj(u),wj))dt.

Image (14.54)

Using (14.43)(14.45) in (14.54) yields the following off-policy HImage Bellman equation:

eα(tktk1)Vj+1(Z(tk))Vj+1(Z(tk1))=tktk1eα(τtk1)((vj+1ˉL(tanhT(vj)tanhT(v))+2γ2wj+1(wwj)ra(Z(t),vj,wj))dt.

Image (14.55)

Note that if vjImage and wjImage are given, the unknown functions Vj+1(Z)Image, vj+1Image and wj+1Image can be approximated using (14.55). Then Lj+1(u)Image is found from vj+1Image.

The elements of Lj+1(u)Image can be either dependent or independent. If elements in Lj+1(u)Image are independent, then the Bellman equation (14.55) can be solved iteratively using stored data to find L(u)Image and the optimal control policy is uImage. The following algorithm shows how to iterate on (14.55) to find the optimal control policy in this case.

Algorithm 3 gives Lj+1(u)Image and, if the condition of Lemma 2 is satisfied, then the elements of the control input are uj+1=L1(ˉLtanhT(vj+1))Image. However, if elements in Lj+1(u)Image are dependent, then the dependency of its elements must be taken into account by encoding equality constraints while solve Eq. (14.55) for vj+1Image.

Image
Algorithm 3 Online off-policy RL algorithm for nonaffine system with independent elements in L(u).

To find a form for solution constraints L(u)Image if it has dependent elements, consider the UAV system in Example 1 with

L(u(t))=[L1L2L3L4L5]=[u1u2u22u2cos(u3)u2sin(u3)].

Image

Then, the dependency of the elements of L(u)Image becomes

L3=L22=L42+L52.

Image

This gives the following equality constraints:

ˉL3tanh(v3)=(ˉL2tanh(v2))2=(ˉL4tanh(v4))2+(ˉL5tanh(v5))2.

Image

In general, it is seen that one has a vector of equality functions

f(L)=[f1(L),...,fp(L)]T=0,

Image (14.57)

with p being the number of dependent elements in L(u)Image. For example for the UAV system, one has f1=ˉL3tanh(v3)(ˉL2tanh(v2))2Image, f2=(ˉL2tanh(v2))2(ˉL4tanh(v4))2(ˉL5tanh(v5))2Image and f3=(ˉL3tanh(v3))(ˉL4tanh(v4))2(ˉL5tanh(v5))2Image. This constraint must be taken into account when solving (14.55) for v using NNs.

The following algorithm shows how to find the optimal control solution for the cases where L(u)Image has dependent elements. The details of implementation of solving (14.55) for v while considering the constraint imposed by the independency of elements of v are presented in the next subsection.

Before proceeding, ˉHImage is defined as

ˉH=tktk1eα(τtk1)((vj+1ˉL(tanhT(vj)tanhT(v))+2γ2wj+1(wwj)ra(Z(t),vj,wj))dteα(tktk1)Vj+1(Z(tk))+Vj+1(Z(tk1)).

Image

The minimum value of ˉHImage in Algorithm 4 not considering the constraint (14.57) is zero. If this algorithm terminates, so that ˉH=0Image, then by Theorem 2 the L2Image gain problem is solved and there exists a uImage such that L(u)=L(u)Image.

Image
Algorithm 4 Online off-policy RL algorithm for nonaffine system with dependent elements in L(u).

The following subsection shows how to use NNs along with linear and nonlinear LS, respectively, to implement Algorithms 3 and 4.

14.3.5 Neural Networks for Implementation of Off-Policy RL Algorithms

In this subsection, the solution of the off-policy HImage Bellman equations (14.56) and Eq. (14.58) in Algorithms 3 and 4 using three NNs is presented. The unknown functions Vj+1(Z)Image, vj+1Image and wj+1Image can be approximated by three NNs as

ˆVj+1(Z)=N1i=1ˆcj+1iϕi(Z)=ˆCj+1ϕ(Z),

Image (14.59)

ˆvij+1=N2k=1ˆpj+1i,kσi,k(Z)=ˆPij+1σi(Z),

Image (14.60)

ˆwij+1=N3k=1ˆqj+1i,kρi,k(Z)=ˆQij+1ρi(Z),

Image (14.61)

where ˆvj+1=[ˆv1j+1,...,ˆvlj+1]Image, ˆwj+1=[ˆw1j+1,...,ˆwqj+1]Image. The terms ϕi(Z)=[ϕi1,...,ϕiNi1]Image, σi(Z)=[σi1,...,σiNi2]Image and ρi(Z)=[ρi1,...,ρiN3]Image are basis function vectors, ˆCj+1Image, ˆPij+1Image and ˆQij+1Image are constant weight vectors and N1Image, N2Image and N3Image are the number of neurons. Substituting (14.59)(14.61) into the off-policy HImage Bellman equation (14.55) yields

eα(tktk1)ˆCj+1[ϕ(Z(tk))ϕ(Z(tk1))]=tktk1eα(τtk1)(li=1ˆPij+1σi(Z)ˉLi(tanhT(ˆvij)tanhT(vi))+2γ2qi=1ˆQij+1ρi(Z)(wiwij)ra(Z(t),ˆvj,ˆwj))dt.

Image (14.62)

By defining ˆP=[ˆP1...ˆPl]Image and ˆQ=[ˆQ1...ˆQq]Image, Eq. (14.62) can be rewritten as

ˆWTh(tk)=y(tk),

Image (14.63)

where

ˆW=[(ˆCj+1)T(ˆPj+1)T(ˆQj+1)T]T,h(tk)=[eα(tktk1)ϕ(Z(tk))ϕ(Z(tk1))tktk1eα(τtk1)σ1(Z)ˉL1(tanhT(ˆv1j)tanhT(v1))dτtktk1eα(τtk1)σl(Z)ˉLl(tanhT(ˆvlj)tanhT(vl))dτ2γ2tktk1eα(τtk1)ρ1(Z)(w1w1j)dτ2γ2tktk1eα(τtk1)ρq(Z)(wqwqj)dτ],y(tk)=tktk1eα(τtk1)ra(Z(t),ˆvj,ˆwj))dt.

Image

Case 1: Independency in Elements of L(u)Image  Eq. (14.63) can be solved using the least square method for parameter vector ˆWImage. Then the approximated value function and disturbance input are (14.59) and (14.61), respectively. The control input ˆuj+1Image is found by determining L(ˆuj+1)Image based on (14.45) from (14.60). The number of unknown parameters ˆWImage is N1+N2+N3Image. Then, at least NN1+N2+N3Image data sampled t1Image to tNImage should be collect before solving (14.63) in the least square sense,

Y=[y(t1)...y(tN)]T,H=[h(t1)...h(tN)].

Image

The least square solution is obtained as

ˆW=(HHT)1HY.

Image

Case 2: Dependency in Elements of L(u)Image  If the elements of L(u)Image are dependent, one has to solve a constrained nonlinear least square problem to take into account the equality constraints imposed by the dependency of the elements of L(u)Image. To show this, consider the case of the UAV in Example 1. The following constraints are considered when finding the weights of NNs:

ˉL3tanh(ˆPj+13σ3(Z))=(ˉL2tanh(ˆPj+12σ2(Z)))2=(ˉL4tanh(ˆPj+14σ4(Z)))2+(ˉL5tanh(ˆPj+15σ5(Z)))2.

Image

This constraint is nonlinear in NN weights and thus requires using the nonlinear least square method. In general, (14.58) becomes

argminˆWˆWHY2s.t.f(ˆPj+1,σ1,...,σl)=0,

Image

where the function f is defined in (14.57) and depends on how the elements of L(u)Image and consequently NN weights are related.

References

[1] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, vol. 1. Cambridge: MIT Press; 2017.

[2] D.P. Bertsekas, J.N. Tsitsikli, Neuro-Dynamic Programming. MA: Athena Scientific; 1996.

[3] W.B. Powell, Approximate Dynamic Programming. Hoboken, NJ: Wiley; 2007.

[4] F.L. Lewis, D. Liu, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley; 2013.

[5] D. Vrabie, K.G. Vamvoudakis, F.L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles. Institution of Engineering and Technology; 2013.

[6] H. Zhang, L. Cui, X. Zhang, X. Luo, Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method, IEEE Transactions on Neural Networks 2011;22:2226–2236.

[7] K.G. Vamvoudakis, F.L. Lewis, Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica 2010;46(5):878–888.

[8] D. Vrabie, F.L. Lewis, Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems, Neural Networks 2009;22(3):237–246.

[9] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, F.L. Lewis, Adaptive optimal control for continuous-time linear systems based on policy iteration, Automatica 2009;45(2):447–484.

[10] R. Song, W. Xiao, H. Zhang, C. Sun, Adaptive dynamic programming for a class of complex-valued nonlinear systems, IEEE Transactions on Neural Networks and Learning Systems 2014;25(9):1733–1739.

[11] H. Modares, F.L. Lewis, M.B. Naghibi-Sistani, Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks, IEEE Transactions on Neural Networks and Learning Systems 2013;24(10):1513–1525.

[12] S. Bhasin, R. Kamalapurkar, M. Johnson, K.G. Vamvoudakis, F.L. Lewis, W.E. Dixon, A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems, Automatica 2013;49(1):82–92.

[13] T. Bian, Y. Jiang, Z.-P. Jiang, Adaptive dynamic programming and optimal control of nonlinear nonaffine systems, Automatica 2014;50(10):2624–2632.

[14] Y. Jiang, Z.-P. Jiang, Robust adaptive dynamic programming and feedback stabilization of nonlinear systems, IEEE Transactions on Neural Networks and Learning Systems 2014;25(5):882–893.

[15] Y. Jiang, Z.-P. Jiang, Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics, Automatica 2012;48(10):2699–2704.

[16] D. Liu, H. Li, D. Wang, Error bounds of adaptive dynamic programming algorithms for solving undiscounted optimal control problems, IEEE Transactions on Neural Networks and Learning Systems 2015;26(6):1323–1334.

[17] B. Luo, H.-N. Wu, T. Huang, D. Liu, Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design, Automatica 2014;50(12):3281–3290.

[18] B. Luo, H.-N. Wu, H.-X. Li, Adaptive optimal control of highly dissipative nonlinear spatially distributed processes with neuro-dynamic programming, IEEE Transactions on Neural Networks and Learning Systems 2015;26(4):684–696.

[19] M. Abu-Khalaf, F.L. Lewis, J. Huang, Neurodynamic programming and zero-sum games for constrained control systems, IEEE Transactions on Neural Networks and Learning Systems 2008;19(7):1243–1252.

[20] H. Zhang, Q. Wei, D. Liu, An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games, Automatica 2011;47(1):207–214.

[21] K.G. Vamvoudakis, F.L. Lewis, Online solution of nonlinear two-player zero-sum games using synchronous policy iteration, International Journal of Robust and Nonlinear Control 2012;22(13):1460–1483.

[22] K.G. Vamvoudakis, F.L. Lewis, Online Gaming: Real Time Solution of Nonlinear Two-Player Zero-Sum Games Using Synchronous Policy Iteration. INTECH Open Access Publisher; 2011.

[23] H. Modares, F.L. Lewis, M.-B. Naghibi-Sistani, Online solution of nonquadratic two-player zero-sum games arising in the HImage control of constrained input systems, International Journal of Adaptive Control and Signal Processing 2014;28(3–5):232–254.

[24] H. Zhang, C. Qin, B. Jiang, Y. Luo, Online adaptive policy learning algorithm for HImage state feedback control of unknown affine nonlinear discrete-time systems, IEEE Transactions on Cybernetics 2014;44(12):2706–2718.

[25] H.-N. Wu, B. Luo, Simultaneous policy update algorithms for learning the solution of linear continuous-time HImage state feedback control, Information Sciences 2013;222:472–485.

[26] H.-N. Wu, B. Luo, Neural network based online simultaneous policy update algorithm for solving the HJI equation in nonlinear HImage control, IEEE Transactions on Neural Networks and Learning Systems 2012;23(12):1884–1895.

[27] B. Luo, H.-N. Wu, Computationally efficient simultaneous policy update algorithm for nonlinear HImage state feedback control with Galerkin's method, International Journal of Robust and Nonlinear Control 2013;23(9):991–1012.

[28] D. Vrabie, F.L. Lewis, Adaptive dynamic programming for online solution of a zero-sum differential game, Journal of Control Theory and Applications 2011;9(3):353–360.

[29] H. Li, D. Liu, D. Wang, Integral reinforcement learning for linear continuous-time zero-sum games with completely unknown dynamics, IEEE Transactions on Automation Science and Engineering 2014;11(3):706–714.

[30] B. Luo, H.-N. Wu, T. Huang, Off-policy reinforcement learning for HImage control design, IEEE Transactions on Cybernetics 2015;45(1):65–76.

[31] R.A. Howard, Dynamic Programming and Markov Processes. Cambridge, MA: MIT Press; 1960.

[32] K.G. Vamvoudakis, D. Vrabie, F.L. Lewis, Online adaptive algorithm for optimal control with integral reinforcement learning, International Journal of Robust and Nonlinear Control 2014;24(17):2686–2710.

[33] B. Kiumarsi, F.L. Lewis, H. Modares, A. Karimpour, M.-B. Naghibi-Sistani, Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics, Automatica 2014;50(4):1167–1175.

[34] H. Zhang, Q. Wei, Y. Luo, A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 2008;38:937–942.

[35] D. Wang, D. Liu, Q. Wei, Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach, Neurocomputing 2012;78:14–22.

[36] T. Dierks, S. Jagannathan, Optimal tracking control of affine nonlinear discrete-time systems with unknown internal dynamics, Decision and Control, 2009 Held Jointly with the 2009 28th Chinese Control Conference, CDC/CCC 2009, Proceedings of the 48th IEEE Conference on. 2009:6750–6755.

[37] T. Dierks, S. Jagannathan, Optimal control of affine nonlinear continuous-time systems, American Control Conference. ACC. 2009:6750–6755.

[38] M.D.S. Aliyu, Nonlinear HImage Control, Hamiltonian Systems and Hamilton–Jacobi Equations. CRC Press; 2017.

[39] S. Devasia, D. Chen, B. Paden, Nonlinear inversion-based output tracking, IEEE Transactions on Automatic Control 1996;41(7):930–942.

[40] G.J. Toussaint, T. Basar, F. Bullo, HImage-optimal tracking control techniques for nonlinear underactuated systems, Decision and Control, 2000, Proceedings of the 39th IEEE Conference on, vol. 3. 2000:2078–2083.

[41] J.A. Ball, P. Kachroo, A.J. Krener, HImage tracking control for a class of nonlinear systems, IEEE Transaction on Automatic Control 1999;44(6):1202–1206.

[42] T. Basar, P. Bernard, HImage Optimal Control and Related Minimax Design Problems. Boston, MA: Birkhäuser; 1995.

[43] F.L. Lewis, D. Vrabie, V. Syrmos, Optimal Control. 3rd edition Wiley; 2012.

[44] H. Modares, F.L. Lewis, Z.-P. Jiang, HImage tracking control of completely unknown continuous-time systems via off-policy reinforcement learning, IEEE Transactions on Neural Networks and Learning Systems 2015;26:2550–2562.

[45] B.L. Stevens, F.L. Lewis, E.N. Johnson, Aircraft Control and Simulation. third edition Wiley-Blackwell; 2015.

[46] B. Kiumarsi, W. Kang, F.L. Lewis, HImage control of non-affine aerial systems using off-policy reinforcement learning, Unmanned Systems 2016;4(1):51–60.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.113.193