The agent would like to obtain the different reward values (rit,rit)(1im)(rit,rit)(1im) and the final reward value set (rt,rt)={(r1t,r1t),(r1t,r1t),...(rmt,rmt)}(rt,rt)={(r1t,r1t),(r1t,r1t),...(rmt,rmt)} by taking different actions ait(1im).ait(1im). Thus, the agent will select the peak value in the set, defined as (rmaxt,rmaxt)(1maxm).(rmaxt,rmaxt)(1maxm). Based on the peak reward value, the agent will obtain the corresponding action amaxt(1maxm).amaxt(1maxm). Therefore, the current state of the agent will be converted from (St,St)to(St+1,St+1).(St,St)to(St+1,St+1). Overall, the formulation of the learning process of the agent can be represented as (Rt,Rt):{(St1,St1)×At(St,St).(Rt,Rt):{(St1,St1)×At(St,St).

The immediate return single-agent learning is the simplest agent learning approach. This method has no need to consider whether the follow-up action exerts any influence on the current learning step. The key to this method is to select the maximal reward value during the learning process of the agent, which has a low computational complexity. It turns out that, if we always select the maximal reward value in every step, the final reward value will not be optimal.

Fig. 7.2: Example 7.2.

Example 7.2 (See Fig. 7.2).

Let point 1 be the initial state S and point 12 be the target state T. Points 2–11 represent the states that are accessible, and the weights are the reward values.

Starting from state S, we obtain the following learning process under the computational method described by Definition 7.3.

(1)S={1},a1={12,13,14,15},(r1,r1)={(9,9),(7,7),(3,3),(2,2)}.S={1},a1={12,13,14,15},(r1,r1)={(9,9),(7,7),(3,3),(2,2)}. We know that (R1,R1)=(9,9),A1=12.As(R,R)={(9,9)},A={12}(R1,R1)=(9,9),A1=12.As(R,R)={(9,9)},A={12} does not allow the agent to reach the target state; the learning process continues.

(2)S={1,2},(r2,r2)={(4,4),(2,2),(1,1)}.S={1,2},(r2,r2)={(4,4),(2,2),(1,1)}.
We know that (R2,R2)=(4,4),A2=26.As(R,R)={(9,9),(4,4)},A=(R2,R2)=(4,4),A2=26.As(R,R)={(9,9),(4,4)},A={12,26}{12,26} does not reach the target state; the learning process continues.

(3)S={1,2,6},a3={69,610,},(r3,r3)={(6,6),(5,5)}.S={1,2,6},a3={69,610,},(r3,r3)={(6,6),(5,5)}.
We know that (R3,R3)=(6,6),A3=69.As(R,R)={(9,9),(4,4)},(R3,R3)=(6,6),A3=69.As(R,R)={(9,9),(4,4)},(6,6)},A={12,26,69}(6,6)},A={12,26,69} does not reach the target state; the learning process continues.

(4)S={1,2,6,9},a4={912},(r4,r4)={(4,4)},S={1,2,6,9},a4={912},(r4,r4)={(4,4)},

We know that (R4,R4)=(4,4),A3=912.As(R,R)={(9,9),(4,4),(6,6)},(R4,R4)=(4,4),A3=912.As(R,R)={(9,9),(4,4),(6,6)},(4,4)},A={12,26,69,912}(4,4)},A={12,26,69,912} has reached the target state; the learning process is terminated.

Therefore, the final strategy of the agent is π=A={12,26,69,912}.π=A={12,26,69,912}. If the accumulated reward value of this example is defined as the sum of the reward values from the initial state to the target state, the result would be (9,9)+(4,4)+(9,9)+(4,4)+(6,6)+(4,4)=(23,23).(6,6)+(4,4)=(23,23).

However, in Fig. 7.2, we can see that the accumulated reward value (23,23)(23,23) is not the maximal value obtained from π={12,26,69,912}.π={12,26,69,912}. There is an action set {14,48,811,1112}{14,48,811,1112} whose corresponding rewarding value is (3,3)+(11,11)+(6,6)+(5,5)+=(25,25)>(23,23).(3,3)+(11,11)+(6,6)+(5,5)+=(25,25)>(23,23).

How can we find the optimal strategy, in which the enumeration method is obviously clumsy? Baermann proposed using the principle of optimality, and the resulting recurrence relation provides a better method. By this method, we can easily obtain the optimal sequence of strategies and decrease the problem size. The principle of optimality indicates that the optimal sequence of strategies of the learning process has a specific property: no matter what the initial state and the target state, the intermediate strategies must construct an optimal strategy with respect to the initial strategy.

What follows is a non-immediate return reinforcement learning algorithm, Q-learning. This algorithm can obtain the optimal control strategy from the delayed retribution, even without prior knowledge.

7.3.3Q-learning function based on DFL

Definition 7.4 The evaluation function Q((s,s),α)Q((s,s),α) of Q-learning based on DFL is defined as follows: the reward value of Q-learning is accumulated from conducting action A under state (s,s).(s,s). When the agent obtains the accumulated reward value that has been added the value of the optimal strategy, we get

Q((s,s),a)(R,R)((s,s),a)+γV*((s,s)).(7.2)Q((s,s),a)(R,R)((s,s),a)+γV((s',s')).(7.2)

Here, (R,R)((s,s),α)(R,R)((s,s),α) is the immediate return value after the agent has taken action A under state (s,s).State(s,s)(s,s).State(s,s) is the subsequent state after action A has been taken.

Now, we have a problem. If the Q function corresponds to the optimal strategy, how should the agent obtain the Q function? Let us first make the following analysis. At first, we build a relationship between Q and V and the following relationship exists:

V*(s)=maxaQ(s,a).(7.3)V(s)=maxa'Q(s,a').(7.3)

Through a simple mathematical derivation, we have

Q(s,s),a)(R,R)((s,s),a)+γmaxaQ((s,s),a).(7.4)Q(s,s),a)(R,R)((s,s),a)+γmaxa'Q((s',s'),a').(7.4)

Now we know the Q function is determined by the subsequent Q function and the immediate return functions.

7.3.4Q-learning algorithm based on DFL

We now present Algorithm 7.7 (the Q-learning algorithm based on DFL) and give an example.

Example 7.3 We have six lattices, each representing a state that the agent can access (See Fig. 7.3). To express this conveniently, we set the serial numbers of the lattices to run from 1 to 6 from left to right, top to bottom. The central lattice is number 3 and is the target state.

Algorithm 7.7 The Q-learning algorithm based on DFL

Input: state of the agent is (s,s)(s,s) action of the agent is A, value of the Q function Q((s,s,α))Q((s,s,α)) is (0,0).(0,0).

Repeated: Monitoring the current state

Select an action from the current action list, and conduct it.

When the agent receives the action, its immediate rewarding value is (r,r).(r,r).

Check the new state (s,s).(s,s).

Update the list item Q((s,s),α)Q((s,s),α) as follows:

Q((S,S),α)(R,R)((s,s),α)+γmaxQ((s,s),α)Q((S,S),α)(R,R)((s,s),α)+γmaxQ((s,s),α)

Modify the current state (s,s(s,s))a.(s,s(s,s))a.

Because the state of the agent is a dynamic fuzzy variable, we assume the agent will obtain an enhanced reward value with reduced ability when its state changes from number 6 to number 3. However, the agent will obtain a subdued reward value with enhanced ability when its state changes from number 5 to number 2.

Therefore, we assume that the immediate reward value of every state in the learning process of the agent will change. Then, under the Q-learning algorithm based on DFL, we obtain the corresponding value of the state Q((s,s),a)Q((s,s),a) as in Fig. 7.3. If the initial state of the agent is number 4, the optimal path to reach the optimal state is 4 → 5 → 2 → 3.

Fig. 7.3: The example of Algorithm 7.7.

Example 7.4 We use the Q-learning algorithm based on DFL to solve Example 7.3. Set the maximal reward value of the agent when starting from number n to be Q(n,∗). Then, we have Q(9,912)=(4,4)So,Q(9,*)=Q(9,912)=(4,4)Q(9,912)=(4,4)So,Q(9,*)=Q(9,912)=(4,4) and Q(10,1012)=(2,2).Q(10,1012)=(2,2).

Thus, it convert to Q(10,*)=Q(10,1012)=(2,2)Q(10,*)=Q(10,1012)=(2,2) and Q(11, 11←12) = (5,5)(5,5)

Then, we have

Q(11,*)=Q(11,1112)=(5,5),(7.5)Q(11,*)=Q(11,1112)=(5,5),(7.5)

Q(6,69)=(6,6)+Q(9,*)=(6,6)+(4,4)=(10,10),(7.6)Q(6,69)=(6,6)+Q(9,*)=(6,6)+(4,4)=(10,10),(7.6)

Q(6,610)=(5,5)+Q(10,*)=(5,5)+(2,2)=(7,7).(7.7)Q(6,610)=(5,5)+Q(10,*)=(5,5)+(2,2)=(7,7).(7.7)

Therefore, we know

Q(6,*)=max{Q(6,69),Q(6,610)}=(10,10)(7.8)Q(6,*)=max{Q(6,69),Q(6,610)}=(10,10)(7.8)

Q(7,79)=(4,4)+Q(9,*)=(4,4)+(4,4)=(8,8),(7.9)Q(7,79)=(4,4)+Q(9,*)=(4,4)+(4,4)=(8,8),(7.9)

Q(7.710)=(3,3)+Q(10,*)=(3,3)+(2,2)=(5,5).(7.10)Q(7.710)=(3,3)+Q(10,*)=(3,3)+(2,2)=(5,5).(7.10)

Hence, in the next step, we have

Q(7,*)=max{Q(7,79),Q(7,710)}=(8,8),(7.11)Q(7,*)=max{Q(7,79),Q(7,710)}=(8,8),(7.11)

Q(8,810)=(5,5)+Q(10,*)=(5,5)+(2,2)=(7,7),(7.12)Q(8,810)=(5,5)+Q(10,*)=(5,5)+(2,2)=(7,7),(7.12)

Q(8,811)=(6,6)+Q(11,*)=(6,6)+(5,5)=(11,11)(7.13)Q(8,811)=(6,6)+Q(11,*)=(6,6)+(5,5)=(11,11)(7.13)

Finally, Q(8,*)=max{Q(8,810),Q(8,811)}=(11,11).Q(8,*)=max{Q(8,810),Q(8,811)}=(11,11).

The same can be processed as Q(2,*)=(14,14),Q(3,*)=(15,15),Q(4,*)=Q(2,*)=(14,14),Q(3,*)=(15,15),Q(4,*)=(22,22),Q(5,*)=(19,19).(22,22),Q(5,*)=(19,19).

Thus, we get

Q(1,12)=(9,9)+Q(2,*)=(9,9)+(14,14)=(23,23),(7.14)Q(1,12)=(9,9)+Q(2,*)=(9,9)+(14,14)=(23,23),(7.14)

Q(1,13)=(7,7)+Q(3,*)=(7,7)+(15,15)=(22,22),(7.15)Q(1,13)=(7,7)+Q(3,*)=(7,7)+(15,15)=(22,22),(7.15)

Q(1,14)=(3,3)+Q(4,*)=(3,3)+(22,22)=(25,25),(7.16)Q(1,14)=(3,3)+Q(4,*)=(3,3)+(22,22)=(25,25),(7.16)

Q(1,15)=(2,2)+Q(5,*)=(2,2)+(19,19)=(21,21).(7.17)Q(1,15)=(2,2)+Q(5,*)=(2,2)+(19,19)=(21,21).(7.17)

Here, Q(1,*)=(25,25),Q(1,*)=(25,25), so the optimal strategy is π = {1 ← 4, 4 ← 8, 8 ← 11, 11 ← 12}, and the maximal accumulated reward value is (25,25),orQ(1,*)=(25,25).(25,25),orQ(1,*)=(25,25).

7.4Multi-agent learning algorithm based on DFL

In the previous section, we developed a single-agent learning algorithm based on DFL. All the single-agent learning algorithms build a foundation for multi-agent learning. In the multi-agent learning model, each agent is learning according to a particular single-agent learning algorithm based on its own position in the model.

7.4.1Multi-agent learning model based on DFL

Definition 7.5 In the multi-agent learning model, each circular symbol is an agent (see Fig. 7.4). Some agents form an agent group. Each agent in the same group cooperates with the other sin the group. Different groups compete with each other.

We now introduce a cooperative multi-agent learning algorithm and a competitive multi-agent learning algorithm.

7.4.2Cooperative multi-agent learning algorithm based on DFL

In the real world, when a team is tasked with achieving a goal, there are various ways of proceeding. For example, each person in the team can provide an idea, and the team as a whole makes a comprehensive assessment of all the ideas. Alternatively, the mission could be divided into several parts, each of which is assigned to a different team member.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.118.90