7Multi-agent learning model based on dynamic fuzzy logic

This chapter is divided into five sections. Section 7.1 gives a brief introduction to this chapter. We present the agent mental model based on DFL in Section 7.2 and the single agent learning algorithm based on DFL in Section 7.3. In Section 7.4, we introduce algorithm of multi-agent learning model based on DFL. The last section is the summary of the chapter.

7.1Introduction

By using a learning method, the knowledge and ability of an agent will be enhanced. This is known as agent learning. In other words, the knowledge base of the agent will be complemented, and the ability to execute will be improved or have a better fit with the interests and habits of the users.

7.1.1Strategic classification of the agent learning method

Recently, many agent learning strategies have been developed, such as Rote Learning, Learning from Instructions and by Advice Taking, Learning from Examples, Inductive Learning, Learning by Analogy, Learning by Observation and Discovery, Case-Based Learning, Explanation-Based Learning, Completing Learning, Cooperating Learning, Game-theoretic Learning, Adaptive Learning (through the adaptive adjustment of all kinds of objects to stabilize a target neural network, which is then combined with a Genetic Algorithm), and Reinforcement Learning.

7.1.2Characteristics of agent learning

The agent is an intellectual body that has the capacity for self-learning, as well as the following characteristics:

Self-control: during the learning procedure, the agent does not need external guidance and has the ability to control its own learning actions.

Reactivity: during the learning procedure, the agent will respond to changes in the outside world.

Inferential: according to its own current knowledge, which means the current information included in some repository, the agent will implement a continuing rational inquiry and update the repository.

Learnability: from unknown to known, the agent learns from everything around, much like a conscious human. After obtaining the learning results, the agent will store them in the repository. Thus, the agent can directly recall knowledge from the repository when faced with a similar situation.

According to these characteristics, the agent techniques have been applied in multiple areas, e.g. game development [12], mail routing [3], spoken dialogue [4], and robot soccer [5].

7.1.3Related work

A typical case of single-agent learning is OpenML, which is a software development tool published by Intel Corporation in 2004. This tool enables systems built by software engineers to “learn” from the application. The key to this learning process is to use previous data to improve the accuracy and durability of the system. Moreover, this tool can predict the probability of events that have happened. OpenML is based on Bayes’ theorem. Its core philosophy is to study the frequency of an event that has taken place and predict the probability of this event happening in the future. An interesting application of OpenML is when Intel researchers used it to create an audio/video speech recognition system. This recognition system employs video cameras to detect a speaker’s facial expressions and mouth movements. This “read lips” method helps to improve the accuracy of speech recognition in noisy environments such as airports or supermarkets [6].

The ADE project at the University of Southern California is an adaptive learning system geared for education. The core of this project is to solve problems. In Stanford University, an adaptive system called “Fab” provides a web page recommendation service. All of these applications are representations of agent learning technology, collectively referred to as the interface agent technique.

A typical case of a learning agent has been developed by Chen and Sycara. Their WebMate can automatically extract resources online according to the user’s interests and can satisfy demands for the retrieval of related domain knowledge. Yang and Ning conducted a deep analysis of reinforcement learning. They proposed an ensemble method based on AODE that employs the BDI model and reinforcement learning to guide the learning process of the agent. Moreover, they studied the deviation of motivated learning. Using planning rules, they were able to overcome the shortcomings of reinforcement learning to some extent and improve the learning efficiency. The machine learning research group led by Professor Fanzhang at Soochow University has conducted a lot of theoretical work in the field of agent learning. For example, they have studied an agent self-learning model based on DFL [7] and an automated reasoning platform [8].

The ILS system of GTE [9] integrates heterogeneous agents through a centralized controller. In this way, the agents are able to judge one another. The MALE system enables different agents to cooperate on an interactive board, similar to a blackboard. Guorui proposed an emotional body system that employs the features of the genetic system, nervous system, and endocrine system in its construction. In this system, the researchers proposed a novel action learning method using emotions. The machine learning research group led by Professor Li has also reported great achievements in this area, such as a multi-agent problem-solving system based on DFL and a combinatorial mathematics learning system.

7.2Agent mental model based on DFL

The agent mental model is the foundation of agent learning, and an efficient and reasonable mental model is needed to construct the learning system [10]. In previous work, the agent system has been studied as a mental system [1113], and many scholars believe that is reasonable and useful. Hu and Shi [14] described a representative system called A-BDI with Agent-BDI logic. In this section, we will construct the model structure and axiomatic system of the agent mental model based on DFL.

7.2.1Model structure

Definition 7.1 The agent mental model based on DFL consists of an 11-tuple (B, C, I, A, S, Sty, R, Brf, Acf, Irf, Stf), where B (Belief) represents the understanding of information and basic views of the environmental world within the range of cognitive ability of the agent model. B(Pi,(xi,xi)) represents the agent belief of event (xi,xi) at some point. B is the set of current beliefs, and (B,B)={B(P1,(x1,x1)),B(P2,(x2,x2)),...,B(Pm,(xm,xm))} form current events.

From the definition of Belief, we know that an event is actually a pseudo-proposition; therefore, the implication of B(Pi,(xi,xi)) is the extent to which the agent believes that proposition Pi is true, which we represent using (xi,xi) The structure and updating algorithm of our belief database are shown in Tab. 7.1 and Algorithm 7.1. In this model structure, we will update the belief database when the agent conducts operations relevant to the belief database as follows: first, contrast the belief of the current learning process to query whether there is any information about the belief. If the belief already exists and its value is invariant, do nothing. If the value of the belief has changed, make the corresponding modifications. If the belief does not already exist, we will create and insert it at the appropriate location.

Tab. 7.1: Structure of belief database.

C (Capability) represents prerequisites for performing a task, i.e. the capacity of the current agent. C(Mi,(xi,xi)) represents the agent’s capacity to process problem Mi at time (xi,xi). C is the set of capacities after the learning process, and (C,C)={C(M1,(x1,x1)),C(M2,(x2,x2)),...,C(Mm,(xm,xm))}, which assumes the current number of events is m.

Algorithm 7.1 The update strategy of the belief database

Repeat:

Query the belief database

If the belief has existed

If the value of the belief is invariable, do not do any conduction and turn to the next belief.

Else modify the value of the belief in the database and turn to the next belief.

Else insert the belief in the database and turn to the next belief.

Until: all the beliefs were conducted

Different problems result from the learning process of the agent. We need to solve these problems. First, the capacity of the agent to solve one problem may be zero. However, as the learning proceeds, the capacity of the agent is enhanced. When the agent accomplishes one step of the learning process, one of the following cases exists:

1.For a certain problem, the agent’s capacity to solve it has not changed.

2.For a certain problem, the agent’s capacity to solve it has changed.

3.For a certain problem, the agent’s capacity to solve it does not exist in the database.

The structure of the capacity database is analogous to that of the belief database, and we present the updating strategy in Algorithm 7.2.

Algorithm 7.2 The update strategy of the capacity database

Repeated:

Query the capacity database

If the capacity to solve the problem has existed,

If the capacity value of the agent is invariable, do not do any conduction and turn to the next belief.

Else modify the value in the database and turn to the next belief.

Else insert the capacity to solve the problem in the database and turn to the next belief.

Until: all the problems were conducted

Tab. 7.2: Structure of intention database.

I (Intention) represents the prearrangements of ones behaviour in future events and represents the action orientation of the agent for future events. In the i th step of the learning process, we assume the number of next-time executable strategies of the agent is n, which means the strategy set is {sty1,sty2,...,styn}.Ii(styi,(xij,xij)) represents the intention of the executable strategy Styj in the i th learning step with respect to (xij,xij). Ii is the intention set of optional strategies of the agent in the i th learning step, and the number of current optional strategies is n, or (Ii,Ii)={Ii(sty1,(xi1,xi1),Ii(sty2,...(xi2,xi2)),...,Ii(styn,(xin,xin))}. Moreover, the whole intention set can be represented by (I,I)={(I1,I1),(I2,I2),...,(Im,Im)}.

According to the definition of intention, we know that if the agent learns according to the specific strategies and eliminates the possibility of backtracking, the intention database will be waiting for new intentions. However, if backtracking occurs, the intention database will be modified. The structure and updating algorithm for the intention database are shown in Tab. 7.2 and Algorithm 7.3.

Algorithm 7.3 The update strategy of the intention database

Repeated:

If the current strategy was effective

Insert the intention defined by the current strategy into the intention database

Else trace back to the last step and delete the corresponding intention information

A (Action) saves the information of all actions in every step of the learning process. If we set the agent to move into the i th step of the learning process, the current optional action sequence could be represented by Ai={Ai1,Ai2,...,Aim}, where the number of current optional actions is m. When the agent moves into the n th step of the learning process, the set of optional actions can be defined by A = {A1, A2, ..., An}. The structure and updating algorithm of the action database are shown in Tab. 7.3 and Algorithm 7.4.

Tab. 7.3: Structure of action database.

Algorithm 7.4 The update strategy of the action database

Repeated:

If the current strategy was effective

Insert the action defined by the current strategy into the action database

Else trace back to the last step and delete the corresponding action information

S (State) is the set of current states of the agent. (Si,Si) represents the state in the i th learning step. (S,S) represents the set of states that have happened after n learning steps, or (S,S)={(S1,S1),(S2,S2),...,(S1,Sn)}. The structure and updating algorithm of the state database are shown in Tab. 7.4 and Algorithm 7.5.

Sty (Strategy) records the learning path defined by the learning rules. These learning paths are recorded during the process of learning, and this set is actually a subset of the action set.

Tab. 7.4: Structure of state database.

Algorithm 7.5 The update strategy of the state database

Repeated:

If the current strategy was effective

Insert the state defined by the current strategy into the state database

Else trace back to the last step and delete the corresponding state information

Tab. 7.5: Action database of Example 7.1.

Tab. 7.6: Strategy database of Example 7.1.

Example 7.1 After the agent has been through three steps of the learning process, the action database is as shown in Tab. 7.5

Based on the target, capability database, and intention database of the agent, fused with the rewards and punishments provided by the ambient environment, we choose the path Left → Down → Right: A1 = A11 = Left, A2 = A12 = Down, A3 = A23 = Right. Therefore, we obtain the strategy database shown in Tab. 7.6.

In the learning process, we find that the current strategy database is not optimal for deeper study. Thus, we trace back to the last step, delete the outdated strategies, and insert the new strategies.

Theorem 7.1 The current strategy is Styi={A1i1,A2i2,...,Anin}. During the learning process, a better strategy Styi={A1j1,A2j2,...,Anjn} has been found. Deleting Styi will not affect the optimal learning process of the agent.

Proof: The choice of strategy during the learning process follows certain rules. As the superiority of the new strategy has been demonstrated, the outdated strategy Styi, is obviously not the best, and deleting it will not cause an error.

R (Reward): After the agent takes actions on the environment, the environment will produce a value, named the reward. If the environment returns a positive value when the agent takes an action, the possibility of the action happened in the next time will increase, and vice versa. By the definition of the reward value, each action corresponds to a value. Thus, the reward value in the i th learning step of the agent is (Ri,Ri)={(Ri1,Ri1),(Ri2,Ri2)),...,(Rim,Rim))}. When the learning process moves into the n th step, the set of reward values is defined by (R,R)={(R1,R1),(R2,R2),...,(Rn,Rn)}.

Tab. 7.7: Structure of reward database.

The structure and updating algorithm of the reward database are shown in Tab. 7.7 and Algorithm 7.6.

Algorithm 7.6 The update strategy of the reward database

Repeated:

If the current strategy was effective

Insert the reward value defined by the current strategy into the reward database

Else trace back to the last step and delete the corresponding reward information

Brf (Belief Revision Function): As the environment changes, the state of the agent also changes. This kind change results in the beliefs of the agent being constantly modified, and the belief revision function is Brf:(bi,bi)×(si,si)(bi+1,bi+1).

Acf (Action Choice Function): Each action performed by the agent depends on the agent’s belief, capacity, intention, and its own external environment. Thus, the action choice function is Acf:(bi,bi)×(ci,ci)×(ii,ii)×(si,si)ai.

Irf (Intention Revision Function): As the environment changes, the state of the agent also constantly changes. This results in the intentions of the agent being constantly modified, and the intention revision function is Irf:(ii,ii)×(si,si)(ii+1,ii+1).

Stf (State Transition Function): This function maps the Cartesian product between action sets A and B to the state set. Thus, the state transition function is Stf:ai×(si,si)(si+1,si+1).

7.2.2Related axioms

There are some relationships between the mental state of each agent that we call the mutual generation and restriction principle. The mutual generation between mental states refers to the ability of mental states to promote one another, and the mutual restriction refers to the ability of mental states to restrict one another.

a) Belief:

Axiom 7.1 B(Pi,(xi,xi¯))=B(Pi,(xi,xi))¯

Axiom 7.2 B(Pi,(xi1,xi1))ΛB(Pi,(xi2,xi2))=B(Pi,(xi1,xi1)Λ(xi2,xi2))

Axiom 7.3 B(Pi,(xi1,xi1))B(Pi,(xi2,xi2))=B(Pi,(xi1,xi1)(xi2,xi2))

b) Capacity:

Axiom 7.4 C(Mi,(xi,xi¯))=C(Mi,(xi,xi))¯

Axiom 7.5 C(Mi,(xi1,xi1))ΛC(Mi,(xi2,xi2))=C(Mi,(xi1,xi1)Λ(xi2,xi2))

Axiom 7.6 C(Mi,(xi1,xi1))C(Mi,(xi2,xi2))=C(Mi,(xi1,xi1)(xi2,xi2))

c) Intention:

Axiom 7.7 Ii(styj,(xij,xij)¯=Ii(styj,(xij,xij))¯

Axiom 7.8 Ii(styj,(xij1,xij1)Ii(styj,(xij2,xij2))=Ii(styj,(xij1,xij1)(xij2,xij2))

Axiom 7.9 Ii(styj,(xij1,xij1)Ii(styj,(xij2,xij2))=Ii(styj,(xij1,xij1)(xij2,xij2))

The mental states of the agent are not independent but are correlated and mutually restrictive. From the two mapping relations Acf:(bi,bi)×(ci,ci)×(ii,ii)×(si,si)aiandStf:ai×(si,si)(s1+i,si+i), we can see that there are some relationships between action ai and state (si,si). If the action makes the state move toward the target state, the relationship between them is one of facilitation. Conversely, the relationship may be restrictive. Obviously, under a certain state (si,si), the action of the agent is determined by belief (bi,bi), capacity (ci,ci), and intention (ii,ii). Consequently, there are certain relationships between each kind of attribute.

7.2.3Working mechanism

Based on the above model, we now introduce the main working mechanism. Based on DFL, the agent can be divided into a responding layer, learning layer, and thinking layer. The responding layer mainly realizes the information exchange between the agent and the external environment. The learning layer handles the learning process of the agent. The thinking layer includes two parts, one to design the overall learning process of the agent and another to consider collaboration with other agents. The concrete working mechanism is shown in Fig. 7.1.

Fig. 7.1: The structure of the learning process of the agent.

In a constantly changing environment, the agent is an entity with the following basic working mechanism. The agent perceives changes in the environment and transmits this information to the perception layer.

The perception layer conducts new perceptual behaviour based on belief and the result from the previous behaviour layer and transmits the new perceptual result to the current behaviour layer. Then, it updates the belief database.

The action taken by the behaviour layer is dependent on the current capacity, state, and intention of the agent and the result of the learning layer. When the behaviour layer takes a new action, it transmits the result of the new action to the learning layer and the perception layer. Then, it updates the corresponding capacity database, state database, and intention database.

The learning layer conducts new learning steps based on the new action transmitted from the behaviour layer and the new management plans transmitted from the management layer.

The management layer makes plans for the next learning step based on the result from the learning layer, cooperation information from the cooperation layer, and the strategies in the strategy database. If the management layer has made a new plan, it will transmit the new plan to the learning layer and cooperation layer at the same time and update the corresponding strategy database.

The cooperation layer decides how to collaborate with other agents based on the plan from the management layer and the current intention from the intention database and updates the intention database.

7.3Single-agent learning algorithm based on DFL

This section discusses the single-agent learning algorithm based on the agent mental model described in the previous section.

7.3.1Learning task

The learning task of the agent is defined as follows: when learning a strategy π : SA, the agent will select the state for the next action based on the reward value r.

Definition 7.2 Starting with any state si, Vπ(si) is acquired with any strategy π as

Vπ(si)=ri+γri+1+γ2ri+2+=k=0ykrl+k,(7.1)

where γ is a constant in the range {0, 1}. If γ = 0, we obtain an immediate return from the agent. If γ = 1, we will obtain the result that the future return is more important than the immediate return. This section starts by considering the simplest case of an immediate return. Thus, we first present the immediate return single-agent learning algorithm.

7.3.2Immediate return single-agent learning algorithm based on DFL

Definition 7.3 (S,S) is the set of state (Si,Si) of the agent. A is the action set of the agent, named Ai, and (R,R) is the reward set of the agent, named (Ri,Ri). At a particular time, if the agent is in a certain state (St,St) and the actions that can be taken by the agent are represented by at={at1,at2,atm},m1, the learning process of the agent will be defined as follows:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.34.146