5Semi-supervised multi-task learning based on dynamic fuzzy sets

This chapter is divided into six sections. Section 5.1 gives a brief introduction to this chapter. We present the semi-supervised multi-task learning model in Section 5.2 and the semi-supervised multi-task learning model based on DFS in Section 5.3. In Section 5.4, we introduce algorithm of dynamic fuzzy semi-supervised multi-task matching. The DFSSMTAL algorithm is described in Section 5.5. The last section is the summary of the chapter.

5.1Introduction

5.1.1Review of semi-supervised multi-task learning

It is well known that supervised learning is an effective technique for learning classifiers when there is sufficient mark-up data. However, there is often a high price to obtain the tags for data. Thus, it can be difficult to obtain sufficient tag data. Moreover, even if the classification error rate of supervised classifiers trained from a small amount of tag data is zero, the classifiers still lack good generalization performance. Ensuring the effective use of labelled and unlabelled data for learning, has become a major concern.

1. Overview of semi-supervised learning

In many machine learning problems, we need tag data to train a model. However, sometimes, the marked data are limited because manual marking is often tedious or the cost of acquiring data is high. As a result, the models that are being trained are unsatisfactory. In recent years, numerous scholars have attempted to improve the generalization performance of classifiers using information sources other than marker data. These studies can be divided into two categories according to different information sources: semi-supervised learning [15] and multi-tasking learning [610]. Additional information for the former comes from data samples, where sufficient sample information is included in the data without markers; the latter uses related tasks to obtain additional information. We can consider semi-supervised learning as an extension of the traditional supervised learning; that is, semi-supervised learning is mainly concerned with obtaining good generalization ability and performance in learning machines that lack partial information from the training data. Semi-supervised learning has aroused the interest of scholars in terms of building better classifiers with a lot of unlabelled data, reducing manual tasks and material consumption, and the performance of machine learning. Currently, semi-supervised learning methods include generative models, self-training, co-training, multi-angle learning, low-density separation (such as transductive SVM), and graph-based approaches.

There are different advantages and disadvantages to these methods. For example, generative models have the longest history. Mixed models must be identified and correct, otherwise the learning will not be carried out smoothly, and the performance of the classifier will be damaged to a certain extent. Self-training, another familiar semi-supervised learning algorithm, is the simplest encapsulation algorithm but has the disadvantage of the packaging effect. The performance of the algorithm relies heavily on the supervisory method of its use, and errors early in the training stage will be enhanced in later periods, leading to a decrease in performance. Overall, in semi-supervised learning, good learning performance requires model assumptions that are consistent with the problem structure, and compared with general supervised learning, this approach often requires more energy to design features or similar functions.

2. Overview of multi-task learning

At present, most machine learning techniques are designed for a single problem or task. However, many problems are highly interrelated. For example, a tennis player can easily learn other sports such as badminton and table tennis; the future performance of students can be predicted according to their existing results; robots can be trained to identify the different characteristics of the face; and so on. In the field of machine learning, this is known as multi-task learning. Multi-task learning considers related tasks and shares information from relevant tasks to improve the learning effect of target tasks.

This production method is closely related to human learning. When people learn new knowledge, they will consciously or unconsciously learn from similar past learning experiences. Psychologists call this learning transfer. It is possible to improve the performance of the learning system by using valuable information from related tasks, which is the main reason that multi-task learning has become an important branch of machine learning.

Over the past decade, many multi-task learning algorithms have been proposed. For example, the multi-task feature learning in [11] uses regularization to learn the same representation of all tasks. The regularized multi-task SVM method in [12] requires the SVM parameters in all tasks to be similar and successfully introduces the traditional SVM method into the multi-task learning domain. The task clustering method in [13, 14] divides all tasks into different clusters and then learns the similar or common expressions of all tasks in the same cluster. The multitasking learning method in [15], [16], [17], [18] is a kind of learning method based on the Gaussian process, and the Gaussian process is the basic model of multitasking.

Most existing multi-task learning methods assume that tasks are interrelated. Thus, a subset of tasks has similar or identical data characteristics and model parameters. For example, both neural network-based approaches and multi-task feature learning assume that all tasks have the same data characteristics. Regularized multi-task SVM [12] and [14] assume that all tasks or tasks in the same cluster have similar model parameters.

3. Overview of semi-supervised multi-task learning

The common goal of semi-supervised learning and multi-task learning is to use additional information to improve the learning performance of supervised learning tasks. This extra information may come from unlabelled data or other related tasks. Therefore, it is very reasonable to integrate both types of learning to constitute a new learning framework.

In fact, in recent years, some scholars have conducted studies in this regard. Although the method proposed by Ando and Zhang [19] is mainly intended to improve the performance of semi-supervised learning, they laid the foundations for the study of the relationship between semi-supervised learning and multi-task learning. Starting from a single target task and unlabelled data, they use unlabelled data to construct more unrelated tasks to assist in learning the target task. Liu et al. proposed a semi-supervised multi-task learning framework [2022], which can be regarded as the first successful integration of semi-supervised learning and multitask learning into a learning framework. The semi-supervised learning method used in this new framework is Parametric Neighbourhood Classification; it is composed of M semi-supervised classifiers and a joint prior distribution with all classifier parameters.

Each classification task has a corresponding classifier. In this integrated framework, M tasks are classified at the same time.

The semi-supervised multi-task learning framework proposed by Li et al. has led to a boom in semi-supervised multi-task learning in the field of machine learning. The current semi-supervised multi-task research has two main categories: parameter estimation methods and cost minimization techniques.

1) Parameter estimation method

The parameter estimation method is described in [2024]. Liu et al. constructed a parameter classifier (called a basic classifier) for each classification task in [20, 21]. They use the joint distribution of these parameters and the posteriori maximization method to estimate the corresponding classifier parameters and then complete the multi-task learning. Because the classifier is linear, [22] extends the basic classifier (mainly using the kernel method) to make it suitable for nonlinear classification. Applying Dirichlet process (DP) in the context of multi-task learning is a complex task. Thus, a simple form of DP is selected that obtains a solution to using EM. This multitask framework effectively integrates semi-supervised learning. However, this method randomly selects the marker data without validation. Using this idea, active learning is applied to the semi-supervised multi-task learning framework in [23], whereby active learning is employed to select the most informative marker data.

The general idea of the semi-supervised multi-task learning algorithm based on the parameter estimation method can be summarized as Algorithm 5.1.

Algorithm 5.1 Semi-supervised multi-task learning based on the parameter estimation method

Input: Training Set of M Tasks (include labeled data and unlabeled data);

Output: The parameters neighborhood basic classifier parameters of each task, take the m th basic classifier parameters as θm.

(1)Take as parameter in each task, the probability that takes the class of sample point xi as yi:

p*(yi|xi,θ)

(2)Take the probability of what take t steps neighborhood of the sample point xi as Nt(xi) and have

p(yi|Nt(xi),θ)=j=1nbij(t)p*(yi|xi,θ)

where b(t)ij is the probability that xi is xi’s t steps neighborhood

(3)Joint likelihood function of M tasks:

p({yim,iLm}m1M|{Nt(xim),iLm}m1m,{θm}m1M)=m1Mp({yjm,iLm}|{Nt(xim),iLm},θm)=m1MlLmj=1nmbij(tm)p*(yim|xim,θm)

where Lm is index collection of labeled data in the m-th task

(4)Use maximum a posteriori estimation to solve the M parameters θ

The classification problem considered in [2023] does not involve regression. In fact, many examples can be designed as semi-supervised and multi-task regression problems. For instance, in the individual posture prediction application, each task corresponds to a posture prediction for individuals, and each person has a large number of pictures of unknown posture information. Thus, semi-supervised regression and multi-task regression are properly integrated in [24]. This forms a semi-supervised multi-task regression framework. On the basis of the parameter estimation method, a Supervised Multitasking Regression Framework is first proposed based on Gaussian processes, with the kernel parameters of all tasks assumed to have the same Gaussian prior. The Gauss prior kernel function is then modified to have a data-dependent kernel function; combined with the use of unlabelled data, this ultimately forms a semi-supervised multi-task regression framework.

2) Cost minimization method

Loeff at al. [25] and Wang et al. [26] described typical applications of cost minimization. In [25], multi-task learning and semi-supervised learning are integrated into a new model of extended semi-supervised multi-task learning through the addition of a regular term. The new provisions of the regularization encourage the sharing of characteristics between the classifiers, thus improving the generalization performance; on the other hand, it helps spread neighbourhood tag information and the effective use of unlabelled samples to learn. The attributes of the original model are unchanged by the new regular items: the cost function is still convex. Therefore, the global optimal value can be obtained using the gradient descent method.

In general, multi-task learning studies assume that tasks are interrelated. In [26], the author proposes a different approach. He assumes that tasks can form different groups and that tasks in each group should have similar classification rules. In fact, [13] and [27] describe multi-task research under this assumption, but they are monitoring methods rather than semi-supervised approaches. In [26], the author constructs a linear classifier for each task according to the data manifold of the potential task and then integrates all the classification vectors with the K-means inter-class regularization term, which forms a semi-supervised multi-task learning system as in Algorithm 5.2.

Algorithm 5.2 Semi-supervised multi-task learning system

Input: Training Set of M Tasks (include labeled data and unlabeled data);

Output: Assume classifier of the t-th task is ft(x) sample, wt is the weight. Then the main need to determine the algorithm is the M tasks by the weight of the classifier matrix W.

(1) List the cost function of the problem. In general, the cost function consists of two parts: general cost and regular item. The general cost can be a common loss function, while the regular item is set to complete the semi-supervised multi-task learning. Thus, the cost function is

C(W)=L(W)+R(W).L(W)isgeneralcostfunction,R(W)istheregularitem.

(2) The whole semi-supervised multi-task problem is the need to solve:

argminwC(W)

Because the cost function is a convex optimization problem, it can be solved by the conventional method.

3) Other semi-supervised multi-task learning research

In addition to the above two categories, the study of semi-supervised multi-task learning in theory and application has also made new progress in recent years. In application, Chen et al. broke through the usual supervised learning or unsupervised learning in [28]; they use semi-supervised model to integrate a multi-task learning method – rotation structure optimization algorithm – and successfully apply semi-supervised multi-task learning to semantic relations between phrases. On the theoretical side, Michelangelo et al. studied semi-supervised kernel machines in [29] and relied on multi-task learning patterns. Each task is related to individual predicates in feature space. And the deep abstraction is composed of the first-order logical clauses with these predicates, so that the semi-supervised multi-task learning framework can successfully integrate the first-order logical clauses theory with the nuclear machine.

5.1.2Problem statement

In engineering applications, we often need to model a system, and we require a reliable model to describe the correct relationship between the input and output of the application system, such as performance monitoring, error detection, problem optimization, and multi-group classification. Time series prediction is an extremely important and practical problem that is related to economics, business decision making, signal processing, and control. Prediction is a dynamic fuzzy process that can be predicted as a dynamic fuzzy object. DFD are an important feature of this type of problem.

In contrast, the current accepted concept of machine learning is still Simon’s explanation of learning: “If a system can improve its performance by performing a certain process, that is learning”. This statement implies that learning is a process, that learning occurs in terms of a system, and that learning can change the performance of the system. “Process”, “system”, and “change performance” are the three main points of learning. If we analyse this carefully, we can see that these three points have dynamic ambiguity: the nature of the learning process has a dynamic fuzzy character; system changes may be good or bad, and so are essentially dynamic and fuzzy; and changes in system performance, results, etc., have dynamic fuzzy characteristics.

Therefore, it can be seen that dynamic fuzzy information is common in practical applications and theoretical research. To establish a machine learning method that offers substantial progress and can meet the main points of Simon’s argument, the key problem is to solve the dynamic ambiguity problem arising from learning activities effectively.

Although adaptive model can predict the performance of a dynamic system, they cannot handle dynamic fuzzy problems. Rough set theory is based on fuzzy sets and can effectively solve fuzzy problems but cannot handle dynamic fuzzy problems. Statistical learning theory is based on small-sample statistics, which solve static problems but are insufficient for dynamic problems. Reinforcement learning based on the Markov process theory can solve dynamic problems but is unsuitable for fuzzy problems. Therefore, there is an urgent need for a theory that effectively solves dynamic fuzzy problems in the field of machine learning.

To solve dynamic fuzzy problems in engineering applications and machine learning activities, dynamic machine learning models have been developed based on dynamic fuzzy sets and related algorithms [30, 31]. Further work on this basis has proposed a dynamic fuzzy machine learning method, a dynamic fuzzy machine learning model, and related algorithms from the perspectives of algebra and geometry [3234]. In [35], dynamic fuzzy set theory is applied to the semi-supervised multi-task learning domain. A dynamic fuzzy semi-supervised multi-task learning (DFSSMTL) algorithm provides a new theoretical method for solving dynamic fuzzy problems. The main work of this chapter is to apply the theory of dynamic fuzzy sets to the semi-supervised multi-task learning neighbourhood. We will elaborate on the following aspects:

(1)For dynamic fuzzy problems, how can we design a reasonable and effective semi-supervised multi-task learning model using dynamic fuzzy set theory?

(2)Based on the theory of dynamic fuzzy sets, how can we integrate the existing semi-supervised approach and multi-task method more effectively to form a new semi-supervised multi-task learning framework?

5.2Semi-supervised multi-task learning model

5.2.1Semi-supervised learning

Traditional machine learning techniques typically use only labelled sample sets (supervised learning) or only unlabelled sample sets (unsupervised learning) for learning. In most practical problems, however, labelled samples coexist with unlabelled samples. To make better use of these data, semi-supervised learning was developed. In recent years, semi-supervised learning technology has become a hot topic of research in the machine learning field.

1. Concept of semi-supervised learning

Traditional machine learning techniques can be divided into two categories: unsupervised learning and supervised learning.

General unsupervised learning data are as follows:

Suppose the data sample set X contains n sample points X = {x1, x2, ..., xn}, where xi ∈ ∀ ∈ {x1, x2, ..., xn}. It is generally assumed that the data are independent and identically distributed. The purpose of unsupervised learning is to estimate the density distribution of X according to these sample points. Clustering and dimensionality reduction are typical representative techniques.

Supervised learning differs from unsupervised learning in that the data contains not only sample points but also category labels that can be used to describe a sample point (x, y), where y is the class label of sample point x. The purpose of supervised learning is to learn a mapping from sample points to category labels from these data samples. When the category label is a real number, the supervised learning technique is called regression; otherwise, it is classification. SVM is a typical supervised learning technology.

Semi-supervised learning is a learning method for studying how computers and natural systems (e.g. people) learn from unlabelled data and labelled data. It is a learning framework between supervised learning and unsupervised learning. Its dataset X = {x1, x2, ..., xn}(n = l + u) is divided into two blocks: one is labelled data Xl = {(x1, y1), where the class label of the data is unknown. As there are generally more unlabelled data than labelled data, where the class label of the sample point xi is yi; the other set consists of unlabelled data Xu = {xl+1, ⋅ ⋅ ⋅, xl+u}. Semi-supervised learning is widely studied, partly because of its practical value in constructing computer algorithms and partly because it can help people understand the theoretical value of machines and how they learn.

2. Unlabelled data in semi-supervised learning

We can indeed learn useful information from unmarked data. It may sound magical, but it is no miracle that the model assumptions are a good match for the structure of the problem. Although people do not need to spend the same effort to mark the training data in semi-supervised learning, it is important that excellent and reasonable models, features, cores, and similar functions are provided. These efforts are particularly critical in compensating for the lack of mark-up data compared to supervised learning.

However, there is no free lunch: unmarked data are not always useful. Elworthy realized that training HMMs with unlabelled data reduces the model accuracy under certain initial conditions [38]; Cozman analysed the semi-supervised learning performance of a mixed model and found that unlabelled data sometimes introduce an increase in categorical errors – the phenomenon of degeneration was analysed from a mathematical point of view [39]. The reason for the unhelpfulness of these unmarked data may lie in the fact that the model assumptions are not well matched with the problem structure. For example, many semi-supervised learning methods assume that decision boundaries should be far away from high-density regions, such as transductive SVMs (TSVMs), information regularization, noiseless Gaussian processes, and graph-based approaches (the weight of the graph is determined by the distance between data). If the data come from two highly covered Gaussian distributions, the decision boundary will be located in the densest place, and learning by the above method will give poor results. Other semi-supervised learning algorithms such as EM generate a hybrid model that can easily solve such problems, but the pre-detection of poor matches is difficult, and this field has many open problems.

Semi-supervised learning methods use unlabelled data primarily to modify or re-differentiate assumptions obtained from the marked data. So, how does semi-supervised learning use unlabelled data? To facilitate a description, we use a probabilistic method. Assuming that p(y|x) and unmarked data p(x) are known, the joint distribution p(x) in the production model has the same parameters. Obviously, p(x) can affect p(y|x); the EM mixed model belongs to this category, and to some extent, self-training is the same. Many other methods are discriminant techniques, such as TSVMs, information regularization, noiseless Gaussian processes, and graph-based approaches. The original discriminant training method cannot be used for semi-supervised learning, because p(x) To this end, p(x is ignored in estimating p(y|x) dependencies are often introduced into the objective function, and these objective functions often assume that p(y|x) has the same parameters as p(x).

3. Basic assumption in semi-supervised learning

Most current machine learning techniques are based on the independent and identical distribution hypothesis; that is, the data samples must be independently sampled from the same distribution. Obviously, it is not possible to generalize from a finite training set to an unknown test set without making certain assumptions.

The following are the most common assumptions in semi-supervised learning.

1) Semi-supervised smoothness assumption

The semi-supervised smoothness assumption states that the labelled function is smoother in the high-density region than in the low-density region. That is, if two sample points in the high-density region are very similar, their corresponding marks are also very similar. This means that if two sample points are connected by a high-density path (for example, the two points belong to the same cluster), then their output signatures are likely to be similar. Otherwise, if they are separated by a low-density region, their outputs need not be close.

2) Cluster assumption

The Cluster Assumption states that sample points in the same cluster are likely to have the same category marker. However, this does not imply that each class forms a single, compact cluster; it simply means that there are no two different classes of samples in the same cluster. In fact, if the cluster is regarded as a set of sampling points that are connected by a short curve over a high-density region, it can be said that the clustering assumption is a special case of the semi-supervised smoothing assumption.

An equivalent argument to the cluster assumption is known as low-density separation, which argues that the decision boundary should be located in a sparsely populated data area. This is because, if the decision boundary crosses an area with more data points, it is possible to classify the sample points in a cluster into different categories.

3) Manifold assumption

A different and related assumption is the manifold hypothesis, which forms the basis of many semi-supervised learning algorithms. This states that high-dimensional data have low-dimensional properties.

The manifold assumption effectively avoids the curse of dimensionality suffered by many statistical methods and learning algorithms. As the dimension increases, the amount of data grows exponentially, and the number of samples required for statistical problems such as density estimation increases exponentially. A related problem at high dimensions is that the distance between the data tends to be similar and loses its expressiveness, which will heavily influence the discriminant method. If the data has low-dimensional characteristics, then the learning algorithm can be implemented in the corresponding low-dimensional space, thus avoiding the dimension disaster.

In fact, we can see from previous analysis that algorithms based on the manifold assumption can be regarded as an approximate realization of the semi-supervised smoothness assumption, because these algorithms use the manifold metric to calculate the geodesic distance. Moreover, if we consider manifolds as approximations of high-density regions, the semi-supervised smoothing assumption at this point degrades to the smoothing assumption applied to manifolds.

4. Common methods of semi-supervised learning

Many of the above-mentioned assumptions are included in many approaches. According to the hypothesis, semi-supervised learning methods can be divided into the following categories: generative models, low-density division, and Graph-based approaches.

1) Generative models

Generative models are a common approach to semi-supervised learning. The method of generating the model involves estimating the conditional probability p(x|y) is a Gaussian distribution. Then, the EM algorithm can be used to find Gaussian parameters for each category. In fact, the only difference between the EM algorithm for clustering and the standard EM algorithm is that the “hidden variables” of any labelled sample are not invisible but are known and are the class labels of the sample. As each given cluster belongs to only one class, this implements the clustering assumption.

One of the advantages of generative models is that they can contain structural knowledge of problems or data through modelling. However, studies have found that, when the model assumptions are wrong, unlabelled data will reduce the prediction accuracy. In statistical learning, a class of functions or a priori data is chosen prior to reasoning and can be selected according to knowledge of the known problems in advance. In semi-supervised learning, if some information about the objective function is known in advance, the a priori data can be more accurately selected after using the unmarked data. In general, higher prior probabilities can be assigned to functions that fit the clustering assumptions. In theory, this is a natural way for semi-supervised learning to obtain boundaries.

2) Low-density division

Some methods directly implement low-density partitioning assumptions by moving the decision boundaries away from unlabelled data. The most common way to achieve this goal is to use a maximum interval algorithm, such as an SVM. The maximum separation method for the labelled data and the unlabelled data is the TSVM. However, the problem is non-convex, so it is difficult to optimize.

Chapter 6 of Reference [1] provides an optimized algorithm for TSVM. The SVM is used to train the labelled data and mark the unlabelled data, and then it retrains the SVM on all data samples. It repeatedly and gradually increases the weight of the unlabelled data. Another approach uses semi-definite relaxation procedures [1, Chapter 7]. Compared with TSVM, this method seems to implement the low-density partitioning hypothesis more directly because it takes advantage of an extra term that reflects the density of neighbouring decision boundaries and the standard quadratic regularization term. For details, please see Chapter 10 of Reference [1].

3) Graph-based approaches

In recent years, graph-based approaches have become a hot topic in the field of semi-supervised learning. They all use graph nodes to represent the data, with the edges of the graph representing the distance between the nodes (where there is no edge between two nodes, the distance is infinite). The distance between two data samples is calculated by minimizing the path distance of all paths between two points. From the manifold data sampling point of view, this can be seen as the geodesic distance between two data points. Therefore, it can be said that graph-based methods are constructed on the basis of the manifold assumption.

4) Other methods

There are other methods that are not semi-supervised in nature and are generally learned in two steps.

First, an unsupervised learning method is performed on all data, ignoring existing data tags. This represents a change in the method or a limitation of the new metric or new kernel.

Second, it ignores the unlabelled data and uses the new distance as a new representation method or new kernel for general supervised learning. It can be said that these methods directly implement the semi-supervised smoothing assumption, because the representation method is changed by keeping the fine distance in the high-density region.

Obviously, it is confusing to choose from a number of semi-supervised methods. However, there is no clear way to solve this problem. Because the labelled data are limited, the semi-supervised learning method makes a strict model hypothesis.

In the ideal scenario, the methodological model should fit the structure of the problem, but this is not easy in practice. However, we can use the following method to determine which semi-supervised approach to use. If the data in the class is clustered, the EM-generated hybrid model is a good choice; if the features can be naturally divided into two sets, then co-training is a good choice; if two data samples with similar characteristics tend to be in the same class, then a graph-based approach may be used; if you have used SVM, then TSVM is preferred. The current supervised classifier is complex and difficult to modify, so self-training is a good encapsulation method.

5. Inductive semi-supervised learning and transductive semi-supervised learning

There are many semi-supervised learning methods based on different hypotheses, but these can be broadly divided into two blocks: inductive semi-supervised learning and transductive semi-supervised learning. In the supervised learning classification, the training samples are all labelled. Therefore, people are more concerned about the classification performance of the model in the test data. In the semi-supervised learning classification, the training sample contains unlabelled data, so there are two different goals. We call the former inductive semi-supervised learning and call the latter transductive semi-supervised learning, or simply transductive learning.

1) Inductive semi-supervised learning

For a given training example {(xi,yi)i=1l,{xj}j=l+1l+u},, inductive semi-supervised learning involves learning a function f : χ → y such that it is possible to predict unknown data other than {xj}j=l+1l+u. Similar to supervised learning, the performance with the unknown data is estimated using test samples {(xk,yk)}k=1m that are not trained.

2) Transductive learning

For a given training example {(xi,yi)i=1l,{x}j=l+1l+u},, transductive learning involves training a function f:xl+uyl+u such that f gives a good prediction for unlabelled data {xj}j=l+1l+u. Note that f is only defined for a given training sample, and does not predict extra data. Therefore, it is just a simple function.

To facilitate a better understanding, inductive semi-supervised learning can be analogized into classroom tests. The test questions are unknown in advance, so the students need to prepare for all possible problems. Conversely, transductive learning is analogous to homework, where students know the test questions and do not need to prepare for other questions.

5.2.2Multi-task learning

Traditional machine learning methods learn only one task at a time. If the problem is more complicated, it can be split into several relatively simple and logically independent subproblems. These subproblems can be learnt separately and then integrated to form the solution to the original problem. Sometimes, the effect of this approach will backfire because it ignores the many potential problems in real rich information. The information is included in the training signals of other tasks in the same field. In 1997, “Multitask Learning” (published in Machine Learning by Caruana) made “multi-task learning” a hot issue in machine learning.

1. Related tasks

The formal or heuristic description of what constitutes related tasks remains the most important open problem in the field of inductive transfer. Because of the lack of an accurate definition of task relatedness, inductive migration theory is progressing slowly. However, some of the characteristics of the theory of task relatedness are very clear.

(1)If two tasks have the same function, but the noise introduced into the task is independent, then the two tasks are related.

(2)If two tasks predict different aspects of the same person’s health, these tasks are certainly more relevant than those that predict different aspects of human health.

It should be noted that this is not supposed to determine that tasks are relevant simply because they are able to cooperate when being trained at the same time. For example, noise is sometimes added to the extra output of a backpropagation network. Other outputs in the hidden layer serve as regular terms to improve the generalization performance, but this does not mean that the noise task is related to other tasks.

2. Concept of multi-task learning

The standard machine learning method is to learn one task at a time. If the scale of the problem is large, then the problem will be refined into a number of separate, simple questions. These simple problems are individually learnt and then merged to solve complex problems. However, this practice does not always work because it ignores the rich information about training signals for other problems in the same field.

Considering the number of known training patterns and training time, a network with 1000 × 1000 input retina cannot learn to recognize complex objects in the real world. If you learn a lot of things at the same time, however, then the learning effect will be improved. If tasks are allowed to share what they have learned, then people will feel that learning together is much easier than learning alone. Therefore, if we train a network to recognize contours, shapes, edges, regions, sub-regions, textures, reflections, highlights, shadows, orientations, sizes, and distances of objects, then the trained model is more likely to identify the real world of complex objects. This approach is multi-task learning. Multi-task learning is an inductive migration method that improves the generalization performance using domain information contained in the relevant task training signal as the inductive bias. It improves the generalization performance by learning multiple learning tasks in parallel, and these learning tasks have the same representation, that is, they are related. Each task learns information to help train other tasks. In practice, the training signal of other tasks is an inductive bias.

In short, the core idea of multi-task learning is that a core issue is accompanied by some related issues that help the core results of the problem.

In general, typical information migration methods include the sharing of a hidden model in neural networks [6, 40, 41], setting the same priors in a hierarchical Bayesian model [16, 4244], sharing parameters in the Gaussian process [15], learning the best distance metric in the KNN [45], sharing the same structure in the prediction space [19], and the regularization of structures in the kernel method [46].

3. Common applications in multi-task learning

In the nine fields discussed below, it is often possible to obtain training signals for useful additional tasks. In general, most real problems belong to one or more of these nine areas [6]. Only a few of the resources in machine learning are multi-task learning problems. This may sound surprising, but many problems in the field of machine learning are dealt with in terms of single-task learning, thus reducing the chances of using multi-task learning.

1) Use future information to predict the current

People often obtain some important features when they make predictions. However, if we cannot get these features at runtime, they cannot be used as input. If learning is conducted, we can collect these features for the training set and use them as an additional instance of multi-task learning. When using the system, the prediction made by the learner for these additional tasks is likely to be ignored. The primary role of prediction is to provide additional information for the learner during training.

One application that uses future information for learning is medical crisis prediction, such as in a pneumonia crisis [6]. In this problem, lab data from the training set are used as the additional output task. In fact, these laboratory data are not available when making predictions for patients. The useful information contained in these future measures makes the network biased towards hidden layers that are better able to predict crises from features that are available at runtime.

2) Multiple representations and metrics

Sometimes, it is difficult to obtain all the important information from an error metric or an output representation. The use of multi-tasking can benefit from these aspects when the optional measures or outputs represent different aspects of the problem and are useful.

3) Time sequence prediction

This type of application uses future information to predict a current subclass. The future task is the same as the current task, but occurs later. The simplest way to use multi-task learning for time series prediction is to use a single network with multiple outputs, each corresponding to the same task at different times. Figure 5.1 shows a multi-task learning network with four outputs. If output k represents the prediction at time Tk, then the network makes predictions for the same task at four different times. The output that is normally used for prediction is currently in the middle, so that earlier or later tasks can be trained on the network. When the input feature temporarily “skips” the input, we can collect and combine the output from a prediction sequence.

4) Use an inoperable feature

It is not practical to use certain features at runtime. This is either because they are computationally expensive or because they require human assistance that cannot be provided in time. We often incur high costs to obtain more features because the training set is too small. It is practical to compute inoperable eigenvalues for the training set, which can be used as an output for multi-task learning.

As in scene analysis, humans are often required to mark important features. Generally, when a learning system is used, there is no need for a manual annotation, but this does not mean that people cannot use the characteristics that have been marked. If these markers are included in the training set, they can be used as additional tasks for the learner but are not required when using the system.

Fig. 5.1: A simple multi-output network.

5) Use additional tasks to focus attention

Learners often learn to use large and common input patterns, while ignoring smaller or less common inputs that may also be useful. Multi-task learning can be used to force the learner to focus on patterns that are ignored in the input. The task of relying strictly on input patterns that may be overlooked is supported by forcing learners to learn internal representations. As in the road-following field [6], single-task learning networks often ignore road markings in learning to drive. This is because road marking are often a small part of the image, and do not always exist.

If a learning-to-drive network is required to identify road boundaries and serve additional output tasks, the network will learn to pay attention to the road boundaries in the picture. Road borders are learnable, so the network learns the internal representation to support them. As the network also learns to drive with the same hidden layer, the driving task can use any part of the data that hides the way the road is useful for driving.

6) Sequential transfer

Sometimes, we learn from the prior study of related tasks, but this cannot be used to train the model data. At the same time, multi-task learning can benefit from a priori learning without training data. People can use this model to generate simulation data and use the training signal in the simulation data for multi-task learning. This approach of sequential transfer avoids catastrophic interference problems (i.e. forgetting the old task when learning a new task – “the grass is always greener”). Sequential transfer can also be used where analytical field theory is not applicable, and these field theories are used by other sequential transfer methods.

7) Multiple tasks appear naturally

Some related tasks are widely encountered in everyday life. The separation of these tasks into separate problems and the traditional way of training these issues alone is often counterproductive. If you train related tasks together, then they can benefit from each other. An early and accidental use of multi-tasking in a backpropagation network was NETtalk, which learned phonemes and stresses for a speech synthesizer to pronounce input words. NETtalk uses a network with multiple outputs, in part because the goal was to control the synthesizer, which requires both phonemes and pressure. Although the researchers did not analyse the contribution of multi-tasking migration to NETtalk, there is already evidence that it may be more difficult to use a separate learning network.

8) Quantitative smoothing

There is a lot of quantitative information in everyday life. For example, training signals may be assigned a level (superior, good, medium, or poor), or some natural process may quantify potentially smoother functions, such as finite predictive physical measurements leading to the patient’s end result – live or die. Although quantification sometimes makes the problem easier to learn, it makes learning more difficult.

These additional training signals may be used as additional tasks if the primary task is quantified better than the additional training signals, or if the two use different quantization methods. It is useful to learn something that is poorly quantified by the extra task because, sometimes, this is easier to learn (because it is smoother). The additional tasks that are not smooth but use different quantization processes are sometimes beneficial because they may better refine the roughness of the primary and extra tasks after combining the main tasks. In fact, each task can make up for the limitations of other tasks due to quantification.

9) Some of the inputs are better as output items

Multi-task learning is useful in many areas where it is impractical to include certain features as input. Multi-tasking provides a way to benefit from these features by using them as additional tasks rather than ignoring them. Using certain features as output items may be more useful than using them as input items because it is entirely possible to construct such problems. For further details on this point, see [6].

4. Transfer learning and multi-task learning

Inductive transfer or transfer learning [47, 48] is widely studied in the machine learning field. It focuses on the information stored in the solved problem that can be used to solve different and related problems. For example, learning to walk helps in learning to run. Of course, research in this area is associated with the transfer of learning in psychology, although the formalization of this association is limited. Multi-task learning is also studied in the field of machine learning. It studies a certain problem while learning other related problems that have the same representation method. Thus, multi-task learning is an inductive transfer method.

Table 5.1 summarizes the relationship between traditional machine learning and various types of transfer learning. According to different settings, the transfer learning is divided into three categories: inductive transfer learning, transitive transfer learning, and unsupervised transfer learning. Figure 5.2 summarizes the different settings for transfer learning [47].

Tab. 5.1: Relationship between traditional machine learning and various transfers.

Fig. 5.2: The classification overview of different settings in transfer learning.

5. Make predictions for multiple tasks

Multi-task learning involves training multiple tasks in parallel within a model, but that does not mean that the model predicts the outcome of multiple tasks. The reason why multiple tasks are trained in a model is that each task can utilize useful information contained in other task training signals, rather than reducing the number of models that must be trained. The general performance of all tasks must be balanced with good performance on the task of interest. In general, it is best to optimize the performance of one task at a time, allowing the performance of other tasks to decline. When the need to forecast multiple tasks arises, it is particularly important for each task to be trained using an independent multi-task learning model.

6. Parallel transfer and sequential transfer

Multi-task learning is a type of parallel transfer. Though parallel tasks may seem easier, this is not the case. Parallel transfer has the following advantages:

(1)Because all tasks are learned at the same time, all tasks have full details about the tasks being learned.

(2)In many applications, additional tasks can be applied in parallel to the main task. Parallel migration does not require the definition of a training order, while the order of training in the sequential transfer is not the same.

(3)Tasks can benefit from each other, which is not the case with a linear order. For example, if you learned Task 1 before Task 2, Task 2 does not help Task 1 to learn. This not only reduces the performance of Task 1 but also weakens the ability of Task 1 to assist the learning of Task 2.

When tasks occur naturally and continuously, the use of parallel transfer is easier than sequential transfer. If you can store training data, you can use existing tasks for multitasking and relearn after new tasks. If you cannot store training data, you can use the learned model to generate a priori simulation data. It is interesting to note that, while it is easy to migrate sequentially using parallel transfer, it is difficult to use sequential transfer to initiate a parallel transfer. However, it is worth mentioning that performing sequential transfer and parallel transfer at the same time is feasible.

5.3Semi-supervised multi-task learning model based on DFS

Although semi-supervised multi-tasking is a hot topic at present and can deal with static problems effectively, dynamic fuzzy problems still cannot be solved. In 1965, Zadeh proposed the concept of fuzzy sets as a static concept. At this point, it was not possible to describe dynamic fuzzy phenomena, dynamic fuzzy events, and dynamic fuzzy concepts objectively. However, dynamic fuzzy set theory can describe dynamic fuzzy problems well, so it is an extremely natural thing to propose a semi-supervised multi-task learning model based on dynamic fuzzy sets.

5.3.1Dynamic fuzzy machine learning model

In [34], dynamic fuzzy set theory was introduced into the machine learning domain, and a dynamic fuzzy machine learning model and related theories were proposed.

Definition 5.1 Dynamic fuzzy machine learning space: The space used to describe the learning process composed of all the dynamic fuzzy machine learning elements is called the dynamic fuzzy machine learning space. It consists of {learning sample, learning algorithm, input data, output data, representation theory} and is written as follows:

(S,S)={(Ex,Ex),ER,(X,X),(Y,Y),ET}.

Definition 5.2 Dynamic fuzzy machine learning: Dynamic fuzzy machine learning (l,l) refers to the mapping of an input dataset (X,X) to an output dataset (Y,Y) in a dynamic fuzzy machine learning space (S,S), denoted by

(l,l):(X,X)(Y,Y).

Definition 5.3 Dynamic fuzzy machine learning system: Five elements in the dynamic fuzzy machine learning space correspond with a certain learning mechanism to form the learning ability of a computer system, known as the dynamic fuzzy machine learning system.

Definition 5.4 Dynamic fuzzy machine learning model: DFMLM={(S,S),(L,L),(u,u)(y,y),(p,p),(I,I),(o,o),where(S,S) is the learning part (dynamic environment/dynamic fuzzy learning space), (L,L) is the dynamic learning part, (u,u) is the output of (S,S)to(L,L),(y,y) is the dynamic feedback of to (S,S)to(L,L),(y,y)isthedynamicfeedbackof(L,L) is the system learning performance index, (S,S),(p,p) is the external environment of the dynamic fuzzy learning system input, and (O,O) is the system output to the outside world. See Fig. 5.3 for a schematic representation.

Fig. 5.3: Dynamic fuzzy machine learning model block diagram.

5.3.2Dynamic fuzzy semi-supervised learning model

Definition 5.5 Dynamic fuzzy semi-supervised learning (DFSSL): Consider a set of samples (X,X)=(L,L)(U,U) from an unknown distribution, where (L,L)={((x1,x1),(y1,y1)),((x2,x2),(y2,y2)),...,((x|(L,L)|,x|(L,L)|),(y|(L,L)|,y|(L,L)|))} is a marked sample set and (U,U)={(x'1,x'1),(x'2,x'2),...,(x|(U.U)|,x'|(U.U)|)} is the unlabelled sample set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.186.79