Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7

Multitask Learning for the Diagnosis of Machine Fleet ¹

7.1. Introduction

Nowadays, in manufacturing or production systems, condition-based maintenance is preferred over preventive maintenance because it is technically achievable and financially advantageous [CAM 09]. It is based on the surveillance of the state of the system considered from information collected on it, i.e. on the ability to detect and diagnose its dysfunctions in order to be able to plan maintenance actions for the “equipment” involved. Since the beginning of 2000, with the rise of information and communication technologies (Web, mobile, etc.), new forms of maintenance such as remote condition-based maintenance (e-CBM) have emerged [MUL 08, HIG 04].

This evolution is translated at the industrial level into the development of new solutions and new services, in particular the possibility to monitor a fleet of equipment (fleet-wide monitoring). The main idea is, through an overall vision of the operation of the set of machines, to improve the efficiency of the monitoring of the machines of this fleet. We find examples of these applications in the domain of transport: a fleet of cars [YOU 05], or planes or boats; in the energy sector (see, for instance, the websites of the companies Smartsignal¹ or EPRI²): a fleet of thermal, hydraulic, or nuclear power plants [KUN 03], of wind turbines, etc. Many of these solutions mainly offer a hardware and software architecture enabling communication between the different levels or agents defining this remote diagnostic center. Others detail a bit more the “surveillance” function of their solutions. However, we can note from publications or patents associated with these solutions that the diagnostic methods used are quite elementary (threshold crossing of a signal, for instance) and do not integrate the recent developments coming from scientific research into the area. Furthermore, none present methods enabling us to develop these monitoring systems by integrating the concept of the fleet [FAM 03, HAS 04, ABU 07, YU 10]. However, this aspect appears to be the main key to this approach.

Overall, for the diagnosis, we can expect several beneficial points of this fleet diagnosis. On the one hand, it is clear that the redundancy inherent to a fleet of similar machines or systems enables us to have much more information and many more measurements relative to the operating modes and dysfunction of the systems considered. On the other hand, the similarity of the systems should or could also facilitate the design and implementation of a diagnosis system on any system of a similar nature.

Initially, we are going to define more precisely the issue at stake. We consider a set of equipment or machines supposed to be identical, for instance, a fleet of pumps of nuclear power plants, a fleet of rolling mill electric engines, and a fleet of plane turbines. Without large restrictions, we will hypothesize that this equipment can go through the same states of operation and dysfunction, and that the set of variables observed enabling us to characterize the states is identical on each piece of equipment throughout the fleet. Even though assumed to be identical, this equipment can have different behaviors, in particular, because of different operating conditions and disturbances. For instance, the same turbine is not subjected to identical operating conditions according to whether it equips a medium-haul plane covering flights in Africa or a long-haul plane covering flights between Nordic countries. It results therefrom that the values of the monitoring variables coming from an equipment and their distribution function in the observation space depend not only on its operating state, but also on the equipment itself and eventually on the history of its use.

Generally, the statistical models of observations are not known and the system is only described by samples. Hence, we assume that we have at our disposal only observation samples. Each observation is represented by a vector of attributes, extracted from each equipment of the fleet. The predefined attributes are able to characterize the states of the equipment. However, the size of the collected samples can significantly vary from one piece of equipment to another and the states corresponding to the observations available are not necessarily the same on each piece of equipment. Furthermore, these observations correspond most often to measurements made in normal operating conditions while not many, if not any, describe abnormal situations.

Among the approaches to the diagnosis of operating systems, we were particularly interested in approaches referred to as no a priori model, based on data (pattern recognition). In fact, the implementation of classification methods for learning requires us to have at our disposal a labeled data set. All the data are used according to different modalities to determine a decision rule and to estimate the performances of the rule. The latter is then applied to new observations coming from the same system in order to label them. The performances of the rule depend on both the available learning set and the learning method.

In the literature, this kind of problem is sometimes tackled as a two-class classification problem [KAS 09]:

– a target class also referred to as a positive class, whose data are available;

– another class whose data are difficult or impossible to obtain. This class corresponds to the failure class or negative class. Several methods were suggested within this framework. The general principle consists of artificially generating the data of the negative class, and then employing conventional learning algorithms with two classes. These methods are based on the hypothesis that the proportion of positive examples in the non-labeled set is very weak, which is not always correct. The extraction of negative examples is thus poorly reliable and requires careful attention. Besides, the non-labeled set must be sufficiently large in order to obtain the best classifier.

The classification restricted to one class is another approach for this kind of problem; in this case, learning only takes into account positive examples, the main idea being to learn the normal class by adjusting the model to positive data. The new observations are classified by comparing their score, given by the model, at a decision threshold. There exist two categories of methods of this kind [CAM 08]: the statistical methods and the methods relying on the optimization of a decision function among a class of functions, for instance neural networks or approaches based on support vector machines (SVMs). The 1-SVM method, suggested by B. Schölkopf, is one of them [SCH 01, MAN 01].

The learning methods have proved very efficient to infer monitoring rules adapted to an equipment. However, in several practical applications, we must design a common monitoring system for several similar systems, or rebuild a decision rule for each system. This is the case, for instance, for diagnosis problems when the system has several identical pieces of equipment a priori, but which work in different conditions. Here, the term “equipment” can refer to a simple machine (an engine, a pump, etc.), a system (a car, a plane, etc.), and by extension to an industrial installation (a nuclear power plant, etc.). The vehicles of a rental company are an example of this. In the industrial sphere, it is common to find several identical operational sites a priori in the domain of construction or energy production. In all these cases, learning of a monitoring rule can be assimilated to a task and it seems relevant to transfer or mutualize the information coming from similar tasks [PAN 10]. These reasons explain undoubtedly the growing interest, for several years, in multitask learning (MTL) [BI 08, ZHE 08, GU 09, BIR 10].

While most learning methods focus on the processing of a task, multitask methods aim to improve the generalization performances of the monitoring system by taking advantage of the links between the tasks. The main idea is to share what is learned from different tasks (for instance a common representation space or some parameters of the model that are close to each other), whereas the tasks are learned in parallel [CAR 97]. Previous works showed that MTL can lead to more efficient learning models with better performances [CAR 97, HES 00, BEN 03, EVG 05, BEN 08] than those obtained from single-task learning techniques.

In recent years, SVMs (support vector machines) [BOS 92, VAP 95] have been successfully used for MTL [EVG 04, JEB 04, EVG 05, WID 10, YAN 10]. The SVM method was initially developed for the classification of data of two different classes. When the data are linearly separable, SVMs determine the separating hyperplane that maximizes the distance to the closest observations (hyperplane with maximum margin). When the data are not linearly separable, “the kernel trick” is used. The basic idea is to use the original data in a space of greater dimension (transformed space) and then to solve a linear problem in that space. The good properties of the kernel functions can also be used within the framework of MTL.

In this chapter, we present a new approach to MTL that relies on one-class support vector machines (1-SVMs). 1-SVMs, suggested by [SCH 01], are adapted to the solving of problems of novelty or outlier detection. These problems appear each time as knowledge of the system to monitor is restricted to one class, or that knowledge of the other classes is too insufficient. In fact, in several applications, it is very difficult to collect a sufficiently large number of observations corresponding to all the abnormal behaviors of the system. The main advantage of a 1-SVM, with respect to other one-class classification methods [TAR 95, RIT 97, ESK 00, SIN 04], is that it focuses only on the estimation of the envelope of a region containing the samples of the target class, the envelope being sufficient for classification. The estimation of the envelope is done by separating a certain proportion of the target samples from their origin by a maximum margin hyperplane (in general, in a transformed space).

Recently, Yan et al. [YAN 10] suggested using a multitask approach for learning one-class classifiers. The basic idea is to constrain the classifiers of similar tasks to be close to each other. However, they solved this problem by resorting to conic programming [KEM 08] that happens to be complex. In this chapter, inspired by works of [EVG 04], we introduce an easier MTL framework that relies on 1-SVM method. We hypothesize, as in [EVG 04], that the parameter values of the model of the different tasks considered are close to a certain mean value. This hypothesis is reasonable because the tasks being similar to each other, usually their models are quite close. Then, we determine a classifier for each task and all the 1-SVMs are learned simultaneously. The suggested multitask approach is simple to implement because it only requires a minor modification to the formulation of a the optimization problem solved by 1-SVM. Experimental results prove its efficiency.

This chapter is organized as follows. In section 7.2, the formulation of a one-class SVM algorithm is briefly described. The suggested MTL method is then described in section 7.3. Section 7.4 presents the experimental results. We conclude the chapter with a few comments and suggestions for future work in section 7.5.

7.2. Single-task learning of one-class SVM classifier

Different versions of the methods of an SVM kind were specially adapted to classification in the case of a single class [SCH01, MAN 01]. 1-SVM approach introduced by Schölkopf et al. (2001) relies on the transformation of original positive data from the input space to a space of greater dimension due to a kernel function. In the transformed space, the positive data are separated from the origin with a maximum margin. Besides the original parameters of SVM, 1-SVM requires setting a priori the value of a parameter indicating the percentage of positive data allowed to lie outside the description of the positive class. This makes 1-SVM more tolerant to noise in positive data (i.e. outliers). However, the choice of an appropriate value for the aforementioned parameter is not always intuitive and has a major influence on the performance of 1-SVM.

Briefly, the solution of 1-SVM corresponds to the estimation of the border of a region that includes most of the learning samples. If a new observation is a member of this region, it is classified as normal; otherwise, it is recognized as an outlier.

Let A_m = {x_i}, i = 1, …, m, be the learning set with m ∈ the total number of observations. x_i is an observation in space X ⊆ ^d of dimension d. Let φ be a nonlinear transformation of attributes so that the result of the scalar product of φ can be obtained indirectly by a simple kernel K (semi-definite positive kernel on ^d fulfilling Mercer’s conditions [BOS 92, VAP 95]):

[7.1]

for instance the Gaussian kernel:

[7.2]

where σ is the kernel bandwidth.

The data must be projected onto a space induced by the kernel used and a proportion 1 − ν of it must be separated from the origin by the hyperplane furthest away from the origin – we refer to it as the hyperplane of maximum margin. For a new observation x, the value of the decision function is determined by evaluating on which side of the hyperplane the observation lies in the transformed space. To determine this hyperplane, we must find its normal vector w and a threshold ρ, for solutions to the following problem:

[7.3] images

where · is the norm associated with scalar product 〈·, ·〉, w is the vector normal to the separating hyperplane, ρ is the offset parameter, and ξ_i is the slack variables. As a result, the distance between the hyperplane and the origin is ρ/w.

It is emphasized in [SCH 01] that parameter ν ∈ [0,1] is an upper bound of the rate of anomalies (the learning observations that lie outside the estimated positive region). Slack variables ξ_i, associated with each observation, are introduced. They enable us, in the case where the data are not linearly separable in the transformed space, to use some learning examples that can be misclassified when the maximum margin of the origin is obtained, thus giving some robustness to the method. The optimization algorithm consists of finding a compromise between the maximization of margin ρ/w and the minimization of the average of the slack variables. The distance between an anomaly and the hyperplane is ξ_i/w. The learning observations that lie on the wrong side of the separating hyperplane, those with ξ_i > 0, are classified as anomalies. When we use the Gaussian kernel [7.2], the observations are projected onto the surface of the hypersphere in the space transformed by φ. Figure 7.1 shows a view in two dimensions where the hypersphere is a circle around the origin. In this case, 1-SVM is equivalent to the support vector data description method [TAX 99], which consists of looking for the minimum sphere of radius R, including most or all the positive data.

Let α_i, β_i ≥ 0 be the Lagrange multipliers associated with problem [7.3]. The Lagrangian is written as:

[7.4] images

By annulling the derivatives of Lagrangian with respect to primal variables w, ξ_i, ρ, we obtain:

[7.5]

[7.6]

Figure 7.1. View in two dimensions of 1-SVM classifier and its parameters

Given these expressions, the values of α are solutions to the dual problem given by:

[7.7]

It is not necessary to explicitly calculate the nonlinear transformation for each observation due to the kernel trick. The latter consists of defining directly a scalar product using a function fulfilling Mercer’s conditions [VAP 95]:

[7.8]

By introducing this kernel during the solving of the dual problem, we show that the final decision is expressed only as a function of the scalar product and that it is useless to be explicit about the transformation φ(x):

[7.9] images

We can cite as an example the Gaussian kernel , which is widely used.

7.3. Multitask learning of 1-SVM classifiers

In this section, we introduce the 1-SVM method in the case of MTL. In this context, we consider T tasks defined on the same space X, with X ⊆ ^d. For each task, we have at our disposal m samples {x_1t,x₂_t, …, x_mt}. The goal is to learn a decision function (a hyperplane) f_t(x) = sign(〈w_t, φ(x)〉 − ρ_t) for each task t. Inspired by the method suggested by [EVG 04], we hypothesize that when the tasks are linked to one another, vector w_t normal to the hyperplane can be represented by the sum of a mean vector w₀ and of a vector v_t specific to each task:

[7.10]

7.3.1. Formulation of the problem

By taking into account the hypothesis introduced above, we can generalize the 1-SVM method to the MTL problem. The primal optimization problem can be described as follows:

[7.11] images

for all i ∈ {1, 2, …, m} and t ∈ {1, 2, …, T}, under the constraints:

[7.12]

where ξ_it are variables enabling us to minimize the constraints associated with each sample and ν_t ∈ [0,1] is a specific parameter of 1-SVM of each task. To control the level of similarity between the tasks, we introduce a positive regularization parameter μ in the primal optimization problem. With this formulation, a large value of μ tends to force the system to learn tasks independently from one another while a small value of μ will compel the solutions for each task to converge toward a common solution. Like in the case of the conventional 1-SVM, the Lagrangian is defined as follows:

[7.13] images

where α_it, β_it ≥ 0 are the Lagrange multipliers. At the optimum, the partial derivatives of the Lagrangian with respect to the variables are equal to zero and we obtain the following equations:

[7.14a] images

[7.14b]

[7.14c]

[7.14d]

By combining equations [7.10], [7.14a], and [7.14b], we obtain the following:

[7.15]

[7.16]

With these equations, the primal optimization problem can be rewritten as a function of only w_t according to:

[7.17] images

with:

[7.18]

This new expression of the original primal problem [7.11] enables us to emphasize that, in the context of MTL, the solution is obtained by making compromises between obtaining a 1-SVM model specific to each task and obtaining a unique model for all the tasks.

7.3.2. Dual problem

The primal problem [7.11] can be solved more easily by using its dual Lagrangian that is expressed by:

[7.19] images

with the constraints:

[7.20]

where δ_rt is Kronecker’s symbol:

[7.21]

The main difference between this dual problem [7.19] and that of the conventional 1-SVM [7.7] is the introduction of expression within the framework of MTL. Let us assume that we define a kernel function such as the one defined by equation [7.8]:

[7.22]

where r and t are indices of tasks associated with each sample. Given the properties of the kernels used, the product of two kernels δ_rtk(x_it,x_jr) is a valid kernel. Therefore, the following function:

images

is a linear combination of two valid kernels using positive coefficients and 1. We can thus solve the optimization problem associated with MTL [7.11] like the one of a conventional 1-SVM with a new kernel function G_rt(x_it, x_jr). The decision function for each task is given by:

[7.23] images

7.4. Experimental results

In this section, we present results obtained on two examples by means of this approach: an academic example of small dimension and the analysis of texture images. To evaluate the performances of the method suggested (MTL-OS VM), we compare it with two other methods. The first method consists of carrying out the learning of a classifier 1-SVM, independently for each task (we write it T-OSVM), and the second method consists of learning a single classifier 1-SVM for all the tasks by considering them as a single task (we write it as 1-OSVM). To obtain reliable results in terms of estimation, all the tests were repeated 20 times by using different sets drawn at random.

In the different tests, the Gaussian kernel [7.2] was used for the methods T-OSVM and 1-OSVM. For the suggested method MTL-OSVM, the kernel is built from the kernel specified in equation [7.23].

The MTL-OSVM method requires us to tune different parameters: parameters ν_t and σ_t (t = 1, …, T) of the 1-SVM classifier for each task and regularization parameter μ. There are, in principle, 2 × T + 1 parameters to tune, but in order to avoid a combinatorial explosion (ν_t, σ_t, μ) being examined, we hypothesized that all the tasks share the same pair (ν, σ). We then suggest a procedure for the determination of parameters in two stages:

– Determination of the value of the two parameters ν and σ of 1-SVM classifier by cross-validation by considering the set of tasks as a single task. The cross-validation procedure consists of carrying out the learning of the classifier for different combinations of these two parameters defined a priori and choosing the one that gives the smallest classification error rate on a validation set [WU 09]. This classification error rate is the sum of the rates of non-detection of “abnormal” observations and false alarm ones (“healthy” observations considered as “abnormal”).

– Determination of the regularization parameter μ also by cross-validation, the values of the two parameters ν and σ being fixed.

7.4.1. Academic nonlinear example

We have initially tested the suggested method in an example with four tasks. The data sets are obtained as follows. For the first task, each observation x_i = [x_i,1 x_i,2 x_i,3 x_i,4]^T , i = 1, …, m, is generated from the following model:

[7.24] images

The data samples corresponding to the three other tasks are generated by adding Gaussian white noise of different magnitude to the data set from the first task. The noises added are, respectively, weak for the second task with an amplitude of 1% of the range of the first data set, moderate for the third task with an amplitude of 8%, andhigh for the fourth task with an amplitude of 15%. To evaluate the rate of non-detection, we have also generated a set of negative samples that are composed of four uniformly distributed variables [7.24]. The training set of each task contains only “healthy” data (m = 200) whereas the test set contains both “healthy” and “abnormal” data (200 observations of each kind).

The optimal parameters obtained for 1-SVM are (ν,σ) = (0.01, 0.5) for this example. Figures 7.2–7.5 represent, for each of the four tasks, the variations of the average false positive, false negative, and total error rates for the three methods (MTL-OSVM, T-OSVM, and 1-OSVM) as a function of the regularization parameter μ. We observe that for small values of μ, the performances of the MTL-OSVM method coincide with those of an 1-OSVM method, which is coherent since we consider then all the tasks as a single task. When the value of μ is very high, the performances of the MTL-OSVM method are comparable to those of the method learning independent tasks T-OSVM. When the value of μ increases, the behavior of the MTL-OSVM method for the first three classes is similar. The rate of false alarm for this method decreases with the increase of μ while the rate of non-detection increases. The MTL-OSVM method clearly gives better results than the other two methods for a large range of values of the regularization parameter within interval [0.05, 1]. However, for the fourth task (Figure 7.3), the results obtained are very different from the previous three. In fact, the rates of non-detection and of false alarm of the MTL-OSVM method are no longer included between those of the two other methods. This result is certainly due to the high level of noise added to the initial data set. Furthermore, contrary to the three other tasks, while remaining better, the performances of the MTL-OSVM method are no longer as significantly different from those of the two other methods.

7.4.2. Analysis of textured images

We have also tested the suggested approach on images that contain Markovian textures [SMO 97]. Analyzing an image in terms of texture consists of classifying the pixels of the image by labeling them. Representation space X is of dimension d = 25. The variables of the attribute vector characterizing a pixel to classify are attached to the sites belonging to the 5 × 5 square neighborhood of the pixel to classify as well as the one attached to the central pixel (pixel to classify). Their values are defined by the gray level of each pixel. Like for the previous example of section 7.4.1, four tasks have been generated from an initial textured image. The data set associated with task 1 is made up of samples of dimension 25 that are randomly selected from the textured source image. The samples for the three other tasks are also selected from the same source image as task 1, but they are contaminated by noises of, respectively, a weak (task 2), middle (task 3), and strong magnitude (task 4). “Abnormal” observations generated from a different textured image are also added to the test set. Figure 7.6 represents the different textured images used to generate the data sets. For each test, the learning set of each task is made up of 200 “healthy” observations and the test set contains, besides 200 “healthy” observations, 200 “abnormal” observations. The common values for the parameters of classifier 1-SVM are (ν, σ) = (0.01, 300).

Figure 7.2. Variation of the error rates (a) non-detection, (b) false alarm, and (c) total classification error for task 1 according to the value of regularization parameter μ (nonlinear example)

Figure 7.3. Variation of the error rates (a) non-detection, (b) false alarm, and (c) total classification error for task 2 according to the value of regularization parameter μ (nonlinear example)

Figure 7.4. Variation of the error rates (a) non-detection, (b) false alarm, and (c) total classification error for task 3 according to the value of regularization parameter μ (nonlinear example)

Figure 7.5. Variation of the error rates (a) non-detection, (b) false alarm, and (c) total classification error for task 4 according to the value of regularization parameter μ (nonlinear example)

Figure 7.6. Textured images used to generate the different data sets. (a) Textured source image for task 1. (b) Textured image for task 2 (weak noise). (c) Textured image for task 3 (middle noise). (d) Textured image for task 4 (high noise). (e) Textured source image to generate the “abnormal” observations

Table 7.1 summarizes the results obtained in this example for the three methods. We observe that the individual learning method T-OSVM generates the smallest rates of non-detection of “abnormal” observations with the highest false alarm rates as a counterpart. Conversely, the overall learning method of all the tasks 1-OSVM gives the smallest rates of false alarm but the highest rates of non-detection. The suggested MTL method MTL-OSVM gives better overall performances by compromising between non-detection and false alarm rates.

Table 7.1. Classification error rate (%) of three methods for each task made up of textured images. ND: non-detection rate, FA: false alarm rate, Total: total classification error rate

Figures 7.7–7.10 represent, for each of the four tasks, the variations in the average false positive, false negative, and total error rates for the three methods (MTL-OSVM, T-OSVM, and 1-OSVM) according to the regularization parameter μ. Overall, we find the same behavior for the three methods as for the previous example. The suggested method MTL-OSVM shows better performances than the two other methods for all the tasks. Hence, we will note the advantage of the MTL-OSVM method that balances between individual learning of tasks and learning of all the tasks gathered as a single task, this compromise being determined from data for each example.

Figure 7.7. Variation of classification error rates (a) non-detection, (b) false alarm, and (c) total classification error rate for task 1 according to the value of regularization parameter μ (textured images)

Figure 7.8. Variation of classification error rates (a) non-detection, (b) false alarm, and (c) total classification error rate for task 2 according to the value of regularization parameter μ (textured images)

Figure 7.9. Variation of classification error rates (a) non-detection, (b) false alarm, and (c) total classification error rate for task 3 according to the value of regularization parameter μ (textured images)

Figure 7.10. Variation of classification error rates (a) non-detection, (b) false alarm, and (c) total classification error rate for task 4 according to the value of regularization parameter μ (textured images)

The determination of regularization parameter μ is thus essential. Like for the previous example, by analyzing curves of the classification error rates according to this parameter, we notice two important points. First, the tasks quite similar (tasks 1−3) have optimal values very close to each other for μ, which is not the case for task 4, for which the value of this parameter is higher. This behavior, is quite consistent (the higher the value of μ, the more the learning tends toward individual learning), reveals the tasks that differ from the others. Then, we also observe that there exists not a single optimal value, but an interval of values for which the performances are comparable. The method shows robustness according to the setting of this parameter.

7.5. Conclusion

In this chapter, we introduced the one-class SVM within the framework of MTL by hypothesizing that the solutions obtained for related tasks are close to a mean value. A regularization parameter was used in the optimization process in order to control the balance between maximization of the margin for each 1-SVM model and proximity of each 1-SVM to the mean model. The design of new kernels within the multitask framework, based on the kernel properties, significantly facilitates the implementation of the suggested method. The experimental validation was carried out using a single-class data set generated artificially. The results show that the simultaneous learning of several tasks enables us to improve the performances with respect to an independent learning of each class.

We used the same hyperparameters for all the tasks. A conceivable improvement consists of using different hyperparameter values for the different tasks. From this point of view, the composition properties of the kernels should lead to a large array of additional possibilities for the definition of new kernels within the framework of MTL. A more fundamental aspect concerns the use of classification models previously established to characterize a new machine. Generally, this MTL problem can find several applications in the healthcare sector and in the environmental domain.

7.6. Acknowledgments

This project was carried out within the framework of PARDI project with the following partners: EDF, ANDRA, CRAN, and ICD.

7.7. Bibliography

[ABU 07] ABU-EL-ZEET Z.H., PATEL V., Method of condition monitoring, US Patent 7275018, 2007.

[BEN 03] BEN-DAVID S., SCHULLER R., “Exploiting task relatedness for multiple task learning”, Proceedings of the 16th Annual Conference on Computational Learning Theory and the 7th Kernel Workshop, Washington, DC, pp. 567–580, 2003.

[BEN 08] BEN-DAVID S., BORBELY R.S., “A notion of task relatedness yielding provable multiple-task learning guarantees”, Machine Learning, vol. 73, pp. 273–287, 2008.

[BI 08] BI J., XIONG T., YU S., DUNDAR M., RAO R.B., “An improved multi-task learning approach with applications in medical diagnosis”, Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases — Part I, Antwerp, Belgium, pp. 117–132, 2008.

[BIR 10] BIRLUTIU A., GROOT P., HESKES T., “Multi-task preference learning with an application to hearing aid personalization”, Neurocomputing, vol. 73, pp. 1177–1185, 2010.

[BOS 92] BOSER B.E., GUYON I.M., VAPNIK V.N., “A training algorithm for optimal margin classifiers”, Proceedings of the 5th Annual Workshop on Computational Learning Theory, Pittsburgh, PA, pp. 144–152, 1992.

[CAM 08] CAMCI F., CHINNAM R., “General support vector representation machine (GSVRM) for stationary and non-stationary classes”, Pattern Recognition, vol. 41, pp. 3021–3034, 2008.

[CAM 09] CAMPOS J., “Development in the application of ICT in condition monitoring and maintenance”, Computers in Industry, vol. 60, no. 1, pp. 1–20, 2009.

[CAR 97] CARUANA R., “Multitask learning”, Machine Learning, vol. 28, pp. 41–75, 1997.

[ESK 00] ESKIN E., “Anomaly detection over noisy data using learned probability distributions”, Proceedings of the 17th International Conference on Machine Learning, San Francisco, CA, pp. 255–262, 2000.

[EVG 04] EVGENIOU T., PONTIL M., “Regularized multi-task learning”, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, pp. 109–117, 2004.

[EVG 05] EVGENIOU T., MICCHELLI C.A., PONTIL M., “Learning multiple tasks with kernel methods”, Journal of Machine Learning Research, vol. 6, pp. 615–637, 2005.

[FAM 03] FAMILI A., LETOURNEAU S., O’BRIEN C., Method of identifying anormal behaviour in a fleet of vehicles, US Patent 2003/0149550, 2003.

[GU 09] GU Q., ZHOU J., “Learning the shared subspace for multi-task clustering and transductive transfer classification”, Proceedings of the 9th IEEE International Conference on Data Mining, Miami, FL, pp. 159–168, 2009.

[HAS 04] HASIEWICZ J., HERZOG J., MARCELL R., Equipment health monitoring architecture for fleets of assets, US Patent 2004/0243636, 2004.

[HES 00] HESKES T., “Empirical bayes for learning to learn”, Proceedings of the 17th International Conference on Machine Learning, San Francisco, CA, pp. 367–374, 2000.

[HIG 04] HIGGS P.A., PARKIN R., JACKSON M., “A survey of condition monitoring systems in industry”, Proceedings of the 7th Biennal ASME Conference Engineering Systems Design and Analysis, Manchester, UK, pp. 163–178, 2004.

[JEB 04] JEBARA T., “Multi-task feature and kernel selection for SVMs”, Proceedings of the 21st International Conference on Machine Learning, Banff, Alberta, Canada, 2004.

[KAS 09] KASSAB R., ALEXANDRE F., “Incremental data-driven learning of a novelty detection model for one-class classification with application to high-dimensional noisy data”, Machine Learning, vol. 74, no. 2, pp. 191–234, 2009.

[KEM 08] KEMP C., GOODMAN N., TENENBAUM J., “Learning and using relational theories”, Advances in Neural Information Processing Systems 20, pp. 753–760, Cambridge, MA, 2008.

[KUN 03] KUNZE U., “Condition telemonitoring and diagnosis of power plants using web technology”, Progress in Nuclear Energy, vol. 43, nos. 1–4, pp. 129–136, 2003.

[MAN 01] MANEVITZ L., YOUSEF M., “One-class SVMs for document classification”, Journal of Machine Learning Research, vol. 2, pp. 139–154, 2001.

[MUL 08] MULLER A., MARQUEZ A.C., IUNG B., “On the concept of e-maintenance: review and current research”, Reliability Engineering & System Safety, vol. 93, no. 8, pp. 1165–1187, 2008.

[PAN 10] PAN S.J., YANG Q., “A survey on transfer learning”, IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.

[RIT 97] RITTER G., GALLEGOS M.T., “Outliers in statistical pattern recognition and an application to automatic chromosome classification”, Pattern Recognition Letters, vol. 18, pp. 525–539, 1997.

[SCH 01] SCHöLKOPF B., PLATT J.C., SHAWE-TAYLOR J., SMOLA A.J., WILLIAMSON R.C., “Estimating the support of a high-dimensional distribution”, Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001.

[SIN 04] SINGH S., MARKOU M., “An approach to novelty detection applied to the classification of image regions”, IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 4, pp. 396–407, 2004.

[SMO 97] SMOLARZ A., “Etude qualitative du modèle Auto-Binomial appliqué à la synthèse de texture”, Actes des XXIXèmes Journées de Statistique, Carcassonne, France, pp. 712–715, 1997.

[TAR 95] TARASSENKO L., HAYTON P., CERNEAZ N., BRADY M., “Novelty detection for the identification of masses in mammograms”, Proceedings of the 4th IEE International Conference on Artificial Neural Networks, Cambridge, UK, pp. 442–447, 1995.

[TAX 99] TAX D.M., DUIN R.P., “Support vector domain description”, Pattern Recognition Letters, vol. 20, pp. 1991–1999, 1999.

[VAP 95] VAPNIK V.N. , The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995.

[WID 10] WIDMER C., TOUSSAINT N., ALTUN Y., RATSCH G., “Inferring latent task structure for multitask learning by multiple kernel learning”, BMC Bioinformatics, vol. 11, no. 8, p. S5, 2010.

[WU 09] WU R.-S., CHUNG W.-H., “Ensemble one-class support vector machines for content-based image retrieval”, Expert Systems with Applications, vol. 36, no. 3, pp. 4451–4459, 2009.

[YAN 10] YANG H., KING I., LYU M.R., “Multi-task learning for one-class classification”, Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, pp. 1–8, 2010.

[YOU 05] YOU S., KRAGE M., JALICS L., “Overview of remote diagnosis and maintenance for automotive systems”, Proceedings of the SAE World Congress, Detroit, MI, 2005.

[YU 10] YU L., CLEARY D., OSBORNE M.D., Method and system for diagnosing faults in a particular device within a fleet of devices, US Patent 7826943, 2010.

[ZHE 08] ZHENG V.W., PAN S.J., YANG Q., PAN J.J., “Transferring multi-device localization models using latent multi-task learning”, Proceedings of the 23rd National Conference on Artificial Intelligence, pp. 1427–1432, 2008.

1 Chapter written by Xiyan HE, Gilles MOUROT, Didier MAQUIN, José RAGOT, Pierre BEAUSEROY, André SMOLARZ and Edith GRALL-MAëS.

1 http://www.smartsignal.com.

2 http://my.epri.com.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7. Multitask Learning for the Diagnosis of Machine Fleet

Create new playlist

Sign In

Sign Up

Chapter 7Multitask Learning for the Diagnosis of Machine Fleet 1