Chapter 8: Multi-Label Classification (1/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 8

Multi-Label Classification

8.1 Introduction

With the advent of affordable digital facilities and the rapid development

of internet, a huge amount of data has been generated, and the volume is

still increasing explosively. In most cases, such data needs to be analysed

and thus potential knowledge hidden behind them could be extracted

and summarized. Classiﬁ cation, also known as supervised learning, is a

fundamental technique for data analysis. In the framework of classiﬁ cation,

each object is described as an instance, which is usually a feature vector

that characterizes the object from different aspects. Moreover, each instance

is associated with one or more labels indicating its categories. Generally

speaking, the process of classiﬁ cation consists of two main steps: the ﬁ rst is

training a classiﬁ er based on a set of labelled instances, the second is using

the classiﬁ er to predict the label of unseen instance.

It is usually assumed that each instance has only one label. Let X denote

the instance space and L be a set of labels, a single-label training set could

then be denoted as D = {(x

, l

), (x

, l

), . . . , (x

, l

)}, where x

is the ith instance

and l

is its relevant label taken from L. The objective of classiﬁ cation could

be viewed as learning a mapping from the instance space to the label space:

f : X  Y, based on a training dataset D. Generally, this kind of learning is

also called single-label classiﬁ cation [1].

However, the instances might be assigned with multiple labels

simultaneously, and problems of this type are ubiquitous in many modern

applications. For instance, Fig. 8.1.1 gives us two examples of multi-label

objects. As it has shown, a ﬁ lm could be tagged as action, and adventure

simultaneously, and an outdoor scene can also be viewed as a sea scene,

etc. Learning for this kind of instances is called multi-label classiﬁ cation

accordingly.

Recently, there have been a considerable amount of research concerned

with dealing with multi-label problems and many state-of-the-art

182 Applied Data Mining

methods has already been proposed. It has also been applied to lots of

practical applications, including text classiﬁ cation [2, 3], gene function

prediction [4], music emotion analysis [5], semantic annotation of video

[6], tag recommendation [7, 8] and so forth. Motivated by the increasing

applications, multi-label classiﬁ cation is becoming a hotspot and attracting

more and more academic researchers and industrial practitioners.

In this chapter, a comprehensive and systematic study of multi-label

classiﬁ cation is carried out, in order to give a clear description of what is

the multi-label classiﬁ cation and highlight the basic and representative

methods. The chapter is organized as follows. First, we introduce the

formal concept of multi-label classiﬁ cation and several deﬁ nitions from

different perspectives. Second, we divide the multi-label methods into

several types, and provide a thorough description of the state-of-the-art

methods according to different types. Subsequently, we present a number

of frequently-used benchmark datasets and evaluation metrics, which are

used to examine and compare the performance of different methods. Finally,

the conclusions and the further challenges are given in the last section.

(a) a movie with multiple genres (b) a scene have several meanings

Figure 8.1.1: Two examples of multi-label instances

8.2 What is Multi-label Classiﬁ cation

To begin with, let us give the formal concept of multi-label classiﬁ cation

ﬁ rstly, in order to gain a better comprehension of it, and make the analysis

and comparison of the following algorithms more easy.

Let X denote the instance space and L = {l

, l

, ..., l

} be a set of labels,

so D = {(x

, C

), (x

, C

), . . . , (x

, C

)} can be used to denote a set of multi-

label instances, where xk ¢ X is an instance, and C

 L is a subset of L that

denote x

k’s

true labels. The target of multi-label classiﬁ cation is thus to learn

a classiﬁ er: f : X  2

, which is a mapping from the instance space to the

space consists of all the possible subsets of L, where 2

is the power set of L.

Ck can also be represented as a Boolean vector y

= (b

, b

, . . . , b

), where

= 1 indicates label l

is x

k’s

true label (l

¢ C

), whereas b

= 0 indicates the

opposite.

In the view of probability, the process of multi-label classiﬁ cation can

also be viewed as to calculate the posteriori joint probability of each possible

label vector. If x is an unlabelled instance, and Y is a set of possible label

vector, then the reasonable label vector, for instance, x should be the one that

gets the greatest posteriori probability, as illustrated by Formula 8.2.1.

arg max ( | )

PY x

(8.2.1)

However, it’s a NP-hard problem to calculate the posteriori probability

by using Formula 8.2.1 directly, so it’s very difﬁ cult and time-consuming.

One practicable alternative is to learn the relationship between labels and

characterize the joint probability as a Bayesian network. For each label y

let Parent(l

) denote the set of labels that label l

is dependent on, so P(y|x)

can be transformed as Formula 8.2.2

P(y|x) =

(| (),)

P y parent y x

∏

(8.2.2)

where y

denotes the kth label. Hence we can get the label vector’s posterior

probability by computing each label’s posterior probability respectively and

then multiplying them. It’s clear that how to ﬁ nd the dependent labels for

each label is emerging as a critical issue now. Actually, how to learn the

appropriate dependencies among labels in order to improve the learning

performance is getting more and more important. We can also see various

methods in the following sections that try to incorporate label dependencies

into the learning process, and most of them indeed bring beneﬁ t.

The predictions can also be a number of values besides Boolean label

vector in many cases, and each value indicates a label’s probability or

conﬁ dence being the instance’s true label. In these cases, the task of multi-

label classiﬁ cation could be transformed to learning a function f : X × L 

R, and thus f(x, l) outputs a real value that indicates label l’s conﬁ dence to

be instance x’s true label. Actually, many multi-label algorithms learn such a

function for each label, in other words, f = {f

, f

, . . . , f

}. Thus the predictions

for instance x takes the form as shown in Formula 8.2.3.

f(x) = {f

(x), f

(x), . . . , f

(x)} (8.2.3)

The multi-label classiﬁ cation problem thus can be tackled through

solving a label ranking (LR) problem, where the labels are sorted according

to their predicted values in descending order, then a threshold t, is learned

Multi-Label Classiﬁ cation 183

184 Applied Data Mining

to decide the relevant labels and irrelevant labels. For label l

, it is predicted

as a relevant label if f

(x) ≥ t, otherwise it is predicted as an irrelevant label.

LR is a very important issue in multi-label learning, not only it is an effective

means of solving multi-label problems, but also it’s more suitable for some

real applications. For example, users of tag recommendation system might

prefer seeing the most interesting tags top the tag list, instead of just a set

of tags. So it poses a very interesting and useful way to solving multi-

label classiﬁ cation. Usually, LR models can also be learned for single-label

instances to solve the multi-class problems.

Nowadays, researchers have proposed lots of effective and efﬁ cient

methods for multi-label classification, and they mainly fall into two

main categories: (1) algorithm adaptation (2) problem transformation [9].

Algorithm adaptation methods extend traditional single-label algorithms,

such as kNN, decision tree, Naive Bayes etc., in order to enable them

to handle multi-label data directly [4, 10, 11]. By contrast, problem

transformation methods decompose the multi-label instances into one or

several single-label instances, thus existing methods could be used without

modiﬁ cation [12, 13, 14]. In other words, the algorithm adaptation strategy

is to ﬁ t the algorithms to data, whereas the problem transformation strategy

is to ﬁ t data to the algorithms. The primary difference between these two

strategies is that the algorithm adaptation is algorithm-speciﬁ c, therefore

the strategy used in one method can not be applied to another one usually.

Nevertheless the problem transformation is algorithm-independent, so it

is more ﬂ exible and can be used with any existing models. The following

sections will elaborate on them by different categories.

8.3 Problem Transformation

As mentioned above, problem transformation is a fundamental strategy for

tackling multi-label problems. It enables most of the existing methods to

work easily, while making few modiﬁ cations to them. Hence it gets much

popularity among researchers, and various methods based on it have

been proposed [12, 13, 14, 15, 16]. Several simple methods of this kind are

All Label Assignment (ALA), No Label Assignment (NLA), Largest Label

Assignment (LLA) and Smallest Label Assignment (SLA), as summarized

by Chen et al. and used for multi-label document transformation [17]. Let’s

explain these methods through a multi-label dataset shown in Table 8.1.

Table 8.1: A exmaple of multi-label dataset

instances labels

, l

Suppose that we have a set of instances {x

, x

}, as shown in

Table 8.1. For each instance, ALA approach will make a copy of it and

assign the copy to all its true labels respectively. Namely, ALA will

replace an instance {x

, C

} with |C

| instances, each of which is associated

with one label in C

. Thus instance x

will appear |C

| times in the

dataset. The NLA approach only keeps instances, with a single label and

discards others with more than one label. Different from the previous

two approaches that either keep or delete all the labels of a multi-label

instance, LLA and SLA make a compromise by keeping only one label as

the instance’s ﬁ nal label and discarding the remaining labels. LLA selects

the label that is most frequent among all instances, whereas SLA selects

the label that is least frequent among all instances instead. The results of

transformation for the above example dataset using these 4 approaches

Chen alsaorepsrhoopwonseidn aFignuorveel8.a3p.1p.roach is called

entropy-based weighted label assignment (ELA). This method actually

assigns a weight to each instance that is generated by ALA approach [17].

Generally speaking, the aforementioned approaches are straightforward,

but they ignore lots of potential information in the discarded instances and

labels, which could have been used to enhance the learning performance.

Currently, two commonly used strategies for problem transformation are

label powerset (LP) method that treats all the true labels of each instance

as a new single label, and binary relevance (BR) method that predicts each

label respectively [13, 9]. A considerable mount of methods based on these

two strategies have been proposed, some of them will be inspected in the

following parts respectively.

8.3.1 Binary Relevance and Label Powerset

BR method transforms a multi-label dataset D into |m| single-label datasets

, D

, . . . , D

}, each one corresponding to a label. Dataset D

includes the

same instances of D, while each instance’s label is changed to 1 if l

is its true

labels in D, otherwise it is changed to 0. The generated datasets for dataset

given in Table 8.1 through using BR method are shown in Fig. 8.3.2.

Multi-Label Classiﬁ cation 185

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Multi-Label Classification (1/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8: Multi-Label Classification (1/5)