CHAPTER 8
Multi-Label Classification
8.1 Introduction
With the advent of affordable digital facilities and the rapid development
of internet, a huge amount of data has been generated, and the volume is
still increasing explosively. In most cases, such data needs to be analysed
and thus potential knowledge hidden behind them could be extracted
and summarized. Classifi cation, also known as supervised learning, is a
fundamental technique for data analysis. In the framework of classifi cation,
each object is described as an instance, which is usually a feature vector
that characterizes the object from different aspects. Moreover, each instance
is associated with one or more labels indicating its categories. Generally
speaking, the process of classifi cation consists of two main steps: the fi rst is
training a classifi er based on a set of labelled instances, the second is using
the classifi er to predict the label of unseen instance.
It is usually assumed that each instance has only one label. Let X denote
the instance space and L be a set of labels, a single-label training set could
then be denoted as D = {(x
1
, l
1
), (x
2
, l
2
), . . . , (x
n
, l
n
)}, where x
i
is the ith instance
and l
i
is its relevant label taken from L. The objective of classifi cation could
be viewed as learning a mapping from the instance space to the label space:
f : X Y, based on a training dataset D. Generally, this kind of learning is
also called single-label classifi cation [1].
However, the instances might be assigned with multiple labels
simultaneously, and problems of this type are ubiquitous in many modern
applications. For instance, Fig. 8.1.1 gives us two examples of multi-label
objects. As it has shown, a fi lm could be tagged as action, and adventure
simultaneously, and an outdoor scene can also be viewed as a sea scene,
etc. Learning for this kind of instances is called multi-label classifi cation
accordingly.
Recently, there have been a considerable amount of research concerned
with dealing with multi-label problems and many state-of-the-art
182 Applied Data Mining
methods has already been proposed. It has also been applied to lots of
practical applications, including text classifi cation [2, 3], gene function
prediction [4], music emotion analysis [5], semantic annotation of video
[6], tag recommendation [7, 8] and so forth. Motivated by the increasing
applications, multi-label classifi cation is becoming a hotspot and attracting
more and more academic researchers and industrial practitioners.
In this chapter, a comprehensive and systematic study of multi-label
classifi cation is carried out, in order to give a clear description of what is
the multi-label classifi cation and highlight the basic and representative
methods. The chapter is organized as follows. First, we introduce the
formal concept of multi-label classifi cation and several defi nitions from
different perspectives. Second, we divide the multi-label methods into
several types, and provide a thorough description of the state-of-the-art
methods according to different types. Subsequently, we present a number
of frequently-used benchmark datasets and evaluation metrics, which are
used to examine and compare the performance of different methods. Finally,
the conclusions and the further challenges are given in the last section.
(a) a movie with multiple genres (b) a scene have several meanings
Figure 8.1.1: Two examples of multi-label instances
8.2 What is Multi-label Classi cation
To begin with, let us give the formal concept of multi-label classifi cation
rstly, in order to gain a better comprehension of it, and make the analysis
and comparison of the following algorithms more easy.
Let X denote the instance space and L = {l
1
, l
2
, ..., l
m
} be a set of labels,
so D = {(x
1
, C
1
), (x
2
, C
2
), . . . , (x
n
, C
n
)} can be used to denote a set of multi-
label instances, where xk ¢ X is an instance, and C
k
L is a subset of L that
denote x
k’s
true labels. The target of multi-label classifi cation is thus to learn
a classifi er: f : X 2
L
, which is a mapping from the instance space to the
space consists of all the possible subsets of L, where 2
L
is the power set of L.
Ck can also be represented as a Boolean vector y
k
= (b
k1
, b
k2
, . . . , b
km
), where
b
kj
= 1 indicates label l
j
is x
k’s
true label (l
j
¢ C
k
), whereas b
kj
= 0 indicates the
opposite.
In the view of probability, the process of multi-label classifi cation can
also be viewed as to calculate the posteriori joint probability of each possible
label vector. If x is an unlabelled instance, and Y is a set of possible label
vector, then the reasonable label vector, for instance, x should be the one that
gets the greatest posteriori probability, as illustrated by Formula 8.2.1.
y
x
=
arg max ( | )
Yy
PY x
=
(8.2.1)
However, it’s a NP-hard problem to calculate the posteriori probability
by using Formula 8.2.1 directly, so it’s very diffi cult and time-consuming.
One practicable alternative is to learn the relationship between labels and
characterize the joint probability as a Bayesian network. For each label y
k
,
let Parent(l
k
) denote the set of labels that label l
k
is dependent on, so P(y|x)
can be transformed as Formula 8.2.2
P(y|x) =
1
(| (),)
m
kk
k
P y parent y x
=
(8.2.2)
where y
k
denotes the kth label. Hence we can get the label vector’s posterior
probability by computing each label’s posterior probability respectively and
then multiplying them. It’s clear that how to fi nd the dependent labels for
each label is emerging as a critical issue now. Actually, how to learn the
appropriate dependencies among labels in order to improve the learning
performance is getting more and more important. We can also see various
methods in the following sections that try to incorporate label dependencies
into the learning process, and most of them indeed bring benefi t.
The predictions can also be a number of values besides Boolean label
vector in many cases, and each value indicates a label’s probability or
confi dence being the instance’s true label. In these cases, the task of multi-
label classifi cation could be transformed to learning a function f : X × L
R, and thus f(x, l) outputs a real value that indicates label l’s confi dence to
be instance x’s true label. Actually, many multi-label algorithms learn such a
function for each label, in other words, f = {f
1
, f
2
, . . . , f
m
}. Thus the predictions
for instance x takes the form as shown in Formula 8.2.3.
f(x) = {f
1
(x), f
2
(x), . . . , f
m
(x)} (8.2.3)
The multi-label classifi cation problem thus can be tackled through
solving a label ranking (LR) problem, where the labels are sorted according
to their predicted values in descending order, then a threshold t, is learned
Multi-Label Classifi cation 183
184 Applied Data Mining
to decide the relevant labels and irrelevant labels. For label l
i
, it is predicted
as a relevant label if f
i
(x) t, otherwise it is predicted as an irrelevant label.
LR is a very important issue in multi-label learning, not only it is an effective
means of solving multi-label problems, but also it’s more suitable for some
real applications. For example, users of tag recommendation system might
prefer seeing the most interesting tags top the tag list, instead of just a set
of tags. So it poses a very interesting and useful way to solving multi-
label classifi cation. Usually, LR models can also be learned for single-label
instances to solve the multi-class problems.
Nowadays, researchers have proposed lots of effective and effi cient
methods for multi-label classification, and they mainly fall into two
main categories: (1) algorithm adaptation (2) problem transformation [9].
Algorithm adaptation methods extend traditional single-label algorithms,
such as kNN, decision tree, Naive Bayes etc., in order to enable them
to handle multi-label data directly [4, 10, 11]. By contrast, problem
transformation methods decompose the multi-label instances into one or
several single-label instances, thus existing methods could be used without
modifi cation [12, 13, 14]. In other words, the algorithm adaptation strategy
is to fi t the algorithms to data, whereas the problem transformation strategy
is to fi t data to the algorithms. The primary difference between these two
strategies is that the algorithm adaptation is algorithm-specifi c, therefore
the strategy used in one method can not be applied to another one usually.
Nevertheless the problem transformation is algorithm-independent, so it
is more fl exible and can be used with any existing models. The following
sections will elaborate on them by different categories.
8.3 Problem Transformation
As mentioned above, problem transformation is a fundamental strategy for
tackling multi-label problems. It enables most of the existing methods to
work easily, while making few modifi cations to them. Hence it gets much
popularity among researchers, and various methods based on it have
been proposed [12, 13, 14, 15, 16]. Several simple methods of this kind are
All Label Assignment (ALA), No Label Assignment (NLA), Largest Label
Assignment (LLA) and Smallest Label Assignment (SLA), as summarized
by Chen et al. and used for multi-label document transformation [17]. Let’s
explain these methods through a multi-label dataset shown in Table 8.1.
Table 8.1: A exmaple of multi-label dataset
instances labels
x
1
l
1
, l
3
x
2
l
1
, l
2
, l
4
x
3
l
2
x
4
l
2
,l
4
Suppose that we have a set of instances {x
1
, x
2
, x
3
, x
4
}, as shown in
Table 8.1. For each instance, ALA approach will make a copy of it and
assign the copy to all its true labels respectively. Namely, ALA will
replace an instance {x
i
, C
i
} with |C
i
| instances, each of which is associated
with one label in C
i
. Thus instance x
i
will appear |C
i
| times in the
dataset. The NLA approach only keeps instances, with a single label and
discards others with more than one label. Different from the previous
two approaches that either keep or delete all the labels of a multi-label
instance, LLA and SLA make a compromise by keeping only one label as
the instance’s fi nal label and discarding the remaining labels. LLA selects
the label that is most frequent among all instances, whereas SLA selects
the label that is least frequent among all instances instead. The results of
transformation for the above example dataset using these 4 approaches
Chen alsaorepsrhoopwonseidn aFignuorveel8.a3p.1p.roach is called
entropy-based weighted label assignment (ELA). This method actually
assigns a weight to each instance that is generated by ALA approach [17].
Generally speaking, the aforementioned approaches are straightforward,
but they ignore lots of potential information in the discarded instances and
labels, which could have been used to enhance the learning performance.
Currently, two commonly used strategies for problem transformation are
label powerset (LP) method that treats all the true labels of each instance
as a new single label, and binary relevance (BR) method that predicts each
label respectively [13, 9]. A considerable mount of methods based on these
two strategies have been proposed, some of them will be inspected in the
following parts respectively.
8.3.1 Binary Relevance and Label Powerset
BR method transforms a multi-label dataset D into |m| single-label datasets
{D
1
, D
2
, . . . , D
m
}, each one corresponding to a label. Dataset D
i
includes the
same instances of D, while each instance’s label is changed to 1 if l
i
is its true
labels in D, otherwise it is changed to 0. The generated datasets for dataset
given in Table 8.1 through using BR method are shown in Fig. 8.3.2.
Multi-Label Classifi cation 185
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.113.197