196 Applied Data Mining
(x
2
, C
2
), . . . , (x
n
, C
n
)} be a dataset, where x
i
is its ith instance, and C
i
L is its
true labels. Given a classifi er f and a test instance x
i
, Y
i
denotes the possible
labels of x
i
predicted by f. rank(x
i
) or rank
i
denotes the predicted rank of
labels, and rank(x
i
, l) denotes the label l’s position in the rank.
Bipartition-based metrics simply focus on whether the labels are
correctly predicted or not, without considering to what extent the labels
are correctly predicted. Thus predictions with different confi dences will get
the same evaluation value according to these metrics. Several well-known
metrics of this type are as follows:
(1) Subset accuracy. It is similar to the accuracy used in single-label learning,
and computes the ratio of instances for which the predicted set of
labels match the true set of labels exactly. Its defi nition is given by the
Equation 8.5.1.
SubsetAccuracy(f, D) =
1
1
()
n
ii
i
IY C
n
=
=
(8.5.1)
where I(true)=0 and I(false)=0. Obviously it’s a very strict measure, since
the prediction will still be viewed as totally wrong, even if only few of
the labels are predicted incorrectly and the rest get right predictions. The
greater this measure is, the better the classifi er’s performance is, and the
optimal value could be 1, that indicates all instances’ labels get the exact
predictions. However, it’ should be noted that it’s very diffi cult to predict
all labels correctly when there are a huge number of labels, thus the actual
value should be very small in most cases.
(2) Hamming Loss. This measure is proposed by Schapire and Singer [3],
and it’s defi nition is given in Equation 8.5.2.
H-Loss(f, D) =
1
||
1
n
ii
i
YC
nm
=
(8.5.2)
where the operator calculates the symmetric difference of two sets of
labels. So |•| returns the number of misclassifi ed labels, for instance. We
can see that it’s a more reasonable measure compared with Subset accuracy,
since it results in better evaluation to the classifi er which can predict the
majority of the labels correctly. The smaller this measure is, the better the
classifi er’s performance, and the optimal value could be 0 when the labels
of all instances are predicted correctly. Hamming Loss can also be viewed as
a kind of label-based metric, since it can be decomposed by labels, and for
each label the evaluation is the same as the traditional accuracy for single-
label learning.
(3) Precision. It calculates the ratio between intersection of the two sets of
labels and the set of true labels, as depicted in Equation 8.5.3.
Precision(f, D) =
1
||
1
||
n
ii
i
i
YC
nC
=
(8.5.3)
As shown, this metric would compute the portion of one instance’s
true sets of labels that are predicted correctly. The difference between it
and Hamming Loss is it only concerns the predictions of one instance’s
set of true labels and does not care about whether the remaining labels
are predicted correctly or not, while Hamming Loss takes all the labels’
predictions into consideration. The greater this measure is, the better the
classifi er’s performance is, and the optimal value could be 1 when all the
instances’ true labels get right predictions.
(4) Recall. Different from Precision, this metric calculates the ratio between
intersection of the two sets of labels and the set of predicted labels, as
showed in Equation 8.5.4.
Recall(f, D) =
1
||
1
||
n
ii
i
i
YC
nY
=
(8.5.4)
It evaluates the classifier’s performance from the perspective of
predicted labels, since it only focuses on the portion of an instance’s set
of predicted labels that are its true labels. The greater this measure is, the
better the classifi er’s performance, and the optimal value could be 1 when
all the predicted labels are the instance’s true labels, even some of its true
labels are still predicted wrongly.
(5) F
1
measure. Since precision and recall evaluate classifi er’s from different
prospectives, optimizing any one will make the other decline.
Therefore, F
1
measure is introduced to make a trade-off between them
and get a reasonable result. It’s described by Equation 8.5.5.
F
1
(f, D) =
1
2| |
1
||| |
n
ii
i
ii
YC
nYC
=
+
(8.5.5)
the better the classifi er’s performance is, and the optimal value could be 1.
The aforementioned 3 metrics are heavily used in information retrieval to
evaluate the returned documents given an ad hoc query.
(6) Accuracy. It evaluates the average ratio of the intersection of the two
sets of labels and the union of the two sets of labels, as depicted in
Formula 8.5.6.
Accuracy(f, D) =
1
2| |
1
||| |
n
ii
i
ii
YC
nYC
=
+
(8.5.6)
The greater this measure is, the better the classifi er’s performance,
and the optimal value could be 1. We can see that it’s a more strict metric
Multi-Label Classifi cation 197
198 Applied Data Mining
compared with precision and recall, since only when the predicted label set
matches the true label set exactly, the optimal value could be reached.
All the above are bipartition-based metrics. There are other kinds of
metrics called rank-based metrics. While the former are based on binary
predictions of the labels, the latter are based on a predicted rank of labels,
instead of giving an explicit yes/no predictions. The following are the several
commonly used rank-based metrics.
(1) One-error. It evaluates how many times the top-ranked label is not in
the instance’s set of true labels, as given in Equation 8.5.7.
One-error(f, D) =
1
1
(arg max ( , ))
i
n
i
lY
i
rank x l
n
=
δ
(8.5.7)
where δ(•) = 1 when l isn’t x
i
’s true label, otherwise δ(•) = 0. One-error is not
a very rigorous metric, since it only concerns whether the top-rank one is a
true label or not, while ignoring the remaining true labels’ predictions. So it
might not able to give a reasonable evaluation of the classifi er’s performance.
The smaller this measure is, the better the classifi er’s performance, and the
optimal value could be 0 when the top-rank label is the its true label for
all instances.
(2) Coverage. It computes how far it is needed to go down the ranked list
of labels to cover one instance’s all true labels, as given in Formula
8.5.8
Coverage(f, D) =
1
1
max ( , ) 1
i
n
i
lC
i
rank x l
n
=
(8.5.8)
Smaller value of this measure means more true labels are ranked before
the false labels and thus the classifi er’s performance is better.
(3) Ranking Loss. It computes the number of times when false labels are
ranked before the true labels, as given in Equation 8.5.9.
R-Loss =
1
11
|{(,): (,) (,),(,) }|
||||
n
ab ia ib ab i i
i
ii
l l rank x l rank x l l l C C
n
CC
=
>∈×
(8.5.9)
where
C
i
is the set of false labels of instance x
i
. This metric compares any
possible pairwise labels’ rank that one is the true label and the other is a
false label. The smaller this measure, the better the classifi er’s performance,
and the optimal value could be 0 when any one of the true labels is ranked
before all the false labels.
(4) Average Precision. This metric is fi rstly used in the fi eld of information
retrieval, to evaluate the rank of result documents, given a specifi c
query. It computes the average fraction of labels ranked above a
particular true label, while these labels are also true labels. The
defi nition is given in Formula 8.5.10.
AvePrec =
1
|{ : ( , ) ( , )}|
11
|| (,)
i
n
ii i
ilC
ii
l' C rank x l' rank x l
n C rank x l
=∈
∈≤
∑∑
(8.5.10)
The greater this measure, the better the classifi er’s performance is, and
the optimal value could be 1.
Roughly speaking, all preceding bipartition-based and rank-based
metrics evaluate the performance of classifi ers from different aspects.
Therefore, there could be no single classifi cation model which could perform
well on all metrics, and there is also no general metric that could be used for
evaluating any kinds of classifi ers. It depends on the particular problems
and the objectives of learning to select appropriate metrics. Moreover,
there might be potential relationships between different metrics, which are
implicit now and need further investigation.
8.5.2 Benchmark Datasets and the Statistics
In this subsection, some representative datasets used extensively by
researchers are presented. Since different characteristics of multi-label
data might have different impacts on the learning methods, so the
explicit definitions of various characteristics should be given first to
assist observation of the relationship between datasets and classifi ers’
performances. The following are a number of fundamental statistics for
characterizing multi-label datasets.
(1) Label Cardinality. It calculates the average number of labels for each
instance, which is defi ned as follows.
LC =
1
1
||
n
i
i
C
n
=
(8.5.11)
where |C
i
| is the number of the ith instance’s true labels.
(2) Label Density. It is calculated through dividing the label cardinality by
m, the size of original label set, which is defi ned as follows.
LD =
1
1
n
i
i
C
nm
=
(8.5.12)
(3) Distinct Label Sets. It counts the number of distinct label vectors which
appeared in the data set, which is defi ned as follows:
DLS(D) = |{C|(x, C) ¢ D}| (8.5.13)
Multi-Label Classifi cation 199
200 Applied Data Mining
(4) Proportion of Distinct Label Sets. It normalizes the DLS(s) by the number
of instances, which is defi ned as follows.
PDLS(D) =
()DLS S
n
(8.5.14)
Table 8.1: Description of representative multi-label benchmark datasets
name insts atts labels LC LD DLS PDLS
bibtex[7] 7395 1836 159 2.402 0.015 2856 0.386
bookmarks[7] 87856 2150 208 2.028 0.010 18716 0.213
CAL500[31] 502 68 174 26.044 0.150 502 1
corel5k[32] 5000 499 374 3.522 0.009 3175 0.635
delicious[5] 16105 500 983 19.020 0.019 15806 0.981
emotions[33] 593 72 6 1.869 0.311 27 0.045
enron 1702 1001 53 3.378 0.064 753 0.442
genbase[34] 662 1186 27 1.252 0.046 32 0.516
mediamill[35] 43907 120 101 4.376 0.043 6555 0.149
medical 978 1449 45 1.245 0.028 94 0.096
rcv1v2[36] 6000 47236 101 2.880 0.029 1028 0.171
tmc2007[37] 28596 49060 22 2.158 0.098 1341 0.046
scene[12] 2407 294 6 1.074 0.179 15 0.006
yeast[38] 2417 103 14 4.237 0.303 198 0.081
As mentioned in the beginning, multi-label data are ubiquitous in the
real applications, including text analysis, image classifi cation, prediction
of gene functions etc. Researchers have extracted multiple benchmark
datasets from these practical problems and used them to examine and
compare various multi-label methods’ performance. Tsoumakas et al. have
summarized a number of datasets used commonly, with corresponding
informations including source reference, number of instances, features,
labels, etc. and other statistics [9]. Table 8.1 gives the detailed descriptions
of the datasets and all of them are available for download at the homepage
of Mulan, an open source platform for multi-label learning [30].
8.6 Chapter Summary
So far we have introduced the defi nition of multi-label classifi cation and
various types of algorithms for it. The typical measure metrics and datasets
used for experiments are also given. Although we have reached a great deal
of achievements, the problem is actually not tackled very well. The following
issues still should be given enough concerns and need further research.
The fi rst one is the instance spareness problem, since the number of
possible label vectors has grown explosively with the increasing size of
original label space m. For example, the size of possible label space would
be 2
20
nally, even if m is only 20. Consequently, the number of positive
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.89.85