Chapter 8: Multi-Label Classification (4/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

196 Applied Data Mining

, C

), . . . , (x

, C

)} be a dataset, where x

is its ith instance, and C

 L is its

true labels. Given a classiﬁ er f and a test instance x

, Y

denotes the possible

labels of x

predicted by f. rank(x

) or rank

denotes the predicted rank of

labels, and rank(x

, l) denotes the label l’s position in the rank.

Bipartition-based metrics simply focus on whether the labels are

correctly predicted or not, without considering to what extent the labels

are correctly predicted. Thus predictions with different conﬁ dences will get

the same evaluation value according to these metrics. Several well-known

metrics of this type are as follows:

(1) Subset accuracy. It is similar to the accuracy used in single-label learning,

and computes the ratio of instances for which the predicted set of

labels match the true set of labels exactly. Its deﬁ nition is given by the

Equation 8.5.1.

SubsetAccuracy(f, D) =

()

IY C

∑

(8.5.1)

where I(true)=0 and I(false)=0. Obviously it’s a very strict measure, since

the prediction will still be viewed as totally wrong, even if only few of

the labels are predicted incorrectly and the rest get right predictions. The

greater this measure is, the better the classiﬁ er’s performance is, and the

optimal value could be 1, that indicates all instances’ labels get the exact

predictions. However, it’ should be noted that it’s very difﬁ cult to predict

all labels correctly when there are a huge number of labels, thus the actual

value should be very small in most cases.

(2) Hamming Loss. This measure is proposed by Schapire and Singer [3],

and it’s deﬁ nition is given in Equation 8.5.2.

H-Loss(f, D) =

⊕

∑

(8.5.2)

where the operator  calculates the symmetric difference of two sets of

labels. So |•| returns the number of misclassiﬁ ed labels, for instance. We

can see that it’s a more reasonable measure compared with Subset accuracy,

since it results in better evaluation to the classiﬁ er which can predict the

majority of the labels correctly. The smaller this measure is, the better the

classiﬁ er’s performance, and the optimal value could be 0 when the labels

of all instances are predicted correctly. Hamming Loss can also be viewed as

a kind of label-based metric, since it can be decomposed by labels, and for

each label the evaluation is the same as the traditional accuracy for single-

label learning.

(3) Precision. It calculates the ratio between intersection of the two sets of

labels and the set of true labels, as depicted in Equation 8.5.3.

Precision(f, D) =

∩

∑

(8.5.3)

As shown, this metric would compute the portion of one instance’s

true sets of labels that are predicted correctly. The difference between it

and Hamming Loss is it only concerns the predictions of one instance’s

set of true labels and does not care about whether the remaining labels

are predicted correctly or not, while Hamming Loss takes all the labels’

predictions into consideration. The greater this measure is, the better the

classiﬁ er’s performance is, and the optimal value could be 1 when all the

instances’ true labels get right predictions.

(4) Recall. Different from Precision, this metric calculates the ratio between

intersection of the two sets of labels and the set of predicted labels, as

showed in Equation 8.5.4.

Recall(f, D) =

∩

∑

(8.5.4)

It evaluates the classifier’s performance from the perspective of

predicted labels, since it only focuses on the portion of an instance’s set

of predicted labels that are its true labels. The greater this measure is, the

better the classiﬁ er’s performance, and the optimal value could be 1 when

all the predicted labels are the instance’s true labels, even some of its true

labels are still predicted wrongly.

(5) F

measure. Since precision and recall evaluate classiﬁ er’s from different

prospectives, optimizing any one will make the other decline.

Therefore, F

measure is introduced to make a trade-off between them

and get a reasonable result. It’s described by Equation 8.5.5.

(f, D) =

2| |

||| |

nYC

∩

∑

(8.5.5)

the better the classiﬁ er’s performance is, and the optimal value could be 1.

The aforementioned 3 metrics are heavily used in information retrieval to

evaluate the returned documents given an ad hoc query.

(6) Accuracy. It evaluates the average ratio of the intersection of the two

sets of labels and the union of the two sets of labels, as depicted in

Formula 8.5.6.

Accuracy(f, D) =

2| |

||| |

nYC

∩

∑

(8.5.6)

The greater this measure is, the better the classiﬁ er’s performance,

and the optimal value could be 1. We can see that it’s a more strict metric

Multi-Label Classiﬁ cation 197

198 Applied Data Mining

compared with precision and recall, since only when the predicted label set

matches the true label set exactly, the optimal value could be reached.

All the above are bipartition-based metrics. There are other kinds of

metrics called rank-based metrics. While the former are based on binary

predictions of the labels, the latter are based on a predicted rank of labels,

instead of giving an explicit yes/no predictions. The following are the several

commonly used rank-based metrics.

(1) One-error. It evaluates how many times the top-ranked label is not in

the instance’s set of true labels, as given in Equation 8.5.7.

One-error(f, D) =

(arg max ( , ))

rank x l

∈

∑

(8.5.7)

where δ(•) = 1 when l isn’t x

’s true label, otherwise δ(•) = 0. One-error is not

a very rigorous metric, since it only concerns whether the top-rank one is a

true label or not, while ignoring the remaining true labels’ predictions. So it

might not able to give a reasonable evaluation of the classiﬁ er’s performance.

The smaller this measure is, the better the classiﬁ er’s performance, and the

optimal value could be 0 when the top-rank label is the its true label for

all instances.

(2) Coverage. It computes how far it is needed to go down the ranked list

of labels to cover one instance’s all true labels, as given in Formula

8.5.8

Coverage(f, D) =

max ( , ) 1

rank x l

∈

−

∑

(8.5.8)

Smaller value of this measure means more true labels are ranked before

the false labels and thus the classiﬁ er’s performance is better.

(3) Ranking Loss. It computes the number of times when false labels are

ranked before the true labels, as given in Equation 8.5.9.

R-Loss =

|{(,): (,) (,),(,) }|

||||

ab ia ib ab i i

l l rank x l rank x l l l C C

>∈×

∑

(8.5.9)

where

is the set of false labels of instance x

. This metric compares any

possible pairwise labels’ rank that one is the true label and the other is a

false label. The smaller this measure, the better the classiﬁ er’s performance,

and the optimal value could be 0 when any one of the true labels is ranked

before all the false labels.

(4) Average Precision. This metric is ﬁ rstly used in the ﬁ eld of information

retrieval, to evaluate the rank of result documents, given a speciﬁ c

query. It computes the average fraction of labels ranked above a

particular true label, while these labels are also true labels. The

deﬁ nition is given in Formula 8.5.10.

AvePrec =

|{ : ( , ) ( , )}|

|| (,)

ii i

ilC

l' C rank x l' rank x l

n C rank x l

=∈

∈≤

∑∑

(8.5.10)

The greater this measure, the better the classiﬁ er’s performance is, and

the optimal value could be 1.

Roughly speaking, all preceding bipartition-based and rank-based

metrics evaluate the performance of classiﬁ ers from different aspects.

Therefore, there could be no single classiﬁ cation model which could perform

well on all metrics, and there is also no general metric that could be used for

evaluating any kinds of classiﬁ ers. It depends on the particular problems

and the objectives of learning to select appropriate metrics. Moreover,

there might be potential relationships between different metrics, which are

implicit now and need further investigation.

8.5.2 Benchmark Datasets and the Statistics

In this subsection, some representative datasets used extensively by

researchers are presented. Since different characteristics of multi-label

data might have different impacts on the learning methods, so the

explicit definitions of various characteristics should be given first to

assist observation of the relationship between datasets and classiﬁ ers’

performances. The following are a number of fundamental statistics for

characterizing multi-label datasets.

(1) Label Cardinality. It calculates the average number of labels for each

instance, which is deﬁ ned as follows.

LC =

∑

(8.5.11)

where |C

| is the number of the ith instance’s true labels.

(2) Label Density. It is calculated through dividing the label cardinality by

m, the size of original label set, which is deﬁ ned as follows.

LD =

∑

(8.5.12)

(3) Distinct Label Sets. It counts the number of distinct label vectors which

appeared in the data set, which is deﬁ ned as follows:

DLS(D) = |{C|(x, C) ¢ D}| (8.5.13)

Multi-Label Classiﬁ cation 199

200 Applied Data Mining

(4) Proportion of Distinct Label Sets. It normalizes the DLS(s) by the number

of instances, which is deﬁ ned as follows.

PDLS(D) =

()DLS S

∗

(8.5.14)

Table 8.1: Description of representative multi-label benchmark datasets

name insts atts labels LC LD DLS PDLS

bibtex[7] 7395 1836 159 2.402 0.015 2856 0.386

bookmarks[7] 87856 2150 208 2.028 0.010 18716 0.213

CAL500[31] 502 68 174 26.044 0.150 502 1

corel5k[32] 5000 499 374 3.522 0.009 3175 0.635

delicious[5] 16105 500 983 19.020 0.019 15806 0.981

emotions[33] 593 72 6 1.869 0.311 27 0.045

enron 1702 1001 53 3.378 0.064 753 0.442

genbase[34] 662 1186 27 1.252 0.046 32 0.516

mediamill[35] 43907 120 101 4.376 0.043 6555 0.149

medical 978 1449 45 1.245 0.028 94 0.096

rcv1v2[36] 6000 47236 101 2.880 0.029 1028 0.171

tmc2007[37] 28596 49060 22 2.158 0.098 1341 0.046

scene[12] 2407 294 6 1.074 0.179 15 0.006

yeast[38] 2417 103 14 4.237 0.303 198 0.081

As mentioned in the beginning, multi-label data are ubiquitous in the

real applications, including text analysis, image classiﬁ cation, prediction

of gene functions etc. Researchers have extracted multiple benchmark

datasets from these practical problems and used them to examine and

compare various multi-label methods’ performance. Tsoumakas et al. have

summarized a number of datasets used commonly, with corresponding

informations including source reference, number of instances, features,

labels, etc. and other statistics [9]. Table 8.1 gives the detailed descriptions

of the datasets and all of them are available for download at the homepage

of Mulan, an open source platform for multi-label learning [30].

8.6 Chapter Summary

So far we have introduced the deﬁ nition of multi-label classiﬁ cation and

various types of algorithms for it. The typical measure metrics and datasets

used for experiments are also given. Although we have reached a great deal

of achievements, the problem is actually not tackled very well. The following

issues still should be given enough concerns and need further research.

The ﬁ rst one is the instance spareness problem, since the number of

possible label vectors has grown explosively with the increasing size of

original label space m. For example, the size of possible label space would

be 2

ﬁ nally, even if m is only 20. Consequently, the number of positive

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Multi-Label Classification (4/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8: Multi-Label Classification (4/5)