200 Applied Data Mining
(4) Proportion of Distinct Label Sets. It normalizes the DLS(s) by the number
of instances, which is defi ned as follows.
PDLS(D) =
()DLS S
n
∗
(8.5.14)
Table 8.1: Description of representative multi-label benchmark datasets
name insts atts labels LC LD DLS PDLS
bibtex[7] 7395 1836 159 2.402 0.015 2856 0.386
bookmarks[7] 87856 2150 208 2.028 0.010 18716 0.213
CAL500[31] 502 68 174 26.044 0.150 502 1
corel5k[32] 5000 499 374 3.522 0.009 3175 0.635
delicious[5] 16105 500 983 19.020 0.019 15806 0.981
emotions[33] 593 72 6 1.869 0.311 27 0.045
enron 1702 1001 53 3.378 0.064 753 0.442
genbase[34] 662 1186 27 1.252 0.046 32 0.516
mediamill[35] 43907 120 101 4.376 0.043 6555 0.149
medical 978 1449 45 1.245 0.028 94 0.096
rcv1v2[36] 6000 47236 101 2.880 0.029 1028 0.171
tmc2007[37] 28596 49060 22 2.158 0.098 1341 0.046
scene[12] 2407 294 6 1.074 0.179 15 0.006
yeast[38] 2417 103 14 4.237 0.303 198 0.081
As mentioned in the beginning, multi-label data are ubiquitous in the
real applications, including text analysis, image classifi cation, prediction
of gene functions etc. Researchers have extracted multiple benchmark
datasets from these practical problems and used them to examine and
compare various multi-label methods’ performance. Tsoumakas et al. have
summarized a number of datasets used commonly, with corresponding
informations including source reference, number of instances, features,
labels, etc. and other statistics [9]. Table 8.1 gives the detailed descriptions
of the datasets and all of them are available for download at the homepage
of Mulan, an open source platform for multi-label learning [30].
8.6 Chapter Summary
So far we have introduced the defi nition of multi-label classifi cation and
various types of algorithms for it. The typical measure metrics and datasets
used for experiments are also given. Although we have reached a great deal
of achievements, the problem is actually not tackled very well. The following
issues still should be given enough concerns and need further research.
The fi rst one is the instance spareness problem, since the number of
possible label vectors has grown explosively with the increasing size of
original label space m. For example, the size of possible label space would
be 2
20
fi nally, even if m is only 20. Consequently, the number of positive