1.1.3.4 Classi cation and Prediction
Classifi cation is a typical predictive method. The aim of classifi cation is to
determine the class (or category) label for data objects based on the trained
model (sometimes also called classifi er). It is hard to completely differentiate
the prediction approach from classifi cation. In the data mining community,
one commonly agreed opinion is that classifi cation is mainly focused on
determining the categorical attribute of data objects, while prediction is
focused on continuous-values attributes instead, i.e., it is used to predict
the analog values of data objects. As the model learning and prediction is
performed under the prior knowledge of data (e.g., the known label), this
kind of method has an alternative name—supervised learning approaches.
Figure 1.1.4 presents an example of supervised learning based on prior
knowledge—label, where the positive and negative objects are marked
by round and cross symbols respectively. The aim of classifi cation is to
build up a dividing line to differentiate the positive and negative points
from the existing labels. A number of classifi cation algorithms have been
well studied in data mining and machine learning domains, the common
and well used approaches include Decision Trees, Rule-based Induction,
Genetic Algorithms, Neural Networks, Bayesian Networks, Support Vector
Machine (SVM), C4.5 and so on. Figure 1.1.5 is a constructed decision tree
from the observations of whether it is appropriate to play tennis depending
on the weather conditions, such as sunny, rainy, windy, humid conditions
and so on. In this example, the classifi cation rules are expressed as a set of
If-Then clauses. Apart from decision tree, classifi er is another important
classifi cation model. Based on the different classifi cation requirement,
various classifi ers could be trained upon the supervision, e.g., Fig. 1.1.6
demonstrates an example of linear and nonlinear classifi er in the above
example of debt-income relationship case.
Figure 1.1.3: Example of unsupervised learning
Introduction
13
Income
Income
Debt
Debt
14 Applied Data Mining
Figure 1.1.4: Example of supervised learning
Figure 1.1.5: Example of decision tree
Figure 1.1.6: Linear and nonlinear classifi cation
Income
Income
Debt
Debt
Income
Income
Income
Debt
Debt
Debt
1.1.3.5 Advanced Data Mining Algorithms
Despite the great success of data mining techniques applied in different
areas and settings, there is an increasing demand for developing new data
mining algorithms and improving state-of-the-art approaches to handle
the more complicated and dynamical problems. In the meantime, with the
prevalence and deployment of data mining in real applications, some new
research questions and emerging research directions have been raised in
response to the advance and breakthrough of theory and technology in
data mining. Consequently, applied data mining is becoming an active and
fast progressing topic which has opened up a big algorithmic space and
developing potential. Here we list some interesting topics, which will be
described in subsequent chapters.
1. High-Dimensional Clustering In general, data objects to be clustered are
described by points in a high-dimensional space, where each dimension
corresponds to an attribute/feature. A distance measurement between
any two points is used to measure their similarity. The research has
shown that the increasing dimensionality results in the loss of contrast
in distances between data objects. Thus, clustering algorithms that
measure the similarity between data objects based on all attributes/
features tend to degrade in high dimensional data spaces. In additional,
the widely used distance measurement usually perform effectively
only on some particular subsets of attributes, where the data objects
are distributed densely. In other words, it is more likely to form dense
and reasonable clusters of data objects in a low-dimensional subspace.
Recently, several algorithms for discovering data object clusters in
subsets of attributes have been proposed, and they can be classifi ed
into two categories: subspace clustering and projective clustering [8].
2. Multi-Label Classifi cation In the framework of classifi cation, each
object is described as an instance, which is usually a feature vector
that characterizes the object from different aspects. Moreover, each
instance is associated with one or more labels indicating its categories.
Generally speaking, the process of classifi cation consists of two main
steps: the fi rst is training a classifi er or model on a given set of labeled
instances, the second is using the learned classifi er to predict the
label of unseen instance. However, the instances might be assigned
with multiple labels simultaneously, and problems of this type are
ubiquitous in many modern applications. Recently, there has been a
considerable amount of research concerned with dealing with multi-
label problems and many state-of-the-art methods have already been
proposed [3]. It has also been applied to lots of practical applications,
including text classifi cation, gene function prediction, music emotion
analysis, semantic annotation of video, tag recommendation, etc.
Introduction 15
16 Applied Data Mining
3. Stream data mining Data stream mining is an important issue because
it is the basis for many applications, such as network traffi c, web
searches, sensor network processing, etc. The purpose of data stream
mining is to discover the patterns or structures from the continuous
data, which may be used later to infer events that could happen. The
special characteristics for stream data is its dynamics that commonly
stream data can be read only once. This property limits many
traditional strategies for analyzing stream data, because these works
always assume that the whole data could be stored in limited storage.
In other words, stream data mining could be thought as computation
on very large (unlimited large) data.
4. Recommender Systems These are important applications because they
are essential for many business models. The purpose of recommender
systems is to suggest some good items to people based on their
preference and historical purchased data. The basic idea of these
systems is that if users shared the same interests in the past, they
will, with high probability, have similar behaviors in the future. The
historical data which refl ects users’ preferences may consist of explicit
ratings, web click log, or tags [6]. It is obviously that personalization
plays a critical role in an effective recommendation system [7].
1.2 Organization of the Book
This book is structured into three parts. Part 1: Fundamentals, Part 2:
Advanced Data Mining and Part 3: Emerging Applications. In Part 1, we
mainly introduce and review the fundamental concepts and mathematical
models which are commonly used in data mining. Starting from various data
types, we introduce the basic measures and data preprocessing techniques
applied in data mining. This part includes fi ve chapters, which will lay down
a solid base and prepare the necessary skills and approaches for further
understanding the subsequent chapters. Part 2 covers three chapters and
addresses the topics of advanced clustering, multi-label classifi cation and
stream data mining, which are all hot topics in applied data mining. In
addition, we report some recently emerging application directions in applied
data mining. Particularly, we will discuss the issues of privacy preserving,
recommender systems and social tagging annotation systems, where we
will structure the contents in a sequence of theoretical background, state-
of-the-art techniques, application cases and future research questions. We
also aim to highlight the applied potential of these challenging topics.
1.2.1 Part 1: Fundamentals
1.2.1.1 Chapter 2
Mathematics plays an important role in data mining. As a handbook
covering a variety of research topics mentioned in related disciplines, it is
necessary to prepare some basic but crucial concepts and backgrounds for
readers to easily proceed to the following chapters. This chapter forms an
essential and solid base to the whole book.
1.2.1.2 Chapter 3
Data preparation is the beginning of the data mining process. Data mining
results are heavily dependent on the data quality prepared before the
mining process. This chapter discusses related topics with respect to data
preparation, covering attribute selection, data cleaning and integrity, data
federation and integration, etc.
1.2.1.3 Chapter 4
Cluster analysis forms the topic of Chapter 4. In this chapter, we classify the
proposed clustering algorithms into four categories: traditional clustering
algorithm, high-dimensional clustering algorithm, constraint-based
clustering algorithm, and consensus clustering algorithm. The traditional
data clustering approaches include partitioning methods, hierarchical
methods, density-based methods, grid-based methods, and model-based
methods. Two different kinds of high-dimensional clustering algorithms are
also described. In the constraint-based clustering algorithm subsection, the
concept is defi ned; the algorithms are described and comparison of different
algorithms are presented as well. Consensus clustering algorithm is based on
the clustering results and is a new way to fi nd robust clustering results.
1.2.1.4 Chapter 5
Chapter 5 describes the methods for data classifi cation, including decision
tree induction, Bayesian network classifi cation, rule-based classifi cation,
neural network technique of back-propagation, support vector machines,
associative classification, k-nearest neighbor classifiers, case-based
reasoning, genetic algorithms, rough set theory, and fuzzy set approaches.
Issues regarding accuracy and how to choose the best classifi er are also
discussed.
Introduction 17
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.146.87