Chapter 1: Introduction (3/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

1.1.3.4 Classiﬁ cation and Prediction

Classiﬁ cation is a typical predictive method. The aim of classiﬁ cation is to

determine the class (or category) label for data objects based on the trained

model (sometimes also called classiﬁ er). It is hard to completely differentiate

the prediction approach from classiﬁ cation. In the data mining community,

one commonly agreed opinion is that classiﬁ cation is mainly focused on

determining the categorical attribute of data objects, while prediction is

focused on continuous-values attributes instead, i.e., it is used to predict

the analog values of data objects. As the model learning and prediction is

performed under the prior knowledge of data (e.g., the known label), this

kind of method has an alternative name—supervised learning approaches.

Figure 1.1.4 presents an example of supervised learning based on prior

knowledge—label, where the positive and negative objects are marked

by round and cross symbols respectively. The aim of classiﬁ cation is to

build up a dividing line to differentiate the positive and negative points

from the existing labels. A number of classiﬁ cation algorithms have been

well studied in data mining and machine learning domains, the common

and well used approaches include Decision Trees, Rule-based Induction,

Genetic Algorithms, Neural Networks, Bayesian Networks, Support Vector

Machine (SVM), C4.5 and so on. Figure 1.1.5 is a constructed decision tree

from the observations of whether it is appropriate to play tennis depending

on the weather conditions, such as sunny, rainy, windy, humid conditions

and so on. In this example, the classiﬁ cation rules are expressed as a set of

If-Then clauses. Apart from decision tree, classiﬁ er is another important

classiﬁ cation model. Based on the different classiﬁ cation requirement,

various classiﬁ ers could be trained upon the supervision, e.g., Fig. 1.1.6

demonstrates an example of linear and nonlinear classiﬁ er in the above

example of debt-income relationship case.

Figure 1.1.3: Example of unsupervised learning

Introduction

Income

Debt

14 Applied Data Mining

Figure 1.1.4: Example of supervised learning

Figure 1.1.5: Example of decision tree

Figure 1.1.6: Linear and nonlinear classiﬁ cation

Income

Debt

Income

Debt

1.1.3.5 Advanced Data Mining Algorithms

Despite the great success of data mining techniques applied in different

areas and settings, there is an increasing demand for developing new data

mining algorithms and improving state-of-the-art approaches to handle

the more complicated and dynamical problems. In the meantime, with the

prevalence and deployment of data mining in real applications, some new

research questions and emerging research directions have been raised in

response to the advance and breakthrough of theory and technology in

data mining. Consequently, applied data mining is becoming an active and

fast progressing topic which has opened up a big algorithmic space and

developing potential. Here we list some interesting topics, which will be

described in subsequent chapters.

1. High-Dimensional Clustering In general, data objects to be clustered are

described by points in a high-dimensional space, where each dimension

corresponds to an attribute/feature. A distance measurement between

any two points is used to measure their similarity. The research has

shown that the increasing dimensionality results in the loss of contrast

in distances between data objects. Thus, clustering algorithms that

measure the similarity between data objects based on all attributes/

features tend to degrade in high dimensional data spaces. In additional,

the widely used distance measurement usually perform effectively

only on some particular subsets of attributes, where the data objects

are distributed densely. In other words, it is more likely to form dense

and reasonable clusters of data objects in a low-dimensional subspace.

Recently, several algorithms for discovering data object clusters in

subsets of attributes have been proposed, and they can be classiﬁ ed

into two categories: subspace clustering and projective clustering [8].

2. Multi-Label Classiﬁ cation In the framework of classiﬁ cation, each

object is described as an instance, which is usually a feature vector

that characterizes the object from different aspects. Moreover, each

instance is associated with one or more labels indicating its categories.

Generally speaking, the process of classiﬁ cation consists of two main

steps: the ﬁ rst is training a classiﬁ er or model on a given set of labeled

instances, the second is using the learned classiﬁ er to predict the

label of unseen instance. However, the instances might be assigned

with multiple labels simultaneously, and problems of this type are

ubiquitous in many modern applications. Recently, there has been a

considerable amount of research concerned with dealing with multi-

label problems and many state-of-the-art methods have already been

proposed [3]. It has also been applied to lots of practical applications,

including text classiﬁ cation, gene function prediction, music emotion

analysis, semantic annotation of video, tag recommendation, etc.

Introduction 15

16 Applied Data Mining

3. Stream data mining Data stream mining is an important issue because

it is the basis for many applications, such as network trafﬁ c, web

searches, sensor network processing, etc. The purpose of data stream

mining is to discover the patterns or structures from the continuous

data, which may be used later to infer events that could happen. The

special characteristics for stream data is its dynamics that commonly

stream data can be read only once. This property limits many

traditional strategies for analyzing stream data, because these works

always assume that the whole data could be stored in limited storage.

In other words, stream data mining could be thought as computation

on very large (unlimited large) data.

4. Recommender Systems These are important applications because they

are essential for many business models. The purpose of recommender

systems is to suggest some good items to people based on their

preference and historical purchased data. The basic idea of these

systems is that if users shared the same interests in the past, they

will, with high probability, have similar behaviors in the future. The

historical data which reﬂ ects users’ preferences may consist of explicit

ratings, web click log, or tags [6]. It is obviously that personalization

plays a critical role in an effective recommendation system [7].

1.2 Organization of the Book

This book is structured into three parts. Part 1: Fundamentals, Part 2:

Advanced Data Mining and Part 3: Emerging Applications. In Part 1, we

mainly introduce and review the fundamental concepts and mathematical

models which are commonly used in data mining. Starting from various data

types, we introduce the basic measures and data preprocessing techniques

applied in data mining. This part includes ﬁ ve chapters, which will lay down

a solid base and prepare the necessary skills and approaches for further

understanding the subsequent chapters. Part 2 covers three chapters and

addresses the topics of advanced clustering, multi-label classiﬁ cation and

stream data mining, which are all hot topics in applied data mining. In

addition, we report some recently emerging application directions in applied

data mining. Particularly, we will discuss the issues of privacy preserving,

recommender systems and social tagging annotation systems, where we

will structure the contents in a sequence of theoretical background, state-

of-the-art techniques, application cases and future research questions. We

also aim to highlight the applied potential of these challenging topics.

1.2.1 Part 1: Fundamentals

1.2.1.1 Chapter 2

Mathematics plays an important role in data mining. As a handbook

covering a variety of research topics mentioned in related disciplines, it is

necessary to prepare some basic but crucial concepts and backgrounds for

readers to easily proceed to the following chapters. This chapter forms an

essential and solid base to the whole book.

1.2.1.2 Chapter 3

Data preparation is the beginning of the data mining process. Data mining

results are heavily dependent on the data quality prepared before the

mining process. This chapter discusses related topics with respect to data

preparation, covering attribute selection, data cleaning and integrity, data

federation and integration, etc.

1.2.1.3 Chapter 4

Cluster analysis forms the topic of Chapter 4. In this chapter, we classify the

proposed clustering algorithms into four categories: traditional clustering

algorithm, high-dimensional clustering algorithm, constraint-based

clustering algorithm, and consensus clustering algorithm. The traditional

data clustering approaches include partitioning methods, hierarchical

methods, density-based methods, grid-based methods, and model-based

methods. Two different kinds of high-dimensional clustering algorithms are

also described. In the constraint-based clustering algorithm subsection, the

concept is deﬁ ned; the algorithms are described and comparison of different

algorithms are presented as well. Consensus clustering algorithm is based on

the clustering results and is a new way to ﬁ nd robust clustering results.

1.2.1.4 Chapter 5

Chapter 5 describes the methods for data classiﬁ cation, including decision

tree induction, Bayesian network classiﬁ cation, rule-based classiﬁ cation,

neural network technique of back-propagation, support vector machines,

associative classification, k-nearest neighbor classifiers, case-based

reasoning, genetic algorithms, rough set theory, and fuzzy set approaches.

Issues regarding accuracy and how to choose the best classiﬁ er are also

discussed.

Introduction 17

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1: Introduction (3/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 1: Introduction (3/4)