Chapter 1: Introduction (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

8 Applied Data Mining

remotely. Thus data federation and consolidation is often a necessary step

to deal with the heterogeneity and homogeneity of multiple databases.

All these operations comprise the data preparation and preprocessing of

data mining.

1.1.2.3 Exploratory Analysis

Exploratory analysis of the data is the process of exploring the basic statistical

property of the data involved. The aim of this preliminary analysis is to

transform the original data distribution to a new visualization form, which

can be better understood. This step provides the start to choose appropriate

data mining algorithms since the suitability of various algorithms is largely

dependent on the data integrity and coherence. The exploratory analysis

of the data is also able to identify the anomalous data—the entries which

exhibit distinctive distribution or occurrence, sometimes also called outliers,

and the missing data. This can trigger the additional data preprocessing

operations to assure the data integrity and quality. Another purpose of

this step is to suggest the need for extraction of additional data since the

obtained data is not rich enough to conduct the desired tasks. In short, this

stage works as a prerequisite to connect the analytical aims and data mining

algorithms, facilitating the analytical tasks and saving the computational

overhead for algorithm design and reﬁ nement.

1.1.2.4 Algorithm Design and Implementation

Data mining algorithm design and implementation is always the most

important part in the whole data mining process. As discussed above,

the selection of appropriate analytical algorithms is closely related to the

analytical purposes, the organization of data, the model of analysis task

and the initial exploratory analysis on the constructed data source. There

is a wide spectrum of data mining algorithms that can be used to tackle

the requested tasks, so it is essential to carefully select the appropriate

algorithms. The choice of data mining algorithms are mainly dependent

on the used data itself and the nature of the analytical task. Beneﬁ ting from

the advances and achievements in related research communities, such as

machine learning, computational intelligence and statistics, many practical

and effective paradigms have been devised and employed in a variety of

applications, and great successes have been made. We can categorize these

methods into the following approaches:

• Descriptive approach: This kind of approach aims at giving a descriptive

statement on the data we are analyzing. To do this, we have to look

deeply into the distribution of the data, reveal the mutual relations

among the objects, and capture the common characteristics of data

distribution via machine intelligence methods. For example, clustering

analysis is used to partition data objects into various groups unknown

beforehand based on the mutual distance or similarity between them.

The criterion of such partition is to meet the optimal condition that

the objects within the same group are close to each other, while the

objects from different groups should be separated far enough. Topic

modeling is a newly emerging descriptive learning method to detect

the topical coherence with the observations. Through the adjustment

of the statistical model chosen for learning and comparison between

the observation and model derivation, we can identify the hidden topic

distribution underlying the observations and associations between

the topics and the data objects. In this way all the objects are treated

equally and an overall and statistical description is derived from the

machine learning process. As they mainly rely on the computational

power of machines without human interactions, sometimes we also

call them unsupervised approaches.

• Predictive approach: This kind of approach aims at concluding some

operational rules or regulations for prediction. By generalizing the

linkage between the outcome and observed variables, we can induce

some rules or patterns of classiﬁ cations and predictions. These rules

help us to predict the unknown status of new targeted objects or

occurrence of speciﬁ c results. To accomplish this, we have to collect

sufﬁ cient data samples in advance, which have been already labeled

with the speciﬁ c input labels, for example, the positive or negative in

pathological examination or accept and reject decision in bank credit

assessment. These approaches are mainly developed in the domain of

machine learning such as Support Vector Machine (SVM), decision tree

and so on. The learned results from such approaches are represented

as a set of reasoning conditions and stored as rule to guide the future

prediction and judgment. One distinct feature of this kind approaches is

the presence of labeled samples beforehand and the classiﬁ er are trained

upon the training data, so it is also called supervised approaches (i.e.,

with prior knowledge and human supervision). Predictive approaches

account for majority of analytical tasks in real applications due to its

advantage for future prediction.

• Evolutionary approach: The above two kinds of approaches are often

used to deal with the static data, i.e., data collected is restricted within

a speciﬁ c time frame. However, with the huge reﬂ ux of massive data

available in a distributed and networked environment, the dynamics

becomes a challenging characteristic in data mining research. This

calls for evolutionary data mining algorithms to deal with the change

of temporal and spatial data within the database. The representative

Introduction 9

10 Applied Data Mining

methods and applications include sequential pattern mining and data

stream mining. The former is to determine the signiﬁ cant patterns from

the sequential data observations, such as the customer behavior in online

shopping, whereas the latter was proposed to tackle the difﬁ culties

within data stream applications, such as RFID signal sampling and

processing. The main difference of this with other approaches is the

outstanding capability to deal with continuous signal generating and

processing in real time with affordable computational cost, such as

limited memory and CPU usage. Recently, such approaches highlight

this new active and potential trends within data mining research.

• Detective approach: the descriptive and predictive approaches are

focused on the exploration of the global property of data rather

than that of local information. Sometimes the analysis at the smaller

granularity will provide us more informative ﬁ ndings than the overall

description or prediction. Detective approaches are the means to

help us uncover the local mutual relations at a lower level. In data

mining, association rule mining or sequential pattern mining are able

to fulﬁ ll such requirement within a speciﬁ c application domain, such

as business transaction data or online shopping data.

Although four categories from the perspectives of data objects and

analysis aims are presented, it is worth noting that the dividing lines

between all these approaches are blurred and overlap one other. In real

applications, we often take a mixture of these approaches to satisfy the

requirements of complexity and practicality. More often, using the existing

approaches or a mixture of them is a far cry from the success of analytical

tasks in real applications, resulting in the desire to design new innovative

algorithms and implementing them in real scenarios with satisfactory

performance. This inspires researchers from different communities to make

more efforts and fully utilize the ﬁ ndings from relevant areas.

Another signiﬁ cant issue attracting our attention is the increasingly

popularity of data mining in almost every aspect of business, industry and

society. The real analytical questions have raised a bunch of new challenges

and opportunities for researchers to form the synergy to undertake applied

data mining, which lays down a solid foundation and a real motivation for

this new book.

1.1.3 Data Mining Algorithms

1.1.3.1 Descriptive and Predictive

Due to the broad applications and unique intelligent capability of data

mining, a huge amount of research efforts have been invested and a wide

spectrum of algorithms and techniques have been developed [5]. In general,

from the perspective of data mining aims, data mining algorithms can be

categorized into two main streams: descriptive and predictive algorithms.

Descriptive approaches aim to reveal the characteristic data structure hidden

in the data collection, while the predictive methods build up prediction

models to forecast the potential attribute of new data subjects instead.

There are various descriptive data mining approaches that have been

devised in the past decades, such as data characterization, discrimination,

association rule mining, clustering and so on. The common capability of

such kinds of approaches is to present the data property and describe the

data distribution in a mathematical manner, which is not easily seen at

surface analysis. Clustering is a typical descriptive algorithm, indicating

the aggregation behavior of data objects. By deﬁ ning the speciﬁ c distance

or similarity measure, we are able to capture the mutual distance or

similarity between different data points (as shown in Fig.1.1.1). In contrast,

predictive approaches mainly exploit the prior knowledge, such as known

labels or categories, to derive a prediction “model” that best describes

and differentiates data classes. As the model is learned from the available

dataset by using machine learning approaches, the process is also called

model training, while the dataset used is therefore named training data

(i.e., data objects whose class label is known). After the model is trained, it

is used to predict the class label for new data subjects based on the actual

attribute of the data.

Figure 1.1.1: Cluster analysis

1.1.3.2 Association Rule and Frequent Pattern Mining

Association rule mining [1] is one of the most important techniques in the

data mining domain, which is to reveal the co-occurrence relationships of

activities or observations in a large database or data repository. Suppose in a

Introduction 11

12 Applied Data Mining

traditional e-marketing application, the purchase consequence of “milk” and

“bread” is a commonly observed pattern in any supermarket case, therefore

resulting the generating of association rule µbread, milkÅ. Of course, there

may exist a large number of association rules in a huge transaction database

dependent on the setting of the satisfactory (or conﬁ dence) threshold. The

algorithm of association rule mining is thus designed to extract such rules

as are hidden in the massive data based on the analyst’s targets. Figure

1.1.2 gives a typical association rule set in a market-basket transaction

campaign. Here you can observe the common occurrence of various items

in supermarket transaction records, which can be used to improve the

market proﬁ t by adjusting the item-shelf arrangement in daily supermarket

management. Frequent pattern mining is one of the most fundamental

research issues in data mining, which aims to mine useful information

from huge volumes of data [4]. The purpose of searching such frequent

patterns (i.e., association rules) is to explore the historical supermarket

transaction data, which is indeed to discover the customer behavior based

on the purchased items.

Figure 1.1.2: An example of association rules

1.1.3.3 Clustering

Clustering is an approach to reveal the group coherence of data points and

capture the partition of data points [2]. The outcome of clustering operation

is a set of clusters, in which the data points within the same cluster have

a minimum mutual distance, while the data points belonging to different

clusters are sufﬁ ciently separated from each other. Since clustering is

performed relying on the data distribution itself, i.e., the mutual distance,

but not associated with other prior knowledge, it is also called unsupervised

algorithm. Figure 1.1.3 depicts an example of cluster analysis of debt-income

relationships.

Bread, Milk

Bread, Diaper, Beer, Eggs

Milk, Diaper, Beer, Coke

Bread, Milk, Diaper, Beer

Bread, Milk, Diaper, Coke

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1: Introduction (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 1: Introduction (2/4)