8 Applied Data Mining
remotely. Thus data federation and consolidation is often a necessary step
to deal with the heterogeneity and homogeneity of multiple databases.
All these operations comprise the data preparation and preprocessing of
data mining.
1.1.2.3 Exploratory Analysis
Exploratory analysis of the data is the process of exploring the basic statistical
property of the data involved. The aim of this preliminary analysis is to
transform the original data distribution to a new visualization form, which
can be better understood. This step provides the start to choose appropriate
data mining algorithms since the suitability of various algorithms is largely
dependent on the data integrity and coherence. The exploratory analysis
of the data is also able to identify the anomalous data—the entries which
exhibit distinctive distribution or occurrence, sometimes also called outliers,
and the missing data. This can trigger the additional data preprocessing
operations to assure the data integrity and quality. Another purpose of
this step is to suggest the need for extraction of additional data since the
obtained data is not rich enough to conduct the desired tasks. In short, this
stage works as a prerequisite to connect the analytical aims and data mining
algorithms, facilitating the analytical tasks and saving the computational
overhead for algorithm design and refi nement.
1.1.2.4 Algorithm Design and Implementation
Data mining algorithm design and implementation is always the most
important part in the whole data mining process. As discussed above,
the selection of appropriate analytical algorithms is closely related to the
analytical purposes, the organization of data, the model of analysis task
and the initial exploratory analysis on the constructed data source. There
is a wide spectrum of data mining algorithms that can be used to tackle
the requested tasks, so it is essential to carefully select the appropriate
algorithms. The choice of data mining algorithms are mainly dependent
on the used data itself and the nature of the analytical task. Benefi ting from
the advances and achievements in related research communities, such as
machine learning, computational intelligence and statistics, many practical
and effective paradigms have been devised and employed in a variety of
applications, and great successes have been made. We can categorize these
methods into the following approaches:
Descriptive approach: This kind of approach aims at giving a descriptive
statement on the data we are analyzing. To do this, we have to look
deeply into the distribution of the data, reveal the mutual relations
among the objects, and capture the common characteristics of data
distribution via machine intelligence methods. For example, clustering
analysis is used to partition data objects into various groups unknown
beforehand based on the mutual distance or similarity between them.
The criterion of such partition is to meet the optimal condition that
the objects within the same group are close to each other, while the
objects from different groups should be separated far enough. Topic
modeling is a newly emerging descriptive learning method to detect
the topical coherence with the observations. Through the adjustment
of the statistical model chosen for learning and comparison between
the observation and model derivation, we can identify the hidden topic
distribution underlying the observations and associations between
the topics and the data objects. In this way all the objects are treated
equally and an overall and statistical description is derived from the
machine learning process. As they mainly rely on the computational
power of machines without human interactions, sometimes we also
call them unsupervised approaches.
Predictive approach: This kind of approach aims at concluding some
operational rules or regulations for prediction. By generalizing the
linkage between the outcome and observed variables, we can induce
some rules or patterns of classifi cations and predictions. These rules
help us to predict the unknown status of new targeted objects or
occurrence of specifi c results. To accomplish this, we have to collect
suffi cient data samples in advance, which have been already labeled
with the specifi c input labels, for example, the positive or negative in
pathological examination or accept and reject decision in bank credit
assessment. These approaches are mainly developed in the domain of
machine learning such as Support Vector Machine (SVM), decision tree
and so on. The learned results from such approaches are represented
as a set of reasoning conditions and stored as rule to guide the future
prediction and judgment. One distinct feature of this kind approaches is
the presence of labeled samples beforehand and the classifi er are trained
upon the training data, so it is also called supervised approaches (i.e.,
with prior knowledge and human supervision). Predictive approaches
account for majority of analytical tasks in real applications due to its
advantage for future prediction.
Evolutionary approach: The above two kinds of approaches are often
used to deal with the static data, i.e., data collected is restricted within
a specifi c time frame. However, with the huge refl ux of massive data
available in a distributed and networked environment, the dynamics
becomes a challenging characteristic in data mining research. This
calls for evolutionary data mining algorithms to deal with the change
of temporal and spatial data within the database. The representative
Introduction 9
10 Applied Data Mining
methods and applications include sequential pattern mining and data
stream mining. The former is to determine the signifi cant patterns from
the sequential data observations, such as the customer behavior in online
shopping, whereas the latter was proposed to tackle the diffi culties
within data stream applications, such as RFID signal sampling and
processing. The main difference of this with other approaches is the
outstanding capability to deal with continuous signal generating and
processing in real time with affordable computational cost, such as
limited memory and CPU usage. Recently, such approaches highlight
this new active and potential trends within data mining research.
Detective approach: the descriptive and predictive approaches are
focused on the exploration of the global property of data rather
than that of local information. Sometimes the analysis at the smaller
granularity will provide us more informative fi ndings than the overall
description or prediction. Detective approaches are the means to
help us uncover the local mutual relations at a lower level. In data
mining, association rule mining or sequential pattern mining are able
to fulfi ll such requirement within a specifi c application domain, such
as business transaction data or online shopping data.
Although four categories from the perspectives of data objects and
analysis aims are presented, it is worth noting that the dividing lines
between all these approaches are blurred and overlap one other. In real
applications, we often take a mixture of these approaches to satisfy the
requirements of complexity and practicality. More often, using the existing
approaches or a mixture of them is a far cry from the success of analytical
tasks in real applications, resulting in the desire to design new innovative
algorithms and implementing them in real scenarios with satisfactory
performance. This inspires researchers from different communities to make
more efforts and fully utilize the fi ndings from relevant areas.
Another signifi cant issue attracting our attention is the increasingly
popularity of data mining in almost every aspect of business, industry and
society. The real analytical questions have raised a bunch of new challenges
and opportunities for researchers to form the synergy to undertake applied
data mining, which lays down a solid foundation and a real motivation for
this new book.
1.1.3 Data Mining Algorithms
1.1.3.1 Descriptive and Predictive
Due to the broad applications and unique intelligent capability of data
mining, a huge amount of research efforts have been invested and a wide
spectrum of algorithms and techniques have been developed [5]. In general,
from the perspective of data mining aims, data mining algorithms can be
categorized into two main streams: descriptive and predictive algorithms.
Descriptive approaches aim to reveal the characteristic data structure hidden
in the data collection, while the predictive methods build up prediction
models to forecast the potential attribute of new data subjects instead.
There are various descriptive data mining approaches that have been
devised in the past decades, such as data characterization, discrimination,
association rule mining, clustering and so on. The common capability of
such kinds of approaches is to present the data property and describe the
data distribution in a mathematical manner, which is not easily seen at
surface analysis. Clustering is a typical descriptive algorithm, indicating
the aggregation behavior of data objects. By defi ning the specifi c distance
or similarity measure, we are able to capture the mutual distance or
similarity between different data points (as shown in Fig.1.1.1). In contrast,
predictive approaches mainly exploit the prior knowledge, such as known
labels or categories, to derive a prediction “model” that best describes
and differentiates data classes. As the model is learned from the available
dataset by using machine learning approaches, the process is also called
model training, while the dataset used is therefore named training data
(i.e., data objects whose class label is known). After the model is trained, it
is used to predict the class label for new data subjects based on the actual
attribute of the data.
Figure 1.1.1: Cluster analysis
1.1.3.2 Association Rule and Frequent Pattern Mining
Association rule mining [1] is one of the most important techniques in the
data mining domain, which is to reveal the co-occurrence relationships of
activities or observations in a large database or data repository. Suppose in a
Introduction 11
12 Applied Data Mining
traditional e-marketing application, the purchase consequence of “milk” and
“bread” is a commonly observed pattern in any supermarket case, therefore
resulting the generating of association rule µbread, milkÅ. Of course, there
may exist a large number of association rules in a huge transaction database
dependent on the setting of the satisfactory (or confi dence) threshold. The
algorithm of association rule mining is thus designed to extract such rules
as are hidden in the massive data based on the analyst’s targets. Figure
1.1.2 gives a typical association rule set in a market-basket transaction
campaign. Here you can observe the common occurrence of various items
in supermarket transaction records, which can be used to improve the
market profi t by adjusting the item-shelf arrangement in daily supermarket
management. Frequent pattern mining is one of the most fundamental
research issues in data mining, which aims to mine useful information
from huge volumes of data [4]. The purpose of searching such frequent
patterns (i.e., association rules) is to explore the historical supermarket
transaction data, which is indeed to discover the customer behavior based
on the purchased items.
Figure 1.1.2: An example of association rules
1.1.3.3 Clustering
Clustering is an approach to reveal the group coherence of data points and
capture the partition of data points [2]. The outcome of clustering operation
is a set of clusters, in which the data points within the same cluster have
a minimum mutual distance, while the data points belonging to different
clusters are suffi ciently separated from each other. Since clustering is
performed relying on the data distribution itself, i.e., the mutual distance,
but not associated with other prior knowledge, it is also called unsupervised
algorithm. Figure 1.1.3 depicts an example of cluster analysis of debt-income
relationships.
Bread, Milk
1
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.208.206