CHAPTER 1
Introduction
In the last couple of decades, we have witnessed a signifi cant increase in
the volume of data in our daily life—there is data available for almost all
aspects of life. Almost every individual, company and organization has
created and can access a large amount of data and information recording
the historical activities of themselves when they are interacting with the
surrounding world. This kind of data and information helps to provide the
analytical sources to reveal the evolution of important objects or trends,
which will greatly help the growth and development of business and
economy. However, due to the bottleneck of technological advance and
application, such potential has yet been fully addressed and exploited in
theory as well as in real world applications. Undoubtedly, data mining is a
very important and active topic since it was coined in the 1990s, and many
algorithmic and theoretical breakthroughs have been achieved as a result of
synthesized efforts of multiple domains, such as database, machine learning,
statistics, information retrieval and information systems. Recently, there has
been an increasing focus shift in data mining from algorithmic innovations
to application and marketing driven issues, i.e., due to the increasing
demand from industry and business, more and more people pay attention
to applied data mining. This book aims at creating a bridge between data
mining algorithms and applications, especially the newly emerging topics of
applied data mining. In this chapter, we fi rst review the related concepts and
techniques involved in data mining research and applications. The layout
of this book is then described from three perspectives—fundamentals,
advanced data mining and emerging applications. Finally the readership
of this book and its purpose is discussed.
1.1 Background
We are often overwhelmed with various kinds of data which comes from the
pervasive use of electronic equipment and computing facilities, and whose
4 Applied Data Mining
size is continuously increasing. Personal computing devices are becoming
cheap and convenient, so it is easy to use it in almost every aspect of our
daily life, ranging from entertainment and communication to education and
political life. The dropping down of prices of electronic storage drivers allows
us to purchase disks to save information easily, which had to be discarded
earlier due to the expense reason. Nowadays database and information
systems have been widely deployed in industry and business, and they
have the capability to record the interactions between users and systems,
such as online shoppings, banking transactions, fi nancial decisions and so
on. The interactions between users and database systems form an important
data source for business analysis and business intelligence. To deal with the
overload of information, search engines have been invented as a useful tool
to help us locate and retrieve the needed information over the Internet. The
user navigational and retrieval activities that have been recorded in Web
log servers, undoubtedly can convey the browsing behavior and hidden
intent of users that are explicitly unseen, without in-depth analysis. Thus,
the widespread use of high-speed telecommunication infrastructures, the
easy affordability of data storage equipment, the ubiquitous deployment
of information systems and advanced data analysis techniques have put us
in front of an unprecedented data-intensive and data-centric world. We are
facing an urgent challenge in dealing with the growing gap between data
generation and our understanding capability. Due to the restricted volume
of human brain cells, an individual’s reasoning, summarizing and analyses
is limited. On the contrary, with the increase in data volume, the proportion
of data that people can understand decreases. These two facts bring a real
demand to tackle the realistic problem in current information society—it is
almost impossible to simply rely on human labors to accomplish the data
analysis more scalable and intelligent computational methods are called for
urgently. Data mining is emerging as one kind of such technical solutions
to address these challenges and demands.
1.1.1 Data Mining—De nitions and Concepts
Data mining is actually an analytical process to reveal the patterns or
trends hidden in the vast data ocean of data via cutting-edge computational
intelligence paradigms [5]. The original meaning of “mining” represents
the operation of extracting precious resources such as oil or gold from
the earth. The combination of mining with the word “data” refl ects the
in-depth analysis of data to reveal the knowledge “nuggets” that are not
exposed explicitly in the mass of data. As the undiscovered knowledge is
of statistical nature, via statistical means, it is sometimes called statistical
analysis, or multivariate statistical analysis due to its multivariate nature.
From the perspective of scientifi c research, data mining is closely related
to many other disciplines, such as machine learning, database, statistics,
data analytics, operational research, decision support, information systems,
information retrieval and so on. For example, from the viewpoint of data
itself, data mining is a variant discipline of database systems, following
research directions, such as data warehousing (on storage and retrieval) and
clustering (data coherence and performance). In terms of methodologies
and tools, data mining could be considered as the sub-stream of machine
learning and statistics—revealing the statistical characteristics of data
occurrences and distributions via computational or artifi cial intelligence
paradigms.
Thus data mining is defi ned as the process of using one or more
computational learning techniques to analyze and extract useful knowledge
from data in databases. The aim of data mining is to reveal trends and
patterns hidden in data. Hence from this viewpoint, this procedure is very
relevant to the term Pattern Recognition, which is a traditional and active
topic in Artifi cial Intelligence. The emergence of data mining is closely related
to the research advances in database systems in computer science, especially
the evolution and organization of databases, and later incorporating more
computational learning approaches. The very basic database operations
such as query and reporting simulate the very early stages of data mining.
Query and reporting are very functional tools to help us locate and identify
the requested data records within the database at various granularity levels,
and present more informative characteristics of the identifi ed data, such
as statistical results. The operations could be done locally and remotely,
where the former is executed at local end-user side, while the latter over
a distributed network environment, such as the Intranet or Internet. Data
retrieval, similar to data mining, extracts the needed data and information
from databases. In order to fi lter out the needed data from the whole
data repository, the database administrators or end-users need to defi ne
beforehand a set of constraints or fi lters which will be employed at a later
stage. A typical example is the marketing investigation of customer groups
who have bought two products consequently by using the “and” joint
operator to form a fi lter, in order to identify the specifi c customer group. This
is viewed as a simplest business means in marketing campaign. Apparently,
the database itself offers somewhat surface methods for data analysis and
business intelligence but far from the real business requirements such as
customer behavioral modeling and product targeting.
Data mining is different from data query and retrieval because it drills
down the in-depth associations and coherences between the data occurrence
within the repository that are impossible to be known beforehand or via
using basic data manipulating. Instead of query and retrieval operations,
data mining usually utilizes more complicated and intelligent data analysis
approaches, which are “borrowed” from the relevant research domains
Introduction 5
6 Applied Data Mining
such as machine learning and artifi cial intelligence. Additionally, it also
allows the supportive decision made upon the judgment on the data itself,
and the knowledgeable patterns derived. A similar data analytical method
is called Online Analytical Processing (OLAP), which is actually a graphic
data reporting tool to visualize the multidimensional structure within
the database. OLAP is used to summarize and demonstrate the relations
between available variables in the form of a two-dimensional table. Different
from OLAP, data mining brings together all the attributes and treats them
in a unifi ed manner, revealing the underlying models or patterns for real
applications, such as business analytics. In one word, OLAP is more like
a visualization instrument, whereas, data mining refl ects the analytical
capability for more intelligent use. Although data query, retrieval and
OLAP and data mining have owned a lot of commonplaces, data mining
is distinctive from the counterparts due to its outstanding and competent
advantages of analysis.
Knowledge Discovery in Database (KDD) is a name frequently used
interchangeably together with data mining. In fact, data mining has a
broader coverage of applicability while KDD is more focused on the
extension of scientifi c methods in data mining. In addition to performing
data mining, a typical KDD process also includes the stages of data
collection, data preprocessing and knowledge utilization, which form a
whole cycle of data preparation, data mining or knowledge discovery and
knowledge utilization. However it is indeed hard to draw a clear border to
differentiate these two kinds of disciplines since there is a big overlapping
between the two from the perspectives of not only the research targets
and approaches, but also the research communities and publications.
More theoretically, data mining is more about data objects and algorithms
involved, while KDD is a synergy of knowledge discovery process and
learning approaches used. In this book, we mainly focus our description
on data mining, presenting a generic and broad landscape to bridge the
gap between theory and application.
1.1.2 Data Mining Process
The key components within a data mining task consist of the following
subtasks:
Defi nition of the data analytical purposes and application domain.
Data organization and design structure, data preparation, consolidation
and integration.
Exploratory analysis of the data and summarization of the preliminary
results.
Computational learning approach choosing and devising based on
data analytical purposes.
Data mining process using the above approaches.
Knowledge representation of results in the form of models or
patterns.
Interpretation of knowledge patterns and the subsequent utilization
in decision supports.
1.1.2.1 De nition of Aims
Defi nition of aims is to clearly specify the analytical purpose of data mining,
i.e., what kinds of data mining tasks are intended to be conducted, what
major outcomes would be discovered, what the application domain of the
data mining task is, and how the fi ndings are interpreted based on domain
expertise. A clear statement of the problem and the aims to be achieved are
the prerequisite for setting up the mining task correctly and the key for
fulfi lling the aims successfully. The defi nition of the analytical aims also
prepares a guidance for the data organization and the engaged data mining
approaches in the following subtasks:
1.1.2.2 Design of Data Schema
This step is to design the data organization upon which the data analysis
will be performed. Normally in a data analysis task, there are a handful of
features involved, and these features can be accommodated into various
data models. Hence choosing an appropriate data schema and selecting
the related attributes in the chosen schema is also a crucial procedure in
the success of data mining. Mathematically, there exist some well studied
models, such as Vector Space Model (VSM) and graph model to choose
from. We need to choose a practical model to refl ect and accommodate the
engaged features. Features are another important consideration in data
mining, which is used to describe the data objects and characterize the
individual property of the data. For example, given a scenario of customer
credit assessment in banking applications, the considered attributes could
include customers’ age, education background, salary income, asset
amount, historic default records and so on. To induce the practical credit
assessment rules or patterns, we need to carefully select the possibly relevant
attributes to form the features of the chosen model. There are a number of
feature selection algorithms developed in past studies of data mining and
machine learning. An additional concern is the diverse residency of data in
multiple databases due to the current distributed computing environment
and popularization of internal or external networking. In other words, the
selected data attributes are distributed in different databases locally and
Introduction 7
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.111.49