Chapter 1: Introduction (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 1

Introduction

In the last couple of decades, we have witnessed a signiﬁ cant increase in

the volume of data in our daily life—there is data available for almost all

aspects of life. Almost every individual, company and organization has

created and can access a large amount of data and information recording

the historical activities of themselves when they are interacting with the

surrounding world. This kind of data and information helps to provide the

analytical sources to reveal the evolution of important objects or trends,

which will greatly help the growth and development of business and

economy. However, due to the bottleneck of technological advance and

application, such potential has yet been fully addressed and exploited in

theory as well as in real world applications. Undoubtedly, data mining is a

very important and active topic since it was coined in the 1990s, and many

algorithmic and theoretical breakthroughs have been achieved as a result of

synthesized efforts of multiple domains, such as database, machine learning,

statistics, information retrieval and information systems. Recently, there has

been an increasing focus shift in data mining from algorithmic innovations

to application and marketing driven issues, i.e., due to the increasing

demand from industry and business, more and more people pay attention

to applied data mining. This book aims at creating a bridge between data

mining algorithms and applications, especially the newly emerging topics of

applied data mining. In this chapter, we ﬁ rst review the related concepts and

techniques involved in data mining research and applications. The layout

of this book is then described from three perspectives—fundamentals,

advanced data mining and emerging applications. Finally the readership

of this book and its purpose is discussed.

1.1 Background

We are often overwhelmed with various kinds of data which comes from the

pervasive use of electronic equipment and computing facilities, and whose

4 Applied Data Mining

size is continuously increasing. Personal computing devices are becoming

cheap and convenient, so it is easy to use it in almost every aspect of our

daily life, ranging from entertainment and communication to education and

political life. The dropping down of prices of electronic storage drivers allows

us to purchase disks to save information easily, which had to be discarded

earlier due to the expense reason. Nowadays database and information

systems have been widely deployed in industry and business, and they

have the capability to record the interactions between users and systems,

such as online shoppings, banking transactions, ﬁ nancial decisions and so

on. The interactions between users and database systems form an important

data source for business analysis and business intelligence. To deal with the

overload of information, search engines have been invented as a useful tool

to help us locate and retrieve the needed information over the Internet. The

user navigational and retrieval activities that have been recorded in Web

log servers, undoubtedly can convey the browsing behavior and hidden

intent of users that are explicitly unseen, without in-depth analysis. Thus,

the widespread use of high-speed telecommunication infrastructures, the

easy affordability of data storage equipment, the ubiquitous deployment

of information systems and advanced data analysis techniques have put us

in front of an unprecedented data-intensive and data-centric world. We are

facing an urgent challenge in dealing with the growing gap between data

generation and our understanding capability. Due to the restricted volume

of human brain cells, an individual’s reasoning, summarizing and analyses

is limited. On the contrary, with the increase in data volume, the proportion

of data that people can understand decreases. These two facts bring a real

demand to tackle the realistic problem in current information society—it is

almost impossible to simply rely on human labors to accomplish the data

analysis more scalable and intelligent computational methods are called for

urgently. Data mining is emerging as one kind of such technical solutions

to address these challenges and demands.

1.1.1 Data Mining—Deﬁ nitions and Concepts

Data mining is actually an analytical process to reveal the patterns or

trends hidden in the vast data ocean of data via cutting-edge computational

intelligence paradigms [5]. The original meaning of “mining” represents

the operation of extracting precious resources such as oil or gold from

the earth. The combination of mining with the word “data” reﬂ ects the

in-depth analysis of data to reveal the knowledge “nuggets” that are not

exposed explicitly in the mass of data. As the undiscovered knowledge is

of statistical nature, via statistical means, it is sometimes called statistical

analysis, or multivariate statistical analysis due to its multivariate nature.

From the perspective of scientiﬁ c research, data mining is closely related

to many other disciplines, such as machine learning, database, statistics,

data analytics, operational research, decision support, information systems,

information retrieval and so on. For example, from the viewpoint of data

itself, data mining is a variant discipline of database systems, following

research directions, such as data warehousing (on storage and retrieval) and

clustering (data coherence and performance). In terms of methodologies

and tools, data mining could be considered as the sub-stream of machine

learning and statistics—revealing the statistical characteristics of data

occurrences and distributions via computational or artiﬁ cial intelligence

paradigms.

Thus data mining is deﬁ ned as the process of using one or more

computational learning techniques to analyze and extract useful knowledge

from data in databases. The aim of data mining is to reveal trends and

patterns hidden in data. Hence from this viewpoint, this procedure is very

relevant to the term Pattern Recognition, which is a traditional and active

topic in Artiﬁ cial Intelligence. The emergence of data mining is closely related

to the research advances in database systems in computer science, especially

the evolution and organization of databases, and later incorporating more

computational learning approaches. The very basic database operations

such as query and reporting simulate the very early stages of data mining.

Query and reporting are very functional tools to help us locate and identify

the requested data records within the database at various granularity levels,

and present more informative characteristics of the identiﬁ ed data, such

as statistical results. The operations could be done locally and remotely,

where the former is executed at local end-user side, while the latter over

a distributed network environment, such as the Intranet or Internet. Data

retrieval, similar to data mining, extracts the needed data and information

from databases. In order to ﬁ lter out the needed data from the whole

data repository, the database administrators or end-users need to deﬁ ne

beforehand a set of constraints or ﬁ lters which will be employed at a later

stage. A typical example is the marketing investigation of customer groups

who have bought two products consequently by using the “and” joint

operator to form a ﬁ lter, in order to identify the speciﬁ c customer group. This

is viewed as a simplest business means in marketing campaign. Apparently,

the database itself offers somewhat surface methods for data analysis and

business intelligence but far from the real business requirements such as

customer behavioral modeling and product targeting.

Data mining is different from data query and retrieval because it drills

down the in-depth associations and coherences between the data occurrence

within the repository that are impossible to be known beforehand or via

using basic data manipulating. Instead of query and retrieval operations,

data mining usually utilizes more complicated and intelligent data analysis

approaches, which are “borrowed” from the relevant research domains

Introduction 5

6 Applied Data Mining

such as machine learning and artiﬁ cial intelligence. Additionally, it also

allows the supportive decision made upon the judgment on the data itself,

and the knowledgeable patterns derived. A similar data analytical method

is called Online Analytical Processing (OLAP), which is actually a graphic

data reporting tool to visualize the multidimensional structure within

the database. OLAP is used to summarize and demonstrate the relations

between available variables in the form of a two-dimensional table. Different

from OLAP, data mining brings together all the attributes and treats them

in a uniﬁ ed manner, revealing the underlying models or patterns for real

applications, such as business analytics. In one word, OLAP is more like

a visualization instrument, whereas, data mining reﬂ ects the analytical

capability for more intelligent use. Although data query, retrieval and

OLAP and data mining have owned a lot of commonplaces, data mining

is distinctive from the counterparts due to its outstanding and competent

advantages of analysis.

Knowledge Discovery in Database (KDD) is a name frequently used

interchangeably together with data mining. In fact, data mining has a

broader coverage of applicability while KDD is more focused on the

extension of scientiﬁ c methods in data mining. In addition to performing

data mining, a typical KDD process also includes the stages of data

collection, data preprocessing and knowledge utilization, which form a

whole cycle of data preparation, data mining or knowledge discovery and

knowledge utilization. However it is indeed hard to draw a clear border to

differentiate these two kinds of disciplines since there is a big overlapping

between the two from the perspectives of not only the research targets

and approaches, but also the research communities and publications.

More theoretically, data mining is more about data objects and algorithms

involved, while KDD is a synergy of knowledge discovery process and

learning approaches used. In this book, we mainly focus our description

on data mining, presenting a generic and broad landscape to bridge the

gap between theory and application.

1.1.2 Data Mining Process

The key components within a data mining task consist of the following

subtasks:

• Deﬁ nition of the data analytical purposes and application domain.

• Data organization and design structure, data preparation, consolidation

and integration.

• Exploratory analysis of the data and summarization of the preliminary

results.

• Computational learning approach choosing and devising based on

data analytical purposes.

• Data mining process using the above approaches.

• Knowledge representation of results in the form of models or

patterns.

• Interpretation of knowledge patterns and the subsequent utilization

in decision supports.

1.1.2.1 Deﬁ nition of Aims

Deﬁ nition of aims is to clearly specify the analytical purpose of data mining,

i.e., what kinds of data mining tasks are intended to be conducted, what

major outcomes would be discovered, what the application domain of the

data mining task is, and how the ﬁ ndings are interpreted based on domain

expertise. A clear statement of the problem and the aims to be achieved are

the prerequisite for setting up the mining task correctly and the key for

fulﬁ lling the aims successfully. The deﬁ nition of the analytical aims also

prepares a guidance for the data organization and the engaged data mining

approaches in the following subtasks:

1.1.2.2 Design of Data Schema

This step is to design the data organization upon which the data analysis

will be performed. Normally in a data analysis task, there are a handful of

features involved, and these features can be accommodated into various

data models. Hence choosing an appropriate data schema and selecting

the related attributes in the chosen schema is also a crucial procedure in

the success of data mining. Mathematically, there exist some well studied

models, such as Vector Space Model (VSM) and graph model to choose

from. We need to choose a practical model to reﬂ ect and accommodate the

engaged features. Features are another important consideration in data

mining, which is used to describe the data objects and characterize the

individual property of the data. For example, given a scenario of customer

credit assessment in banking applications, the considered attributes could

include customers’ age, education background, salary income, asset

amount, historic default records and so on. To induce the practical credit

assessment rules or patterns, we need to carefully select the possibly relevant

attributes to form the features of the chosen model. There are a number of

feature selection algorithms developed in past studies of data mining and

machine learning. An additional concern is the diverse residency of data in

multiple databases due to the current distributed computing environment

and popularization of internal or external networking. In other words, the

selected data attributes are distributed in different databases locally and

Introduction 7

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1: Introduction (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 1: Introduction (1/4)