Chapter 3: Data Preparation (2/3)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

50 Applied Data Mining

• Equal-width binning divides the range of possible values into

N subranges of the same size and bin width = (max value - min

value)/N

For example: if the observed values are all between 0 and 100, we could

create 5 bins as follows:

(1) width = (100–0)/5 = 20

(2) bins: [0–20], (20–40], (40–60], (60–80], (80–100] [ or ] means the endpoint

is included (or) means the endpoint is not included

(3) typically, the ﬁ rst and last bins are extended to allow for values outside

the range of observed values (-inﬁ nity-20], (20–40], (40–60], (60–80],

(80-inﬁ nity).

• Equal-frequency or equal-height binning divides the range of possible

values into N bins, each of which holds the same number of training

instances.

For example: let’s say we have 10 training examples with the following

values for the attribute that we are discrediting: 5, 7, 12, 35, 65, 82, 84, 88, 90,

95 to create 5 bins, we would divide up the range of values so that each bin

holds 2 of the training examples: 5–7, 12–35, 65–82, 84–88, 90–95. To select

the boundary values for the bins, this method typically chooses a value

halfway between the training examples on either side of the boundary. For

example: (7 + 12)/2 = 9.5 (35 + 65)/2 = 50

3.2 Data Cleaning and Integrity

3.2.1 Missing Values

Imagine that you need to analyze All Electronics sales and customer data.

You note that many tuples have no recorded value for several attributes,

such as customer income. How can you go about ﬁ lling in the missing values

for this attribute? Let’s look at the following methods [9, 8]:

• Ignore the tuple: This is usually done when the class label is missing

(assuming the mining task involves classiﬁ cation or description). This

method is not very effective, unless the tuple contains several attributes

with missing values. It is especially poor when the percentage of

missing values per attribute varies considerably.

• Fill in the missing value manually: In general, this approach is time-

consuming and may not be feasible given a large data set with many

missing values.

• Use a global constant to ﬁ ll in the missing value: Replace all missing

attribute values by the same constant, such as a label like “Unknown”.

If missing values are replaced by, say, “Unknown”, then the mining

program may mistakenly think that they form an interesting concept,

since they all have a value in common—that of “Unknown”. Hence,

although this method is simple, it is not recommended.

• Use the attribute mean to ﬁ ll in the missing value: For example, suppose

that the average income of All Electronics customers is $28,000. Use

this value to replace the missing value for income.

• Use the attribute mean for all samples belonging to the same class as

the given tuple: For example, if classifying customers according to

credit_risk, replace the missing value with the average income value

for customers in the same credit_risk category as that of the given

tuple.

• Use the most probable value to ﬁ ll in the missing value: This may be

determined with inference-based tools using a Bayesian formalism or

decision tree induction. For example, sing the other customer attributes

in your data set, you may construct a decision tree to predict the missing

values for income.

Methods 3 to 6 bias the data. The ﬁ lled-in value may not be correct.

Method 6, however, is a popular strategy. In comparison to the other

methods, it uses the most information from the present data to predict

missing values.

3.2.2 Detecting Anomalies

Anomaly detection, also referred to as outlier detection [2], refers to detecting

patterns in a given data set that do not conform to an established normal

behavior. The patterns thus detected are called anomalies and often translate

to critical and actionable information in several application domains.

Anomalies are also referred to as outliers, change, deviation, surprise,

aberrant, peculiarity, intrusion, etc. In particular in the context of abuse

and network intrusion detection, the interesting objects are often not rare

objects, but unexpected bursts of activity. This pattern does not adhere to the

common statistical deﬁ nition of an outlier as a rare object, and many outlier

detection methods (in particular unsupervised methods) will fail on such

data, unless it has been aggregated appropriately. Instead, a cluster analysis

algorithm may be able to detect the micro clusters formed by these patterns.

Three broad categories of anomaly detection techniques exist. Unsupervised

anomaly detection techniques detect anomalies in an unlabeled test data set

under the assumption that the majority of the instances in the data set are

normal by looking for instances that seem to ﬁ t least to the remainder of the

data set. Supervised anomaly detection techniques require a data set that has

been labeled as ”normal” and ”abnormal” and involves training a classiﬁ er

(the key difference to many other statistical classiﬁ cation problems is the

Data Preparation 51

52 Applied Data Mining

inherent unbalanced nature of outlier detection). Semi-supervised anomaly

detection techniques construct a model representing normal behavior from

a given normal training data set, and then testing the likelihood of a test

instance to be generated by the learnt model.

3.2.3 Applications

Anomaly detection is applicable in a variety of domains, such as intrusion

detection, fraud detection, fault detection, system health monitoring, event

detection in sensor networks, and detecting eco-system disturbances. It is

often used in preprocessing to remove anomalous data from the dataset. In

supervised learning, removing the anomalous data from the dataset often

results in a statistically signiﬁ cant increase in accuracy.

(1) Popular techniques

Several anomaly detection techniques have been proposed in literature.

Some of the popular techniques are:

• Distance based techniques (k-nearest neighbor, Local Outlier Factor)

• One class support vector machines

• Replicator neural networks

• Cluster analysis based outlier detection

• Pointing at records that deviate from association rules

• Conditional anomaly concept.

(2) Application to data security

Anomaly detection was proposed for Intrusion Detection Systems (IDS)

by Dorothy Denning in 1986. Anomaly detection for IDS is normally

accomplished with thresholds and statistics, but can also be done with soft

computing and inductive learning. Types of statistics proposed by 1999

included proﬁ les of users, workstations, networks, remote hosts, groups of

users, and programs based on frequencies, means, variances, covariances,

and standard deviations. The counterpart of Anomaly Detection in Intrusion

Detection is Misuse Detection.

(3) Time series outlier detection

Parametric tests to ﬁ nd outliers in time series are implemented in almost

all statistical packages: Demetra+, for example, uses the most popular ones.

One way to detect anomalies in time series is a simple non-parametric

method called washer. It uses a non-parametric test to ﬁ nd one or more

outliers in a group of even very short time series. The group must have

a similar behaviour, as explained more fully below. An example is that of

municipalities cited in the work of Dahlberg and Johanssen (2000). Swedish

municipalities expenditures between 1979 and 1987 represent 256 time

series. If you consider three years such as, for example, 1981,1982 and 1983,

you have 256 simple polygonal chains made of two lines segments. Every

couple of segments can approximate a straight line or a convex downward

(or convex upward) simple polygonal chain. The idea is to ﬁ nd outliers

among the couples of segments that performs in a too much different way

from the other couples. In the washer procedure every couple of segments

is represented by an index and a non-parametric test (Sprent test) is applied

to the unknown distribution of those indices. For implementing washer

methodology you can download an open source R (programming language)

function with a simple numeric example.

3.3 Multiple Model Integration

3.3.1 Data Federation

Data federation is a brand new idea for integration of data from many

diffract sources. Many organizations and companies store their data in

different ways, like transactional databases, data warehouses, business

intelligence systems, legacy systems and so on. The problem arises, when

someone needs to access data from some of these sources [8, 4, 3]. There

is no easy way to retrieve the data, because every storage system has its

own way of accessing it. In order to help getting to the data from many

sources, there are some ways to integrate the data, and the most advanced of

them is data federation. To integrate the data it has to be copied and moved,

because the integrated data need to be kept together. Of course it has its

defects, like the time needed to copy and move the data, and some copyright

infringements during copying. The data also occupied more disk space

than it actually needed, because it was kept in few instances. There were

also some problems with data refreshing, because if there was more than

one instance of the data, only the modiﬁ ed instance was up to date, so all

others instances of the data has to be refreshed. Of course it slowed down

the integration system. In response to these problems, the IT specialists

created a new data integration system called data federation. The idea of

data federation is to integrate data from many individual sources and make

access to them as easy as possible. The target has to be reached without

moving or copying the data. In fact, the data sources can be in any location.

It only has to be online. Also, every data source can be made using different

technology, standard and architecture. For the end user it will feel like one

big data storage system. The data federation supports many data storage

standards. From the SQL relational databases like Mysql, PostgreSQL,

InterBase, IBM DB2, Firebird and Oracle through directory services and

object-based databases like LDAP and OpenLDAP, to data warehouses

Data Preparation 53

54 Applied Data Mining

and Business-Intelligence systems. The goal is to make the data federation

system work with every standard that is used to store data in companies

and other organizations. The data that is already integrated with the data

federation system is called a federated database or a virtual database.

Federated database allows users to read and write the data without even

knowing that it comes from many different sources. The user doesn’t need

to know how to use a database system, or how to access data in a directory

service. All he needs is to know is how to use the uniﬁ ed front-end of the

data federation system. The data federation system in many cases might

be the best way to unify the data kept in different places in many different

ways. It’s simple, easy for the end users, and an efﬁ cient solution that will

make accessing the data a lot easier.

3.3.2 Bagging and Boosting

(1) Bagging

The concept of bagging (voting for classiﬁ cation, averaging for regression-

type problems with continuous dependent variables of interest) applies to

the area of predictive data mining, to combine the predicted classiﬁ cations

(prediction) from multiple models, or from the same type of model for

different learning data. It is also used to address the inherent instability of

results when applying complex models to relatively small data sets. Suppose

your data mining task is to build a model for predictive classiﬁ cation,

and the dataset from which to train the model (learning data set, which

contains observed classiﬁ cations) is relatively small. You could repeatedly

sub-sample (with replacement) from the dataset, and apply, for example,

a tree classiﬁ er (e.g., C&RT and CHAID) to the successive samples. In

practice, very different trees will often be grown for the different samples,

illustrating the instability of models often evident with small datasets. One

method of deriving a single prediction (for new observations) is to use all

trees found in the different samples, and to apply some simple voting: The

ﬁ nal classiﬁ cation is the one most often predicted by the different trees. Note

that some weighted combination of predictions (weighted vote, weighted

average) is also possible, and commonly used. A sophisticated (machine

learning) algorithm for generating weights for weighted prediction or voting

is the Boosting procedure.

(2) Boosting

The concept of boosting applies to the area of predictive data mining, to

generate multiple models or classiﬁ ers (for prediction or classiﬁ cation),

and to derive weights to combine the predictions from those models into

a single prediction or predicted classiﬁ cation (see also Bagging). A simple

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Data Preparation (2/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 3: Data Preparation (2/3)