50 Applied Data Mining
Equal-width binning divides the range of possible values into
N subranges of the same size and bin width = (max value - min
value)/N
For example: if the observed values are all between 0 and 100, we could
create 5 bins as follows:
(1) width = (100–0)/5 = 20
(2) bins: [0–20], (20–40], (40–60], (60–80], (80–100] [ or ] means the endpoint
is included (or) means the endpoint is not included
(3) typically, the fi rst and last bins are extended to allow for values outside
the range of observed values (-infi nity-20], (20–40], (40–60], (60–80],
(80-infi nity).
Equal-frequency or equal-height binning divides the range of possible
values into N bins, each of which holds the same number of training
instances.
For example: let’s say we have 10 training examples with the following
values for the attribute that we are discrediting: 5, 7, 12, 35, 65, 82, 84, 88, 90,
95 to create 5 bins, we would divide up the range of values so that each bin
holds 2 of the training examples: 5–7, 12–35, 65–82, 84–88, 90–95. To select
the boundary values for the bins, this method typically chooses a value
halfway between the training examples on either side of the boundary. For
example: (7 + 12)/2 = 9.5 (35 + 65)/2 = 50
3.2 Data Cleaning and Integrity
3.2.1 Missing Values
Imagine that you need to analyze All Electronics sales and customer data.
You note that many tuples have no recorded value for several attributes,
such as customer income. How can you go about fi lling in the missing values
for this attribute? Let’s look at the following methods [9, 8]:
Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classifi cation or description). This
method is not very effective, unless the tuple contains several attributes
with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many
missing values.
Use a global constant to fi ll in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown”.
If missing values are replaced by, say, “Unknown”, then the mining
program may mistakenly think that they form an interesting concept,
since they all have a value in common—that of “Unknown”. Hence,
although this method is simple, it is not recommended.
Use the attribute mean to fi ll in the missing value: For example, suppose
that the average income of All Electronics customers is $28,000. Use
this value to replace the missing value for income.
Use the attribute mean for all samples belonging to the same class as
the given tuple: For example, if classifying customers according to
credit_risk, replace the missing value with the average income value
for customers in the same credit_risk category as that of the given
tuple.
Use the most probable value to fi ll in the missing value: This may be
determined with inference-based tools using a Bayesian formalism or
decision tree induction. For example, sing the other customer attributes
in your data set, you may construct a decision tree to predict the missing
values for income.
Methods 3 to 6 bias the data. The fi lled-in value may not be correct.
Method 6, however, is a popular strategy. In comparison to the other
methods, it uses the most information from the present data to predict
missing values.
3.2.2 Detecting Anomalies
Anomaly detection, also referred to as outlier detection [2], refers to detecting
patterns in a given data set that do not conform to an established normal
behavior. The patterns thus detected are called anomalies and often translate
to critical and actionable information in several application domains.
Anomalies are also referred to as outliers, change, deviation, surprise,
aberrant, peculiarity, intrusion, etc. In particular in the context of abuse
and network intrusion detection, the interesting objects are often not rare
objects, but unexpected bursts of activity. This pattern does not adhere to the
common statistical defi nition of an outlier as a rare object, and many outlier
detection methods (in particular unsupervised methods) will fail on such
data, unless it has been aggregated appropriately. Instead, a cluster analysis
algorithm may be able to detect the micro clusters formed by these patterns.
Three broad categories of anomaly detection techniques exist. Unsupervised
anomaly detection techniques detect anomalies in an unlabeled test data set
under the assumption that the majority of the instances in the data set are
normal by looking for instances that seem to fi t least to the remainder of the
data set. Supervised anomaly detection techniques require a data set that has
been labeled as ”normal” and ”abnormal” and involves training a classifi er
(the key difference to many other statistical classifi cation problems is the
Data Preparation 51
52 Applied Data Mining
inherent unbalanced nature of outlier detection). Semi-supervised anomaly
detection techniques construct a model representing normal behavior from
a given normal training data set, and then testing the likelihood of a test
instance to be generated by the learnt model.
3.2.3 Applications
Anomaly detection is applicable in a variety of domains, such as intrusion
detection, fraud detection, fault detection, system health monitoring, event
detection in sensor networks, and detecting eco-system disturbances. It is
often used in preprocessing to remove anomalous data from the dataset. In
supervised learning, removing the anomalous data from the dataset often
results in a statistically signifi cant increase in accuracy.
(1) Popular techniques
Several anomaly detection techniques have been proposed in literature.
Some of the popular techniques are:
Distance based techniques (k-nearest neighbor, Local Outlier Factor)
One class support vector machines
Replicator neural networks
Cluster analysis based outlier detection
Pointing at records that deviate from association rules
Conditional anomaly concept.
(2) Application to data security
Anomaly detection was proposed for Intrusion Detection Systems (IDS)
by Dorothy Denning in 1986. Anomaly detection for IDS is normally
accomplished with thresholds and statistics, but can also be done with soft
computing and inductive learning. Types of statistics proposed by 1999
included profi les of users, workstations, networks, remote hosts, groups of
users, and programs based on frequencies, means, variances, covariances,
and standard deviations. The counterpart of Anomaly Detection in Intrusion
Detection is Misuse Detection.
(3) Time series outlier detection
Parametric tests to fi nd outliers in time series are implemented in almost
all statistical packages: Demetra+, for example, uses the most popular ones.
One way to detect anomalies in time series is a simple non-parametric
method called washer. It uses a non-parametric test to fi nd one or more
outliers in a group of even very short time series. The group must have
a similar behaviour, as explained more fully below. An example is that of
municipalities cited in the work of Dahlberg and Johanssen (2000). Swedish
municipalities expenditures between 1979 and 1987 represent 256 time
series. If you consider three years such as, for example, 1981,1982 and 1983,
you have 256 simple polygonal chains made of two lines segments. Every
couple of segments can approximate a straight line or a convex downward
(or convex upward) simple polygonal chain. The idea is to fi nd outliers
among the couples of segments that performs in a too much different way
from the other couples. In the washer procedure every couple of segments
is represented by an index and a non-parametric test (Sprent test) is applied
to the unknown distribution of those indices. For implementing washer
methodology you can download an open source R (programming language)
function with a simple numeric example.
3.3 Multiple Model Integration
3.3.1 Data Federation
Data federation is a brand new idea for integration of data from many
diffract sources. Many organizations and companies store their data in
different ways, like transactional databases, data warehouses, business
intelligence systems, legacy systems and so on. The problem arises, when
someone needs to access data from some of these sources [8, 4, 3]. There
is no easy way to retrieve the data, because every storage system has its
own way of accessing it. In order to help getting to the data from many
sources, there are some ways to integrate the data, and the most advanced of
them is data federation. To integrate the data it has to be copied and moved,
because the integrated data need to be kept together. Of course it has its
defects, like the time needed to copy and move the data, and some copyright
infringements during copying. The data also occupied more disk space
than it actually needed, because it was kept in few instances. There were
also some problems with data refreshing, because if there was more than
one instance of the data, only the modifi ed instance was up to date, so all
others instances of the data has to be refreshed. Of course it slowed down
the integration system. In response to these problems, the IT specialists
created a new data integration system called data federation. The idea of
data federation is to integrate data from many individual sources and make
access to them as easy as possible. The target has to be reached without
moving or copying the data. In fact, the data sources can be in any location.
It only has to be online. Also, every data source can be made using different
technology, standard and architecture. For the end user it will feel like one
big data storage system. The data federation supports many data storage
standards. From the SQL relational databases like Mysql, PostgreSQL,
InterBase, IBM DB2, Firebird and Oracle through directory services and
object-based databases like LDAP and OpenLDAP, to data warehouses
Data Preparation 53
54 Applied Data Mining
and Business-Intelligence systems. The goal is to make the data federation
system work with every standard that is used to store data in companies
and other organizations. The data that is already integrated with the data
federation system is called a federated database or a virtual database.
Federated database allows users to read and write the data without even
knowing that it comes from many different sources. The user doesn’t need
to know how to use a database system, or how to access data in a directory
service. All he needs is to know is how to use the unifi ed front-end of the
data federation system. The data federation system in many cases might
be the best way to unify the data kept in different places in many different
ways. It’s simple, easy for the end users, and an effi cient solution that will
make accessing the data a lot easier.
3.3.2 Bagging and Boosting
(1) Bagging
The concept of bagging (voting for classifi cation, averaging for regression-
type problems with continuous dependent variables of interest) applies to
the area of predictive data mining, to combine the predicted classifi cations
(prediction) from multiple models, or from the same type of model for
different learning data. It is also used to address the inherent instability of
results when applying complex models to relatively small data sets. Suppose
your data mining task is to build a model for predictive classifi cation,
and the dataset from which to train the model (learning data set, which
contains observed classifi cations) is relatively small. You could repeatedly
sub-sample (with replacement) from the dataset, and apply, for example,
a tree classifi er (e.g., C&RT and CHAID) to the successive samples. In
practice, very different trees will often be grown for the different samples,
illustrating the instability of models often evident with small datasets. One
method of deriving a single prediction (for new observations) is to use all
trees found in the different samples, and to apply some simple voting: The
nal classifi cation is the one most often predicted by the different trees. Note
that some weighted combination of predictions (weighted vote, weighted
average) is also possible, and commonly used. A sophisticated (machine
learning) algorithm for generating weights for weighted prediction or voting
is the Boosting procedure.
(2) Boosting
The concept of boosting applies to the area of predictive data mining, to
generate multiple models or classifi ers (for prediction or classifi cation),
and to derive weights to combine the predictions from those models into
a single prediction or predicted classifi cation (see also Bagging). A simple
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.86.121