CHAPTER 5
Classification
Two common data mining techniques for fi nding hidden patterns in
data are clustering and classifi cation analysis. Although classifi cation
and clustering are often mentioned in the same breath, they are different
analytical approaches. Imaging a database of customer records, where each
record represents a customer’s attributes. These can include identifi ers such
as name and address, demographic information such as gender and age,
and fi nancial attributes such as income and revenue spent. Clustering is an
automated process to group related records together. Related records are
grouped together on the basis of having similar values for attributes. This
approach of segmenting the database via clustering analysis is often used
as an exploratory technique because it is not necessary for the analyst to
specify ahead of time how records should be related together. In fact, the
objective of the analysis is often to discover clusters, and then examine
the attributes and values that defi ne the clusters or segments. As such,
interesting and surprising ways of grouping customers together can become
apparent, and this in turn can be used to drive marketing and promotion
strategies to target specifi c types of customers. Classifi cation is a different
technique from clustering. It is similar to clustering in that it also segments
customer records into distinct segments called classes. But unlike clustering,
a classifi cation analysis requires that the analyst know ahead of time how
classes are defi ned. For example, classes can be defi ned to represent the
likelihood that a customer defaults on a loan (Yes/No). It is necessary that
each record in the dataset used to build the classifi er already have a value for
the attribute used to defi ne classes. Because each record has a value for the
attribute used to defi ne the classes, and because the end-user decides on the
attribute to use, classifi cation is much less exploratory than clustering.
The objective of a classifi er is not to explore the data to discover
interesting segments, but rather to decide how new records should
be classifi ed—i.e., is this new customer likely to default on the loan?
Classifi cation routines in data mining also use a variety of algorithms
—and the particular algorithm used can affect the way records are classifi ed.
A common approach for classifi ers is to use decision trees to partition and
segment records. New records can be classifi ed by traversing the tree from
the root through branches and nodes, to a leaf representing a class. The path
a record takes through a decision tree can then be represented as a rule.
For example, Income<$30,000 and age<25, and debt=High, then Default
Class=Yes. But due to the sequential nature of the way a decision tree
splits records (i.e., the most discriminative attribute-values [e.g., Income]
appear early in the tree) can result in a decision tree being overly sensitive
to initial splits. Therefore, in evaluating the goodness of fi t of a tree, it is
important to examine the error rate for each leaf node (proportion of records
incorrectly classifi ed). A nice property of decision tree classifi ers is that
because paths can be expressed as rules, then it becomes possible to use
measures for evaluating the usefulness of rules such as Support, Confi dence
and Lift to also evaluate the usefulness of the tree. Although clustering and
classifi cation are often used for purposes of segmenting data records, they
have different objectives and achieve their segmentations through different
ways. Knowing which approach to use is important for decision-making.
5.1 Classi cation De nition and Related Issues
The data analysis task classifi cation is where a model or classifi er is
constructed to predict categorical labels (the class label attribute). For
example, Categorical labels include ”safe” or ”risky” for the loan application
data. In general, data classifi cation includes the following two-step process.
Step 1: A classifi er is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classifi cation
algorithm builds the classifi er by analyzing or ”learning from” a training
set made up of database tuples and their associated class labels. Each tuple,
is assumed to belong to a predefi ned class called the class label attribute.
Because the class label of each training tuple is provided, this step is also
known as supervised learning. The fi rst step can also be viewed as the
learning of a mapping or function, y = f (X), that can predict the associated
class label y of a given tuple X. Typically, this mapping is represented in
the form of classifi cation rules, decision trees, or mathematical formulae.
In step 2, the model is used for classifi cation.
The predictive accuracy of the classifi er is very important and should
be estimated at fi rst. If we were to use the training set to measure the
accuracy of the classifi er, this estimate would likely be optimistic, because
the classifi er tends to overfi t the data. Therefore, a test set is used, made
up of test tuples and their associated class labels. The associated class label
of each test tuple is compared with the learned classifi er’s class prediction
Classifi cation 101
102 Applied Data Mining
for that tuple. If the accuracy of the classifi er is considered acceptable, the
classifi er can be used to classify future data tuples for which the class label
is not known. For example, the classifi cation rules learned in Fig. from the
analysis of data from previous loan applications can be used to approve
or reject new or future loan applicants. The preparing of the data and the
quality of a classifi er are two important regarding issues of classifi cation. The
following preprocessing steps may be applied to the data to help improve
the accuracy, effi ciency, and scalability of the classifi cation process.
Data cleaning: This refers to the preprocessing of data in order to remove or
reduce noise and the treatment of missing values. This step can help reduce
confusion during learning.
Relevance analysis: Many of the attributes in the data may be redundant. A
database may also contain irrelevant attributes. Hence, relevance analysis
in the form of correlation analysis and attribute subset selection, can be
used to detect attributes that do not contribute to the classifi cation or
prediction task.
Data transformation and reduction: Normalization involves scaling all values
for a given attribute so that they fall within a small specifi ed range, such
as 0:0 to 1:0.
The data can also be transformed by generalizing it to higher-level
concepts. Concept hierarchies may be used for this purpose. Data can
also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization
techniques, such as binning, histogram analysis, and clustering. Ideally,
the time spent on relevance analysis, when added to the time spent on
learning from the resulting ”reduced” attribute subset, should be less than
the time that would have been spent on learning from the original set of
attributes. Hence, such analysis can help improve classifi cation effi ciency
and scalability.
Classifi cation methods can be compared and evaluated according to the
following criteria:
Accuracy: The accuracy of a classifi er refers to the ability of a given
classifi er to correctly predict the class label of new or previously unseen
data. Estimation techniques are cross-validation and bootstrapping.
Because the accuracy computed is only an estimate of how well the
classifi er or predictor will do on new data tuples, confi dence limits
can be computed to help gauge this estimate.
Speed: This refers to the computational costs involved in generating
and using the given classifi er.
Robustness: This is the ability of the classifi er to make correct predictions
given noisy data or data with missing values.
Scalability: This refers to the ability to construct the classifi er effi ciently
given large amounts of data.
Interpretability: This refers to the level of understanding and insight
that is provided by the classifi er.
End nodes: represented by triangles.
5.2 Decision Tree and Classi cation
This section introduces decision tree fi rst, and then discusses a decision
tree classifi er.
5.2.1 Decision Tree
A decision tree is a decision support tool that uses a tree-like graph or
model of decisions and their possible consequences, including chance event
outcomes, resource costs, and utility. It is one way to display an algorithm.
Decision trees are commonly used in operations research, specifi cally in
decision analysis, to help identify a strategy most likely to reach a goal. If in
practice decisions have to be taken online with no recall under incomplete
knowledge, a decision tree should be paralleled by a probability model
as a best choice model or online selection model algorithm. Another use
of decision trees is as a descriptive means for calculating conditional
probabilities. In general, a “decision tree” is used as a visual and analytical
decision support tool, where the expected values (or expected utility) of
competing alternatives are calculated. A decision tree consists of three
types of nodes:
Decision nodes—commonly represented by squares.
Chance nodes—represented by circles.
End nodes—represented by triangles.
Commonly, a decision tree is drawn using fl ow chart symbols as it is easier
for many to read and understand. Figure 5.2.1 shows a decision tree which
is drawn using fl ow chart symbols. A decision tree has only burst nodes
(splitting paths) but no sink nodes (converging paths). Therefore, used
manually, they can grow very big and are then often hard to draw fully by
hand. Traditionally, decision trees have been created manually—as the aside
example shows—although increasingly, specialized software is employed.
Decision trees have several advantages:
Classifi cation 103
104 Applied Data Mining
Are simple to understand and interpret. People are able to understand
decision tree models after a brief explanation.
Have value even with little hard data. Important insights can be
generated based on experts describing a situation (its alternatives,
probabilities, and costs) and their preferences for outcomes.
Possible scenarios can be added.
Worst, best and expected values can be determined for different
scenarios. Use a white box model. If a given result is provided by a
model.
Can be combined with other decision techniques.
Like other methods, decision tree also has some disadvantages. These
include:
For data including categorical variables with different number of
levels, information gain in decision trees are biased in favor of those
attributes with more levels [10].
Calculations can get very complex particularly if many values are
uncertain and/or if many outcomes are linked.
Figure 5.2.1: An example of a decision tree.
Case
Proceed
Costs
Settle*
Proceed
Costs
Zero
Zero
Zero
$100K
–$100K
$50K
$500K
Damages
Win
Loss
40%
60%
80%
80%
5%
40%
55%
Offer of $30K
–50K + 40% (80% –100K) + 60% (80% 100K + (5%
$500K + 55% 50K)) = –$2.5K
20%
20%
$50K
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.11.28