Chapter 5: Classification (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 5

Classification

Two common data mining techniques for ﬁ nding hidden patterns in

data are clustering and classiﬁ cation analysis. Although classiﬁ cation

and clustering are often mentioned in the same breath, they are different

analytical approaches. Imaging a database of customer records, where each

record represents a customer’s attributes. These can include identiﬁ ers such

as name and address, demographic information such as gender and age,

and ﬁ nancial attributes such as income and revenue spent. Clustering is an

automated process to group related records together. Related records are

grouped together on the basis of having similar values for attributes. This

approach of segmenting the database via clustering analysis is often used

as an exploratory technique because it is not necessary for the analyst to

specify ahead of time how records should be related together. In fact, the

objective of the analysis is often to discover clusters, and then examine

the attributes and values that deﬁ ne the clusters or segments. As such,

interesting and surprising ways of grouping customers together can become

apparent, and this in turn can be used to drive marketing and promotion

strategies to target speciﬁ c types of customers. Classiﬁ cation is a different

technique from clustering. It is similar to clustering in that it also segments

customer records into distinct segments called classes. But unlike clustering,

a classiﬁ cation analysis requires that the analyst know ahead of time how

classes are deﬁ ned. For example, classes can be deﬁ ned to represent the

likelihood that a customer defaults on a loan (Yes/No). It is necessary that

each record in the dataset used to build the classiﬁ er already have a value for

the attribute used to deﬁ ne classes. Because each record has a value for the

attribute used to deﬁ ne the classes, and because the end-user decides on the

attribute to use, classiﬁ cation is much less exploratory than clustering.

The objective of a classiﬁ er is not to explore the data to discover

interesting segments, but rather to decide how new records should

be classiﬁ ed—i.e., is this new customer likely to default on the loan?

Classiﬁ cation routines in data mining also use a variety of algorithms

—and the particular algorithm used can affect the way records are classiﬁ ed.

A common approach for classiﬁ ers is to use decision trees to partition and

segment records. New records can be classiﬁ ed by traversing the tree from

the root through branches and nodes, to a leaf representing a class. The path

a record takes through a decision tree can then be represented as a rule.

For example, Income<$30,000 and age<25, and debt=High, then Default

Class=Yes. But due to the sequential nature of the way a decision tree

splits records (i.e., the most discriminative attribute-values [e.g., Income]

appear early in the tree) can result in a decision tree being overly sensitive

to initial splits. Therefore, in evaluating the goodness of ﬁ t of a tree, it is

important to examine the error rate for each leaf node (proportion of records

incorrectly classiﬁ ed). A nice property of decision tree classiﬁ ers is that

because paths can be expressed as rules, then it becomes possible to use

measures for evaluating the usefulness of rules such as Support, Conﬁ dence

and Lift to also evaluate the usefulness of the tree. Although clustering and

classiﬁ cation are often used for purposes of segmenting data records, they

have different objectives and achieve their segmentations through different

ways. Knowing which approach to use is important for decision-making.

5.1 Classiﬁ cation Deﬁ nition and Related Issues

The data analysis task classiﬁ cation is where a model or classiﬁ er is

constructed to predict categorical labels (the class label attribute). For

example, Categorical labels include ”safe” or ”risky” for the loan application

data. In general, data classiﬁ cation includes the following two-step process.

Step 1: A classiﬁ er is built describing a predetermined set of data classes or

concepts. This is the learning step (or training phase), where a classiﬁ cation

algorithm builds the classiﬁ er by analyzing or ”learning from” a training

set made up of database tuples and their associated class labels. Each tuple,

is assumed to belong to a predeﬁ ned class called the class label attribute.

Because the class label of each training tuple is provided, this step is also

known as supervised learning. The ﬁ rst step can also be viewed as the

learning of a mapping or function, y = f (X), that can predict the associated

class label y of a given tuple X. Typically, this mapping is represented in

the form of classiﬁ cation rules, decision trees, or mathematical formulae.

In step 2, the model is used for classiﬁ cation.

The predictive accuracy of the classiﬁ er is very important and should

be estimated at ﬁ rst. If we were to use the training set to measure the

accuracy of the classiﬁ er, this estimate would likely be optimistic, because

the classiﬁ er tends to overﬁ t the data. Therefore, a test set is used, made

up of test tuples and their associated class labels. The associated class label

of each test tuple is compared with the learned classiﬁ er’s class prediction

Classiﬁ cation 101

102 Applied Data Mining

for that tuple. If the accuracy of the classiﬁ er is considered acceptable, the

classiﬁ er can be used to classify future data tuples for which the class label

is not known. For example, the classiﬁ cation rules learned in Fig. from the

analysis of data from previous loan applications can be used to approve

or reject new or future loan applicants. The preparing of the data and the

quality of a classiﬁ er are two important regarding issues of classiﬁ cation. The

following preprocessing steps may be applied to the data to help improve

the accuracy, efﬁ ciency, and scalability of the classiﬁ cation process.

Data cleaning: This refers to the preprocessing of data in order to remove or

reduce noise and the treatment of missing values. This step can help reduce

confusion during learning.

Relevance analysis: Many of the attributes in the data may be redundant. A

database may also contain irrelevant attributes. Hence, relevance analysis

in the form of correlation analysis and attribute subset selection, can be

used to detect attributes that do not contribute to the classiﬁ cation or

prediction task.

Data transformation and reduction: Normalization involves scaling all values

for a given attribute so that they fall within a small speciﬁ ed range, such

as 0:0 to 1:0.

The data can also be transformed by generalizing it to higher-level

concepts. Concept hierarchies may be used for this purpose. Data can

also be reduced by applying many other methods, ranging from wavelet

transformation and principle components analysis to discretization

techniques, such as binning, histogram analysis, and clustering. Ideally,

the time spent on relevance analysis, when added to the time spent on

learning from the resulting ”reduced” attribute subset, should be less than

the time that would have been spent on learning from the original set of

attributes. Hence, such analysis can help improve classiﬁ cation efﬁ ciency

and scalability.

Classiﬁ cation methods can be compared and evaluated according to the

following criteria:

• Accuracy: The accuracy of a classiﬁ er refers to the ability of a given

classiﬁ er to correctly predict the class label of new or previously unseen

data. Estimation techniques are cross-validation and bootstrapping.

Because the accuracy computed is only an estimate of how well the

classiﬁ er or predictor will do on new data tuples, conﬁ dence limits

can be computed to help gauge this estimate.

• Speed: This refers to the computational costs involved in generating

and using the given classiﬁ er.

• Robustness: This is the ability of the classiﬁ er to make correct predictions

given noisy data or data with missing values.

• Scalability: This refers to the ability to construct the classiﬁ er efﬁ ciently

given large amounts of data.

• Interpretability: This refers to the level of understanding and insight

that is provided by the classiﬁ er.

• End nodes: represented by triangles.

5.2 Decision Tree and Classiﬁ cation

This section introduces decision tree ﬁ rst, and then discusses a decision

tree classiﬁ er.

5.2.1 Decision Tree

A decision tree is a decision support tool that uses a tree-like graph or

model of decisions and their possible consequences, including chance event

outcomes, resource costs, and utility. It is one way to display an algorithm.

Decision trees are commonly used in operations research, speciﬁ cally in

decision analysis, to help identify a strategy most likely to reach a goal. If in

practice decisions have to be taken online with no recall under incomplete

knowledge, a decision tree should be paralleled by a probability model

as a best choice model or online selection model algorithm. Another use

of decision trees is as a descriptive means for calculating conditional

probabilities. In general, a “decision tree” is used as a visual and analytical

decision support tool, where the expected values (or expected utility) of

competing alternatives are calculated. A decision tree consists of three

types of nodes:

• Decision nodes—commonly represented by squares.

• Chance nodes—represented by circles.

• End nodes—represented by triangles.

Commonly, a decision tree is drawn using ﬂ ow chart symbols as it is easier

for many to read and understand. Figure 5.2.1 shows a decision tree which

is drawn using ﬂ ow chart symbols. A decision tree has only burst nodes

(splitting paths) but no sink nodes (converging paths). Therefore, used

manually, they can grow very big and are then often hard to draw fully by

hand. Traditionally, decision trees have been created manually—as the aside

example shows—although increasingly, specialized software is employed.

Decision trees have several advantages:

Classiﬁ cation 103

104 Applied Data Mining

• Are simple to understand and interpret. People are able to understand

decision tree models after a brief explanation.

• Have value even with little hard data. Important insights can be

generated based on experts describing a situation (its alternatives,

probabilities, and costs) and their preferences for outcomes.

• Possible scenarios can be added.

• Worst, best and expected values can be determined for different

scenarios. Use a white box model. If a given result is provided by a

model.

• Can be combined with other decision techniques.

Like other methods, decision tree also has some disadvantages. These

include:

• For data including categorical variables with different number of

levels, information gain in decision trees are biased in favor of those

attributes with more levels [10].

• Calculations can get very complex particularly if many values are

uncertain and/or if many outcomes are linked.

Figure 5.2.1: An example of a decision tree.

Case

Proceed

Costs

Settle*

Proceed

Costs

Zero

$100K

–$100K

$50K

$500K

Damages

Win

Loss

40%

60%

80%

40%

55%

Offer of $30K

–50K + 40% (80% –100K) + 60%  (80% 100K + (5% 

$500K + 55%  50K)) = –$2.5K

20%

$50K

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: Classification (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5: Classification (1/4)