—and the particular algorithm used can affect the way records are classifi ed.
A common approach for classifi ers is to use decision trees to partition and
segment records. New records can be classifi ed by traversing the tree from
the root through branches and nodes, to a leaf representing a class. The path
a record takes through a decision tree can then be represented as a rule.
For example, Income<$30,000 and age<25, and debt=High, then Default
Class=Yes. But due to the sequential nature of the way a decision tree
splits records (i.e., the most discriminative attribute-values [e.g., Income]
appear early in the tree) can result in a decision tree being overly sensitive
to initial splits. Therefore, in evaluating the goodness of fi t of a tree, it is
important to examine the error rate for each leaf node (proportion of records
incorrectly classifi ed). A nice property of decision tree classifi ers is that
because paths can be expressed as rules, then it becomes possible to use
measures for evaluating the usefulness of rules such as Support, Confi dence
and Lift to also evaluate the usefulness of the tree. Although clustering and
classifi cation are often used for purposes of segmenting data records, they
have different objectives and achieve their segmentations through different
ways. Knowing which approach to use is important for decision-making.
5.1 Classifi cation Defi nition and Related Issues
The data analysis task classifi cation is where a model or classifi er is
constructed to predict categorical labels (the class label attribute). For
example, Categorical labels include ”safe” or ”risky” for the loan application
data. In general, data classifi cation includes the following two-step process.
Step 1: A classifi er is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classifi cation
algorithm builds the classifi er by analyzing or ”learning from” a training
set made up of database tuples and their associated class labels. Each tuple,
is assumed to belong to a predefi ned class called the class label attribute.
Because the class label of each training tuple is provided, this step is also
known as supervised learning. The fi rst step can also be viewed as the
learning of a mapping or function, y = f (X), that can predict the associated
class label y of a given tuple X. Typically, this mapping is represented in
the form of classifi cation rules, decision trees, or mathematical formulae.
In step 2, the model is used for classifi cation.
The predictive accuracy of the classifi er is very important and should
be estimated at fi rst. If we were to use the training set to measure the
accuracy of the classifi er, this estimate would likely be optimistic, because
the classifi er tends to overfi t the data. Therefore, a test set is used, made
up of test tuples and their associated class labels. The associated class label
of each test tuple is compared with the learned classifi er’s class prediction
Classifi cation 101