Predicting the authenticity of banknotes

In this section, we will study the problem of predicting whether a particular banknote is genuine or whether it has been forged. The banknote authentication data set is hosted at https://archive.ics.uci.edu/ml/datasets/banknote+authentication. The creators of the data set have taken specimens of both genuine and forged banknotes and photographed them with an industrial camera. The resulting grayscale image was processed using a type of time-frequency transformation known as a wavelet transform. Three features of this transform are constructed, and along with the image entropy, they make up the four features in total for this binary classification task.

Column name

Type

Definition

waveletVar

Numerical

Variance of the wavelet-transformed image

waveletSkew

Numerical

Skewness of the wavelet-transformed image

waveletCurt

Numerical

Curtosis of the wavelet-transformed image

entropy

Numerical

Entropy of the image

class

Binary

Authenticity (output of 0 means genuine and output of 1 means forged)

First, we will split our 1,372 observations into training and test sets:

> library(caret)
> set.seed(266)
> bnote_sampling_vector <- createDataPartition(bnote$class, p = 
                           0.80, list = FALSE)
> bnote_train <- bnote[bnote_sampling_vector,]
> bnote_test <- bnote[-bnote_sampling_vector,]

Next, we will introduce the C50 R package that contains an implementation of the C5.0 algorithm for classification. The C5.0() function that belongs to this package also takes in a formula and a data frame as its minimum required input. Just as before, we can use the summary() function to examine the resulting model. Instead of reproducing the entire output of the latter, we'll focus on just the tree that is built:

> bnote_tree <- C5.0(class ~ ., data = bnote_train)
> summary(bnote_tree)
waveletVar > 0.75896:
:...waveletCurt > -1.9702: 0 (342)
:   waveletCurt <= -1.9702:
:   :...waveletSkew > 4.9228: 0 (128)
:       waveletSkew <= 4.9228:
:       :...waveletVar <= 3.4776: 1 (34)
:           waveletVar > 3.4776: 0 (2)
waveletVar <= 0.75896:
:...waveletSkew > 5.1401:
    :...waveletVar <= -3.3604: 1 (31)
    :   waveletVar > -3.3604: 0 (93/1)
    waveletSkew <= 5.1401:
    :...waveletVar > 0.30081:
        :...waveletCurt <= 0.35273: 1 (25)
        :   waveletCurt > 0.35273:
        :   :...entropy <= 0.71808: 0 (24)
        :       entropy > 0.71808: 1 (3)
        waveletVar <= 0.30081:
        :...waveletCurt <= 3.0423: 1 (241)
            waveletCurt > 3.0423:
            :...waveletSkew > -1.8624: 0 (21/1)
                waveletSkew <= -1.8624:
                :...waveletVar <= -0.69572: 1 (146)
                    waveletVar > -0.69572:
                    :...entropy <= -0.73535: 0 (2)
                        entropy > -0.73535: 1 (6)

As we can see, it is perfectly acceptable to use a feature more than once in the tree in order to make a new split. The numbers in brackets to the right of the leaf nodes in the tree indicate the number of observations from each class that are assigned to that node. As we can see, the vast majority of the leaf nodes in the tree are pure nodes, so that only observations from one class are assigned to them.

Only two leaf nodes have a single observation each from the minority class for that node, and with this we can infer that we only made two mistakes in our training data using this model. To see if our model has overfit the data or whether it really can generalize well, we'll test it on our test set:

> bnote_predictions <- predict(bnote_tree, bnote_test)
> mean(bnote_test$class == bnote_predictions)
[1] 0.9890511

The test accuracy is near perfect, a rare sight and the last time in this chapter that we'll be done so easily! As a final note, C50() also has a costs parameter, which is useful for dealing with asymmetric error costs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.72.74