Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Predicting the authenticity of banknotes

In this section, we will study the problem of predicting whether a particular banknote is genuine or whether it has been forged. The banknote authentication data set is hosted at https://archive.ics.uci.edu/ml/datasets/banknote+authentication. The creators of the data set have taken specimens of both genuine and forged banknotes and photographed them with an industrial camera. The resulting grayscale image was processed using a type of time-frequency transformation known as a wavelet transform. Three features of this transform are constructed, and along with the image entropy, they make up the four features in total for this binary classification task.

Column name	Type	Definition
`waveletVar`	Numerical	Variance of the wavelet-transformed image
`waveletSkew`	Numerical	Skewness of the wavelet-transformed image
`waveletCurt`	Numerical	Curtosis of the wavelet-transformed image
`entropy`	Numerical	Entropy of the image
`class`	Binary	Authenticity (output of 0 means genuine and output of 1 means forged)

First, we will split our 1,372 observations into training and test sets:

> library(caret)
> set.seed(266)
> bnote_sampling_vector <- createDataPartition(bnote$class, p = 
                           0.80, list = FALSE)
> bnote_train <- bnote[bnote_sampling_vector,]
> bnote_test <- bnote[-bnote_sampling_vector,]

Next, we will introduce the C50 R package that contains an implementation of the C5.0 algorithm for classification. The C5.0() function that belongs to this package also takes in a formula and a data frame as its minimum required input. Just as before, we can use the summary() function to examine the resulting model. Instead of reproducing the entire output of the latter, we'll focus on just the tree that is built:

> bnote_tree <- C5.0(class ~ ., data = bnote_train)
> summary(bnote_tree)
waveletVar > 0.75896:
:...waveletCurt > -1.9702: 0 (342)
:   waveletCurt <= -1.9702:
:   :...waveletSkew > 4.9228: 0 (128)
:       waveletSkew <= 4.9228:
:       :...waveletVar <= 3.4776: 1 (34)
:           waveletVar > 3.4776: 0 (2)
waveletVar <= 0.75896:
:...waveletSkew > 5.1401:
    :...waveletVar <= -3.3604: 1 (31)
    :   waveletVar > -3.3604: 0 (93/1)
    waveletSkew <= 5.1401:
    :...waveletVar > 0.30081:
        :...waveletCurt <= 0.35273: 1 (25)
        :   waveletCurt > 0.35273:
        :   :...entropy <= 0.71808: 0 (24)
        :       entropy > 0.71808: 1 (3)
        waveletVar <= 0.30081:
        :...waveletCurt <= 3.0423: 1 (241)
            waveletCurt > 3.0423:
            :...waveletSkew > -1.8624: 0 (21/1)
                waveletSkew <= -1.8624:
                :...waveletVar <= -0.69572: 1 (146)
                    waveletVar > -0.69572:
                    :...entropy <= -0.73535: 0 (2)
                        entropy > -0.73535: 1 (6)

As we can see, it is perfectly acceptable to use a feature more than once in the tree in order to make a new split. The numbers in brackets to the right of the leaf nodes in the tree indicate the number of observations from each class that are assigned to that node. As we can see, the vast majority of the leaf nodes in the tree are pure nodes, so that only observations from one class are assigned to them.

Only two leaf nodes have a single observation each from the minority class for that node, and with this we can infer that we only made two mistakes in our training data using this model. To see if our model has overfit the data or whether it really can generalize well, we'll test it on our test set:

> bnote_predictions <- predict(bnote_tree, bnote_test)
> mean(bnote_test$class == bnote_predictions)
[1] 0.9890511

The test accuracy is near perfect, a rare sight and the last time in this chapter that we'll be done so easily! As a final note, C50() also has a costs parameter, which is useful for dealing with asymmetric error costs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Predicting the authenticity of banknotes

Create new playlist

Sign In

Sign Up

Predicting the authenticity of banknotes

Table of Contents for
Predicting the authenticity of banknotes