Working with k-NN in R

When explaining the way k-NN works, we have used the same data as training and testing data. The risk here is overfitting: there is noise in the data almost always (for instance due to measurement errors) and testing on the same dataset does not let us examine the impact of noise on our classification. In other words, we want to ensure that our classification reflects real associations in the data.

There are several ways to solve this issue. The most simple is to use a different set of data for training and testing. We have already seen this when discussing Naïve Bayes. Another, better, solution is to use cross-validation. In cross-validation, the data is split in any number of parts lower than the number of observations. One part is then left out for testing and the rest is used for training. Training is then performed again, leaving another part of the data out for testing, but including the part that was previously used for testing. We will discuss cross-validation in more detail in Chapter 14, Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML.

Here, we will use a special case of cross-validation: leave-one-out cross-validation, as this is readily implemented in the function knn.cv(), which is also included in the class package. In leave-one-out cross-validation, each observation is iteratively left out for testing, and all other observations are used for training.

We will perform leave-one-out cross-validation using the Ozone dataset, which contains air quality data. We will now install and load the mlbench package, as it contains the data. Missing data will be omitted here:

install.packages("mlbench"); library(mlbench)
data(Ozone)
Oz = na.omit(Ozone)

The dataset originally contains 366 observations (203 after omitting observations containing a missing gate) and 13 attributes. Type ?Ozone after installing the package for a description of the attributes. The first attribute is the month where the observation was collected. For the purpose of the analysis, we will recode the month in a new attribute—data collected between April and September will be coded 1, and the others 0:

Oz$AprilToSeptember = rep(0,length(Oz[,1]))
Oz$AprilToSeptember[as.numeric(Oz[[1]])>=4 &
   as.numeric(Oz[[1]])<=9] = 1

There are 92 observations that were collected between April and September. The task will be to classify the observations on this attribute. In order to illustrate the importance of using cross-validation, we will do this first using the whole dataset for training and testing. Later, we will rely on cross validation. Notice that the first argument of knn() is the training dataset, the second is the testing dataset, the third is the class attribute, and the fourth is the number of neighbors. It is recommended to set k to odd values in order to avoid ties. We will use three neighbors here. We will see how to select k later:

Oz$classif = knn(Oz[2:13],Oz[2:13],Oz[[14]], 3)

We then examine the confusion matrix:

table(Oz$classif,Oz[[14]])

The following confusion matrix shows that the majority of the observations were classified correctly, but about 15 percent were misclassified, (12+19) / 203 = 0.1527094:

 

0

1

0

92

12

1

19

80

Remember that, in this case, all the observations are part of the training dataset. What happens if we classify observations that are not used for training ? Let's find out:

Oz$classif2 = knn.cv(Oz[2:13],Oz[[14]], 3)

Note

Note that knn.cv() takes the dataset as the first argument, the class as the second argument, and k as the third argument.

Let's examine the confusion matrix:

table(Oz$classif2,Oz[[14]])

The following confusion matrix shows that, again, most observations were correctly classified, but the proportion of incorrectly classified observations is higher (27+28)/203 = 0.270936:

 

0

1

0

83

27

1

28

65

The proportion of incorrectly classified observations is almost twice as high as before. This matters because, when attempting to predict data that is really unknown, we could have unrealistic expectations about the performance of k-NN, if relying on the data used for training for testing. We will discuss performance measures at the end of the chapter.

How to select k

Duda, Hart, and Stork (in their book, Pattern Classification, 2000) propose to select k as the square root of the number of observations. While such a rule-of-thumb has the merit of simplicity, it does not always lead to better classifications. In the case of our example, the ratio of misclassified observations using leave-one-out cross-validation rises to 35 percent when setting k to the square root of the number of observations.

Another way to select k is to choose the number of neighbors that maximizes some performance measure. This implies running the analysis several times with different k values until such maxima is found. We can also set a reasonable limit of neighbors, an arbitrary 10 percent of the dataset. So let's do this with the data in our example. We will be maximizing accuracy (the opposite of the misclassification ratio), but another measure (for instance precision or recall—see the end of the chapter) can be used:

1  Accur = rep(0,20)
2  for (i in 1:20) {
3     classification = knn.cv(Oz[2:13],Oz[[14]], i)
4     Accur[i] = sum(classification == Oz[[14]])/203
5     }

So let's examine the best number of neighbors for our data:

6  which.max(Accur)

The output is 3, so we selected the best number initially!

Now that we have a working knowledge of classification using k-NN, let's see how to do it with Naïve Bayes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.184.200