Chapter 10. Classification with k-Nearest Neighbors and Naïve Bayes

In Chapter 8, Probability Distributions, Covariance, and Correlation, we examined statistical distributions, covariance, and correlation. In the previous chapter, you learned about regression. Here, we will focus on classification using Naïve Bayes and k-Nearest Neighbors (k-NN). The problem we want to solve, when using both algorithms, is as follows:

  • We have data in which class (the attribute we want to predict) values are known. We call this training data.
  • We have data in which class values are not known (or we pretend we don't know to test that our classifier works, in which case we call this testing data).
  • We want to predict unknown class values using information from data where the class is known.

For instance, imagine we have collected data about the health habits of individuals. For half of these individuals, we know whether or not they have developed a disease, say, in the following year. For the other half of our sample, we don't know if they have developed this disease. We will seek to gain knowledge about the unknown values by using the observed relationships in the data where we know the disease outcome (the class).

In this chapter, we will discover two new algorithms for data classification:

  • Naïve Bayes
  • k-Nearest Neighbors

Understanding k-NN

Remember that, in Chapter 4, Cluster Analysis, we discovered that distance matrices are used by k-means to cluster data into a user-specified number of groups of homogenous cases. k-NN uses distances to select the user-defined number of observations that are closest (neighbors) to each of the observations to classify. In k-NN, any attribute can be categorical or numeric, including the target. As we discuss categorization in this chapter, I will limit the description to categorical target (called class attributes).

The classification of a given observation will be made as a majority vote in the neighbors—that is, the most frequent class among the k closest observations. This means that the classification of observations will depend on the chosen number of neighbors. Let's have a look at this. The following figure represents the membership of gray-outlined circles to two class values: the plain grey-lined and the dotted grey-lined. Notice there is one filled grey circle as well, of which we don't know the class membership. The black circles or ellipses contain the closest neighbors on two imaginary attributes, with k = 1, k = 3, and k = 5 being the closest neighbors:

Understanding k-NN

Nearest neighbors with different k values

One can see that, with k = 1, the membership of the filled circle would be dotted grey-lined, as the neighbor has this membership, but, with k = 3, the filled circles would be classified as filled grey-lined as the majority of the neighbors (two out of three) have this membership. With k = 5, the filled circle would be classified as dotted grey-lined, because three out of five of the neighbors have this membership.

The neighbors are determined according to a distance measure. We discussed distances in Chapter 4, Cluster Analysis, so I'll just provide a quick reminder here.

Distances can be understood as a measure of how much observations differ from one another, overall, on all the considered attributes. There are several types of distances. For instance, the Euclidean distance is the square root of the sum of the squared differences on each attribute between two observations. Its formula is provided here:

Understanding k-NN

The Manhattan (or taxi cab distance) is the sum of the absolute differences on each attribute between two observations. Its formula is:

Understanding k-NN

We will use the most common distance—Euclidean distance—in this chapter.

Before going in depth into the analyses of data with R, let's classify some data by hand and examine whether we did it right by relying on the knn() function.

For this exploration, we will rely on the iris dataset one more time.

We will start by computing the Euclidean distances between each observation. We want the full distance matrix so that we can determine the k-NN more easily (but not optimally):

1  distances = dist(iris[1:4], upper = T, diag = T)

We now need to create a data frame from the distances. This is only done in our manual implementation of k-Nearest Neighbors. For this purpose, we will install and load the reshape2 package. Using the melt() function, we will obtain an attribute containing all the distances, instead of the matrix of distances we created previously:

2  install.packages("reshape2"); library(reshape2)
3  distances.df  <- melt(as.matrix(distances))

Now that we have our data frame, we will select the k closest observation to each observation and assign a value to each unclassified observation based on the values of the majority of its neighbors:

4  k = 5
5  N = length(levels(as.factor(distances.df$Var2)))
6  Nearest = matrix(nrow= N, ncol = k)
7  level_count = length(levels(as.factor(iris[[5]])))
8  classif = rep(0,N)
9  for (i in 1:N) {
10     temp = subset(distances.df, Var2 == i)
11     nearest = 
12     unlist(head(temp[order(temp$Var2,temp$value),],k)[1])
13     votes = iris[0,5]
14     for (j in 1:length(nearest)) {
15          votes[j] = iris[5][nearest[j],]
16          classif[i]= which.max(table(votes))
17      }
18     }
19  iris$Species_class[classif == 1] = "setosa"
20  iris$Species_class[classif == 2] = "versicolor"
21  iris$Species_class[classif == 3] = "virginica"
22  table(iris$Species,iris$Species_class)

The following output shows some differences in our classification and the correct classes of the data, but overall, our version of k-NN worked pretty fine:

 

setosa

versicolor

virginica

setosa

50

0

0

versicolor

0

47

3

virginica

0

2

48

The following line of code allows us to identify which observations were classified incorrectly:

rownames(iris)[iris$Species != iris$Species_class]

The output shows that this was the case for observations in rows 71, 73, 84, 107, and 120.

Does our version of k-NN work as well as the version implemented in R in the knn() function? Let's find out! We first install and load the class package, which contains the knn() function:

install.packages("class"); library(class)

As we did with our version of knn, we will use the same data as training and testing data (this is easier but not optimal as we will see later):

iris$knn_class = knn(iris[1:4],iris[1:4],iris[[5]], 5)

There are 150 observations in the iris dataset. Let's see how many are classified in the same class by our implementation of k-NN and the knn() function:

sum(iris$knn_class == iris$Species_class)

The output is 150, which means that both implementations performed as well with this data. Nevertheless, as always, the code presented here is meant for understand the algorithm only, and will only work if the training and testing data are the same. It therefore has only pedagogical applications. Be sure to always use functions available on CRAN for real data analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.237.201