How it works...

In Step 1, we initially use set.seed() to ensure random number reproducibility and then scale each column of the dataset using the dplyr mutate_if() function. The first argument of mutate_if() is a condition to be tested; the .funs argument is the function to be applied if the condition is true. Here, then, we're applying the scale() function to a column of the iris dataframe and if it is numeric, returning a dataframe we call scaled_iris. Performing scaling between columns is very important in kNN as the magnitude of the actual values can have a strong effect, so we need them to be of similar scale between columns. Next, we make a copy of the Species column from the data as this contains the class labels and remove it from the dataframe by assigning NULL to the column—for the next steps, the dataframe should contain only numeric data.

In Step 2, we decide which rows should be in our training set and our test set. We use the sample() function to select from a vector of 1 to the number of rows in iris; we select 80% of the row numbers without a replacement so that train_rows is a vector of integers giving the rows from scaled_iris, which we will use in our training set. In the rest of this step, we use subsetting and negative subsetting to prepare the subsets of scaled_iris we will need.

In Step 3, we apply the kNN algorithm with the knn() function to build the model and classify the test set in a single operation. The train argument gets the portion of the data we set aside for training, the test argument the portion for testing, and the cl (class) argument gets the labels for the training set. The k argument is the number of neighbors that should be used in classifying each unknown test point. The function returns a vector of predicted classes for each row in the test data, which we save in test_set_predictions.

In Step 4, we assess the predictions using the caret package function, confusionMatrix(). This takes the predicted classes and real classes and creates a set of statistics, including the following table, which contains the Real labels in the rows and the Predicted labels in the columns. This model predicted one versicolor row as virginica, incorrectly, with all other predictions correct:

##             Reference
## Prediction   setosa versicolor virginica
##   setosa          8          0         0
##   versicolor      0          9         1
##   virginica       0          0        12

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...