For this example, we are using the housing data from ics.edu. First, we load the data and assign column names:
housing <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data") colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PRATIO", "B", "LSTAT", "MDEV") summary(housing)
We reorder the data so the key (the housing price MDEV) is in ascending order:
housing <- housing[order(housing$MDEV),]
Now, we can split the data into a training set and a test set:
#install.packages("caret") library(caret) set.seed(5557) indices <- createDataPartition(housing$MDEV, p=0.75, list=FALSE) training <- housing[indices,] testing <- housing[-indices,] nrow(training) nrow(testing) 381 125
We build our nearest neighbor model using both sets:
library(class) knnModel <- knn(train=training, test=testing, cl=training$MDEV) knnModel 10.5 9.7 7 6.3 13.1 16.3 16.1 13.3 13.3...
Let us look at the results:
plot(knnModel)
There is a slight Poisson distribution with the higher points near the left side. I think this makes sense as natural data. The start and end tails are dramatically going off page.
What about the accuracy of this model? I did not find a clean way to translate the predicted factors in the knnModel to numeric values, so I extracted them to a flat file, and then loaded them in separately:
predicted <- read.table("housing-knn-predicted.csv") colnames(predicted) <- c("predicted") predicted
predicted |
10.5 |
9.7 |
7.0 |
Then we can build up a results data frame:
results <- data.frame(testing$MDEV, predicted)
And compute our accuracy:
results["accuracy"] <- results['testing.MDEV'] / results['predicted'] head(results) mean(results$accuracy) 1.01794816307793
testing.MDEV | predicted | accuracy |
5.6 | 10.5 | 0.5333333 |
7.2 | 9.7 | 0.7422680 |
8.1 | 7.0 | 1.1571429 |
8.5 | 6.3 | 1.3492063 |
10.5 | 13.1 | 0.8015267 |
10.8 | 16.3 | 0.6625767 |
So, we are estimating within 2% (1.01) of our testing data.