With R we include the packages we are going to use:
install.packages("randomForest", repos="http://cran.r-project.org") library(randomForest)
Load the data:
filename = "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data" housing <- read.table(filename) colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PRATIO", "B", "LSTAT", "MDEV")
Split it up:
housing <- housing[order(housing$MDEV),] #install.packages("caret") library(caret) set.seed(5557) indices <- createDataPartition(housing$MDEV, p=0.75, list=FALSE) training <- housing[indices,] testing <- housing[-indices,] nrow(training) nrow(testing)
Calculate our model:
forestFit <- randomForest(MDEV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PRATIO + B + LSTAT, data=training) forestFit Call: randomForest(formula = MDEV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PRATIO + B + LSTAT, data = training) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 4 Mean of squared residuals: 11.16163 % Var explained: 87.28
This is one of the more informative displays about a model—we see the model explains 87% of the variable.
Make our prediction:
forestPredict <- predict(forestFit, newdata=testing) See how well the model worked: diff <- forestPredict - testing$MDEV sum( (diff - mean(diff) )^2 ) #sum of squares 1391.95553131418
This is one of the lowest sum of squares among the models we produced in this chapter.