Let's start by installing the different packages.
J48 is the implementation of C4.5 in Weka. It is readily available to R users through the RWeka
package. Let's start by downloading and installing RWeka
:
install.packages("RWeka"); library(RWeka)
This should work fine if the Java version (32 or 64 bit) matches the version of R that you are running. If the JAVA_HOME cannot be determined from the Registry error occurs, please install the correct Java version as explained at http://www.r-statistics.com/2012/08/how-to-load-the-rjava-package-after-the-error-java_home-cannot-be-determined-from-the-registry/. The package discussed is the rJava
package, but the advice given should work just fine for issues with RWeka
too.
C5.0 is included in the C50
package, which we will install and load:
install.packages("C50"); library(C50)
The rpart()
function of the rpart
package allows us to run CART in R. Let's start by installing and loading the package:
install.packages("rpart"); library(rpart)
We'll also install the rpart.plot
package, which will allow us to plot the trees:
install.packages("rpart.plot"); library(rpart.plot)
Let's install and load the randomForest
package:
install.packages("randomForest"); library(randomForest)
Let's install and load the partykit
package that contains the ctree()
function, which we will use. We also install the
Formula
package, which is required by partykit
:
install.packages(c("Formula","partykit")) library(Formula); library(partykit)
Let's start by loading arules
; the package contains the
AdultUCI
dataset, which we will use first:
install.packages("arules") library(arules)
In this dataset, the class (the attribute named income
) is the annual salary of individuals—whether it is above (modality large) or below (modality small) $50,000. The attributes and their type can be seen on the following summary of the dataset:
data(AdultUCI) summary(AdultUCI)
Here is the output:
We can see that there are a lot of missing values. We will therefore now remove the observations containing missing values from the dataset. There are also two attributes we will not use: fnlwgt
and education-num
. The following are attributes three and five, which we will therefore also remove from the dataset:
ADULT = na.omit(AdultUCI)[,-c(3,5)]
We are now ready to generate our training and testing datasets. For this purpose, we will use the
createDataPartition()
function of the caret
package, which we install with its dependencies:
install.packages("caret", dependencies = (c("Suggests","Depends"))); library(caret) set.seed(123) TrainCases = createDataPartition(ADULT$income, p = .5, list=F)
Notice the income attribute is unbalanced; there are far more small income than large income individuals. Hence, we will rebalance this skewed dataset by oversampling the minority class in the training dataset:
1 TrainTemp = ADULT[TrainCases,] 2 AdultTrainSmallIncome = TrainTemp[TrainTemp$income == "small",] 3 AdultTrainLargeIncome = TrainTemp[TrainTemp$income == "large",] 4 Oversample = sample(nrow(AdultTrainLargeIncome), 5 nrow(AdultTrainSmallIncome), replace = TRUE) 6 AdultTrain = rbind(AdultTrainSmallIncome, 7 AdultTrainLargeIncome[Oversample,]) 8 AdultTest = ADULT[-TrainCases,]
3.14.129.194