Installing the packages containing the required functions

Let's start by installing the different packages.

Installing C4.5

J48 is the implementation of C4.5 in Weka. It is readily available to R users through the RWeka package. Let's start by downloading and installing RWeka:

install.packages("RWeka"); library(RWeka)

This should work fine if the Java version (32 or 64 bit) matches the version of R that you are running. If the JAVA_HOME cannot be determined from the Registry error occurs, please install the correct Java version as explained at http://www.r-statistics.com/2012/08/how-to-load-the-rjava-package-after-the-error-java_home-cannot-be-determined-from-the-registry/. The package discussed is the rJava package, but the advice given should work just fine for issues with RWeka too.

Installing C5.0

C5.0 is included in the C50 package, which we will install and load:

install.packages("C50"); library(C50)

Installing CART

The rpart() function of the rpart package allows us to run CART in R. Let's start by installing and loading the package:

install.packages("rpart"); library(rpart)

We'll also install the rpart.plot package, which will allow us to plot the trees:

install.packages("rpart.plot"); library(rpart.plot)

Installing random forest

Let's install and load the randomForest package:

install.packages("randomForest"); library(randomForest)

Installing conditional inference trees

Let's install and load the partykit package that contains the ctree() function, which we will use. We also install the Formula package, which is required by partykit:

install.packages(c("Formula","partykit"))
library(Formula); library(partykit)

Loading and preparing the data

Let's start by loading arules; the package contains the AdultUCI dataset, which we will use first:

install.packages("arules")
library(arules)

In this dataset, the class (the attribute named income) is the annual salary of individuals—whether it is above (modality large) or below (modality small) $50,000. The attributes and their type can be seen on the following summary of the dataset:

data(AdultUCI)
summary(AdultUCI)

Here is the output:

Loading and preparing the data

Summary of the AdultUCI dataset

We can see that there are a lot of missing values. We will therefore now remove the observations containing missing values from the dataset. There are also two attributes we will not use: fnlwgt and education-num. The following are attributes three and five, which we will therefore also remove from the dataset:

ADULT = na.omit(AdultUCI)[,-c(3,5)]

We are now ready to generate our training and testing datasets. For this purpose, we will use the createDataPartition() function of the caret package, which we install with its dependencies:

install.packages("caret", dependencies =      
   (c("Suggests","Depends")));
library(caret)
set.seed(123)
TrainCases = createDataPartition(ADULT$income, p = .5, 
  list=F)

Notice the income attribute is unbalanced; there are far more small income than large income individuals. Hence, we will rebalance this skewed dataset by oversampling the minority class in the training dataset:

1  TrainTemp = ADULT[TrainCases,]
2  AdultTrainSmallIncome = TrainTemp[TrainTemp$income == "small",]
3  AdultTrainLargeIncome = TrainTemp[TrainTemp$income == "large",]
4  Oversample = sample(nrow(AdultTrainLargeIncome), 
5     nrow(AdultTrainSmallIncome), replace = TRUE)
6  AdultTrain = rbind(AdultTrainSmallIncome, 
7  AdultTrainLargeIncome[Oversample,])
8  AdultTest = ADULT[-TrainCases,]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.129.194