How to do it...

This dataset has 8,124 samples of mushrooms that can be either poisonous or edible; the objective is to predict which ones are edible or poisonous based on 22 categorical features that relate to the physical characteristics of each mushroom. 

We will show different ways of partitioning this dataset using several caret tools:

  1. First, we load the caret library and the mushroom dataset. This dataset does not contain the column names, so we need to add them after loading the data:
library(caret)
mushroom_data = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data",head=FALSE)
colnames(mushroom_data) = c("edible","cap-shape", "cap-surface", "cap-color","bruises","odor", "gill-attachment","gill-spacing","gill-size","gill-color","stalk-shape", "stalk-root","stalk-surface-above-ring","stalk-surface-below-ring","stalk-color-above-ring", "stalk-color-below-ring","veil-type","veil-color","ring-number","ring-type", "spore-print-color","population","habitat")
  1. The first function that we will present here is the createDataPartition(). The first argument specifies what the target variable is (for classification problems, the partition will be generated according to this label—this means that the proportion of labels will be the same in the train and test splits). The p = parameter is used to specify the training size (obviously 1-p will be the proportion of the data assigned to the testing dataset). The list=FALSE parameter is used to instruct R that we don't want to return a list, and times=1 specifies that we want just one replication in the output. After we get the indices, we can build the test and test should be in the same font as train datasets.
trainIndex <- createDataPartition(mushroom_data$edible, p = .75,  list = FALSE,  times = 1)
traindata <- mushroom_data[trainIndex,]
testdata <- mushroom_data[-trainIndex,]
  1. Let's compare the proportions of edible total for the main dataset, the train dataset, and the testing dataset. The proportions are the same:
total_proportion <- nrow(mushroom_data[mushroom_data$edible=="e",])/nrow(mushroom_data)
train_proportion <- nrow(traindata[traindata$edible=="e",])/nrow(traindata)
test_proportion <- nrow(testdata[testdata$edible=="e",])/nrow(testdata)
print(paste("p of edible in data=",round(total_proportion,3),
"/p of edible in train=",round(train_proportion,3),
"/p of edible in test=",round(test_proportion,3)))

The resulting proportions for the total dataset, the training one and the testing one:

  1. Another function in the caret package is the createResample() function. This generates several new samples of the same size as the data, using bootstrap (sampling observations with replacement - some of them will be repeated, and some of them won’t even appear in the resulting sample):
bootstrap_sample <- createResample(mushroom_data$edible,times=10,list=FALSE)
  1. The final function that can be used to create partitions is the createFolds() function. We pass a vector of labels again, and we specify how many subsets of data we want, and whether we want to return a list or not. This function is similar to createDataPartition(), with the obvious difference that it can be used to generate more than two sets:
kfolds_results = createFolds(mushroom_data$edible, k=4,list=FALSE)
r1 = nrow(mushroom_data[kfolds_results==1 & mushroom_data$edible=="e",])/nrow(mushroom_data[kfolds_results==1,])
r2 = nrow(mushroom_data[kfolds_results==2 & mushroom_data$edible=="e",])/nrow(mushroom_data[kfolds_results==2,])
r3 = nrow(mushroom_data[kfolds_results==3 & mushroom_data$edible=="e",])/nrow(mushroom_data[kfolds_results==3,])
r4 = nrow(mushroom_data[kfolds_results==4 & mushroom_data$edible=="e",])/nrow(mushroom_data[kfolds_results==4,])
print(paste("proportion of edible in fold1=",r1,
"/proportion of edible in fold2=",r2,
"/proportion of edible in fold3=",r3,
"/proportion of edible in fold4=",r4))

Take a look at the following screenshot:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.159.25