Splitting the data into train and test sets

Every machine learning modeling exercise begins with the process of data cleansing, as discussed earlier. The next step is to split the data into a train and test set. This is usually done by randomly selecting rows from the data that will be used to create the model. The rows that weren't selected would then be used to test the final model.

The usual split varies between 70-80 percent (training data versus test data). In an 80-20 split, 80% of the data would be used in order to create the model. The remaining 20% would be used to test the final model produced.

We applied this in the earlier section, but we can revisit the code once again. The createDataPartition function was used with the parameter p = 0.80 in order to split the data. The training_index variable holds the training indices (of the dataset, diab) that we will use:

training_index<- createDataPartition(diab$diabetes, p = 0.80, list = FALSE, times = 1) 
 
length(training_index) # Number of items that we will select for the train set 
[1] 615 
 
nrow(diab) # The total number of rows in the dataset 
[1] 768 
 
# Creating the training set, this is the data we will use to build our model 
diab_train<- diab[training_index,] 
 
# Create the test set, this is the data against which we will test the performance of our model 
diab_test<- diab[-training_index,] 

We do not have to necessarily use the createDataPartition function and instead, a random sample created using simple R commands as shown here will suffice:

# Create a set of random indices representing 80% of the data 
training_index2 <- sample(nrow(diab),floor(0.80*nrow(diab))) 
 
# Check the size of the indices just created 
length(training_index2) 
[1] 614 
 
# Create the training set 
diab_train2 <- diab[training_index2,] 
 
# Create the test set 
diab_test2 <- diab[-training_index2] 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.235.8