Creating a training and testing set

As with most supervised learning tasks, we will create a split in our dataset so that we teach a model on one subset and then test its ability to generalize on new data against the holdout set. For the purposes of this example, we split the data 80/20 but there is no hard rule on what the ratio for a split should be - or for that matter - how many splits there should be in the first place:

// Create Train & Test Splits 
val trainTestSplits = higgs.randomSplit(Array(0.8, 0.2)) 
val (trainingData, testData) = (trainTestSplits(0), trainTestSplits(1)) 

By creating our 80/20 split on the dataset, we are taking a random sample of 8.8 million examples as our training set and the remaining 2.2 million as our testing set. We could just as easily take another random 80/20 split and generate a new training set with the same number of examples (8.8 million) but with different data. Doing this type of hard splitting of our original dataset introduces a sampling bias, which basically means that our model will learn to fit the training data but the training data may not be representative of "reality". Given that we are working with 11 million examples already, this bias is not as prominent versus if our original dataset is 100 rows, for example. This is often referred to as the holdout method for model validation.

You can also use the H2O Flow to split the data:

  1. Publish the Higgs data as H2OFrame:
val higgsHF = h2oContext.asH2OFrame(higgs.toDF, "higgsHF") 
  1. Split data in the Flow UI using the command splitFrame (see Figure 07).
  2. And then publish the results back to RDD.
Figure 7 - Splitting Higgs dataset into two H2O frames representing 80 and 20 percent of data.

In contrast to Spark lazy evaluation, the H2O computation model is eager. That means the splitFrame invocation processes the data right away and creates two new frames, which can be directly accessed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.200.112