When I import CSV data files from HDFS to my Spark Scala H2O example code, I can filter the incoming data. The following example code contains two filter lines; the first checks that a data line is not empty, while the second checks that the final column in each data row (income), which will be enumerated, is not empty:
val testRDD = rawTestData .filter(!_.isEmpty) .map(_.split(",")) .filter( rawRow => ! rawRow(14).trim.isEmpty )
I also needed to clean my raw data. There are two data sets, one for training and one for testing. It is important that the training and testing data have the following:
I encountered an error related to the enumerated label column income and the values that it contained. I found that my test data set rows were terminated with a full stop character ".
" When processed, this caused the training and the test data values to mismatch when enumerated.
So, I think that time and effort should be spent safeguarding the data quality, as a pre-step to training, and testing machine learning functionality so that time is not lost, and extra cost incurred.
3.142.133.180