Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data Quality

When I import CSV data files from HDFS to my Spark Scala H2O example code, I can filter the incoming data. The following example code contains two filter lines; the first checks that a data line is not empty, while the second checks that the final column in each data row (income), which will be enumerated, is not empty:

val testRDD  = rawTestData
  .filter(!_.isEmpty)
  .map(_.split(","))
  .filter( rawRow => ! rawRow(14).trim.isEmpty )

I also needed to clean my raw data. There are two data sets, one for training and one for testing. It is important that the training and testing data have the following:

The same number of columns
The same data types
The null values must be allowed for in the code
The enumerated type values must match—especially for the labels

I encountered an error related to the enumerated label column income and the values that it contained. I found that my test data set rows were terminated with a full stop character "." When processed, this caused the training and the test data values to mismatch when enumerated.

So, I think that time and effort should be spent safeguarding the data quality, as a pre-step to training, and testing machine learning functionality so that time is not lost, and extra cost incurred.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Data Quality

Create new playlist

Sign In

Sign Up

Data Quality

Table of Contents for
Data Quality