Importing and cleaning our data

Let's go to the first bit of Python code that actually gets executed in this script.

The first thing we're going to do is load up this PastHires.csv file, and that's the same file we used in the decision tree exercise that we did earlier in this book.

Let's pause quickly to remind ourselves of the content of that file. If you remember right, we have a bunch of attributes of job candidates, and we have a field of whether or not we hired those people. What we're trying to do is build up a decision tree that will predict - would we hire or not hire a person given those attributes?

Now, let's take a quick peek at the PastHires.csv, which will be an Excel file.

You can see that Excel actually imported this into a table, but if you were to look at the raw text you'd see that it's made up of comma-separated values.

The first line is the actual headings of each column, so what we have above are the number of years of prior experience, is the candidate currently employed or not, number of previous employers, the level of education, whether they went to a top-tier school, whether they had an internship while they were in school, and finally, the target that we're trying to predict on, whether or not they got a job offer in the end of the day. Now, we need to read that information into an RDD so we can do something with it.

Let's go back to our script:

rawData = sc.textFile("e:/sundog-consult/udemy/datascience/PastHires.csv") 
header = rawData.first() 
rawData = rawData.filter(lambda x:x != header)

The first thing we need to do is read that CSV data in, and we're going to throw away that first row, because that's our header information, remember. So, here's a little trick for doing that. We start off by importing every single line from that file into a raw data RDD, and I could call that anything I want, but we're calling it sc.textFile. SparkContext has a textFile function that will take a text file and create a new RDD, where each entry, each line of the RDD, consists of one line of input.

Make sure you change the path to that file to wherever you actually installed it, otherwise it won't work.

Now, I'm going to extract the first line, the first row from that RDD, by using the first function. So, now the header RDD will contain one entry that is just that row of column headers. And now, look what's going on in the above code, I'm using filter on my original data that contains all of the information in that CSV file, and I'm defining a filter function that will only let lines through if that line is not equal to the contents of that initial header row. What I've done here is, I've taken my raw CSV file and I've stripped out the first line by only allowing lines that do not equal that first line to survive, and I'm returning that back to the rawData RDD variable again. So, I'm taking rawData, filtering out that first line, and creating a new rawData that only contains the data itself. With me so far? It's not that complicated.

Now, we're going to use a map function. What we need to do next is start to make more structure out of this information. Right now, every row of my RDD is just a line of text, it is comma-delimited text, but it's still just a giant line of text, and I want to take that comma-separated value list and actually split it up into individual fields. At the end of the day, I want each RDD to be transformed from a line of text that has a bunch of information separated by commas into a Python list that has actual individual fields for each column of information that I have. So, that's what this lambda function does:

csvData = rawData.map(lambda x: x.split(","))

It calls the built-in Python function split, which will take a row of input, and split it on comma characters, and divide that into a list of every field delimited by commas.

The output of this map function, where I passed in a lambda function that just splits every line into fields based on commas, is a new RDD called csvData. And, at this point, csvData is an RDD that contains, on every row, a list where every element is a column from my source data. Now, we're getting close.

It turns out that in order to use a decision tree with MLlib, a couple of things need to be true. First of all, the input has to be in the form of LabeledPoint data types, and it all has to be numeric in nature. So, we need to transform all of our raw data into data that can actually be consumed by MLlib, and that's what the createLabeledPoints function that we skipped past earlier does. We'll get to that in just a second, first here's the call to it:

trainingData = csvData.map(createLabeledPoints)

We're going to call a map on csvData, and we are going to pass it the createLabeledPoints function, which will transform every input row into something even closer to what we want at the end of the day. So, let's look at what createLabeledPoints does:

def createLabeledPoints(fields): 
    yearsExperience = int(fields[0]) 
    employed = binary(fields[1]) 
    previousEmployers = int(fields[2]) 
    educationLevel = mapEducation(fields[3]) 
    topTier = binary(fields[4]) 
    interned = binary(fields[5]) 
    hired = binary(fields[6]) 
 
    return LabeledPoint(hired, array([yearsExperience, employed, 
        previousEmployers, educationLevel, topTier, interned]))

It takes in a list of fields, and just to remind you again what that looks like, let's pull up that .csv Excel file again:

So, at this point, every RDD entry has a field, it's a Python list, where the first element is the years of experience, second element is employed, so on and so forth. The problems here are that we want to convert those lists to Labeled Points, and we want to convert everything to numerical data. So, all these yes and no answers need to be converted to ones and zeros. These levels of experience need to be converted from names of degrees to some numeric ordinal value. Maybe we'll assign the value zero to no education, one can mean BS, two can mean MS, and three can mean PhD, for example. Again, all these yes/no values need to be converted to zeros and ones, because at the end of the day, everything going into our decision tree needs to be numeric, and that's what createLabeledPoints does. Now, let's go back to the code and run through it:

def createLabeledPoints(fields): 
    yearsExperience = int(fields[0]) 
    employed = binary(fields[1]) 
    previousEmployers = int(fields[2]) 
    educationLevel = mapEducation(fields[3]) 
    topTier = binary(fields[4]) 
    interned = binary(fields[5]) 
    hired = binary(fields[6]) 
 
    return LabeledPoint(hired, array([yearsExperience, employed, 
        previousEmployers, educationLevel, topTier, interned]))

First, it takes in our list of StringFields ready to convert it into LabeledPoints, where the label is the target value-was this person hired or not? 0 or 1-followed by an array that consists of all the other fields that we care about. So, this is how you create a LabeledPoint that the DecisionTree MLlib class can consume. So, you see in the above code that we're converting years of experience from a string to an integer value, and for all the yes/no fields, we're calling this binary function, that I defined up at the top of the code, but we haven't discussed yet:

def binary(YN): 
    if (YN == 'Y'): 
        return 1 
    else: 
        return 0

All it does is convert the character yes to 1, otherwise it returns 0. So, Y will become 1, N will become 0. Similarly, I have a mapEducation function:

def mapEducation(degree): 
    if (degree == 'BS'): 
        return 1 
    elif (degree =='MS'): 
        return 2 
    elif (degree == 'PhD'): 
        return 3 
    else: 
        return 0

As we discussed earlier, this simply converts different types of degrees to an ordinal numeric value in exactly the same way as our yes/no fields.

As a reminder, this is the line of code that sent us running through those functions:

trainingData = csvData.map(createLabeledPoints)

At this point, after mapping our RDD using that createLabeledPoints function, we now have a trainingData RDD, and this is exactly what MLlib wants for constructing a decision tree.

Table of Contents for Importing and cleaning our data

Create new playlist

Sign In

Sign Up

Table of Contents for
Importing and cleaning our data