Creating the SparkContext

Now, we'll start by setting up our SparkContext, and giving it a SparkConf, a configuration.

conf = SparkConf().setMaster("local").setAppName("SparkDecisionTree") 

This configuration object says, I'm going to set the master node to "local", and this means that I'm just running on my own local desktop, I'm not actually running on a cluster at all, and I'm just going to run in one process. I'm also going to give it an app name of "SparkDecisionTree," and you can call that whatever you want, Fred, Bob, Tim, whatever floats your boat. It's just what this job will appear as if you were to look at it in the Spark console later on.

And then we will create our SparkContext object using that configuration:

sc = SparkContext(conf = conf) 

That gives us an sc object we can use for creating RDDs.

Next, we have a bunch of functions:

# Some functions that convert our CSV input data into numerical 
# features for each job candidate 
def binary(YN): 
    if (YN == 'Y'): 
        return 1 
    else: 
        return 0 
 
def mapEducation(degree): 
    if (degree == 'BS'): 
        return 1 
    elif (degree =='MS'): 
        return 2 
    elif (degree == 'PhD'): 
        return 3 
    else: 
        return 0 
 
# Convert a list of raw fields from our CSV file to a 
# LabeledPoint that MLLib can use. All data must be numerical... 
def createLabeledPoints(fields): 
    yearsExperience = int(fields[0]) 
    employed = binary(fields[1]) 
    previousEmployers = int(fields[2]) 
    educationLevel = mapEducation(fields[3]) 
    topTier = binary(fields[4]) 
    interned = binary(fields[5]) 
    hired = binary(fields[6]) 
 
    return LabeledPoint(hired, array([yearsExperience, employed, 
        previousEmployers, educationLevel, topTier, interned])) 

Let's just get down these functions for now, and we'll come back to them later.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.203.200