Turns out that it's easy to make decision trees; in fact it's crazy just how easy it is, with just a few lines of Python code. So let's give it a try.
I've included a PastHires.csv file with your book materials, and that just includes some fabricated data, that I made up, about people that either got a job offer or not based on the attributes of those candidates.
import numpy as np import pandas as pd from sklearn import tree input_file = "c:/spark/DataScience/PastHires.csv" df = pd.read_csv(input_file, header = 0)
You'll want to please immediately change that path I used here for my own system (c:/spark/DataScience/PastHires.csv) to wherever you have installed the materials for this book. I'm not sure where you put it, but it's almost certainly not there.
We will use pandas to read our CSV in, and create a DataFrame object out of it. Let's go ahead and run our code, and we can use the head() function on the DataFrame to print out the first few lines and make sure that it looks like it makes sense.
df.head()
So, for each candidate ID, we have their years of past experience, whether or not they were employed, their number of previous employers, their highest level of education, whether they went to a top-tier school, and whether they did an internship; and finally here, in the Hired column, the answer - where we knew that we either extended a job offer to this person or not.
As usual, most of the work is just in massaging your data, preparing your data, before you actually run the algorithms on it, and that's what we need to do here. Now scikit-learn requires everything to be numerical, so we can't have Ys and Ns and BSs and MSs and PhDs. We have to convert all those things to numbers for the decision tree model to work. The way to do this is to use some short-hand in pandas, which makes these things easy. For example:
d = {'Y': 1, 'N': 0} df['Hired'] = df['Hired'].map(d) df['Employed?'] = df['Employed?'].map(d) df['Top-tier school'] = df['Top-tier school'].map(d) df['Interned'] = df['Interned'].map(d) d = {'BS': 0, 'MS': 1, 'PhD': 2} df['Level of Education'] = df['Level of Education'].map(d) df.head()
Basically, we're making a dictionary in Python that maps the letter Y to the number 1, and the letter N to the value 0. So, we want to convert all our Ys to 1s and Ns to 0s. So 1 will mean yes and 0 will mean no. What we do is just take the Hired column from the DataFrame, and call map() on it, using a dictionary. This will go through the entire Hired column, in the entire DataFrame and use that dictionary lookup to transform all the entries in that column. It returns a new DataFrame column that I'm putting back into the Hired column. This replaces the Hired column with one that's been mapped to 1s and 0s.
We do the same thing for Employed, Top-tier school and Interned, so all those get mapped using the yes/no dictionary. So, the Ys and Ns become 1s and 0s instead. For the Level of Education, we do the same trick, we just create a dictionary that assigns BS to 0, MS to 1, and PhD to 2 and uses that to remap those degree names to actual numerical values. So if I go ahead and run that and do a head() again, you can see that it worked:
All my yeses are 1's, my nos are 0's, and my Level of Education is now represented by a numerical value that has real meaning.
Next we need to prepare everything to actually go into our decision tree classifier, which isn't that hard. To do that, we need to separate our feature information, which are the attributes that we're trying to predict from, and our target column, which contains the thing that we're trying to predict.To extract the list of feature name columns, we are just going to create a list of columns up to number 6. We go ahead and print that out.
features = list(df.columns[:6]) features
We get the following output:
Above are the column names that contain our feature information: Years Experience, Employed?, Previous employers, Level of Education, Top-tier school, and Interned. These are the attributes of candidates that we want to predict hiring on.
Next, we construct our y vector which is assigned what we're trying to predict, that is our Hired column:
y = df["Hired"] X = df[features] clf = tree.DecisionTreeClassifier() clf = clf.fit(X,y)
This code extracts the entire Hired column and calls it y. Then it takes all of our columns for the feature data and puts them in something called X. This is a collection of all of the data and all of the feature columns, and X and y are the two things that our decision tree classifier needs.
To actually create the classifier itself, two lines of code: we call tree.DecisionTreeClassifier() to create our classifier, and then we fit it to our feature data (X) and the answers (y)- whether or not people were hired. So, let's go ahead and run that.
Displaying graphical data is a little bit tricky, and I don't want to distract us too much with the details here, so please just consider the following boilerplate code. You don't need to get into how Graph viz works here - and dot files and all that stuff: it's not important to our journey right now. The code you need to actually display the end results of a decision tree is simply:
from IPython.display import Image from sklearn.externals.six import StringIO import pydot dot_data = StringIO() tree.export_graphviz(clf, out_file=dot_data, feature_names=features) graph = pydot.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png())
So let's go ahead and run this.
This is what your output should now look like:
There we have it! How cool is that?! We have an actual flow chart here.
Now, let me show you how to read it. At each stage, we have a decision. Remember most of our data which is yes or no, is going to be 0 or 1. So, the first decision point becomes: is Employed? less than 0.5? Meaning that if we have an employment value of 0, that is no, we're going to go left.If employment is 1, that is yes, we're going to go right.
So, were they previously employed? If not go left, if yes go right. It turns out that in my sample data, everyone who is currently employed actually got a job offer, so I can very quickly say if you are currently employed, yes, you're worth bringing in, we're going to follow down to the second level here.
So, how do you interpret this? The gini score is basically a measure of entropy that it's using at each step. Remember as we're going down the algorithm is trying to minimize the amount of entropy. And the samples are the remaining number of samples that haven't beensectioned off by a previous decision.
So say this person was employed. The way to read the right leaf node is the value column that tells you at this point we have 0 candidates that were no hires and 5 that were hires. So again, the way to interpret the first decision point is if Employed? was 1, I'm going to go to the right, meaning that they are currently employed, and this brings me to a world where everybody got a job offer. So, that means I should hire this person.
Now let's say that this person doesn't currently have a job. The next thing I'm going to look at is, do they have an internship. If yes, then we're at a point where in our training data everybody got a job offer. So, at that point, we can say our entropy is now 0 (gini=0.0000), because everyone's the same, and they all got an offer at that point. However, you know if we keep going down(where the person has not done an internship),we'll be at a point where the entropy is 0.32. It's getting lower and lower, that's a good thing.
Next we're going to look at how much experience they have, do they have less than one year of experience? And, if the case is that they do have some experience and they've gotten this far they're a pretty good no hire decision. We end up at the point where we have zero entropy but, all three remaining samples in our training set were no hires. We have 3 no hires and 0 hires. But, if they do have less experience, then they're probably fresh out of college, they still might be worth looking at.
The final thing we're going to look at is whether or not they went to a Top-tier school, and if so, they end up being a good prediction for being a hire. If not, they end up being a no hire. We end up with one candidate that fell into that category that was a no hire and 0 that were a hire. Whereas, in the case candidates did go to a top tier school, we have 0 no hires and 1 hire.
So, you can see we just keep going until we reach an entropy of 0, if at all possible, for every case.