9
Machine Learning

Now that you understand the ideas behind many fundamental algorithms, we can turn to more advanced ideas. In this chapter, we explore machine learning. Machine learning refers to a broad range of methods, but they all share the same goal: finding patterns in data and using them to make predictions. We’ll discuss a method called decision trees and then build one that can predict a person’s level of happiness based on some of their personal characteristics.

Decision Trees

Decision trees are diagrams that have a branching structure resembling a tree. We can use decision trees in the same way we use flowcharts—by answering yes/no questions, we are guided along a path that leads to a final decision, prediction, or recommendation. The process of creating a decision tree that leads to optimal decisions is a paradigmatic example of a machine learning algorithm.

Let’s consider a real-world scenario in which we might use decision trees. In emergency rooms, an important decision-maker must perform triage for every newly admitted patient. Triage simply means assigning priority: someone who is minutes from death but can be saved by a timely operation will be admitted to treatment immediately, whereas someone who has a paper cut or a mild case of sniffles will be asked to wait until more urgent cases can be cleared up.

Triage is difficult because you have to make a reasonably accurate diagnosis with very little information or time. If a 50-year-old woman comes to the emergency room and complains of bad chest pain, the person in charge of triage has to decide whether her pain is more likely to be heartburn or a heart attack. The thought process of a person who makes triage decisions is necessarily complex. They’ll take into account a number of factors: the age and sex of the patient, whether they are obese or a smoker, the symptoms they report and the way they talk about them, the expression on their face, how busy the hospital is and what other patients are waiting for treatment, and factors that they may not even be consciously aware of. In order to become good at triage, a person has to learn many patterns.

Understanding the way a triage professional makes a decision is not easy. Figure 9-1 shows a hypothetical, totally made-up triage decision process (not meant as medical advice—don’t try this at home!).

Figure_9-1

Figure 9-1: A simplified decision tree for heart attack triage

You can read this diagram from top to bottom. At the top, we can see that the heart-attack diagnosis process begins with a patient reporting chest pain. After that, the process branches out depending on the sex of the patient. If the patient is a man, the diagnosis process continues in the left branch and we determine whether he is obese. If the patient is a woman, the process continues in the right branch instead, and we determine whether she is a smoker. At each point in the process, we follow the appropriate branch until we reach the bottom of the tree, where we find the tree’s classification of whether the patient is at high risk or low risk for a heart attack. This binary branching process resembles a tree whose trunk branches into smaller offshoots until reaching the ends of the farthest branches. Accordingly, the decision process illustrated in Figure 9-1 is called a decision tree.

Every place you see text in Figure 9-1 is a node of the decision tree. A node like “Not obese” is known as a branching node because there’s at least one more branch to follow before we’re able to make a prediction. The “Not diabetic = Low risk” node is a terminal node because if we’ve arrived there, we don’t need to branch anymore and we know the decision tree’s final classification (“Low risk”).

If we could design a thorough, well-researched decision tree that always led to good triage decisions, it’s possible that someone without medical training could perform triage of heart attack patients, which would save every emergency room in the world plenty of money because they would no longer need to hire and train judicious, highly educated triage professionals. A sufficiently good decision tree could even make it possible to replace human triage professionals with robots, though whether that’s a good goal is debatable. A good decision tree may even lead to better decisions than the average human would make, since it could potentially eliminate the unconscious biases that we fallible humans possess. (And in fact, this has already happened: in 1996 and 2002, separate teams of researchers published papers about their success improving triage results for patients complaining of chest pain by using decision trees.)

The branching decision steps described in a decision tree constitute an algorithm. Executing such an algorithm is very simple: just decide which of the two branches you should be on at every node, and follow the branches to the end. But don’t obey the suggestions of every decision tree you encounter. Remember that anyone can make a decision tree that prescribes any conceivable decision process, even if it leads to wrong decisions. The hard part of decision trees is not executing the decision tree algorithm but designing the decision tree so that it leads to the best possible decisions. Creating an optimal decision tree is an application of machine learning, though merely following a decision tree is not. Let’s discuss the algorithm that creates an optimal decision tree—an algorithm to generate an algorithm—and proceed through the steps of the process to generate an accurate decision tree.

Building a Decision Tree

Let’s build a decision tree that uses information about a person to predict how happy they are. Finding the secret of happiness has preoccupied millions of people for millennia, and social science researchers today spill plenty of ink (and burn through plenty of research grants) pursuing the answers. If we had a decision tree that could use a few pieces of information and reliably predict how happy a person is, it would give us important clues about what determines a person’s happiness, and maybe even some ideas about how to achieve it ourselves. By the end of this chapter, you’ll know how to build such a decision tree.

Downloading Our Dataset

Machine learning algorithms find useful patterns in data, so they require a good dataset. We’ll use data from the European Social Survey (ESS) for our decision tree. You can download the files we’ll use from http://bradfordtuckfield.com/ess.csv and http://bradfordtuckfield.com/variables.csv. (We got our files originally from https://www.kaggle.com/pascalbliem/european-social-survey-ess-8-ed21-201617, where they’re publicly available for free). The ESS is a large-scale survey of adults across Europe that is conducted every two years. It asks a wide variety of personal questions, including religious affiliation, health status, social life, and level of happiness. The files we’ll look at are stored in CSV format. The file extension .csv is short for comma-separated values, and it’s a very common and simple way to store datasets so that they can be opened by Microsoft Excel, LibreOffice Calc, text editors, and some Python modules.

The file variables.csv contains a detailed description of each question recorded in the survey. For example, in line 103 of variables.csv, we can see a description of a variable called happy. This variable records a survey-taker’s answer to the question “Taking all things together, how happy would you say you are?” The answers to this question range from 1 (not happy at all) to 10 (extremely happy). Look at the other variables in variables.csv to see the variety of information available to us. For example, the variable sclmeet records how often respondents meet socially with friends, relatives, or colleagues. The variable health records subjective general health. The variable rlgdgr records a subjective rating of how religious respondents are, and so on.

After seeing our data, we can start to think of hypotheses related to happiness predictions. We might reasonably suppose that people who have active social lives and good health are happier than others. Other variables—like gender, household size, and age—may be less easy to hypothesize about.

Looking at the Data

Let’s start by reading in the data. Download the data from the link and save it locally as ess.csv. Then we can use the pandas module to work with it, storing it in our Python session in a variable called ess:

import pandas as pd
ess = pd.read_csv('ess.csv')

Remember, in order to read the CSV file, you’ll have to be storing it in the same place as you’re running Python from, or you’ll have to change 'ess.csv' in the previous snippet to reflect the exact filepath where you’re storing the CSV file. We can use the shape attribute of a pandas dataframe to see how many rows and columns are in our data:

print(ess.shape)

The output should be (44387, 534), indicating that our dataset has 44,387 rows (one for each respondent) and 534 columns (one for each question in the survey). We can look more closely at some of the columns that interest us by using the pandas module’s slicing functions. For example, here’s how we look at the first five answers to the “happy” question:

print(ess.loc[:,'happy'].head())

Our dataset, ess, has 534 columns, one for each question in the survey. For some purposes, we may want to work with all 534 columns at once. Here, we want to look only at the happy column, not the other 533. That’s why we used the loc() function. Here, the loc() function has sliced the variable called happy from the pandas dataframe. In other words, it takes out only that column and ignores the other 533. Then, the head() function shows us the first five rows of that column. You can see that the first five responses are 5, 5, 8, 8, and 5. We can do the same with the sclmeet variable:

print(ess.loc[:,'sclmeet'].head())

The result should be 6, 4, 4, 4, and 6. The happy responses and the sclmeet responses will line up in order. For example, the 134th element of sclmeet is a response given by the same person who gave the response in the 134th element of happy.

The ESS staff strives to get a complete set of responses from every survey participant. However, there are some cases where responses to some survey questions are missing, sometimes because a participant either refuses to answer or doesn’t know how to answer. Missing responses in the ESS dataset are assigned codes that are much higher than the possible range of real responses. For example, on a question that asks a respondent to choose a number on a scale from 1 to 10, the ESS records a 77 response if the respondent refuses to answer. For our analysis, we’ll consider only responses that are complete, with no missing values for variables that interest us. We can restrict the ess data so that it contains only full responses for the variables we care about as follows:

ess = ess.loc[ess['sclmeet'] <= 10,:].copy()
ess = ess.loc[ess['rlgdgr'] <= 10,:].copy()
ess = ess.loc[ess['hhmmb'] <= 50,:].copy()
ess = ess.loc[ess['netusoft'] <= 5,:].copy()
ess = ess.loc[ess['agea'] <= 200,:].copy()
ess = ess.loc[ess['health'] <= 5,:].copy()
ess = ess.loc[ess['happy'] <= 10,:].copy()
ess = ess.loc[ess['eduyrs'] <= 100,:].copy().reset_index(drop=True)

Splitting Our Data

There are many ways we could use this data to explore the relationship between someone’s social life and their happiness. One of the simplest approaches is a binary split: we compare the happiness levels of people with highly active social lives to those of people with less active social lives (Listing 9-1).

import numpy as np
social = list(ess.loc[:,'sclmeet'])
happy = list(ess.loc[:,'happy'])
low_social_happiness = [hap for soc,hap in zip(social,happy) if soc <= 5]
high_social_happiness = [hap for soc,hap in zip(social,happy) if soc > 5]
            
meanlower = np.mean(low_social_happiness)
meanhigher = np.mean(high_social_happiness)

Listing 9-1: Calculating the mean happiness levels of people with inactive and active social lives

In Listing 9-1, we imported the numpy module in order to calculate means. We defined two new variables, social and happy, by slicing them from the ess dataframe. Then, we used list comprehensions to find the happiness levels of all people with lower ratings of social activity (which we saved in the variable low_social_happiness) and the happiness levels of all people with higher ratings of social activity (which we saved in the variable high_social_happiness). Finally, we calculated the mean happiness rating of unsocial people (meanlower) and the mean happiness rating of highly social people (meanhigher). If you run print(meanlower) and print(meanhigher), you should see that people who rated themselves as highly social also rated themselves as slightly happier than their less socially active peers: about 7.8 was the mean happiness level reported by the socially active, and about 7.2 was the mean happiness level for the socially inactive.

We can draw a simple diagram of what we just did, as in Figure 9-2.

Figure_9-2

Figure 9-2: A simple decision tree predicting happiness based on frequency of social outings

This diagram of our simple binary split has already started to resemble a decision tree. This is not a coincidence: making a binary split in a dataset and comparing outcomes in each half is exactly the process at the heart of the decision tree generation algorithm. In fact, Figure 9-2 can rightfully be called a decision tree, albeit one that has only one branching node. We can use Figure 9-2 as a very simple predictor of happiness: we find out how often someone goes out socially. If their sclmeet value is 5 or less, then we can predict that their happiness is 7.2. If it is higher than 5, then we can predict that their happiness is 7.8. It will not be a perfect prediction, but it’s a start and it’s more accurate than random guessing.

We can try to use our decision tree to draw conclusions about the impact of various characteristics and lifestyle choices. For example, we see that the difference between low social happiness and high social happiness is about 0.6, and we conclude that increasing one’s level of social activity from low to high could lead to a predicted increase in happiness of about 0.6 on a 10-point scale. Of course, trying to draw these sorts of conclusions is fraught with difficulties. It could be that social activity does not cause happiness, but rather that happiness causes social activity; maybe happy people are more often in the jovial mood that leads to ccalling their friends and arranging social meetings. Disentangling correlation from causation is beyond the scope of this chapter, but regardless of the direction of causation, our simple decision tree has at least given us the fact of the association, which we can investigate further if we care to. As cartoonist Randall Munroe put it, “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there.’”

We know how to make a simple decision tree with two branches. Now we just need to perfect how we create branches and then make many of them for a better, more complete decision tree.

Smarter Splitting

When we compared the happiness levels of people with active versus inactive social lives, we used 5 as our split point, saying that those who were rated higher than 5 had an active social life and those who were rated at 5 or below had an inactive social life. We chose 5 because it is a natural middle point for ratings that go from 1 to 10. However, remember that our goal is to build an accurate predictor of happiness. Rather than splitting based on intuitions about what a natural midpoint is or what seems like an active social life, it would be best to make our binary split in some place that leads to the best possible accuracy.

In machine learning problems, there are a few different ways to measure accuracy. The most natural way is to find the sum of our errors. In our case, the error that interests us is the difference between our prediction of someone’s happiness rating and their actual happiness rating. If our decision tree predicts that your happiness is 6 but it’s actually 8, then that tree’s error for your rating is 2. If we add up the prediction errors for every respondent in some group, we can get an error sum that measures the decision tree’s accuracy for predicting the happiness of members of that group. The closer we can get our error sum to zero, the better our tree is (but please see "The Problem of Overfitting" on page 179 for important caveats). This snippet shows a simple way to find the error sum:

lowererrors = [abs(lowhappy - meanlower) for lowhappy in low_social_happiness]
highererrors = [abs(highhappy - meanhigher) for highhappy in high_social_happiness]

total_error = sum(lowererrors) + sum(highererrors)

This code takes the sum of all prediction errors for all respondents. It defines lowererrors, a list containing the prediction error for each less social respondent, and highererrors, a list containing the prediction error for each more social respondent. Notice that we took the absolute value so that we’re adding only non-negative numbers to calculate the error sum. When we run this code, we find that our total error is about 60224. This number is much higher than zero, but if you consider that this is a sum of errors for more than 40,000 respondents whose happiness we predicted using a tree with only two branches, suddenly it doesn’t seem so bad.

We can try different split points to see if our error improves. For example, we can classify everyone with a social rating higher than 4 as high social and everyone with a social rating of 4 or lower as low social, and compare the resulting error rates. Or we could use 6 as our split point instead. In order to get the highest possible accuracy, we should check every possible split point in order, and choose the split point that leads to the lowest possible error. Listing 9-2 contains a function that accomplishes this.

def get_splitpoint(allvalues,predictedvalues):
    lowest_error = float('inf')
    best_split = None
    best_lowermean = np.mean(predictedvalues)
    best_highermean = np.mean(predictedvalues)
    for pctl in range(0,100):
        split_candidate = np.percentile(allvalues, pctl)
        
        loweroutcomes = [outcome for value,outcome in zip(allvalues,predictedvalues) if value <= split_candidate]
        higheroutcomes = [outcome for value,outcome in zip(allvalues,predictedvalues) if value > split_candidate]
        
        if np.min([len(loweroutcomes),len(higheroutcomes)]) > 0:
            meanlower = np.mean(loweroutcomes)
            meanhigher = np.mean(higheroutcomes)
            
            lowererrors = [abs(outcome - meanlower) for outcome in loweroutcomes]
            highererrors = [abs(outcome - meanhigher) for outcome in higheroutcomes]
            
            total_error = sum(lowererrors) + sum(highererrors)
            
            if total_error < lowest_error:
                best_split = split_candidate
                lowest_error = total_error
                best_lowermean = meanlower
                best_highermean = meanhigher
    return(best_split,lowest_error,best_lowermean,best_highermean)

Listing 9-2: A function that finds the best point at which to split a variable for a branch point of a decision tree

In this function, we use a variable called pctl (short for percentile) to loop through every number from 0 to 100. In the first line of the loop, we define a new split_candidate variable, which is the pctl-th percentile of the data. After that, we go through the same process we used in Listing 9-2. We create a list of the happiness levels of people whose sclmeet values are less than or equal to the split candidate, and the happiness levels of people whose sclmeet values are greater than the split candidate, and we check the errors that come from using that split candidate. If the error sum from using that split candidate is smaller than any of the error sums from using any previous split candidate, then we redefine the best_split variable to be equal to split_candidate. After the loop completes, the best_split variable is equal to the split point that led to the highest accuracy.

We can run this function for any variable, as in the following example where we run it for hhmmb, the variable recording the respondent’s number of household members.

allvalues = list(ess.loc[:,'hhmmb'])
predictedvalues = list(ess.loc[:,'happy'])
print(get_splitpoint(allvalues,predictedvalues))

The output here shows us the correct split point as well as the predicted happiness level for the groups defined by that split point:

(1.0, 60860.029867951016, 6.839403436723225, 7.620055170794695)

We interpret this output to mean that the best place to split the hhmmb variable is at 1.0; we split the survey respondents into people who live alone (one household member) and those who live with others (more than one household member). We can also see the average happiness levels for those two groups: about 6.84 and about 7.62, respectively.

Choosing Splitting Variables

For any variable we choose in our data, we can find the optimal place to put our split point. However, remember that in a decision tree like the one in Figure 9-1, we are not finding split points for only one variable. We split men from women, the obese from the non-obese, smokers from nonsmokers, and so on. A natural question is, how we should know which variable to split at each branching node? We could reorder the nodes in Figure 9-1 so that we split by weight first and sex second, or sex only on the left branch or not at all. Deciding which variable to split at each branch point is a crucial part of generating an optimal decision tree, so we should write code for that part of the process.

We’ll use the same principle we used to get optimal split points to decide the best split variable: the best way to split is the one that leads to the smallest error. In order to determine that, we need to iterate over each available variable and check whether splitting on that variable leads to the smallest error. We then determine which variable leads to the split with the lowest error. We can accomplish this by using Listing 9-3.

def getsplit(data,variables,outcome_variable):
    best_var = ''
    lowest_error = float('inf')
    best_split = None
    predictedvalues = list(data.loc[:,outcome_variable])
    best_lowermean = -1
    best_highermean = -1
    for var in variables:
        allvalues = list(data.loc[:,var])
        splitted = get_splitpoint(allvalues,predictedvalues)
        
        if(splitted[1] < lowest_error):
            best_split = splitted[0]
            lowest_error = splitted[1]
            best_var = var
            best_lowermean = splitted[2]
            best_highermean = splitted[3]          
    
    generated_tree = [[best_var,float('-inf'),best_split,best_lowermean],[best_var,best_split,    float('inf'),best_highermean]]
    
    return(generated_tree)

Listing 9-3: A function that iterates over every variable and finds the best variable to split on

In Listing 9-3, we’ve defined a function with a for loop that iterates over all the variables in a list of variables. For each of those variables, it finds the best split point by calling the get_splitpoint() function. Each variable, split at its best split point, will lead to a certain error sum for our predictions. If a particular variable has a lower error sum than any previous variable we considered, we’ll store that variable name as best_var. After looping through every variable name, it has found the variable with the lowest error sum, stored in best_var. We can run this code on a set of variables other than sclmeet as follows:

variables = ['rlgdgr','hhmmb','netusoft','agea','eduyrs']
outcome_variable = 'happy'
print(getsplit(ess,variables,outcome_variable))

In this case, we see the following output:

[['netusoft', -inf, 4.0, 7.041597337770383], ['netusoft', 4.0, inf, 7.73042471042471]]

Our getsplit() function has output a very simple “tree” in the form of a nested list. This tree has only two branches. The first branch is represented by the first nested list, and the second branch is represented by the second nested list. Each element of both nested lists tells us something about their respective branches. The first list tells us that we’re looking at a branch based on a respondent’s value of netusoft (frequency of internet usage). Specifically, the first branch corresponds to people whose value of netusoft is between -inf and 4.0, where inf stands for infinity. In other words, people in this branch report their internet usage as 4 or less on a 5-point scale. The last element of each list shows an estimated happiness rating: about 7.0 for those who are not highly active internet users. We can draw a plot of this simple tree in Figure 9-3.

Figure_9-3

Figure 9-3: The tree generated by our first call to the getsplit() function

Our function so far is telling us that people with relatively low internet use report themselves as feeling less happy, with a mean happiness rating of about 7.0, whereas people who report the highest level of internet use report happiness levels at about 7.7 on average. Again, we need to be careful about how we draw conclusions from this single fact: internet use may not be a true driver of happiness, but it may instead be correlated to happiness levels because of its strong correlations with age, wealth, health, education, and other characteristics. Machine learning alone doesn’t usually allow us to determine complex causal links with certainty, but, as it has with the simple tree in Figure 9-3, it enables us to make accurate predictions.

Adding Depth

We’ve completed everything we need to make the best possible split at each branch point and generate a tree with two branches. Next, we need to grow the tree beyond just one branching node and two terminal nodes. Look at Figure 9-1 and notice that it has more than two branches. It has what we call a depth of three because there are up to three successive branches you have to follow in order to get the final diagnosis. The final step of our decision tree generation process is to specify a depth that we want to reach, and build new branches until we reach that depth. The way we accomplish this is by making the additions to our getsplit() function shown in Listing 9-4.

maxdepth = 3
def getsplit(depth,data,variables,outcome_variable):
    --snip--
    generated_tree = [[best_var,float('-inf'),best_split,[]],[best_var,est_split,float('inf'),[]]]
    
    if depth < maxdepth:
        splitdata1=data.loc[data[best_var] <= best_split,:]
        splitdata2=data.loc[data[best_var] > best_split,:]
        if len(splitdata1.index) > 10 and len(splitdata2.index) > 10:
            generated_tree[0][3] = getsplit(depth + 1,splitdata1,variables,outcome_variable)
            generated_tree[1][3] = getsplit(depth + 1,splitdata2,variables,outcome_variable)
        else:
            depth = maxdepth + 1
            generated_tree[0][3] = best_lowermean
            generated_tree[1][3] = best_highermean
    else:
        generated_tree[0][3] = best_lowermean
        generated_tree[1][3] = best_highermean
    return(generated_tree)

Listing 9-4: A function that can generate a tree of a specified depth

In this updated function, when we define the generated_tree variable, we now add empty lists to it, instead of means. We insert means only in terminal nodes, but if we want a tree that has greater depth, we need to insert other branches within each branch (that’s what the empty lists will contain). We also added an if statement with a long chunk of code at the end of the function. If the depth of the current branch is less than the maximum depth we want in a tree, this section will recursively call the get_split() function again to fill in another branch inside it. This process continues until the maximum depth is reached.

We can run this code to find the decision tree that leads to the lowest error in happiness predictions for our dataset:

variables = ['rlgdgr','hhmmb','netusoft','agea','eduyrs']
outcome_variable = 'happy'
maxdepth = 2
print(getsplit(0,ess,variables,outcome_variable))

When we do so, we should get the following output, which represents a tree with a depth of two:

[['netusoft', -inf, 4.0, [['hhmmb', -inf, 4.0, [['agea', -inf, 15.0, 8.035714285714286], ['agea', 15.0, inf, 6.997666564322997]]], ['hhmmb', 4.0, inf, [['eduyrs', -inf, 11.0, 7.263969171483622], ['eduyrs', 11.0, inf, 8.0]]]]], ['netusoft', 4.0, inf, [['hhmmb', -inf, 1.0, [['agea', -inf, 66.0, 7.135361428970136], ['agea', 66.0, inf, 7.621993127147766]]], ['hhmmb', 1.0, inf, [['rlgdgr', -inf, 5.0, 7.743893678160919], ['rlgdgr', 5.0, inf, 7.9873320537428025]]]]]]

Listing 9-5: A representation of a decision tree using nested lists

What you see here is a collection of lists nested within each other. These nested lists represent our full decision tree, though it’s not as easy to read as Figure 9-1. In each level of nesting, we find a variable name and its range, just like we saw with the simple tree illustrated in Figure 9-3. The first level of nesting shows us the same branch we found in Figure 9-3: a branch that represents respondents whose value of netusoft was less than or equal to 4.0. The next list, nested within the first, begins with hhmmb, -inf, 4.0. This is another branch of our decision tree that branches from the branch we just examined, and consists of people whose self-reported household size is 4 or less. If we drew the portion of a decision tree that we’ve looked at in our nested list so far, it would look like Figure 9-4.

We can continue to look at the nested lists to fill in more branches of our decision tree. Lists that are nested within other lists correspond to branches that are lower on the tree. A nested list branches from the list that contains it. The terminal nodes, instead of containing more nested lists, have an estimated happiness score.

Figure_9-4

Figure 9-4: A selection of branches from the decision tree

We’ve successfully created a decision tree that enables us to predict happiness levels with relatively low error. You can examine the output to see the relative determinants of happiness, and the happiness levels associated with each branch.

There is more exploring we can do with decision trees and our dataset. For example, we can try to run the same code but with a different or larger set of variables. We can also create a tree with a different maximum depth. Here is an example of running the code with a different variable list and depth:

variables = ['sclmeet','rlgdgr','hhmmb','netusoft','agea','eduyrs','health']
outcome_variable = 'happy'
maxdepth = 3
print(getsplit(0,ess,variables,outcome_variable))

When we run it with these parameters, we find a very different decision tree. You can see the output here:

[['health', -inf, 2.0, [['sclmeet', -inf, 4.0, [['health', -inf, 1.0, [['rlgdgr', -inf, 9.0, 7.9919636617749825], ['rlgdgr', 9.0, inf, 8.713414634146341]]], ['health', 1.0, inf, [['netusoft', -inf, 4.0, 7.195121951219512], ['netusoft', 4.0, inf, 7.565659008464329]]]]], ['sclmeet', 4.0, inf, [['eduyrs', -inf, 25.0, [['eduyrs', -inf, 8.0, 7.9411764705882355], ['eduyrs', 8.0, inf, 7.999169779991698]]], ['eduyrs', 25.0, inf, [['hhmmb', -inf, 1.0, 7.297872340425532], ['hhmmb', 1.0, inf, 7.9603174603174605]]]]]]], ['health', 2.0, inf, [['sclmeet', -inf, 3.0, [['health', -inf, 3.0, [['sclmeet', -inf, 2.0, 6.049427365883062], ['sclmeet', 2.0, inf, 6.70435393258427]]], ['health', 3.0, inf, [['sclmeet', -inf, 1.0, 4.135036496350365], ['sclmeet', 1.0, inf, 5.407051282051282]]]]], ['sclmeet', 3.0, inf, [['health', -inf, 4.0, [['rlgdgr', -inf, 9.0, 6.992227707173616], ['rlgdgr', 9.0, inf, 7.434662998624484]]], ['health', 4.0, inf, [['hhmmb', -inf, 1.0, 4.948717948717949], ['hhmmb', 1.0, inf, 6.132075471698113]]]]]]]]

In particular, notice that the first branch is split on the variable health instead of the variable netusoft. Other branches at lower depths are split at different points and for different variables. The flexibility of the decision tree method means that starting with the same dataset and the same end goal, two researchers can potentially reach very different conclusions, depending on the parameters they use and decisions they make about how to work with the data. This is a common characteristic of machine learning methods, and part of what makes them so difficult to master.

Evaluating Our Decision Tree

In order to generate our decision tree, we compared error rates for each potential split point and each potential splitting variable, and we always chose the variable and split point that led to the lowest error rate for a particular branch. Now that we’ve successfully generated a decision tree, it makes sense to do similar error calculations, not just for a particular branch but for the whole tree. Evaluating the error rate for the whole tree can give us a sense of how well we’ve accomplished our prediction task, and how well we’re likely to perform on future tasks (for example, future hospital patients complaining of chest pain).

If you look at the decision tree output that we’ve generated so far, you’ll notice that it’s a little hard to read all the nested lists, and there’s no natural way to determine how happy we predict someone is without painstakingly reading through the nested branches and finding the right terminal node. It will be helpful for us to write code that can determine the predicted level of happiness for a person based on what we know about them from their ESS answers. The following function, get_prediction(), can accomplish this for us:

def get_prediction(observation,tree):
    j = 0
    keepgoing = True
    prediction = - 1
    while(keepgoing):
        j = j + 1
        variable_tocheck = tree[0][0]
        bound1 = tree[0][1]
        bound2 = tree[0][2]
        bound3 = tree[1][2]
        if observation.loc[variable_tocheck] < bound2:
            tree = tree[0][3]
        else:
            tree = tree[1][3]
        if isinstance(tree,float):
            keepgoing = False
            prediction = tree
    return(prediction)

Next, we can create a loop that goes through any portion of our dataset and gets any tree’s happiness prediction for that portion. In this case, let’s try a tree with a maximum depth of four:

predictions=[]
outcome_variable = 'happy'
maxdepth = 4
thetree = getsplit(0,ess,variables,outcome_variable)
for k in range(0,30):
    observation = ess.loc[k,:]
    predictions.append(get_prediction(observation,thetree))

print(predictions)

This code just repeatedly calls the get_prediction() function and appends the result to our predictions list. In this case, we made predictions only for the first 30 observations.

Finally, we can check how these predictions compare to the actual happiness ratings, to see what our total error rate is. Here, we’ll make predictions for our entire dataset, and calculate the absolute differences between our predictions and the recorded happiness values:

predictions = []


for k in range(0,len(ess.index)):
    observation = ess.loc[k,:]
    predictions.append(get_prediction(observation,thetree))

ess.loc[:,'predicted'] = predictions
errors = abs(ess.loc[:,'predicted'] - ess.loc[:,'happy'])

print(np.mean(errors))

When we run this, we find that the mean error made by predictions in our decision tree is 1.369. This is higher than zero but lower than it might be if we used a worse prediction method. Our decision tree seems to make reasonably good predictions so far.

The Problem of Overfitting

You may have noticed one very important way that our method for evaluating our decision tree doesn’t resemble how predictions work in real life. Remember what we did: we used the full set of survey respondents to generate our decision tree, and then we used that same set of respondents to judge the accuracy of our tree’s predictions. But it’s redundant to predict the happiness ratings of respondents who already took the survey—they took the survey, so we already know their happiness ratings and don’t need to predict them at all! This would be like getting a dataset of past heart attack patients, meticulously studying their pretreatment symptoms, and building a machine learning model that told us whether someone had a heart attack last week. By now, it’s already quite clear whether that person had a heart attack last week, and there are better ways to know than by looking at their initial triage diagnosis data. It’s easy to predict the past, but remember that true prediction is always about the future. As Wharton professor Joseph Simmons put it, “History is about what happened. Science is about what happens next.”

You may think that this isn’t a serious problem. After all, if we can make a decision tree that works well with last week’s heart attack patients, it’s reasonable to suppose that it will work well with next week’s heart attack patients. This is true to some extent. However, there is a danger that if we aren’t careful, we can encounter a common, dastardly peril called overfitting, the tendency of machine learning models to achieve very low error rates on the datasets used to create them (like data from the past) and then unexpectedly high error rates on other data (like the data that actually matters, from the future).

Consider the example of heart attack predictions. If we observe an emergency room for several days, maybe, by coincidence, every admitted patient who is wearing a blue shirt is suffering from a heart attack and every admitted patient who is wearing a green shirt is healthy. A decision tree model that included shirt color in its prediction variables would pick up this pattern and use it as a branching variable because it has such high diagnostic accuracy in our observations. However, if we then use that decision tree to predict heart attacks in another hospital, or for some future day, we’ll find that our predictions are often wrong, as many people in green shirts also suffer heart attacks and many people in blue shirts don’t. The observations we used to build our decision tree are called in-sample observations, and the observations that we then test our model on, which are not part of our decision tree generation process, are called out-of-sample observations. Overfitting means that by zealously seeking low error rates in predictions of our in-sample observations, we have caused our decision tree model to have inordinately high error rates when predicting our out-of-sample observations.

Overfitting is a serious issue in all applications of machine learning, and it trips up even the best machine learning practitioners. To avoid it, we’ll take an important step that will make our decision tree creation process better resemble the real-life prediction scenario.

Remember that real-life prediction is about the future, but when we build our decision tree we necessarily have data only from the past. We can’t possibly get data from the future, so we’ll split our dataset into two subsets: a training set, which we’ll use only to build our decision tree, and a test set, which we’ll use only to check the accuracy of our decision tree. Our test set is from the past, just like the rest of our data, but we treat it as if it’s the future; we don’t use it to create our decision tree (as if it hasn’t happened yet), but we do use it—only after completely building the decision tree—to test the accuracy of our decision tree (as if we got it later in the future).

By doing this simple training/test split, we’ve made our decision tree generation process resemble the real-life problem of predicting the unknown future; the test set is like a simulated future. The error rate that we find on the test set gives us a reasonable expectation of the error rate we’ll get from the actual future. We’ll know that we’re guilty of overfitting if the error on our training set is very low and the error on our test set is very high.

We can define training and test sets as follows:

import numpy as np
np.random.seed(518)
ess_shuffled = ess.reindex(np.random.permutation(ess.index)).reset_index(drop = True)
training_data = ess_shuffled.loc[0:37000,:]
test_data = ess_shuffled.loc[37001:,:].reset_index(drop = True)

In this snippet, we used the numpy module to shuffle the data—in other words, keeping all the data but moving the rows randomly. We accomplished this with the reindex() method of the pandas module. The reindexing is done with a random shuffling of the row numbers, which we get by using the numpy module’s permutation capability. After shuffling the dataset, we select the first 37,000 shuffled rows as a training dataset, and the remainder of the rows as a test dataset. The command np.random.seed(518) is not necessary, but if you run it you’ll ensure that you’ll get the same pseudo-random results that we show here.

After defining our training and test data, we generate a decision tree using only the training data:

thetree = getsplit(0,training_data,variables,outcome_variable)

Finally, we check the average error rate on the test data, which wasn’t used to train our decision tree:

predictions = []
for k in range(0,len(test_data.index)):
    observation = test_data.loc[k,:]
    predictions.append(get_prediction(observation,thetree))

test_data.loc[:,'predicted'] = predictions
errors = abs(test_data.loc[:,'predicted'] - test_data.loc[:,'happy'])
print(np.mean(errors))

We find that our mean error rate on the test data is 1.371. This is just a hair higher than the 1.369 error rate we found when we used the whole dataset for both training and testing. This indicates that our model doesn’t suffer from overfitting: it’s good at predicting the past and almost exactly as good at predicting the future. Quite often, instead of getting this good news, we get bad news—that our model is worse than we thought it was—but it’s good to get this news because we can still make improvements before we start using our model in a real scenario. In such cases, before our model is ready to be deployed in real life, we’ll need to make improvements to it so that its error rate on the test set is minimized.

Improvements and Refinements

You may find that you’ve created a decision tree that has lower accuracy than you would like. For example, you might have worse accuracy than you should because you’re guilty of overfitting. Many of the strategies for dealing with overfitting issues boil down to some kind of simplification, since simple machine learning models are less likely to suffer from overfitting than are complex models.

The first and easiest way to simplify our decision tree models is to limit their maximum depth; since depth is a variable that we can redefine in one short line, this is easy to do. To determine the right depth, we have to check the error rates on out-of-sample data for different depths. If the depth is too high, it’s likely to cause high error because of overfitting. If the depth is too low, it is likely to cause high error because of underfitting. You can think of underfitting as something like the mirror image of overfitting. Overfitting consists of attempting to learn from patterns that are arbitrary or irrelevant—in other words, learning “too much” from noise in our training data, like whether someone is wearing a green shirt. Underfitting consists of failing to learn enough—creating models that miss crucial patterns in the data, like whether someone is obese or uses tobacco.

Overfitting tends to result from models that have too many variables or too much depth, whereas underfitting tends to result from models that have too few variables or too little depth. Just as with many situations in algorithm design, the right place to be is a happy medium between too high and too low. Choosing the right parameters for a machine learning model, including the depth of a decision tree, is often referred to as tuning, because fixing the tightness of a string on a guitar or violin also relies on finding a happy medium between a pitch that’s too high and one that’s too low.

Another way to simplify our decision tree model is to do what’s called pruning. For this, we grow a decision tree to its full depth and then find branches that we can remove from the tree without increasing our error rate by much.

Another refinement worth mentioning is using different measures to choose the right split point and the right splitting variable. In this chapter, we introduced the idea of using the classification error sum to decide where to put the split point; the right split point is one that minimizes our error sum. But there are other ways to decide on the right split point for a decision tree, including Gini impurity, entropy, information gain, and variance reduction. In practice, these other measures, especially Gini impurity and information gain, are almost always used rather than classification error rate, because some mathematical properties make them better in many cases. Experiment with different ways to choose a split point and splitting variable to find one that seems to perform the best for your data and your decision problem.

Everything we do in machine learning is meant to enable us to make accurate predictions on new data. When you’re trying to improve a machine learning model, you can always judge whether an action is worthwhile by checking how much it improves your error rate on test data. And feel free to be creative to find improvements—anything that improves your error rate on test data is probably worth trying.

Random Forests

Decision trees are useful and valuable, but they are not regarded as the best machine learning method by professionals. This is in part because of their reputation for overfitting and relatively high error rates, and in part because of the invention of a method called random forests, which has become popular recently and provides an unequivocal performance improvement over decision trees.

As its name suggests, a random forest model consists of a collection of decision tree models. Each decision tree in the random forest depends on some randomization. Using randomization, we get a diverse forest with many trees instead of a forest that is just one tree repeated over and over. The randomization occurs in two places. First, the training dataset is randomized: each tree is built considering only a subset of the training set, which is randomly selected and will be different for every tree. (The test set is randomly selected at the beginning of the process, but it’s not rerandomized or reselected for every tree.) Second, the variables used to build the tree are randomized: only a subset of the full set of variables is used for each tree, and the subset could be different every time as well.

After building a collection of these different, randomized trees, we have a whole random forest. To make a prediction about a particular observation, we have to find what each of these different decision trees predicts, and then take the average of the prediction for every individual decision tree. Since the decision trees are randomized in both their data and their variables, taking an average of all of them helps avoid the problem of overfitting and often leads to more accurate predictions.

Our code in this chapter creates decision trees “from scratch,” by directly manipulating datasets and lists and loops. When you work with decision trees and random forests in the future, you can rely on existing Python modules that do much of that heavy lifting for you. But don’t let these modules become a crutch: if you understand every step of these important algorithms well enough to code them from scratch yourself, you can be much more effective in your machine learning efforts.

Summary

This chapter introduced machine learning and explored decision tree learning, a fundamental, simple, and useful machine learning method. Decision trees constitute a type of algorithm, and the generation of a decision tree is itself an algorithm, so this chapter contained an algorithm for generating an algorithm. By learning decision trees and the fundamental ideas of random forests, you have taken a big step toward becoming a machine learning expert. The knowledge you’ve gained in this chapter will be a solid foundation for other machine learning algorithms you may choose to learn, including advanced ones like neural networks. All machine learning methods attempt the type of task we tried here: prediction based on patterns in a dataset. In the next chapter, we explore artificial intelligence, one of the most advanced undertakings of our adventure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.185.123