Implementing decision tree regression

It's now time for coding after we're clear about the regression tree construction process.

The node splitting utility function we define as follow is identical to what we had in Chapter 6, Predicting Online Ads Click-through with Tree-Based Algorithms, which separates samples in a node into left and right branches based on a pair of feature and value:

>>> def split_node(X, y, index, value):
...     """ Split data set X, y based on a feature and a value
...     Args:
...         X, y (numpy.ndarray, data set)
...         index (int, index of the feature used for splitting)
...         value (value of the feature used for splitting)
...     Returns:
...         list, list: left and right child, a child is in the 
                        format of [X, y]
...     """
...     x_index = X[:, index]
...     # if this feature is numerical
...     if type(X[0, index]) in [int, float]:
...         mask = x_index >= value
...     # if this feature is categorical
...     else:
...     mask = x_index == value
...     # split into left and right child
...     left = [X[~mask, :], y[~mask]]
...     right = [X[mask, :], y[mask]]
...     return left, right

Next, we define the greedy search function trying out all possible splits and returning the one with the least weighted MSE:

>>> def get_best_split(X, y):
...     """ Obtain the best splitting point and resulting children 
            for the data set X, y
...     Args:
...         X, y (numpy.ndarray, data set)
...         criterion (gini or entropy)
...     Returns:
...         dict {index: index of the feature, value: feature 
                  value, children: left and right children}
...     """
...     best_index, best_value, best_score, children = 
                                       None, None, 1e10, None
...     for index in range(len(X[0])):
...         for value in np.sort(np.unique(X[:, index])):
...             groups = split_node(X, y, index, value)
...             impurity = weighted_mse([groups[0][1], 
                                         groups[1][1]])
...             if impurity < best_score:
...                 best_index, best_value, best_score, children = 
                                    index, value, impurity, groups
...     return {'index': best_index, 'value': best_value, 
                                           'children': children}

The preceding selection and splitting process occurs in a recursive manner on each of subsequent children. When a stopping criterion is met, the process at a node stops, and the mean value of the sample targets will be assigned to this terminal node:

>>> def get_leaf(targets):
...     # Obtain the leaf as the mean of the targets
...     return np.mean(targets)

And finally, the recursive function split that links all of these preceding together by checking whether any of stopping criteria is met and assigning the leaf node if so or proceeding with further separation otherwise:

>>> def split(node, max_depth, min_size, depth):
...     """ Split children of a node to construct new nodes or 
            assign them terminals
...     Args:
...         node (dict, with children info)
...         max_depth (int, maximal depth of the tree)
...         min_size (int, minimal samples required to further 
                      split a child)
...         depth (int, current depth of the node)
...     """
...     left, right = node['children']
...     del (node['children'])
...     if left[1].size == 0:
...         node['right'] = get_leaf(right[1])
...         return
...     if right[1].size == 0:
...         node['left'] = get_leaf(left[1])
...         return
...     # Check if the current depth exceeds the maximal depth
...     if depth >= max_depth:
...         node['left'], node['right'] = 
                          get_leaf(left[1]), get_leaf(right[1])
...         return
...     # Check if the left child has enough samples
...     if left[1].size <= min_size:
...         node['left'] = get_leaf(left[1])
...     else:
...         # It has enough samples, we further split it
...         result = get_best_split(left[0], left[1])
...         result_left, result_right = result['children']
...         if result_left[1].size == 0:
...             node['left'] = get_leaf(result_right[1])
...         elif result_right[1].size == 0:
...             node['left'] = get_leaf(result_left[1])
...         else:
...             node['left'] = result
...             split(node['left'], max_depth, min_size, 
                                                  depth + 1)
...     # Check if the right child has enough samples
...     if right[1].size <= min_size:
...         node['right'] = get_leaf(right[1])
...     else:
...         # It has enough samples, we further split it
...         result = get_best_split(right[0], right[1])
...         result_left, result_right = result['children']
...         if result_left[1].size == 0:
...             node['right'] = get_leaf(result_right[1])
...         elif result_right[1].size == 0:
...             node['right'] = get_leaf(result_left[1])
...         else:
...             node['right'] = result
...             split(node['right'], max_depth, min_size, 
                                                   depth + 1)

Finally, the entry point of the regression tree construction is as follows:

>>> def train_tree(X_train, y_train, max_depth, min_size):
...     """ Construction of a tree starts here
...     Args:
...         X_train, y_train (list, list, training data)
...         max_depth (int, maximal depth of the tree)
...         min_size (int, minimal samples required to further 
                      split a child)
...     """
...     root = get_best_split(X_train, y_train)
...     split(root, max_depth, min_size, 1)
...     return root

Now, let's test it with the preceding hand-calculated example:

>>> X_train = np.array([['semi', 3],
...                     ['detached', 2],
...                     ['detached', 3],
...                     ['semi', 2],
...                     ['semi', 4]], dtype=object)
>>> y_train = np.array([600, 700, 800, 400, 700])
>>> tree = train_tree(X_train, y_train, 2, 2)

To verify the trained tree is identical to what we constructed by hand, we write a function displaying the tree:

>>> CONDITION = {'numerical': {'yes': '>=', 'no': '<'},
...              'categorical': {'yes': 'is', 'no': 'is not'}}
>>> def visualize_tree(node, depth=0):
...     if isinstance(node, dict):
...         if type(node['value']) in [int, float]:
...             condition = CONDITION['numerical']
...         else:
...             condition = CONDITION['categorical']
...         print('{}|- X{} {} {}'.format(depth * ' ', 
                  node['index'] + 1, condition['no'], node['value']))
...         if 'left' in node:
...             visualize_tree(node['left'], depth + 1)
...         print('{}|- X{} {} {}'.format(depth * ' ', 
                 node['index'] + 1, condition['yes'], node['value']))
...         if 'right' in node:
...             visualize_tree(node['right'], depth + 1)
...     else:
...         print('{}[{}]'.format(depth * ' ', node))

>>> visualize_tree(tree)
|- X1 is not detached
  |- X2 < 3
    [400.0]
  |- X2 >= 3
    [650.0]
|- X1 is detached
  [750.0]

Now that we have a better understanding of regression tree by realizing it from scratch, we can directly use the DecisionTreeRegressor package from scikit-learn. Apply it on an example of predicting Boston house prices as follows:

>>> boston = datasets.load_boston()
>>> num_test = 10 # the last 10 samples as testing set
>>> X_train = boston.data[:-num_test, :]
>>> y_train = boston.target[:-num_test]
>>> X_test = boston.data[-num_test:, :]
>>> y_test = boston.target[-num_test:]
>>> from sklearn.tree import DecisionTreeRegressor
>>> regressor = DecisionTreeRegressor(max_depth=10, 
                                      min_samples_split=3)
>>> regressor.fit(X_train, y_train)
>>> predictions = regressor.predict(X_test)
>>> print(predictions)
[12.7 20.9 20.9 20.2 20.9 30.8
 20.73076923 24.3 28.2 20.73076923]

Compare predictions with the ground truth as follows:

>>> print(y_test)
[ 19.7  18.3 21.2  17.5 16.8 22.4  20.6 23.9 22. 11.9]

Table of Contents for Implementing decision tree regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing decision tree regression