Constructing a decision tree

A decision tree is constructed by partitioning the training samples into successive subsets. The partitioning process is repeated in a recursive fashion on each subset. For each partitioning at a node, a condition test is conducted based on the value of a feature of the subset. When the subset shares the same class label, or no further splitting can improve the class purity of this subset, recursive partitioning on this node is finished.

Theoretically, for a partitioning on a feature (numerical or categorical) with n different values, there are n different ways of binary splitting (yes or no to the condition test), not to mention other ways of splitting. Without considering the order of features partitioning is taking place on, there are already nm possible trees for an m-dimension dataset:

Many algorithms have been developed to efficiently construct an accurate decision tree. Popular ones include the following:

  • Iterative Dichotomiser 3 (ID3): This algorithm uses a greedy search in a top-down manner by selecting the best attribute to split the dataset on each iteration without backtracking.
  • C4.5: An improved version on ID3 that introduces backtracking; it traverses the constructed tree and replaces branches with leaf nodes if purity is improved this way.
  • Classification and Regression Tree (CART)It constructs the tree using binary splitting, which we will discuss in detail shortly.

  • CHi-squared Automatic Interaction Detector (CHAID): This algorithm is often used in direct marketing. It involves complicated statistical concepts, but basically determines the optimal way of merging predictive variables in order to best explain the outcome.

The basic idea of these algorithms is to grow the tree greedily by making a series of local optimizations on choosing the most significant feature to use to partition the data. The dataset is then split based on the optimal value of that feature. We will discuss the measurement of a significant feature and the optimal splitting value of a feature in the next section.

We now study the CART algorithm in detail and will implement it as the most notable decision tree algorithm after. It constructs the tree using binary splitting and growing each node into left and right children. In each partition, it greedily searches for the most significant combination of a feature and its value; all different possible combinations are tried and tested using a measurement function. With the selected feature and value as a splitting point, it then divides the dataset as follows:

  • Samples with the feature of this value (for a categorical feature) or a greater value (for a numerical feature) become the right child
  • The remainder becomes the left child

This partitioning process repeats and recursively divides up the input samples into two subgroups. When the dataset becomes unmixed, a splitting process stops at a subgroup where either of the following two criteria are met:

  • The minimum number of samples for a new node: When the number of samples is not greater than the minimum number of samples required for a further split, the partitioning stops in order to prevent the tree from excessively tailoring to the training set and, as a result, overfitting.
  • The maximum depth of the tree: A node stops growing when its depth, which is defined as the number of partitioning taking place from the top down, starting from the root node ending in a terminal node, is not less than the maximum tree depth. Deeper trees are more specific to the training set and lead to overfitting.

A node with no branches becomes a leaf, and the dominant class of samples at this node is the prediction. Once all splitting processes finish, the tree is constructed and is portrayed with the assigned labels at the terminal nodes and the splitting points (feature + value) at all the internal nodes above.

We will implement the CART decision tree algorithm from scratch after studying the metrics of selecting the optimal splitting feature and value, as promised.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.47.208