Understanding the decision tree classification algorithm

The distinguishing feature of decision tree classification is the generation of the human-interpretable hierarchy of rules that are used to predict the label at runtime. The algorithm is recursive in nature. Creating this hierarchy of rules involves the following steps:

  1.  Find the most important feature: Out of all of the features, the algorithm identifies the feature that best differentiates between the data points in the training dataset with respect to the label. The calculation is based on metrics such as information gain or Gini impurity.        
  2. Bifurcate: Using the most identified important feature, the algorithm creates a criterion that is used to divide the training dataset into two branches:
    • Data points that pass the criterion
    • Data points that fail the criterion
  3. Check for leaf nodes: If any resultant branch mostly contains labels of one class, the branch is made final, resulting in a leaf node.
  4. Check the stopping conditions and repeat: If the provided stopping conditions are not met, then the algorithm will go back to step 1 for the next iteration. Otherwise, the model is marked as trained and each node of the resultant decision tree at the lowest level is labeled as a leaf node. The stopping condition can be as simple as defining the number of iterations, or the default stopping condition can be used, where the algorithm stops as soon it reaches a certain homogeneity level for each of the leaf nodes.

The decision tree algorithm can be explained by the following diagram:

In the preceding diagram, the root contains a bunch of circles and crosses. The algorithm creates a criterion that tries to separate the circles from the crosses. At each level, the decision tree creates partitions of the data, which are expected to be more and more homogeneous from level 1 upward. A perfect classifier has leaf nodes that only contain circles or crosses. Training perfect classifiers is usually difficult due to the inherent randomness of the training dataset. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.191.86