Tree splitting

There are various algorithms that help when it comes to tree splitting, all of which take us to the leaf node. The decision tree takes all of the features (variables) that are available into account and selects the feature that would result in the most pure or most homogeneous split. The algorithm that's used to split the tree also depends on the target variable. Let's go through this, step by step:

  1. Gini index: This says that if we select two items at random from a population, they must be from the same class. The probability for this event would turn out to be 1 if the population is totally pure. It only performs binary splits. Classification and regression trees (CARTs) make use of this kind of split.

The following formula is how you calculate the Gini index:

 

Here, p(t) is the proportion of observations with a target variable with a value of t.

For the binary target variable, t=1, the max Gini index value is as follows:

= 1 — (1/2)^2— (1/2)^2
= 1–2*(1/2)^2
= 1- 2*(1/4)
= 1–0.5
= 0.5

A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups that were created the by the split. A perfect separation results in a Gini score of 0, whereas the worst case split results in 50/50 classes.

For a nominal variable with k level, the maximum value of the Gini index is (1- 1/k).

  1. Information gain: Let's delve into this and find out what it is. If we happened to have three scenarios, as shown in the following diagram, which can be described easily? 

Since Z seem to be quite homogeneous and all of the values of it are similar, it is called a pure set. Hence, it requires less effort to explain it. However, Y would need more information to explain as it's not pure. X turns out to be the impurest of them all. What it tries to convey is that randomness and disorganization adds to complexity and so it needs more information to explain. This degree of randomness is known as entropy. If the sample is completely homogeneous, then the entropy is 0. If the sample is equally divided, its entropy will be 1:

Entropy = -p log2p - q log2q

Here, p means the probability of success and q means the probability of failure.

Entropy is also used with a categorical target variable. It picks the split that has the lowest entropy compared to the parent node.

Here, we must calculate the entropy of parent node first. Then, we need to calculate entropy of each individual node that's been split and post that, including the weighted average of all subnodes.

  1. Reduction in variance: When it comes to the continuous target variable, reduction in variance is used. Here, we are using variance to decide the best split. The split with the lowest variance is picked as the criteria to split:

Variance = 

Here,  is the mean of all the values, X, is the real values, and n is the number of values.

The calculation of variance for each node is done first and then the weighted average of each node's variance makes us select the best node.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.31.125