How to regularize a decision tree

The following table lists key parameters available for this purpose in the sklearn decision tree implementation. After introducing the most important parameters, we will illustrate how to use cross-validation to optimize the hyperparameter settings with respect to the bias-variance tradeoff and lower prediction errors:

<tdDefault

Parameter Options Description
max_depth None int Maximum number of levels: split nodes until reaching max_depth or all leaves are pure or contain fewer than min_samples_split samples.
max_features None None: all features; int
float: fraction
auto, sqrt: sqrt(n_features)
log2: log2(n_features)
Number of features to consider for a split.
max_leaf_nodes None None: unlimited number of leaf nodes
int
Split nodes until creating this many leaves.
min_impurity_decrease 0 float Split node if impurity decreases by at least this value.
min_samples_leaf 1 int;

float (as a percentage of N)

Minimum number of samples to be at a leaf node. A split will only be considered if there are at least min_samples_leaf training samples in each of the left and right branches. May smoothen the model, especially for regression.
min_samples_split 2 int; float (percent of N) The minimum number of samples required to split an internal node:
min_weight_fraction_leaf 0   The minimum weighted fraction of the sum total of all sample weights needed at a leaf node. Samples have equal weight unless sample_weight provided in fit method.

 

The max_depth parameter imposes a hard limit on the number of consecutive splits and represents the most straightforward way to cap the growth of a tree.

The min_samples_split and min_samples_leaf parameters are alternative, data-driven ways to limit the growth of a tree. Rather than imposing a hard limit on the number of consecutive splits, these parameters control the minimum number of samples required to further split the data. The latter guarantees a certain number of samples per leaf, while the former can create very small leaves if a split results in a very uneven distribution. Small parameter values facilitate overfitting, while a high number may prevent the tree from learning the signal in the data. The default values are often quite low, and you should use cross-validation to explore a range of potential values. You can also use a float to indicate a percentage as opposed to an absolute number. 

The sklearn documentation contains additional details about how to use the various parameters for different use cases; see GitHub references.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.235.75.229