GridsearchCV for decision trees

sklearn provides a method to define ranges of values for multiple hyperparameters. It automates the process of cross-validating the various combinations of these parameter values to identify the optimal configuration. Let's walk through the process of automatically tuning your model.

The first step is to instantiate a model object and define a dictionary where the keywords name the hyperparameters, and the values list the parameter settings to be tested:

clf = DecisionTreeClassifier(random_state=42)
param_grid = {'max_depth': range(10, 20),
'min_samples_leaf': [250, 500, 750],
'max_features': ['sqrt', 'auto']
}

Then, instantiate the GridSearchCV object, providing the estimator object and parameter grid, as well as a scoring method and cross-validation choice to the initialization method. We'll use an object of our custom OneStepTimeSeriesSplit class, initialized to use ten folds for the cv parameter, and set the scoring to the roc_auc metric. We can parallelize the search using the n_jobs parameter and automatically obtain a trained model that uses the optimal hyperparameters by setting refit=True.

With all settings in place, we can fit GridSearchCV just like any other model:

gridsearch_clf = GridSearchCV(estimator=clf,
param_grid=param_grid,
scoring='roc_auc',
n_jobs=-1,
cv=cv, # custom OneStepTimeSeriesSplit
refit=True,
return_train_score=True)

gridsearch_clf.fit(X=X, y=y_binary)

The training process produces some new attributes for our GridSearchCV object, most importantly the information about the optimal settings and the best cross-validation score (now using the proper setup that avoids lookahead bias).

Setting max_depth to 13min_samples_leaf to 500, and randomly selecting only a number corresponding to the square root of the total number of features when deciding on a split, produces the best results, with an AUC of 0.5855:

gridsearch_clf.best_params_
{'max_depth': 13, 'max_features': 'sqrt', 'min_samples_leaf': 500}

gridsearch_clf.best_score_
0.5855

The automation is quite convenient, but we also would like to inspect how the performance evolves for different parameter values. Upon completion of this process, the GridSearchCV object makes available detailed cross-validation results to gain more insights. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.69.151