Execute the following steps to run Bayesian hyperparameter optimization of a LightGBM model.
- Import the libraries:
import pandas as pd
from sklearn.model_selection import train_test_split
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from sklearn.model_selection import (cross_val_score,
StratifiedKFold)
from lightgbm import LGBMClassifier
from chapter_9_utils import performance_evaluation_report
- Define parameters for later use:
N_FOLDS = 5
MAX_EVALS = 200
- Load and prepare the data:
df = pd.read_csv('credit_card_fraud.csv')
X = df.copy()
y = X.pop('Class')
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
stratify=y)
- Define the objective function:
def objective(params, n_folds = N_FOLDS, random_state=42):
model = LGBMClassifier(**params)
model.set_params(random_state=random_state)
k_fold = StratifiedKFold(n_folds, shuffle=True,
random_state=random_state)
metrics = cross_val_score(model, X_train, y_train,
cv=k_fold, scoring='recall')
loss = -1 * metrics.mean()
return {'loss': loss, 'params': params, 'status': STATUS_OK}
- Define the search space:
lgbm_param_grid = {
'boosting_type': hp.choice('boosting_type', ['gbdt', 'dart',
'goss']),
'max_depth': hp.choice('max_depth', [-1, 2, 3, 4, 5, 6, 7, 8,
9, 10]),
'n_estimators': hp.choice('n_estimators', [10, 50, 100,
300, 750, 1000]),
'is_unbalance': hp.choice('is_unbalance', [True, False]),
'colsample_bytree': hp.uniform('colsample_bytree', 0.3, 1),
'learning_rate': hp.uniform ('learning_rate', 0.05, 0.3),
}
- Run the Bayesian optimization:
trials = Trials()
best_set = fmin(fn= objective,
space= lgbm_param_grid,
algo= tpe.suggest,
max_evals = MAX_EVALS,
trials= trials)
Inspecting the best_set prints the following summary:
{'boosting_type': 1,
'colsample_bytree': 0.8861225641638096,
'is_unbalance': 0,
'learning_rate': 0.193440600772047,
'max_depth': 6,
'n_estimators': 0}
The hyperparameters defined using hp.choice in the grid are presented as encoded integers. In the following steps, we show how to recover the original values.
- Define the dictionaries for mapping the results to hyperparameter values:
boosting_type = {0: 'gbdt', 1: 'dart', 2: 'goss'}
max_depth = {0: -1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6,
6: 7, 7: 8, 8: 9, 9: 10}
n_estimators = {0: 10, 1: 50, 2: 100, 3: 300, 4: 750, 5: 1000}
is_unbalance = {0: True, 1: False}
- Fit a model using the best hyperparameters:
best_lgbm = LGBMClassifier(
boosting_type = boosting_type[best_set['boosting_type']],
max_depth = max_depth[best_set['max_depth']],
n_estimators = n_estimators[best_set['n_estimators']],
is_unbalance = is_unbalance[best_set['is_unbalance']],
colsample_bytree = best_set['colsample_bytree'],
learning_rate = best_set['learning_rate']
)
best_lgbm.fit(X_train, y_train)
- Evaluate the performance of the best model on the test set:
_ = performance_evaluation_report(best_lgbm, X_test, y_test,
show_plot=True,
show_pr_curve=True)
Running the code generates a plot the following plot:
The plot contains some of the performance evaluation metrics obtained from the custom performance_evaluation_report function.