How to create binary data formats

All libraries have their own data format to precompute feature statistics to accelerate the search for split points, as described previously. These can also be persisted to accelerate the start of subsequent training.

The following code constructs binary train and validation datasets for each model to be used with the OneStepTimeSeriesSplit:

cat_cols = ['year', 'month', 'age', 'msize', 'sector']
data = {}
for fold, (train_idx, test_idx) in enumerate(kfold.split(features)):
print(fold, end=' ', flush=True)
if model == 'xgboost':
data[fold] = {'train': xgb.DMatrix(label=target.iloc[train_idx],
data=features.iloc[train_idx],
nthread=-1), # use avail. threads
'valid': xgb.DMatrix(label=target.iloc[test_idx],
data=features.iloc[test_idx],
nthread=-1)}
elif model == 'lightgbm':
train = lgb.Dataset(label=target.iloc[train_idx],
data=features.iloc[train_idx],
categorical_feature=cat_cols,
free_raw_data=False)

# align validation set histograms with training set
valid = train.create_valid(label=target.iloc[test_idx],
data=features.iloc[test_idx])

data[fold] = {'train': train.construct(),
'valid': valid.construct()}

elif model == 'catboost':
# get categorical feature indices
cat_cols_idx = [features.columns.get_loc(c) for c in cat_cols]
data[fold] = {'train': Pool(label=target.iloc[train_idx],
data=features.iloc[train_idx],
cat_features=cat_cols_idx),

'valid': Pool(label=target.iloc[test_idx],
data=features.iloc[test_idx],
cat_features=cat_cols_idx)}

The available options vary slightly: 

  • xgboost allows the use of all available threads
  • lightgbm explicitly aligns the quantiles that are created for the validation set with the training set
  • The catboost implementation needs feature columns identified using indices rather than labels
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.244.201