Category Encoders library

Aside from using pandas and scikit-learn, we can also use another library called Category Encoders. It belongs to a set of libraries compatible with scikit-learn and provides a selection of encoders using a similar fit-transform approach. That is why it is also possible to use them together with ColumnTransformer and Pipeline.

We show two of the available encoders. The first one will be an alternative implementation of the one-hot encoder.

Import the library:

import category_encoders as ce

Create the encoder object:

one_hot_encoder_ce = ce.OneHotEncoder(use_cat_names=True)

Additionally, we could specify an argument called drop_invariant, to indicate that we want to drop columns with a 0 variance. This could help reduce the number of features. 

Fit the encoder, and transform the data:

one_hot_encoder_ce.fit(X_train)
X_train_ce = one_hot_encoder_ce.transform(X_train)

This implementation of the one-hot encoder automatically encodes only the columns containing strings (unless we specify only a subset of categorical columns by passing a list to the cols argument). By default, it also returns a pandas DataFrame (in comparison to the numpy array, in the case of scikit-learn's implementation) with the adjusted column names. The only drawback of this implementation is that it does not allow for dropping the one redundant dummy column of each feature.

The second interesting approach to encoding categorical features available in the Category Encoders library is target encoding. This approach is useful for classification tasks and relies on using the mean of the dependent variable (target) to replace categories. We can also interpret mean encoding as a probability of the target variable, conditional on each value of the feature. In the case of a simple variable with gender and a Boolean target, target encoding would replace the categories with the percentage of positive cases per gender.

target_encoder = ce.TargetEncoder(smoothing=0)
target_encoder.fit(X_train.sex, y_train)
target_encoder.transform(X_train.sex).head()

Running the code generates a preview of the table, as shown in the following screenshot:

It looks as though ~20.7% of females and ~24.4% of males defaulted in the training set. We set the smoothing argument to zero, to remove regularization and keep the pure mean value.

The benefits of using target encoding include:

  • Improved model performance.
  • Target encoding tends to group the classes of the target together, while the distribution of the target is quite random in the case of label encoding.
  • The number of features is reduced, instead of using one-hot encoding with a large number of columns. This is especially helpful with gradient-boosted trees, as they have trouble handling high-cardinality categorical features, due to the limited depth of the trees.

The biggest drawback of target encoding is its potential to cause overfitting. As a remedy, we could apply k-fold target encoding.

A warning about one-hot encoding and decision tree-based algorithms

While regression-based models can naturally handle the OR condition of one-hot encoded features, this is not that simple using decision tree-based algorithms. In theory, decision trees are capable of handling categorical features without the need for encoding.

However, its popular implementation in scikit-learn still requires all features to be numerical. Without going into too much detail, such an approach favors continuous numerical features over one-hot-encoded dummies, as a single dummy can only bring a fraction of the total feature information into the model. A possible solution is to use either a different kind of encoding (label/target encoding) or an implementation that handles categorical features, such as Random Forest in the h2o library.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.87.83