Specifying possible categories for OneHotEncoder

When creating ColumnTransformer, we could have additionally provided a list of possible categories for all the considered features. A simplified example follows:

one_hot_encoder = OneHotEncoder(
categories=[['Male', 'Female', 'Unknown']],
sparse=False,
handle_unknown='error',
drop='first'
)

one_hot_transformer = ColumnTransformer(
[("one_hot", one_hot_encoder, ['sex'])]
)

one_hot_transformer.fit(X_train)

one_hot_transformer.get_feature_names()
#['one_hot__x0_Female', 'one_hot__x0_Unknown']

By passing a list (of lists) containing possible categories for each feature, we are taking into account the possibility that the specific value does not appear in the training set, but might appear in the test set. If this were the case, we would run into errors.

In the preceding code block, we added an extra category called 'Unknown' to the column representing gender. As a result, we will end up with an extra "dummy" column for that category.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.67.54