First, we will utilize the scikit-learn TransformerMixin base class to create our own custom categorical imputer. This transformer (and all other custom transformers in this chapter) will work as an element in a pipeline with a fit and transform method.

The following code block will become very familiar throughout this chapter, so we will go over each line in detail:

from sklearn.base import TransformerMixin

class CustomCategoryImputer(TransformerMixin):
    def __init__(self, cols=None):
        self.cols = cols
        
    def transform(self, df):
        X = df.copy()
        for col in self.cols:
            X[col].fillna(X[col].value_counts().index[0], inplace=True)
        return X
    
    def fit(self, *_):
        return self

There is a lot happening in this code block, so let's break it down by line:

First, we have a new import statement:

from sklearn.base import TransformerMixin

We will inherit the TransformerMixin class from scikit-learn, which includes a .fit_transform method that calls upon the .fit and .transform methods we will create. This allows us to maintain a similar structure in our transformer to that of scikit-learn. Let's initialize our custom class:

class CustomCategoryImputer(TransformerMixin):
    def __init__(self, cols=None):
        self.cols = cols

We have now instantiated our custom class and have our __init__ method that initializes our attributes. In our case, we only need to initialize one instance attribute, self.cols (which will be the columns that we specify as a parameter). Now, we can build our fit and transform methods:

def transform(self, df):
        X = df.copy()
        for col in self.cols:
            X[col].fillna(X[col].value_counts().index[0], inplace=True)
        return X

Here, we have our transform method. It takes in a DataFrame, and the first step is to copy and rename the DataFrame to X. Then, we will iterate over the columns we have specified in our cols parameter to fill in the missing slots. The fillna portion may feel familiar, as it is the function we employed in our first example. We are using the same function and setting it up so that our custom categorical imputer can work across several columns at once. After the missing values have been filled, we return our filled DataFrame. Next comes our fit method:

def fit(self, *_):
        return self

We have set up our fit method to simply return self, as is the standard of .fit methods in scikit-learn.

Now we have a custom method that allows us to impute our categorical data! Let's see it in action with our two categorical columns, city and boolean:

# Implement our custom categorical imputer on our categorical columns.

cci = CustomCategoryImputer(cols=['city', 'boolean'])

We have initialized our custom categorical imputer, and we now need to fit_transform this imputer to our dataset:

cci.fit_transform(X)

Our dataset now looks like this:

	boolean	city	ordinal_column	quantitative_column
0	yes	tokyo	somewhat like	1.0
1	no	tokyo	like	11.0
2	no	london	somewhat like	-0.5
3	no	seattle	like	10.0
4	no	san francisco	somewhat like	NaN
5	yes	tokyo	dislike	20.0

Great! Our city and boolean columns are no longer missing values. However, our quantitative column still has null values. Since the default imputer cannot select columns, let's make another custom one.

Table of Contents for
Custom category imputer

Custom category imputer

Table of Contents for Custom category imputer

Create new playlist

Sign In

Sign Up

Table of Contents for
Custom category imputer