Custom category imputer

First, we will utilize the scikit-learn TransformerMixin base class to create our own custom categorical imputer. This transformer (and all other custom transformers in this chapter) will work as an element in a pipeline with a fit and transform method.

The following code block will become very familiar throughout this chapter, so we will go over each line in detail:

from sklearn.base import TransformerMixin

class CustomCategoryImputer(TransformerMixin):
    def __init__(self, cols=None):
        self.cols = cols
        
    def transform(self, df):
        X = df.copy()
        for col in self.cols:
            X[col].fillna(X[col].value_counts().index[0], inplace=True)
        return X
    
    def fit(self, *_):
        return self

There is a lot happening in this code block, so let's break it down by line: 

  1. First, we have a new import statement:
from sklearn.base import TransformerMixin
  1. We will inherit the TransformerMixin class from scikit-learn, which includes a .fit_transform method that calls upon the .fit and .transform methods we will create. This allows us to maintain a similar structure in our transformer to that of scikit-learn. Let's initialize our custom class:
class CustomCategoryImputer(TransformerMixin):
    def __init__(self, cols=None):
        self.cols = cols
  1. We have now instantiated our custom class and have our __init__ method that initializes our attributes. In our case, we only need to initialize one instance attribute, self.cols (which will be the columns that we specify as a parameter). Now, we can build our fit and transform methods:
def transform(self, df):
        X = df.copy()
        for col in self.cols:
            X[col].fillna(X[col].value_counts().index[0], inplace=True)
        return X
  1. Here, we have our transform method. It takes in a DataFrame, and the first step is to copy and rename the DataFrame to X. Then, we will iterate over the columns we have specified in our cols parameter to fill in the missing slots. The fillna portion may feel familiar, as it is the function we employed in our first example. We are using the same function and setting it up so that our custom categorical imputer can work across several columns at once. After the missing values have been filled, we return our filled DataFrame. Next comes our fit method: 
def fit(self, *_):
        return self

We have set up our fit method to simply return self, as is the standard of .fit methods in scikit-learn.

  1. Now we have a custom method that allows us to impute our categorical data! Let's see it in action with our two categorical columns, city and boolean:
# Implement our custom categorical imputer on our categorical columns.

cci
= CustomCategoryImputer(cols=['city', 'boolean'])
  1. We have initialized our custom categorical imputer, and we now need to fit_transform this imputer to our dataset:
cci.fit_transform(X)

Our dataset now looks like this:

boolean

city

ordinal_column

quantitative_column

0

yes

tokyo

somewhat like

1.0

1

no

tokyo

like

11.0

2

no

london

somewhat like

-0.5

3

no

seattle

like

10.0

4

no

san francisco

somewhat like

NaN

5

yes

tokyo

dislike

20.0

 

Great! Our city and boolean columns are no longer missing values. However, our quantitative column still has null values. Since the default imputer cannot select columns, let's make another custom one. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.111.33