Imputing categorical features

Now that we have an understanding of the data we are working with, let's take a look at our missing values:

  • To do this, we can use the isnull method available to us in pandas for DataFrames. This method returns a boolean same-sized object indicating if the values are null.
  • We will then sum these to see which columns have missing data:
X.isnull().sum()
>>>>
boolean 1 city 1 ordinal_column 0 quantitative_column 1 dtype: int64

Here, we can see that three of our columns are missing values. Our course of action will be to impute these missing values.

If you recall, we implemented scikit-learn's Imputer class in a previous chapter to fill in numerical data. Imputer does have a categorical option, most_frequent, however it only works on categorical data that has been encoded as integers.

We may not always want to transform our categorical data this way, as it can change how we interpret the categorical information, so we will build our own transformer. By transformer, we mean a method by which a column will impute missing values. 

In fact, we will build several custom transformers in this chapter, as they are quite useful for making transformations to our data, and give us options that are not readily available in pandas or scikit-learn.

Let's start with our categorical column, cityJust as we have the strategy of imputing the mean value to fill missing rows for numerical data, we have a similar method for categorical data. To impute values for categorical data, fill missing rows with the most common category. 

To do so, we will need to find out what the most common category is in our city column:

Note that we need to specify the column we are working with to employ a method called value_counts. This will return an object that will be in descending order so that the first element is the most frequently-occurring element.

We will grab only the first element in the object: 

# Let's find out what our most common category is in our city column
X['city'].value_counts().index[0]

>>>>
'tokyo'

We can see that tokyo appears to be the most common city. Now that we know which value to use to impute our missing rows, let's fill these slots. There is a fillna function that allows us to specify exactly how we want to fill missing values:

# fill empty slots with most common category
X
['city'].fillna(X['city'].value_counts().index[0])

The city column now looks like this:

0            tokyo
1            tokyo
2           london
3          seattle
4    san francisco
5            tokyo
Name: city, dtype: object

Great, now our city column no longer has missing values. However, our other categorical column, boolean, still does. Rather than going through the same method, let's build a custom imputer that will be able to handle imputing all categorical data. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.186.234