Encoding at the ordinal level

Now, let's take a look at our ordinal columns. There is still useful information here, however, we need to transform the strings into numerical data. At the ordinal level, since there is meaning in the data having a specific order, it does not make sense to use dummy variables. To maintain the order, we will use a label encoder.

By a label encoder, we mean that each label in our ordinal data will have a numerical value associated to it. In our example, this means that the ordinal column values (dislike, somewhat like, and like) will be represented as 0, 1, and 2

In the simplest form, the code is as follows:

# set up a list with our ordinal data corresponding the list index
ordering
= ['dislike', 'somewhat like', 'like'] # 0 for dislike, 1 for somewhat like, and 2 for like
# before we map our ordering to our ordinal column, let's take a look at the column

print X['ordinal_column']
>>>>
0 somewhat like
1 like
2 somewhat like
3 like
4 somewhat like
5 dislike
Name: ordinal_column, dtype: object

Here, we have set up a list for ordering our labels. This is key, as we will be utilizing the index of our list to transform the labels to numerical data. 

Here, we will implement a function called map on our column, that allows us to specify the function we want to implement on the column. We specify this function using a construct called lambda, which essentially allows us to create an anonymous function, or one that is not bound to a name: 

lambda x: ordering.index(x)

This specific code is creating a function that will apply the index of our list called ordering to each element. Now, we map this to our ordinal column:

# now map our ordering to our ordinal column:
print X['ordinal_column'].map(lambda x: ordering.index(x))
>>>>
0 1 1 2 2 1 3 2 4 1 5 0 Name: ordinal_column, dtype: int64

Our ordinal column is now represented as labeled data. 

Note that scikit-learn has a LabelEncoder, but we are not using this method because it does not include the ability to order categories (0 for dislike, 1 for somewhat like, 2 for like) as we have done previously. Rather, the default is a sorting method, which is not what we want to use here.

Once again, let us make a custom label encoder that will fit into our pipeline:

class CustomEncoder(TransformerMixin):
    def __init__(self, col, ordering=None):
        self.ordering = ordering
        self.col = col
        
    def transform(self, df):
        X = df.copy()
        X[self.col] = X[self.col].map(lambda x: self.ordering.index(x))
        return X
    
    def fit(self, *_):
        return self

We have maintained the structure of the other custom transformers in this chapter. Here, we have utilized the map and lambda functions detailed previously to transform the specified columns. Note the key parameter, ordering, which will determine which numerical values the labels will be encoding into. 

Let's call our custom encoder:

ce = CustomEncoder(col='ordinal_column', ordering = ['dislike', 'somewhat like', 'like'])

ce.fit_transform(X)

Our dataset after these transformations looks like the following:

boolean

city

ordinal_column

quantitative_column

0

yes

tokyo

1

1.0

1

no

None

2

11.0

2

None

london

1

-0.5

3

no

seattle

2

10.0

4

no

san francisco

1

NaN

5

yes

tokyo

0

20.0

 

Our ordinal column is now labeled.

Up to this point, we have transformed the following columns accordingly:

  • boolean, city: dummy encoding
  • ordinal_column: label encoding
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.4.181