Creating our pipeline

To review, we have transformed the columns in our dataset in the following ways thus far:

  • boolean, city: dummy encoding
  • ordinal_column: label encoding
  • quantitative_column: ordinal level data

Since we now have transformations for all of our columns, let's put everything together in a pipeline. 

Start with importing our Pipeline class from scikit-learn:

from sklearn.pipeline import Pipeline

We will bring together each of the custom transformers that we have created. Here is the order we will follow in our pipeline:

  1. First, we will utilize the imputer to fill in missing values
  2. Next, we will dummify our categorical columns
  3. Then, we will encode the ordinal_column
  4. Finally, we will bucket the quantitative_column

Let's set up our pipeline as follows:

pipe = Pipeline([("imputer", imputer), ('dummify', cd), ('encode', ce), ('cut', cc)])
# will use our initial imputer
# will dummify variables first
# then encode the ordinal column
# then bucket (bin) the quantitative column

In order to see the full transformation of our data using our pipeline, let's take a look at our data with zero transformations:

# take a look at our data before fitting our pipeline
print X

This is what our data looked like in the beginning before any transformations were made:

boolean

city

ordinal_column

quantitative_column

0

yes

tokyo

somewhat like

1.0

1

no

None

like

11.0

2

None

london

somewhat like

-0.5

3

no

seattle

like

10.0

4

no

san francisco

somewhat like

NaN

5

yes

tokyo

dislike

20.0

 

We can now fit our pipeline:

# now fit our pipeline
pipe.fit(X)

>>>>
Pipeline(memory=None, steps=[('imputer', Pipeline(memory=None, steps=[('quant', <__main__.CustomQuantitativeImputer object at 0x128bf00d0>), ('category', <__main__.CustomCategoryImputer object at 0x13666bf50>)])), ('dummify', <__main__.CustomDummifier object at 0x128bf0ed0>), ('encode', <__main__.CustomEncoder object at 0x127e145d0>), ('cut', <__main__.CustomCutter object at 0x13666bc90>)])

We have created our pipeline object, let's transform our DataFrame: 

pipe.transform(X)

Here is what our final dataset looks like after undergoing all of the appropriate transformations by column:

ordinal_column

quantitative_column

boolean_no

boolean_yes

city_london

city_san francisco

city_seattle

city_tokyo

0

1

0

0

1

0

0

0

1

1

2

1

1

0

0

0

0

1

2

1

0

1

0

1

0

0

0

3

2

1

1

0

0

0

1

0

4

1

1

1

0

0

1

0

0

5

0

2

0

1

0

0

0

1

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.154.76