To review, we have transformed the columns in our dataset in the following ways thus far:
- boolean, city: dummy encoding
- ordinal_column: label encoding
- quantitative_column: ordinal level data
Since we now have transformations for all of our columns, let's put everything together in a pipeline.
Start with importing our Pipeline class from scikit-learn:
from sklearn.pipeline import Pipeline
We will bring together each of the custom transformers that we have created. Here is the order we will follow in our pipeline:
- First, we will utilize the imputer to fill in missing values
- Next, we will dummify our categorical columns
- Then, we will encode the ordinal_column
- Finally, we will bucket the quantitative_column
Let's set up our pipeline as follows:
pipe = Pipeline([("imputer", imputer), ('dummify', cd), ('encode', ce), ('cut', cc)]) # will use our initial imputer # will dummify variables first # then encode the ordinal column # then bucket (bin) the quantitative column
In order to see the full transformation of our data using our pipeline, let's take a look at our data with zero transformations:
# take a look at our data before fitting our pipeline
print X
This is what our data looked like in the beginning before any transformations were made:
boolean |
city |
ordinal_column |
quantitative_column |
|
0 |
yes |
tokyo |
somewhat like |
1.0 |
1 |
no |
None |
like |
11.0 |
2 |
None |
london |
somewhat like |
-0.5 |
3 |
no |
seattle |
like |
10.0 |
4 |
no |
san francisco |
somewhat like |
NaN |
5 |
yes |
tokyo |
dislike |
20.0 |
We can now fit our pipeline:
# now fit our pipeline
pipe.fit(X)
>>>>
Pipeline(memory=None, steps=[('imputer', Pipeline(memory=None, steps=[('quant', <__main__.CustomQuantitativeImputer object at 0x128bf00d0>), ('category', <__main__.CustomCategoryImputer object at 0x13666bf50>)])), ('dummify', <__main__.CustomDummifier object at 0x128bf0ed0>), ('encode', <__main__.CustomEncoder object at 0x127e145d0>), ('cut', <__main__.CustomCutter object at 0x13666bc90>)])
We have created our pipeline object, let's transform our DataFrame:
pipe.transform(X)
Here is what our final dataset looks like after undergoing all of the appropriate transformations by column:
ordinal_column |
quantitative_column |
boolean_no |
boolean_yes |
city_london |
city_san francisco |
city_seattle |
city_tokyo |
|
0 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
1 |
2 |
1 |
1 |
0 |
0 |
0 |
0 |
1 |
2 |
1 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
3 |
2 |
1 |
1 |
0 |
0 |
0 |
1 |
0 |
4 |
1 |
1 |
1 |
0 |
0 |
1 |
0 |
0 |
5 |
0 |
2 |
0 |
1 |
0 |
0 |
0 |
1 |