Implementing scikit-learn's pipelines

In the previous recipes, we showed all the steps required to build a machine learning model —starting with loading data, splitting it into a training and a test set, imputing missing values, encoding categorical features, and—lastly—fitting a decision tree classifier.

The process requires multiple steps to be executed in a certain order, which can sometimes be tricky with a lot of modifications to the pipeline mid-work. That is why scikit-learn introduced Pipelines. By using Pipelines, we can sequentially apply a list of transformations to the data, and then train a given estimator (model).

One important point to be aware of is that the intermediate steps of the Pipeline must have the fit and transform methods (the final estimator only needs the fit method, though). Using Pipelines has several benefits:

The flow is much easier to read and understand—the chain of operations to be executed on given columns is clear
The order of steps is enforced by the Pipeline
Increased reproducibility

In this recipe, we show how to create the entire project's pipeline, from loading the data to training the classifier.

Table of Contents for Implementing scikit-learn's pipelines

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing scikit-learn's pipelines