Implementing pipelines in scikit-learn

The Pipeline class itself has a fit, a predict, and a score method, which all behave just like any other estimator in scikit-learn. The most common use case of the Pipeline class is to chain different preprocessing steps together with a supervised model like a classifier.

Let's return to the breast cancer dataset from Chapter 5, Using Decision Trees to Make a Medical Diagnosis. Using scikit-learn, we import the dataset and split it into training and test sets:

In [1]: from sklearn.datasets import load_breast_cancer
...     import numpy as np
...     cancer = load_breast_cancer()
...     X = cancer.data.astype(np.float32)
...     y = cancer.target
In [2]: X_train, X_test, y_train, y_test = train_test_split(
...         X, y, random_state=37
...     )

Instead of the kNN algorithm, we could fit a Support Vector Machine (SVM) to the data:

In [3]: from sklearn.svm import SVC
...     svm = SVC()
...     svm.fit(X_train, y_train)
Out[3]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
            decision_function_shape=None, degree=3, gamma='auto',
            kernel='rbf', max_iter=-1, probability=False,
            random_state=None, shrinking=True, tol=0.001,
            verbose=False)

Without straining our brains too hard, this algorithm achieves an accuracy score of 65%:

In [4]: svm.score(X_test, y_test)
Out[4]: 0.65034965034965031

Now, if we wanted to run the algorithm again using some preprocessing step (for example, by scaling the data first with MinMaxScaler), we would do the preprocessing step by hand and then feed the preprocessed data into the classifiers fit method.

An alternative is to use a pipeline object. Here, we want to specify a list of processing steps, where each step is a tuple containing a name (any string of our choosing) and an instance of an estimator:

In [5]: from sklearn.pipeline import Pipeline
...     pipe = Pipeline(["scaler", MinMaxScaler(), ("svm", SVC())])

Here, we created two steps: the first, called "scaler", is an instance of MinMaxScaler, and the second, called "svm", is an instance of SVC.

Now, we can fit the pipeline like any other scikit-learn estimator:

In [6]: pipe.fit(X_train, y_train)
...     Pipeline(steps=[('scaler', MinMaxScaler(copy=True,
...     feature_range=(0, 1))), ('svm', SVC(C=1.0, 
...     cache_size=200, class_weight=None, coef0=0.0,
...     decision_function_shape=None, degree=3, gamma='auto',
...     kernel='rbf', max_iter=-1, probability=False,
...     random_state=None, shrinking=True, tol=0.001,
...     verbose=False))])

Here, the fit method first calls fit on the first step (the scaler), then it transforms the training data using the scaler, and finally, it fits the SVM with the scaled data.

And voila! When we score the classifier on the test data, we see a drastic improvement in performance:

In [7]: pipe.score(X_test, y_test)
Out[7]: 0.95104895104895104

Calling the score method on the pipeline first transforms the test data using the scaler and then calls the score method on the SVM using the scaled test data. And scikit-learn did all of this with only four lines of code!

The main benefit of using the pipeline, however, is that we can now use this single estimator in cross_val_score or GridSearchCV.

Table of Contents for Implementing pipelines in scikit-learn

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing pipelines in scikit-learn