How it works...

In Step 1, we imported the required libraries. The list can look a bit daunting, but that is due to the fact that we need to combine multiple functions/classes used in the previous recipes.

In Step 2, we loaded the data from a CSV file, separated the target variable and created the stratified train-test split. Next, we also created two lists containing the names of the numerical/categorical features—we will apply different transformations depending on the type of the feature. To select the appropriate columns, we used the select_dtypes method.

In Step 4, we started preparing the separate Pipelines. For the numerical one, we only wanted to impute the missing values of the features using the column median. For the Pipeline, we provided a list of tuples containing the steps, each of the tuples containing the name of the step (for easier identification) and the class we wanted to use, in this case, it was the SimpleImputer.

In Step 5, we prepared a similar Pipeline for categorical features. This time, however, we chained two different operations—the imputer (using the most frequent value), and the one-hot encoder. For the encoder, we also specified a list of lists called cat_list, in which we listed all the possible categories, based on X_train. We did so mostly for the sake of the next recipe, in which we introduce cross-validation, and it can happen that some of the random draws will not contain all the categories.

In Step 6, we defined the ColumnTransformer object, which we used to manipulate the data in the columns. Again, we passed a list of tuples, where each tuple contained a name, one of the Pipelines we defined before, and a list of columns to which the transformations should be applied. We also specified remainder='drop', to drop any extra columns to which no transformations were applied. In this case, the transformations were applied to all features, so no columns were dropped.

Please bear in mind that ColumnTransformer returns numpy arrays instead of pandas DataFrames!

In Step 7, we once again used Pipeline to chain the preprocessor (the previously defined ColumnTransformer object) with the decision tree classifier (for comparability, we set the random state to 42). The last two steps involved fitting the entire Pipeline to the data and using the custom function to measure the performance of the model. The performance_evaluation_report function was built in such a way that it works with any estimator or Pipeline that has the predict and predict_proba methods, used to create predictions.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...