Creating a baseline machine learning pipeline

In previous chapters, we offered to you, the reader, a single machine learning model to use throughout the chapter. In this chapter, we will do some work to find the best machine learning model for our needs and then work to enhance that model with feature selection. We will begin by importing four different machine learning models:

  • Logistic Regression
  • K-Nearest Neighbors
  • Decision Tree
  • Random Forest

The code for importing the learning models is given as follows:

# Import four machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Once we are finished importing these modules, we will run them through our get_best_model_and_accuracy functions to get a baseline on how each one handles the raw data. We will have to first establish some variables to do so. We will use the following code to do this:

# Set up some parameters for our grid search
# We will start with four different machine learning model parameters

# Logistic Regression
lr_params = {'C':[1e-1, 1e0, 1e1, 1e2], 'penalty':['l1', 'l2']}

# KNN
knn_params = {'n_neighbors': [1, 3, 5, 7]}

# Decision Tree
tree_params = {'max_depth':[None, 1, 3, 5, 7]}

# Random Forest
forest_params = {'n_estimators': [10, 50, 100], 'max_depth': [None, 1, 3, 5, 7]}
If you feel uncomfortable with any of the models listed above, we recommend reading up on documentation, or referring to the Packt book, The Principles of Data Science, https://www.packtpub.com/big-data-and-business-intelligence/principles-data-science, for a more detailed explanation of the algorithms.

Because we will be sending each model through our function, which invokes a grid search module, we need only create blank state models with no customized parameters set, as shown in the following code:

# instantiate the four machine learning models
lr = LogisticRegression()
knn = KNeighborsClassifier()
d_tree = DecisionTreeClassifier()
forest = RandomForestClassifier()

We are now going to run each of the four machine learning models through our evaluation function to see how well (or not) they do against our dataset. Recall that our number to beat at the moment is .7788, the baseline null accuracy. We will use the following code to run the models:

get_best_model_and_accuracy(lr, lr_params, X, y)

Best Accuracy: 0.809566666667 Best Parameters: {'penalty': 'l1', 'C': 0.1} Average Time to Fit (s): 0.602 Average Time to Score (s): 0.002

We can see that the logistic regression has already beaten the null accuracy using the raw data and, on average, took 6/10 of a second to fit to a training set and only 20 milliseconds to score. This makes sense if we know that to fit, a logistic regression in scikit-learn must create a large matrix in memory, but to predict, it need only multiply and add scalars to one another.

Now, let's do the same with the KNN model, using the following code:

get_best_model_and_accuracy(knn, knn_params, X, y)

Best Accuracy: 0.760233333333 Best Parameters: {'n_neighbors': 7} Average Time to Fit (s): 0.035 Average Time to Score (s): 0.88

Our KNN model, as expected, does much better on the fitting time. This is because, to fit to the data, the KNN only has to store the data in such a way that it is easily retrieved at prediction time, where it takes a hit on time. It's also worth mentioning the painfully obvious fact that the accuracy is not even better than the null accuracy! You might be wondering why, and if you're saying hey wait a minute, doesn't KNN utilize the Euclidean Distance in order to make predictions, which can be thrown off by non-standardized data, a flaw that none of the other three machine learning models suffer?, then you're 100% correct.

KNN is a distance-based model, in that it uses a metric of closeness in space that assumes that all features are on the same scale, which we already know that our data is not on. So, for KNN, we will have to construct a more complicated pipeline to more accurately assess its baseline performance, using the following code:

# bring in some familiar modules for dealing with this sort of thing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# construct pipeline parameters based on the parameters
# for KNN on its own
knn_pipe_params = {'classifier__{}'.format(k): v for k, v in knn_params.iteritems()}

# KNN requires a standard scalar due to using Euclidean distance # as the main equation for predicting observations
knn_pipe = Pipeline([('scale', StandardScaler()), ('classifier', knn)])

# quick to fit, very slow to predict
get_best_model_and_accuracy(knn_pipe, knn_pipe_params, X, y)

print knn_pipe_params # {'classifier__n_neighbors': [1, 3, 5, 7]}

Best Accuracy: 0.8008
Best Parameters: {'classifier__n_neighbors': 7}
Average Time to Fit (s): 0.035
Average Time to Score (s): 6.723

The first thing to notice is that our modified code pipeline, which now includes a StandardScalar (which z-score normalizes our features) now beats the null accuracy at the very least, but also seriously hurts our predicting time, as we have added a step of preprocessing. So far, the logistic regression is in the lead with the best accuracy and the better overall timing of the pipeline. Let's move on to our two tree-based models and start with the simpler of the two, the decision tree, with the help of the following code:

get_best_model_and_accuracy(d_tree, tree_params, X, y)

Best Accuracy: 0.820266666667 Best Parameters: {'max_depth': 3} Average Time to Fit (s): 0.158 Average Time to Score (s): 0.002

Amazing! Already, we have a new lead in accuracy and, also, the decision tree is quick to both fit and predict. In fact, it beats logistic regression in its time to fit and beats the KNN in its time to predict. Let's finish off our test by evaluating a random forest, using the following code:

get_best_model_and_accuracy(forest, forest_params, X, y)

Best Accuracy: 0.819566666667 Best Parameters: {'n_estimators': 50, 'max_depth': 7} Average Time to Fit (s): 1.107 Average Time to Score (s): 0.044

Much better than either the Logistic Regression or the KNN, but not better than the decision tree. Let's aggregate these results to see which model we should move forward with in optimizing using feature selection:

Model Name

Accuracy (%)

Fit Time (s)

Predict Time (s)

Logistic Regression

.8096

.602

.002

KNN (with scaling)

.8008

.035

6.72

Decision Tree

.8203

.158

.002

Random Forest

.8196

1.107

.044

The decision tree comes in first for accuracy and tied for first for predict time with logistic regression, while KNN with scaling takes the trophy for being the fastest to fit to our data. Overall, the decision tree appears to be the best model to move forward with, as it came in first for, arguably, our two most important metrics:

  • We definitely want the best accuracy to ensure that out of sample predictions are accurate
  • Having a prediction time is useful considering that the models are being utilized for real-time production usage
The approach we are taking is one that selects a model before selecting any features. It is not required to work in this fashion, but we find that it generally saves the most time when working under pressure of time. For your purposes, we recommend that you experiment with many models concurrently and don't limit yourself to a single model.

Knowing that we will be using the decision tree for the remainder of this chapter, we know two more things:

  • The new baseline accuracy to beat is .8203, the accuracy the tree obtained when fitting to the entire dataset
  • We no longer have to use our StandardScaler, as decision trees are unaffected by it when it comes to model performance
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.155.49