In previous chapters, we offered to you, the reader, a single machine learning model to use throughout the chapter. In this chapter, we will do some work to find the best machine learning model for our needs and then work to enhance that model with feature selection. We will begin by importing four different machine learning models:
- Logistic Regression
- K-Nearest Neighbors
- Decision Tree
- Random Forest
The code for importing the learning models is given as follows:
# Import four machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
Once we are finished importing these modules, we will run them through our get_best_model_and_accuracy functions to get a baseline on how each one handles the raw data. We will have to first establish some variables to do so. We will use the following code to do this:
# Set up some parameters for our grid search
# We will start with four different machine learning model parameters
# Logistic Regression
lr_params = {'C':[1e-1, 1e0, 1e1, 1e2], 'penalty':['l1', 'l2']}
# KNN
knn_params = {'n_neighbors': [1, 3, 5, 7]}
# Decision Tree
tree_params = {'max_depth':[None, 1, 3, 5, 7]}
# Random Forest
forest_params = {'n_estimators': [10, 50, 100], 'max_depth': [None, 1, 3, 5, 7]}
Because we will be sending each model through our function, which invokes a grid search module, we need only create blank state models with no customized parameters set, as shown in the following code:
# instantiate the four machine learning models
lr = LogisticRegression()
knn = KNeighborsClassifier()
d_tree = DecisionTreeClassifier()
forest = RandomForestClassifier()
We are now going to run each of the four machine learning models through our evaluation function to see how well (or not) they do against our dataset. Recall that our number to beat at the moment is .7788, the baseline null accuracy. We will use the following code to run the models:
get_best_model_and_accuracy(lr, lr_params, X, y)
Best Accuracy: 0.809566666667 Best Parameters: {'penalty': 'l1', 'C': 0.1} Average Time to Fit (s): 0.602 Average Time to Score (s): 0.002
We can see that the logistic regression has already beaten the null accuracy using the raw data and, on average, took 6/10 of a second to fit to a training set and only 20 milliseconds to score. This makes sense if we know that to fit, a logistic regression in scikit-learn must create a large matrix in memory, but to predict, it need only multiply and add scalars to one another.
Now, let's do the same with the KNN model, using the following code:
get_best_model_and_accuracy(knn, knn_params, X, y)
Best Accuracy: 0.760233333333 Best Parameters: {'n_neighbors': 7} Average Time to Fit (s): 0.035 Average Time to Score (s): 0.88
Our KNN model, as expected, does much better on the fitting time. This is because, to fit to the data, the KNN only has to store the data in such a way that it is easily retrieved at prediction time, where it takes a hit on time. It's also worth mentioning the painfully obvious fact that the accuracy is not even better than the null accuracy! You might be wondering why, and if you're saying hey wait a minute, doesn't KNN utilize the Euclidean Distance in order to make predictions, which can be thrown off by non-standardized data, a flaw that none of the other three machine learning models suffer?, then you're 100% correct.
KNN is a distance-based model, in that it uses a metric of closeness in space that assumes that all features are on the same scale, which we already know that our data is not on. So, for KNN, we will have to construct a more complicated pipeline to more accurately assess its baseline performance, using the following code:
# bring in some familiar modules for dealing with this sort of thing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# construct pipeline parameters based on the parameters
# for KNN on its own
knn_pipe_params = {'classifier__{}'.format(k): v for k, v in knn_params.iteritems()}
# KNN requires a standard scalar due to using Euclidean distance # as the main equation for predicting observations
knn_pipe = Pipeline([('scale', StandardScaler()), ('classifier', knn)])
# quick to fit, very slow to predict
get_best_model_and_accuracy(knn_pipe, knn_pipe_params, X, y)
print knn_pipe_params # {'classifier__n_neighbors': [1, 3, 5, 7]}
Best Accuracy: 0.8008
Best Parameters: {'classifier__n_neighbors': 7}
Average Time to Fit (s): 0.035
Average Time to Score (s): 6.723
The first thing to notice is that our modified code pipeline, which now includes a StandardScalar (which z-score normalizes our features) now beats the null accuracy at the very least, but also seriously hurts our predicting time, as we have added a step of preprocessing. So far, the logistic regression is in the lead with the best accuracy and the better overall timing of the pipeline. Let's move on to our two tree-based models and start with the simpler of the two, the decision tree, with the help of the following code:
Amazing! Already, we have a new lead in accuracy and, also, the decision tree is quick to both fit and predict. In fact, it beats logistic regression in its time to fit and beats the KNN in its time to predict. Let's finish off our test by evaluating a random forest, using the following code:
get_best_model_and_accuracy(forest, forest_params, X, y)
Best Accuracy: 0.819566666667 Best Parameters: {'n_estimators': 50, 'max_depth': 7} Average Time to Fit (s): 1.107 Average Time to Score (s): 0.044
Much better than either the Logistic Regression or the KNN, but not better than the decision tree. Let's aggregate these results to see which model we should move forward with in optimizing using feature selection:
Model Name |
Accuracy (%) |
Fit Time (s) |
Predict Time (s) |
Logistic Regression |
.8096 |
.602 |
.002 |
KNN (with scaling) |
.8008 |
.035 |
6.72 |
Decision Tree |
.8203 |
.158 |
.002 |
Random Forest |
.8196 |
1.107 |
.044 |
The decision tree comes in first for accuracy and tied for first for predict time with logistic regression, while KNN with scaling takes the trophy for being the fastest to fit to our data. Overall, the decision tree appears to be the best model to move forward with, as it came in first for, arguably, our two most important metrics:
- We definitely want the best accuracy to ensure that out of sample predictions are accurate
- Having a prediction time is useful considering that the models are being utilized for real-time production usage
Knowing that we will be using the decision tree for the remainder of this chapter, we know two more things:
- The new baseline accuracy to beat is .8203, the accuracy the tree obtained when fitting to the entire dataset
- We no longer have to use our StandardScaler, as decision trees are unaffected by it when it comes to model performance