Speeding up your Python code

In the previous chapter, we talked about different best practices, approaches, and ways to boost code performance. As a toy example for performance, we'll build our own KNN model, which we used in Chapter 13, Training a Machine Learning Model. As a reminder, KNN is a simple ML model that predicts the target variable by identifying K closest records in the training set, then taking a mode (for classification) or weighted average (for regression) of the target variable. Obviously, there are quite a few implementations of KNN already, and so we will use one as an example.

For starters, let's write a naive implementation; it has already been fairly optimized through the use of NumPy commands. First, let's import all the Euclidean distance measuring functions and define a function to get the N-closest records. Take a look at the following code:

from sklearn.metrics.pairwise import euclidean_distances

def _closest_N(X1, X2, N=1):
    matrix = euclidean_distances(X1, X2)
    args = np.argsort(matrix, axis=1)[:, :N]
    return args

Here, we pass two datasets with the same number of features and pass the N argument. First, a matrix of distances between the two datasets is computed. From that, for each row (the data point in the first dataset) we sort the columns (the data points in the second dataset), by their distance, take the N closest ones, and return their IDs. This function is the main engine of an algorithm.

Now, we can write an estimator class, which will store X, y, and N arguments, and will execute the preceding function on the predict method, sklearn-style. Here is the code:

class NearestNeighbor:
    X = None
    y = None
    N = None
        
    def __init__(self, N=3):
        self.N=N
    
    def fit(self, X, y):
        self.X = X
        self.y = y
    
    def predict(self, X):
        closest = _closest_N(X, self.X, N=self.N)
        
        result = pd.Series(np.mean(np.take(ytrain.values, closest)
                                   , axis=1))
        result.index = X.index
        return result

Note that even this naive model is vectorized (since we use pandas) and uses a specialized function, euclidean_distances, from sklearn. Let's see how it performs. For this, we'll use a sample of 2,500 records from the 311 complaints dataset we used previously. Here is the measurement:

>>> %%timeit
>>> naiveKNN.predict(Xtest)
1.43 s ± 78.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It predicts 1.43 s on average, which is quite a lot!

Premature optimization is the root of all evil—it usually results in bad, fragile code. In order to avoid that, we need to understand what part of the code we should tinker with. It is a bad idea to optimize before you know which specific part of your code is slow. If we run lprun for the predict method, it is clear that 99.9% of the time is taken up by the _closest_one function. Therefore, we should focus on this function alone.

Now, if we run the same lprun again for the _closest_one function, we'll get the following:

>>> %lprun -f _closest_one naiveKNN.predict(Xtest)

Timer unit: 1e-06 s

Total time: 1.44122 s
File: <ipython-input-124-90edea23066c>
Function: _closest_N at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           def _closest_N(X1, X2, N=1):
     5         1     196149.0 196149.0     13.6      matrix = euclidean_distances(X1, X2)
     6         1    1245072.0 1245072.0    86.4      args = np.argsort(matrix, axis=1)[:, :N]
     7         1          1.0      1.0      0.0      return args

As you can see, approximately 86% of the time is taken up by sorting, while the remaining 14% is taken up by Euclidean distance computations.

Table of Contents for Speeding up your Python code

Create new playlist

Sign In

Sign Up

Table of Contents for
Speeding up your Python code