Using kNN for Predictive Analytics

kNN is non-parametric and instance-based and is used in supervised learning. It is a robust and versatile classifier, frequently used as a benchmark for complex classifiers such as Neural Networks (NNs) and Support Vector Machines (SVMs). kNN is commonly used in economic forecasting, data compression, and genetics based on their expression profiling.

Working Principles of kNN

The idea of kNN is that from a set of features x we try to predict the labels y. Thus, kNN falls in a supervised learning family of algorithms. Informally, this means that we are given a labeled dataset consisting of training observations (x, y). Now, the task is to model the relationship between x and y so that the function f: X→Y learns from the unseen observation x. The function f(x) can confidently predict the corresponding label y prediction on a point z by looking at a set of nearest neighbors.

However, the actual method of prediction depends on whether or not we are doing regression (continuous) or classification (discrete). For discrete classification targets, the prediction may be given by a maximum voting scheme weighted by the distance to the prediction point:

Working Principles of kNN

Here, our prediction, f(z), is the maximum weighted value of all classes, j, where the weighted distance from the prediction point to the training point, i, is given by φ (dij), where d indicates the distance between two points. On the other hand, Iij is just an indicator function if point i is in class j.

For continuous regression targets, the prediction is given by a weighted average of all k points nearest to the prediction:

Working Principles of kNN

From the previous two equations, it is clear that the prediction is heavily dependent on the choice of the distance metric, d. There are many different specifications of distance metrics such as L1 and L2 metrics can be used for the textual distances:

A straightforward way to weigh the distances is by the distance itself. Points that are further away from our prediction should have less impact than the nearer points. The most common way to weigh is the normalized inverse of the distance. We will implement this method in the next section.

Implementing a kNN-Based Predictive Model

To illustrate how making predictions with the nearest neighbors works in TensorFlow, we will use the 1970s Boston housing dataset, which is available through the UCI ML repository at https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data. The following table shows the basic description of the dataset:

Implementing a kNN-Based Predictive Model

Here, we will predict the median neighborhood housing value that is the last value named MEDV as a function of several features. Since we consider the training set the trained model, we will find kNNs to the prediction points and do a weighted average of the target value. Let's get started:

  1. Loading required libraries and packages.

    As an entry point, we import necessary libraries and packages that will be needed to do predictive analytics using kNN with TensorFlow:

    import matplotlib.pyplot as plt
    import numpy as np 
    import random
    import os
    import tensorflow as tf
    import requests
    from tensorflow.python.framework import ops
    import warnings
  2. Resetting the default graph and disabling the TensorFlow warning.

    We need to reset the default TensorFlow graph using the reset_default_graph() function from TensorFlow. You must also disable all warnings due to the absence of GPU on your device:

    warnings.filterwarnings("ignore")
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
    ops.reset_default_graph()
  3. Loading and preprocessing the dataset.

    First, we will load and parse the dataset using the get() function from the requests package as follows:

    housing_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
    housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
    num_features = len(housing_header)
    housing_file = requests.get(housing_url)
    housing_data = [[float(x) for x in y.split(' ') if len(x)>=1] for y in housing_file.text.split('
    ') if len(y)>=1]

    Note

    For more information on how the previous code works, please see the documentation of requests package at http://docs.python-requests.org/en/master/user/quickstart/.

    Then, we will separate features (predictor) from the labels:

    y_vals = np.transpose([np.array([y[len(housing_header)-1] for y in housing_data])])
    x_vals = np.array([[x for i,x in enumerate(y) if housing_header[i] in housing_header] for y in housing_data])

    Now, to get some idea of the features and labels, let's print them as follows:

    print(y_vals)
    >>>
    [[ 24. ]
    [ 21.6]
    [ 34.7]
    [ 33.4]
    [ 36.2]
    [ 28.7]
    [ 22.9]
    [ 27.1]
    [ 16.5]
    [ 18.9]
    [ 15. ]
    …]

    So, the labels are okay to work with, and these are also continuous values. Now, let's see the features:

    print(x_vals)
    >>>
    [[  6.32000000e-03   1.80000000e+01   2.31000000e+00 ...,   3.96900000e+02
        4.98000000e+00   2.40000000e+01]
     [  2.73100000e-02   0.00000000e+00   7.07000000e+00 ...,   3.96900000e+02
        9.14000000e+00   2.16000000e+01]
     [  2.72900000e-02   0.00000000e+00   7.07000000e+00 ...,   3.92830000e+02
        4.03000000e+00   3.47000000e+01]
     ..., 
     [  6.07600000e-02   0.00000000e+00   1.19300000e+01 ...,   3.96900000e+02
        5.64000000e+00   2.39000000e+01]
     [  1.09590000e-01   0.00000000e+00   1.19300000e+01 ...,   3.93450000e+02
        6.48000000e+00   2.20000000e+01]
     [  4.74100000e-02   0.00000000e+00   1.19300000e+01 ...,   3.96900000e+02
        7.88000000e+00   1.19000000e+01]]

    Well, if you see these values, they are pretty unscaled to be fed to a predictive model. Thus, we need to apply the min-max scaling to get a better structure of the features so that an estimator scales and translates each feature individually, and it ensures that it is in the given range on the training set, that is, between zero and one. Since features are most important in predictive analytics, we should take special care of them. The following line of code does the min-max scaling:

    x_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0)

    Now let's print them again to check to make sure what's changed:

    print(x_vals)
    >>>
    [[  0.00000000e+00   1.80000000e-01   6.78152493e-02 ...,   1.00000000e+008.96799117e-02   4.22222222e-01]
     [  2.35922539e-04   0.00000000e+00   2.42302053e-01 ...,   1.00000000e+002.04470199e-01   3.68888889e-01]
     [  2.35697744e-04   0.00000000e+00   2.42302053e-01 ...,   9.89737254e-016.34657837e-02   6.60000000e-01] ..., 
     [  6.11892474e-04   0.00000000e+00   4.20454545e-01 ...,   1.00000000e+001.07891832e-01   4.20000000e-01]
     [  1.16072990e-03   0.00000000e+00   4.20454545e-01 ...,   9.91300620e-01
        1.31070640e-01   3.77777778e-01]
     [  4.61841693e-04   0.00000000e+00   4.20454545e-01 ...,   1.00000000e+00
        1.69701987e-01   1.53333333e-01]]
  4. Preparing the training and test set.

    Since our features are already scaled, now it's time to split the data into train and test sets. Now, we split the x and y values into the train and test sets. We will create the training set by selecting about 75% of the rows at random and leave the remaining 25% for the test set:

    train_indices = np.random.choice(len(x_vals), int(len(x_vals)*0.75), replace=False)
    test_indices = np.array(list(set(range(len(x_vals))) - set(train_indices)))
    x_vals_train = x_vals[train_indices]
    x_vals_test = x_vals[test_indices]
    y_vals_train = y_vals[train_indices]
    y_vals_test = y_vals[test_indices]
  5. Preparing the placeholders for the tensors.

    First, we will declare the batch size. Ideally, the batch size should be equal to the size of features in the test set:

    batch_size=len(x_vals_test)

    Then, we need to declare the placeholders for the TensorFlow tensors, as follows:

    x_data_train = tf.placeholder(shape=[None, num_features], dtype=tf.float32)
    x_data_test = tf.placeholder(shape=[None, num_features], dtype=tf.float32)
    y_target_train = tf.placeholder(shape=[None, 1], dtype=tf.float32)
    y_target_test = tf.placeholder(shape=[None, 1], dtype=tf.float32)
  6. Defining the distance metrics.

    For this example, we are going to use the L1 distance. The reason is that using L2 did not give a better result in my case:

    distance = tf.reduce_sum(tf.abs(tf.subtract(x_data_train, tf.expand_dims(x_data_test,1))), axis=2)
  7. Implementing kNN.

    Now, it's time to implement kNN. This will predict the nearest neighbors by getting the minimum distance index. The kNN() method does the trick. There are several steps for doing this, as follows:

    1. Get the minimum distance index.
    2. Compute the prediction function. To do this, we will use the top_k(), function, which returns the values and indices of the largest values in a tensor. Since we want the indices of the smallest distances, we will instead find the k-biggest negative distances. Since we are predicting continuous values that is regression task, we also declare the predictions and Mean Squared Error (MSE) of the target values.
    3. Calculate the number of loops over training data.
    4. Initialize the global variables.
    5. Iterate the training over the number of loops calculated in step 3.

    Now, here's the function of kNN. It takes the number of initial neighbors and starts the computation. Note that although it is a widely used convention, here I will make it a variable to do some tuning, as shown in the following code:

    def kNN(k): 
        topK_X, topK_indices = tf.nn.top_k(tf.negative(distance), k=k)
        x_sums = tf.expand_dims(tf.reduce_sum(topK_X, 1), 1)
        x_sums_repeated = tf.matmul(x_sums,tf.ones([1, k], tf.float32))
        x_val_weights = tf.expand_dims(tf.div(topK_X, x_sums_repeated), 1)
        topK_Y = tf.gather(y_target_train, topK_indices)
        prediction = tf.squeeze(tf.matmul(x_val_weights,topK_Y), axis=[1])
       mse = tf.div(tf.reduce_sum(tf.square(tf.subtract(prediction, y_target_test))), batch_size)
        num_loops = int(np.ceil(len(x_vals_test)/batch_size))
        init_op = tf.global_variables_initializer()
        with tf.Session() as sess:
                sess.run(init_op) 
                for i in range(num_loops):
                    min_index = i*batch_size
                    max_index = min((i+1)*batch_size,len(x_vals_train))
                    x_batch = x_vals_test[min_index:max_index]
                    y_batch = y_vals_test[min_index:max_index]
                    predictions = sess.run(prediction, feed_dict={x_
    data_train: x_vals_train, x_data_test: x_batch, y_target_train: y_vals_train, y_target_test: y_batch})
                    batch_mse = sess.run(mse, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch, y_target_train: y_vals_train, y_target_test: y_batch})           
        return batch_mse
  8. Evaluating the classification/regression.

    Note that this function does not return the optimal mse value, that is, the lowest mse value, but varies over different k values, so this is a hyperparameter to be tuned. One potential technique would be to iterate the method for k = 2 to, say, 11 and keeping track of the optimal k value that forces kNN() to produce the lowest mse value. First, we define a method that iterates several times from 2 to 11 and returns two separate lists for mse and k respectively:

    mse_list = []
    k_list = []
    def getOptimalMSE_K():
        mse = 0.0
        for k in range(2, 11):
            mse = kNN(k)  
            mse_list.append(mse)
            k_list.append(k)
        return k_list, mse_list 

    Now, it's time to invoke the previous method and find the optimal k value for which the kNN produces the lowest mse value. Upon receiving the two lists, we create a dictionary and use the min()method to return the optimal k value, as follows:

    k_list, mse_list  = getOptimalMSE_K()
    dict_list = zip(k_list, mse_list)
    my_dict = dict(dict_list)
    print(my_dict)
    optimal_k = min(my_dict, key=my_dict.get)
    >>>
    {2: 7.6624126, 3: 10.184645, 4: 8.9112329, 5: 11.29573, 6: 13.341181, 7: 14.406253, 8: 13.923589, 9: 14.915736, 10: 13.920851}

    Now, let's print the Optimal k value for which we get the lowest mse value:

    print("Optimal K value: ", optimal_k)
    mse = min(mse_list)
    print("Minimum mean square error: ", mse)
    >>>
    Optimal K value: 2 minimum mean square error: 7.66241
  9. Running the best kNN.

    Now we have the optimal k, so we will entertain calculating the nearest neighbor. This time we will try to return the matrices for the predicted and actual labels:

    def bestKNN(k): 
        topK_X, topK_indices = tf.nn.top_k(tf.negative(distance), k=k)
        x_sums = tf.expand_dims(tf.reduce_sum(topK_X, 1), 1)
        x_sums_repeated = tf.matmul(x_sums,tf.ones([1, k], tf.float32))
        x_val_weights = tf.expand_dims(tf.div(topK_X, x_sums_repeated), 1)
        topK_Y = tf.gather(y_target_train, topK_indices)
        prediction = tf.squeeze(tf.matmul(x_val_weights,topK_Y), axis=[1])
        num_loops = int(np.ceil(len(x_vals_test)/batch_size))
        init_op = tf.global_variables_initializer()
        with tf.Session() as sess:
                sess.run(init_op) 
                for i in range(num_loops):
                    min_index = i*batch_size
                    max_index = min((i+1)*batch_size,len(x_vals_train))
                    x_batch = x_vals_test[min_index:max_index]
                    y_batch = y_vals_test[min_index:max_index]

    predictions = sess.run(prediction, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch, y_target_train: y_vals_train, y_target_test: y_batch})

        return predictions, y_batch
  10. Evaluating the best kNN.

    Now, we will invoke the bestKNN() method with the optimal value of k that was calculated in the previous step, as follows:

    predicted_labels, actual_labels = bestKNN(optimal_k)

    Now, I would like to measure the prediction accuracy. Are you wondering why? I know the reason. You're right. There is no significant reason for calculating the accuracy or precision since we are predicting the continuous values that is labels. Even so, I would like to show you whether it works or not:

    def getAccuracy(testSet, predictions):
     correct = 0
     for x in range(len(testSet)):
         if(np.round(testSet[x]) == np.round(predictions[x])):
                    correct += 1
     return (correct/float(len(testSet))) * 100.0
    accuracy = getAccuracy(actual_labels, predicted_labels)
    print('Accuracy: ' + repr(accuracy) + '%')
    >>>
    Accuracy: 17.322834645669293%

    The previous getAccuracy() method computes the accuracy, which is quite low. This is obvious and there is no exertion. This also implies that the previous method is pointless. However, if you are about to predict discrete values, this method will obviously help you. Try it yourself with suitable data and combinations of the previous code.

    But do not to be disappointed; we have another way of looking at how our predictive model performs. We can still plot a histogram showing the predicted versus actual labels that are a prediction and actual distribution:

    bins = np.linspace(5, 50, 45)
    plt.hist(predicted_labels, bins, alpha=1.0, facecolor='red', label='Prediction')
    plt.hist(actual_labels, bins, alpha=1.0, facecolor='green', label='Actual')
    plt.title('predicted vs actual values')
    plt.xlabel('Median house price in $1,000s')
    plt.ylabel('count')
    plt.legend(loc='upper right')
    plt.show()
    >>>
    Implementing a kNN-Based Predictive Model

    Figure 16: Predicted versus actual median prices of the houses in $1,000s

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.63.5