k-nearest neighbors

KNN is both interpretable and fast and for small and medium datasets (for large ones, there is a scalable modification—approximate KNN). It also has an important property—similar to k-means, it works on distances, and therefore sees the interaction between the features, which many other algorithms can't do. 

The logic behind KNN is very simple—for each record it predicts, it finds k nearest records (neighbors—hence the name) most similar (close in the feature space) to the given one in the training set and infers data from them. The algorithm can be used both for classification (in this case, a most frequent class for the neighbors will be taken) or regression (calculated as a weighted average of the neighbors' values). The following sections explain these in detail.

Of course, KNN has its drawbacks as well:

  • It can't extrapolate beyond the training set—similar to k-means, it depends on the unit scales.
  • It also has to store all data in the object itself, so it won't be a great choice (at least, in its basic version) to run on big datasets.
  • This algorithm cannot ignore features or treat them as less important. Adding bad features to the dataset may lead to a decrease in performance.
  • KNN has also a somewhat limited value for interpretation—it can spill out the neighbors but won't reveal any meaningful trends, nor estimate the relationship between the target value and particular features.

Let's try to use KNN to predict the outcome of each battle, using the same scaled dataset we used for the clustering problem. Let's see how it is done:

  1. Before we run the model, we first need to convert the result variable into a numeric feature, as well. As some battles have no clear victory, we'll have three outcomes—negative for an axis victory, positive for an allies victory, and zero for a tie as shown here in the code. The following code will replace axis with -1, allies with 1, and fill all null values with 0. For debugging purposes, we then count the number of all unique values in the column:
>>> data['result_num'] = data['result'].map({'axis':-1, 'allies':1}).fillna(0)  # 0 for tie
>>> data['result_num'].value_counts()

-1.0 93
1.0 34
0.0 6
Name: result_num, dtype: int64
  1. Next, we need to split the data into two parts: training and testing, as we'll need to measure the quality of the outcome. Luckily, there is a sklearn helper function for that, as well. All we need is to specify a test size (as a fraction) and random_state—to be sure the change in the metrics is not attributed to change in the data splits:
from sklearn.model_selection import train_test_split

mask = data[cols].isnull().any(1)
X = data.loc[~mask, cols]
y = data.loc[~mask, 'result_num']

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
test_size=0.2, random_state=2019)
  1. To scale the data or not depends on the case. Scaling ensures that all given features are equally contributing to the clustering—this may be important for the performance, but will undermine the interpretability of the model and, especially for KNN, may decrease the accuracy—KNN cannot ignore features. In this specific case, accuracy for the unscaled dataset is betterperhaps as points are further away from each other.
  2. The fit method, in this case, essentially internalizes the training dataset. The following code initializes the model, fits it to the training set, and predicts values for the test set:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5) # again, arbitrary number
model.fit(Xtrain, ytrain)
y_pred = model.predict(Xtest)

But how to measure the performance of the model? For the classification model, it seems logical to start with accuracy—a ratio of correct predictions to the total number of them. This metric—and many others—is also built into sklearn. Let's use it to measure our model's performance. The following code does exactly that:

>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)
0.5

The result is, of course, not very impressive—we didn't perform better than a coin flip. The round number, however, can give us hints—as we dropped any data points with missing values, our testing sample is very small. With this sample size, metrics are very volatile and hard to interpret. Still, this model can be used to further explore our dataset, as KNN can spill out the neighboring records in addition to the estimate.

In the following table, we're getting the five neighbors for the first record in the test dataset, using the model's kneighbors method:

>>> Xtest.head(1)
allies_infantry axis_infantry allies_tanks axis_tanks allies_guns axis_guns
106 378000.0 100000.0 1000.0 350.0 3241.0 2000.0

>>> Xtrain.iloc[model.kneighbors(Xtest.head(1))[1][0]]

allies_infantry axis_infantry allies_tanks axis_tanks allies_guns axis_guns 55 1173500.0 1040000.0 894.0 950.0 13451.0 3000.0 66 1286000.0 300700.0 2409.0 625.0 26379.0 5500.0 98 1002200.0 900000.0 1979.0 900.0 11265.0 6300.0 126 1171800.0 270000.0 1600.0 772.0 5425.0 434.0 73 822000.0 500000.0 550.0 146.0 4600.0 2389.0

The preceding table shows the records for the data points in the metrics. Here, the first row with index 106 represents the Battle of the Dukla Pass in the Carpathian mountains. Its five most similar neighbors are Operation Uranus, Operation Kutuzov, the Lvov–Sandomierz Offensive, the Vienna Offensive, and the Leningrad–Novgorod Offensive, all solely in terms of the number of troops. It seems that they all share advantages in the number of troops for the allies and that they were all fought in the second half of the war by the Soviets—the model didn't know all that, but it is hidden behind the values. In other words, only Soviets on that front had armies of that scale, and only in the second part of the war—it's not really hard to guess, but still useful for understanding the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.218.1