Exploring unsupervised learning

First, let's try clustering our data. Clustering is a type of unsupervised learning with the goal of grouping records based solely on their features. It is often used to get a better understanding of the data before building a supervised model or as part of the exploratory analysis. It also could be used on its own. One common task is defining the target audience for the service or product. In our case, this should reveal the similarities between the battles across the dataset.

This task may seem to be simple for a 1- or 2-dimensional (one- or two-column) datasets—indeed, our eyes and brains are splendid at finding clusters visually. It is, however, a near-impossible task for a human when the number of dimensions grows beyond three. To automate that process, we will use a k-means clustering algorithm—simple, performant, and easy to interpret and debug.

k-means is one of the most popular algorithms for the task, mainly due to its fast performance and the small set of hyperparameters—external parameters of the model, which have to be chosen outside of the training process itself. The main drawbacks of this method are the inability to catch complex shapes (k-means only supports convex and isotropic shapes) and the number of clusters that have to be predefined. This necessity to specify the number of clusters can be both a curse and a blessing. There are methods to find the best number of clusters (for example, the elbow method), or there can be an obvious, business-driven need for a specific number of clusters.

Before we run the model, we'll need to load and prepare the dataset:

First of all, let's think about which features can and should be used here. For our first attempt, let's use the raw number of soldiers, tanks, and guns on each side:

cols = [
    'allies_infantry', 'axis_infantry',
    'allies_tanks', 'axis_tanks',
    'allies_guns', 'axis_guns'
]

This choice is arbitrary but will have a direct impact on the outcome, as we'll soon see.

We didn't use any time-specific or belligerent-specific values as these values will just group records together based on their place in history, which is what we know already.

As with most ML models, k-means does not itself support empty cells and can only run on numeric values. There are multiple ways to resolve both issues—depending on the specifics of the goal and other considerations. All of the features we picked are numeric already, but we'll have to take care of the missing values. For now, we'll take only records with existing infantry numbers and fill empty cells in other columns:

mask = data[['allies_infantry', 'axis_infantry']].notnull().all(1)
data_kmeans = data.loc[mask, cols].fillna(0)

Finally, we can run the clustering on this dataset.
Let's split our data into five groups. It is, again, an arbitrary number. There are some methods to define the best number of clusters in terms of particular metrics (for example, inertia), but we won't do that in our case for the sake of simplicity. We also set a random_state seed for reproducibility—k-means is robust, but not deterministic and can randomly swap cluster numbers:

from sklearn.cluster import KMeans
model = KMeans(n_clusters=5, random_state=2019)

After this, the algorithm is ready to spit out the labels. The following code does exactly that by running a standard predict method on our data. Labels are just integers representing each group, starting with zero. For visualization purposes (so that there won't be Cluster O), we add 1 to each:

>>> labels = model.fit_predict(data_kmeans) + 1

>>> print(labels)
[1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 1 1 4 4 1 4 1 4 1 1 1 4 5 1 3 5 4 2 3 1 1 1 1 3 4 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 4 3 1 1 1 1 1 1 3 3 4 1 1 2 1]

Let's now take a look at the result by visualizing the dataset with the new column:

data_kmeans['label'] = ('Cluster ' + 
    pd.Series((labels+1)).astype(str)).values
data_kmeans[['name', 'result', 'start']] = data.loc[mask, ['name', 'result', 'start']]

c = alt.Chart(data_kmeans).mark_point().encode(
    color=alt.Color('label:N', legend=alt.Legend(title='Cluster')),
    x='allies_infantry', y='axis_infantry', shape='result',
    tooltip=data_kmeans.columns.tolist()).interactive()

c

And here is the outcome. As you can see, there is a somewhat distinctive pattern—clusters tend to be grouped together, both by x and y coordinate axes as if only those two properties were used:

Why is that so? To answer this question, let's talk about how the algorithm works, first. There are a few simple steps:

k centroids are generated randomly in the features space (in other words, we generate k random rows with the same features and within the same range as the dataset).
For each of those centroids, a Euclidean distance to all of the data points of the dataset is calculated (theoretically, k-means can use other distances as well, but that is quite rare).
All data points are then assigned to the closest center point. For each group, a centroid is calculated, and the center point is moved there.
From that, the cycle is repeated—points are re-assigned, centroids are recalculated, and center points are moved. This happens over and over again until center points stop moving.

As a subsequence of that approach, the model is always in Euclidean space—that is, all units for all of the features are viewed as equal. At the same time, we obviously have thousands of soldiers but only dozens of tanks and airplanes in our dataset. Therefore, infantry features are treated as way more important by definition.

One way to make the model to pay more attention to tanks, airplanes, or any other feature, is to standardize them—for each feature, we will subtract its mean and divide the result by the standard error—that way, they will all be spread equally around zero. In fact, sklearn has built-in functionality for that task.

In the following code, we're using the sklearn function, scale, to scale multiple columns at once. The function may give you a warning if some of the columns are integers —it will convert them into floats as part of the scaling process. It also returns a numpy array, not a dataframe, but that's okay in this case:

from sklearn.preprocessing import scale
data_to_scale = data_kmeans.drop(['label', 'name', 'start', 'result'], axis=1)
data_scaled = scale(data_to_scale)

labels_scaled = model.fit_predict(data_scaled) + 1
data_kmeans['label 2'] = ('Cluster ' + 
    pd.Series((labels_scaled)).astype(str)).values

But does the scaling affect labels? Let's see! We will re-run the model again, as follows:

c.data = data_kmeans
c.encode(color=alt.Color('label 2:N', 
         legend=alt.Legend(title='Cluster')))

This time, shapes are mixed—clearly, infantry numbers are not the only features in play. Here is what the new clustering looks like:

But does it offer any insights? We'd argue that it does. Combined with the interactivity given by Altair, clusters help us to highlight some internal similarities. For example, Cluster 1 clearly represents battles with small numbers on both sides. Cluster 2 represents battles with a considerably larger number of infantry for allies. Cluster 3 groups together battles where allies have a lot of tanks and/or guns. Cluster 4 represents battles with a small to none number of vehicles—including a battle for Voronezh and Prague offensive—in both cases, it is clear that the number of tanks wasn't reported due to the mere scale of the operations. Finally, Cluster 5 seems to represent battles with a large number of axis tanks.

Table of Contents for Exploring unsupervised learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Exploring unsupervised learning