Finally, let's review yet another model—arguably, the most established and popular around the world. Linear regression is actually a statistical model with a long history. The idea behind linear regression is as follows.

Assuming that variables have linear relationships, independent variables are not correlated, and there is a certain variance in the features, we can estimate the linear relationship between independent and target variables. As a result, our model will be present as a set of coefficients, one per each feature (independent variable), plus one for bias:

Here, i stands for the record index and p for the index of the feature. Epsilon represents an error that the model couldn't explain. This way, to calculate our estimate of y for the record (row) i, we simply need to multiply each feature in the row by the corresponding coefficient and add them up together with the bias (beta zero). All coefficients are universal and calculated beforehand, during the model training. Take a look at the following diagram:

The preceding is a scatterplot of the allies' casualties, plotted against the number of tanks the allies had in each battle. The (red) line here represents a linear model—it is defined by a bias (its Y coordinate at X=0) and the slope—coefficient for our feature—represents the number of tanks. As you can see, the direct correlation here is positive; the trend is set upward, which is not surprising—more tanks means larger armies on both sides, hence larger casualties overall. Note the three outlier records beyond all others—they significantly impact the model.

Linear models have distinctive properties, in particular:

Linear models are easy to interpret. Essentially, they define an interpretable coefficient of impact for each feature. Say we predict the price of the apartments: it will attach a price tag, in dollars, for every square foot that would be the price of that square foot, on average.
They don't require scaling.
Assuming non-collinearity (that features are not correlated between themselves), those coefficients are independent of other features—for example, the price of a square foot will be estimated on average, independently of the location.
Linear models are easy to train and very easy to infer (even manually), as it boils down to simple multiplication of a few numbers.
In contrast to KNN, linear regression generalizes to the absolute—all you get is a set of coefficients, one per feature (plus bias constant), each answering what is the impact, on average, of this feature on a target variable. It is useful and easy to digest—but there is no way to go deeper than that.

At the same time, due to its linear nature, this algorithm can't account for complex nuances in the data and usually performs badly as a prediction model. It is also not robust to outliers—if there are outliers in the dataset, it is a good idea to drop them before training the model.

Let's try building a linear model on our dataset. Here, we need to predict a continuous value, so let's try to predict the number of casualties for the allies:

First, let's prepare the dataset:

cols = [
    'allies_infantry', 'axis_infantry',
    'allies_tanks', 'axis_tanks',
    'allies_guns', 'axis_guns',
    'start_num'
]

mask = data[cols + ['allies killed']].isnull().any(1)

Now, we can split the features and prepare training and testing sets:

y = data.loc[~mask, 'allies killed']
X = data.loc[~mask, cols]

Xtrain, Xtest, ytrain, ytest = train_test_split(X, 
                                                y, 
                                                test_size=0.3, 
                                                random_state=2019)

Finally, we can train the data and see how it performs.
In the following code, we initiate the linear regression model and train it. Lastly, we predict values for the test and store them in the ypred variable:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import median_absolute_error

model = LinearRegression()
model.fit(Xtrain, ytrain)

ypred = model.predict(Xtest)

But how should we measure the performance of the model? The model itself usually uses the mean of the squared errors, but for interpretation, we can use the median (or mean) absolute error as it will preserve units—in our case, the number of casualties. Indeed, our model does not perform extremely well—it has a median error of 42584 people, as shown here:

>>> median_absolute_error(ytest, ypred)
42584.419274116095

As the test dataset is very small, we can print the errors and check them manually.

Here, we're calculating the errors, as follows:

>>> (ypred - ytest)

111    4.934710e+04
27    -3.582174e+04
42     2.148667e+04
106   -1.191980e+03
54    -1.007381e+06
49    -9.226890e+05
Name: allies killed, dtype: float64Summary

You can see the difference between correct and predicted values of y. As you can see, all but two cases underestimate the real number of casualties. Well, we didn't expect it to be perfect; despite the large errors, this linear model can give us a bird's-eye view of the trends—something easy to digest and discuss. Let's take a look at the coefficients representing the impact of each feature on casualties:

In this table, each coefficient represents the number of deaths associated with one unit for each feature, on average. For example, we see that each tank for the allies decreases casualties by -25. If that was a causal relationship, this digit would be an actionable point for generals and government to consider. However, correlation is not causation! If you look underneath, each axis tank decreases the number of casualties as well. Isn't it supposed to be the other way around? It is unclear, but perhaps we know the answer: larger tank armies could mean battles outside of the cities and we can imagine that those usually have way less infantry involved—hence, fewer casualties. One trend that is worth discussing is that the number of casualties decreases over time.

Table of Contents for Linear regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Linear regression