Descriptive statistics

Now, having a dataframe in place, let's answer some simple questions; for example, which battles took the most lives on both sides? To answer that, we need to add two columns, sort the dataframe by the result, from larger to smaller, and print out first N records. Let's do it:

>>> kill_cols = ['allies killed', 'axis killed']
>>> data['killed total'] = data[kill_cols].sum(1)
>>> data['killed total'].sort_values(ascending=False).head(3)

>>> name
Battle of Stalingrad     1997993.0
Battle of Moscow         1203428.0
Battle of Kiev (1941)     661958.0
Name: killed total, dtype: float64

The next question might be on the typical number of casualties for each battle. Before we calculate the statistics, we have to filter rows with unknown (NaN) or zero values—in both cases, records shouldn't be included. Here, we'll use the pipe operator, |, as a vectorized equivalent to or in ordinary Python.

Consider the following code. We're assigning a mask as a logic OR of two Boolean arrays (hence the pipe). The first array checks whether any of the values in the kill_cols columns we assigned previously is null. As there are multiple columns, the result will be in the form of a 2-dimensional array. To convert it into a 1-dimensional array, we further use the any method, passing 1 to identify horizontal (axis=1) direction of the operation—in other words, the result will tell us, for each row, whether any value in this row is True.

The second operation (after the pipe) works similarly, but, instead, we check whether the value is zero. As a result, the mask variable will be a 1-dimensional array with a Boolean value for each row in the original dataset. Those values tell whether, in this row, values in any of kill columns is null or equal to zero:

mask = data[kill_cols].isnull().any(1) | (data[kill_cols] == 0).any(1)

Next, we need to drop rows for which mask is true and keep the rest. For that, we need to invert our mask, using the tilde symbol; similar to the pipe, the tilde ~ works as a vectorized not (or exclamation mark). In the following, we're filtering data, keeping only rows with the proper "kill" columns, and computing the median only for them. Here is an example: we use ~mask as NOT; for example, the row does not have zero nor null values.

Using this inverted mask, we filter the dataset and compute the median of the killed total values:

>>> data.loc[~mask, 'killed total'].median()
37316.0

This mask can now be used on many occasions. As a final example, let's compute the main statistics for the tanks columns for both sides, using the describe method (here, we use mask as a proxy for good records with meaningful results). Many battles have casualties, but no tanks lost or reported to be lost, and this seems fine:

>>> data.loc[~mask, ['allies tanks', 'axis tanks']].describe()
        allies tanks axis tanks
count   79.000000    79.000000
mean    352.683544   65.911392
std     897.692848   235.066831
min     0.000000     0.000000
25%     0.000000     0.000000
50%     0.000000     0.000000
75%     254.000000   18.000000
max     4799.000000  1500.000000

The values are interesting—as you can see, most battles don't have tank losses on either side. On average, though, the allies lost six times more tanks than the axis. Furthermore, in 75% of battles, the axis lost 18 or fewer tanks—but the allies lost 254—an even larger ratio!

Our analysis is getting more complex. It is getting hard to manually read and understand more than 10 numbers at once. To make sense of larger sets, we need to start visualizing our dataset on the charts.

Table of Contents for Descriptive statistics

Create new playlist

Sign In

Sign Up

Table of Contents for
Descriptive statistics