Descriptive statistics

Now, having a dataframe in place, let's answer some simple questions; for example, which battles took the most lives on both sides? To answer that, we need to add two columns, sort the dataframe by the result, from larger to smaller, and print out first N records. Let's do it:

>>> kill_cols = ['allies killed', 'axis killed']
>>> data['killed total'] = data[kill_cols].sum(1)
>>> data['killed total'].sort_values(ascending=False).head(3)

>>> name Battle of Stalingrad 1997993.0 Battle of Moscow 1203428.0 Battle of Kiev (1941) 661958.0 Name: killed total, dtype: float64

The next question might be on the typical number of casualties for each battle. Before we calculate the statistics, we have to filter rows with unknown (NaN) or zero valuesin both cases, records shouldn't be included. Here, we'll use the pipe operator, |, as a vectorized equivalent to or in ordinary Python.

Consider the following code. We're assigning a mask as a logic OR of two Boolean arrays (hence the pipe). The first array checks whether any of the values in the kill_cols columns we assigned previously is null. As there are multiple columns, the result will be in the form of a 2-dimensional array. To convert it into a 1-dimensional array, we further use the any method, passing 1 to identify horizontal (axis=1) direction of the operationin other words, the result will tell us, for each row, whether any value in this row is True.

The second operation (after the pipe) works similarly, but, instead, we check whether the value is zero. As a result, the mask variable will be a 1-dimensional array with a Boolean value for each row in the original dataset. Those values tell whether, in this row, values in any of kill columns is null or equal to zero:

mask = data[kill_cols].isnull().any(1) | (data[kill_cols] == 0).any(1)

Next, we need to drop rows for which mask is true and keep the rest. For that, we need to invert our mask, using the tilde symbol; similar to the pipe, the tilde ~ works as a vectorized not (or exclamation mark). In the following, we're filtering data, keeping only rows with the proper "kill" columns, and computing the median only for them. Here is an example: we use ~mask as NOT; for example, the row does not have zero nor null values.

Using this inverted mask, we filter the dataset and compute the median of the killed total values:

>>> data.loc[~mask, 'killed total'].median()
37316.0

This mask can now be used on many occasions. As a final example, let's compute the main statistics for the tanks columns for both sides, using the describe method (here, we use mask as a proxy for good records with meaningful results). Many battles have casualties, but no tanks lost or reported to be lost, and this seems fine:

>>> data.loc[~mask, ['allies tanks', 'axis tanks']].describe()
allies tanks axis tanks
count 79.000000 79.000000
mean 352.683544 65.911392
std 897.692848 235.066831
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 254.000000 18.000000
max 4799.000000 1500.000000

The values are interestingas you can see, most battles don't have tank losses on either side. On average, though, the allies lost six times more tanks than the axis. Furthermore, in 75% of battles, the axis lost 18 or fewer tanksbut the allies lost 254—an even larger ratio!

Our analysis is getting more complex. It is getting hard to manually read and understand more than 10 numbers at once. To make sense of larger sets, we need to start visualizing our dataset on the charts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.6.202