Understanding casualties

Casualties are probably the most verbose and non-structured columns of the dataset. It will be extremely hard to make use of all the nuances of information here, so again—perhaps we can simplify the task, getting only the things we really want to use. Perhaps we can use code words to extract any digit preceding them; for example, ([d|,]+)s*dead should extract any consecutive digits or commas before the word 'dead'. We can define similar patterns for all types of casualties and loop over all of them, testing the patterns. There are, unfortunately, many keywords that mean the same thing ('captured', 'prisoners', and many more), so we have to make them optional, similar to the preceding month expression:

digit_pattern = '([d|,]+)(?:[d+])?s*(?:{words})'

keywords = { 'killed': ['dead', 'killed', 'men'],
'wounded': ['wounded', 'sick'],
'captured': ['captured', 'prisoners'],
'tanks': ['tanks'],
'airplane': ['airplane'],
'guns': ['artillery', 'gun'],
'ships': ['warships', 'boats'],
'submarines': ['submarines']
}

Now, for each keyword, we can generate a custom regular expression and extract all their cells with multiple occurrences (casualties from the different countries involved). In this case, however, we can preemptively convert them into numbers and summarize. By itself this is easy—but before we do that, we need to remove commas, filter empty cells, and convert strings to integers. There is probably a way to do some of that using regex, but it seems easier in this particular case to write a custom pure-Python function (note—it may or may not be the slowest part of the timeline):

def _shy_convert_numeric(v):
if pd.isnull(v) or v == ',':
return 0
return int(v.replace(',', ''))

This function can be applied to every cell via applymap. After that, we can finally summarize every row. The result can be viewed as follows:

results = {
'allies' : pd.DataFrame(index=battles.index), # empty dataframes with the same index
'axis' : pd.DataFrame(index=battles.index)
}

for name, edf in results.items():
column = battles[f'Casualties and losses.{name}']

for tp, keys in keywords.items():
pattern = digit_pattern.format(words="|".join(keys))
extracted = column.str.extractall(pattern).unstack()
edf[tp] = extracted.applymap(_shy_convert_numeric).sum(1)
results[name] = edf.fillna(0).astype(int)

Let's now see the result of results['axis'].head(5):

killed wounded captured tanks airplane guns ships submarines
0 50 150 0 0 0 0 0 0
1 500 0 0 1 0 0 0 0
2 1200 1500 0 0 0 0 0 0
3 1600 750 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0

 

Note, that there is a caveat to our casualties parsing approach that we'll have to keep in mind—due to the pattern we use, in all cases where a range of casualties is stated, we take the last number mentioned. It will be the maximum digit in the range (for example, the 100-150 killed pattern will return 150) and the minimum in other cases (for example, the 10+ tanks pattern will return 10). 

Finally, let's reconnect both of those new dataframes to the original one. This time, let's create a multilevel column structure of our own so that we can select casualties for axis/allies without the need for the long column name. We're going to use the pd.concat function, which can join dataframes both vertically or horizontally. Our allies/axis casualties data is already in the proper dictionary format; we just need to add the rest of the data to the dictionary and then join the datasets together:

results['old_metrics'] = battles
new_dataset = pd.concat(results, axis=1)

As a result, we now have a clean, numeric dataframe of casualties for both sides in the conflict, divided by the type of casualty—be it a warship, plane, tank, or soldiers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.243.23