Defining the scope of work to be done

Before we dive into the process of data cleaning, which might be very time-consuming, it is always useful to define the scope of work—which columns and rows we actually need to clean. For this chapter, let's restrict the scope to the lowest level of the hierarchy—specific battles (level=100—pages for events with no children). We can use the equality operator to generate a Boolean mask, and then use this mask to filter the dataset:

>>> battles = data[data.level == 100]  
>>> battles.shape
(147, 23)

There are many columns in the dataset—enough for pandas to omit the middle part when printing. As we'll be mostly focused on time, geolocation, names, and casualties of each side, let's define those columns of interest in a list and investigate them more closely:

columns = ['Location', 'name', 'Date', 'Result', 'Belligerents.allies', 'Belligerents.axis']
battles[columns].head(3)

As a result of this code, we'll get the following table:

	`Location`	`Name`	`Date`	`Results`	`Belligerents.allies`	`Belligerents.axis`	`Casualties and losses.allies`	`Casualties and losses.axis`
`0`	`Westerplatte, harbor of Free City of Danzig54°...`	`Battle of Westerplatte`	`1–7 September 1939`	`German victory`	`Poland`	`Germany Danzig`	`15 dead at least 40 wounded Remainder captured`	`50 dead at least 150 wounded`
`1`	`Mokra, Kielce Voivodeship, Poland`	`Battle of Mokra`	`September 1, 1939`	`Polish victory`	`Germany`	`Poland`	`800 killed, missing, captured, or wounded 50 tanks`	`500 killed, missing or wounded 300 horses sever a...`
`2`	`Near Mława, Warsaw Voivodeship, Poland`	`Battle of Mlawa`	`1–3 September 1939`	`German victory`	`Germany`	`Poland`	`1,800 killed 3,000 wounded 1,000 missing 72 tanks..`	`1,200 killed 1,500 wounded`
`3`	`Near Tuchola Forest, Pomeranian Voivodeship, P...`	`Battle of Tuchola Forest`	`1–5 September 1939`	`German victory`	`Germany`	`Poland`	`506 killed 743 wounded`	`1600 killed 750 wounded Unknown number cap...`
`4`	`Jordanów, Kraków Voivodeship, Poland`	`Battle of Jordanów`	`1–3 September 1939`	`Pyrrhic German victory`	`Poland`	`Germany`	`3+ tanks`	`70+ tanks and AFVs`

Now, let's investigate the missing values in the data if the particular column is mostly empty. It makes no sense to spend time cleaning and processing it. The best way to explore the missing values is to make a plot. With the help of the missingno library, it is an easy task. Take a look at the code:

import missingno as msno
msno.matrix(battles, labels=True, sparkline=False)

As a result, the following chart will be plotted:

Here, the black rectangles represent non-empty values. As you can see, a few auto-generated columns (level, name, parent, and URL) don't have any misses. Some others, on the other hand, do have just a few non-empty ones (for example, all the columns related to the third party). What is even more important is the fact that there is a clear correlation between the missing values on some of the columns—it seems that rows with missing data in Belligerents also lack values for Date and Location. Let's first investigate those columns:

>>> mask = battles[['Date', 'Location']].isnull().all(1)
>>> battles.loc[mask, ['name', 'url']]
                                    name                                                url
39   Pripyat swamps (punitive operation)  https://en.wikipedia.org/wiki/Pripyat_swamps_(...
42    Bombing of Tallinn in World War II  https://en.wikipedia.org/wiki/Bombing_of_Talli...
46                       Operation Wotan  https://en.wikipedia.org/w/index.php?title=Ope...
47                      Nevsky Pyatachok     https://en.wikipedia.org/wiki/Nevsky_Pyatachok
48            Operation Nordlicht (1942)  https://en.wikipedia.org/wiki/Operation_Nordli...
61                      Operation Büffel  https://en.wikipedia.org/wiki/Operation_B%C3%B...
67                     Operation Kremlin    https://en.wikipedia.org/wiki/Operation_Kremlin
68                Operation Braunschweig  https://en.wikipedia.org/wiki/Operation_Brauns...
70                         Malaya Zemlya        https://en.wikipedia.org/wiki/Malaya_Zemlya
96                   Concert (operation)  https://en.wikipedia.org/wiki/Concert_(operation)
97          Zhitomir–Berdichev Offensive  https://en.wikipedia.org/wiki/Zhitomir%E2%80%9...
152      Operation Nordlicht (1944-1945)  https://en.wikipedia.org/wiki/Operation_Nordli...
157                     Operation Konrad     https://en.wikipedia.org/wiki/Operation_Konrad
175                 Operation Margarethe  https://en.wikipedia.org/wiki/Operation_Margar...

From the outcome, it seems that the web pages actually lack this kind of information. Moreover, many of them are not exactly standard battle pages, so perhaps we'd be better off without them—let's throw them out for good:

battles=battles.dropna(subset=['Date', 'Location'])

Now that we're done with missing values, let's get back to the table we printed. As you can see, there are a few serious issues, including an incorrectly stated axis and allies belligerents (refer to rows 3 and 4 of the preceding example), and Date, Location, and Casualties (among others) values stored in an unstructured way. Those issues have to be taken care of before we can move on to analysis. In other words, we need to correct the sides, parse dates, convert locations into coordinates, and parse multiple types of casualties as numbers. Unfortunately, there is no one silver bullet here. To process all those records accurately would require a lot of time. Usually, our time is limited, so we'll have to find some sort of compromise, depending on our end goals.

In this section, we explored the dataset in general, which allowed us to throw away what we won't use, and identify issues with the data that we'll have to fix in the next sections.

But first, how do we even approach data cleaning and parsing? The former is simple – just use masks, filters, and/or imputation strategies. The latter, however, will require us to use yet another technological trick—regular expressions.

Table of Contents for Defining the scope of work to be done

Create new playlist

Sign In

Sign Up

Table of Contents for
Defining the scope of work to be done