Defining the scope of work to be done

Before we dive into the process of data cleaning, which might be very time-consuming, it is always useful to define the scope of work—which columns and rows we actually need to clean. For this chapter, let's restrict the scope to the lowest level of the hierarchy—specific battles (level=100—pages for events with no children). We can use the equality operator to generate a Boolean mask, and then use this mask to filter the dataset:

>>> battles = data[data.level == 100]  
>>> battles.shape
(147, 23)

There are many columns in the dataset—enough for pandas to omit the middle part when printing. As we'll be mostly focused on time, geolocation, names, and casualties of each side, let's define those columns of interest in a list and investigate them more closely:

columns = ['Location', 'name', 'Date', 'Result', 'Belligerents.allies', 'Belligerents.axis']
battles[columns].head(3)

As a result of this code, we'll get the following table:

Location Name Date Results Belligerents.allies Belligerents.axis Casualties and losses.allies Casualties and losses.axis
0 Westerplatte, harbor of Free City of Danzig54°... Battle of Westerplatte 1–7 September 1939 German victory Poland Germany Danzig 15 dead at least 40 wounded Remainder captured 50 dead at least 150 wounded
1 Mokra, Kielce Voivodeship, Poland Battle of Mokra September 1, 1939 Polish victory Germany Poland 800 killed, missing, captured, or wounded 50 tanks 500 killed, missing or wounded 300 horses sever a...
2 Near Mława, Warsaw Voivodeship, Poland Battle of Mlawa 1–3 September 1939 German victory Germany Poland 1,800 killed 3,000 wounded 1,000 missing 72 tanks.. 1,200 killed 1,500 wounded
3 Near Tuchola Forest, Pomeranian Voivodeship, P... Battle of Tuchola Forest 1–5 September 1939 German victory Germany Poland 506 killed 743 wounded 1600 killed 750 wounded Unknown number cap...
4 Jordanów, Kraków Voivodeship, Poland Battle of Jordanów 1–3 September 1939 Pyrrhic German victory Poland Germany 3+ tanks 70+ tanks and AFVs

Now, let's investigate the missing values in the data if the particular column is mostly empty. It makes no sense to spend time cleaning and processing it. The best way to explore the missing values is to make a plot. With the help of the missingno library, it is an easy task. Take a look at the code:

import missingno as msno
msno.matrix(battles, labels=True, sparkline=False)

As a result, the following chart will be plotted:

Here, the black rectangles represent non-empty values. As you can see, a few auto-generated columns (level, name, parent, and URL) don't have any misses. Some others, on the other hand, do have just a few non-empty ones (for example, all the columns related to the third party). What is even more important is the fact that there is a clear correlation between the missing values on some of the columns—it seems that rows with missing data in Belligerents also lack values for Date and Location. Let's first investigate those columns:

>>> mask = battles[['Date', 'Location']].isnull().all(1)
>>> battles.loc[mask, ['name', 'url']]
name url 39 Pripyat swamps (punitive operation) https://en.wikipedia.org/wiki/Pripyat_swamps_(... 42 Bombing of Tallinn in World War II https://en.wikipedia.org/wiki/Bombing_of_Talli... 46 Operation Wotan https://en.wikipedia.org/w/index.php?title=Ope... 47 Nevsky Pyatachok https://en.wikipedia.org/wiki/Nevsky_Pyatachok 48 Operation Nordlicht (1942) https://en.wikipedia.org/wiki/Operation_Nordli... 61 Operation Büffel https://en.wikipedia.org/wiki/Operation_B%C3%B... 67 Operation Kremlin https://en.wikipedia.org/wiki/Operation_Kremlin 68 Operation Braunschweig https://en.wikipedia.org/wiki/Operation_Brauns... 70 Malaya Zemlya https://en.wikipedia.org/wiki/Malaya_Zemlya 96 Concert (operation) https://en.wikipedia.org/wiki/Concert_(operation) 97 Zhitomir–Berdichev Offensive https://en.wikipedia.org/wiki/Zhitomir%E2%80%9... 152 Operation Nordlicht (1944-1945) https://en.wikipedia.org/wiki/Operation_Nordli... 157 Operation Konrad https://en.wikipedia.org/wiki/Operation_Konrad 175 Operation Margarethe https://en.wikipedia.org/wiki/Operation_Margar...

From the outcome, it seems that the web pages actually lack this kind of information. Moreover, many of them are not exactly standard battle pages, so perhaps we'd be better off without them—let's throw them out for good:

battles=battles.dropna(subset=['Date', 'Location'])

Now that we're done with missing values, let's get back to the table we printed. As you can see, there are a few serious issues, including an incorrectly stated axis and allies belligerents (refer to rows 3 and 4 of the preceding example), and Date, Location, and Casualties (among others) values stored in an unstructured way. Those issues have to be taken care of before we can move on to analysis. In other words, we need to correct the sides, parse dates, convert locations into coordinates, and parse multiple types of casualties as numbers. Unfortunately, there is no one silver bullet here. To process all those records accurately would require a lot of time. Usually, our time is limited, so we'll have to find some sort of compromise, depending on our end goals. 

In this section, we explored the dataset in general, which allowed us to throw away what we won't use, and identify issues with the data that we'll have to fix in the next sections.

But first, how do we even approach data cleaning and parsing? The former is simple – just use masks, filters, and/or imputation strategies. The latter, however, will require us to use yet another technological trick—regular expressions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.51.191