Quality control

As we mentioned already, there are plenty of issues with this data, as web pages are very different in terms of their structure and offer different sets of information, formatted differently. There are a lot of issues in the code – cleaning all of it will take another chapter (and indeed, that's what we'll do in Chapter 11, Data Cleaning and Manipulation). It is good practice, however, to perform a modicum of basic quality control, verifying that all the pages have some minimal, requisite properties, and that they are not null. We could also add some other checks, ensuring, for example, that the additional fields are not empty, at least for a significant number of the pages.

The approach we'll be using is two-fold. First, we'll try to define a list of values we're assuming are required for each record. Second, we already know that some information will be missing for some of the pages, so let's at least calculate it. In order to do so, we define one dictionary to store all the information. At the start, it will contain only zeros. Consider the following example. Here, we cover all the battles we'll check (total), records with missing locations, outcomes, and territorial sections. In addition, we'll calculate a number of records with only total values for Casualties, Commanders, and Strength in the total section. Similarly, we will check how many records are devoid of those sections:

STATISTICS = {
'battles_checked':0,
'location_null':0,
'result_null':0,
'territorial_null': 0,
'total': {
'Casualties and losses':0,
'Commanders and leaders':0,
'Strength':0
},
'none': {
'Casualties and losses':0,
'Commanders and leaders':0,
'Strength':0
}
}

Once the data structure is defined, let's write a checking function. It is a rather simple one, in keeping with the others. It is recursive, as it calls itself on all the children of a given record.

Note that we check for required values at the outset. In the end, we only kept the level value as a required attribute:

def qa(battle, name='Unknown'):
required = (
# 'Location'
# 'url',
'level',
)

for el in required:
assert el in battle and battle[el] is not None, (name, el)


STATISTICS['battles_checked'] +=1

for el in 'Location', 'Result', 'Territorial':
if el not in battle or battle[el] is None:
STATISTICS[f'{el.lower()}_null'] += 1

for el in 'Casualties and losses', 'Commanders and leaders',
'Strength':
if el not in battle:
STATISTICS['none'][el] += 1
continue

if 'total' in battle[el]:
STATISTICS['total'][el] += 1

if 'children' in battle:
for name, child in battle['children'].items():
qa(child, name)

 With this function, we can now loop over the records and check our statistics:

for _, front in campaigns_parsed.items():
for name, campaign in front.items():
qa(campaign, name)

The preceding function, as defined, passes all the tests. But why did we remove url and Location from required? It transpires that some records do miss them – for example, Battle of Lang Son does not have a link at all, while a few others (for example, the French West Africa—https://en.wikipedia.org/wiki/French_West_Africa_in_World_War_II page) are missing Location and Date. In this case, we decided to relax our requirements but to add a note on those missing records. Feel free to modify the test – this will give you an insight into some of the different types of issues we'll have to mitigate with this dataset in the future. 

Once the test is over, we can check the statistics:

>>> STATISTICS

{'battles_checked': 624,
'location_null': 37,
'result_null': 40,
'territorial_null': 553,
'total': {'Casualties and losses': 7,
'Commanders and leaders': 3,
'Strength': 2},
'none': {'Casualties and losses': 83,
'Commanders and leaders': 44,
'Strength': 109}}

Most records are missing a territorial section – and quite a few don't have any information on the overall strength. Again, it is a good idea to collect that information for the future. For now, let's store the dataset we obtained to another JSON:

with open('_all_battles_parsed.json', 'w') as f:
json.dump(campaigns_parsed, f)

And we're done! The three steps in this section will help you to scrape up the data and present it accordingly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.28.202