Additional information

There is a lot of additional information on the page that we'd want to have in the dataset – sections such as belligerents, leaders, strengths, and casualties (refer to the lower rectangle on the screenshot in Step 2 – Scraping information from the Wiki page section, named Supplemental info). As you can see, each header here is a row in the table – and all the actual information sits beneath it – similar to lists after headers in the previous task. You will also observe that most of the sections are split horizontally into two cells, the first being on allies, and the second on the axis. In fact, in some cases (refer to the Vilnius Offensive page), Wikipedia adds third column for a third party involved in the event. For operation Skorpion, some sections have only one column, representing overall outcomes (for example, the total number of casualties), while for others (Operation Goodwood—https://en.wikipedia.org/wiki/Operation_Goodwood), casualties are split between the two sides involved. This needs to be addressed as well.

All titles appear to be consistent throughout all the pages. Thus, we can search for specific headers – and if there are any, grab the corresponding data from the section beneath each. Here is the code snippet that does exactly that; it looks for a header, and if there is one, takes the next section:

def _find_row_by_header(table, string):
header = table.tbody.find('tr', text=string)

if header is not None:
return header.next_sibling

Now that we know how to get each section, let's parse them. The following code checks for the number of rows in the table. If there is only one, it will be stored as total. If there are two or more, they will receive corresponding columns – allies, axis, and third party:

def _parse_row(row, names=('allies', 'axis', 'third party')):
'''parse secondory info row
as dict of info points
'''
cells = row.find_all('td', recursive=False)
if len(cells) == 1:
return {'total':cells[0].get_text().strip()}

return {name:cell.get_text().strip() for name, cell in zip(names, cells)}

The order of belligerents is assumed to be consistent within the same page. Unfortunately, the order of sides differs from page to page, and we failed to find any rationale for this. Thus, the column that we're calling axis may actually refer to the allies side – we will have to fix that during the data cleaning process in Chapter 11, Data Cleaning and Manipulation.

Let's now collect all the additional information together, using a predefined set of headers to look for:

def _additional(table):

keywords = (
'Belligerents',
'Commanders and leaders',
'Strength',
'Casualties and losses',
)

result = {}
for keyword in keywords:
try:
data = _find_row_by_header(table, keyword)
if data:
result[keyword] = _parse_row(data)
except Exception as e:
raise Exception(keyword, e)

return result

Note that the exception here is used to show the keyword for which the issue is raised, thereby facilitating debugging.

Finally, let's wrap all the code for the page into one function that we'll run on all links:

def parse_battle_page(url):
''' main function to parse battle urls from wikipedia
'''
try:
dom = _default_collect(url) # dom
except Exception as e:
warnings.warn(str(e))
return {}


table = dom.find('table','infobox vevent') # info table
if table is None: # some campaigns don't have table
return {}

data = _get_main_info(table)
data['url'] = url

additional = _additional(table)
data.update(additional)
return data

The preceding try/except clause helps to catch an exception if there is no such page – we got one broken link in our database. You can find it on the initial page; it is highlighted in red.

Now, as an added bonus, all the operation and campaign pages have the same structure, so we collect all the info from them, using the same code. Nice!

To ensure that we have addressed most of the issues, test the code on a set of links, ideally, the most diverse and exotic ones. As the preceding code will be used in another notebook, it makes sense to copy it to a dedicated .py file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.12.230