Step 3 – Scraping data as a whole

Finally, we have all the links and code to scrape information from each one of them.

The code for this subsection can be found in the C_Scraping_part3 notebook.

First, let's import all the code we will require, and read the file with links:

import json
from wiki import parse_battle_page
import time

with open('./all_battles.json', 'r') as f:
campaigns = json.load(f)

Once again, we need to write a recursive function that will scrape data for a given event and call itself for all nested events:

def _parse_in_depth(element, name):
'''attempts to scrape data for every
element with url attribute – and all the children
if there are any'''

if 'children' in element:
for k, child in element['children'].items():
parsed = _parse_in_depth(child, k)
element['children'][k].update(parsed)

if element.get('url', None):
try:
element.update(parse_battle_page(element['url']))
except Exception as e:
raise Exception(name, e)

time.sleep(.1) # let's be good citizens!
return element

Note that we actively use the try/except clause as it helps to understand the specific page that needs to be resolved. Often, it is enough to look at the actual page to understand the problem.

Fronts and campaigns do not have links themselves. In order to pull information on all battles, we need to use a double loop. In the following code snippet, we loop over all fronts. For each front, we add a new key to the dictionary and start looping over the campaign links for this front. From here, our battle page parser can take over:

campaigns_parsed = {}

for fr_name, front in campaigns.items():
print(fr_name)
campaigns_parsed[fr_name] = {}
for cp_name, campaign in front.items():
print(f' {cp_name}')
campaigns_parsed[fr_name][cp_name] = _parse_in_depth(campaign, cp_name)

This process may take some time, mostly because we asked Python to sleep on each element. Note that scraping failed for one link – Operation Wotan. It seems that this operation is fictional, and the Wiki community decided to remove the page, which is fair enough, and the good part is that this didn't cause our code to stop, so all the other pages are collected.

Once it's done, let's check the overall quality of the data we just pulled. More on that in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.210.133