Step 1 – Scraping the list of battles

Let's start with scraping the main page. For that, let's go through a few steps.

First, we need to collect the main page as a string, using the requests library – the same way we pulled the information from Nominatim, using the HTTP GET request via the library's get method:

import requests as rq

base_url = 'https://en.wikipedia.org/wiki/List_of_World_War_II_battles'
response = rq.get(url)

We can access the raw content of the page via response.content.

Next, we need to parse this string into a Python representation of the page using BS4:

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

Perfect! To structure the code, we create a new function, get_dom (DOM stands for Document Object Model), which encloses all the preceding code:

def get_dom(url):
response = rq.get(url)
response.raise_for_status()
return BeautifulSoup(response.content, 'html.parser')

Next, once we have a parsed DOM, it is a good idea to specify the area of the page we'll be working in – the main content element. Using the Chrome Developer console, we can observe that the main text is stored within the div element with the mw-parser-output class, wrapped into another div element that has the mw-content-text ID. Let's now use that information to get it in Python, using BS4.

Beautiful Soup has three separate ways to search for an element – findfind_all, and select. The first and second approaches expect you to pass an object type and, optionally, attributes. A recursive argument defines whether the search should be recursive (deeper than one level in). The difference between these approaches is subtle – the first method will only retrieve the first occurrence. The second one, in contrast, will always return a list with all the elements. Finally, select will also return a list – and expects you to pass a single CSS Selector string. Because of that, it is easier to specify a nested element to retrieve:

content = soup.select('div#mw-content-text > div.mw-parser-output', limit=1)[0]

The objects they return are similar to the root BS4 object, but only cover the corresponding section of the page.

In the next step, we need to collect the corresponding elements for each front, in our case, h2 headers. All fronts are organized as sections – each section has a title (the h2 element), but hierarchically, titles are not nested within the sections; all section content sits just below the corresponding title. There is also the last title that we don't care about – citations and notes. One way to filter would be just to drop the last element. Alternatively, we could use a CSS Selector trick, using the following predicate: :not(:last-of-type). To keep things simple and readable, we'll just drop the last element in the list in this case:

fronts = content.select('div.mw-parser-output>h2')[:-1]

Here, we are searching for all h2 headers in the content section, with no recursion. Did we collect all the correct titles? Let's check! To remove the [edit] element from the title, we simply remove the final six characters:

>>> for el in fronts:
>>> print(el.text[:-6])
African Front Mediterranean Front Western Front Atlantic Ocean Eastern Front Indian Ocean Pacific Theatre China Front Southeast Asia Front

This looks correct. Now we have all the fronts!

But how can we get the corresponding ul lists for each header? This may seem like a problem – we need to retrieve elements that are based on, and associated with, other elements on the same level. Luckily, BS4 can take care of that using the find_next_siblings method. It works exactly like find_all, except that it won't look within the element but in the overarching document starting right after it. In other words, BS4 objects don't only store information on given HTML element – they are aware of the HTML document as a whole.

Take a look at the following code snippet, we're taking the first title by using 0 as an index, and then we run the method, passing the query properties—it should be a ul element within the first div with three classes: div-col, columns, and column-width:

>>> fronts[0].find_next_siblings("div", "div-col columns column-width")[0].ul
<ul><li><a href="/wiki/North_African_Campaign" title="North African Campaign">North African Campaign</a>.....

It worked! We asked BS4 to find all the div elements with the "div-col columns column-width" classes, after each header. There are many of them, but only the first div element is related to the header. Thus, we need to retrieve the first one and obtain the underlying ul element. And that is what we're going to do in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.151.153