CSS and XPath selectors

As mentioned previously, Beautiful Soup parses the HTML from string to a Python object. Even parsed, this structure is not an easy thing to navigate. This is especially true for bulk retrieval when we operate on multiple pages at once due to the dynamic nature of the web. Even the same page can change constantly, with some elements being added or removed, let alone different pages, even those with apparently the same structure. This is the moment when you'll start appreciating well-defined and stable APIs!

To navigate HTML document structures, also known as Document Object Models (DOMs), two common and widely adopted techniques are used. The first one, CSS Selectors, is a pattern language built to work with HTML and identify elements using a combination of the element type, class, and ID properties, their nested structure, and a number of other options. Here is a full example of a CSS Selector for the main image on an arbitrary Wikipedia page:

body div#content.mw-body div#bodyContent.mw-body-content [email protected] div.mw-parser-output table.infobox.vevent tbody tr td a.image img

As you can see, it is a very specific sequence of HTML elements – div tags, table, table body, table row, image, and so on.

Many of those elements have IDs – unique identifiers for the exact elements. IDs are defined after the element type, separated by the # symbol. Any element can have only one ID, and IDs are supposed to be unique, but that won't break the page. 

Some elements have one or multiple classes – non-unique identifiers. Classes are not unique, so HTML elements can have any number of classes at the same time. Also, classes are separated by the dot (.) element.

The sequence of elements represents the required nesting structure for writing a code—for example, an element that fits the preceding pattern needs to be an image, nested within an a element with a class image, and nested within a td element.

This makes the preceding query extremely specific for scraping purposes; any change to this structure will break the retrieval, so we need to design a query that is as simple and general as possible so that it doesn't break, but specific enough to pull the correct information. Defining such queries is almost an art by itself. An example of an arguably better query for the same element could be something like the following:

 table.infobox.vevent a.image img

This query is much shorter! The trick here is that the sequence does not require nesting to be direct or complete (for direct nesting, a > symbol can be used). We start by specifying that we're only interested in a table with two classes, infobox and veventIn that table, we're looking at the a element with the image class, and pulling an image (img element) from it. Of course, there is a change here. There will be more than one image on the page. Inside that table, we can either decide to pull all of them, or just the first occurrence, by using a corresponding retrieval command in Python or adding :first-of-type to the CSS. There are many other properties and tricks of querying with CSS. To learn more, check out the Mozilla CSS documentation (https://developer.mozilla.org/en-US/docs/Web/CSS).

CSS attributes are at the core of navigation and the querying of HTML, but we don't have to use a CSS Selector to run a query. While CSS querying is usually short and readable, they don't provide advanced tooling for operating the structure of the document, for example, going up the document tree, or of the sibling elements. An alternative tool, one that can be used beyond HTML, is XPath, the XML path language. XPath has a lot of flexibility and power when it comes to locating objects.

XPath can sometimes look like a filesystem path – nesting is represented by a slash, while a double slash means recursive nesting (more than one level inside). Elements can be indexed with square brackets (similar to Python iterables). The existence of a certain attribute, or the matching of certain criteria (predicates), can also be specified within square brackets (we'll see a similar approach in Python, in Chapter 11Data Cleaning and Manipulation). For example, here is the same query as earlier, but using XPath:

//table[contains(@class, 'infobox') and contains(@class, 'vevent')]//a[contains(@class, 'image')]//img

As you can see, this path is much longer! In fact, it has some problems as well – contains only checks for partial inclusion, so theoretically, contains(@class, 'vevent') will also match a class name such as veventTest. This may be a problem for some cases, but not in ours. Despite being verbose, on many other occasions, XPath works better than CSS, especially regarding the bulk retrieval of values. If you want more information on XPath, MDN (https://developer.mozilla.org/en-US/docs/Web/XPath) keeps you covered, again.

Both CSS Selectors and XPath are capable of retrieving almost any set of elements from a page. In addition, Beautiful Soup itself has quite a few tricks of its own. Here, we're using only CSS Selectors, as Beautiful Soup does not support XPath (LXML does, however).

Building a proper scraper requires a lot of testing and manual sifting through the web page to get the structure of HTML and CSS attributes just right. An invaluable tool at our disposal to this end is a browser developer console; both Chrome and Firefox have one. Let's have a look.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.22.216