When Michelangelo was asked how he could sculpt a work of art as masterful as his David, he is famously reported to have said: “It is easy. You just chip away the stone that doesn’t look like David.”
Although web scraping is unlike marble sculpting in most other respects, we must take a similar attitude when it comes to extracting the information we’re seeking from complicated web pages. There are many techniques to chip away the content that doesn’t look like the content that we’re searching for, until we arrive at the information we’re seeking. In this chapter, we’ll take look at parsing complicated HTML pages in order to extract only the information we’re looking for.
It can be tempting, when faced with a Gordian Knot of tags, to dive right in and use multiline statements to try to extract your information. However, keep in mind that layering the techniques used in this section with reckless abandon can lead to code that is difficult to debug, fragile, or both. Before getting started, let’s take a look at some of the ways you can avoid altogether the need for advanced HTML parsing!
Let’s say you have some target content. Maybe it’s a name, statistic, or block of text. Maybe it’s buried 20 tags deep in an HTML mush with no helpful tags or HTML attributes to be found. Let’s say you dive right in and write something like the following line to attempt extraction:
bsObj
.
findAll
(
"table"
)[
4
]
.
findAll
(
"tr"
)[
2
]
.
find
(
"td"
)
.
findAll
(
"div"
)[
1
]
.
find
(
"a"
)
That doesn’t look so great. In addition to the aesthetics of the line, even the slightest change to the website by a site administrator might break your web scraper altogether. So what are your options?
Especially when faced with buried or poorly formatted data, it’s important not to just start digging. Take a deep breath and think of alternatives. If you’re certain no alternatives exist, the rest of this chapter is for you.
In Chapter 1, we took a quick look at installing and running BeautifulSoup, as well as selecting objects one at a time. In this section, we’ll discuss searching for tags by attributes, working with lists of tags, and parse tree navigation.
Nearly every website you encounter contains stylesheets. Although you might think that a layer of styling on websites that is designed specifically for browser and human interpretation might be a bad thing, the advent of CSS is actually a boon for web scrapers. CSS relies on the differentiation of HTML elements that might otherwise have the exact same markup in order to style them differently. That is, some tags might look like this:
<
span
class
=
"green"
></
span
>
while others look like this:
<
span
class
=
"red"
></
span
>
Web scrapers can easily separate these two different tags based on their class; for example, they might use BeautifulSoup to grab all of the red text but none of the green text. Because CSS relies on these identifying attributes to style sites appropriately, you are almost guaranteed that these class and ID attributes will be plentiful on most modern websites.
Let’s create an example web scraper that scrapes the page located at http://bit.ly/1Ge96Rw.
On this page, the lines spoken by characters in the story are in red, whereas the names of characters themselves are in green. You can see the span
tags, which reference the appropriate CSS classes, in the following sample of the page’s source code:
"<span
class=
"red"
>
Heavens! what a virulent attack!</span>
" replied<span
class=
"green"
>
the prince</span>
, not in the least disconcerted by this reception.
We can grab the entire page and create a BeautifulSoup object with it using a program similar to the one used in Chapter 1:
from
urllib.request
import
urlopen
from
bs4
import
BeautifulSoup
html
=
urlopen
(
"http://www.pythonscraping.com/pages/warandpeace.html"
)
bsObj
=
BeautifulSoup
(
html
)
Using this BeautifulSoup object, we can use the findAll
function to extract a Python list of proper nouns found by selecting only the text within <span class="green"></span>
tags (findAll
is an extremely flexible function we’ll be using a lot later in this book):
nameList
=
bsObj
.
findAll
(
"span"
,
{
"class"
:
"green"
})
for
name
in
nameList
:
(
name
.
get_text
())
When run, it should list all the proper nouns in the text, in the order they appear in War and Peace. So what’s going on here? Previously, we’ve called bsObj.tagName
in order to get the first occurrence of that tag on the page. Now, we’re calling bsObj.findAll(tagName, tagAttributes)
in order to get a list of all of the tags on the page, rather than just the first.
After getting a list of names, the program iterates through all names in the list, and prints name.get_text()
in order to separate the content from the tags.
.get_text()
strips all tags from the document you are working with and returns a string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away and you’ll be left with a tagless block of text.
Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Calling .get_text()
should always be the last thing you do, immediately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible.
BeautifulSoup’s find()
and findAll()
are the two functions you will likely use the most. With them, you can easily filter HTML pages to find lists of desired tags, or a single tag, based on their various attributes.
The two functions are extremely similar, as evidenced by their definitions in the BeautifulSoup documentation:
findAll
(
tag
,
attributes
,
recursive
,
text
,
limit
,
keywords
)
find
(
tag
,
attributes
,
recursive
,
text
,
keywords
)
In all likelihood, 95% of the time you will find yourself only needing to use the first two arguments: tag
and attributes
. However, let’s take a look at all of the arguments in greater detail.
The tag
argument is one that we’ve seen before—you can pass a string name of a tag or even a Python list of string tag names. For example, the following will return a list of all the header tags in a document:1
.
findAll
({
"h1"
,
"h2"
,
"h3"
,
"h4"
,
"h5"
,
"h6"
})
The attributes
argument takes a Python dictionary of attributes and matches tags that contain any one of those attributes. For example, the following function would return both the green and red span
tags in the HTML document:
.
findAll
(
"span"
,
{
"class"
:
"green"
,
"class"
:
"red"
})
The recursive
argument is a boolean. How deeply into the document do you want to go? If recursion
is set to True
, the findAll
function looks into children, and children’s children, for tags that match your parameters. If it is false
, it will look only at the top-level tags in your document. By default, findAll
works recursively (recursive
is set to True
); it’s generally a good idea to leave this as is, unless you really know what you need to do and performance is an issue.
The text
argument is unusual in that it matches based on the text content of the tags, rather than properties of the tags themselves. For instance, if we want to find the number of times “the prince” was surrounded by tags on the example page, we could replace our .findAll()
function in the previous example with the following lines:
nameList
=
bsObj
.
findAll
(
text
=
"the prince"
)
(
len
(
nameList
))
The output of this is “7.”
The limit
argument, of course, is only used in the findAll
method; find
is equivalent to the same findAll
call, with a limit of 1. You might set this if you’re only interested in retrieving the first x items from the page. Be aware, however, that this gives you the first items on the page in the order that they occur, not necessarily the first ones that you want.
The keyword
argument allows you to select tags that contain a particular attribute. For example:
allText
=
bsObj
.
findAll
(
id
=
"text"
)
(
allText
[
0
]
.
get_text
())
The keyword
argument can be very helpful in some situations. However, it is technically redundant as a BeautifulSoup feature. Keep in mind that anything that can be done with keyword
can also be accomplished using techniques we will discuss later in this chapter (see Regular Expressions and Lambda Expressions).
For instance, the following two lines are identical:
bsObj
.
findAll
(
id
=
"text"
)
bsObj
.
findAll
(
""
,
{
"id"
:
"text"
})
In addition, you might occasionally run into problems using keyword
, most notably when searching for elements by their class
attribute, because class
is a protected keyword in Python. That is, class
is a reserved word in Python that cannot be used as a variable or argument name (no relation to the BeautifulSoup.findAll() keyword
argument, previously discussed).2 For example, if you try the following call, you’ll get a syntax error due to the nonstandard use of class
:
bsObj
.
findAll
(
class
=
"green"
)
Instead, you can use BeautifulSoup’s somewhat clumsy solution, which involves adding an underscore:
bsObj
.
findAll
(
class_
=
"green"
)
Alternatively, you can enclose class
in quotes:
bsObj
.
findAll
(
""
,
{
"class"
:
"green"
}
At this point, you might be asking yourself, “But wait, don’t I already know how to get a list of tags by attribute—by passing attributes to the function in a dictionary list?”
Recall that passing a list of tags to .findAll()
via the attributes list acts as an “or” filter (i.e., it selects a list of all tags that have tag1 or tag2 or tag3...). If you have a lengthy list of tags, you can end up with a lot of stuff you don’t want. The keyword
argument allows you to add an additional “and” filter to this.
So far in the book, you’ve seen two types of objects in the BeautifulSoup library:
bsObj
.
div
.
h1
BeautifulSoup
objectsbsObj
Tag
objectsfind
and findAll
on a BeautifulSoup
object, or drilling down, as in:However, there are two more objects in the library that, although less commonly used, are still important to know about:
NavigableString
objectsNavigableStrings
, rather than tag objects).Comment
object<!--like this one-->
These four objects are the only objects you will ever encounter (as of the time of this writing) in the BeautifulSoup library.
The findAll
function is responsible for finding tags based on their name and attribute. But what if you need to find a tag based on its location in a document? That’s where tree navigation comes in handy. In Chapter 1, we looked at navigating a BeautifulSoup tree in a single direction:
bsObj
.
tag
.
subTag
.
anotherSubTag
Now let’s look at navigating up, across, and diagonally through HTML trees using our highly questionable online shopping site http://bit.ly/1KGe2Qk as an example page for scraping (see Figure 2-1):
The HTML for this page, mapped out as a tree (with some tags omitted for brevity), looks like:
html
body
div.wrapper
table#giftList
tr
tr.gift#gift1
td
td
We will use this same HTML structure as an example in the next few sections.
In computer science and some branches of mathematics, you often hear about horrible things done to children: moving them, storing them, removing them, and even killing them. Fortunately, in BeautifulSoup, children are treated differently.
In the BeautifulSoup library, as well as many other libraries, there is a distinction drawn between children and descendants: much like in a human family tree, children are always exactly one tag below a parent, whereas descendants can be at any level in the tree below a parent. For example, the tr
tags are children of the table
tag, whereas tr
, th
, td
, img
, and span
are all descendants of the table
tag (at least in our example page). All children are descendants, but not all descendants are children.
In general, BeautifulSoup functions will always deal with the descendants of the current tag selected. For instance, bsObj.body.h1
selects the first h1
tag that is a descendant of the body tag. It will not find tags located outside of the body.
Similarly, bsObj.div.findAll("img")
will find the first div
tag in the document, then retrieve a list of all img
tags that are descendants of that div
tag.
If you want to find only descendants that are children, you can use the .children
tag:
from
urllib.request
import
urlopen
from
bs4
import
BeautifulSoup
html
=
urlopen
(
"http://www.pythonscraping.com/pages/page3.html"
)
bsObj
=
BeautifulSoup
(
html
)
for
child
in
bsObj
.
find
(
"table"
,{
"id"
:
"giftList"
})
.
children
:
(
child
)
This code prints out the list of product rows in the giftList
table. If you were to write it using the descendants()
function instead of the children()
function, about two dozen tags would be found within the table and printed, including img
tags, span
tags, and individual td
tags. It’s definitely important to differentiate between children and descendants!
The BeautifulSoup next_siblings()
function makes it trivial to collect data from tables, especially ones with title rows:
from
urllib.request
import
urlopen
from
bs4
import
BeautifulSoup
html
=
urlopen
(
"http://www.pythonscraping.com/pages/page3.html"
)
bsObj
=
BeautifulSoup
(
html
)
for
sibling
in
bsObj
.
find
(
"table"
,{
"id"
:
"giftList"
})
.
tr
.
next_siblings
:
(
sibling
)
The output of this code is to print all rows of products from the product table, except for the first title row. Why does the title row get skipped? Two reasons: first, objects cannot be siblings with themselves. Any time you get siblings of an object, the object itself will not be included in the list. Second, this function calls next siblings only. If we were to select a row in the middle of the list, for example, and call next_siblings
on it, only the subsequent (next) siblings would be returned. So, by selecting the title row and calling next_siblings
, we can select all the rows in the table, without selecting the title row itself.
Note that the preceding code will work just as well, if we select bsObj.table.tr
or even just bsObj.tr
in order to select the first row of the table. However, in the code, I go through all of the trouble of writing everything out in a longer form:
bsObj
.
find
(
"table"
,{
"id"
:
"giftList"
})
.
tr
Even if it looks like there’s just one table (or other target tag) on the page, it’s easy to miss things. In addition, page layouts change all the time. What was once the first of its kind on the page, might someday be the second or third tag of that type found on the page. To make your scrapers more robust, it’s best to be as specific as possible when making tag selections. Take advantage of tag attributes when they are available.
As a complement to next_siblings
, the previous_siblings
function can often be helpful if there is an easily selectable tag at the end of a list of sibling tags that you would like to get.
And, of course, there are the next_sibling
and previous_sibling
functions, which perform nearly the same function as next_siblings
and previous_siblings
, except they return a single tag rather than a list of them.
When scraping pages, you will likely discover that you need to find parents of tags less frequently than you need to find their children or siblings. Typically, when we look at HTML pages with the goal of crawling them, we start by looking at the top layer of tags, and then figure out how to drill our way down into the exact piece of data that we want. Occasionally, however, you can find yourself in odd situations that require BeautifulSoup’s parent-finding functions, .parent
and .parents
. For example:
from
urllib.request
import
urlopen
from
bs4
import
BeautifulSoup
html
=
urlopen
(
"http://www.pythonscraping.com/pages/page3.html"
)
bsObj
=
BeautifulSoup
(
html
)
(
bsObj
.
find
(
"img"
,{
"src"
:
"../img/gifts/img1.jpg"
})
.
parent
.
previous_sibling
.
get_text
())
This code will print out the price of the object represented by the image at the location ../img/gifts/img1.jpg
(in this case, the price is “$15.00”).
How does this work? The following diagram represents the tree structure of the portion of the HTML page we are working with, with numbered steps:
<tr>
<td>(3)
s<td> (2)
As the old computer-science joke goes: “Let’s say you have a problem, and you decide to solve it with regular expressions. Well, now you have two problems.”
Unfortunately, regular expressions (often shortened to regex) are often taught using large tables of random symbols, strung together to look like a lot of nonsense. This tends to drive people away, and later they get out into the workforce and write needlessly complicated searching and filtering functions, when all they needed was a one-line regular expression in the first place!
Fortunately for you, regular expressions are not all that difficult to get up and running with quickly, and can easily be learned by looking at and experimenting with a few simple examples.
Regular expressions are so called because they are used to identify regular strings; that is, they can definitively say, “Yes, this string you’ve given me follows the rules, and I’ll return it,” or “This string does not follow the rules, and I’ll discard it.” This can be exceptionally handy for quickly scanning large documents to look for strings that look like phone numbers or email addresses.
Notice that I used the phrase regular string. What is a regular string? It’s any string that can be generated by a series of linear rules,3 such as:
Strings that follow these rules are: “aaaabbbbbccccd,” “aabbbbbcc,” and so on (there are an infinite number of variations).
Regular expressions are merely a shorthand way of expressing these sets of rules. For instance, here’s the regular expression for the series of steps just described:
aa
*
bbbbb
(
cc
)
*
(
d
|
)
This string might seem a little daunting at first, but it becomes clearer when we break it down into its components:
aa*
bbbbb
(cc)*
When learning how to write regular expressions, it’s critical to play around with them and get a feel for how they work.
If you don’t feel like firing up a code editor, writing a few lines, and running your program in order to see if a regular expression works as expected, you can go to a website such as RegexPal and test your regular expressions on the fly.
One classic example of regular expressions can be found in the practice of identifying email addresses. Although the exact rules governing email addresses vary slightly from mail server to mail server, we can create a few general rules. The corresponding regular expression for each of these rules is shown in the second column:
|
|
|
|
|
|
|
|
|
|
By concatenating all of the rules, we arrive at the regular expression:
[
A
-
Za
-
z0
-
9
.
_
+
]
+
@
[
A
-
Za
-
z
]
+
.
(
com
|
org
|
edu
|
net
)
When attempting to write any regular expression from scratch, it’s best to first make a list of steps that concretely outlines what your target string looks like. Pay attention to edge cases. For instance, if you’re identifying phone numbers, are you considering country codes and extensions?.
Table 2-1 lists some commonly used regular expression symbols, with a brief explanation and example. This list is by no means complete, and as mentioned before, you might encounter slight variations from language to language. However, these 12 symbols are the most commonly used regular expressions in Python, and can be used to find and collect most any string type.
Symbol(s) | Meaning | Example | Example Matches |
* | Matches the preceding character, subexpression, or bracketed character, 0 or more times | a*b* |
aaaaaaaa, aaabbbbb, bbbbbb |
+ | Matches the preceding character, subexpression, or bracketed character, 1 or more times | a+b+ |
aaaaaaaab, aaabbbbb, abbbbbb |
[] | Matches any character within the brackets (i.e., “Pick any one of these things”) | [A-Z]* | APPLE, CAPITALS, QWERTY |
() |
A grouped subexpression (these are evaluated first, in the “order of operations” of regular expressions) |
(a*b)* | aaabaab, abaaab, ababaaaaab |
{m, n} | Matches the preceding character, subexpression, or bracketed character between m and n times (inclusive) | a{2,3}b{2,3} | aabbb, aaabbb, aabb |
[^] | Matches any single character that is not in the brackets | [^A-Z]* | apple, lowercase, qwerty |
| | Matches any character, string of characters, or subexpression, separated by the “I” (note that this is a vertical bar, or “pipe,” not a capital “i”) | b(a|i|e)d | bad, bid, bed |
. | Matches any single character (including symbols, numbers, a space, etc.) | b.d | bad, bzd, b$d, b d |
^ | Indicates that a character or subexpression occurs at the beginning of a string | ^a |
apple, asdf, a |
An escape character (this allows you to use “special” characters as their literal meaning) | . | \ | . | | |
$ | Often used at the end of a regular expression, it means “match this up to the end of the string.” Without it, every regular expression has a defacto “.*” at the end of it, accepting strings where only the first part of the string matches. This can be thougt of as analogous to the ^ symbol. | [A-Z]*[a-z]*$ | ABCabc, zzzyx, Bob |
?! | “Does not contain.” This odd pairing of symbols, immediately preceding a character (or regular expression), indicates that that character should not be found in that specific place in the larger string. This can be tricky to use; after all, the character might be found in a different part of the string. If trying to eliminate a character entirely, use in conjunction with a ^ and $ at either end. | ^((?![A-Z]).)*$ | no-caps-here, $ymb0ls a4e f!ne |
The standard version of regular expressions (the one we are covering in this book, and that is used by Python and BeautifulSoup) is based on syntax used by Perl. Most modern programming languages use this or one very similar to it. Be aware, however, that if you are using regular expressions in another language, you might encounter problems. Even some modern languages, such as Java, have slight differences in the way they handle regular expressions. When in doubt, read the docs!
If the previous section on regular expressions seemed a little disjointed from the mission of this book, here’s where it all ties together. BeautifulSoup and regular expressions go hand in hand when it comes to scraping the Web. In fact, most functions that take in a string argument (e.g., find(id="aTagIdHere")
) will also take in a regular expression just as well.
Let’s take a look at some examples, scraping the page found at http://bit.ly/1KGe2Qk.
Notice that there are many product images on the site—they take the following form:
<
img
src
=
"../img/gifts/img3.jpg"
>
If we wanted to grab URLs to all of the product images, it might seem fairly straightforward at first: just grab all the image tags using .findAll("img")
, right? But there’s a problem. In addition to the obvious “extra” images (e.g., logos), modern websites often have hidden images, blank images used for spacing and aligning elements, and other random image tags you might not be aware of. Certainly, you can’t count on the only images on the page being product images.
Let’s also assume that the layout of the page might change, or that, for whatever reason, we don’t want to depend on the position of the image in the page in order to find the correct tag. This might be the case when you are trying to grab specific elements or pieces of data that are scattered randomly throughout a website. For instance, there might be a featured product image in a special layout at the top of some pages, but not others.
The solution is to look for something identifying about the tag itself. In this case, we can look at the file path of the product images:
from
urllib.request
import
urlopenfrom
bs4
import
BeautifulSoupimport
re
html
=
urlopen
(
"http://www.pythonscraping.com/pages/page3.html"
)
bsObj
=
BeautifulSoup
(
html
)
images
=
bsObj
.
findAll
(
"img"
,
{
"src"
:
re
.
compile
(
"../img/gifts/img.*.jpg"
)})
for
image
in
images
:
(
image
[
"src"
])
This prints out only the relative image paths that start with ../img/gifts/img and end in .jpg, the output of which is the following:
../
img
/
gifts
/
img1
.
jpg
../
img
/
gifts
/
img2
.
jpg
../
img
/
gifts
/
img3
.
jpg
../
img
/
gifts
/
img4
.
jpg
../
img
/
gifts
/
img6
.
jpg
A regular expression can be inserted as any argument in a BeautifulSoup expression, allowing you a great deal of flexibility in finding target elements.
So far, we’ve looked at how to access and filter tags and access content within them. However, very often in web scraping you’re not looking for the content of a tag; you’re looking for its attributes. This becomes especially useful for tags such as <a>
, where the URL it is pointing to is contained within the href
attribute, or the <img>
tag, where the target image is contained within the src
attribute.
With tag objects, a Python list of attributes can be automatically accessed by calling:
myTag.attrs
Keep in mind that this literally returns a Python dictionary object, which makes retrieval and manipulation of these attributes trivial. The source location for an image, for example, can be found using the following line:
myImgTag.attrs['src']
If you have a formal education in computer science, you probably learned about lambda expressions once in school and then never used them again. If you don’t, they might be unfamiliar to you (or familiar only as “that thing I’ve been meaning to learn at some point”). In this section, we won’t go deeply into these extremely useful functions, but we will look at a few examples of how they can be useful in web scraping.
Essentially, a lambda expression is a function that is passed into another function as a variable; that is, instead of defining a function as f(x, y), you may define a function as f(g(x), y), or even f(g(x), h(x)).
BeautifulSoup allows us to pass certain types of functions as parameters into the findAll
function. The only restriction is that these functions must take a tag object as an argument and return a boolean. Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to “true” are returned while the rest are discarded.
For example, the following retrieves all tags that have exactly two attributes:
soup.findAll(lambda tag: len(tag.attrs) == 2)
That is, it will find tags such as the following:
<div class="body" id="content"></div> <span style="color:red" class="title"></span>
Using lambda functions in BeautifulSoup, selectors can act as a great substitute for writing a regular expression, if you’re comfortable with writing a little code.
Although BeautifulSoup
is used throughout this book (and is one of the most popular HTML libraries available for Python), keep in mind that it’s not the only option. If BeautifulSoup does not meet your needs, check out these other widely used libraries:
1 If you’re looking to get a list of all h<some_level>
tags in the document, there are more succinct ways of writing this code to accomplish the same thing. We’ll take a look at other ways of approaching these types of problems in the section BeautifulSoup and regular expressions.
2 The Python Language Reference provides a complete list of protected keywords.
3 You might be asking yourself, “Are there ‘irregular’ expressions?” Nonregular expressions are beyond the scope of this book, but they encompass strings such as “write a prime number of a’s, followed by exactly twice that number of b’s” or “write a palindrome.” It’s impossible to identify strings of this type with a regular expression. Fortunately, I’ve never been in a situation where my web scraper needed to identify these kinds of strings.
18.220.9.237