Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Advanced HTML Parsing

When Michelangelo was asked how he could sculpt a work of art as masterful as his David, he is famously reported to have said: “It is easy. You just chip away the stone that doesn’t look like David.”

Although web scraping is unlike marble sculpting in most other respects, we must take a similar attitude when it comes to extracting the information we’re seeking from complicated web pages. There are many techniques to chip away the content that doesn’t look like the content that we’re searching for, until we arrive at the information we’re seeking. In this chapter, we’ll take look at parsing complicated HTML pages in order to extract only the information we’re looking for.

You Don’t Always Need a Hammer

It can be tempting, when faced with a Gordian Knot of tags, to dive right in and use multiline statements to try to extract your information. However, keep in mind that layering the techniques used in this section with reckless abandon can lead to code that is difficult to debug, fragile, or both. Before getting started, let’s take a look at some of the ways you can avoid altogether the need for advanced HTML parsing!

Let’s say you have some target content. Maybe it’s a name, statistic, or block of text. Maybe it’s buried 20 tags deep in an HTML mush with no helpful tags or HTML attributes to be found. Let’s say you dive right in and write something like the following line to attempt extraction:

bsObj.findAll("table")[4].findAll("tr")[2].find("td").findAll("div")[1].find("a")

That doesn’t look so great. In addition to the aesthetics of the line, even the slightest change to the website by a site administrator might break your web scraper altogether. So what are your options?

Look for a “print this page” link, or perhaps a mobile version of the site that has better-formatted HTML (more on presenting yourself as a mobile device—and receiving mobile site versions—in Chapter 12).
Look for the information hidden in a JavaScript file. Remember, you might need to examine the imported JavaScript files in order to do this. For example, I once collected street addresses (along with latitude and longitude) off a website in a neatly formatted array by looking at the JavaScript for the embedded Google Map that displayed a pinpoint over each address.
This is more common for page titles, but the information might be available in the URL of the page itself.
If the information you are looking for is unique to this website for some reason, you’re out of luck. If not, try to think of other sources you could get this information from. Is there another website with the same data? Is this website displaying data that it scraped or aggregated from another website?

Especially when faced with buried or poorly formatted data, it’s important not to just start digging. Take a deep breath and think of alternatives. If you’re certain no alternatives exist, the rest of this chapter is for you.

Another Serving of BeautifulSoup

In Chapter 1, we took a quick look at installing and running BeautifulSoup, as well as selecting objects one at a time. In this section, we’ll discuss searching for tags by attributes, working with lists of tags, and parse tree navigation.

Nearly every website you encounter contains stylesheets. Although you might think that a layer of styling on websites that is designed specifically for browser and human interpretation might be a bad thing, the advent of CSS is actually a boon for web scrapers. CSS relies on the differentiation of HTML elements that might otherwise have the exact same markup in order to style them differently. That is, some tags might look like this:

<span class="green"></span>

while others look like this:

<span class="red"></span>

Web scrapers can easily separate these two different tags based on their class; for example, they might use BeautifulSoup to grab all of the red text but none of the green text. Because CSS relies on these identifying attributes to style sites appropriately, you are almost guaranteed that these class and ID attributes will be plentiful on most modern websites.

Let’s create an example web scraper that scrapes the page located at http://bit.ly/1Ge96Rw.

On this page, the lines spoken by characters in the story are in red, whereas the names of characters themselves are in green. You can see the span tags, which reference the appropriate CSS classes, in the following sample of the page’s source code:

"<span class="red">Heavens! what a virulent attack!</span>" replied <span class=
"green">the prince</span>, not in the least disconcerted by this reception.

We can grab the entire page and create a BeautifulSoup object with it using a program similar to the one used in Chapter 1:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)

Using this BeautifulSoup object, we can use the findAll function to extract a Python list of proper nouns found by selecting only the text within <span class="green"></span> tags (findAll is an extremely flexible function we’ll be using a lot later in this book):

nameList = bsObj.findAll("span", {"class":"green"})
for name in nameList:
    print(name.get_text())

When run, it should list all the proper nouns in the text, in the order they appear in War and Peace. So what’s going on here? Previously, we’ve called bsObj.tagName in order to get the first occurrence of that tag on the page. Now, we’re calling bsObj.findAll(tagName, tagAttributes) in order to get a list of all of the tags on the page, rather than just the first.

After getting a list of names, the program iterates through all names in the list, and prints name.get_text() in order to separate the content from the tags.

When to get_text() and When to Preserve Tags

.get_text() strips all tags from the document you are working with and returns a string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away and you’ll be left with a tagless block of text.

Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Calling .get_text() should always be the last thing you do, immediately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible.

find() and findAll() with BeautifulSoup

BeautifulSoup’s find() and findAll() are the two functions you will likely use the most. With them, you can easily filter HTML pages to find lists of desired tags, or a single tag, based on their various attributes.

The two functions are extremely similar, as evidenced by their definitions in the BeautifulSoup documentation:

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

In all likelihood, 95% of the time you will find yourself only needing to use the first two arguments: tag and attributes. However, let’s take a look at all of the arguments in greater detail.

The tag argument is one that we’ve seen before—you can pass a string name of a tag or even a Python list of string tag names. For example, the following will return a list of all the header tags in a document:¹

.findAll({"h1","h2","h3","h4","h5","h6"})

The attributes argument takes a Python dictionary of attributes and matches tags that contain any one of those attributes. For example, the following function would return both the green and red span tags in the HTML document:

.findAll("span", {"class":"green", "class":"red"})

The recursive argument is a boolean. How deeply into the document do you want to go? If recursion is set to True, the findAll function looks into children, and children’s children, for tags that match your parameters. If it is false, it will look only at the top-level tags in your document. By default, findAll works recursively (recursive is set to True); it’s generally a good idea to leave this as is, unless you really know what you need to do and performance is an issue.

The text argument is unusual in that it matches based on the text content of the tags, rather than properties of the tags themselves. For instance, if we want to find the number of times “the prince” was surrounded by tags on the example page, we could replace our .findAll() function in the previous example with the following lines:

nameList = bsObj.findAll(text="the prince")
print(len(nameList))

The output of this is “7.”

The limit argument, of course, is only used in the findAll method; find is equivalent to the same findAll call, with a limit of 1. You might set this if you’re only interested in retrieving the first x items from the page. Be aware, however, that this gives you the first items on the page in the order that they occur, not necessarily the first ones that you want.

The keyword argument allows you to select tags that contain a particular attribute. For example:

allText = bsObj.findAll(id="text")
print(allText[0].get_text())

A Caveat to the keyword Argument

The keyword argument can be very helpful in some situations. However, it is technically redundant as a BeautifulSoup feature. Keep in mind that anything that can be done with keyword can also be accomplished using techniques we will discuss later in this chapter (see Regular Expressions and Lambda Expressions).

For instance, the following two lines are identical:

bsObj.findAll(id="text")
bsObj.findAll("", {"id":"text"})

In addition, you might occasionally run into problems using keyword, most notably when searching for elements by their class attribute, because class is a protected keyword in Python. That is, class is a reserved word in Python that cannot be used as a variable or argument name (no relation to the BeautifulSoup.findAll() keyword argument, previously discussed).² For example, if you try the following call, you’ll get a syntax error due to the nonstandard use of class:

bsObj.findAll(class="green")

Instead, you can use BeautifulSoup’s somewhat clumsy solution, which involves adding an underscore:

 bsObj.findAll(class_="green")

Alternatively, you can enclose class in quotes:

 bsObj.findAll("", {"class":"green"}

At this point, you might be asking yourself, “But wait, don’t I already know how to get a list of tags by attribute—by passing attributes to the function in a dictionary list?”

Recall that passing a list of tags to .findAll() via the attributes list acts as an “or” filter (i.e., it selects a list of all tags that have tag1 or tag2 or tag3...). If you have a lengthy list of tags, you can end up with a lot of stuff you don’t want. The keyword argument allows you to add an additional “and” filter to this.

Other BeautifulSoup Objects

So far in the book, you’ve seen two types of objects in the BeautifulSoup library:

bsObj.div.h1

BeautifulSoup objects: Seen in previous code examples as bsObj
Tag objects: Retrieved in lists or individually by calling find and findAll on a BeautifulSoup object, or drilling down, as in:

However, there are two more objects in the library that, although less commonly used, are still important to know about:

NavigableString objects: Used to represent text within tags, rather than the tags themselves (some functions operate on, and produce, NavigableStrings, rather than tag objects).
The Comment object: Used to find HTML comments in comment tags,

These four objects are the only objects you will ever encounter (as of the time of this writing) in the BeautifulSoup library.

Navigating Trees

The findAll function is responsible for finding tags based on their name and attribute. But what if you need to find a tag based on its location in a document? That’s where tree navigation comes in handy. In Chapter 1, we looked at navigating a BeautifulSoup tree in a single direction:

bsObj.tag.subTag.anotherSubTag

Now let’s look at navigating up, across, and diagonally through HTML trees using our highly questionable online shopping site http://bit.ly/1KGe2Qk as an example page for scraping (see Figure 2-1):

The HTML for this page, mapped out as a tree (with some tags omitted for brevity), looks like:

html
- body
  - div.wrapper
    - h1
    - div.content
    - table#giftList
      - tr
        
        th
        
        th
        
        th
        
        th
      - tr.gift#gift1
        
        td
        
        td
        
        span.excitingNote
        
        td
        
        td
        
        img
      - ...table rows continue...
    - div.footer

We will use this same HTML structure as an example in the next few sections.

Dealing with children and other descendants

In computer science and some branches of mathematics, you often hear about horrible things done to children: moving them, storing them, removing them, and even killing them. Fortunately, in BeautifulSoup, children are treated differently.

In the BeautifulSoup library, as well as many other libraries, there is a distinction drawn between children and descendants: much like in a human family tree, children are always exactly one tag below a parent, whereas descendants can be at any level in the tree below a parent. For example, the tr tags are children of the table tag, whereas tr, th, td, img, and span are all descendants of the table tag (at least in our example page). All children are descendants, but not all descendants are children.

In general, BeautifulSoup functions will always deal with the descendants of the current tag selected. For instance, bsObj.body.h1 selects the first h1 tag that is a descendant of the body tag. It will not find tags located outside of the body.

Similarly, bsObj.div.findAll("img") will find the first div tag in the document, then retrieve a list of all img tags that are descendants of that div tag.

If you want to find only descendants that are children, you can use the .children tag:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)

for child in bsObj.find("table",{"id":"giftList"}).children:
    print(child)

This code prints out the list of product rows in the giftList table. If you were to write it using the descendants() function instead of the children() function, about two dozen tags would be found within the table and printed, including img tags, span tags, and individual td tags. It’s definitely important to differentiate between children and descendants!

Dealing with siblings

The BeautifulSoup next_siblings() function makes it trivial to collect data from tables, especially ones with title rows:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)

for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling)

The output of this code is to print all rows of products from the product table, except for the first title row. Why does the title row get skipped? Two reasons: first, objects cannot be siblings with themselves. Any time you get siblings of an object, the object itself will not be included in the list. Second, this function calls next siblings only. If we were to select a row in the middle of the list, for example, and call next_siblings on it, only the subsequent (next) siblings would be returned. So, by selecting the title row and calling next_siblings, we can select all the rows in the table, without selecting the title row itself.

Make Selections Specific

Note that the preceding code will work just as well, if we select bsObj.table.tr or even just bsObj.tr in order to select the first row of the table. However, in the code, I go through all of the trouble of writing everything out in a longer form:

bsObj.find("table",{"id":"giftList"}).tr

Even if it looks like there’s just one table (or other target tag) on the page, it’s easy to miss things. In addition, page layouts change all the time. What was once the first of its kind on the page, might someday be the second or third tag of that type found on the page. To make your scrapers more robust, it’s best to be as specific as possible when making tag selections. Take advantage of tag attributes when they are available.

As a complement to next_siblings, the previous_siblings function can often be helpful if there is an easily selectable tag at the end of a list of sibling tags that you would like to get.

And, of course, there are the next_sibling and previous_sibling functions, which perform nearly the same function as next_siblings and previous_siblings, except they return a single tag rather than a list of them.

Dealing with your parents

When scraping pages, you will likely discover that you need to find parents of tags less frequently than you need to find their children or siblings. Typically, when we look at HTML pages with the goal of crawling them, we start by looking at the top layer of tags, and then figure out how to drill our way down into the exact piece of data that we want. Occasionally, however, you can find yourself in odd situations that require BeautifulSoup’s parent-finding functions, .parent and .parents. For example:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"
                       }).parent.previous_sibling.get_text())

This code will print out the price of the object represented by the image at the location ../img/gifts/img1.jpg (in this case, the price is “$15.00”).

How does this work? The following diagram represents the tree structure of the portion of the HTML page we are working with, with numbered steps:

<tr>
- <td>
- <td>
- <td>(3)
  - “$15.00” (4)
- s<td> (2)
  - <img src=”../img/gifts/img1.jpg">(1)

The image tag where src="../img/gifts/img1.jpg" is first selected
We select the parent of that tag (in this case, the <td> tag).
We select the previous_sibling of the <td> tag (in this case, the <td> tag that contains the dollar value of the product).
We select the text within that tag, “$15.00”

Regular Expressions

As the old computer-science joke goes: “Let’s say you have a problem, and you decide to solve it with regular expressions. Well, now you have two problems.”

Unfortunately, regular expressions (often shortened to regex) are often taught using large tables of random symbols, strung together to look like a lot of nonsense. This tends to drive people away, and later they get out into the workforce and write needlessly complicated searching and filtering functions, when all they needed was a one-line regular expression in the first place!

Fortunately for you, regular expressions are not all that difficult to get up and running with quickly, and can easily be learned by looking at and experimenting with a few simple examples.

Regular expressions are so called because they are used to identify regular strings; that is, they can definitively say, “Yes, this string you’ve given me follows the rules, and I’ll return it,” or “This string does not follow the rules, and I’ll discard it.” This can be exceptionally handy for quickly scanning large documents to look for strings that look like phone numbers or email addresses.

Notice that I used the phrase regular string. What is a regular string? It’s any string that can be generated by a series of linear rules,³ such as:

Write the letter “a” at least once.
Append to this the letter “b” exactly five times.
Append to this the letter “c” any even number of times.
Optionally, write the letter “d” at the end.

Strings that follow these rules are: “aaaabbbbbccccd,” “aabbbbbcc,” and so on (there are an infinite number of variations).

Regular expressions are merely a shorthand way of expressing these sets of rules. For instance, here’s the regular expression for the series of steps just described:

aa*bbbbb(cc)*(d | )

This string might seem a little daunting at first, but it becomes clearer when we break it down into its components:

aa*: The letter a is written, followed by a* (read as a star) which means “any number of a’s, including 0 of them.” In this way, we can guarantee that the letter a is written at least once.
bbbbb: No special effects here—just five b’s in a row.
(cc)*: Any even number of things can be grouped into pairs, so in order to enforce this rule about even things, you can write two c’s, surround them in parentheses, and write an asterisk after it, meaning that you can have any number of pairs of c’s (note that this can mean 0 pairs, as well).
(d | ): Adding a bar in the middle of two expressions means that it can be “this thing or that thing.” In this case, we are saying “add a d followed by a space or just add a space without a d.” In this way we can guarantee that there is, at most, one d, followed by a space, completing the string.

Experimenting with RegEx

When learning how to write regular expressions, it’s critical to play around with them and get a feel for how they work.

If you don’t feel like firing up a code editor, writing a few lines, and running your program in order to see if a regular expression works as expected, you can go to a website such as RegexPal and test your regular expressions on the fly.

One classic example of regular expressions can be found in the practice of identifying email addresses. Although the exact rules governing email addresses vary slightly from mail server to mail server, we can create a few general rules. The corresponding regular expression for each of these rules is shown in the second column:

Rule 1 The first part of an email address contains at least one of the following: uppercase letters, lowercase letters, the numbers 0-9, periods (.), plus signs (+), or underscores (_).	[A-Za-z0-9._+]+ The regular expression shorthand is pretty smart. For example, it knows that “A-Z” means “any uppercase letter, A through Z.” By putting all these possible sequences and symbols in brackets (as opposed to parentheses) we are saying “this symbol can be any one of these things we’ve listed in the brackets.” Note also that the + sign means “these characters can occur as many times as they want to, but must occur at least once.”
Rule 2 After this, the email address contains the @ symbol.	@ This is fairly straightforward: the @ symbol must occur in the middle, and it must occur exactly once.
Rule 3 The email address then must contain at least one uppercase or lowercase letter.	[A-Za-z]+ We may use only letters in the first part of the domain name, after the @ symbol. Also, there must be at least one character.
Rule 4 This is followed by a period (.).	. You must include a period (.) before the domain name.
Rule 5 Finally, the email address ends with com, org, edu, or net (in reality, there are many possible top-level domains, but, these four should suffice for the sake of example).	(com\|org\|edu\|net) This lists the possible sequences of letters that can occur after the period in the second part of an email address.

By concatenating all of the rules, we arrive at the regular expression:

[A-Za-z0-9._+]+@[A-Za-z]+.(com|org|edu|net)

When attempting to write any regular expression from scratch, it’s best to first make a list of steps that concretely outlines what your target string looks like. Pay attention to edge cases. For instance, if you’re identifying phone numbers, are you considering country codes and extensions?.

Table 2-1 lists some commonly used regular expression symbols, with a brief explanation and example. This list is by no means complete, and as mentioned before, you might encounter slight variations from language to language. However, these 12 symbols are the most commonly used regular expressions in Python, and can be used to find and collect most any string type.

Table 2-1. Commonly used regular expression symbols
Symbol(s)	Meaning	Example	Example Matches
*	Matches the preceding character, subexpression, or bracketed character, 0 or more times	ab	aaaaaaaa, aaabbbbb, bbbbbb
+	Matches the preceding character, subexpression, or bracketed character, 1 or more times	a+b+	aaaaaaaab, aaabbbbb, abbbbbb
[]	Matches any character within the brackets (i.e., “Pick any one of these things”)	[A-Z]*	APPLE, CAPITALS, QWERTY
()	A grouped subexpression (these are evaluated first, in the “order of operations” of regular expressions)	(ab)	aaabaab, abaaab, ababaaaaab
{m, n}	Matches the preceding character, subexpression, or bracketed character between m and n times (inclusive)	a{2,3}b{2,3}	aabbb, aaabbb, aabb
[^]	Matches any single character that is not in the brackets	[^A-Z]*	apple, lowercase, qwerty
\|	Matches any character, string of characters, or subexpression, separated by the “I” (note that this is a vertical bar, or “pipe,” not a capital “i”)	b(a\|i\|e)d	bad, bid, bed
.	Matches any single character (including symbols, numbers, a space, etc.)	b.d	bad, bzd, b$d, b d
^	Indicates that a character or subexpression occurs at the beginning of a string	^a	apple, asdf, a
	An escape character (this allows you to use “special” characters as their literal meaning)	. \| \	. \|
$	Often used at the end of a regular expression, it means “match this up to the end of the string.” Without it, every regular expression has a defacto “.*” at the end of it, accepting strings where only the first part of the string matches. This can be thougt of as analogous to the ^ symbol.	[A-Z][a-z]$	ABCabc, zzzyx, Bob
?!	“Does not contain.” This odd pairing of symbols, immediately preceding a character (or regular expression), indicates that that character should not be found in that specific place in the larger string. This can be tricky to use; after all, the character might be found in a different part of the string. If trying to eliminate a character entirely, use in conjunction with a ^ and $ at either end.	^((?![A-Z]).)*$	no-caps-here, $ymb0ls a4e f!ne

Regular Expressions: Not Always Regular!

The standard version of regular expressions (the one we are covering in this book, and that is used by Python and BeautifulSoup) is based on syntax used by Perl. Most modern programming languages use this or one very similar to it. Be aware, however, that if you are using regular expressions in another language, you might encounter problems. Even some modern languages, such as Java, have slight differences in the way they handle regular expressions. When in doubt, read the docs!

Regular Expressions and BeautifulSoup

If the previous section on regular expressions seemed a little disjointed from the mission of this book, here’s where it all ties together. BeautifulSoup and regular expressions go hand in hand when it comes to scraping the Web. In fact, most functions that take in a string argument (e.g., find(id="aTagIdHere")) will also take in a regular expression just as well.

Let’s take a look at some examples, scraping the page found at http://bit.ly/1KGe2Qk.

Notice that there are many product images on the site—they take the following form:

<img src="../img/gifts/img3.jpg">

If we wanted to grab URLs to all of the product images, it might seem fairly straightforward at first: just grab all the image tags using .findAll("img"), right? But there’s a problem. In addition to the obvious “extra” images (e.g., logos), modern websites often have hidden images, blank images used for spacing and aligning elements, and other random image tags you might not be aware of. Certainly, you can’t count on the only images on the page being product images.

Let’s also assume that the layout of the page might change, or that, for whatever reason, we don’t want to depend on the position of the image in the page in order to find the correct tag. This might be the case when you are trying to grab specific elements or pieces of data that are scattered randomly throughout a website. For instance, there might be a featured product image in a special layout at the top of some pages, but not others.

The solution is to look for something identifying about the tag itself. In this case, we can look at the file path of the product images:

from urllib.request 
import urlopenfrom bs4 
import BeautifulSoupimport re

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
images = bsObj.findAll("img", {"src":re.compile("../img/gifts/img.*.jpg")})
for image in images: 
    print(image["src"])

This prints out only the relative image paths that start with ../img/gifts/img and end in .jpg, the output of which is the following:

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

A regular expression can be inserted as any argument in a BeautifulSoup expression, allowing you a great deal of flexibility in finding target elements.

Accessing Attributes

So far, we’ve looked at how to access and filter tags and access content within them. However, very often in web scraping you’re not looking for the content of a tag; you’re looking for its attributes. This becomes especially useful for tags such as <a>, where the URL it is pointing to is contained within the href attribute, or the <img> tag, where the target image is contained within the src attribute.

With tag objects, a Python list of attributes can be automatically accessed by calling:

myTag.attrs

Keep in mind that this literally returns a Python dictionary object, which makes retrieval and manipulation of these attributes trivial. The source location for an image, for example, can be found using the following line:

myImgTag.attrs['src']

Lambda Expressions

If you have a formal education in computer science, you probably learned about lambda expressions once in school and then never used them again. If you don’t, they might be unfamiliar to you (or familiar only as “that thing I’ve been meaning to learn at some point”). In this section, we won’t go deeply into these extremely useful functions, but we will look at a few examples of how they can be useful in web scraping.

Essentially, a lambda expression is a function that is passed into another function as a variable; that is, instead of defining a function as f(x, y), you may define a function as f(g(x), y), or even f(g(x), h(x)).

BeautifulSoup allows us to pass certain types of functions as parameters into the findAll function. The only restriction is that these functions must take a tag object as an argument and return a boolean. Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to “true” are returned while the rest are discarded.

For example, the following retrieves all tags that have exactly two attributes:

soup.findAll(lambda tag: len(tag.attrs) == 2)

That is, it will find tags such as the following:

<div class="body" id="content"></div>
<span style="color:red" class="title"></span>

Using lambda functions in BeautifulSoup, selectors can act as a great substitute for writing a regular expression, if you’re comfortable with writing a little code.

Beyond BeautifulSoup

Although BeautifulSoup is used throughout this book (and is one of the most popular HTML libraries available for Python), keep in mind that it’s not the only option. If BeautifulSoup does not meet your needs, check out these other widely used libraries:

lxml: This library is used for parsing both HTML and XML documents, and is known for being very low level and heavily based on C. Although it takes a while to learn (a steep learning curve actually means you learn it very fast), it is very fast at parsing most HTML documents.
HTML Parser: This is Python’s built-in parsing library. Because it requires no installation (other than, obviously, having Python installed in the first place), it can be extremely convenient to use.

¹ If you’re looking to get a list of all h<some_level> tags in the document, there are more succinct ways of writing this code to accomplish the same thing. We’ll take a look at other ways of approaching these types of problems in the section BeautifulSoup and regular expressions.

² The Python Language Reference provides a complete list of protected keywords.

³ You might be asking yourself, “Are there ‘irregular’ expressions?” Nonregular expressions are beyond the scope of this book, but they encompass strings such as “write a prime number of a’s, followed by exactly twice that number of b’s” or “write a palindrome.” It’s impossible to identify strings of this type with a regular expression. Fortunately, I’ve never been in a situation where my web scraper needed to identify these kinds of strings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
2. Advanced HTML Parsing

Chapter 2. Advanced HTML Parsing

You Don’t Always Need a Hammer

Another Serving of BeautifulSoup

When to get_text() and When to Preserve Tags

find() and findAll() with BeautifulSoup

A Caveat to the keyword Argument

Other BeautifulSoup Objects

Navigating Trees

Figure 2-1. Screenshot from http://www.pythonscraping.com/pages/page3.html

Dealing with children and other descendants

Dealing with siblings

Make Selections Specific

Dealing with your parents

Regular Expressions

Experimenting with RegEx

Regular Expressions: Not Always Regular!

Regular Expressions and BeautifulSoup

Accessing Attributes

Lambda Expressions

Beyond BeautifulSoup

Table of Contents for 2. Advanced HTML Parsing

Create new playlist

Sign In

Sign Up

Chapter 2. Advanced HTML Parsing

You Don’t Always Need a Hammer

Another Serving of BeautifulSoup

When to get_text() and When to Preserve Tags

find() and findAll() with BeautifulSoup

A Caveat to the keyword Argument

Other BeautifulSoup Objects

Navigating Trees

Figure 2-1. Screenshot from http://www.pythonscraping.com/pages/page3.html

Dealing with children and other descendants

Dealing with siblings

Make Selections Specific

Dealing with your parents

Regular Expressions

Experimenting with RegEx

Regular Expressions: Not Always Regular!

Regular Expressions and BeautifulSoup

Accessing Attributes

Lambda Expressions

Beyond BeautifulSoup

Table of Contents for
2. Advanced HTML Parsing