Extracting Data from a Web Page

The Internet has become a vast source of freely available data no further than the browser window on your home computer. While some resources on the Web are formatted for easy consumption by computer programs, the majority of content is intended for human readers using a browser application, with formatting done using HTML markup tags.

Sometimes you have your own Python script that needs to use tabular or reference data from a web page. If the data has not already been converted to easily processed comma-separated values or some other digestible format, you will need to write a parser that "reads around" the HTML tags and gets the actual text data.

It is very common to see postings on Usenet from people trying to use regular expressions for this task. For instance, someone trying to extract image reference tags from a web page might try matching the tag pattern "<img src=quoted_string>". Unfortunately, since HTML tags can contain many optional attributes, and since web browsers are very forgiving in processing sloppy HTML tags, HTML retrieved from the wild can be full of surprises to the unwary web page scraper. Here are some typical "gotchas" when trying to find HTML tags:

Tags with extra whitespace or of varying upper-/lowercase

<img src="sphinx.jpeg">, <IMG SRC="sphinx.jpeg">, and <img src = "sphinx.jpeg" > are all equivalent tags.

Tags with unexpected attributes

The IMG tag will often contain optional attributes, such as align, alt, id, vspace, hspace, height, width, etc.

Tag attributes in varying order

If the matching pattern is expanded to detect the attributes src, align, and alt, as in the tag <img src="sphinx.jpeg" align="top" alt="The Great Sphinx">, the attributes can appear in the tag in any order.

Tag attributes may or may not be enclosed in quotes

<img src="sphinx.jpeg"> can also be represented as <img src='sphinx.jpeg'> or <img src=sphinx.jpeg>.

Pyparsing includes the helper method makeHTMLTags to make short work of defining standard expressions for opening and closing tags. To use this method, your program calls makeHTMLTags with the tag name as its argument, and makeHTMLTags returns pyparsing expressions for matching the opening and closing tags for the given tag name. But makeHTMLTags("X") goes far beyond simply returning the expressions Literal("<X>") and Literal("</X>"):

  • Tags may be upper- or lowercase.

  • Whitespace may appear anywhere in the tag.

  • Any number of attributes can be included, in any order.

  • Attribute values can be single-quoted, double-quoted, or unquoted strings.

  • Opening tags may include a terminating /, indicating no body text and no closing tag (specified by using the results name 'empty').

  • Tag and attribute names can include namespace references.

But perhaps the most powerful feature of the expressions returned by makeHTMLTags is that the parsed results include the opening tag's HTML attributes named results, dynamically creating the results names while parsing.

Here is a short script that searches a web page for image references, printing a list of images and any provided alternate text:

from pyparsing import makeHTMLTags
import urllib

# read data from web page
url = "https://www.cia.gov/library/"
        "publications/the-world-"
        "factbook/docs/refmaps.html"
html = urllib.urlopen(url).read()

# define expression for <img> tag
imgTag,endImgTag = makeHTMLTags("img")

# search for matching tags, and
# print key attributes
for img in imgTag.searchString(html):
    print "'%(alt)s' : %(src)s" % img

Notice that instead of using parseString, this script searches for matching text with searchString. For each match returned by searchString, the script prints the values of the alt and src tag attributes just as if they were attributes of the parsed tokens returned by the img expression.

This script just lists out images from the initial page of maps included in the online CIA Factbook. The output contains information on each map image reference, like this excerpt:

'Africa Map' : ../reference_maps/thumbnails/africa.jpg
'Antarctic Region Map' : ../reference_maps/thumbnails/antarctic.jpg
'Arctic Region Map' : ../reference_maps/thumbnails/arctic.jpg
'Asia Map' : ../reference_maps/thumbnails/asia.jpg
'Central America and Caribbean Map' : ../reference_maps/thumbnails/central_america.jpg
'Europe Map' : ../reference_maps/thumbnails/europe.jpg
...

The CIA Factbook web site also includes a more complicated web page, which lists the conversion factors for many common units of measure used around the world. Here are some sample data rows from this table:

ares

square meters

100

ares

square yards

119.599

barrels, US beer

gallons

31

barrels, US beer

liters

117.347 77

barrels, US petroleum

gallons (British)

34.97

barrels, US petroleum

gallons (US)

42

barrels, US petroleum

liters

158.987 29

barrels, US proof spirits

gallons

40

barrels, US proof spirits

liters

151.416 47

bushels (US)

bushels (British)

0.968 9

bushels (US)

cubic feet

1.244 456

bushels (US)

cubic inches

2,150.42

The corresponding HTML source for these rows is of the form:

<TR align="left" valign="top" bgcolor="#FFFFFF">
    <td width=33% valign=top class="Normal">ares </TD>
    <td width=33% valign=top class="Normal">square meters </TD>
    <td width=33% valign=top class="Normal">100 </TD>
  </TR>
  <TR align="left" valign="top" bgcolor="#CCCCCC">
    <td width=33% valign=top class="Normal">ares </TD>
    <td width=33% valign=top class="Normal">square yards </TD>
    <td width=33% valign=top class="Normal">119.599 </TD>
  </TR>
  <TR align="left" valign="top" bgcolor="#FFFFFF">
    <td width=33% valign=top class="Normal">barrels, US beer </TD>
    <td width=33% valign=top class="Normal">gallons </TD>
    <td width=33% valign=top class="Normal">31 </TD>
  </TR>
  <TR align="left" valign="top" bgcolor="#CCCCCC">
    <td width=33% valign=top class="Normal">barrels, US beer </TD>
    <td width=33% valign=top class="Normal">liters </TD>
    <td width=33% valign=top class="Normal">117.347 77 </TD>
  </TR>
...

Since we have some sample HTML to use as a template, we can create a simple BNF using shortcuts for opening and closing tags (meaning the results from makeHTMLTags, with the corresponding support for HTML attributes):

entry ::= <tr> conversionLabel conversionLabel conversionValue </tr>
conversionLabel ::= <td> text </td>
conversionValue ::= <td> readableNumber </td>

Note that the conversion factors are formatted for easy reading (by humans, that is):

  • Integer part is comma-separated on the thousands

  • Decimal part is space-separated on the thousandths

We can plan to include a parse action to reformat this text before calling float() to convert to a floating-point number. We will also need to post-process the text of the conversion labels; as we will find, these can contain embedded <BR> tags for explicit line breaks.

From a purely mechanical point of view, our script must begin by extracting the source text for the given URL. I usually find the Python urllib module to be sufficient for this task:

import urllib
url = "https://www.cia.gov/library/publications/" 
        "the-world-factbook/appendix/appendix-g.html"
html = urllib.urlopen(url).read()

At this point we have retrieved the web page's source HTML into our Python variable html as a single string. We will use this string later to scan for conversion factors.

But we've gotten a little ahead of ourselves—we need to set up our parser's grammar first! Let's start with the real numbers. Looking through this web page, there are numbers such as:

200
0.032 808 40
1,728
0.028 316 846 592
3,785.411 784

Here is an expression to match these numbers:

decimalNumber = Word(nums, nums+",") + Optional("." + OneOrMore(Word(nums)))

Notice that we are using a new form of the Word constructor, with two arguments instead of just one. When using this form, Word will use the first argument as the set of valid starting characters, and the second argument as the set of valid body characters. The given expression will match 1,000, but not ,456. This two-argument form of Word is useful when defining expressions for parsing identifiers from programming source code, such as this definition for a Python variable name:

Word(alphas+"_", alphanums+"_").

Since decimalNumber is a working parser all by itself, we can test it in isolation before including it into a larger, more complicated expression.

Using the list of sampled numbers, we get these results:

['200']
['0', '.', '032', '808', '40']
['1,728']
['0', '.', '028', '316', '846', '592']
['3,785', '.', '411', '784']

In order to convert these into floating-point values, we need to:

  • Join the individual token pieces together

  • Strip out the commas in the integer part

While these two steps could be combined into a single expression, I want to create two parse actions to show how parse actions can be chained together.

The first parse action will be called joinTokens, and can be performed by a lambda:

joinTokens = lambda tokens : "".join(tokens)

The next parse action will be called stripCommas. Being the next parse action in the chain, stripCommas will receive a single string (the output of joinTokens), so we will only need to work with the 0th element of the supplied tokens:

stripCommas = lambda tokens : tokens[0].replace(",", "")

And of course, we need a final parse action to do the conversion to float:

convertToFloat = lambda tokens : float(tokens[0])

Now, to assign multiple parse actions to an expression, we can use the pair of methods, setParseAction and addParseAction:

decimalNumber.setParseAction( joinTokens )
decimalNumber.addParseAction( stripCommas )
decimalNumber.addParseAction( convertToFloat )

Or, we can just call setParseAction listing multiple parse actions as separate arguments, and these will be defined as a chain of parse actions to be executed in the same order that they are given:

decimalNumber.setParseAction( joinTokens, stripCommas, convertToFloat )

Next, let's do a more thorough test by creating the expression that uses decimalNumber and scanning the complete HTML source.

tdStart,tdEnd = makeHTMLTags("td")
conversionValue = tdStart + decimalNumber + tdEnd

for tokens,start,end in conversionValue.scanString(html):
    print tokens

scanString is another parsing method that is especially useful when testing grammar fragments. While parseString works only with a complete grammar, beginning with the start of the input string and working until the grammar is completely matched, scanString scans through the input text, looking for bits of the text that match the grammar. Also, scanString is a generator function, which means it will return tokens as they are found rather than parsing all of the input text, so your program begins to report matching tokens right away. From the code sample, you can see that scanString returns the tokens and starting and ending locations for each match.

Here are the initial results from using scanString to test out the conversionValue expression:

['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False,
40.468564223999998, '</td>']
['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False,
0.40468564223999998, '</td>']
['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 43560.0,
'</td>']
['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False,
0.0040468564224000001, '</td>']
['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False,
4046.8564224000002, '</td>']
['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False,
0.0015625000000000001, '</td>']
['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 4840.0,
'</td>']
...

Well, all those parsed tokens from the attributes of the <TD> tags are certainly distracting. We should clean things up by adding a results name to the decimalNumber expression and just printing out that part:

conversionValue = tdStart + decimalNumber.setResultsName("factor") + tdEnd

for tokens,start,end in conversionValue.scanString(html):
    print tokens.factor

Now our output is plainly:

40.468564224
0.40468564224
43560.0
0.0040468564224
4046.8564224
0.0015625
4840.0
100.0
...

Also note from the absence of quotation marks that these are not strings, but converted floats. On to the remaining elements!

We've developed the expression to extract the conversion factors themselves, but these are of little use without knowing the "from" and "to" units. To parse these, we'll use an expression very similar to the one for extracting the conversion factors:

fromUnit = tdStart + units.setResultsName("fromUnit") + tdEnd
toUnit   = tdStart + units.setResultsName("toUnit") + tdEnd

But how will we define the units expression itself? Looking through the web page, this text doesn't show much of a recognizable pattern. We could try something like OneOrMore (Word(alphas)), but that would fail when trying to match units of "barrels, US petroleum" or "gallons (British)." Trying to add in punctuation marks sets us up for errors when we overlook a little-used mark and unknowingly skip over a valid conversion factor.

One thing we do know is that the units text ends when we reach the closing </TD> tag. With this knowledge, we can avoid trying to exhaustively define the pattern for units, and use a helpful pyparsing class, SkipTo. SkipTo collects all the intervening text, from the current parsing position to the location of the target expression, into a single string. Using SkipTo, we can define units simply as:

units = SkipTo( tdEnd )

We may end up having to do some post-processing on this text, such as trimming leading or trailing whitespace, but at least we won't omit some valid units, and we won't read past any closing </TD> tags.

We are just about ready to complete our expression for extracting unit conversions, adding expressions for the "from" and "to" unit expressions:

conversion = trStart + fromUnits + toUnits + conversionValue + trEnd

Repeating the test scanning process, we get the following:

for tokens,start,end in conversion.scanString(html):
    print "%(fromUnit)s : %(toUnit)s : %(factor)f" % tokens

acres  : ares  : 40.468564
acres  : hectares  : 0.404686
acres  : square feet  : 43560.000000
acres  : square kilometers  : 0.004047
acres  : square meters  : 4046.856422
...

This doesn't seem too bad, but further down the list, there are some formatting problems:

barrels, US petroleum  : liters  : 158.987290
barrels, US proof
                      spirits  : gallons  : 40.000000
barrels, US proof
                      spirits  : liters  : 151.416470
bushels (US)  : bushels (British)  : 0.968900

And even further down, we find these entries:

tons, net register  : cubic feet of permanently enclosed space <br>
                      for cargo and passengers  : 100.000000
tons, net register  : cubic meters of permanently enclosed space <br>
                      for cargo and passengers  : 2.831685

So, to clean up the units of measure, we need to strip out newlines and extra spaces, and remove embedded <br> tags. As you may have guessed, we'll use a parse action to do the job.

Our parse action has two tasks:

  • Remove <br> tags

  • Collapse whitespace and newlines

The simplest way in Python to collapse repeated whitespace is to use the str type's methods split followed by join. To remove the <br> tags, we will just use str.replace("<br>"," "). A single lambda for both these tasks will get a little difficult to follow, so this time we'll create an actual Python method, and attach it to the units expression:

def htmlCleanup(t):
    unitText = t[0]
    unitText = unitText.replace("<br>"," ")
    unitText = " ".join(unitText.split())
    return unitText

units.setParseAction(htmlCleanup)

With these changes, our conversion factor extractor can collect the unit conversion information. We can load it into a Python dict variable or a local database for further use by our program.

Here is the complete conversion factor extraction program:

import urllib
from pyparsing import *

url = "https://www.cia.gov/library/" 
      "publications/the-world-factbook/" 
      "appendix/appendix-g.html"
page = urllib.urlopen(url)
html = page.read()
page.close()

tdStart,tdEnd = makeHTMLTags("td")
trStart,trEnd = makeHTMLTags("tr")
decimalNumber = Word(nums+",") + Optional("." + OneOrMore(Word(nums)))
joinTokens = lambda tokens : "".join(tokens)
stripCommas = lambda tokens: tokens[0].replace(",","")
convertToFloat = lambda tokens: float(tokens[0])
decimalNumber.setParseAction( joinTokens, stripCommas, convertToFloat )

conversionValue = tdStart + decimalNumber.setResultsName("factor") + tdEnd

units = SkipTo(tdEnd)
def htmlCleanup(t):
    unitText = t[0]
    unitText = " ".join(unitText.split())
    unitText = unitText.replace("<br>","")
    return unitText
units.setParseAction(htmlCleanup)

fromUnit = tdStart + units.setResultsName("fromUnit") + tdEnd
toUnit   = tdStart + units.setResultsName("toUnit") + tdEnd
conversion = trStart + fromUnit + toUnit + conversionValue + trEnd

for tokens,start,end in conversion.scanString(html):
    print "%(fromUnit)s : %(toUnit)s : %(factor)s" % tokens
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.93.175