The Internet has become a vast source of freely available data no further than the browser window on your home computer. While some resources on the Web are formatted for easy consumption by computer programs, the majority of content is intended for human readers using a browser application, with formatting done using HTML markup tags.
Sometimes you have your own Python script that needs to use tabular or reference data from a web page. If the data has not already been converted to easily processed comma-separated values or some other digestible format, you will need to write a parser that "reads around" the HTML tags and gets the actual text data.
It is very common to see postings on Usenet from people trying to use regular expressions for this task. For instance, someone trying to extract image reference tags from a web page might try matching the tag pattern "<img src=
quoted_string>"
. Unfortunately, since HTML tags can contain many optional attributes, and since web browsers are very forgiving in processing sloppy HTML tags, HTML retrieved from the wild can be full of surprises to the unwary web page scraper. Here are some typical "gotchas" when trying to find HTML tags:
<img src="sphinx.jpeg">
, <IMG SRC="sphinx.jpeg">
, and <img src = "sphinx.jpeg" >
are all equivalent tags.
The IMG
tag will often contain optional attributes, such as align
, alt
, id
, vspace
, hspace
, height
, width
, etc.
If the matching pattern is expanded to detect the attributes src
, align
, and alt
, as in the tag <img src="sphinx.jpeg" align="top" alt="The Great Sphinx">
, the attributes can appear in the tag in any order.
<img src="sphinx.jpeg">
can also be represented as <img src='sphinx.jpeg'>
or <img src=sphinx.jpeg>
.
Pyparsing includes the helper method makeHTMLTags
to make short work of defining standard expressions for opening and closing tags. To use this method, your program calls makeHTMLTags
with the tag name as its argument, and makeHTMLTags
returns pyparsing expressions for matching the opening and closing tags for the given tag name. But makeHTMLTags("X")
goes far beyond simply returning the expressions Literal("<X>")
and Literal("</X>")
:
Tags may be upper- or lowercase.
Whitespace may appear anywhere in the tag.
Any number of attributes can be included, in any order.
Attribute values can be single-quoted, double-quoted, or unquoted strings.
Opening tags may include a terminating /
, indicating no body text and no closing tag (specified by using the results name 'empty').
Tag and attribute names can include namespace references.
But perhaps the most powerful feature of the expressions returned by makeHTMLTags
is that the parsed results include the opening tag's HTML attributes named results, dynamically creating the results names while parsing.
Here is a short script that searches a web page for image references, printing a list of images and any provided alternate text:
from pyparsing import makeHTMLTags import urllib # read data from web page url = "https://www.cia.gov/library/" "publications/the-world-" "factbook/docs/refmaps.html" html = urllib.urlopen(url).read() # define expression for <img> tag imgTag,endImgTag = makeHTMLTags("img") # search for matching tags, and # print key attributes for img in imgTag.searchString(html): print "'%(alt)s' : %(src)s" % img
Notice that instead of using parseString
, this script searches for matching text with searchString
. For each match returned by searchString
, the script prints the values of the alt
and src
tag attributes just as if they were attributes of the parsed tokens returned by the img
expression.
This script just lists out images from the initial page of maps included in the online CIA Factbook. The output contains information on each map image reference, like this excerpt:
'Africa Map' : ../reference_maps/thumbnails/africa.jpg 'Antarctic Region Map' : ../reference_maps/thumbnails/antarctic.jpg 'Arctic Region Map' : ../reference_maps/thumbnails/arctic.jpg 'Asia Map' : ../reference_maps/thumbnails/asia.jpg 'Central America and Caribbean Map' : ../reference_maps/thumbnails/central_america.jpg 'Europe Map' : ../reference_maps/thumbnails/europe.jpg ...
The CIA Factbook web site also includes a more complicated web page, which lists the conversion factors for many common units of measure used around the world. Here are some sample data rows from this table:
ares | square meters | 100 |
ares | square yards | 119.599 |
barrels, US beer | gallons | 31 |
barrels, US beer | liters | 117.347 77 |
barrels, US petroleum | gallons (British) | 34.97 |
barrels, US petroleum | gallons (US) | 42 |
barrels, US petroleum | liters | 158.987 29 |
barrels, US proof spirits | gallons | 40 |
barrels, US proof spirits | liters | 151.416 47 |
bushels (US) | bushels (British) | 0.968 9 |
bushels (US) | cubic feet | 1.244 456 |
bushels (US) | cubic inches | 2,150.42 |
The corresponding HTML source for these rows is of the form:
<TR align="left" valign="top" bgcolor="#FFFFFF"> <td width=33% valign=top class="Normal">ares </TD> <td width=33% valign=top class="Normal">square meters </TD> <td width=33% valign=top class="Normal">100 </TD> </TR> <TR align="left" valign="top" bgcolor="#CCCCCC"> <td width=33% valign=top class="Normal">ares </TD> <td width=33% valign=top class="Normal">square yards </TD> <td width=33% valign=top class="Normal">119.599 </TD> </TR> <TR align="left" valign="top" bgcolor="#FFFFFF"> <td width=33% valign=top class="Normal">barrels, US beer </TD> <td width=33% valign=top class="Normal">gallons </TD> <td width=33% valign=top class="Normal">31 </TD> </TR> <TR align="left" valign="top" bgcolor="#CCCCCC"> <td width=33% valign=top class="Normal">barrels, US beer </TD> <td width=33% valign=top class="Normal">liters </TD> <td width=33% valign=top class="Normal">117.347 77 </TD> </TR> ...
Since we have some sample HTML to use as a template, we can create a simple BNF using shortcuts for opening and closing tags (meaning the results from makeHTMLTags
, with the corresponding support for HTML attributes):
entry ::= <tr> conversionLabel conversionLabel conversionValue </tr> conversionLabel ::= <td> text </td> conversionValue ::= <td> readableNumber </td>
Note that the conversion factors are formatted for easy reading (by humans, that is):
Integer part is comma-separated on the thousands
Decimal part is space-separated on the thousandths
We can plan to include a parse action to reformat this text before calling float()
to convert to a floating-point number. We will also need to post-process the text of the conversion labels; as we will find, these can contain embedded <BR>
tags for explicit line breaks.
From a purely mechanical point of view, our script must begin by extracting the source text for the given URL. I usually find the Python urllib
module to be sufficient for this task:
import urllib url = "https://www.cia.gov/library/publications/" "the-world-factbook/appendix/appendix-g.html" html = urllib.urlopen(url).read()
At this point we have retrieved the web page's source HTML into our Python variable html as a single string. We will use this string later to scan for conversion factors.
But we've gotten a little ahead of ourselves—we need to set up our parser's grammar first! Let's start with the real numbers. Looking through this web page, there are numbers such as:
200 0.032 808 40 1,728 0.028 316 846 592 3,785.411 784
Here is an expression to match these numbers:
decimalNumber = Word(nums, nums+",") + Optional("." + OneOrMore(Word(nums)))
Notice that we are using a new form of the Word
constructor, with two arguments instead of just one. When using this form, Word
will use the first argument as the set of valid starting characters, and the second argument as the set of valid body characters. The given expression will match 1,000
, but not ,456
. This two-argument form of Word
is useful when defining expressions for parsing identifiers from programming source code, such as this definition for a Python variable name:
Word(alphas+"_", alphanums+"_").
Since decimalNumber
is a working parser all by itself, we can test it in isolation before including it into a larger, more complicated expression.
Using the list of sampled numbers, we get these results:
['200'] ['0', '.', '032', '808', '40'] ['1,728'] ['0', '.', '028', '316', '846', '592'] ['3,785', '.', '411', '784']
In order to convert these into floating-point values, we need to:
Join the individual token pieces together
Strip out the commas in the integer part
While these two steps could be combined into a single expression, I want to create two parse actions to show how parse actions can be chained together.
The first parse action will be called joinTokens
, and can be performed by a lambda:
joinTokens = lambda tokens : "".join(tokens)
The next parse action will be called stripCommas
. Being the next parse action in the chain, stripCommas
will receive a single string (the output of joinTokens
), so we will only need to work with the 0th element of the supplied tokens:
stripCommas = lambda tokens : tokens[0].replace(",", "")
And of course, we need a final parse action to do the conversion to float:
convertToFloat = lambda tokens : float(tokens[0])
Now, to assign multiple parse actions to an expression, we can use the pair of methods, setParseAction
and addParseAction
:
decimalNumber.setParseAction( joinTokens ) decimalNumber.addParseAction( stripCommas ) decimalNumber.addParseAction( convertToFloat )
Or, we can just call setParseAction
listing multiple parse actions as separate arguments, and these will be defined as a chain of parse actions to be executed in the same order that they are given:
decimalNumber.setParseAction( joinTokens, stripCommas, convertToFloat )
Next, let's do a more thorough test by creating the expression that uses decimalNumber
and scanning the complete HTML source.
tdStart,tdEnd = makeHTMLTags("td") conversionValue = tdStart + decimalNumber + tdEnd for tokens,start,end in conversionValue.scanString(html): print tokens
scanString
is another parsing method that is especially useful when testing grammar fragments. While parseString
works only with a complete grammar, beginning with the start of the input string and working until the grammar is completely matched, scanString
scans through the input text, looking for bits of the text that match the grammar. Also, scanString
is a generator function, which means it will return tokens as they are found rather than parsing all of the input text, so your program begins to report matching tokens right away. From the code sample, you can see that scanString
returns the tokens and starting and ending locations for each match.
Here are the initial results from using scanString
to test out the conversionValue
expression:
['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 40.468564223999998, '</td>'] ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 0.40468564223999998, '</td>'] ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 43560.0, '</td>'] ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 0.0040468564224000001, '</td>'] ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 4046.8564224000002, '</td>'] ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 0.0015625000000000001, '</td>'] ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 4840.0, '</td>'] ...
Well, all those parsed tokens from the attributes of the <TD>
tags are certainly distracting. We should clean things up by adding a results name to the decimalNumber
expression and just printing out that part:
conversionValue = tdStart + decimalNumber.setResultsName("factor") + tdEnd for tokens,start,end in conversionValue.scanString(html): print tokens.factor
Now our output is plainly:
40.468564224 0.40468564224 43560.0 0.0040468564224 4046.8564224 0.0015625 4840.0 100.0 ...
Also note from the absence of quotation marks that these are not strings, but converted floats. On to the remaining elements!
We've developed the expression to extract the conversion factors themselves, but these are of little use without knowing the "from" and "to" units. To parse these, we'll use an expression very similar to the one for extracting the conversion factors:
fromUnit = tdStart + units.setResultsName("fromUnit") + tdEnd toUnit = tdStart + units.setResultsName("toUnit") + tdEnd
But how will we define the units
expression itself? Looking through the web page, this text doesn't show much of a recognizable pattern. We could try something like OneOrMore (Word(alphas))
, but that would fail when trying to match units of "barrels, US petroleum" or "gallons (British)." Trying to add in punctuation marks sets us up for errors when we overlook a little-used mark and unknowingly skip over a valid conversion factor.
One thing we do know is that the units
text ends when we reach the closing </TD>
tag. With this knowledge, we can avoid trying to exhaustively define the pattern for units
, and use a helpful pyparsing class, SkipTo
. SkipTo
collects all the intervening text, from the current parsing position to the location of the target expression, into a single string. Using SkipTo
, we can define units
simply as:
units = SkipTo( tdEnd )
We may end up having to do some post-processing on this text, such as trimming leading or trailing whitespace, but at least we won't omit some valid units, and we won't read past any closing </TD>
tags.
We are just about ready to complete our expression for extracting unit conversions, adding expressions for the "from" and "to" unit expressions:
conversion = trStart + fromUnits + toUnits + conversionValue + trEnd
Repeating the test scanning process, we get the following:
for tokens,start,end in conversion.scanString(html): print "%(fromUnit)s : %(toUnit)s : %(factor)f" % tokens acres : ares : 40.468564 acres : hectares : 0.404686 acres : square feet : 43560.000000 acres : square kilometers : 0.004047 acres : square meters : 4046.856422 ...
This doesn't seem too bad, but further down the list, there are some formatting problems:
barrels, US petroleum : liters : 158.987290 barrels, US proof spirits : gallons : 40.000000 barrels, US proof spirits : liters : 151.416470 bushels (US) : bushels (British) : 0.968900
And even further down, we find these entries:
tons, net register : cubic feet of permanently enclosed space <br> for cargo and passengers : 100.000000 tons, net register : cubic meters of permanently enclosed space <br> for cargo and passengers : 2.831685
So, to clean up the units of measure, we need to strip out newlines and extra spaces, and remove embedded <br>
tags. As you may have guessed, we'll use a parse action to do the job.
Our parse action has two tasks:
Remove <br>
tags
Collapse whitespace and newlines
The simplest way in Python to collapse repeated whitespace is to use the str
type's methods split
followed by join
. To remove the <br>
tags, we will just use str.replace("<br>"," ")
. A single lambda for both these tasks will get a little difficult to follow, so this time we'll create an actual Python method, and attach it to the units
expression:
def htmlCleanup(t): unitText = t[0] unitText = unitText.replace("<br>"," ") unitText = " ".join(unitText.split()) return unitText units.setParseAction(htmlCleanup)
With these changes, our conversion factor extractor can collect the unit conversion information. We can load it into a Python dict variable or a local database for further use by our program.
Here is the complete conversion factor extraction program:
import urllib from pyparsing import * url = "https://www.cia.gov/library/" "publications/the-world-factbook/" "appendix/appendix-g.html" page = urllib.urlopen(url) html = page.read() page.close() tdStart,tdEnd = makeHTMLTags("td") trStart,trEnd = makeHTMLTags("tr") decimalNumber = Word(nums+",") + Optional("." + OneOrMore(Word(nums))) joinTokens = lambda tokens : "".join(tokens) stripCommas = lambda tokens: tokens[0].replace(",","") convertToFloat = lambda tokens: float(tokens[0]) decimalNumber.setParseAction( joinTokens, stripCommas, convertToFloat ) conversionValue = tdStart + decimalNumber.setResultsName("factor") + tdEnd units = SkipTo(tdEnd) def htmlCleanup(t): unitText = t[0] unitText = " ".join(unitText.split()) unitText = unitText.replace("<br>","") return unitText units.setParseAction(htmlCleanup) fromUnit = tdStart + units.setResultsName("fromUnit") + tdEnd toUnit = tdStart + units.setResultsName("toUnit") + tdEnd conversion = trStart + fromUnit + toUnit + conversionValue + trEnd for tokens,start,end in conversion.scanString(html): print "%(fromUnit)s : %(toUnit)s : %(factor)s" % tokens
18.118.93.175