This article turned out to be so popular that I ended up writing a whole book, Perl & LWP (O’Reilly), which goes into great detail about the many ways of pulling data out of markup languages like HTML.
In the previous article, Ken MacFarlane describes how the HTML::Parser module scans HTML source as a stream of start tags, end tags, text, comments, and so on. In another issue of TPJ (and republished in Computer Science & Perl Programming: Best of the Perl Journal), I described tree data structures. Now I’ll tie it together by discussing trees of HTML.
The CPAN module HTML::TreeBuilder takes the tags that HTML::Parser extracts, and builds a parse tree—a tree-shaped network of objects representing the structured content of an HTML document. Once the document is parsed as a tree, you’ll find the common tasks of extracting data from that HTML document/tree to be quite straightforward.
HTML::TreeBuilder can construct a parse tree out of an HTML source file simply by saying:
use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new(); $tree->parse_file('foo.html'),
$tree
now contains a parse tree built from
the HTML in foo.html
. The parse tree is represented
as a network of objects—$tree
is the
root, an element with tag name
html
. Its children typically include
head
and body
elements, and so
on. Each element in the tree is an object of the class HTML::Element.
If you take this source:
<html><head><title>Doc 1</title></head> <body> Stuff <hr> 2000-08-17 </body></html>
and feed it to HTML::TreeBuilder, it’ll return a tree of objects that looks like this:
html / head body / / | title "Stuff" hr "2000-08-17" | "Doc 1"
This is a pretty simple document. If it were any more complex, it’d be a bit hard to draw in that style, since it sprawls left and right. The same tree can be represented a bit more easily sideways, with indenting:
• html • head • title • "Doc 1" • body • "Stuff" • hr • "2000-08-17"
Both representations express the same structure. The root node
is an object of the class HTML::Element (actually, of HTML::TreeBuilder, but
that’s just a subclass of HTML::Element) with the tag name
html
, and with two children: an HTML::Element
object whose tag names are head
and
body
. And each of those elements have children, and
so on down. Not all elements have children—the C element doesn’t, for
instance. And not all nodes in the tree are elements—the text nodes
(Doc 1, Stuff
, and 2000-08-17
)
are just strings.
Objects of the class HTML::Element have three noteworthy attributes:
_tag
Best accessed as $element->tag
. The
element’s tag name, lowercased (e.g., em
for
an EM
element).[1]
_parent
Best accessed as $element->parent
.
The element that is the element’s parent, or
undef
if this element is the root.
_content
Best accessed as
$element->content_list
. The list of nodes
(i.e., elements or text segments) that are the element’s
children.
Moreover, if an element has any attributes, those are readable
as $element->attr(‘name’)
—for example, with the
object built from <a id=‘foo’>bar</a>
,
the method call $element->attr(‘id’)
returns the
string foo
. Furthermore,
$element->tag
on that object returns the string
“a
”, $element->content_list
returns a list consisting of just the single scalar
bar
, and $element->parent
method returns the parent of this node—which might be, for example, a
<p>
element.
And that’s all that there is to it: you throw HTML source at TreeBuilder, and it returns a tree of HTML::Element objects and some text strings.
However, what do you do with a tree of
objects? People code information into HTML trees not for the fun of arranging elements, but to
represent the structure of specific text and images—some text is in
this li
element, some other text is in that
heading, some images are in this table cell with those attributes, and
so on.
Now, it may happen that you’re rendering that whole HTML tree into some layout format. Or you could be trying to make some systematic change to the HTML tree before dumping it out as HTML source again. But in my experience, the most common programming task that Perl programmers face with HTML is trying to extract some piece of information from a larger document. Since that’s so common (and also since it involves concepts required for more complex tasks), that is what the rest of this article will be about.
Suppose you have a thousand HTML documents, each of them a press release. They all start out:
[...lots of leading images and junk...] <h1>ConGlomCo to Open New Corporate Office in Ouagadougou</h1> BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge of world conquest, Rock Feldspar, announced today the opening of a new office in Ouagadougou, the capital city of Burkina Faso, gateway to the bustling "Silicon Sahara" of Africa... [...etc...]
For each document, you’ve got to copy whatever text is in the
h1
element, so that you can make a table of its
contents. There are three ways to do this:
You can just use a regex to scan the file for a text pattern. For simple tasks, this will be fine. Many HTML documents are, in practice, very consistently formatted with respect to placement of linebreaks and whitespace, so you could just get away with scanning the file like so:
sub get_heading {
my $filename = $_[0];
local *HTML;
open(HTML, $filename) or die "Couldn't open $filename);
my $heading;
Line:
while (<HTML>) {
if( m{<h1>(.*?)</h1>}i
) {
$heading = $1;
last Line;
}
}
close(HTML);
warn "No heading in $filename?" unless defined $heading;
return $heading;
}
This is quick, fast, and fragile—if there’s a newline in the
middle of a heading’s text, it won’t match the above regex, and
you’ll get an error. The regex will also fail if the
h1
element’s start tag has any attributes. If
you have to adapt your code to fit more kinds of start tags,
you’ll end up basically reinventing part of HTML::Parser, at which
point you should probably just stop and use HTML::Parser
itself.
You can use HTML::Parser to scan the file for an
h1
start tag token and capture all the text
tokens until the h1
end tag. This approach is
extensively covered in the previous article. (A variant of this
approach is to use HTML::TokeParser, which presents a different
and handier interface to the tokens that HTML::Parser
extracts.)
Using HTML::Parser is less fragile than our first approach,
since it is insensitive to the exact internal formatting of the
start tag (much less whether it’s split across two lines).
However, when you need more information about the context of the
h1
element, or if you’re having to deal with
tricky HTML bits like tables, you’ll find that the flat list of
tokens returned by HTML::Parser isn’t immediately useful. To get
something useful out of those tokens, you’ll need to write code
that knows which elements take no content (as with C elements),
and that </p>
end tags are optional, so a
<p>
ends any currently open paragraph.
You’re well on your way to pointlessly reinventing much of the
code in HTML::TreeBuilder, and as the person who last
rewrote that module, I can attest that it wasn’t terribly easy to
get right! Never underestimate the perversity of people creating
HTML. At this point you should probably just stop and use
HTML::TreeBuilder itself.
You can use HTML::Treebuilder and scan the tree of elements it creates. This last approach is diametrically opposed to the first approach, which involves just elementary Perl and one regex. The TreeBuilder approach involves being comfortable with the concept of tree-shaped data structures and modules with object-oriented interfaces, as well as with the particular interfaces that HTML::TreeBuilder and HTML::Element provide.
However, the TreeBuilder approach is the most robust, because it involves dealing with HTML in its “native” format—the tree structure that HTML code represents, without any consideration of how the source is coded and with what tags are omitted.
To extract the text from the h1
elements of
an HTML document with HTML::TreeBuilder, you’d do this:
sub get_heading { my $tree = HTML::TreeBuilder->new; $tree->parse_file
($_[0]); my $heading; my $h1 = $tree->look_down
('_tag', 'h1'), if ($h1) { $heading = $h1->as_text
; } else { warn "No heading in $_[0]?"; } $tree->delete;
# clear memory return $heading; }
This uses some unfamiliar methods. The
parse_file
method we’ve seen before builds a tree
based on source from the file given. The delete
method is for marking a tree’s contents as available for garbage
collection when you’re done. The as_text
method
returns a string that contains all the text bits that are children (or
otherwise descendants) of the given node; to get the text content of
the $h1
object, we could just say:
$heading = join '', $h1->content_list;
but that will work only if we’re sure that the
h1
element’s children will be only text bits. If
the document contained this:
<h1>Local Man Sees <cite>Blade</cite> Again</h1>
then the subtree would be:
• h1 • "Local Man Sees " • cite • "Blade" • " Again'
so join ‘’, $h1->content_list
will result
in something like this:
Local Man Sees HTML::Element=HASH(0x15424040) Again
Meanwhile, $h1->as_text
would
yield:
Local Man Sees Blade Again
Depending on what you’re doing with the heading text, you might
want the as_HTML
method instead. It returns the
subtree represented as HTML source. $h1->as_HTML
would yield:
<h1>Local Man Sees <cite>Blade</cite> Again</h1>
However, if you wanted the contents of $h1
as
HTML, but not the $h1
itself, you could say:
join '', map( ref($_) ? $_->as_HTML : $_, $h1->content_list )
This map
iterates over the nodes in
$h1
’s list of children, and for each node that’s
only a text bit (like Local Man Sees
is), it just
passes through that string value, and for each node that’s an actual
object (causing ref
to be true),
as_HTML
will be used instead of the string value of
the object itself (which would be something quite useless, as most
object values are). So for the cite
element,
as_HTML
will be the string
<cite>Blade</cite>
. And then, finally,
join
just combines all the strings that the
map
returns into one string.
Finally, the most important method in our
get_heading
subroutine is the
look_down
method. This method looks down at the
subtree starting at the given object (here, $h1
),
retrieving elements that meet criteria you provide.
The criteria are specified in the method’s argument list. Each
criterion consists of two scalars: a key and a value expressing an
element and attribute. The key might be _tag
or
src
, and the value might be an attribute like
h1
. Or, the criterion can be a reference to a
subroutine that, when called on an element, returns true if it’s a
node you’re looking for. If you specify several criteria, that means
you want all the elements that satisfy all the
criteria. (In other words, there’s an implicit “and.”)
And finally, there’s a bit of an optimization. If you call the
look_down
method in a scalar context, you get just
the first node (or undef
if
none)—and, in fact, once look_down
finds that first
matching element, it doesn’t bother looking any further. So the
example:
$h1 = $tree->look_down('_tag', 'h1'),
returns the first element at or under $tree
whose _tag
attribute has the value
h1
.
Now, the above look_down
code looks like a
lot of bother, with barely more benefit than just grepping the file!
But consider a situation in which your criteria are more
complicated—suppose you found that some of your press releases had
several h1
elements, possibly before or after the
one you actually want. For example:
<h1><center>Visit Our Corporate Partner <br><a href="/dyna/clickthru"> <img src="/dyna/vend_ad"></a> </center> </h1> <h1><center>ConGlomCo President Schreck to Visit Regional HQ <br><a href="/photos/Schreck_visit_large.jpg"> <img src="/photos/Schreck_visit.jpg"></a> </center></h1>
Here, you want to ignore the first h1
element
because it contains an ad, and you want the text from the second
h1
. The problem is how to formalize what’s an ad
and what’s not. Since ad banners are always entreating you to “visit”
the sponsoring site, you could exclude h1
elements
that contain the word “visit” under them:
my $real_h1 = $tree->look_down( '_tag', 'h1', sub { $_[0]->as_text !~ m/visit/i } );
The first criterion looks for h1
elements,
and the second criterion limits those to only the ones with text that
doesn’t match m/visit/
. Unfortunately, that won’t
work for our example, since the second h1
mentions
“ConGlomCo President Schreck to Visit Regional
HQ”.
Instead, you could try looking for the first
h1
element that doesn’t contain an image:
my $real_h1 = $tree->look_down('_tag', 'h1', sub { not $_[0]->look_down('_tag', 'img') } );
This criterion subroutine might seem a bit odd, since it calls
look_down
as part of a larger
look_down
operation, but that’s fine. Note if
there’s no matching element at or under the given element,
look_down
returns false (specifically,
undef
) in a boolean context. If there are matching
elements, it returns the first. So this means “return true only if
this element has no img
element as descendants and
isn’t an img
element itself.”
sub { not $_[0]->look_down('_tag', 'img') }
This correctly filters out the first h1
that
contains the ad, but it also incorrectly filters out the second
h1
that contains a non-advertisement photo near the
headline text you want.
There clearly are detectable differences between the first and
second h1
elements—the only second one contains the
string “Schreck”, and we can just test for that:
my $real_h1 = $tree->look_down('_tag', 'h1', sub { $_[0]->as_text =~ m{Schreck} } );
And that works fine for this one example, but unless all
thousand of your press releases have “Schreck” in the headline, it’s
not generic enough. However, if all the ads in h1
s
involve a link with a URL that includes /dyna/
, you
can use that:
my $real_h1 = $tree->look_down('_tag', 'h1', sub { my $link = $_[0]->look_down('_tag','a'), # No link means it's fine return 1 unless $link; # A link to there is bad return 0 if $link->attr('href') =~ m{/dyna/}; return 1; # Otherwise okay } );
Or you can look at it another way, and say that you want the
first h1
element that either contains no images, or
else with an image that has a src
attribute whose
value contains /photos/
:
my $real_h1 = $tree->look_down('_tag', 'h1', sub { my $img = $_[0]->look_down('_tag','img'), # No image means it's fine return 1 unless $img; # Good if a photo return 1 if $img->attr('src') =~ m{/photos/}; return 0; # Otherwise bad } );
Recall that this use of look_down
in a scalar
context returns the first element at or under $tree
matching all the criteria. But if you can formulate criteria that
match several possible h1
elements, with the
last one being the one you want, you can use
look_down
in a list context, and ignore all but the
last element of the returned list:
my @h1s = $tree->look_down('_tag', 'h1', ...maybe more criteria... ); die "What, no h1s here?" unless @h1s; my $real_h1 = $h1s[-1]; # last or only element
The above (somewhat contrived) case involves extracting data from a bunch of pre-existing HTML files. In such situations, it’s easy to know when your code works, since the data it handles won’t change or grow, and you typically need to run the program only once.
The other kind of situation faced in many data extraction tasks is in which the program is used recurringly to handle new data, such as from ever-changing web pages. As a real-world example of this, consider a program that you could use to extract headline links from subsections of Yahoo! News (http://dailynews.yahoo.com/). Yahoo! News has several subsections, such as:
http://dailynews.yahoo.com/h/tc/ for technology news |
http://dailynews.yahoo.com/h/sc/ for science news |
http://dailynews.yahoo.com/h/hl/ for health news |
http://dailynews.yahoo.com/h/wl/ for world news |
http://dailynews.yahoo.com/h/en/ for entertainment news |
All of them are built on the same basic HTML template—and a
scarily complicated template it is, especially when you look at it
with an eye toward identifying the real headline links and screening
out the links to everything else. You’ll need to puzzle over the HTML
source, and scrutinize the output of $tree->dump
on the parse tree of that HTML.
Sometimes the only way to pin down what you’re after is by position in the tree. For example, headlines of interest may be in the third column of the second row of the second table element in a page:
my $table = ( $tree->look_down('_tag','table') )[1]; my $row2 = ( $table->look_down('_tag', 'tr' ) )[1]; my $col3 = ( $row2->look-down('_tag', 'td') )[2]; ...then do things with $col3...
Or they might be all the links in a <p>
element with more than two <br>
elements as
children:
my $p = $tree->look_down('_tag', 'p', sub { 2 < grep { ref($_) and $_->tag eq 'br' } $_[0]->content_list } ); @links = $p->look_down('_tag', 'a'),
But almost always, you can get away with looking for properties of the thing itself, rather than just looking for contexts. If you’re lucky, the document you’re looking through has clear semantic tagging, perhaps tailored for CSS (Cascading Style Sheets):
<a href="...long_news_url..." class="headlinelink"
>Elvis seen in tortilla</a>
If you find anything like that, you could leap right in and select links with:
@links = $tree->look_down('class', 'headlinelink'),
Regrettably, your chances of observing such semantic markup principles in real-life HTML are pretty slim. (In fact, your chances of finding a page that is simply free of HTML errors are even slimmer. And surprisingly, the quality of the code at sites like Amazon or Yahoo! is typically worse than at personal sites whose entire production cycle involves simply being saved and uploaded from Netscape Composer.)
The code may be “accidentally semantic,” however—for example, in
a set of pages I was scanning recently, I found that looking for
td
elements with a width
attribute value of 375
got me exactly what I
wanted. No one designing that page ever conceived of
width=375
as meaning “this is
a headline,” but if you take it to mean that, it works.
An approach like this happens to work for the Yahoo! News code, because the headline links are
distinguished by the fact that they (and they alone) contain a
b
element:
<a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
Or, diagrammed as a part of the parse tree:
• a [href="...long_news_url..."] • b • "Elvis seen in tortilla"
A rule that matches these can be formalized as “look for any
a
element that has only one daughter node, which
must be a b
element.” And this is what it looks
like when cooked up as a look_down
expression and
prefaced with a bit of code to retrieve the Yahoo! News page and feed it to TreeBuilder:
use strict;
use HTML::TreeBuilder 3;
use LWP 5.64;
sub get_headlines {
my $url = $_[0] || die "What URL?";
my $response = LWP::UserAgent->new->get($url);
unless ($response->is_success) {
warn "Couldn't get $url: ", $response->status_line, "
";
return;
}
my $tree = HTML::TreeBuilder->new();
$tree->parse($response->content);
$tree->eof;
my @out;
foreach my $link ( $tree->look_down
('_tag', 'a',
sub {
return unless $_[0]->attr('href'),
my @c = $_[0]->content_list;
@c = = 1 and ref $c[0] and $c[0]->tag eq 'b';
} ) ) {
push @out, [$link->attr('href'), $link->as_text ];
}
warn "Odd, fewer than 6 stories in $url!" if @out < 6;
$tree->delete;
return @out;
}
And we add a bit of code to call
get_headlines
and display the results:
foreach my $section (qw[tc sc hl wl en]) { my @links = get_headlines( "http://dailynews.yahoo.com/h/$section/" ); print $section, ": ", scalar(@links), " stories ", map((" ", $_->[0], " : ", $_->[1], " "), @links), " "; }
Now we have our own headline extractor service! By itself, it isn’t amazingly useful (since if you want to see the headlines, you can just look at the Yahoo! News pages), but it could easily be the basis for features like filtering the headlines for particular topics of interest.
One of these days, Yahoo! News will change its HTML template. When this happens, the above program finds no links meeting our criteria—or, less likely, dozens of erroneous links that meet the criteria. In either case, the criteria will have to be changed for the new template; they may just need adjustment, or you may need to scrap them and start over.
It’s often a challenge to write criteria that match the desired
parts of an HTML parse tree. Very often you can pull it off with a
simple $tree->look_down(‘_tag’, ‘h1’)
, but
sometimes you have to keep adding and refining criteria, until you end
up with complex filters like I’ve shown in this article. The benefit
of HTML parse trees is that one main search tool, the
look_down
method, can do most of the work, making
simple things easy while keeping hard things possible.
[1] Yes, this is misnamed. In proper SGML lingo, this is
instead called a GI
(short for “generic
identifier”) and the term “tag” is used for a token of SGML
source that represents either the start of an element (a
start tag like <em lang=‘fr’>
) or
the end of an element (an end tag like
</em>
). However, since more people
claim to have been abducted by aliens than to have ever seen
the SGML standard, and since both encounters typically
involve a feeling of “missing time,” it’s not surprising
that the terminology of the SGML standard is not closely
followed.)
18.223.106.100