Chapter 13. HTML::Parser

Ken MacFarlane

Tip

Since the original publication of this article, the HTML::Parser module has continued to evolve (version evolved (Version 3.25 as of this update), enabling one you to write develop powerful parsing tools with a minimum of coding. For those readers who are using this wonderful tool for the first time, the examples here should provide the means and feel for basic HTML parsing techniques, which can then be further extended to meet one’s needs. This article may also be useful for those new to object-oriented programming (I once was myself!) as it covers the concept of subclassing.

Perl is often used to manipulate the HTML files constituting web pages. For instance, one common task is removing tags from an HTML file to extract the plain text. Many solutions for such tasks usually use regular expressions, which often end up complicated, unattractive, and incomplete (or wrong). The alternative, described here, is to use the HTML::Parser module available on CPAN. HTML::Parser is an object-oriented module, and so it requires some extra explanation for casual users.

HTML::Parser works by scanning HTML input, and breaks it up into segments by how the text would be interpreted by a browser. For instance, this input: input would be broken up into three segments: a start tag (<A HREF=“index.html”>), text (This is a link), and an end tag (</A>).

<A HREF="index.html">This is a link</A>

As each segment is detected, the parser passes it to an appropriate subroutine. There’s a subroutine for start tags, one for end tags, and another for plain text. There are subroutines for comments and declarations as well.

In this article, I’ll first give a simple example on how to read and print out all the information found by HTML::Parser. Next, I’ll demonstrate differences in the events triggered by the parser. Finally, I’ll show how to access specific information passed along by the parser.

As of this writing, there are two major versions of HTML::Parser available. Both version Version 2 and version Version 3 work by having you subclass the module. For this article, I will mostly concentrate on the subclassing method, because it will work with both major versions, and is a bit easier to understand for those not overly familiar with some of Perl’s finer details. In version Version 3, there is more of an emphasis on the use of references, anonymous subroutines, and similar topics; advanced users who may be interested will see there is a brief example at the end of this article for advanced users who may be interested.

Getting Started

The first thing to be aware of when using HTML::Parser is that, unlike other modules, it appears to do absolutely nothing. When I first attempted to use this module, I used code similar to this:

#!/usr/bin/perl -w

use strict;
use HTML::Parser;

my $p = new HTML::Parser;
$p->parse_file("index.html");

No output whatsoever. If you look at the source code to the module, you’ll see why:

sub text
{
# my($self, $text) = @_;
}

sub declaration
{
# my($self, $decl) = @_;
}

sub comment
{
# my($self, $comment) = @_;
}

sub start
{
# my($self, $tag, $attr, $attrseq, $origtext) = @_;
# $attr is reference to a HASH, $attrseq is reference to an ARRAY

}

sub end
{
# my($self, $tag, $origtext) = @_;
}

The whole idea of the parser is that as it chugs along through the HTML, it calls these subroutines whenever it finds an appropriate snippet (start tag, end tag, and so on). However, these subroutines do nothing. My program works, and the HTML is being parsed—but I never instructed the program to do anything with the parse results.

The Identity Parser

The following is an example of how HTML::Parser can be subclassed, and its methods overridden, to produce meaningful output. This example simply prints out the original HTML file, unmodified:

 1  #!/usr/bin/perl -w
 2
 3  use strict;
 4
 5  # Define the subclass
 6  package IdentityParse;
 7  use base "HTML::Parser";
 8
 9  sub text {
10      my ($self, $text) = @_;
11      # Just print out the original text
12      print $text;
13  }
14
15  sub comment {
16      my ($self, $comment) = @_;
17      # Print out original text with comment marker
18      print "<!--", $comment, "-->";
19  }
20
21  sub start {
22      my ($self, $tag, $attr, $attrseq, $origtext) = @_;
23      # Print out original text
24      print $origtext;
25  }
26
27  sub end {
28      my ($self, $tag, $origtext) = @_;
29      # Print out original text
30      print $origtext;
31  }
32
33  my $p = new IdentityParse;
34  $p->parse_file("index.html");

Lines 6 and 7 declare the IdentityParse package, having it inherit from HTML::Parser. (Type perldoc perltoot for more information on inheritance.) We then override the text, comment, start, and end subroutines so that they print their original values. The result is a script which reads an HTML file, parses it, and prints it to standard output in its original form.

The HTML Tag Stripper

Our next example strips all the tags from the HTML file and prints just the text:

 1  #!/usr/bin/perl -w
 2
 3  use strict;
 4
 5  package HTMLStrip;
 6  use base "HTML::Parser";
 7
 8  sub text {
 9      my ($self, $text) = @_;
10      print $text;
11  }
12
13  my $p = new HTMLStrip;
14  # Parse line-by-line, rather than the whole file at once file at once
15  while (<>) {
16      $p->parse($_);
17  }
18  # Flush and parse remaining unparsed HTML
19  $p->eof;

Since we’re only interested in the text and HTML tags, we override only the text subroutine. Also note that in lines 13–17, we invoke the parse method instead of parse_file. This lets us read files provided on the command line. When using parse instead of parse_file, we must also call the eof method (line 19); this is done to check and clear HTML::Parser’s internal buffer.

Another Example: HTML Summaries

Suppose you’ve hand-crafted your own search engine for your web site, and you want to be able to generate summaries for each hit. You could use the HTML::Summary module described Chapter 22 in the article Summarizing Web Pages with HTML::Summary, but we’ll describe a simpler solution here. We’ll assume that some (but not all) of your site’s pages use a <META> tag to describe the content:

<META NAME="DESCRIPTION" CONTENT="description of file">

When a page has a <META> tag, your search engine should use the CONTENT for the summary. Otherwise, the summary should be the first H1 tag if one exists. And if that fails, we’ll use the TITLE. Our third example generates such a summary:

 1  #!/usr/bin/perl -w
 2
 3  use strict;
 4
 5  package GetSummary;
 6  use base "HTML::Parser";
 7
 8  my $meta_contents;
 9  my $h1    = "";
10  my $title = "";
11
12  # Set state flags
13  my $h1_flag    = 0;
14  my $title_flag = 0;
15
16  sub start {
17      my ($self, $tag, $attr, $attrseq, $origtext) = @_;
18
19      if ($tag =~ /^meta$/i && $attr->{'name'} =~ /^description$/i) {
20          # Set if we find META NAME="DESCRIPTION"
21          $meta_contents = $attr->{'content'};
22      } elsif ($tag =~ /^h1$/i && ! $h1) {
23          # Set state if we find <H1> or <TITLE>
24          $h1_flag = 1;
25      } elsif ($tag =~ /^title$/i && ! $title) {
26          $title_flag = 1;
27      }
28  }
29
30  sub text {
31      my ($self, $text) = @_;
32      # If we're in <H1>...</H1> or <TITLE>...</TITLE>, save text
33      if ($h1_flag)    { $h1    .= $text; }
34      if ($title_flag) { $title .= $text; }
35  }
36
37  sub end {
38      my ($self, $tag, $origtext) = @_;
39
40      # Reset appropriate flag if we see </H1> or </TITLE>
41      if ($tag =~ /^h1$/i)    { $h1_flag = 0; }
42      if ($tag =~ /^title$/i) { $h1_flag = 0; }
43  }
44
45  my $p = new GetSummary;
46  while (<>) {
47      $p->parse($_);
48  }
49  $p->eof;
50
51  print "Summary information: ", $meta_contents ||
52      $h1 || $title || "No summary information found.", "
";

The magic happens in lines 19–27. The variable $attr contains a reference to a hash where the tag attributes are represented with key/value pairs. The keys are lowercased by the module, which is a code-saver; otherwise, we’d need to check for all casing possibilities (name, NAME, Name, and so on).

Lines 19–21 check to see if the current tag is a META tag and has a field NAME set to DESCRIPTION; if so, the variable $meta_contents is set to the value of the CONTENT field. Lines 22–27 likewise check for an H1 or TITLE tag. In these cases, the information we want is in the text between the start and end tags, and not the tag itself. Furthermore, when the text subroutine is called, it has no way of knowing which tags (if any) its text is between. This is why we set a flag in start (where the tag name is known) and check the flag in text (where it isn’t). Lines 22 and 25 also check whether or not $h1 and $title have been set; since we only want the first match, subsequent matches are ignored.

Another Fictional Example

Your company has been running a successful product site, http://www.bar.com/foo/. However, the web marketing team decides that http://foo.bar.com/ looks better in the company’s advertising materials, so a redirect is set up from the new address to the old.

Fast forward to Friday, 4:45 in the afternoon, when the phone rings. The frantic voice on the other end says, “foo.bar.com just crashed! We need to change all the links back to the old location!” Just when you though a simple search-and-replace would suffice, the voice adds: “And marketing says we can’t change the text of the web pages, only the links.”

“No problem,” you respond, and quickly hack together a program that changes the links in A HREF tags, and nowhere else.

 1  #!/usr/bin/perl -w -i.bak
 2
 3  use strict;
 4
 5  package ChangeLinks;
 6  use base "HTML::Parser";
 7
 8  sub start {
 9      my ($self, $tag, $attr, $attrseq, $origtext) = @_;
10
11      # We're only interested in changing <A ...> tags
12      unless ($tag =~ /^a$/) {
13          print $origtext;
14          return;
15      }
16
17      if (defined $attr->{'href'}) {
18          $attr->{'href'} =~ s[foo.bar.com][www.bar.com/foo];
19      }
20
21      print "<A ";
22      # Print each attribute of the <A ...> tag
23      foreach my $i (@$attrseq) {
24          print $i, qq(="$attr->{$i}" );
25      }
26      print ">";
27  }
28
29  sub text {
30      my ($self, $text) = @_;
31      print $text;
32  }
33
34  sub comment {
35      my ($self, $comment) = @_;
36      print "<!--", $comment, "-->";
37  }
38
39  sub end {
40      my ($self, $tag, $origtext) = @_;
41      print $origtext;
42  }
43
44  my $p = new ChangeLinks;
45  while (<>) {
46      $p->parse($_);
47  }
48  $p->eof;

Line 1 specifies that the files will be edited in place, with the original files being renamed with a .bak extension. The real fun is in the start subroutine, lines 8–27. First, in lines 12–15, we check for an A tag; if that’s not what we have, we simply return the original tag. Lines 17–19 check for the HREF and make the desired substitution.

$attrseq appears in line 23. This variable is a reference to an array with the tag attributes in their original order of appearance. If the attribute order needs to be preserved, this array is necessary to reconstruct the original order, since the hash $attr will jumble them up. Here, we dereference $attrseq and then recreate each tag. The attribute names will appear lowercase regardless of how they originally appeared. If you’d prefer uppercase, change the first $i in line 24 to uc($i).

Using HTML::Parser Version 3

Version 3 of the module provides more flexibility in how the handlers are invoked. One big change is that you no longer have to use subclassing; rather, event handlers can be specified when the HTML::Parser constructor is called. The following example is equivalent to the previous program but uses some of the version Version 3 features:

 1  #!/usr/bin/perl -w -i.bak
 2
 3  use strict;
 4  use HTML::Parser;
 5
 6  # Specify events here rather than in a subclass
 7  my $p = HTML::Parser->new( api_version => 3,
 8                                 start_h => [&start,
 9                                             "tagname, attr, attrseq, text"],
10                               default_h => [sub { print shift }, "text"],
11                           );
12  sub start {
13      my ($tag, $attr, $attrseq, $origtext) = @_;
14
15      unless ($tag =~ /^a$/) {
16          print $origtext;
17          return;
18      }
19
20      if (defined $attr->{'href'}) {
21          $attr->{'href'} =~ s[foo.bar.com][www.bar.com/foo];
22      }
23
24      print "<A ";
25      foreach my $i (@$attrseq) {
26          print $i, qq(="$attr->{$i}" );
27      }
28      print ">";
29  }
30
31  while (<>) {
32      $p->parse($_);
33  }
34  $p->eof;

The key changes are in lines 7–10. In line 8, we specify that the start event is to be handled by the start subroutine. Another key important change is line 10; version Version 3 of HTML::Parser supports the notion of a default handler. In the previous example, we needed to specify separate handlers for text, end tags, and comments; here, we use default_h as a catch-all. This turns out to be a code saver as well.

Take a closer look at line 9, and compare it to line 9 of the previous example. Note that $self hasn’t been passed. In version Version 3 of HTML::Parser, the list of attributes which that can be passed along to the handler subroutine is configurable. If our program only needed to use the tag name and text, we can change the string tagname, attr, attrseq, text to simply tagname, text and then change the start subroutine to only use two parameters. Also, handlers are not limited to subroutines. If we changed the default handler like this, the text that would have been printed is instead pushed onto @lines:

my $p = HTML::Parser->new( api_version => 3,
                               start_h => [&start,
                                           "tagname, attr, attrseq, text"],
                             default_h => @lines, "text"],
                         );

Version 3 of HTML::Parser also adds some new features; notably, one can now set options to recognize and act upon XML constructs, such as <TAG/> and <?TAG?>. There are also multiple methods of accessing tag information, instead of the $attr hash. Rather than go into further detail, I encourage you to explore the flexibility and power of this module on your own.

Acknowledgments

The HTML::Parser module was written by Gisle Aas and Michael A. Chase. Excerpts of code and documentation from the module are used here with the authors’ permission.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.218.184