Chapter 15. A Web Spider in One Line

Tkil

One day, someone on the IRC #perl channel was asking some confused questions. We finally managed to figure out that he was trying to write a web robot, or “spider,” in Perl. Which is a grand idea, except that:

  1. Perfectly good spiders have already been written and are freely available at http://info.webcrawler.com/mak/projects/robots/robots.html.

  2. A Perl-based web spider is probably not an ideal project for novice Perl programmers. They should work their way up to it.

Having said that, I immediately pictured a one-line Perl robot. It wouldn’t do much, but it would be amusing. After a few abortive attempts, I ended up with this monster, which requires Perl 5.005. I’ve split it onto separate lines for easier reading.

perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe '
    $ua = LWP::UserAgent->new;
    while (my $link = shift @ARGV) {
        print STDERR "working on $link";
        HTML::LinkExtor->new(
          sub {
            my ($t, %a) = @_;
            my @links = map { url($_, $link)->abs( ) }
                       grep { defined } @a{qw/href img/};
            print STDERR "+ $_" foreach @links;
            push @ARGV, @links;
          } ) -> parse(
           do {
               my $r = $ua->simple_request
                 (HTTP::Request->new("GET", $link));
               $r->content_type eq "text/html" ? $r->content : "";
        }
     )
  }'http://slinky.scrye.com/~tkil/

I actually edited this on a single line; I use shell-mode inside of Emacs, so it wasn’t that much of a terror. Here’s the one-line version.

perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe
'$ua = LWP::UserAgent->new; while (my $link = shift @ARGV) {
print STDERR "working on $link";HTML::LinkExtor->new( sub
{ my ($t, %a) = @_; my @links = map { url($_, $link)->abs( )
} grep { defined } @a{qw/href img/}; print STDERR "+ $_"
foreach @links; push @ARGV, @links} )->parse(do { my $r =
$ua->simple_request (HTTP::Request->new("GET", $link));
$r->content_type eq "text/html" ? $r-> content : ""; } )
}' http://slinky.scrye.com/~tkil/

After getting an ego-raising chorus of groans from the hapless onlookers in #perl, I thought I’d try to identify some cute things I did with this code that might actually be instructive to TPJ readers.

Callbacks and Closures

Many modules are designed to do grunt work. In this case, HTML::LinkExtor (a specialized version of HTML::Parser) knows how to look through an HTML document and find links. Once it finds them, however, it needs to know what to do with them.

This is where callbacks come in. They’re well known in GUI circles, since interfaces need to know what to do when a button is pressed or a menu item selected. Here, HTML::LinkExtor needs to know what to do with links (all tags, actually) when it finds them.

My callback is an anonymous subroutine reference:

sub {
      my ($t, %a) = @_;
      my @links = map { url($_, $link)->abs() }
                    grep { defined } @a{qw/href img/};
      print STDERR "+ $_" foreach @links;
      push @ARGV, @links;
}

I didn’t notice until later that $link is actually scoped just outside of this subroutine (in the while loop), making this subroutine look almost like a closure. It’s not a classical closure—it doesn’t define its own storage—but it does use a lexical value far away from where it is defined.

Cascading Arrows

It’s amusing to note that, aside from debugging output, the while loop consists of a single statement. The arrow operator (->) only cares about the value of the left hand side. This is the heart of the Perl/Tk idiom:

my $button = $main->Button( ... )->pack( );

We use a similar approach, except we don’t keep a copy of the created reference (which is stored in $button above):

HTML::LinkExtor->new(...)->parse(...);

This is a nice shortcut to use whenever you want to create an object for a single use.

Using Modules with One-Liners

When I first thought of this one-liner, I knew I’d be using modules from the libwww-perl (LWP) library. The first few iterations of this “one-liner” used LWP::Simple, which explicitly states that it should be ideal for one-liners. The -M flag is easy to use, and makes many things very easy. LWP::Simple fetched the files just fine. I used something like this:

HTML::LinkExtor->new(...)->parse( get $link );

Where get is a function provided by LWP::Simple; it returns the contents of a given URL.

Unfortunately, I needed to check the Content-Type of the returned data. The first version merrily tried to parse .tar.gz files and got confused:

working on ./dist/irchat/irchat-3.03.tar.gz
Use of uninitialized value at
    /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 104.
Use of uninitialized value at
    /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 107.
Use of uninitialized value at
    /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 82.

Oops.

Switching to the “industrial strength” LWP::UserAgent module allowed me to check the Content-Type of the fetched page. Using this information, together with the HTTP::Response module and a quick ?: construct, I could parse either the HTML content or an empty string.

The End

Whenever I write a one-liner, I find it interesting to think about it in different ways. While I was writing it, I was mostly thinking from the bottom up; some of the complex nesting is a result of this. For example, the callback routine is fairly hairy, but once I had it written, I could change the data source from LWP::Simple::get to LWP::UserAgent and HTTP::Request::content quite easily.

Obviously, this spider does nothing more than visit HTML pages and try to grab all the links off each one. It could be more polite (as the LWP::RobotUA module is) and it could be smarter about which links to visit. In particular, there’s no sense of which pages have already been visited; a tied DBM of visited pages would solve that nicely.

Even with these limitations, I’m impressed at the power expressed by that “one” line. Kudos for that go to Gisle Aas (the author of LWP) and to Larry Wall, for making a language that does all the boring stuff for us. Thanks Gisle and Larry!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.42.168