Chapter 16. Webpluck

Ed Hill

The promises of smart little web agents that run around the web and grab things of interest have gone unfulfilled. Like me, you probably have a handful of web pages that you check on a regular basis, and if you had the time, you’d check many more.

Listed below are a few of the bookmarks that I check on a regular basis. Each of these pages has content that changes every day, and it is, of course, the content of these pages that I am interested in—not their layout, nor the advertising that appears on the pages.

Dilbert (of course)
CNN U.S. News
Astronomy Picture of the Day
C|Net’s News.com
The local paper (The Daily Iowan)
ESPNET Sportszone

These pages are great sources of information. My problem is that I don’t have time to check each one every day to see what is there or if the page has been updated. What I want is my own personal newspaper built from the sources listed above.

Similar Tools

This is not an original idea, but after spending many hours searching for a tool to do what I wanted, I gave up. Here are the contenders I considered, and why they didn’t do quite what I wanted.

First there is the “smart” agent, a little gremlin that roams the net trying to guess what you want to see using some AI technique. Firefly was an example; you indicate interest in a particular topic and it points you at a list of sites that others have ranked. When I first looked at Firefly, it suggested that since I was interested in “computers and the internet,” I should check out (Figure 16-1).

The Amazing Clickable Beavis
Figure 16-1. The Amazing Clickable Beavis

This is why I don’t have much confidence in agents. Besides, I know what I want to see. I have the URLs in hand. I just don’t have the time to go and check all the pages every day.

The second type of technology is the “custom newspaper.” There are two basic types. CRAYON (Create Your Own Newspaper, headquartered at http://www.crayon.net/), is one flavor of personalized newspaper. CRAYON is little more than a page full of links to other pages that change everyday. For me, CRAYON just adds to the problem, listing tons of pages that I wish I had time to check out. I was still stuck clicking through lists of links to visit all the different pages.

Then there are sites like My Yahoo (http://my.yahoo.com/), a single page that content changes every day. This is very close to what I wanted—a single site with all of the information I need. My Yahoo combines resources from a variety of different sources. It shows a one-line summary of an article; if it’s something that I find interesting, I can click on the link to read more about it. The only problem with My Yahoo is that it’s restricted to a small set of content providers. I want resources other than what Yahoo provides.

Since these tools didn’t do exactly what I wanted, I decided to write my own. I figured with Perl, the LWP library, and a weekend, I could throw together exactly what I wanted. Thus webpluck was born. My goal was to write a generic tool that would automatically grab data from any web page and create a personalized newspaper exactly like My Yahoo. I decided the best approach was to define a regular expression for each web page of interest. webpluck uses the LWP library to retrieve the web page, extracts the content with a regular expression tailored to each, and saves it to a local cache for display later. Once it has done this for all the sources, I use a template to generate my personal newspaper.

How to Use webpluck

I don’t want this article to turn into a manual page (since one already exists), but here’s a brief summary of how to use webpluck. You first create a configuration file containing a list of targets that define which pages you want to read, and the regular expression to match against the contents of that page. Here is an example of a target definition that retrieves headlines from the CNN U.S. web page.

name cnn-us
url http://www.cnn.com/US/
regex <h2>([^<]+)</h2>.*?<a href="([^"]+)"
fields title:url

These definitions define the following: the name of the file to hold the data retrieved from the web page; the URL of the page (if you point at a page containing frames, you need to determine the URL of the page that actually contains the content); the Perl regular expression used to extract data from the web page; and the names of the fields matched in the regular expression that you just defined. The first pair of parentheses in the regex field matches the first field, the second pair matches the second, and so on. For the configuration shown, ([^<]+) is tagged as the title and ([^″]+) is tagged as the url. That url is the link to the actual content, distinct from the url definition on the second line, which is the starting point for the regex.

Running webpluck with the target definition above creates a file called cnn-us in a cache directory that you define. Here’s the file from March 25, 1997:

title:Oklahoma bombing judge to let 'impact witnesses' see trial
url:http://www.cnn.com/US/9703/25/okc/index.html

title:Simpson's attorneys ask for a new trial and lower damages
url:http://www.cnn.com/US/9703/25/simpson.newtrial/index.html

title:U.S. playing low-key role in latest Mideast crisis
url:http://www.cnn.com/WORLD/9703/25/us.israel/index.html

title:George Bush parachutes -- just for fun
url:http://www.cnn.com/US/9703/25/bush.jump.ap/index.html

As you might expect, everything depends on the regular expression, which must be tailored for each source. Not everyone, myself included, feels comfortable with regular expressions; if you want to get the most use out of webpluck, and you feel that your regular expression skills are soft, I recommend Jeffrey Friedl’s book Mastering Regular Expressions (O’Reilly).

The second problem with regular expressions is that as powerful as they are, they can only match data they expect to see. So if the publisher of the web page you are after changes his or her format, you’ll have to update your regular expression. webpluck notifies you if it couldn’t match anything, which is usually a good indication that the format of the target web page has changed.

Once all the content has been collected, webpluck takes those raw data files and a template file that you provide, and combines them to create your “dynamic” HTML document.

webpluck looks for any <clip> tags in your template file, replacing them with webplucked data. Everything else in the template file is passed through as is. Here is an example of a segment in my daily template file (again using the CNN U.S. headlines as an example):

<clip name="cnn-us">
<li><a href="url">title</a>
</clip>

This is replaced with the following HTML (the lines have been split to make them more readable):

<li><a href="http://www.cnn.com/US/9703/25/okc/index.html">
       Oklahoma bombing judge to let 'impact witnesses' see trial
    </a>

<li><a href="http://www.cnn.com/US/9703/25/simpson.newtrial/index.html">
       Simpson's attorneys ask for a new trial and lower damages
    </a>

<li><a href="http://www.cnn.com/WORLD/9703/25/us.israel/index.html">
       U.S. playing low-key role in latest Mideast crisis
    </a>

<li><a href="http://www.cnn.com/US/9703/25/bush.jump.ap/index.html">
       George Bush parachutes -- just for fun
    </a>

I personally use webpluck by running one cron job every morning and one during lunch to re-create my “daily” page. I realize webpluck could be used for a lot more than this; that’s left as an exercise for the reader.

How webpluck Works

Now on to the technical goodies. For those who don’t know what the LWP library is—learn! LWP is a great collection of Perl objects that allows you to fetch documents from the web. What the CGI library does for people writing web server code, LWP does for people writing web client code. You can download LWP from CPAN.

webpluck is a simple program. Most of the code takes care of processing command-line arguments, reading the configuration file, and checking for errors. The guts rely on the LWP library and Perl’s powerful regular expressions. The following is part of the main loop in webpluck. I’ve removed some error checking to make it smaller, but the real guts are shown below.

use LWP;

$req = HTTP::Request->new( GET => $self->{'url'} );
$req->header( Accept => "text/html, */*;q=0.1" );
$res = $main::ua->request( $req );

if ($res->is_success( )) {
    my (@fields) = split( ':', $self->{'fields'} );
    my $content  = $res->content();
    my $regex    = $self->{'regex'};

    while ($content =~ /$regex/isg) {
       my @values = ($1, $2, $3, $4, $5, $6, $7, $8);

       # URL's are special fields; they might be relative, so check for that

       for ($i = 0; $i <= $#fields; $i++) {
           if ($fields[$i] eq "url") {
               my $urlobj = new URI::URL($values[$i], $self->{'url'});
               $values[$i] = $urlobj->abs()->as_string();
           }
           push(@datalist, $fields[$i] . ":" . $values[$i]);
       }
       push( @{$self->{'_data'}}, @datalist );
    }
}

The use LWP imports the LWP module, which takes care of all the web-related tasks (fetching documents, parsing URLs, and parsing robot rules). The next three lines are all it takes to grab a web page using LWP.

Assuming webpluck’s attempt to retrieve the page is successful, it saves the document as one long string. It then iterates over the string, trying to match the regular expression defined for this target. The following statement merits some scrutiny:

while ( $content =~ /$regex/isg ) {

The /i modifier of the above regular expression indicates that it should be a case-insensitive match. The /s modifier treats the entire document as if it were a single line (treating newlines as whitespace), so your regular expression can span multiple lines. /g allows you to go through the entire document and grab data each time the regular expression is matched, instead of just the first.

For each match webpluck finds, it examines the fields defined by the user. If one of the fields is url, it’s turned into an absolute URL—specifically, a URI::URL object. I let that object translate itself from a relative URL to an absolute URL that can be used outside of the web site from where it was retrieved. This is the only data from the target page that gets massaged.

Lastly, I take the field names and the data that corresponds to each field and save that information. Once all the data from each matched regular expression is collected, it’s run through some additional error checking and saved to a local file.

The Dark Side of the Force

Like any tool, webpluck has both good and bad uses. The program is a sort of web robot, which raises some concerns for me and for users. A detailed list of the considerations can be found on the Web Robots Page at http://www.robotstxt.org/wc/robots.html, but a few points from the Web Robot Guide to Etiquette stand out.

Identify Yourself

webpluck identifies itself as webpluck/2.0 to the remote web server. This isn’t a problem since few people use webpluck, but it could be if sites decide to block my program.

Don’t Overload a Site

Since webpluck only checks a finite set of web pages that you explicitly define—that is, it doesn’t tree-walk sites—this isn’t a problem. Just to be safe, webpluck pauses for a small time period between retrieving documents. It should only be run once or twice a day—don’t launch it every five minutes to ensure that you constantly have the latest and greatest information.

Obey Robot Exclusion Rules

This is the toughest rule to follow. Since webpluck is technically a robot, I should be following the rules set forth by a web site’s /robots.txt file. However, since the data that I am after typically changes every day, some sites have set up specific rules telling robots not to index their pages.

In my opinion, webpluck isn’t a typical robot. I consider it more like an average web client. I’m not building an index, which I think is the reason that these sites tell robots not to retrieve the pages. If webpluck followed the letter of the law, it wouldn’t be very useful since it wouldn’t be able to access many pages that change their content. For example, CNN has this in their robot rules file:

User-agent: *
Disallow: /

If webpluck were law-abiding, it wouldn’t be able to retrieve any information from CNN, one of the main sites I check for news. So what to do? After reading the Robot Exclusion Standard (http://www.robotstxt.org/wc/norobots.html), I believe webpluck doesn’t cause any of the problems meant to be prevented by the standard. Your interpretation may differ; I encourage you to read it and decide for yourself. webpluck has two options (--naughty and --nice) that instruct it whether to obey the robot exclusion rules found on remote servers. (This is my way of deferring the decision to you.)

Just playing nice as a web robot is only part of the equation. Another consideration is what you do with the data once you get it. There are obvious copyright considerations. Copyright on the web is a broad issue. I’m just going to mention a few quandaries raised by webpluck; I don’t have the answers.

  1. Is it okay to extract the URL from the Cool Site of the Day home page and jump straight to the cool site? The Cool Site folks don’t own the URL, but they would certainly prefer that you visit their site first.

  2. Is it okay to retrieve headlines from CNN? What about URLs for the articles?

  3. How about grabbing the actual articles from the CNN site and redisplaying them with your own layout?

  4. And for all of these tasks, does it matter if they’re for your own personal use as opposed to showing it to a friend, or redistributing it more widely?

Obviously, people have different opinions of what is right and what is wrong. I personally don’t have the background, knowledge, or desire to try to tell you what to do. I merely want to raise the issues so you can think about them and make your own decisions.

For a final example of a potential problem, let’s take a look at Dilbert. Here’s the target I have defined for Dilbert at the time of this writing.

name dilbert
url http://www.unitedmedia.com/comics/dilbert/
regex SRC="?([^>]?/comics/dilbert/archive.*?.gif)"?s+
fields url

The cartoon on the Dilbert page changes every day, and instead of just having a link to the latest cartoon (todays-dilbert.gif), they generate a new URL every day and include the cartoon in their web page. They do this because they don’t want people setting up links directly to the cartoon. They want people to read their main page—after all, that’s where the advertising is. Every morning I find out where today’s Dilbert cartoon is located, bypassing all of United Media’s advertising. If enough people do this, United Media will probably initiate countermeasures. There are at least three things that would prevent webpluck (as it currently works) from allowing me to go directly to today’s comic.

  • A CGI program that stands between me and the comic strip. The program would then take all kinds of steps to see if I should have access to the image (e.g., checking Referer headers, or planting a cookie on me). But almost any such countermeasure can be circumvented with a clever enough webpluck.

  • The advertising could be embedded in the same image as the cartoon. That’ll work for Dilbert since it’s a graphic, but not for pages where the content is plain HTML.

  • The site could move away from HTML to another display format such as VRML or Java that takes over an entire web page with a single view. This approach makes the content far harder for robots to retrieve.

Most funding for web technology exists to solve the needs of content providers, not users. If tools like webpluck are considered a serious problem by content providers, steps will be taken to shut them down, or make them harder to operate.

It isn’t my intent to distribute a tool to filter web advertising or steal information from web pages so that I can redistribute it myself, but I’m not so naïve as to think this can’t be done. Obviously, anyone intent on doing these things can do so; webpluck just makes it easier. Do what you think is right.

You can find more information about webpluck at http://www.edsgarage.com/ed/webpluck/. The program is also on this book’s web site at http://www.oreilly.com/catalog/tpj2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.186.201