The promises of smart little web agents that run around the web and grab things of interest have gone unfulfilled. Like me, you probably have a handful of web pages that you check on a regular basis, and if you had the time, you’d check many more.
Listed below are a few of the bookmarks that I check on a regular basis. Each of these pages has content that changes every day, and it is, of course, the content of these pages that I am interested in—not their layout, nor the advertising that appears on the pages.
Dilbert (of course) |
CNN U.S. News |
Astronomy Picture of the Day |
C|Net’s News.com |
The local paper (The Daily Iowan) |
ESPNET Sportszone |
These pages are great sources of information. My problem is that I don’t have time to check each one every day to see what is there or if the page has been updated. What I want is my own personal newspaper built from the sources listed above.
This is not an original idea, but after spending many hours searching for a tool to do what I wanted, I gave up. Here are the contenders I considered, and why they didn’t do quite what I wanted.
First there is the “smart” agent, a little gremlin that roams the net trying to guess what you want to see using some AI technique. Firefly was an example; you indicate interest in a particular topic and it points you at a list of sites that others have ranked. When I first looked at Firefly, it suggested that since I was interested in “computers and the internet,” I should check out (Figure 16-1).
This is why I don’t have much confidence in agents. Besides, I know what I want to see. I have the URLs in hand. I just don’t have the time to go and check all the pages every day.
The second type of technology is the “custom newspaper.” There are two basic types. CRAYON (Create Your Own Newspaper, headquartered at http://www.crayon.net/), is one flavor of personalized newspaper. CRAYON is little more than a page full of links to other pages that change everyday. For me, CRAYON just adds to the problem, listing tons of pages that I wish I had time to check out. I was still stuck clicking through lists of links to visit all the different pages.
Then there are sites like My Yahoo (http://my.yahoo.com/), a single page that content changes every day. This is very close to what I wanted—a single site with all of the information I need. My Yahoo combines resources from a variety of different sources. It shows a one-line summary of an article; if it’s something that I find interesting, I can click on the link to read more about it. The only problem with My Yahoo is that it’s restricted to a small set of content providers. I want resources other than what Yahoo provides.
Since these tools didn’t do exactly what I wanted, I decided to
write my own. I figured with Perl, the LWP library, and a weekend, I
could throw together exactly what I wanted. Thus
webpluck
was born. My goal was to write a generic
tool that would automatically grab data from any web page and create a
personalized newspaper exactly like My Yahoo. I decided the best
approach was to define a regular expression for each web page of
interest. webpluck
uses the LWP library to retrieve
the web page, extracts the content with a regular expression tailored
to each, and saves it to a local cache for display later. Once it has
done this for all the sources, I use a template to generate my
personal newspaper.
I don’t want this article to turn into a manual page (since one
already exists), but here’s a brief summary of how to use
webpluck
. You first create a configuration file
containing a list of targets that define which
pages you want to read, and the regular expression to match against
the contents of that page. Here is an example of a target definition
that retrieves headlines from the CNN U.S. web page.
name cnn-us url http://www.cnn.com/US/ regex <h2>([^<]+)</h2>.*?<a href="([^"]+)" fields title:url
These definitions define the following: the name of the file to
hold the data retrieved from the web page; the URL of the page (if you
point at a page containing frames, you need to determine the URL of
the page that actually contains the content); the Perl regular
expression used to extract data from the web page; and the names of
the fields matched in the regular expression that you just defined.
The first pair of parentheses in the regex field matches the first
field, the second pair matches the second, and so on. For the
configuration shown, ([^<]+)
is tagged as the
title and ([^″]+)
is tagged as the
url
. That url
is the link to the
actual content, distinct from the url
definition on
the second line, which is the starting point for the regex.
Running webpluck
with the target definition
above creates a file called cnn-us in a cache
directory that you define. Here’s the file from March 25, 1997:
title:Oklahoma bombing judge to let 'impact witnesses' see trial url:http://www.cnn.com/US/9703/25/okc/index.html title:Simpson's attorneys ask for a new trial and lower damages url:http://www.cnn.com/US/9703/25/simpson.newtrial/index.html title:U.S. playing low-key role in latest Mideast crisis url:http://www.cnn.com/WORLD/9703/25/us.israel/index.html title:George Bush parachutes -- just for fun url:http://www.cnn.com/US/9703/25/bush.jump.ap/index.html
As you might expect, everything depends on the regular
expression, which must be tailored for each source. Not everyone,
myself included, feels comfortable with regular expressions; if you
want to get the most use out of webpluck
, and you
feel that your regular expression skills are soft, I recommend Jeffrey
Friedl’s book Mastering Regular Expressions
(O’Reilly).
The second problem with regular expressions is that as powerful
as they are, they can only match data they expect to see. So if the
publisher of the web page you are after changes his or her format,
you’ll have to update your regular expression.
webpluck
notifies you if it couldn’t match
anything, which is usually a good indication that the format of the
target web page has changed.
Once all the content has been collected,
webpluck
takes those raw data files and a template
file that you provide, and combines them to create your
“dynamic” HTML document.
webpluck
looks for any
<clip>
tags in your template file, replacing
them with webplucked data. Everything else in the template file is
passed through as is. Here is an example of a segment in my daily
template file (again using the CNN U.S. headlines as an
example):
<clip name="cnn-us"> <li><a href="url">title</a> </clip>
This is replaced with the following HTML (the lines have been split to make them more readable):
<li><a href="http://www.cnn.com/US/9703/25/okc/index.html"> Oklahoma bombing judge to let 'impact witnesses' see trial </a> <li><a href="http://www.cnn.com/US/9703/25/simpson.newtrial/index.html"> Simpson's attorneys ask for a new trial and lower damages </a> <li><a href="http://www.cnn.com/WORLD/9703/25/us.israel/index.html"> U.S. playing low-key role in latest Mideast crisis </a> <li><a href="http://www.cnn.com/US/9703/25/bush.jump.ap/index.html"> George Bush parachutes -- just for fun </a>
I personally use webpluck
by running one
cron
job every morning and one during lunch to
re-create my “daily” page. I realize webpluck
could
be used for a lot more than this; that’s left as an exercise for the
reader.
Now on to the technical goodies. For those who don’t know what the LWP library is—learn! LWP is a great collection of Perl objects that allows you to fetch documents from the web. What the CGI library does for people writing web server code, LWP does for people writing web client code. You can download LWP from CPAN.
webpluck
is a simple program. Most of the
code takes care of processing command-line arguments, reading the
configuration file, and checking for errors. The guts rely on the LWP
library and Perl’s powerful regular expressions. The following is part
of the main loop in webpluck
. I’ve removed some
error checking to make it smaller, but the real guts are shown
below.
use LWP; $req = HTTP::Request->new( GET => $self->{'url'} ); $req->header( Accept => "text/html, */*;q=0.1" ); $res = $main::ua->request( $req ); if ($res->is_success( )) { my (@fields) = split( ':', $self->{'fields'} ); my $content = $res->content(); my $regex = $self->{'regex'}; while ($content =~ /$regex/isg) { my @values = ($1, $2, $3, $4, $5, $6, $7, $8); # URL's are special fields; they might be relative, so check for that for ($i = 0; $i <= $#fields; $i++) { if ($fields[$i] eq "url") { my $urlobj = new URI::URL($values[$i], $self->{'url'}); $values[$i] = $urlobj->abs()->as_string(); } push(@datalist, $fields[$i] . ":" . $values[$i]); } push( @{$self->{'_data'}}, @datalist ); } }
The use LWP
imports the LWP module, which
takes care of all the web-related tasks (fetching documents, parsing
URLs, and parsing robot rules). The next three lines are all it takes
to grab a web page using LWP.
Assuming webpluck
’s attempt to retrieve the
page is successful, it saves the document as one long string. It then
iterates over the string, trying to match the regular expression
defined for this target. The following statement merits some
scrutiny:
while ( $content =~ /$regex/isg ) {
The /i
modifier of the above regular
expression indicates that it should be a case-insensitive match. The
/s
modifier treats the entire document as if it
were a single line (treating newlines as whitespace), so your regular
expression can span multiple lines. /g
allows you
to go through the entire document and grab data each time the regular
expression is matched, instead of just the first.
For each match webpluck
finds, it examines
the fields defined by the user. If one of the fields is
url
, it’s turned into an absolute URL—specifically,
a URI::URL object. I let that object translate itself from a relative
URL to an absolute URL that can be used outside of the web site from
where it was retrieved. This is the only data from the target page
that gets massaged.
Lastly, I take the field names and the data that corresponds to each field and save that information. Once all the data from each matched regular expression is collected, it’s run through some additional error checking and saved to a local file.
Like any tool, webpluck
has both good and bad
uses. The program is a sort of web robot, which raises some concerns for me and for users.
A detailed list of the considerations can be found on the Web Robots
Page at http://www.robotstxt.org/wc/robots.html,
but a few points from the Web Robot Guide to Etiquette stand out.
webpluck
identifies itself as
webpluck/2.0
to the remote web server. This isn’t
a problem since few people use webpluck
, but it
could be if sites decide to block my program.
Since webpluck
only checks a finite set of
web pages that you explicitly define—that is, it doesn’t tree-walk
sites—this isn’t a problem. Just to be safe,
webpluck
pauses for a small time period between
retrieving documents. It should only be run once or twice a
day—don’t launch it every five minutes to ensure that you constantly
have the latest and greatest information.
This is the toughest rule to follow. Since
webpluck
is technically a robot, I should be
following the rules set forth by a web site’s
/robots.txt file. However, since the data that
I am after typically changes every day, some sites have set up
specific rules telling robots not to index their pages.
In my opinion, webpluck
isn’t a typical
robot. I consider it more like an average web client. I’m not
building an index, which I think is the reason that these sites tell
robots not to retrieve the pages. If webpluck
followed the letter of the law, it wouldn’t be very useful since it
wouldn’t be able to access many pages that change their content. For
example, CNN has this in their robot rules file:
User-agent: * Disallow: /
If webpluck
were law-abiding, it wouldn’t
be able to retrieve any information from CNN, one of the main sites
I check for news. So what to do? After reading the Robot Exclusion
Standard (http://www.robotstxt.org/wc/norobots.html), I believe
webpluck
doesn’t cause any of the problems meant
to be prevented by the standard. Your interpretation may differ; I
encourage you to read it and decide for yourself.
webpluck
has two options
(--naughty
and --nice
) that
instruct it whether to obey the robot exclusion rules found on
remote servers. (This is my way of deferring the decision to
you.)
Just playing nice as a web robot is only part of the equation. Another
consideration is what you do with the data once you get it. There
are obvious copyright considerations. Copyright on the web is a
broad issue. I’m just going to mention a few quandaries raised by
webpluck
; I don’t have the answers.
Is it okay to extract the URL from the Cool Site of the Day home page and jump straight to the cool site? The Cool Site folks don’t own the URL, but they would certainly prefer that you visit their site first.
Is it okay to retrieve headlines from CNN? What about URLs for the articles?
How about grabbing the actual articles from the CNN site and redisplaying them with your own layout?
And for all of these tasks, does it matter if they’re for your own personal use as opposed to showing it to a friend, or redistributing it more widely?
Obviously, people have different opinions of what is right and what is wrong. I personally don’t have the background, knowledge, or desire to try to tell you what to do. I merely want to raise the issues so you can think about them and make your own decisions.
For a final example of a potential problem, let’s take a look at Dilbert. Here’s the target I have defined for Dilbert at the time of this writing.
name dilbert url http://www.unitedmedia.com/comics/dilbert/ regex SRC="?([^>]?/comics/dilbert/archive.*?.gif)"?s+ fields url
The cartoon on the Dilbert page changes every day, and instead
of just having a link to the latest cartoon
(todays-dilbert.gif), they generate a new URL
every day and include the cartoon in their web page. They do this
because they don’t want people setting up links directly to the
cartoon. They want people to read their main page—after all, that’s
where the advertising is. Every morning I find out where today’s
Dilbert cartoon is located, bypassing all of United Media’s
advertising. If enough people do this, United Media will probably
initiate countermeasures. There are at least three things that would
prevent webpluck
(as it currently works) from
allowing me to go directly to today’s comic.
A CGI program that stands between me and the comic strip.
The program would then take all kinds of steps to see if I
should have access to the image (e.g., checking
Referer
headers, or planting a cookie on me).
But almost any such countermeasure can be circumvented with a
clever enough webpluck
.
The advertising could be embedded in the same image as the cartoon. That’ll work for Dilbert since it’s a graphic, but not for pages where the content is plain HTML.
The site could move away from HTML to another display format such as VRML or Java that takes over an entire web page with a single view. This approach makes the content far harder for robots to retrieve.
Most funding for web technology exists to solve the needs of
content providers, not users. If tools like
webpluck
are considered a serious problem by
content providers, steps will be taken to shut them down, or make
them harder to operate.
It isn’t my intent to distribute a tool to filter web
advertising or steal information from web pages so that I can
redistribute it myself, but I’m not so naïve as to think this can’t
be done. Obviously, anyone intent on doing these things can do so;
webpluck
just makes it easier. Do what you think
is right.
You can find more information about
webpluck
at http://www.edsgarage.com/ed/webpluck/. The program is
also on this book’s web site at http://www.oreilly.com/catalog/tpj2.
3.140.186.201