Sometimes it’s nice to visit web sites without being in front of your computer. Maybe you’d prefer to have the text of web pages mailed to you, or be notified when a web page changes. Or maybe you’d like to download a lot of information from a huge number of web pages (as in the article webpluck), and you don’t want to open them all one by one. Or maybe you’d like to write a robot that scours the web for information. Enter the LWP bundle (sometimes called libwww-perl), which contains two modules that can download web pages for you: LWP::Simple and LWP::UserAgent. LWP is available on CPAN and is introduced in Scripting the Web with LWP.
Dan Gruhl submitted five tiny but exquisite programs to TPJ, all using LWP to automatically download information from a web service. Instead of sprinkling these around various issues as one-liners, I’ve collected all five here with a bit of explanation for each.
The first thing to notice is that all five programs look alike.
Each uses an LWP module (LWP::Simple in the first three, LWP::UserAgent
in the last two) to store the HTML from a web page in Perl’s default
scalar variable $_
. Then they use a series of
s///
substitutions to discard the extraneous HTML.
The remaining text—the part we’re interested in—is displayed on the
screen, although it could nearly as easily have been sent as email with
the various Mail modules on CPAN.
The currency.pl
program converts money from
one currency into another, using the exchange rates on www.oanda.com. Here’s how to find
out what $17.39 is worth in Euros:
$ currency 17.39 USD EUR
--> 17.39 US Dollar = 20.00069 Euro
The LWP::Simple module has a function that makes retrieving
web pages easy: get
. When given a URL,
get
returns the text of that web page as one long
string. In currency.pl, get
is fed a URL for oanda.com containing the three
arguments provided to the program: $ARGV[0],
$ARGV[1]
, and $ARGV[2]
, which correspond
to 17.39, USD, and EUR in the sample run above. The resulting web page
is stored in $_
, after which four
s///
substitutions discard unwanted data.
#!/usr/bin/perl -w # Currency converter. # Usage: currency.pl [amount] [from curr] [to curr] use LWP::Simple; $_= get("http://www.oanda.com/convert/ classic?value=$ARGV[0]&exch=$ARGV[1]&expr=$ARGV[2]"); # Remove the text we don't care about s/^.*<!-- conversion result starts//s; s/<!-- conversion result ends.*$//s; s/<[^>]+>//g; s/s+/ /gm; print $_, " ";
The first s///
removes all text before the
HTML comment <!-- conversion result starts
; the
tail of that comment (-->
) becomes the arrow
that you see in the output. The second s///
removes
all text after the conversion result. The third
s///
dumbly removes all tags in the text that
remains, and the final s///
replaces consecutive
spaces and newlines with a single space each.
Weather information is downloaded from www.intellicast.com in much
the same way as currency information is downloaded from www.oanda.com. The URL is
different, some of the s///
substitutions are
different, but the basic operation is the same. As an added treat,
weather.pl
uses the Text::Wrap module to format the output to 76 columns.
Here’s the gloomy forecast for Boston in February:
$ weather bos
Wednesday: Overcast. High near 49F. Winds SSE 15 to 20 mph. Wednesday
night: Rain showers early becoming steady overnite. Low near
43F. Winds S 10 to 15 mph. Rainfall around a quarter of an inch.
Thursday: A steady rain in the morning. Showers continuing in the
afternoon. High near 55F. Winds SW 15 to 20 mph. Chance of precip
80%. Rainfall around a quarter of an inch. Thursday night: A few clouds
from time to time. Low around 38F. Winds W 10 to 15 mph.
Friday: More clouds than sun. Highs in the low 50s and lows in the low 30s.
Saturday: Mostly cloudy. Highs in the mid 40s and lows in the low 30s.
Sunday: More clouds than sun. Highs in the upper 40s and lows in the upper
30s.
Monday: Occasional showers. Highs in the upper 40s and lows in the upper 20s.
Tuesday: Showers possible. Highs in the low 40s and lows in the upper 20s.
Wednesday: Considerable cloudiness. Highs in the low 40s and lows in the
upper 20s.
Thursday: Considerable cloudiness. Highs in the low 40s and lows in the
upper 20s.
Friday: Partly Cloudy
Here’s weather.pl
:
#!/usr/bin/perl # Prints the weather for a given airport code # # Examples: weather.pl bos # weather.pl sfo use LWP::Simple; use Text::Wrap; $_ = get("http://intellicast.com/Local/USLocalStd.asp?loc=k" . $ARGV[0] . "&seg=LocalWeather&prodgrp=Forecasts&product=Forecast&prodnav= none&pid=nonens"); # Remove the text we don't care about s/Click Here for Averages and Records/ /gim; s/<[^>]+>//gm; s/Trip Ahead.*$//sim; s/ / /gm; s/^(?!w+day:).*?$//gm; s/^s+$//gm; print wrap('', '', $_); # Format and print the weather report
The CNN home page displays the top news story; our cnn.pl
program
formats and displays it using Text::Wrap. I sandwiched Dan’s code in a
while
loop that sleeps for 5 minutes (300 seconds)
and retrieves the top story again. If the new story (as usual, stored
in $_
) is different than the old story
($old
), it’s printed.
#!/usr/bin/perl # # cnn.pl: continuously display the top story on CNN use LWP::Simple; use Text::Wrap; $| = 1; while (1) { # Run forever $_ = get("http://www.cnn.com"); s/FULL STORY.*//sm; s/A.*Updated.*?$//sm; s/<[^>]+>//gm; s/ +/ /gm; if ($old ne $_) { # If it's a new story, print wrap('', '', $_); # Format and print it $old = $_; # ...and remember it } sleep 300; # Sleep for five minutes }
Back in 1999, there was a TPJ subscriber in Cambridge who wasn’t getting his issues. When each issue went to press, Jon FTP’d the TPJ mailing list to a professional mail house for presorting and bagging and labeling that the U.S. Post Office requires (an improvement over the days when Jon addressed every issue himself in a cloud of Glu-Stik vapors).
The problem was that the mail house fixed addresses that seemed incorrect. “Albequerque” became “Albuquerque,” and “Somervile” became “Somerville”. Which is great, as long as the rules for correcting addresses—developed by the Post Office—work. They usually do, but occasionally a correct address is “fixed” to an incorrect address. That’s what happened to this subscriber, and here’s how Jon found out.
The address.pl
program pretends to be a user
typing information into the fields of the post office’s web page at
http://www.usps.com/ncsc/. That page asks for
six fields: company
(left blank for residential
addresses), urbanization
(valid only for Puerto
Rico), street, city
, and zip
.
You need to provide the street
, and either the
zip
or the
city
and state
. Regardless of
which information you provide, the site responds with a complete
address and mail route:
$address company "O'Really" urbanization ""
street "90 Shirman" city "Cambridge" state "MA" zip ""
90 SHERMAN ST CAMBRIDGE MA 02140-3233 Carrier Route : C074 County : MIDDLESEX Delivery Point : 90 Check Digit : 3
Note that I deliberately inserted a spelling error:
O’Really
and Shirman
. The post
office’s database is reasonably resilient.
One inconvenience of address.pl
is that you
have to supply placeholders for all the fields, even the ones you’re
leaving blank, like urbanization
and
zip
above.
This program is trickier than the three you’ve seen. It doesn’t
use LWP::Simple, but two other modules from the LWP bundle: LWP::UserAgent and HTTP::Request::Common. That’s because LWP::Simple can
handle only HTTP GET
queries. This web site uses a
POST
query, and so Dan used the more sophisticated
LWP::UserAgent module, which has an object-oriented
interface.
First, a LWP::UserAgent object, $ua
, is
created with new
and its request
method invoked to POST
the address data to the web
page. If the POST
was successful, the
is_success
method returns true, and the page
contents can then be found in the _content
attribute of the response object, $resp
. The
address is extracted as the _content
is being
stored in $_
, and two more s///
substitutions remove unneeded data.
#!/usr/bin/perl -w # Need *either* state *or* zip use LWP::UserAgent; use HTTP::Request::Common; # Create a new UserAgent object and invoke its request() method $ua = new LWP::UserAgent; $resp = $ua->request(POST 'http://www.usps.com/cgi-bin/zip4/zip4inq2', [@ARGV]); exit -1 unless $resp->is_success; # Remove the text we don't care about ($_ = $resp->{_content}) =~ s/^.*address is:<p> //si; s/Version .*$//s; s/<[^>]+>//g; print;
You can use address.pl
to determine the zip
code given an address, or to find out your own nine-digit zip code, or
even to find out who’s on the same mail carrier route as you. If you
type in the address of the White House, you’ll learn that the First
Lady has her own zip code, 20500-0002
.
Salomon Smith Barney’s web site is one of many with free
15-minute delayed stock quotes. To find the stock price for Yahoo, you’d
provide stock
with its ticker symbol,
yhoo
:
$ stock.pl YHOO
$17.30
Like address.pl, stock.pl
needs the
LWP::UserAgent module since it’s making a POST
query.
Just because LWP::UserAgent has an OO interface doesn’t mean the
program has to spend an entire line creating an object and explicitly
storing it ($object = new Class
), although that’s
undoubtedly what Gisle Aas envisioned when he wrote the interface.
Here, Dan’s preoccupation with brevity shows, as he invokes an
object’s method in the same statement that creates the object:
(new LWP::UserAgent)->request(…)
.
#!/usr/bin/perl # Pulls a stock quote from Salomon Smith Barney's web site. # # Usage: stock.pl ibm # # or whatever stock ticker symbol you like. use LWP::UserAgent; use HTTP::Request::Common; $response = (new LWP::UserAgent)->request(POST 'http://www.salomonsmithbarney.com/cgi-bin/benchopen/sb_quote', [ search_type => "1", search_string => "$ARGV[0]" ]); exit -1 unless $response->is_success; $_ = $response->{_content}; m/ Price.*?($d+.?d+)/gsm; print $1;
These aren’t robust programs. They were dashed off in a couple of minutes for one person’s pleasure, and they most certainly will break as the companies in charge of these pages change the web page formats or the URLs needed to access them.
We don’t care. When that happens, these scripts will break,
we’ll notice that, and we’ll amend them accordingly. Sure, each of
these programs could be made much more flexible. They could be primed
to adapt to changes in the HTML, the way a human would if the
information were moved around on the web page. Then the
s///
expressions would fail, and the programs could
expend some effort trying to understand the HTML using a more
intelligent parsing scheme, perhaps using the HTML::Parse or
Parse::RecDescent modules. If the URL became invalid, the scripts
might start at the site home page and pretend to be a naive user
looking for his weather or news or stock fix. A smart enough script
could start at Yahoo and follow links until it found what it was
looking for, but so far no one has written a script like that.
Of course, the time needed to create and test such programs would be much longer than making quick, brittle, and incremental changes to the code already written. No, it’s not rocket science—it’s not even computer science—but it gets the job done.
All five of these programs worked as originally printed in TPJ #13, but as one would expect, all five of them broke in the two years between publication of the magazine and publication of this book. Since the template of the programs was sound, it took only a few minutes to update each, and the programs you see here all work perfectly as of December 2002.
The next article, Downloading Web Pages Through a Proxy Server, shows how to adapt these programs for use in computing environments with firewalls.
52.14.240.178