Chapter 11. Five Quick Hacks: Downloading web Pages

Jon Orwant

Dan Gruhl

Sometimes it’s nice to visit web sites without being in front of your computer. Maybe you’d prefer to have the text of web pages mailed to you, or be notified when a web page changes. Or maybe you’d like to download a lot of information from a huge number of web pages (as in the article webpluck), and you don’t want to open them all one by one. Or maybe you’d like to write a robot that scours the web for information. Enter the LWP bundle (sometimes called libwww-perl), which contains two modules that can download web pages for you: LWP::Simple and LWP::UserAgent. LWP is available on CPAN and is introduced in Scripting the Web with LWP.

Dan Gruhl submitted five tiny but exquisite programs to TPJ, all using LWP to automatically download information from a web service. Instead of sprinkling these around various issues as one-liners, I’ve collected all five here with a bit of explanation for each.

The first thing to notice is that all five programs look alike. Each uses an LWP module (LWP::Simple in the first three, LWP::UserAgent in the last two) to store the HTML from a web page in Perl’s default scalar variable $_. Then they use a series of s/// substitutions to discard the extraneous HTML. The remaining text—the part we’re interested in—is displayed on the screen, although it could nearly as easily have been sent as email with the various Mail modules on CPAN.

Downloading Currency Exchange Rates

The currency.pl program converts money from one currency into another, using the exchange rates on www.oanda.com. Here’s how to find out what $17.39 is worth in Euros:

$ currency 17.39 USD EUR
--> 17.39 US Dollar = 20.00069 Euro

The LWP::Simple module has a function that makes retrieving web pages easy: get. When given a URL, get returns the text of that web page as one long string. In currency.pl, get is fed a URL for oanda.com containing the three arguments provided to the program: $ARGV[0], $ARGV[1], and $ARGV[2], which correspond to 17.39, USD, and EUR in the sample run above. The resulting web page is stored in $_, after which four s/// substitutions discard unwanted data.

#!/usr/bin/perl -w

# Currency converter.
# Usage: currency.pl [amount] [from curr] [to curr]

use LWP::Simple;

$_= get("http://www.oanda.com/convert/
                                classic?value=$ARGV[0]&exch=$ARGV[1]&expr=$ARGV[2]");

# Remove the text we don't care about
s/^.*<!-- conversion result starts//s;
s/<!-- conversion result ends.*$//s;
s/<[^>]+>//g;
s/s+/ /gm;

print $_, "
";

The first s/// removes all text before the HTML comment <!-- conversion result starts; the tail of that comment (-->) becomes the arrow that you see in the output. The second s/// removes all text after the conversion result. The third s/// dumbly removes all tags in the text that remains, and the final s/// replaces consecutive spaces and newlines with a single space each.

Downloading Weather Information

Weather information is downloaded from www.intellicast.com in much the same way as currency information is downloaded from www.oanda.com. The URL is different, some of the s/// substitutions are different, but the basic operation is the same. As an added treat, weather.pl uses the Text::Wrap module to format the output to 76 columns. Here’s the gloomy forecast for Boston in February:

$ weather bos
Wednesday: Overcast. High near 49F. Winds SSE 15 to 20 mph. Wednesday
night: Rain showers early becoming steady overnite. Low near
43F. Winds S 10 to 15 mph. Rainfall around a quarter of an inch.

Thursday: A steady rain in the morning. Showers continuing in the
afternoon. High near 55F. Winds SW 15 to 20 mph. Chance of precip
80%. Rainfall around a quarter of an inch. Thursday night: A few clouds
from time to time. Low around 38F. Winds W 10 to 15 mph.

Friday: More clouds than sun. Highs in the low 50s and lows in the low 30s.

Saturday: Mostly cloudy. Highs in the mid 40s and lows in the low 30s.

Sunday: More clouds than sun. Highs in the upper 40s and lows in the upper
30s.

Monday: Occasional showers. Highs in the upper 40s and lows in the upper 20s.

Tuesday: Showers possible. Highs in the low 40s and lows in the upper 20s.

Wednesday: Considerable cloudiness. Highs in the low 40s and lows in the
upper 20s.

Thursday: Considerable cloudiness. Highs in the low 40s and lows in the
upper 20s.

Friday: Partly Cloudy

Here’s weather.pl:

#!/usr/bin/perl

# Prints the weather for a given airport code
#
# Examples: weather.pl bos
#           weather.pl sfo

use LWP::Simple;
use Text::Wrap;

$_ = get("http://intellicast.com/Local/USLocalStd.asp?loc=k" . $ARGV[0] .
         "&seg=LocalWeather&prodgrp=Forecasts&product=Forecast&prodnav=
          none&pid=nonens");

# Remove the text we don't care about
s/Click Here for Averages and Records/
/gim;
s/<[^>]+>//gm;
s/Trip Ahead.*$//sim;
s/&nbsp;/ /gm;
s/^(?!w+day:).*?$//gm;
s/^s+$//gm;

print wrap('', '', $_);     # Format and print the weather report

Downloading News Stories

The CNN home page displays the top news story; our cnn.pl program formats and displays it using Text::Wrap. I sandwiched Dan’s code in a while loop that sleeps for 5 minutes (300 seconds) and retrieves the top story again. If the new story (as usual, stored in $_) is different than the old story ($old), it’s printed.

#!/usr/bin/perl
#
# cnn.pl: continuously display the top story on CNN

use LWP::Simple;
use Text::Wrap;

$| = 1;

while (1) {                            # Run forever
    $_ = get("http://www.cnn.com");
    s/FULL STORY.*//sm;
    s/A.*Updated.*?$//sm;
    s/<[^>]+>//gm;
    s/

+/

/gm;
    if ($old ne $_) {                  # If it's a new story,
        print wrap('', '', $_);        # Format and print it
        $old = $_;                     # ...and remember it
     }
    sleep 300;                         # Sleep for five minutes
}

Completing U.S. Postal Addresses

Back in 1999, there was a TPJ subscriber in Cambridge who wasn’t getting his issues. When each issue went to press, Jon FTP’d the TPJ mailing list to a professional mail house for presorting and bagging and labeling that the U.S. Post Office requires (an improvement over the days when Jon addressed every issue himself in a cloud of Glu-Stik vapors).

The problem was that the mail house fixed addresses that seemed incorrect. “Albequerque” became “Albuquerque,” and “Somervile” became “Somerville”. Which is great, as long as the rules for correcting addresses—developed by the Post Office—work. They usually do, but occasionally a correct address is “fixed” to an incorrect address. That’s what happened to this subscriber, and here’s how Jon found out.

The address.pl program pretends to be a user typing information into the fields of the post office’s web page at http://www.usps.com/ncsc/. That page asks for six fields: company (left blank for residential addresses), urbanization (valid only for Puerto Rico), street, city, and zip. You need to provide the street, and either the zip or the city and state. Regardless of which information you provide, the site responds with a complete address and mail route:

$ address company "O'Really" urbanization ""

 street "90 Shirman" city "Cambridge" state "MA" zip ""

90 SHERMAN ST
CAMBRIDGE MA 02140-3233
Carrier Route : C074
County : MIDDLESEX
Delivery Point : 90
Check Digit : 3

Note that I deliberately inserted a spelling error: O’Really and Shirman. The post office’s database is reasonably resilient.

One inconvenience of address.pl is that you have to supply placeholders for all the fields, even the ones you’re leaving blank, like urbanization and zip above.

This program is trickier than the three you’ve seen. It doesn’t use LWP::Simple, but two other modules from the LWP bundle: LWP::UserAgent and HTTP::Request::Common. That’s because LWP::Simple can handle only HTTP GET queries. This web site uses a POST query, and so Dan used the more sophisticated LWP::UserAgent module, which has an object-oriented interface.

First, a LWP::UserAgent object, $ua, is created with new and its request method invoked to POST the address data to the web page. If the POST was successful, the is_success method returns true, and the page contents can then be found in the _content attribute of the response object, $resp. The address is extracted as the _content is being stored in $_, and two more s/// substitutions remove unneeded data.

#!/usr/bin/perl -w
# Need *either* state *or* zip

use LWP::UserAgent;
use HTTP::Request::Common;

# Create a new UserAgent object and invoke its request() method
$ua = new LWP::UserAgent;
$resp = $ua->request(POST 'http://www.usps.com/cgi-bin/zip4/zip4inq2', [@ARGV]);

exit -1 unless $resp->is_success;

# Remove the text we don't care about
($_ = $resp->{_content}) =~ s/^.*address is:<p>
//si;
s/Version .*$//s;
s/<[^>]+>//g;

print;

You can use address.pl to determine the zip code given an address, or to find out your own nine-digit zip code, or even to find out who’s on the same mail carrier route as you. If you type in the address of the White House, you’ll learn that the First Lady has her own zip code, 20500-0002.

Downloading Stock Quotes

Salomon Smith Barney’s web site is one of many with free 15-minute delayed stock quotes. To find the stock price for Yahoo, you’d provide stock with its ticker symbol, yhoo:

$ stock.pl YHOO
$17.30

Like address.pl, stock.pl needs the LWP::UserAgent module since it’s making a POST query.

Just because LWP::UserAgent has an OO interface doesn’t mean the program has to spend an entire line creating an object and explicitly storing it ($object = new Class), although that’s undoubtedly what Gisle Aas envisioned when he wrote the interface. Here, Dan’s preoccupation with brevity shows, as he invokes an object’s method in the same statement that creates the object: (new LWP::UserAgent)->request(…).

#!/usr/bin/perl

# Pulls a stock quote from Salomon Smith Barney's web site.
#
# Usage:       stock.pl ibm
#
# or whatever stock ticker symbol you like.

use LWP::UserAgent;
use HTTP::Request::Common;

$response = (new LWP::UserAgent)->request(POST
      'http://www.salomonsmithbarney.com/cgi-bin/benchopen/sb_quote',
              [ search_type => "1",
              search_string => "$ARGV[0]" ]);

exit -1 unless $response->is_success;
$_ = $response->{_content};
m/ Price.*?($d+.?d+)/gsm;
print $1;

Conclusion

These aren’t robust programs. They were dashed off in a couple of minutes for one person’s pleasure, and they most certainly will break as the companies in charge of these pages change the web page formats or the URLs needed to access them.

We don’t care. When that happens, these scripts will break, we’ll notice that, and we’ll amend them accordingly. Sure, each of these programs could be made much more flexible. They could be primed to adapt to changes in the HTML, the way a human would if the information were moved around on the web page. Then the s/// expressions would fail, and the programs could expend some effort trying to understand the HTML using a more intelligent parsing scheme, perhaps using the HTML::Parse or Parse::RecDescent modules. If the URL became invalid, the scripts might start at the site home page and pretend to be a naive user looking for his weather or news or stock fix. A smart enough script could start at Yahoo and follow links until it found what it was looking for, but so far no one has written a script like that.

Of course, the time needed to create and test such programs would be much longer than making quick, brittle, and incremental changes to the code already written. No, it’s not rocket science—it’s not even computer science—but it gets the job done.

Afterword

All five of these programs worked as originally printed in TPJ #13, but as one would expect, all five of them broke in the two years between publication of the magazine and publication of this book. Since the template of the programs was sound, it took only a few minutes to update each, and the programs you see here all work perfectly as of December 2002.

The next article, Downloading Web Pages Through a Proxy Server, shows how to adapt these programs for use in computing environments with firewalls.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.240.178