Christmas is coming, and Santa has outsourced his deliveries to Federal Express. Lucky us, as that means we can use FedEx’s online shipment tracker to watch our parcels wend their merry way here. Except the tiresome chore of refreshing the FedEx site is just too much to handle. Let’s let a nice script elf take care of it and create a feed for every parcel.
Although FedEx and its rivals do provide APIs, we won’t be using them here. FedEx’s page is easy enough to scrape by brute force, and it’s fun to do so. Of course, when it next changes its page layout, this script will need rejigging. It’s easy to see how to do that when it happens.
So, starting with the usual Perl standards of
warnings;
, strict;
,
XML::RSS
, and CGI
,
let’s use LWP::Simple
to retrieve
the page and the marvellous HTML::TokeParser
to do
the dirty work. More on that anon.
use warnings; use strict; use XML::RSS; use CGI qw(:standard); use LWP::Simple 'get'; use HTML::TokeParser;
Now let’s set up some variables to use later, then fire up the CGI module and grab the tracking number from the query string. To use this script, therefore, you need to request:
http://www.example.org/fedextracker.cgi?track=123456789
where 123456789
is the tracking number of the
parcel:
my ( $tag, $headline, $url, $date_line ); my $last_good_date; my $table_end_check; my $cgi = CGI::new( ); my $tracking_number = $cgi->param('track'),
Now we’re ready to jingle. Using
LWP::Simple
’s
get
method, pull down the page from the FedEx
site. FedEx, bless them, employ openly understandable URLs, so this
is easy to set up. Once that’s downloaded, throw it
into a new instance of the HTML::TokeParser
module, and we’re ready for scraping:
my $tracking_page = get( "http://fedex.com/Tracking?action=track&tracknumber_list=$tracking_number&cntry_code=us" ); my $stream = HTML::TokeParser->new( $tracking_page );
Now is as good a time as any to start off XML::RSS
and fill in some channel
details:
my $rss = XML::RSS->new( ); $rss->channel( title => "FedEx Tracking: $tracking_number", link => "http://fedex.com/Tracking?action=track&tracknumber_list=$tracking_number&cntry_code=us" );
From now on, we’re using the
HTML::TokeParser
module, skipping from tag to tag
until we get to the section of the HTML to scrape. The inline
comments say what we’re up to.
# Go to the right part of the page, skipping 13 tables (!!!) $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); # Now go inside the tracking details table $stream->get_tag("table"); $stream->get_tag("tr"); $stream->get_tag("/tr"); $stream->get_tag("tr"); $stream->get_tag("/tr");
By this point, you’re at the table to parse, so loop
through it, getting the dates and locations. You need to stop at the
bottom of the table, so test for a closing /table
tag. You can do so with a named loop and a
last...if..
. command.
You’ll notice that in this section, we use those mysterious variables from earlier. Because the table is displayed with the date mentioned only once per day, no matter how many stops the parcel makes on that day, you need to keep track of it.
PARSE: while ( $tag = $stream->get_tag('tr') ) { $stream->get_tag("td"); $stream->get_tag("/td"); # Test here for the closing /tr. If it exists, we're done. # Now get date text $stream->get_tag("td"); $stream->get_tag("b"); my $date_text = $stream->get_trimmed_text("/b"); # The page only mentions the date once, so we need to fill in any blanks # that might occur. if ( $date_text eq "xa0" ) { $date_text = $last_good_date; } else { $last_good_date = $date_text; } # Now get the time text $stream->get_tag("/b"); $stream->get_tag("/td"); $stream->get_tag("td"); my $time_text = $stream->get_trimmed_text("/td"); $time_text =~ s/xa0//g; # Now get the status $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("td"); my $status = $stream->get_trimmed_text("/td"); $status =~ s/xa0//g; # Now get the location $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("td"); my $location = $stream->get_trimmed_text("/td"); $location =~ s/xa0//g; # Now get the comment $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("td"); my $comment = $stream->get_trimmed_text("/td"); $comment =~ s/xa0//g; # Now go to the end of the block $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/tr"); # OK, now we have the details, we need to put them into a feed # Do what you want with the info:
Still inside the loop, create an item for the RSS feed:
if ($status) { $rss->add_item( title => "$status $location $date_text $time_text", link => "http://fedex.com/us/tracking/?action=track&tracknumber_list=$tracking_number", description => "Package number $tracking_number was last seen in $location at $time_text on $date_ text, with the status, $status. $comment Godspeed, little parcel! Onward, tiny package!" ); } # Stop parsing after the pickup line. last PARSE if ( $status eq "Picked up " ); }
All that done, you can serve it up nice and festive:
print header('application/rss+xml'), print $rss->as_string;
#!/usr/bin/perl use warnings; use strict; use XML::RSS; use CGI qw(:standard); use LWP::Simple 'get'; use HTML::TokeParser; my ( $tag, $headline, $url, $date_line ); my $last_good_date; my $table_end_check; my $cgi = CGI::new( ); my $tracking_number = $cgi->param('track'), my $tracking_page = get( "http://fedex.com/Tracking?action=track&tracknumber_list=$tracking_number&cntry_code=us" ); my $stream = HTML::TokeParser->new( $tracking_page ); my $rss = XML::RSS->new( ); $rss->channel( title => "FedEx Tracking: $tracking_number", link => "http://fedex.com/Tracking?action=track&tracknumber_list=$tracking_number&cntry_code=us" ); # Go to the right part of the page, skipping 13 tables (!!!) $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); $stream->get_tag("table"); # Now go inside the tracking details table $stream->get_tag("table"); $stream->get_tag("tr"); $stream->get_tag("/tr"); $stream->get_tag("tr"); $stream->get_tag("/tr"); PARSE: while ( $tag = $stream->get_tag('tr') ) { $stream->get_tag("td"); $stream->get_tag("/td"); # Test here for the closing /tr. If it exists, we're done. # Now get date text $stream->get_tag("td"); $stream->get_tag("b"); my $date_text = $stream->get_trimmed_text("/b"); # The page only mentions the date once, so we need to fill in any blanks # that might occur. if ( $date_text eq "xa0" ) { $date_text = $last_good_date; } else { $last_good_date = $date_text; } # Now get the time text $stream->get_tag("/b"); $stream->get_tag("/td"); $stream->get_tag("td"); my $time_text = $stream->get_trimmed_text("/td"); $time_text =~ s/xa0//g; # Now get the status $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("td"); my $status = $stream->get_trimmed_text("/td"); $status =~ s/xa0//g; # Now get the location $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("td"); my $location = $stream->get_trimmed_text("/td"); $location =~ s/xa0//g; # Now get the comment $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("td"); my $comment = $stream->get_trimmed_text("/td"); $comment =~ s/xa0//g; # Now go to the end of the block $stream->get_tag("/td"); $stream->get_tag("/td"); $stream->get_tag("/tr"); # OK, now we have the details, we need to put them into a feed # Do what you want with the info: if ($status) { $rss->add_item( title => "$status $location $date_text $time_text", link => "http://fedex.com/us/tracking/?action=track&tracknumber_list=$tracking_number", description => "Package number $tracking_number was last seen in $location at $time_text on $date_ text, with the status, $status. $comment Godspeed, little parcel! Onward, tiny package!" ); } # Stop parsing after the pickup line. last PARSE if ( $status eq "Picked up " ); } print header('application/rss+xml'), print $rss->as_string;
3.138.124.143