The beginning of the script is simple, since we have already decided which modules to use. In the opening lines, we make sure that we have access to the features of those modules:
#!/usr/local/bin/perl use MIME::Parser; # The MIME-Tools Parser (reads/writes # MIME entities). use Term::ReadLine; # ReadLine library (used/or user input). use Getopt:: Std; # Used to get command line arguments.
Determining which modules are available can be more difficult
than using them. A periodic review of the CPAN, perhaps with the
excellent regular expression-based search features of the CPAN Perl
module (try perl -MCPAN -e shell
) is
useful.
Naturally, the first line is needed only under Unix. Next, we will need to set some initial variables. This is good practice, although not strictly required in a scripting language like Perl. We will need to have some flags to tell us where we are in the parsing process and what the user has requested (such as a flag for whether or not we are compressing the output file). We will also need to set up object instances for each of the modules that we are using, since they use Perl’s object-oriented interface. Many modules have their own requirements; refer to their perldoc documentation to see initialization and usage examples in detail.
In general, MIME-tools is used by creating an object instance of MIME::Parser, in order to read a message. Later we will see how to get objects for each MIME entity. MIME::Parser requires that a temporary directory be given, since it creates many temporary files in the course of its operation. It first creates a temporary file for a message, then another for each of its MIME entities. We have to be sure to delete them, which we will do along the way.
Term::ReadLine is a bit esoteric but simple to use nonetheless. A terminal object is created, and a file handle is associated with it for its output. This file handle is probably (but not necessarily) STDOUT. Every time we ask for input from the user, we have to provide a prompt. In the following code, we define three prompts, which together cover all that we need ask.
Finally, we will need to set a message for use in creating replacement MIME attachments for the ones that we delete.
# Set initial variables. $debug = 0; # Set to non-zero to show debugging info. $archive = 0; # Set to non-zero if compressing output. $archiver = '/bin/gzip'; # Program to use for file compression. $archive_subscript = '.gz'; # File extension for compressed files. $allowed_length = 0; # Size in bytes above which MIME entities # are checked. $mbox_file ""; # Mailbox filename. # Variables used with the Term::ReadLine module to get user input: $term = new Term::ReadLine; # The ReadLine object. $length-prompt = "Enter the maximum size in bytes of attachments " . "you wish to keep. [#/q] "; $file-prompt = "please enter the filename of your mailbox. [#/q] "; $attach-prompt = "Do you wish to delete this attachment? [y/N/q] "; $OUT = $term->OUT || STDOUT; # Where the terminal output goes. # Variables related to the current message. $msg = ""; # Current message contents $msg_num = 0; # The (linear) number of the current # message. $from_ = ""; # The From_ line for the current message. $date = localtime (time); # The current time/date. # The message created when an attachment is deleted. $info_start = "This attachment has been deleted by $0 on $date. " . "The original attachment's headers were: "; # Variables used with MIME-Tools. my($parser) = new MIME::Parser; # The Parser object. $parser->output_dir("/tmp"); # The temporary directory to use # when creating temporary files for # messages and MIME entities.
The first useful thing this script will need to do is to check its
own command-line arguments (if any) and act accordingly. The Getopt
module makes this task easy. The getopts line in the following code
takes any command-line arguments and associates them with variables
prepended with $opt_
. So if we call the script with a
-u option, the variable $opt_u
will be set to 1
and set to 0
otherwise. The colons in the syntax indicate that the
-l and -f options take
arguments; the appropriate variables will be set to those
arguments.
The second thing to do is to return the usage statement, if the user gave us a -u flag. This should probably be in a subroutine to keep the code clean, but this script is so short we won’t worry about it. Note that if the usage statement is given, the script exits.
getopts('ul:f:ad'),
# Return a usage statement if asked.
if ( $opt_u ) {
print<<ENDOFUSAGE;
$0 - an interactive script which allows you to scrub
mbox-style mailboxes of unwanted MIME attachments.
Synopsis: $0 [-u] [-a] [-d] [-l <length>] [-f <file>]
where <length> is the maximum size of attachments to ignore; and
<file> is the path and filename of an mbox-style mailbox.
ENDOFUSAGE
exit(0);
}
If we haven’t exited by this point, we are off and running. We will need to check to see which of the other command-line options are set and take appropriate action. In the case of the debugging (-d) and archive (-a) flags, we just need to note their presence or lack of it. In the case of the length (-l) and filename (-f) flags, we will need to ask the user for the information if it is not provided on the command line.
Using the Term::ReadLine module to get and operate on user input
is easy; three lines are all that is needed to read a line (with the
readline
method), warn if there are errors (reported
in $@
), and update the terminal’s command-line
history (with the addhistory
method). In each case,
we check to see if the user wants to quit (indicated by typing a
response beginning with a letter q) before
proceeding.
print " Welcome to $0! " ; # Turn on debugging reports, if we have been requested to. if ( $opt_d ) { $debug = 1; } # If we don't have an allowed attachment length, ask for it. $allowed_length = $opt_l; unless ( $allowed_length ) { $answer = $term->readline($length-prompt); warn $@ if $@; $term->addhistory($_) if /S/; if ( $answer =~ /^q/i ) { print "ABORTED by user "; exit (0); } else { $allowed_length = $answer; } } # If we don't have a mailbox file, ask for it. $mbox_file = $opt_f; unless ( $mbox_file ) { $answer = $term->readline($file-prompt); warn $@ if $@; $term->addhistory($_) if /S/; if ( $answer =~ /^q/i ) { print "ABORTED by user "; exit (0) ; } else { $mbox_file = $answer; $mbox_file =~ s/s$//; } } # Check to see if we can read the input file. # NOTE that Win32 may place pwd in perl dir... unless ( -r $mbox_file ) { print "File $mbox_file is not readable: $! Exiting... "; exit(0); } if ( $opt_a ) { $archive = 1; }
We’re done with the initial housekeeping and user interface issues and can now turn our attention to the task itself: reading and parsing an mbox-style mailbox. We actually need to open two file handles. The first is for the input mailbox (for reading), and the second is the output file (to be opened for writing in mbox format). At the end of the script, we will inform the user of the name of the output file, which we will name by appending “_scrubbed” to the input filename.
The core of this script is a simple while loop. We read the input mailbox until we have placed one entire message in a buffer, then we act on that message before reading further. We can clear the buffer and put the next message into it and repeat this process until the mailbox is completely processed. If we didn’t do something like this, we would use as much memory as the number of bytes in the input mailbox file. Since mailboxes can be quite large, this would be a huge waste of system resources. Processing a 50 MB file would use 50 MB just for one variable! Obviously, we should do better.
Reading the mailbox, we expect that the first line is a From_ line, as described in Chapter 7, Mailbox Formats. We
need some way to note that the first time we encounter one of these
lines we are at the top of a message and all of the subsequent times we
have reached the end of the preceding message. We will use a simple
flag, $msg_started
, to denote this. We will also keep
track of the number of the message ($msg_num
) that we
are operating on, so we can report it to the user for reference. The
rest of this while loop is to put message lines into the current message
buffer ($msg
) and determine when we have a full
message in the buffer and are ready to operate on it. We will do the
actual operation (parsing and potentially deleting MIME attachments) in a subroutine called
check_message
, to be discussed later.
When the while
loop finishes, we will have the
entire last message in the buffer, but we will not have operated on it
yet! So, we will need to call the check_message
subroutine one more time.
# Open a mailbox and step through it, one message at a time. open (MBOX, $mbox_file) or die "Can't open $mbox_file: $!"; # Open a new mailbox file, so that we can write to it. $newfile = $mbox_file . "_scrubbed"; open (NEWBOX, ">$newfile") or die "Can't open $newfile: $!"; $msg_started = 0; while(<MBOX>) { # Read in one message, then write it to a temporary # file and operate on it. # The current line. $line = $_; if ( ($line =~ /^From /) && (! $msg_started) ) { # Note that we've found the start of the first message. $msg_started = 1; # Keep the From_ line. $from_ = $line; } elsif ( ($line =~ /^From - /) && ($msg_started) ) { # We found the end of a message and the start of # a new one. We now operate on the completed message. # increment the message number. $msg_num++; # Operate on current message. &check_message(); # Set the From_ line for the next message. $from_ = $line; } else { # We're in the middle of reading a message, so just append # to the current message. $msg .= $line; } } # end while reading MBOX. # Need to handle the last message read! $msg_num++; &check_message();
The check_message
subroutine will be
responsible for clearing the message buffer when it is done with
it.
The last thing the script needs to do is to determine if the
output file should be compressed. If so, it must close the open file
handle and run the archiving program on the filename. The script can
then exit. We will handle various bits and pieces of this process in
subroutines; one here for closing file handles and another for the
cleanup on exit. This is because, as you shall see, the same actions
will need to be taken from the central check_message
subroutine as well.
# Archive (compress) the output mailbox if the user so requested. if ( $archive ) { &close_fhs(); `$archiver $newfile`; $newfile .= $archive_subscript; } # Tell the user the filename that was created. print " Your scrubbed mailbox is in the file" . $newfile . ". "; &exit ();
Let’s now turn our attention to the
check_message
subroutine. It is there that we will
parse each message that has come from the mailbox. This subroutine is
the longest part of this script, but it has a simple structure. In HPC,
[24] the structure would look something like this:
Get message as a MIME entity object For each message part { Is this part 'too big'? If so ... Ask the user if the attachment should be deleted and handle the response appropriately. If not, just keep the part. } Write the message to the output file.
In order to get the message into a form that the MIME-tools modules can work with, we need to instantiate an instance of the MIME::Entity object for the message on which we are operating. Once we have done that, we can extract some header information from the message to use as context information for the user. We can also break the message into an array of MIME parts, each of which is in its turn a MIME::Entityo bject.
This is where MIME-tools starts creating heaps of temporary files.
We’ll return to this later, when we look at the cleanup subroutines.
Another small note: MIME-tools provides a nice utility method,
sync_headers
that (among other things) allows for
Content-length headers to be calculated and inserted into each MIME
attachment. This is necessary if we are going to easily determine the
length of the parts!
The code for this first bit looks like this:
sub check_message { # Operate on a given message, asking the user # for each large attachment found. # DBG print "Message #" . $msg_num . ": " if $debug; $entity = $parser->parse_data($msg) or die "ERROR: Couldn't parse MIME stream: $!"; # Compute Content-Length headers for each body part. $entity->sync_headers(Length=>'COMPUTE'), # Prepare some explanatory information. $summarized = 0; $head = $entity->head; $msg_summary = " Message Summary for message #" . $msg_num . ": " . " From : " . $head->get('From') . " To : " . $head->get('To') . " Subject: " . $head->get('Subject') . " Date : " . $head->get('Date') . " "; # DBG print "Dumping skeleton for message #" . $msg_num . ": " if $debug; $entity->dump_skeleton if $debug; @parts $entity->parts;
The astute reader may be asking at this point, “What happens if the input message is not in MIME format but is an older RFC 822 message?” The answer is that the headers will still be available, as will the body, but the parts array will be empty! We will use this bit of trivia later to retain any RFC 822 messages that we find.
Now that we have an array of MIME::Entity objects representing message body parts, we can operate on each of the parts to see if it is too big. If the part is small enough, we just keep it for later so that it can be written into the output file. If it is big, we must ask the user, using the ReadLine module to get the information and act on it accordingly. For each part, we will print out the part’s MIME headers so that the user can make an informed decision whether to delete it.
Parts that we keep are put into an array,
@parts_to_keep
. This array will be used later to
reconstruct the message in the output file. Where an attachment is
deleted, we create a new one (using MIME-tools to create a new
MIME::Entity object) of type text/plain. Into the
new part goes some explanatory text and the headers of the original
attachment. These new parts are also put on the
@parts_to_keep
array, so that they may be output in
their turn.
$i = 0; foreach $part (@parts) { $part_head = $part->head; # DBG print "PART $i: " if $debug; print $part_head->print . " " if $debug; # Print out info only if the length of the attachment # is greater than the set value. $length = $part_head->get('Content-length'), if ( $length > $allowed_length ) { # Print the summary information for this message, # if we haven't already done it. unless ($summarized) { print $msg_summary; $summarized = 1; } print "Found a large attachment. It's header's are: "; print "_" x 60 . " "; $part_head->print; print "_" x 60 . " "; $answer = $term->readline($attach-prompt); warn $@ if $@; $term->addhistory($_) if /S/; # DBG print "The user's answer was: $answer " if $debug; if ( $answer =~ /^y/i ) { # The user has requested that the # attachment be deleted. print "The attachment will be deleted. "; # Create a new attachment of text/plain # and put info into it. $part_headers = $part_head->stringify; @new_info = ($info_start, $part_headers); $new-part = build MIME::Entity (Type => "text/plain", Data => @new_info); push @parts_to_keep, $new-part; } elsif ( $answer =~/^q/i ) { # The user has decided to quit early. &abort(); } else { print "Keeping the attachment. "; push @parts_to_keep, $part; } } else { # This attachment is not too big, so keep it. push @parts_to_keep, $part; } $i++; }
If you were really paying attention, you might have figured out that the preceding code does not recurse into MIME parts! We could do that simply by extracting the block of code that operates on a large part and putting it into a subroutine, then calling that subroutine recursively. Although the current form is less capable, it is much more clearly structured as an example.
The last bit of real work is to reconstruct our message and write it to the output file. You may recall that we have already opened the file handle earlier. However, we are reconstructing each message. We have not retained the message in its original form. Therefore, we have to be careful to do the job faithfully. In particular, we have not yet extracted the MIME separator used (it was used internally by MIME-tools), nor have we accessed the so-called preamble (that part of a MIME message outside of the MIME structure, used to explain the message to non-MIME compliant MUAs). We will need to do both of those things in order to completely reconstruct a message.
Unfortunately, the version of MIME-tools that was used does not provide a clean manner of extracting the MIME separator. Here, we have parsed it from the ContentType header. Just in case we can’t, we have also allowed for the creation of a unique separator to use.
That covers MIME messages, but we still need to deal with RFC 822 messages. In that case, we will just include the entire message body without protest.
Since each message is being written into an mbox formatted output file, we will also need to write a preceding From_ line and a closing two blank lines. You might note that we said in Chapter 7 that there are many minor variations to the mbox format: It is possible that minor variations will therefore be introduced into an outputted mailbox file. The added complexity is not as bad as you might think. First, we use the same From_ line as the one found in the original mailbox, so there will be no problem there. Second, MIME-tools handles internally various ways of handling quoted From_ lines in message bodies. Third, if your specific mailbox format does not place blank lines at the bottom of messages, our introduction of them is unlikely to do harm. To be ultra conservative, we have created a totally new output file, not munged the original.
# At this point, we should have a complete message, in # the form that we wish to keep it. Write this message # to the output file. # Extract MIME boundary from the original message. $boundary = $head->get('Content-Type'), chomp($boundary); if ( $boundary =~ /boundarys*=s*"(.*)"/ ) { $boundary = $1; } else { # Assign a default boundary, since we couldn't parse one. $boundary = "$$--$0--$msg_num"; } # Print the newly formed message to the new mbox. # Start with the FROM_ line. print NEWBOX "$from_"; # Next comes the headers. $head->print(*NEWBOX); print NEWBOX " "; # Then the message body. if ( $#parts_to_keep == -1 ) { # This is not a MIME message. Need to keep the original body. $entity->print_body(*NEWBOX); } else { # This _is_a MIME message, so write out all of its parts. # Print the preamble $preamble = $entity->preamble; foreach $pline (@$preamble) { print NEWBOX $pline, " "; } #print NEWBOX "The Preamble is: $preamble "; foreach $part (@parts_to_keep) { print NEWBOX "--$boundary "; $part->print(*NEWBOX); #print NEWBOX " "; } print NEWBOX " --$boundary-- "; } print NEWBOX " "; &clean() ; } # end sub check_message
Along the way, we have called some small utility subroutines:
abort, clean, close_fhs
, and exit.
abort
takes care of cleanly shutting down the script if the
user aborts the run. clean
is called at the end of
each message parsing; it removes temporary files created by MIME-tools
and resets buffers. close_fhs
closes file handles,
and exit
should be pretty obvious. Here they
are:
sub abort { print " ABORTED by user. "; # Delete output file! &close_fhs(); unlink($newfile); &clean(); &exit (); } # end sub abort sub clean { # Clean up any temporary files left on disk. $entity->purge; # FOrget about the current message. undef($entity); undef(@parts_to_keep); undef($msg); } # end sub clean sub close_fhs { # Close the mailboxes. close(MBOX) ; close(NEWBOX) ; } # end sub close_fhs sub exit { &close_fhs(); # Say Goodbye. print "Finished. "; exit(0); } # end sub exit
18.191.181.231