Creating mboxscrub.pl

The beginning of the script is simple, since we have already decided which modules to use. In the opening lines, we make sure that we have access to the features of those modules:

#!/usr/local/bin/perl

use MIME::Parser;                 # The MIME-Tools Parser (reads/writes
                                  # MIME entities).
use Term::ReadLine;               # ReadLine library (used/or user input).
use Getopt:: Std;                 # Used to get command line arguments.

Note

Determining which modules are available can be more difficult than using them. A periodic review of the CPAN, perhaps with the excellent regular expression-based search features of the CPAN Perl module (try perl -MCPAN -e shell) is useful.

Naturally, the first line is needed only under Unix. Next, we will need to set some initial variables. This is good practice, although not strictly required in a scripting language like Perl. We will need to have some flags to tell us where we are in the parsing process and what the user has requested (such as a flag for whether or not we are compressing the output file). We will also need to set up object instances for each of the modules that we are using, since they use Perl’s object-oriented interface. Many modules have their own requirements; refer to their perldoc documentation to see initialization and usage examples in detail.

In general, MIME-tools is used by creating an object instance of MIME::Parser, in order to read a message. Later we will see how to get objects for each MIME entity. MIME::Parser requires that a temporary directory be given, since it creates many temporary files in the course of its operation. It first creates a temporary file for a message, then another for each of its MIME entities. We have to be sure to delete them, which we will do along the way.

Term::ReadLine is a bit esoteric but simple to use nonetheless. A terminal object is created, and a file handle is associated with it for its output. This file handle is probably (but not necessarily) STDOUT. Every time we ask for input from the user, we have to provide a prompt. In the following code, we define three prompts, which together cover all that we need ask.

Finally, we will need to set a message for use in creating replacement MIME attachments for the ones that we delete.

# Set initial variables.

$debug = 0;                               # Set to non-zero to show debugging info.
$archive = 0;                             # Set to non-zero if compressing output.
$archiver = '/bin/gzip';                  # Program to use for file compression.
$archive_subscript = '.gz';               # File extension for compressed files.

$allowed_length = 0;                      # Size in bytes above which MIME entities
                                          # are checked.
$mbox_file "";                            # Mailbox filename.

# Variables used with the Term::ReadLine module to get user input:
$term = new Term::ReadLine;               # The ReadLine object.
$length-prompt = "Enter the maximum size in bytes of attachments " .
                 "you wish to keep. [#/q] ";
$file-prompt = "please enter the filename of your mailbox. [#/q] ";
$attach-prompt = "Do you wish to delete this attachment? [y/N/q] ";
$OUT = $term->OUT || STDOUT;           # Where the terminal output goes.

# Variables related to the current message.
$msg = "";                                # Current message contents
$msg_num = 0;                             # The (linear) number of the current
                                          # message.
$from_ = "";                              # The From_ line for the current message.
$date = localtime (time);                 # The current time/date.
# The message created when an attachment is deleted.
$info_start = "This attachment has been deleted by $0 on $date. " .
                "The original attachment's headers were:

";

# Variables used with MIME-Tools.
my($parser) = new MIME::Parser;           # The Parser object.
$parser->output_dir("/tmp");           # The temporary directory to use
                                          # when creating temporary files for
                                          # messages and MIME entities.

The first useful thing this script will need to do is to check its own command-line arguments (if any) and act accordingly. The Getopt module makes this task easy. The getopts line in the following code takes any command-line arguments and associates them with variables prepended with $opt_. So if we call the script with a -u option, the variable $opt_u will be set to 1 and set to 0 otherwise. The colons in the syntax indicate that the -l and -f options take arguments; the appropriate variables will be set to those arguments.

The second thing to do is to return the usage statement, if the user gave us a -u flag. This should probably be in a subroutine to keep the code clean, but this script is so short we won’t worry about it. Note that if the usage statement is given, the script exits.

getopts('ul:f:ad'),

# Return a usage statement if asked.
if ( $opt_u ) {
        print<<ENDOFUSAGE;

$0 - an interactive script which allows you to scrub
                mbox-style mailboxes of unwanted MIME attachments.

     Synopsis:  $0 [-u] [-a] [-d] [-l <length>] [-f <file>]
     where <length> is the maximum size of attachments to ignore; and
           <file> is the path and filename of an mbox-style mailbox.

ENDOFUSAGE

      exit(0);

}

If we haven’t exited by this point, we are off and running. We will need to check to see which of the other command-line options are set and take appropriate action. In the case of the debugging (-d) and archive (-a) flags, we just need to note their presence or lack of it. In the case of the length (-l) and filename (-f) flags, we will need to ask the user for the information if it is not provided on the command line.

Using the Term::ReadLine module to get and operate on user input is easy; three lines are all that is needed to read a line (with the readline method), warn if there are errors (reported in $@), and update the terminal’s command-line history (with the addhistory method). In each case, we check to see if the user wants to quit (indicated by typing a response beginning with a letter q) before proceeding.

print "
Welcome to $0!

" ;

# Turn on debugging reports, if we have been requested to.
if ( $opt_d ) { $debug = 1; }

# If we don't have an allowed attachment length, ask for it.
$allowed_length = $opt_l;
unless ( $allowed_length ) {

        $answer = $term->readline($length-prompt);
        warn $@ if $@;
        $term->addhistory($_) if /S/;
        if ( $answer =~ /^q/i ) {
                print "ABORTED by user
";
                exit (0);
        } else {
                $allowed_length = $answer;
        }
}

# If we don't have a mailbox file, ask for it.
$mbox_file = $opt_f;
unless ( $mbox_file ) {

        $answer = $term->readline($file-prompt);
        warn $@ if $@;
        $term->addhistory($_) if /S/;
        if ( $answer =~ /^q/i ) {
                print "ABORTED by user
";
                exit (0) ;
        } else {
                $mbox_file = $answer;
                $mbox_file =~ s/s$//;
        }
}

# Check to see if we can read the input file.
# NOTE that Win32 may place pwd in perl dir...
unless ( -r $mbox_file ) {
        print "File $mbox_file is not readable: $! Exiting... 
";
        exit(0);
}

if ( $opt_a ) { $archive = 1; }

We’re done with the initial housekeeping and user interface issues and can now turn our attention to the task itself: reading and parsing an mbox-style mailbox. We actually need to open two file handles. The first is for the input mailbox (for reading), and the second is the output file (to be opened for writing in mbox format). At the end of the script, we will inform the user of the name of the output file, which we will name by appending “_scrubbed” to the input filename.

The core of this script is a simple while loop. We read the input mailbox until we have placed one entire message in a buffer, then we act on that message before reading further. We can clear the buffer and put the next message into it and repeat this process until the mailbox is completely processed. If we didn’t do something like this, we would use as much memory as the number of bytes in the input mailbox file. Since mailboxes can be quite large, this would be a huge waste of system resources. Processing a 50 MB file would use 50 MB just for one variable! Obviously, we should do better.

Reading the mailbox, we expect that the first line is a From_ line, as described in Chapter 7, Mailbox Formats. We need some way to note that the first time we encounter one of these lines we are at the top of a message and all of the subsequent times we have reached the end of the preceding message. We will use a simple flag, $msg_started, to denote this. We will also keep track of the number of the message ($msg_num) that we are operating on, so we can report it to the user for reference. The rest of this while loop is to put message lines into the current message buffer ($msg) and determine when we have a full message in the buffer and are ready to operate on it. We will do the actual operation (parsing and potentially deleting MIME attachments) in a subroutine called check_message, to be discussed later.

When the while loop finishes, we will have the entire last message in the buffer, but we will not have operated on it yet! So, we will need to call the check_message subroutine one more time.

# Open a mailbox and step through it, one message at a time.
open (MBOX, $mbox_file) or die "Can't open $mbox_file: $!";

# Open a new mailbox file, so that we can write to it.
$newfile = $mbox_file . "_scrubbed";
open (NEWBOX, ">$newfile") or die "Can't open $newfile: $!";

$msg_started = 0;
while(<MBOX>) {

            # Read in one message, then write it to a temporary
            # file and operate on it.

            # The current line.
            $line = $_;

            if ( ($line =~ /^From /) && (! $msg_started) ) {

                    # Note that we've found the start of the first message.
                    $msg_started = 1;

                    # Keep the From_ line.
                    $from_ = $line;

            } elsif ( ($line =~ /^From - /) && ($msg_started) ) {

                    # We found the end of a message and the start of
                    # a new one. We now operate on the completed message.

                    # increment the message number.
                    $msg_num++;

                    # Operate on current message.
                    &check_message();

                    # Set the From_ line for the next message.
                    $from_ = $line;

            } else {
                    # We're in the middle of reading a message, so just append
                    # to the current message.
                    $msg .= $line;
            }

} # end while reading MBOX.

# Need to handle the last message read!
$msg_num++;
&check_message();

The check_message subroutine will be responsible for clearing the message buffer when it is done with it.

The last thing the script needs to do is to determine if the output file should be compressed. If so, it must close the open file handle and run the archiving program on the filename. The script can then exit. We will handle various bits and pieces of this process in subroutines; one here for closing file handles and another for the cleanup on exit. This is because, as you shall see, the same actions will need to be taken from the central check_message subroutine as well.

# Archive (compress) the output mailbox if the user so requested.
if ( $archive ) {
       &close_fhs();
       `$archiver $newfile`;
       $newfile .= $archive_subscript;
}

# Tell the user the filename that was created.
print "
Your scrubbed mailbox is in the file" . $newfile . ".
";

&exit ();

Let’s now turn our attention to the check_message subroutine. It is there that we will parse each message that has come from the mailbox. This subroutine is the longest part of this script, but it has a simple structure. In HPC, [24] the structure would look something like this:

Get message as a MIME entity object
For each message part {
     Is this part 'too big'?  If so ...
          Ask the user if the attachment should be deleted
          and handle the response appropriately.
    If not, just keep the part.
}
Write the message to the output file.

In order to get the message into a form that the MIME-tools modules can work with, we need to instantiate an instance of the MIME::Entity object for the message on which we are operating. Once we have done that, we can extract some header information from the message to use as context information for the user. We can also break the message into an array of MIME parts, each of which is in its turn a MIME::Entityo bject.

This is where MIME-tools starts creating heaps of temporary files. We’ll return to this later, when we look at the cleanup subroutines. Another small note: MIME-tools provides a nice utility method, sync_headers that (among other things) allows for Content-length headers to be calculated and inserted into each MIME attachment. This is necessary if we are going to easily determine the length of the parts!

The code for this first bit looks like this:

sub check_message {
        # Operate on a given message, asking the user
        # for each large attachment found.

        # DBG
        print "Message #" . $msg_num . ":
" if $debug;

        $entity = $parser->parse_data($msg) or die
                "ERROR:  Couldn't parse MIME stream: $!";

        # Compute Content-Length headers for each body part.
         $entity->sync_headers(Length=>'COMPUTE'),

         # Prepare some explanatory information.
         $summarized = 0;
         $head = $entity->head;
         $msg_summary = "     Message Summary for message #" .
                 $msg_num . ":
" .
                 "     From   : " . $head->get('From') .
                 "     To     : " . $head->get('To') .
                 "     Subject: " . $head->get('Subject') .
                 "     Date   : " . $head->get('Date') . "
";

         # DBG
         print "Dumping skeleton for message #" . $msg_num . ":
" if $debug;
         $entity->dump_skeleton if $debug;

         @parts $entity->parts;

The astute reader may be asking at this point, “What happens if the input message is not in MIME format but is an older RFC 822 message?” The answer is that the headers will still be available, as will the body, but the parts array will be empty! We will use this bit of trivia later to retain any RFC 822 messages that we find.

Now that we have an array of MIME::Entity objects representing message body parts, we can operate on each of the parts to see if it is too big. If the part is small enough, we just keep it for later so that it can be written into the output file. If it is big, we must ask the user, using the ReadLine module to get the information and act on it accordingly. For each part, we will print out the part’s MIME headers so that the user can make an informed decision whether to delete it.

Parts that we keep are put into an array, @parts_to_keep. This array will be used later to reconstruct the message in the output file. Where an attachment is deleted, we create a new one (using MIME-tools to create a new MIME::Entity object) of type text/plain. Into the new part goes some explanatory text and the headers of the original attachment. These new parts are also put on the @parts_to_keep array, so that they may be output in their turn.

$i = 0;
foreach $part (@parts) {

        $part_head = $part->head;

        # DBG
        print "PART $i:

" if $debug;
        print $part_head->print . "

" if $debug;
         # Print out info only if the length of the attachment
         # is greater than the set value.
         $length = $part_head->get('Content-length'),
         if ( $length > $allowed_length ) {

                # Print the summary information for this message,
                # if we haven't already done it.
                unless ($summarized) {
                        print $msg_summary;
                        $summarized = 1;
                }
               print "Found a large attachment. It's header's are:
";
                print "_" x 60 . "
";
                $part_head->print;
                print "_" x 60 . "
";

                $answer = $term->readline($attach-prompt);
                warn $@ if $@;
                $term->addhistory($_) if /S/;

                # DBG
                print "The user's answer was: $answer
" if $debug;

                if ( $answer =~ /^y/i ) {
                        # The user has requested that the
                        # attachment be deleted.
                        print "The attachment will be deleted.

";

                        # Create a new attachment of text/plain
                        # and put info into it.
                        $part_headers = $part_head->stringify;
                        @new_info = ($info_start, $part_headers);

                        $new-part = build MIME::Entity
                                        (Type       => "text/plain",
                                        Data        => @new_info);
                        push @parts_to_keep, $new-part;

                } elsif ( $answer =~/^q/i ) {
                        # The user has decided to quit early.
                        &abort();
                } else {
                        print "Keeping the attachment.

";
                        push @parts_to_keep, $part;
                }
        } else {
                # This attachment is not too big, so keep it.
                push @parts_to_keep, $part;
        }

        $i++;
}

If you were really paying attention, you might have figured out that the preceding code does not recurse into MIME parts! We could do that simply by extracting the block of code that operates on a large part and putting it into a subroutine, then calling that subroutine recursively. Although the current form is less capable, it is much more clearly structured as an example.

The last bit of real work is to reconstruct our message and write it to the output file. You may recall that we have already opened the file handle earlier. However, we are reconstructing each message. We have not retained the message in its original form. Therefore, we have to be careful to do the job faithfully. In particular, we have not yet extracted the MIME separator used (it was used internally by MIME-tools), nor have we accessed the so-called preamble (that part of a MIME message outside of the MIME structure, used to explain the message to non-MIME compliant MUAs). We will need to do both of those things in order to completely reconstruct a message.

Unfortunately, the version of MIME-tools that was used does not provide a clean manner of extracting the MIME separator. Here, we have parsed it from the ContentType header. Just in case we can’t, we have also allowed for the creation of a unique separator to use.

That covers MIME messages, but we still need to deal with RFC 822 messages. In that case, we will just include the entire message body without protest.

Since each message is being written into an mbox formatted output file, we will also need to write a preceding From_ line and a closing two blank lines. You might note that we said in Chapter 7 that there are many minor variations to the mbox format: It is possible that minor variations will therefore be introduced into an outputted mailbox file. The added complexity is not as bad as you might think. First, we use the same From_ line as the one found in the original mailbox, so there will be no problem there. Second, MIME-tools handles internally various ways of handling quoted From_ lines in message bodies. Third, if your specific mailbox format does not place blank lines at the bottom of messages, our introduction of them is unlikely to do harm. To be ultra conservative, we have created a totally new output file, not munged the original.

# At this point, we should have a complete message, in
# the form that we wish to keep it. Write this message
# to the output file.

# Extract MIME boundary from the original message.
$boundary = $head->get('Content-Type'),

chomp($boundary);
if ( $boundary =~ /boundarys*=s*"(.*)"/ ) {
        $boundary = $1;
} else {
          # Assign a default boundary, since we couldn't parse one.
          $boundary = "$$--$0--$msg_num";
}

# Print the newly formed message to the new mbox.

# Start with the FROM_ line.
print NEWBOX "$from_";

# Next comes the headers.
$head->print(*NEWBOX);
print NEWBOX "
";

# Then the message body.
if ( $#parts_to_keep == -1 ) {
        # This is not a MIME message. Need to keep the original body.
        $entity->print_body(*NEWBOX);
} else {
        # This _is_a MIME message, so write out all of its parts.

        # Print the preamble
        $preamble = $entity->preamble;
        foreach $pline (@$preamble) {
                print NEWBOX $pline, "
";
        }
        #print NEWBOX "The Preamble is: $preamble

";

        foreach $part (@parts_to_keep) {
                print NEWBOX "--$boundary
";
                $part->print(*NEWBOX);
                #print NEWBOX "
";
        }
        print NEWBOX "
--$boundary--
";
}
print NEWBOX "

";

&clean() ;

} # end sub check_message

Along the way, we have called some small utility subroutines: abort, clean, close_fhs, and exit. abort takes care of cleanly shutting down the script if the user aborts the run. clean is called at the end of each message parsing; it removes temporary files created by MIME-tools and resets buffers. close_fhs closes file handles, and exit should be pretty obvious. Here they are:

sub abort {

       print "
ABORTED by user.
";

       # Delete output file!
       &close_fhs();
       unlink($newfile);
       &clean();
       &exit ();

} # end sub abort

sub clean {

        # Clean up any temporary files left on disk.
        $entity->purge;

        # FOrget about the current message.
        undef($entity);
        undef(@parts_to_keep);
        undef($msg);

} # end sub clean

sub close_fhs {

        # Close the mailboxes.
        close(MBOX) ;
        close(NEWBOX) ;

} # end sub close_fhs

sub exit {

        &close_fhs();

        # Say Goodbye.
        print "Finished.

";
        exit(0);

} # end sub exit


[24] Hideous Pseudocode.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.181.231