The Enron mail data makes for great illustrations in a chapter on mail analysis, but you’ll almost certainly want to take a closer look at your own mail data. Fortunately, many popular mail clients provide an “export to mbox” option, which makes it pretty simple to get your mail data into a format that lends itself to analysis by the techniques described in this chapter. For example, in Apple Mail, you can select some number of messages, pick “Save As…” from the File menu, and then choose “Raw Message Source” as the formatting option to export the messages as an mbox file (see Figure 3-7). A little bit of searching should turn up results for how to do this in most other major clients.
If you exclusively use an online mail client, you could opt to pull
your data down into a mail client and export it, but you might prefer to
fully automate the creation of an mbox file by pulling the data directly
from the server. Just about any online mail service will support POP3 (Post
Office Protocol version 3), most also support IMAP (Internet Message
Access Protocol), and Python scripts for pulling down your mail aren’t
very hard to whip up. One particularly robust command-line tool that you
can use to pull mail data from just about anywhere is getmail ,
which turns out to be written in Python. Two modules included in Python’s
standard library,
poplib
and imaplib
, provide a terrific foundation, so you’re also likely to run
across lots of useful scripts if you do a bit of searching online. getmail
is particularly easy to get up and running. To slurp down your Gmail inbox
data, for example, you just download and install it, then set up a
getmailrc file with a few basic options. Example 3-22 demonstrates some settings for a *nix
environment. Windows users would need to change the [destination]
path
and [options] message_log
values to valid
paths.
Example 3-22. Sample getmail settings for a *nix environment
[retriever] type = SimpleIMAPSSLRetriever server = imap.gmail.com username = ptwobrussell password = blarty-blar-blar [destination] type = Mboxrd path = /tmp/gmail.mbox [options] verbose = 2 message_log = ~/.getmail/gmail.log
With a configuration in place, simply invoking getmail
from a terminal does the rest. Once you have a local mbox on hand, you can
analyze it using the techniques you’ve learned in this chapter:
$ getmail
getmail version 4.20.0
Copyright (C) 1998-2009 Charles Cazabon. Licensed under the GNU GPL version 2.
SimpleIMAPSSLRetriever:[email protected]:993:
msg 1/10972 (4227 bytes) from ... delivered to Mboxrd /tmp/gmail.mbox
msg 2/10972 (3219 bytes) from ... delivered to Mboxrd /tmp/gmail.mbox
...
Tapping into Your Gmail investigates using
imaplib
to slurp down your Gmail data and analyze it, as one
part of the exercises in Chapter 7, which focuses on Google
technologies.
There are several useful toolkits floating around that analyze webmail, and one of the most promising to emerge recently is the Graph Your Inbox Chrome Extension. To use this extension, you just install it, authorize it to access your mail data, run some Gmail queries, and let it take care of the rest. You can search for keywords like “pizza,” time values such as “2010,” or run more advanced queries such as “from:[email protected]” and “label:Strata”. It’s highly likely that this extension is only going to keep getting better, given that it’s new and has been so well received thus far. Figure 3-8 shows a sample screenshot.
Tapping into Your Gmail provides an overview of how to
use Python’s smtplib
module to tap into your Gmail account
(or any other mail account that speaks SMTP) and mine the textual
information in messages. Be sure to check it out when you’re interested
in moving beyond mail header information and ready to dig into text
mining.
3.147.7.154