Google+ is a great source of clean textual data that you can mine, but it’s just one of many starting points. Since this chapter showcases Google technology, this section provides a brief overview of how to tap into your Gmail data so that you can mine the text of what may be many thousands of messages in your inbox. If you haven’t read Chapter 3 yet, recall that it’s devoted to mail analysis but focuses primarily on the structured data features of mail messages, such as the participants in mail threads. The techniques outlined in that chapter could be easily applied to Gmail messages, and vice versa.
In early 2010, Google announced OAuth access to IMAP and SMTP in Gmail. This was a significant announcement because it officially opened the door to “Gmail as a platform,” enabling third-party developers to build apps that can access your Gmail data without you needing to give them your username and password. This section won’t get into the particular nuances of how Xoauth, Google’s particular implementation of OAuth, works (see No, You Can’t Have My Password for a terse introduction to OAuth). Instead, it focuses on getting you up and running so that you can access your Gmail data, which involves just a few simple steps:
Select the “Enable IMAP” option under the “Forwarding and POP/IMAP” tab in your Gmail Account Settings.
Visit the Google
Mail Xoauth Tools wiki page, download the
xoauth.py
command-line utility, and follow the
instructions to generate an OAuth token and secret for an
“anonymous” consumer.[49]
Install python-oauth2
via
easy_install oauth2
and use the template in Example 7-11 to establish a connection.
Example 7-11. A template for connecting to IMAP using OAuth (plus__gmail_template.py)
# -*- coding: utf-8 -*- import sys import oauth2 as oauth import oauth2.clients.imap as imaplib # See http://code.google.com/p/google-mail-xoauth-tools/wiki/ # XoauthDotPyRunThrough for details on xoauth.py OAUTH_TOKEN = sys.argv[1] # obtained with xoauth.py OAUTH_TOKEN_SECRET = sys.argv[2] # obtained with xoauth.py GMAIL_ACCOUNT = sys.argv[3] # [email protected] url = 'https://mail.google.com/mail/b/%s/imap/' % (GMAIL_ACCOUNT, ) # Standard values for Gmail's Xoauth consumer = oauth.Consumer('anonymous', 'anonymous') token = oauth.Token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET) conn = imaplib.IMAP4_SSL('imap.googlemail.com') conn.debug = 4 # set to the desired debug level conn.authenticate(url, consumer, token) conn.select('INBOX') # access your INBOX data
Once you’re able to access your mail data, the next step is to fetch and parse some message data.
The IMAP protocol is a fairly finicky and complex beast, but the
good news is that you don’t have to know much of it to search and fetch
mail messages. imaplib-
compliant examples are readily available
online, and one of the more common operations you’ll want to do
is search for messages. There are various ways that you can construct an
IMAP query. An example of how you’d search for messages from a
particular user is conn.search(None, '(FROM "me")')
, where
None
is an optional parameter for the character set and '(FROM
"me")'
is a search command to find messages that you’ve sent
yourself (Gmail recognizes “me” as the authenticated user). A command to
search for messages containing “foo” in the subject would be
'(SUBJECT "foo")'
, and there are many additional possibilities that you can read
about in Section 6.4.4 of RFC 3501, which
defines the IMAP specification. imaplib
returns a search
response as a tuple that consists of a status code and a string of
space-separated message IDs wrapped in a list, such as ('OK',
['506 527 566'])
. You can parse out these ID values to fetch
RFC
822-compliant mail messages, but alas, there’s additional work
involved to parse the content of the mail messages into a usable form.
Fortunately, with some minimal adaptation, we can reuse the code from
Example 3-3, which used the
email
module to parse messages into a more readily usable
form, to take care of the uninteresting email-parsing cruft that’s
necessary to get usable text from each message. Example 7-12 illustrates.
Example 7-12. A simple workflow for extracting the bodies of Gmail messages returned from a search (plus__search_and_parse_mail.py)
# -*- coding: utf-8 -*- import oauth2 as oauth import oauth2.clients.imap as imaplib import os import sys import email import quopri import json from BeautifulSoup import BeautifulSoup # See http://code.google.com/p/google-mail-xoauth-tools/wiki/ # XoauthDotPyRunThrough for details on xoauth.py OAUTH_TOKEN = sys.argv[1] # obtained with xoauth.py OAUTH_TOKEN_SECRET = sys.argv[2] # obtained with xoauth.py GMAIL_ACCOUNT = sys.argv[3] # [email protected] Q = sys.argv[4] url = 'https://mail.google.com/mail/b/%s/imap/' % (GMAIL_ACCOUNT, ) # Authenticate with OAuth # Standard values for Gmail's xoauth implementation consumer = oauth.Consumer('anonymous', 'anonymous') token = oauth.Token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET) conn = imaplib.IMAP4_SSL('imap.googlemail.com') conn.debug = 4 conn.authenticate(url, consumer, token) # Select a folder of interest conn.select('INBOX') # Repurpose scripts from "Mailboxes: Oldies but Goodies" def cleanContent(msg): # Decode message from "quoted printable" format msg = quopri.decodestring(msg) # Strip out HTML tags, if any are present soup = BeautifulSoup(msg) return ''.join(soup.findAll(text=True)) def jsonifyMessage(msg): json_msg = {'parts': []} for (k, v) in msg.items(): json_msg[k] = v.decode('utf-8', 'ignore') # The To, CC, and Bcc fields, if present, could have multiple items # Note that not all of these fields are necessarily defined for k in ['To', 'Cc', 'Bcc']: if not json_msg.get(k): continue json_msg[k] = json_msg[k].replace(' ', '').replace(' ', '').replace(' ' , '').replace(' ', '').decode('utf-8', 'ignore').split(',') try: for part in msg.walk(): json_part = {} if part.get_content_maintype() == 'multipart': continue json_part['contentType'] = part.get_content_type() content = part.get_payload(decode=False).decode('utf-8', 'ignore') json_part['content'] = cleanContent(content) json_msg['parts'].append(json_part) except Exception, e: sys.stderr.write('Skipping message - error encountered (%s)' % (str(e), )) finally: return json_msg # Consume a query from the user. This example illustrates searching by subject (status, data) = conn.search(None, '(SUBJECT "%s")' % (Q, )) ids = data[0].split() messages = [] for i in ids: try: (status, data) = conn.fetch(i, '(RFC822)') messages.append(email.message_from_string(data[0][1])) except Exception, e: 'Print error fetching message %s. Skipping it.' % (i, ) jsonified_messages = [jsonifyMessage(m) for m in messages] # Separate out the text content from each message so that it can be analyzed content = [p['content'] for m in jsonified_messages for p in m['parts']] # Note: Content can still be quite messy and contain lots of line breaks and other quirks if not os.path.isdir('out'): os.mkdir('out') filename = os.path.join('out', GMAIL_ACCOUNT.split("@")[0] + '.gmail.json') f = open(filename, 'w') f.write(json.dumps(jsonified_messages)) f.close() print >> sys.stderr, "Data written out to", f.name
Once you’ve successfully parsed out the text from the body of a Gmail message, some additional work will be required to cleanse the text to the point that it’s suitable for a nice display or advanced NLP, as illustrated in Chapter 8, but not much effort is required to get it to the point where it’s clean enough for collocation analysis. In fact, the results of Example 7-12 can be fed almost directly into Example 7-9 to produce a list of collocations from the search results. A very interesting visualization exercise would be to create a graph plotting the strength of linkages between messages based on the number of bigrams they have in common, as determined by a custom metric.
[49] If you’re just hacking your own Gmail data, using the
anonymous consumer credentials generated from xoauth.py
is just fine; you can
always register
and create a “trusted” client application when it
becomes appropriate to do so.
3.148.144.214