Tapping into Your Gmail

Google+ is a great source of clean textual data that you can mine, but it’s just one of many starting points. Since this chapter showcases Google technology, this section provides a brief overview of how to tap into your Gmail data so that you can mine the text of what may be many thousands of messages in your inbox. If you haven’t read Chapter 3 yet, recall that it’s devoted to mail analysis but focuses primarily on the structured data features of mail messages, such as the participants in mail threads. The techniques outlined in that chapter could be easily applied to Gmail messages, and vice versa.

Accessing Gmail with OAuth

In early 2010, Google announced OAuth access to IMAP and SMTP in Gmail. This was a significant announcement because it officially opened the door to “Gmail as a platform,” enabling third-party developers to build apps that can access your Gmail data without you needing to give them your username and password. This section won’t get into the particular nuances of how Xoauth, Google’s particular implementation of OAuth, works (see No, You Can’t Have My Password for a terse introduction to OAuth). Instead, it focuses on getting you up and running so that you can access your Gmail data, which involves just a few simple steps:

  • Select the “Enable IMAP” option under the “Forwarding and POP/IMAP” tab in your Gmail Account Settings.

  • Visit the Google Mail Xoauth Tools wiki page, download the xoauth.py command-line utility, and follow the instructions to generate an OAuth token and secret for an “anonymous” consumer.[49]

  • Install python-oauth2 via easy_install oauth2 and use the template in Example 7-11 to establish a connection.

    Example 7-11. A template for connecting to IMAP using OAuth (plus__gmail_template.py)

    # -*- coding: utf-8 -*-
    
    import sys
    import oauth2 as oauth
    import oauth2.clients.imap as imaplib
    
    # See http://code.google.com/p/google-mail-xoauth-tools/wiki/
    #     XoauthDotPyRunThrough for details on xoauth.py
    
    OAUTH_TOKEN = sys.argv[1]  # obtained with xoauth.py
    OAUTH_TOKEN_SECRET = sys.argv[2]  # obtained with xoauth.py
    GMAIL_ACCOUNT = sys.argv[3]  # [email protected]
    
    url = 'https://mail.google.com/mail/b/%s/imap/' % (GMAIL_ACCOUNT, )
    
    # Standard values for Gmail's Xoauth
    consumer = oauth.Consumer('anonymous', 'anonymous')  
    token = oauth.Token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
    
    conn = imaplib.IMAP4_SSL('imap.googlemail.com')
    conn.debug = 4  # set to the desired debug level
    conn.authenticate(url, consumer, token)
    
    conn.select('INBOX')
    
    # access your INBOX data

Once you’re able to access your mail data, the next step is to fetch and parse some message data.

Fetching and Parsing Email Messages

The IMAP protocol is a fairly finicky and complex beast, but the good news is that you don’t have to know much of it to search and fetch mail messages. imaplib-compliant examples are readily available online, and one of the more common operations you’ll want to do is search for messages. There are various ways that you can construct an IMAP query. An example of how you’d search for messages from a particular user is conn.search(None, '(FROM "me")'), where None is an optional parameter for the character set and '(FROM "me")' is a search command to find messages that you’ve sent yourself (Gmail recognizes “me” as the authenticated user). A command to search for messages containing “foo” in the subject would be '(SUBJECT "foo")', and there are many additional possibilities that you can read about in Section 6.4.4 of RFC 3501, which defines the IMAP specification. imaplib returns a search response as a tuple that consists of a status code and a string of space-separated message IDs wrapped in a list, such as ('OK', ['506 527 566']). You can parse out these ID values to fetch RFC 822-compliant mail messages, but alas, there’s additional work involved to parse the content of the mail messages into a usable form. Fortunately, with some minimal adaptation, we can reuse the code from Example 3-3, which used the email module to parse messages into a more readily usable form, to take care of the uninteresting email-parsing cruft that’s necessary to get usable text from each message. Example 7-12 illustrates.

Example 7-12. A simple workflow for extracting the bodies of Gmail messages returned from a search (plus__search_and_parse_mail.py)

# -*- coding: utf-8 -*-

import oauth2 as oauth
import oauth2.clients.imap as imaplib

import os
import sys
import email
import quopri
import json
from BeautifulSoup import BeautifulSoup

# See http://code.google.com/p/google-mail-xoauth-tools/wiki/
#     XoauthDotPyRunThrough for details on xoauth.py

OAUTH_TOKEN = sys.argv[1]  # obtained with xoauth.py
OAUTH_TOKEN_SECRET = sys.argv[2]  # obtained with xoauth.py
GMAIL_ACCOUNT = sys.argv[3]  # [email protected]
Q = sys.argv[4]

url = 'https://mail.google.com/mail/b/%s/imap/' % (GMAIL_ACCOUNT, )

# Authenticate with OAuth

# Standard values for Gmail's xoauth implementation
consumer = oauth.Consumer('anonymous', 'anonymous')  
token = oauth.Token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
conn = imaplib.IMAP4_SSL('imap.googlemail.com')
conn.debug = 4
conn.authenticate(url, consumer, token)

# Select a folder of interest

conn.select('INBOX')

# Repurpose scripts from "Mailboxes: Oldies but Goodies"


def cleanContent(msg):

    # Decode message from "quoted printable" format

    msg = quopri.decodestring(msg)

    # Strip out HTML tags, if any are present

    soup = BeautifulSoup(msg)
    return ''.join(soup.findAll(text=True))


def jsonifyMessage(msg):
    json_msg = {'parts': []}
    for (k, v) in msg.items():
        json_msg[k] = v.decode('utf-8', 'ignore')

    # The To, CC, and Bcc fields, if present, could have multiple items
    # Note that not all of these fields are necessarily defined

    for k in ['To', 'Cc', 'Bcc']:
        if not json_msg.get(k):
            continue
        json_msg[k] = json_msg[k].replace('
', '').replace('	', '').replace('
'
                , '').replace(' ', '').decode('utf-8', 'ignore').split(',')

    try:
        for part in msg.walk():
            json_part = {}
            if part.get_content_maintype() == 'multipart':
                continue
            json_part['contentType'] = part.get_content_type()
            content = part.get_payload(decode=False).decode('utf-8', 'ignore')
            json_part['content'] = cleanContent(content)

            json_msg['parts'].append(json_part)
    except Exception, e:
        sys.stderr.write('Skipping message - error encountered (%s)' % (str(e), ))
    finally:
        return json_msg


# Consume a query from the user. This example illustrates searching by subject

(status, data) = conn.search(None, '(SUBJECT "%s")' % (Q, ))
ids = data[0].split()

messages = []
for i in ids:
    try:
        (status, data) = conn.fetch(i, '(RFC822)')
        messages.append(email.message_from_string(data[0][1]))
    except Exception, e:
        'Print error fetching message %s. Skipping it.' % (i, )

jsonified_messages = [jsonifyMessage(m) for m in messages]

# Separate out the text content from each message so that it can be analyzed

content = [p['content'] for m in jsonified_messages for p in m['parts']]

# Note: Content can still be quite messy and contain lots of line breaks and other quirks

if not os.path.isdir('out'):
    os.mkdir('out')

filename = os.path.join('out', GMAIL_ACCOUNT.split("@")[0] + '.gmail.json')
f = open(filename, 'w')
f.write(json.dumps(jsonified_messages))
f.close()

print >> sys.stderr, "Data written out to", f.name

Once you’ve successfully parsed out the text from the body of a Gmail message, some additional work will be required to cleanse the text to the point that it’s suitable for a nice display or advanced NLP, as illustrated in Chapter 8, but not much effort is required to get it to the point where it’s clean enough for collocation analysis. In fact, the results of Example 7-12 can be fed almost directly into Example 7-9 to produce a list of collocations from the search results. A very interesting visualization exercise would be to create a graph plotting the strength of linkages between messages based on the number of bigrams they have in common, as determined by a custom metric.



[49] If you’re just hacking your own Gmail data, using the anonymous consumer credentials generated from xoauth.py is just fine; you can always register and create a “trusted” client application when it becomes appropriate to do so.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.144.214