Data refactoring

We noticed that the from field contains more information than we need. We just need to extract an email address from that field. Let's do some refactoring:

First of all, import the regular expression package:

import re

Next, let's create a function that takes an entire string from any column and extracts an email address:

def extract_email_ID(string):
  email = re.findall(r'<(.+?)>', string)
  if not email:
    email = list(filter(lambda y: '@' in y, string.split()))
  return email[0] if email else np.nan

The preceding function is pretty straightforward, right? We have used a regular expression to find an email address. If there is no email address, we populate the field with NaN. Well, if you are not sure about regular expressions, don't worry. Just read the Appendix.

Next, let's apply the function to the from column:

dfs['from'] = dfs['from'].apply(lambda x: extract_email_ID(x))

We used the lambda function to apply the function to each and every value in the column.

Next, we are going to refactor the label field. The logic is simple. If an email is from your email address, then it is the sent email. Otherwise, it is a received email, that is, an inbox email:

myemail = '[email protected]'
dfs['label'] = dfs['from'].apply(lambda x: 'sent' if x==myemail else 'inbox')

The preceding code is self-explanatory.

Table of Contents for Data refactoring

Create new playlist

Sign In

Sign Up

Table of Contents for
Data refactoring