Chapter 9

Email: The lifeblood of modern communication

Abstract

Email is one of the oldest, yet most commonly used forms of communication. Email networks connect email addresses to one another based on the number of messages sent and received. Three types of email networks exist: personal, organizational, and community. Personal networks, such as the one examined in this chapter, reveal insights about the person and relationships between those they communicate with. Corporate networks, such as the ABC network, can show connections between corporate units, helping identify redundancies, missed opportunities, and critical units based on actual communication patterns rather than official organizational charts. An examination of the Enron email network illustrates email network analysis value for forensic investigations. NodeXL includes an email network importer that can import data from email clients (e.g., Microsoft Outlook) and email servers (e.g., Microsoft Exchange). Email addresses of the same person must be combined in a process called deduplication.

Keywords

Email; Email network; Corporate email; Microsoft Exchange; Deduplication; Privacy; Enron; Email forensics; Importer

9.1 Introduction

Email has permeated society more than any other form of social media. It is hard to remember a time when inboxes sat on desks and spam was only a processed meat. Today, an estimated 3.8 billion worldwide email users send over 200 billion emails per day.1 Unfortunately, about half of them are considered spam.1 Email is the de facto form of communication for many corporations, nonprofits, and government agencies. Email and email lists are used to keep extended families in touch, coordinate neighborhood activities, support medical patients, share cutting-edge research, solve technical problems, and perform a host of other activities.

In April 2017, 91% of U.S. Internet users sent or received an email message, making it the most common Internet activity of all.2 Over 80% of email users check email at least once a day.3 Unlike many social media tools, email is widely used among nearly every demographic group. Furthermore, many social media sites optionally send email notifications due to its continued ubiquity.

The integration of email into everyday life makes email networks the most accessible and in many cases most accurate source of data for mapping actual social and work relationships. Analyzing one's personal email collection is a lot like looking in a mirror. Prototype systems like PostHistory demonstrated the interest people have in seeing representations of their social media and show how maps of social connections and activity can promote sustained engagement and storytelling around important events [1]. Network visualizations provide an objective representation of one's social ties, encouraging self-reflection and providing a guide to social hygiene. These maps and reports may help you realize unappreciated or forgotten relationships, or identify a past working group that could be rekindled for a current project. They can help us overcome some of our memory biases such as weighing recent events more or remembering things we've initiated more than those initiated by others. A visualization of your personal email network can be shared with other people to explain your social world. For example, new employees could benefit from a summary of important social ties and collaborative groupings related to a job role or position. Personal email collections are of increasing importance to historians, researchers, archivists, and lawyers who are engaged in the discovery and preservation of electronic records.

Analyzing organizational email networks and email lists can provide a wealth of social information that can inform important decisions and support novel interventions. Organizations can identify unique social roles, individuals who span the gaps between organizational silos, internal influencers, and employees in need of creating more connections. This information can be used as one input to help inform personnel hiring and promotion, improve retention, and spread important messages through a company. Analysis of organic employee clusters, rather than formal organizational charts, can be used to inform the formation of communities of practice and organizational restructuring, and it can help integrate relationships after mergers. Expertise networks can be identified by the use of keywords to infer topics, leading to more intelligent workgroup formation and information sharing. The analysis of an internal company or public email list can help identify experts on a topic, monitor the health of the community over time, and identify potential candidates for leadership roles in the list. Because it is based on actual behavior instead of potentially biased self-reports [2], its validity is high.

Working with email poses several ethical challenges. Although company email is far from private and unless encrypted it is far from secure, many users don't realize just how public their email is. A 2007 survey found that nearly half of the 304 U.S. companies surveyed monitored email use.4 More than a quarter of these companies had fired workers for email misuse. A related survey from 2006 found that 24% of employers had email subpoenaed by courts and regulators. Employers must walk a fine line between controlling the risks of litigation and security breaches by employees and not coming across as Big Brother. In such an environment, they must carefully consider the risks and rewards associated with analyzing company email collections. Researchers must also be careful to receive proper approval from list owners, managers, and members; using pseudonyms when advisable;5 and protecting members' privacy. For corporations and researchers, transparency is needed when articulating the goals of the analysis, the procedures for assuring confidentiality of message content, and the decisions that will be informed by the analysis. Options for employees or research subjects to opt out or prefilter email may be desirable. Although an analysis of email collections poses some risks, social network analysis can be less intrusive than many other methods for understanding social interaction. It provides an interesting midway point for those willing to share who they talk to but not what they say.

9.2 History and definition of email

Electronic mail, or email, is an electronic message transmitted over a communications network, typically as a text file with optional attachments. Email is older than the Internet itself. In the 1960s, email-like messages were sent between users of the same mainframe computer. Those who accessed the same mainframe or host computer through terminals could exchange messages. In the 1960s and 1970s, many companies used this approach to allow employees to contact other employees located throughout the world in different branch offices or subsidiaries. Email became the “killer app” of ARPANET, the computer network developed by the United States Department of Defense that evolved into the Internet. In 1971, Ray Tomlinson sent the first network email, using an “@” symbol to separate the user's name and the host computer's name. By 1973, approximately three-fourths of all ARPANET data traffic was email. Over time, email became more standardized and increasingly interoperable between different computer and network systems. Applications for interacting with email became more feature rich and usable. Email services became essentially free from an end-user perspective with popular web services such as Hotmail, Yahoo mail, and Gmail. Although email services may have access fees, each additional email sent or received does not typically impose an additional cost to the user.

Most readers are familiar with email as everyday users. Some important technical characteristics make email particularly powerful:

  •  Flexible form. Email can be a simple plain-text message, a richly formatted newsletter, or even an interactive survey. With attachments, nearly any type of digital content can be sent, provided it is not too large a file. This flexibility allows email to support informal banter, semiformal memos, and formal letters. This flexibility also leads to email overload in which the same channel is used to host conversations, store files, track tasks, and manage transactions.
  •  Asynchronous. The asynchronous nature of email allows people to send and receive messages on their own time, without interrupting others. The lack of immediate feedback in a text-only medium can lead to misunderstandings, but it can also encourage more careful and thorough contributions. The standard reverse-chronological order, where the newest messages are shown at the top of the inbox, makes the asynchronous nature of email more manageable by helping people distinguish new messages from old.
  •  Broadcast. Emails can be sent to any number of people simultaneously. Ad hoc groups can be created on the fly by sending to multiple email addresses and using the common Reply to All feature. Listserv and other email list software tools allow thousands of users to communicate and listen in on the conversations of others.
  •  Push technology. Email is considered a push technology; the sender decides what shows up in the receiver's inbox without any action on the receiver's end. This is great when trying to get someone's attention, but is also the reason so much unwanted email spam gets sent around.
  •  Threaded conversation. Email messages are often organized into a threaded pattern consisting of messages, replies to messages, replies to replies, and so forth. This pattern mimics the natural turn-taking of spoken conversations, albeit with less frequent turnover. Threads also extend the structure of spoken interaction by enabling the creation of multiple parallel lines of conversation. If threaded properly, related messages are grouped together into a single linked collection of related messages within their context.

Services such as Usenet and discussion forums, also known as bulletin boards or web boards, share many of these characteristics making them a close cousin to email lists often called Listservs. We discuss these in detail in Chapter 10.

9.3 Email networks

In a standard email network, vertices represent email addresses or corresponding people. Edges or ties are created when a message is sent from one email address to another. Edges are directed because messages are transferred from a sender to a receiver. These ties are weighted by the number of messages sent between two individuals. Table 9.1 shows a summary of the information found in seven messages pulled from Derek's personal email collection. These relationships are converted into an “edge list” and visually represented in Figure 9.1.

Table 9.1

An email network edge list.
FromToCcSubject
DerekBenHCIL Brownbag
DerekMarc, BenTravel Plans
DerekMarcAnnaRegistration
MarcDerekRe: Travel Plans
CarolDerekTuesday meeting
MarcDerekAnnaRe: Registration
MarcDerekNext Steps

Table 9.1

This email network edge list contains seven messages from Derek's personal email collection that includes ten unique edges (including both To and Cc) and five vertices.

Figure 9.1
Figure 9.1 A simple email network visualized in NodeXL. Arrows point from the sender to the receiver(s). Edge thickness (i.e., Edge Weight) ranges from 2 to 4 and is based on number of messages exchanged. Edge opacity is set to 70. Vertex size (3 to 40) is based on Out-Degree or number of messages sent.

Notice that the six rows in the Edges tab are a re-representation of the seven individual email messages shown in Table 9.1. The new edges are helpful for understanding sender-receiver relationships that are not as obvious when seen in the form of a standard list of email messages. All people in the To and the Cc email fields are counted as receivers when tallying up the Edge Weight. For example, two messages were sent from Derek to Ben; the first was sent only to Ben and the second was also sent to Marc. Derek's message to Marc copied in Anne, so an edge is created between Derek and Anne. The power of this representation is that tens of thousands of email messages among a group of people can be captured in just a few hundred rows. Alternate ways of handling the data are discussed next.

A standard email network can be aggregated to create networks that show the connections between different social groupings. For example, vertices can represent company work groups, organizational departments, local branches, or entire organizations. Edges can represent the aggregate number of messages sent between people associated with different groups (i.e., vertices). For example, a directed edge pointing from the marketing department to the development department with a weight of 100 would suggest that marketing employees sent 100 messages to development employees. The use of organization elements in host names (e.g., @umd.edu versus @cs.umd.edu) as part of email addresses can facilitate this type of analysis by identifying people from different departments, although the frequent use of web mail (e.g., @gmail.com) makes this technique problematic for studying broader populations. Alternatively, edges may represent the number of unique individuals who have sent emails from one department to another. For example, in our prior scenario there may have only been five people that sent the 100 messages, resulting in an Edge Weight of 5. A graph based on these networks provides an overview of the departmental relationships within an organization, highlighting the most connected departments and the most socially isolated ones. Section 9.8 provides an example of a summarized network showing connections between workgroups within a large technology company.

9.4 What questions can be answered by analyzing email networks?

Email messages can be analyzed as part of a larger corpus. Table 9.2 shows the three main types of email collections (personal, organizational, and community), each of which may be analyzed by a current participant or an outside observer.6 Personal email collections include messages sent or received by an individual. Organizational email collections include messages sent and received by members of an organization. More generally, they are the aggregate of several individuals' personal email collections. Community email collections include messages sent to an email list address that get forwarded to a group of subscribed members. Email lists may be public, where anyone can participate and view prior messages, semi-public, where anyone who registers can participate and see the archive, or private, where only invited or approved members can participate and view prior messages.

Table 9.2

Types of analysis for email collections with different scales and observers.
PersonalOrganizationalCommunity
Current ParticipantRegion A:Region B:Region C:
Analyzing your own emailAnalyzing your organization's emailAnalyzing ongoing conversations in a community email list in which you participate
Outside ObserverRegion D:Region E:Region F:
Analyzing another person's emailAnalyzing another organization's emailAnalyzing a community email list archive in which you do not participate

Table 9.2

The goals and process of the analysis are different for each of the regions specified in Table 9.2. Outside observers such as lawyers, historians, and researchers analyze email collections for historical, research, or legal reasons. In contrast, current participants such as managers, community administrators, list owners, and members analyze email collections to help inform decisions. Outside observers can benefit considerably from overviews that provide context before delving into specifics. In contrast, current participants typically understand the overall context and can delve into specifics quickly, although they may be biased in their perceptions. There are fewer privacy concerns when analyzing one's own email (Region A) or public community email lists (many communities in Regions C and F) than when analyzing organizational email (Regions B and E) or another person's email archive (Region D).

We examine personal and organizational email collections in this chapter and discuss community collections in Chapter 10, since they are similar to other community-based threaded conversation tools like discussion forums. We discuss preparing, cleaning, and importing email data in this chapter.

9.4.1 Personal email network questions

Several questions can be asked about personal email network datasets:

  •  Individuals. Who are important individuals within the network? For example, who are boundary spanners who link across clusters of contacts? Who is contacted most often? Who are the most active discussants of a particular topic? Who are unwanted or troublesome correspondents? Who are unresponsive recipients?
  •  Groups. What natural subgroups exist? What collaborative activities are individuals engaged in? What are the relationships between subgroups?
  •  Temporal comparisons. How have relationships changed over time? How did an event (e.g., move to a new location) affect the network? What inactive groups exist with whom I may benefit from re-establishing contact? What projects or people have I neglected?
  •  Structural patterns. Are there common social roles that occur among contacts (e.g., informant, decision maker, boundary spanner)? Are there types of subgroups that occur (e.g., cliques, fans)?

9.4.2 Organizational email network questions

Several different questions can be asked about organizational email network datasets:

  •  Individuals. Who are the important individuals within an organization? For example, who are the boundary spanners who link across organizational silos? Who are the influencers or topical experts? Who is not well connected and could benefit from more social ties? Who would be a good replacement for an individual? Who fills a unique niche? Who was in-the-know about an important decision?
  •  Groups. How do email-based groupings differ from organizational structures? How is the “org-chart” different from the chart of the flow of email? How are groups interconnected? Which groups should be better connected? Is there a core competency that is not discussed in a particular branch or office?
  •  Temporal comparisons. How does information flow through the organization? How do connections among individuals and subgroups evolve over time? How are social relations affected by a major event such as a merger or opening of a new office?
  •  Structural patterns. What network properties are related to success? Can we identify up-and-coming stars or unique social roles based on their network structure? How is information on a particular topic distributed throughout the organization?

9.5 Working with email data

From a user's perspective the components of an email message are relatively simple. The email header includes the From, To, Cc, Bcc, Date, and Subject fields. The email body includes the message content and any attachments. Despite this apparent simplicity, there can be a great deal of hidden complexity. A full treatment of email protocols and formats is beyond the scope of this book. Instead, we list a few key terms and facts that can serve as starting points for those needing to learn more before accessing and analyzing email collections:

  •  Email is transmitted through the Internet via the Simple Mail Transfer Protocol (SMTP).
  •  Email uses the Multipurpose Internet Mail Extensions (MIME) format to allow character sets other than ASCII and non-text attachments to be included and transported via email.
  •  Email client applications such as Microsoft Outlook or Apple Mail retrieve or cache messages from a mail server using Post Office Protocol (POP) or Internet Message Access Protocol (IMAP). Corporate email is typically retrieved through proprietary protocols specific to Microsoft Exchange Servers or competitors.
  •  Email messages are stored in a variety of formats for different email clients. Some email clients store each message as a separate file; others save them in a database format. Some common formats include .eml (Microsoft Outlook, Mozilla Thunderbird), .emlx (Apple Mail), .msg and .pst (Microsoft Outlook or Microsoft Exchange), and .mbox (Mozilla Thunderbird, Gmail backup files, and many email list archives).

Working with email poses technical challenges that often require preprocessing data to create useful results. The large potential size of email networks can be problematic and may require specialized programs to manage large data volumes. In practice, email will likely need to be filtered before analysis to reduce the dataset based on time ranges, people, and topics of interest. Another major challenge is the use of multiple email addresses for the same individual. In most cases, analysts are interested in social relationships between individuals, not the relationships between email accounts. The problem of combining different aliases (email addresses) for the same entity (person) is called “entity resolution,” “identity resolution,” “deduplication,” or “record linkage.” A range of tools provide deduplication services such as Marketo or the open source Python library Dedupe. Another set of tools extracts entities (e.g., names or places mentioned in email messages) which can be used to create networks that consider personal names or places mentioned in email texts rather than the sender and receiver of a message. Searching for tools that perform “named-entity recognition,” “entity identification,” “entity extraction” and “entity chunking” reveals tools such as the spaCy Python library, Stanford NER, and commercial APIs such as Lexalytics, TextRazor, ParallelDots, and Aylien.

9.5.1 Preparing email

Most email clients do not export data in a format amenable to network analysis. Furthermore, the email you'd like to analyze may be stored in different formats and reside on different computers or web mail servers. As a result, you may need to prepare your email before it is ready to import into network analysis tools such as NodeXL.

The easiest way to transform email messages into network relationships (i.e., an edge list) is to use NodeXL's Import from Email Network feature. This feature relies on the Windows built-in indexing functionality on recent versions of Windows (e.g., Windows 10). By default, email files in certain formats will be indexed by Windows. You can view and change which filetypes are indexed and check indexing progress in the Indexing Options dialog accessible via the Control Panel.

You may not have the email you want to analyze on a local or shared machine. For example, you may exclusively rely on a web mail service such as Gmail or Hotmail. Nearly all web mail services allow you to download local copies of your messages via POP or IMAP to an email client such as Thunderbird or Outlook, or create an archive of the files for backup. However, in some cases you may need to purchase backup software to export into a file that can be indexed. If you are using IMAP, make sure to download the complete email message files, not just the header information. Otherwise the Window's indexing service will not download the content of the messages and allow you to import them using NodeXL (as described later). You can typically choose not to download attachments if there are space limitations. If you use IMAP you can also restrict the download by folder. For example, you may want to only download recent messages (i.e., those sent in 2018) rather than years of data. After downloading messages, it may take Windows some time to index all of the files. If you have subscribed to an email list and retained all of the messages you want to analyze, you can place them in a folder and use IMAP to download just those messages.

Advanced topic

Working with large email collections

You may want to create networks based on email archives that are not in a format that Windows understands. For example, mbox and maildir are common formats found in Linux and Apple system mail clients. Maildir stores 100 text file per message in a directory hierarchy that matches the user's mail client, whereas mbox stores all messages in a single file.

One strategy for dealing with this issue is to use a specialized programs like Aid4Mail that can aggregate email stored in multiple devices or formats, perform and store advanced searches, and export emails into a range of formats. For example, you can use Aid4Mail to open email list archive files (in .mbox format) and convert them to a format such as “eml” files that can be indexed by Windows. The tool can handle hundreds of thousands of emails with reasonable performance on a standard machine.

Another strategy is to create a database of the email messages that can be queried in multiple ways. This allows you to apply language processing and text mining approaches not available in the NodeXL import wizard. Some email programs like Aid4Mail will create a database for you. Alternatively, you can convert emails into XML and then use Excel's built-in XML maps feature to populate the Excel fields based on the XML database content.

9.5.2 Importing email networks into NodeXL

Once Windows has indexed the email you want to analyze, you are ready to import the data directly into NodeXL. Select the From Email Network option from the Import drop-down on the NodeXL ribbon to open the importer shown in Figure 9.2.

Figure 9.2
Figure 9.2 NodeXL Import from Email Network dialog filtered to only include messages with the term “cybersecurity” sent between 1/1/2018 and 11/30/2018.

The enormous size of many email collections often requires filtering out messages before analysis. Even when email collections are of manageable size, filtering messages can hone in on a specific subset of messages ideally suited for addressing a question of interest. There are several ways of filtering:

  •  Filter based on time. Include all messages sent and received during a specific time period. In NodeXL, the Date Range fields allow you to specify a time window in which messages must appear in order to be included in results. Filtering based on time can be used to slice data into networks to facilitate comparison over time or measure the impact of an important event. In Figure 9.2, only messages sent between 1/1/2018 and 11/30/2018 are included.
  •  Filter based on sender and receiver(s). Include only messages sent or received by certain people. These people may be part of a group (e.g., department, workgroup), share some characteristic in common (e.g., senior managers, located in Maryland), or have a certain relationship to another person (e.g., all those who have received a Bcc message from John). In NodeXL it is possible to specify email addresses to be found in the From, To, Cc, and Bcc fields. The default setting is a Boolean OR relationship, so if you include multiple addresses it will pull all messages with any of the addresses included. It is also possible in NodeXL to restrict messages to those that include (or don't include) any addresses in the Cc or Bcc fields through the checkboxes available the right-hand side of Figure 9.2. Filtering based on the sender and receiver(s) is ideal for focusing in on a subgroup of important people for further analysis.
  •  Filter based on content. Include only messages that share some characteristic or content of interest. In NodeXL messages with or without attachments, messages within a certain size range, and messages or subjects with specified text can be filtered. The text search feature can be powerful when combined with standard naming conventions. For example, all messages from the Association of Internet Researcher's email list can be selected by searching for the text string “[air-l]” in the Subject search box. In the example shown in Figure 9.2, only messages with the word “cybersecurity” in the message body are included. This was chosen because a new Cybersecurity major was started at BYU in 2018 allowing a more focused analysis of this topic.
  •  Filter based on folders and labels. Include only messages that are found in a specified folder or messages with a certain label (e.g., in Gmail). People often organize or label (i.e., tag) email into meaningful collections based on subject matter or projects. In NodeXL, messages can be restricted to those found within a certain folder. Filtering based on folders is ideal for capturing interactions about a specific project or topic that may not be identifiable from a simple keyword. The “Sample folders” link shown in Figure 9.2 shows examples of pathnames that can be entered. If you do not know the name of your email account, you can open up Outlook, right-click on the folder you are interested in, and choose Properties. For example, if I were to restrict my search to all messages in the Sent Items folder, I would type in /[email protected]/Sent Items.
  •  Filter based on a combination of features. The various filtering options can be combined in intricate ways to find, for example, messages from a subset of key people sent during an important time period with a certain keyword. If more advanced filtering is needed, you can use an advanced email management tool like Aid4Mail to create a folder of desired messages and restrict the import to that folder. This enables the use of advanced search queries (e.g., regular expressions) for identifying messages (see Advanced topic: Working with large email collections).

In addition to filtering the messages included in the social network dataset, NodeXL allows the way the edge weight is calculated to be specified either based only on addresses in the To field or including those in the Cc or Bcc fields as well. This is independent of filtering. By default only those addresses in the To field are counted. In the example displayed in Figure 9.2, the Cc field is included when calculating edge weights, but not the Bcc field.

9.6 Cleaning email data in NodeXL

After importing email data into NodeXL, you will likely need to clean it to remove duplicate email addresses for the same individuals, as well as self-referring loops created when people reply to their own messages.

9.6.1 Remove duplicate email addresses for the same individual

If your focus is on connections between people, as opposed to specific email accounts, you will want to combine multiple email accounts from the same person into a single one. Unless you are using an advanced entity resolution software program to do this, this is likely to be a somewhat manual process.

The simplest approach is to use the Find and Replace tool familiar to most Excel and Word users. Start by choosing Show Graph and then navigate to the Vertices worksheet. Sort the Vertex column from A to Z so that email addresses that start with the same name will be next to one another (e.g., [email protected] and [email protected]). Then click on Control + F to open the Find and Replace window and enter the appropriate email addresses (Figure 9.3 presents an example). Navigate to the Edges worksheet and choose Replace All for data in the Vertex1 and Vertex2 columns. There will likely be duplicates that don't start with the same username (e.g., [email protected] is my work email address, while [email protected] is my personal email address). To find the important people based on the frequency of interaction you can sort columns by edge weight and make sure that those with a high edge weight are not duplicates. The most important duplicates to remove are your own.

Figure 9.3
Figure 9.3 Excel Find and Replace dialog used to help combine different email addresses in the NodeXL Vertices and Edges worksheets.

After replacing the fields in the Edges worksheet, you should delete all of the rows that have data in the Vertices worksheet. Then click Show Graph, which will generate a new list of Vertices on the Vertices worksheet. If you fail to do this, you will have duplicates in the Vertices worksheet, which can cause problems later.

The problem with using Find and Replace is that it must be repeated each time the data is re-imported or updated, even if the email addresses are the same. There is also no trace of the changes once they are made, making it hard to audit mistakes. A more time intensive, but careful, approach is to use a Lookup Table as described in the Advanced topic: Performing the lookup table strategy to count and merge duplicate email addresses.

Advanced topic

Performing the lookup table strategy to count and merge duplicate email addresses

Once you have imported email data into the Edges worksheet and chosen Show Graph, you can navigate to the Vertices worksheet where there is now a list of all of the unique email addresses. Copy that list to a new worksheet and title the column Original_Addresses. Create a new column next to it called New_Addresses. When you find duplicate addresses for the same person in the Original_Addresses column, repeat the desired address in the New_Addresses column. An example is provided in the Lookup_Addresses table shown in Figure 9.4.

Figure 9.4
Figure 9.4 A Lookup table and the Excel = VLOOKUP() formula is used to combine multiple email addresses associated with the same person in NodeXL.

You can now use a VLOOKUP function to look up the New_Addresses that correspond with the Original_Addresses to create a new edge list with no duplicate email addresses. Copy the original edge list from the Edges worksheet (columns D and E) and create two new columns for a new edge list (columns F and G). The New Edge List columns will be automatically populated by the results of a VLOOKUP function. Figure 9.4 shows an example in Cell F9. The VLOOKUP function looks for the original Vertex1 address found in cell D9 ([email protected] shown in the blue outlined rectangle) within the first column of the Lookup Addresses table (cells $A$3:$B$7 shown in a red outlined rectangle). In this example, it finds an exact match in cell A7. It then returns the value of the second column in the Lookup table (because the number 2 was entered into the VLOOKUP formula), which is [email protected] found in cell B7. The FALSE in the VLOOKUP formula specifies that the value that is looked up must exactly match a value in the first column of the Lookup_Addresses table. An error is returned if an exact match is not found. Once the New Edge List is created, you can copy the new Vertex1 and Vertex2 columns (F and G) and use Excel's Paste Special feature to paste their values into a new workbook. You should also copy and paste the Edge Weights column from the original file after assuring that the Vertex1 and Vertex2 columns are in the exact same order as the original data.

VLOOKUP functions can use considerable computational resources when working with large files, so once they are calculated you may want to copy them, choose Paste Special, and select Values so they do not need to be recalculated each time changes to the workbook are made.

9.6.2 Count and merge duplicate edges

Once you have updated the email addresses in the Edges worksheet so that different addresses for the same person are replaced with a single address, you will likely have duplicate edges (e.g., more than one row that have the same values in the Vertex1 and Vertex2 columns). It can be useful to “roll up” these duplicate edges, replacing multiple connections between a pair of email addresses with a single edge. The rolled up edge has a weight equal to the total number of exchanged messages found in the data. It is important to roll up the data so that network metrics can be accurately calculated, as some of them assume that edges connecting any pair of vertices are unique. To prepare your email network for analysis, you can roll up repeated email messages from the same pair of people using the Count and Merge Duplicate Edges feature in the Prepare Data section of the NodeXL Ribbon. This will merge the duplicate edges and sum up the Edge Weights so the total Edge Weight remains the same. Before doing this, make sure the network type is set to Directed, or else it will remove the directed nature of the graph. Figure 9.5 shows the New Edge List from Figure 9.4 in Columns A and B and a merged version of it in Columns E and F shown here to illustrate the results of the Count and Merge Duplicate Edges feature. Notice that the total of the Edge Weight column is the same.

Figure 9.5
Figure 9.5 Effects of NodeXL's Count and Merge Duplicate Edges feature after combining duplicate addresses. A self-loop is shown in red.

Advanced topic

Automatically identifying self-loops

To easily identify self-loops, you can create a new column on the Edges worksheet called Self-Loop and populate it with the function = Edges[[#ThisRow], [Vertex1]] = Edges[[#ThisRow], [Vertex2]]. If the Vertex1 and Vertex2 column are the same for a given row, a TRUE will be returned. Otherwise a FALSE will be returned. Once calculated, you can sort on the column to find the Self-Loops. If desired, you can delete them or choose to skip them using the Visibility column.

Sometimes people send email messages to themselves as a reminder, as a To Do list, or to share a file between computers. This results in a row with the same address in the Vertex1 and Vertex2 columns on the Edges worksheet and is called a self-loop. The use of multiple email addresses and the removal of duplicate addresses can also cause self-loops. Row 9 of Figure 9.4 is an example. For many analyses these self-loops are not important and can be distracting when visualizing data or calculating network metrics. You may want to remove self-loop edges such as the red pair in Figure 9.5 as an additional step after you have counted merged duplicate edges. See Advanced Topic: Automatically identifying self-loops.

9.7 Analyzing personal email networks

This section presents two projects that serve as examples of how to analyze personal email collections. They are both based on the following scenario.

Scenario: You have a new employee coming to work with you next week whom you will supervise. He doesn't know you well and is new to the organization. To help him smoothly transition into his new job, you want to provide him with an overview of the people you work with and their relationships to each other. You decide to create two network visualizations, one that provides an overview of all of your contacts and another that provides more detail on the workgroup that he will work with most closely.

In the following examples, the new employee is a new faculty member coming to work with Derek Hansen, Associate Professor at Brigham Young University's IT and Cybersecurity program. The faculty member will be focused on the area of cybersecurity. For privacy reasons, the email networks analyzed in this section are not made public. You are encouraged to analyze your own email data with a similar scenario in mind. This section assumes that Windows has already indexed your emails as described in prior sections.

9.7.1 Creating an email overview visualization

Step 1: Import data into NodeXL

Import all of your email sent within the past month. Although some people you know may not have contacted you in the prior month, this time period lets you collect a broad set of your active email contacts. Use the Import From Email Network feature in the Data menu and filter based on your chosen date range (e.g., 11/1/2018 to 11/30/2018). Check the Use Cc line when calculating edge weights box to be more inclusive. For Derek's dataset, a total of 1977 edges are created with a total edge weight of 7572.

Step 2: Clean data

Next, combine email addresses as described in the Advanced topic: Performing the lookup table strategy to merge duplicate email addresses, and run the Count and Merge Duplicate Edges function. For my dataset, this collapsed the 1977 edges into 1837 (140 pairs were merged). To make sure no data was lost, check that the sum of the Edge Weight column is the same as it was before the merge.

Step 3: Filter data

To more clearly focus on the key relationships, it is desirable to remove infrequent email exchanges. Sort the Edge Weight column from largest to smallest. It is likely that the values will have a skewed distribution with many connections with very low edge weights and relatively few connections with a high edge weight. Remove the least common connections by choosing a cutoff point for deletion. You may want to use the Dynamic Filters feature discussed in Chapter 7 to find an appropriate cutoff. For example, in Derek's data when all of the connections with an edge weight of < 5 are removed, the key individuals are still retained and the total number of edges is reduced to 304 edges. You can manually delete the rows, in which case your data will be a more manageable size, but will lose some data that may be needed in your later analysis. For example, if you calculate the total number of messages8 a person sends (see Advanced topic: Calculating total sent and received edges), the number would be incomplete if all of the infrequent connections were deleted. Alternatively, you can use the Autofill Columns feature to Skip edges with an edge weight that falls below the cutoff point. This will keep the data in the workbook, but not use it in the display of graphs or the calculation of network metrics (discussed in Chapter 6).

Advanced topic

Calculating total sent and received edges

The following formulas will aggregate the count of all sent (or received) edges for each individual in the Vertices worksheet based on the data in the Edge Weight column of the Edges worksheet:

  •  Sent7: Calculates the total number of sent messages8 for each person:

     = sumif(Edges[Vertex 1],Vertices[[#This Row],[Vertex]],Edges[Edge Weight])

  •  Received: Calculates the total number of received messages8 for each person:

     = sumif(Edges[Vertex 2],Vertices[[#This Row],[Vertex]],Edges[Edge Weight])


    Additional formulas can be used to calculate the total number of messages sent or received by an individual and the percentage of messages that are sent:
  •  Total: Calculates the total number of messages sent or received for each person:

     = Vertices[[#This Row],[Sent]]+Vertices [[#This Row],[Received]]

  •  %_Sent: Calculates the percentage of all messages received:

     = Vertices[[#This Row],[Sent]]/Vertices [[#This Row],[Total]]


7 If the "Use cc: line when calculating edge weights" (or use bcc: line) is chosen when importing email data, then the Sent formula gives the number of messages you sent that were received by others. In other words, if you sent one message to two people (one of whom was cc'ed in), your total sent would be 2, not 1 even though you only authored 1 message.

8 Note that by “messages” at later in the chapter, we do not mean unique email messages (as counted by an email client). Instead, we mean unique edges. They are different because if I send a message to 2 people, 2 unique edges are created, even though only 1 unique message was drafted. Each edge can be thought of as a copy of a physical letter (i.e., message) that is sent to each person. Thus, throughout this chapter and the next chapter, when we refer to the number of “messages” received or sent, we mean copies of messages.

Step 4: Compute graph metrics and add new columns

Next, select Show Graph, which will populate the Vertices worksheet with data about each vertex and display a preliminary email social network graph. The next step in the data analysis is to compute all of the relevant graph metrics (see Chapter 6). You can insert additional columns indicating people's attributes such as their relationship to you, their location, or affiliation. You can also use formulas to calculate the total number of messages8 sent or received by an individual as is described in the Advanced topic: Calculating total sent and received edges.

Step 5: Visualize the email social network

The next step is to map the metrics and new columns onto display attributes in the visualization (see Chapter 5). Many display attributes like color, transparency (“opacity”), edge width, and location can be mapped to data attributes about messages, relationships, and authors. Selecting the optimal mapping of data attributes to display attributes will inevitably require some trial and error. You may want to look at the social network graph with and without the vertex that represents your own email address by manually setting the Visibility column for the row with your email address on the Vertices worksheet to Skip. Figure 9.6 shows Derek's network after using the Harel-Koren Fast Multiplex layout and Group in a Box feature (see Chapter 7).

Figure 9.6
Figure 9.6 Derek Hansen's email network for November 2018, showing only connections with over five messages (Edge visibility > 4). Size is base on the total Sent messages ignoring outliers (1.5 to 15). Opacity is based on total Received messages ignoring outliers (50 to 100). Edge Weight is mapped to Edge width (1 to 3 ignoring outliers) and Edge opacity (30 to 70 ignoring outliers). Group by Connected Component is used and the Group in a Box layout with group labels are used.

Derek's email address and its connections are not shown, which makes the graph less cluttered (because the “Derek” vertex had been connected to every other vertex). Distinct groups can also be seen more clearly. However, removing Derek from the network hides information about whom he communicates with most often and the direction of his communications. To deal with this, opacity and size of vertices have been used to indicate the number of messages sent and received. Email addresses and names of most individuals have not been displayed for privacy reasons. Derek could set the tooltip to display email addresses and provide a file to the new faculty member so that he could map vertices to email addresses. Alternatively, a printed version with selected individuals that the new faculty member is likely to work with could be created.

Step 6: Understand social network visualizations and metrics data

Analysis of the graph and accompanying data helps answer many of the questions offered earlier in the chapter:

  •  Individuals. Who are important individuals within the network? The graph helps clearly identify individuals that play unique roles in Derek's social network, including those who send a large amount of email (larger vertices), those who receive a considerable amount of email (darker vertices), those who span different subgroups (e.g., Beth who bridges several related research groups), and those who are a part of a team (e.g., Ben, Marc, and Itai who are co-authors on this book). Using Excel's built-in Sort feature on the spreadsheet data is an effective way of identifying important individuals. There are also many individuals without any connections in this graph because they send or receive email from Derek directly. Those that are large and light colored (e.g., BYU List) are potential spammers because they send out many messages and don't receive any. In contrast, those that are small and dark may be unresponsive recipients (or people not expected to reply) since they receive many messages but rarely send them out. Large, dark circles (e.g., Justin, Brady, Rachel, Mattie) represent those who are highly engaged with Derek or other colleagues shown in the graph.
  •  Groups. What natural subgroups exist? The original graph included most of the nodes as part of a large connected component. However, filtering out edges with fewer than 5 messages revealed the distinct groups that are labeled in Figure 9.6. Each group is distinct in its own way. The Playable Case Study Research group is really a collection of related research projects that include different collaborators at Brigham Young University (those to the right of Beth) and the University of Maryland (Beth and those to her left). Seeing the relatively low density of this group suggests that there may be opportunities for the individuals to work more closely together in the future. In contrast, other groups such as the Innovation Space group, the NodeXL group, and most of the people in the Faculty and Staff group message each other frequently. For example, the seven faculty and two primary staff members in Derek's program at Brigham Young University collaborate very closely together, with a few people branching off of those with unique roles (e.g., Barry who is the School of Technology director; Rachel who is a student counselor).
  •  Temporal comparisons. This social network graph represents only a single month snapshot of email connection activity. Comparison with similar social network graphs from other time periods can show interesting patterns of change and stability. For example, new email addresses can show up linked to vertices in the Playable Case Study Research group as new members join the team, connections between individuals that are currently isolated can be formed, and the send/reply structure may shift depending on the nature of activities that are under way.
  •  Structural patterns. At a high level, this graph represents the life of a professor, one filled with numerous one-on-one relationships (only 10 of which are included here) with colleagues and students and a collection of active working groups. A comparable graph of an analyst or accountant in a large firm may share a handful of densely connected groups but likely not as many individual connections [3]. Sometimes structural patterns show up such as bridge spanners like Beth and Mattie who receive many messages from a number of people. Cliques, such as the Innovation Space people also emerge, where all of the people communicate with the others regularly.

9.7.2 Creating an expertise network email graph

Step 1: Import email social network data into NodeXL

In this case, we want to include only those messages that mention a particular topic. This gives us some idea of who knows about this topic and how they are related to one another. For this example, we are interested in finding individuals with whom Derek exchanges emails that refer to “Cybersecurity.” Figure 9.2 shows the Import From Email Network window set to include only messages with the text “Cybersecurity” exchanged during November 2018. This is a subset of the graph of all emails examined in the prior section.

Step 2: Clean data

Use one of the previously specified methods for joining duplicate addresses for the same person and merge duplicate edges after making sure the graph type is Directed. In Derek's dataset, the original 432 unique edges collapsed down to and 404 unique vertices (i.e., email addresses after removing duplicates).

Step 3: Compute graph metrics and add new columns

When there are relatively few connections as in this example, it is feasible to calculate the metrics and add columns before filtering the data. Calculate all of the metrics and add the same new columns to the vertices worksheet as in the prior example.

Step 4: Filter data

It is reasonable to use the Dynamic Filters to determine a good cutoff point when there are few connections (see Chapter 7). To focus in on those who communicate the most about the topic (Cybersecurity), filter out those with a low edge weight or those who send few messages. To hone in on the cluster of tightly connected vertices, filter out those with a low in- and out-degree, since those who are densely clustered send and receive messages from others that are part of the group. Figure 9.7 shows Derek's Cybersecurity email network before and after the dynamic filtering.

Figure 9.7
Figure 9.7 Derek Hansen's email network for November 2018 that include the word “cybersecurity” in the message. All edges are shown in the left-side image, while only selected vertices are shown after dynamic filters were applied in the right-side image. In both images, size is based on the total Sent messages and opacity is based on the total Received messages. Edge width and opacity are based on Edge Weight.

Step 5: Visualize network

Figure 9.7 uses a similar mapping of data to visual properties as was used in Figure 9.6 with a few minor changes in minimum and maximum values. The key difference is the inclusion of Derek and his connections and the focus on only messages that include “cybersecurity.” Including Derek in the network adds clutter to the graph, but also adds valuable information about whom he worked with most closely during the 5-month period. For example, the thickest lines coming to and from Derek are with faculty and staff most closely associated with the new Cybersecurity program at BYU.

Step 6: Understanding the network visualization and data

The analysis of the first graph is similar to that of Figure 9.6, except that connections are solely based on messages containing the text “cybersecurity.” Thus, some important individuals from Figure 9.6 do not appear in Figure 9.7 (e.g., Marc, Itai, and Ben), whereas others become comparatively more important than in the prior graph. Even with this change, there are some apparent similarities. There is a densely connected group of faculty and staff who remain in both networks because they are all affiliated with the new Cybersecurity program. Additionally, some of the playable case study researchers remain, since Derek was working on a project and a new grant related to the development of a cybersecurity playable case study during this time. In short, these images provide a view into Derek's work related to the cybersecurity topic.

9.8 Creating a living org-chart with an organizational email network

Enterprises rely on their communication networks to function. A combination of phone, email, calendars, discussion forums, blogs, wikis, group messaging, texts, and file sharing are often used in concert to share ideas, documents, schedules, and data. Analyzing the patterns of connection within these collections can reveal important insights into the structure and dynamics of an organization. When an employee, for example, emails another employee, a link is formed that connects the two individuals, but also their organizational groups and divisions. These connections often crosscut the branches commonly represented in an “org-chart.” Most enterprises and institutions are organized hierarchically with people in a group reporting to a single manager who in turn, reports to a manager. These connections repeat until they connect to the single most senior part of the company, creating a tree or pyramid of branching, nested connections known as the traditional org-chart. But “leaf” groups at the ends of these branches often connect directly to other groups, without passing messages up and down the chain of command. A map of the network of connections among groups in an enterprise is an alternative vision to the org-chart that reveals information about the flows of information and connections through the organization.

The extraction of enterprise social media network data is not trivial and requires the coordination of several parts of a typical business. Support from the managers of enterprise email systems is essential to access records of email exchanges. Data about the organizational structure of the business are often stored in a separate corporate directory system that contains information about each employee, such as one's job title, physical location, level, and reporting structure (i.e., to whom the person reports). Coordinating the extraction of data from these two systems can be a challenge for organizations accustomed to managing these functions separately, although unique employee identifiers and email addresses can often be used to join the separate datasets that must be integrated. Privacy, security, and legal concerns arise and must be addressed, potentially for multiple jurisdictions. Although it may be nice to match performance data with network information, it is often not feasible because of potential privacy concerns. Data from multiple information systems is rarely available in a form that is immediately useful for network analysis such as an edge list, and it must be scrubbed to remove errors or inconsistencies. Despite these challenges, a number of companies have begun to create social network data that combine corporate email network data and corporate directory information, giving them a live window into their corporate communication patterns.

9.8.1 TechABC's organizational unit email network

In this section you will analyze a sample of email traffic from a large global technology company we'll call TechABC. The company has > 100,000 employees in dozens of countries and hundreds of locations. Employees are aggregated into roughly 10,000 organizational units that have an average of 15 members. Organizational names in the visualizations have been anonymized. For privacy reasons we cannot provide the dataset.

People in each organizational unit send and receive email from people within their own unit as well as to people in other units. These events were logged in the corporate email server and were extracted for a weeklong period. An edge list of events in which an employee sent an email to another employee in the To, Cc, or Bcc fields was created. Data about each employee were then removed and replaced with the name of the organizational unit in which they were a member, helping address individual privacy concerns. Data were then aggregated (see Section 9.6.2), creating an edge weight that represents the number of messages sent from one unit to another. This process rolls messages exchanged between members of the same unit into self-loops, where the sending unit and receiving unit are the same. The total number of internally exchanged messages can be useful, but is best treated as attribute data on the Vertices worksheet rather than captured in the edge list.

9.8.2 Normalizing and filtering TechABC's data

Whole graph maps of enterprise networks are likely to be too large and dense to be informative. For example, TechABC's raw sent email network includes > 1.3 million edges and around 10,000 vertices. A process of filtering and selective display is required to peel away parts of the network that obscure structures of interest (see Chapter 7). When working with large datasets such as TechABC's, you may want to perform the first round of filtering using a database program like Microsoft Access because of size limitations in Excel.

A common edge filtering technique is to remove all connections below a threshold, helping whittle away infrequent ties to reveal the strong core skeletal structures of the company. The easiest threshold to use is the raw number of messages sent between units. However, because organizational units differ in size, this approach disadvantages smaller units with fewer members contributing to the number of messages. To account for this discrepancy, you can normalize the data by creating a new edge variable based on the number of messages sent per employee (e.g., per full-time equivalent or FTE). You'll need to decide if you want to use the number of FTEs from the sending unit, receiving unit, or some combination of the two. For the graph shown in Figure 9.8, we removed edges with fewer than 50 messages per FTE sent in a week, where we used the minimum of the sender and receiver FTE values as the denominator. This approach keeps an edge if it is important (i.e., a high number of emails per FTE) to either the sending or receiving unit (see the U.S. Senate co-voting example in Chapter 7 for another illustration of a similar technique). The resulting, filtered TechABC network includes 2303 edges and 2267 vertices. Figure 9.9 uses a similar approach, but because it focuses on a subset of units (only research units), the threshold was lowered to 10 messages per FTE sent in a week.

Figure 9.8
Figure 9.8 TechABC's organizational unit email network “backbone,” focusing on high-traffic connections between units (i.e., those who exchange > 50 messages per FTE). Color is mapped to betweenness centrality with green vertices playing important roles as bridge spanners. Edge opacity is mapped to messages per FTE. Dynamic filters were used to exclude those with low closeness centrality, which is a trick for filtering out all vertices that are not part of the large component.
Figure 9.9
Figure 9.9 TechABC's organizational unit network including research units (maroon squares) and non-research units connected to them (blue disks) through the exchange of email. Only edges with 10 or more messages sent per FTE are included. Edge width (1–2) and opacity (40–100) are based on raw sent messages. Vertex size is based on the number of group members (FTEs). Notice some obvious disconnects (e.g., Market Research 1 and 2), as well as some of the important bridge spanning research groups (e.g., Specific 6) and non-research groups (e.g., Blue disk just above General 17).

You could also normalize the data by calculating the number of messages sent from one unit to another unit as a percentage of all messages sent from the unit. This approach accounts for differences in a unit's overall email usage patterns, which can be desirable in some cases. For example, it would remove edges representing company announcements from a single unit (e.g., the human resources or information technology department) because the messages sent to any one unit would be a small percentage of the sending unit's overall sent messages. As with the prior example, you will need to decide if you want to use the sending or receiving unit's total message count as the denominator, or some combination of the two (e.g., maximum, minimum, average).

Other strategies for filtering data can lead to other insights. For example, showing only weak ties (edges with between 3 and 10 messages per FTE) can highlight lesser-known connections that might guide management efforts to improve connections across gaps in the company. Attributes of organizational units can be used to filter the network graph as well, helping to zoom into subsections of the larger graph. For example, you could remove all but the most central groups to reveal the network of core groups while hiding more peripheral groups. You can also focus on units within a particular department, geographic location, or similar mission. You will see this approach used in our second example (Figure 9.9), which looks at connections between research units of TechABC.

Figure 9.10
Figure 9.10 Enron Corporation's network of email including messages with the word “FERC” exchanged between employees. Vertex size is based on in-degree. Greener vertices received many FERC messages but did not send many out. The Harel-Karen layout was initially used, followed by Fruchterman-Reingold, to push the more peripheral vertices to the edges. Tim Belden, who pleaded guilty and witnessed against other top Enron executives, is labeled.

9.8.3 Creating an overview of TechABC's communication patterns

You may want to create an overview graph of an organization's email communication before moving into more detailed analyses of specific departments or groups. Overview graphs can be difficult to read because of their size. However, they are excellent for dynamically exploring by sorting on metric properties to identify important units, highlighting vertices of interest and seeing their connections on the graph, and using dynamic filters to further hone in on specific areas of interest. A highly filtered overview of TechABC is shown in Figure 9.8. Only edges with > 50 messages per FTE are shown, with additional filtering to show only the main component. You can think of this as the backbone of the company.

This graph and the accompanying data tell interesting stories about the company. Overall, the graph is sparse, largely because of the high filtering threshold we have chosen. The graph density is very low, suggesting that most units only communicate heavily with one or two other units. The average geodesic distance is 10.2 and maximum geodesic distance (i.e., diameter) is 29, both of which are quite high. If high numbers existed at a lower threshold, it would suggest that units may not be well connected with certain other units on the other “side” of the company. Increasing connections between otherwise disconnected groups may be a goal for an organization. For example, many organizations have created “communities of practice” consisting of people with similar skills who are scattered throughout different organizational units. An initiative to increase connections throughout the company could be evaluated by looking for increases in the network density and decreases in the diameter over time.

In addition to looking at global trends, it is possible to look at the role of individual units in the company. Even with the highly filtered graph shown in Figure 9.8, it is possible to identify several hubs (with high out-degree), some densely connected clusters, and units that act as bridges between other units. Some of these fill critical locations in the network, demonstrating their unique value that results from their network position. Organizational units that are connected to many other units (i.e., the hubs) perform services like IT management or library services that touch many parts of the company. Groups that are less connected but have a high betweenness centrality likely have coordination functions, bridging information between multiple groups such as specific geographical units within a larger region. Isolated groups and clusters of groups are likely specialists that perform a function for one or a few other groups to consume. Analyzing networks like the one visualized in Figure 9.8 also allows you to compare units that serve a similar function to see how they compare on various metrics, helping to identify those that could benefit from additional connections.

9.8.4 Examining TechABC's research division

Although overview maps like Figure 9.8 can be helpful, they can also be cluttered and may filter out too much of the detail for large companies. To gain actionable insights, you will typically need to focus on subsections of the network, such as units that serve a similar purpose (e.g., IT, marketing, research). In this section you will explore the organizational units within TechABC that have a research mission. They were identified by looking for organizational unit names with the word “research” in them using Microsoft Excel's Search function (a non-case-sensitive function similar to the Find function).

Although you can restrict the network analysis to only the core units that meet your criteria (i.e., research units), it is often insightful to include all units connected to the core units. For example, Figure 9.9 includes all research units (maroon boxes), as well as all units they sent or received messages to (blue disks). A cutoff point of 10 messages per FTE was used in order to account for unit size differences. Because the focus is on the research units, connections between the non-research units are not shown. The result is a collection of all of the 1.0-degree networks of the research units. To create a similar graph, we added a new column called Research to the Edges worksheet that is a 1 if either Vertex1 or Vertex2 is a research unit and a 0 otherwise. This can be set to Edge Visibility Equal to 1 (in the Research column) using the Autofill Columns feature to exclude all other edges.

The network highlights several important bridge-spanning units, as well as some disconnected units that may need to be connected. For example, research unit Specific 6 plays an important role in connecting several other research groups either directly or indirectly. The organizational unit Specific 10 is important because it is the only path connecting the large Specific 2 unit to the other research units (albeit indirectly). There are also several non-research groups that play pivotal bridge spanning roles, such as the very small unit just above General 17 that is connected to six different research groups, none of which is directly connected to another. This small unit likely plays an important role and its small size may make it vulnerable to employee turnover, suggesting that the company may consider if additional resources are needed to support the group's function. In contrast, the network shows Market Research 1 and 2 in completely different components, not even connected indirectly. More generally, few research units are directly connected to each other, suggesting that there may be potential for increased exchanges through employee swaps, internships, or other shared projects. This assumes there would be benefits from interdisciplinary projects, which content experts would need to determine. Although many of the actionable insights require knowledge about the organization, Figure 9.9 gives you some idea of the potential benefits of this type of analysis.

9.9 Historical and legal analysis of Enron email

In the prior section you explored an organizational network from the perspective of an insider who knows the company. In this section you will explore an organizational network from the perspective of an outsider trying to make sense of an email corpus collected as part of a lawsuit. Specifically, you will explore a subset of email messages sent and received by Enron employees. The original, publicly available dataset included approximately a half-million messages and was made public by the Federal Energy Regulatory Commission (FERC) during the investigation of Enron. It was later cleaned and made permanently accessible by researchers at MIT, CMU, and SRI (see www-2.cs.cmu.edu/~enron for details). The analysis in this chapter is based on a subset of 1700 messages coded by students and researchers at the University of California at Berkeley, filtered to only include messages that are work related. It focuses on business-related messages occurring later in the collection and includes discussions of the California Energy Crisis (see http://bailando.sims.berkeley.edu/enron_email.html for a complete description and compressed file of the individual messages). Messages were downloaded, indexed, and imported into NodeXL using the process described earlier in this chapter. You can download the NodeXL files that correspond to the images shown in this section from https://www.smrfoundation.org/nodexl/teaching-with-nodexl/teaching-resources/. The analysis is inspired by Jeffrey Heer's work [4].

9.9.1 Identifying key individuals using content networks

One problem historians and lawyers face is identifying individuals who played key roles in important events. For employees that use email frequently, email networks provide a quick sense of who communicates with whom. Filtering email collections to include only those that use a particular keyword or set of words is a useful method for finding people related to some event.

You can see an example of this by analyzing the Enron email network of messages that include the term “FERC,” the commonly used acronym for the Federal Energy Regulatory Commission, an “independent agency that regulates the interstate transmission of natural gas, oil, and electricity” (see www.ferc.gov). To create this FERC network, you can use the NodeXL import tools, making sure to filter messages to include those that have “FERC” in the body of the message. The resulting network includes 370 vertices representing employee email addresses and 672 weighted edges. This is a smaller subset of the Enron message network tagged by UC Berkeley students that includes 1803 edges and 1102 vertices. The total sent and received FERC messages8 are included on the Vertices worksheet (see Advanced topic: Calculating total sent and received edges), along with a column called %_Received, which equals Received/(Sent + Received).

Once you calculate the graph metrics, you can use them to create a graph such as Figure 9.10 designed to highlight important individuals. The graph sets the size of each vertex based on in-degree, because those receiving FERC messages from many different individuals are likely “go to” people. Vertex color is based on the %_Received data, with greener vertices indicating that the individual received many messages but did not send out many. Individuals with something to hide may not send out messages, suggesting that focusing on the large green vertices may lead to potential violators. Indeed, one of these vertices represents Tim Belden, the head of trading in Enron Energy Services considered by many to be the mastermind of Enron's scheme to drive up energy prices in California. Belden pleaded guilty to one count of conspiracy to commit wire fraud as part of a plea bargain and ended up serving as a key witness against many top Enron executives.

Although visualizations like Figure 9.10 can help identify individuals worth following up on, they should be used cautiously. In this particular example, there are no messages sent from Tim Belden in the dataset, making it unclear if his high received ratio is due to his actual email usage patterns, purposefully deleted messages, or limitations with the original dataset. Even if the data accurately reflects actual email patterns, Figure 9.10 is imperfect in that it emphasizes many individuals aside from Tim Belden who were not accused of illegal activities. Furthermore, many of those found guilty of crimes were not included in this graph at all, perhaps because they recognized the liability of using email for sensitive communication or perhaps because of limitations in the dataset. Clearly, reading the content of the messages is of utmost importance. However, viewing the network can help identify individuals and messages of interest. Once an individual is known to be involved, mining email is an effective way to identify people with whom the suspect frequently interacts. For example, Figure 9.10 shows a strong connection from John Shelk to Tim Belden (and many other recipients), which is explained by the fact that John Shelk often reported on congressional meetings but rarely received replies to his reports. Integrating the content with network visualization tools can provide a powerful exploratory platform, as has been done with the Enron network dataset [4].

9.10 Practitioner's summary

Email networks provide an intimate look into individuals' social and work relationships making them of interest to managers, community analysts, historians, researchers, and legal professionals. Because email is frequently and widely used in professional contexts, it reliably captures important aspects of many professional relationships. There are three main types of email collections: personal, organizational, and community. An analyst's existing experience with a collection is also important and impacts the types of questions asked and amount of detail needed.

Working with email networks can be challenging. Large collections must often be filtered to a manageable size. Filtering can be based on time, sender/receiver, messages' content, folders or labels, or any combination. Combining duplicate email addresses for the same individual can be time intensive but is often necessary. Integrating email networks with corporate personnel data can be challenging and poses ethical considerations, but when done responsibly can provide new insights.

Personal and organizational email networks can be explored for insights or shared with others to provide an overview. These networks may be based on individuals and their connections or on organizational units and their connections. Analysis can uncover important individuals and relationships such as boundary spanners, central members, broadcasters, and unresponsive recipients. Tightly connected subgroups can be identified and their relationship to one another can be mapped. The impact of interventions or external shocks on the network can be tracked over time, and common structural patterns such as recurring social roles or types of subgroups (e.g., cliques, fans) can be identified. These analyses can lead to actionable insights by identifying people or departments that need more cross-fertilization, helping newcomers get an overview of the social structure they are entering into, evaluating the effectiveness of a new community of practice initiative, and much more.

9.11 Researcher's agenda

The widespread use of email has fostered a growing community of researchers whose goals are to understand usage patterns so as to improve user interfaces and management tools. Researchers have focused largely on individual usage of email [5, 6], but they increasingly work on forensic tools to analyze other person's email or a group's email [3, 7]. A popular theme has been to improve the strategies for finding relevant documents in a large email collection [8, 9]. Exploration tools have built on the traditional keyword or key phrase search strategies, but increased attention to visualization tools has enabled users to get an overview of temporal patterns, relationships with individuals, or the social structure within groups [8, 1013].

The many opportunities to improve on email analysis systems is generating increased research on these issues and an increasing demand for such tools from corporate human resources staff, forensic investigators, legal analysts, and social scientists. The ability to detect temporal changes, such as sharp increases/decreases in communication among certain people or about certain topics, is a valuable guide to analysts. Temporal changes might be visualized by simple timelines or by animated changes to network diagrams, assuming stable layouts are used. The formation and dissolution of subgroups signal important changes that are useful in applications as diverse as detecting rumor spreading (gossip), corporate reorganizations, or antecedents of important events. Tying email to geographical position or even location in office buildings could help us to understand social processes in organizations.

References

[1] Viégas F.B., Boyd D., Nguyen D.H., Potter J., Donath J. Digital artifacts for remembering and storytelling: postHistory and social network fragments. In: Proceedings of Hawaii International Conference on System Sciences (HICCSS). 2004:105–111.

[2] Donaldson S.I., Grant-Vallone E.J. Understanding self-report bias in organizational behavior research. J. Bus. Psychol. 2002;17:245–260.

[3] Leuski A. Email is a stage: Discovering people roles from email archives. In: Proc SIGIR 2004. New York: ACM Press; 2004:502–503.

[4] J. Heer, Exploring Enron: Visualizing ANLP Results, Available online at: http://hci.stanford.edu/jheer/projects/enron/v1.

[5] Ducheneaut N., Bellotti V. Email as habitat: An exploration of embedded personal information management. Interactions. 2001;8(5):30–38.

[6] Whittaker S. Personal information management: from information consumption to curation. Ann. Rev. Inform. Sci. Technol. 2013;1–62.

[7] Tyler J.R., Wilkinson D.M., Huberman B.A. E-mail as spectroscopy: Automated discovery of community structure within organizations. Informat. Soc.: Int. J. 2005;21:143.

[8] Tang G., Pei J., Luk W.S. Email mining: tasks, common techniques, and tools. Knowl. Inform. Syst. 2014;41(1):1–31.

[9] Elsweiler D., Baillie M., Ruthven I. Exploring memory in email refinding. ACM Trans. Information Systems. 2008;26(4):1–36.

[10] Perer A., Shneiderman B., Oard D.W. Using rhythms of relationships to understand e-mail archives. J. Am. Soc. Inf. Sci. Technol. 2006;57(14):1936–1948.

[11] Perer A., Smith M.A. Contrasting portraits of email practices: Visual approaches to reflection and analysis. In: Proceedings International Conference on Advanced Visual Interfaces (AVI 2006); 2006:389–395.

[12] Luo S.J., Huang L.T., Chen B.Y., Shen H.W. Emailmap: visualizing event evolution and contact interaction within email archives. In: 2014 IEEE Pacific Visualization Symposium; IEEE; 2014:320–324.

[13] Viegas F.B., Golder S., Donath J. Visualizing email content: Portraying relationships from conversational histories. In: Proceedings CHI 2006. New York: ACM Press; 2006:979–988.


1 https://www.lifewire.com/how-many-emails-are-sent-every- day-1171210.

2 https://www.statista.com/statistics/183910/internet-activities- of-us-users/.

3 https://www.statista.com/topics/4295/e-mail-usage-in-the- united-states/.

4 https://www.amanet.org/training/articles/the-latest-on-workplace-monitoring-and-surveillance.aspx.

5 Changing the user names and personal identifiers included in a network dataset is a common form of anonymization. However, it is ineffective in some situations in which a version of the original graph is available for analysis. The often-unique patterns found around each vertex in a network can be used to re-identify some anonymized entities.

6 Table 9.2 is loosely based on a similar figure provided by Perer, Shneiderman, and Oard who characterized the types of interactions people have with current and archival email collections [10].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.55.18