Email is one of the oldest, yet most commonly used forms of communication. Email networks connect email addresses to one another based on the number of messages sent and received. Three types of email networks exist: personal, organizational, and community. Personal networks, such as the one examined in this chapter, reveal insights about the person and relationships between those they communicate with. Corporate networks, such as the ABC network, can show connections between corporate units, helping identify redundancies, missed opportunities, and critical units based on actual communication patterns rather than official organizational charts. An examination of the Enron email network illustrates email network analysis value for forensic investigations. NodeXL includes an email network importer that can import data from email clients (e.g., Microsoft Outlook) and email servers (e.g., Microsoft Exchange). Email addresses of the same person must be combined in a process called deduplication.
Email; Email network; Corporate email; Microsoft Exchange; Deduplication; Privacy; Enron; Email forensics; Importer
Email has permeated society more than any other form of social media. It is hard to remember a time when inboxes sat on desks and spam was only a processed meat. Today, an estimated 3.8 billion worldwide email users send over 200 billion emails per day.1 Unfortunately, about half of them are considered spam.1 Email is the de facto form of communication for many corporations, nonprofits, and government agencies. Email and email lists are used to keep extended families in touch, coordinate neighborhood activities, support medical patients, share cutting-edge research, solve technical problems, and perform a host of other activities.
In April 2017, 91% of U.S. Internet users sent or received an email message, making it the most common Internet activity of all.2 Over 80% of email users check email at least once a day.3 Unlike many social media tools, email is widely used among nearly every demographic group. Furthermore, many social media sites optionally send email notifications due to its continued ubiquity.
The integration of email into everyday life makes email networks the most accessible and in many cases most accurate source of data for mapping actual social and work relationships. Analyzing one's personal email collection is a lot like looking in a mirror. Prototype systems like PostHistory demonstrated the interest people have in seeing representations of their social media and show how maps of social connections and activity can promote sustained engagement and storytelling around important events [1]. Network visualizations provide an objective representation of one's social ties, encouraging self-reflection and providing a guide to social hygiene. These maps and reports may help you realize unappreciated or forgotten relationships, or identify a past working group that could be rekindled for a current project. They can help us overcome some of our memory biases such as weighing recent events more or remembering things we've initiated more than those initiated by others. A visualization of your personal email network can be shared with other people to explain your social world. For example, new employees could benefit from a summary of important social ties and collaborative groupings related to a job role or position. Personal email collections are of increasing importance to historians, researchers, archivists, and lawyers who are engaged in the discovery and preservation of electronic records.
Analyzing organizational email networks and email lists can provide a wealth of social information that can inform important decisions and support novel interventions. Organizations can identify unique social roles, individuals who span the gaps between organizational silos, internal influencers, and employees in need of creating more connections. This information can be used as one input to help inform personnel hiring and promotion, improve retention, and spread important messages through a company. Analysis of organic employee clusters, rather than formal organizational charts, can be used to inform the formation of communities of practice and organizational restructuring, and it can help integrate relationships after mergers. Expertise networks can be identified by the use of keywords to infer topics, leading to more intelligent workgroup formation and information sharing. The analysis of an internal company or public email list can help identify experts on a topic, monitor the health of the community over time, and identify potential candidates for leadership roles in the list. Because it is based on actual behavior instead of potentially biased self-reports [2], its validity is high.
Working with email poses several ethical challenges. Although company email is far from private and unless encrypted it is far from secure, many users don't realize just how public their email is. A 2007 survey found that nearly half of the 304 U.S. companies surveyed monitored email use.4 More than a quarter of these companies had fired workers for email misuse. A related survey from 2006 found that 24% of employers had email subpoenaed by courts and regulators. Employers must walk a fine line between controlling the risks of litigation and security breaches by employees and not coming across as Big Brother. In such an environment, they must carefully consider the risks and rewards associated with analyzing company email collections. Researchers must also be careful to receive proper approval from list owners, managers, and members; using pseudonyms when advisable;5 and protecting members' privacy. For corporations and researchers, transparency is needed when articulating the goals of the analysis, the procedures for assuring confidentiality of message content, and the decisions that will be informed by the analysis. Options for employees or research subjects to opt out or prefilter email may be desirable. Although an analysis of email collections poses some risks, social network analysis can be less intrusive than many other methods for understanding social interaction. It provides an interesting midway point for those willing to share who they talk to but not what they say.
Electronic mail, or email, is an electronic message transmitted over a communications network, typically as a text file with optional attachments. Email is older than the Internet itself. In the 1960s, email-like messages were sent between users of the same mainframe computer. Those who accessed the same mainframe or host computer through terminals could exchange messages. In the 1960s and 1970s, many companies used this approach to allow employees to contact other employees located throughout the world in different branch offices or subsidiaries. Email became the “killer app” of ARPANET, the computer network developed by the United States Department of Defense that evolved into the Internet. In 1971, Ray Tomlinson sent the first network email, using an “@” symbol to separate the user's name and the host computer's name. By 1973, approximately three-fourths of all ARPANET data traffic was email. Over time, email became more standardized and increasingly interoperable between different computer and network systems. Applications for interacting with email became more feature rich and usable. Email services became essentially free from an end-user perspective with popular web services such as Hotmail, Yahoo mail, and Gmail. Although email services may have access fees, each additional email sent or received does not typically impose an additional cost to the user.
Most readers are familiar with email as everyday users. Some important technical characteristics make email particularly powerful:
Services such as Usenet and discussion forums, also known as bulletin boards or web boards, share many of these characteristics making them a close cousin to email lists often called Listservs. We discuss these in detail in Chapter 10.
In a standard email network, vertices represent email addresses or corresponding people. Edges or ties are created when a message is sent from one email address to another. Edges are directed because messages are transferred from a sender to a receiver. These ties are weighted by the number of messages sent between two individuals. Table 9.1 shows a summary of the information found in seven messages pulled from Derek's personal email collection. These relationships are converted into an “edge list” and visually represented in Figure 9.1.
Table 9.1
From | To | Cc | Subject |
---|---|---|---|
Derek | Ben | HCIL Brownbag | |
Derek | Marc, Ben | Travel Plans | |
Derek | Marc | Anna | Registration |
Marc | Derek | Re: Travel Plans | |
Carol | Derek | Tuesday meeting | |
Marc | Derek | Anna | Re: Registration |
Marc | Derek | Next Steps |
This email network edge list contains seven messages from Derek's personal email collection that includes ten unique edges (including both To and Cc) and five vertices.
Notice that the six rows in the Edges tab are a re-representation of the seven individual email messages shown in Table 9.1. The new edges are helpful for understanding sender-receiver relationships that are not as obvious when seen in the form of a standard list of email messages. All people in the To and the Cc email fields are counted as receivers when tallying up the Edge Weight. For example, two messages were sent from Derek to Ben; the first was sent only to Ben and the second was also sent to Marc. Derek's message to Marc copied in Anne, so an edge is created between Derek and Anne. The power of this representation is that tens of thousands of email messages among a group of people can be captured in just a few hundred rows. Alternate ways of handling the data are discussed next.
A standard email network can be aggregated to create networks that show the connections between different social groupings. For example, vertices can represent company work groups, organizational departments, local branches, or entire organizations. Edges can represent the aggregate number of messages sent between people associated with different groups (i.e., vertices). For example, a directed edge pointing from the marketing department to the development department with a weight of 100 would suggest that marketing employees sent 100 messages to development employees. The use of organization elements in host names (e.g., @umd.edu versus @cs.umd.edu) as part of email addresses can facilitate this type of analysis by identifying people from different departments, although the frequent use of web mail (e.g., @gmail.com) makes this technique problematic for studying broader populations. Alternatively, edges may represent the number of unique individuals who have sent emails from one department to another. For example, in our prior scenario there may have only been five people that sent the 100 messages, resulting in an Edge Weight of 5. A graph based on these networks provides an overview of the departmental relationships within an organization, highlighting the most connected departments and the most socially isolated ones. Section 9.8 provides an example of a summarized network showing connections between workgroups within a large technology company.
Email messages can be analyzed as part of a larger corpus. Table 9.2 shows the three main types of email collections (personal, organizational, and community), each of which may be analyzed by a current participant or an outside observer.6 Personal email collections include messages sent or received by an individual. Organizational email collections include messages sent and received by members of an organization. More generally, they are the aggregate of several individuals' personal email collections. Community email collections include messages sent to an email list address that get forwarded to a group of subscribed members. Email lists may be public, where anyone can participate and view prior messages, semi-public, where anyone who registers can participate and see the archive, or private, where only invited or approved members can participate and view prior messages.
Table 9.2
Personal | Organizational | Community | |
---|---|---|---|
Current Participant | Region A: | Region B: | Region C: |
Analyzing your own email | Analyzing your organization's email | Analyzing ongoing conversations in a community email list in which you participate | |
Outside Observer | Region D: | Region E: | Region F: |
Analyzing another person's email | Analyzing another organization's email | Analyzing a community email list archive in which you do not participate |
The goals and process of the analysis are different for each of the regions specified in Table 9.2. Outside observers such as lawyers, historians, and researchers analyze email collections for historical, research, or legal reasons. In contrast, current participants such as managers, community administrators, list owners, and members analyze email collections to help inform decisions. Outside observers can benefit considerably from overviews that provide context before delving into specifics. In contrast, current participants typically understand the overall context and can delve into specifics quickly, although they may be biased in their perceptions. There are fewer privacy concerns when analyzing one's own email (Region A) or public community email lists (many communities in Regions C and F) than when analyzing organizational email (Regions B and E) or another person's email archive (Region D).
We examine personal and organizational email collections in this chapter and discuss community collections in Chapter 10, since they are similar to other community-based threaded conversation tools like discussion forums. We discuss preparing, cleaning, and importing email data in this chapter.
Several questions can be asked about personal email network datasets:
Several different questions can be asked about organizational email network datasets:
From a user's perspective the components of an email message are relatively simple. The email header includes the From, To, Cc, Bcc, Date, and Subject fields. The email body includes the message content and any attachments. Despite this apparent simplicity, there can be a great deal of hidden complexity. A full treatment of email protocols and formats is beyond the scope of this book. Instead, we list a few key terms and facts that can serve as starting points for those needing to learn more before accessing and analyzing email collections:
Working with email poses technical challenges that often require preprocessing data to create useful results. The large potential size of email networks can be problematic and may require specialized programs to manage large data volumes. In practice, email will likely need to be filtered before analysis to reduce the dataset based on time ranges, people, and topics of interest. Another major challenge is the use of multiple email addresses for the same individual. In most cases, analysts are interested in social relationships between individuals, not the relationships between email accounts. The problem of combining different aliases (email addresses) for the same entity (person) is called “entity resolution,” “identity resolution,” “deduplication,” or “record linkage.” A range of tools provide deduplication services such as Marketo or the open source Python library Dedupe. Another set of tools extracts entities (e.g., names or places mentioned in email messages) which can be used to create networks that consider personal names or places mentioned in email texts rather than the sender and receiver of a message. Searching for tools that perform “named-entity recognition,” “entity identification,” “entity extraction” and “entity chunking” reveals tools such as the spaCy Python library, Stanford NER, and commercial APIs such as Lexalytics, TextRazor, ParallelDots, and Aylien.
Most email clients do not export data in a format amenable to network analysis. Furthermore, the email you'd like to analyze may be stored in different formats and reside on different computers or web mail servers. As a result, you may need to prepare your email before it is ready to import into network analysis tools such as NodeXL.
The easiest way to transform email messages into network relationships (i.e., an edge list) is to use NodeXL's Import from Email Network feature. This feature relies on the Windows built-in indexing functionality on recent versions of Windows (e.g., Windows 10). By default, email files in certain formats will be indexed by Windows. You can view and change which filetypes are indexed and check indexing progress in the Indexing Options dialog accessible via the Control Panel.
You may not have the email you want to analyze on a local or shared machine. For example, you may exclusively rely on a web mail service such as Gmail or Hotmail. Nearly all web mail services allow you to download local copies of your messages via POP or IMAP to an email client such as Thunderbird or Outlook, or create an archive of the files for backup. However, in some cases you may need to purchase backup software to export into a file that can be indexed. If you are using IMAP, make sure to download the complete email message files, not just the header information. Otherwise the Window's indexing service will not download the content of the messages and allow you to import them using NodeXL (as described later). You can typically choose not to download attachments if there are space limitations. If you use IMAP you can also restrict the download by folder. For example, you may want to only download recent messages (i.e., those sent in 2018) rather than years of data. After downloading messages, it may take Windows some time to index all of the files. If you have subscribed to an email list and retained all of the messages you want to analyze, you can place them in a folder and use IMAP to download just those messages.
Once Windows has indexed the email you want to analyze, you are ready to import the data directly into NodeXL. Select the From Email Network option from the Import drop-down on the NodeXL ribbon to open the importer shown in Figure 9.2.
The enormous size of many email collections often requires filtering out messages before analysis. Even when email collections are of manageable size, filtering messages can hone in on a specific subset of messages ideally suited for addressing a question of interest. There are several ways of filtering:
In addition to filtering the messages included in the social network dataset, NodeXL allows the way the edge weight is calculated to be specified either based only on addresses in the To field or including those in the Cc or Bcc fields as well. This is independent of filtering. By default only those addresses in the To field are counted. In the example displayed in Figure 9.2, the Cc field is included when calculating edge weights, but not the Bcc field.
After importing email data into NodeXL, you will likely need to clean it to remove duplicate email addresses for the same individuals, as well as self-referring loops created when people reply to their own messages.
If your focus is on connections between people, as opposed to specific email accounts, you will want to combine multiple email accounts from the same person into a single one. Unless you are using an advanced entity resolution software program to do this, this is likely to be a somewhat manual process.
The simplest approach is to use the Find and Replace tool familiar to most Excel and Word users. Start by choosing Show Graph and then navigate to the Vertices worksheet. Sort the Vertex column from A to Z so that email addresses that start with the same name will be next to one another (e.g., [email protected] and [email protected]). Then click on Control + F to open the Find and Replace window and enter the appropriate email addresses (Figure 9.3 presents an example). Navigate to the Edges worksheet and choose Replace All for data in the Vertex1 and Vertex2 columns. There will likely be duplicates that don't start with the same username (e.g., [email protected] is my work email address, while [email protected] is my personal email address). To find the important people based on the frequency of interaction you can sort columns by edge weight and make sure that those with a high edge weight are not duplicates. The most important duplicates to remove are your own.
After replacing the fields in the Edges worksheet, you should delete all of the rows that have data in the Vertices worksheet. Then click Show Graph, which will generate a new list of Vertices on the Vertices worksheet. If you fail to do this, you will have duplicates in the Vertices worksheet, which can cause problems later.
The problem with using Find and Replace is that it must be repeated each time the data is re-imported or updated, even if the email addresses are the same. There is also no trace of the changes once they are made, making it hard to audit mistakes. A more time intensive, but careful, approach is to use a Lookup Table as described in the Advanced topic: Performing the lookup table strategy to count and merge duplicate email addresses.
Once you have updated the email addresses in the Edges worksheet so that different addresses for the same person are replaced with a single address, you will likely have duplicate edges (e.g., more than one row that have the same values in the Vertex1 and Vertex2 columns). It can be useful to “roll up” these duplicate edges, replacing multiple connections between a pair of email addresses with a single edge. The rolled up edge has a weight equal to the total number of exchanged messages found in the data. It is important to roll up the data so that network metrics can be accurately calculated, as some of them assume that edges connecting any pair of vertices are unique. To prepare your email network for analysis, you can roll up repeated email messages from the same pair of people using the Count and Merge Duplicate Edges feature in the Prepare Data section of the NodeXL Ribbon. This will merge the duplicate edges and sum up the Edge Weights so the total Edge Weight remains the same. Before doing this, make sure the network type is set to Directed, or else it will remove the directed nature of the graph. Figure 9.5 shows the New Edge List from Figure 9.4 in Columns A and B and a merged version of it in Columns E and F shown here to illustrate the results of the Count and Merge Duplicate Edges feature. Notice that the total of the Edge Weight column is the same.
Sometimes people send email messages to themselves as a reminder, as a To Do list, or to share a file between computers. This results in a row with the same address in the Vertex1 and Vertex2 columns on the Edges worksheet and is called a self-loop. The use of multiple email addresses and the removal of duplicate addresses can also cause self-loops. Row 9 of Figure 9.4 is an example. For many analyses these self-loops are not important and can be distracting when visualizing data or calculating network metrics. You may want to remove self-loop edges such as the red pair in Figure 9.5 as an additional step after you have counted merged duplicate edges. See Advanced Topic: Automatically identifying self-loops.
This section presents two projects that serve as examples of how to analyze personal email collections. They are both based on the following scenario.
Scenario: You have a new employee coming to work with you next week whom you will supervise. He doesn't know you well and is new to the organization. To help him smoothly transition into his new job, you want to provide him with an overview of the people you work with and their relationships to each other. You decide to create two network visualizations, one that provides an overview of all of your contacts and another that provides more detail on the workgroup that he will work with most closely.
In the following examples, the new employee is a new faculty member coming to work with Derek Hansen, Associate Professor at Brigham Young University's IT and Cybersecurity program. The faculty member will be focused on the area of cybersecurity. For privacy reasons, the email networks analyzed in this section are not made public. You are encouraged to analyze your own email data with a similar scenario in mind. This section assumes that Windows has already indexed your emails as described in prior sections.
Import all of your email sent within the past month. Although some people you know may not have contacted you in the prior month, this time period lets you collect a broad set of your active email contacts. Use the Import From Email Network feature in the Data menu and filter based on your chosen date range (e.g., 11/1/2018 to 11/30/2018). Check the Use Cc line when calculating edge weights box to be more inclusive. For Derek's dataset, a total of 1977 edges are created with a total edge weight of 7572.
Next, combine email addresses as described in the Advanced topic: Performing the lookup table strategy to merge duplicate email addresses, and run the Count and Merge Duplicate Edges function. For my dataset, this collapsed the 1977 edges into 1837 (140 pairs were merged). To make sure no data was lost, check that the sum of the Edge Weight column is the same as it was before the merge.
To more clearly focus on the key relationships, it is desirable to remove infrequent email exchanges. Sort the Edge Weight column from largest to smallest. It is likely that the values will have a skewed distribution with many connections with very low edge weights and relatively few connections with a high edge weight. Remove the least common connections by choosing a cutoff point for deletion. You may want to use the Dynamic Filters feature discussed in Chapter 7 to find an appropriate cutoff. For example, in Derek's data when all of the connections with an edge weight of < 5 are removed, the key individuals are still retained and the total number of edges is reduced to 304 edges. You can manually delete the rows, in which case your data will be a more manageable size, but will lose some data that may be needed in your later analysis. For example, if you calculate the total number of messages8 a person sends (see Advanced topic: Calculating total sent and received edges), the number would be incomplete if all of the infrequent connections were deleted. Alternatively, you can use the Autofill Columns feature to Skip edges with an edge weight that falls below the cutoff point. This will keep the data in the workbook, but not use it in the display of graphs or the calculation of network metrics (discussed in Chapter 6).
Next, select Show Graph, which will populate the Vertices worksheet with data about each vertex and display a preliminary email social network graph. The next step in the data analysis is to compute all of the relevant graph metrics (see Chapter 6). You can insert additional columns indicating people's attributes such as their relationship to you, their location, or affiliation. You can also use formulas to calculate the total number of messages8 sent or received by an individual as is described in the Advanced topic: Calculating total sent and received edges.
The next step is to map the metrics and new columns onto display attributes in the visualization (see Chapter 5). Many display attributes like color, transparency (“opacity”), edge width, and location can be mapped to data attributes about messages, relationships, and authors. Selecting the optimal mapping of data attributes to display attributes will inevitably require some trial and error. You may want to look at the social network graph with and without the vertex that represents your own email address by manually setting the Visibility column for the row with your email address on the Vertices worksheet to Skip. Figure 9.6 shows Derek's network after using the Harel-Koren Fast Multiplex layout and Group in a Box feature (see Chapter 7).
Derek's email address and its connections are not shown, which makes the graph less cluttered (because the “Derek” vertex had been connected to every other vertex). Distinct groups can also be seen more clearly. However, removing Derek from the network hides information about whom he communicates with most often and the direction of his communications. To deal with this, opacity and size of vertices have been used to indicate the number of messages sent and received. Email addresses and names of most individuals have not been displayed for privacy reasons. Derek could set the tooltip to display email addresses and provide a file to the new faculty member so that he could map vertices to email addresses. Alternatively, a printed version with selected individuals that the new faculty member is likely to work with could be created.
Analysis of the graph and accompanying data helps answer many of the questions offered earlier in the chapter:
In this case, we want to include only those messages that mention a particular topic. This gives us some idea of who knows about this topic and how they are related to one another. For this example, we are interested in finding individuals with whom Derek exchanges emails that refer to “Cybersecurity.” Figure 9.2 shows the Import From Email Network window set to include only messages with the text “Cybersecurity” exchanged during November 2018. This is a subset of the graph of all emails examined in the prior section.
Use one of the previously specified methods for joining duplicate addresses for the same person and merge duplicate edges after making sure the graph type is Directed. In Derek's dataset, the original 432 unique edges collapsed down to and 404 unique vertices (i.e., email addresses after removing duplicates).
When there are relatively few connections as in this example, it is feasible to calculate the metrics and add columns before filtering the data. Calculate all of the metrics and add the same new columns to the vertices worksheet as in the prior example.
It is reasonable to use the Dynamic Filters to determine a good cutoff point when there are few connections (see Chapter 7). To focus in on those who communicate the most about the topic (Cybersecurity), filter out those with a low edge weight or those who send few messages. To hone in on the cluster of tightly connected vertices, filter out those with a low in- and out-degree, since those who are densely clustered send and receive messages from others that are part of the group. Figure 9.7 shows Derek's Cybersecurity email network before and after the dynamic filtering.
Figure 9.7 uses a similar mapping of data to visual properties as was used in Figure 9.6 with a few minor changes in minimum and maximum values. The key difference is the inclusion of Derek and his connections and the focus on only messages that include “cybersecurity.” Including Derek in the network adds clutter to the graph, but also adds valuable information about whom he worked with most closely during the 5-month period. For example, the thickest lines coming to and from Derek are with faculty and staff most closely associated with the new Cybersecurity program at BYU.
The analysis of the first graph is similar to that of Figure 9.6, except that connections are solely based on messages containing the text “cybersecurity.” Thus, some important individuals from Figure 9.6 do not appear in Figure 9.7 (e.g., Marc, Itai, and Ben), whereas others become comparatively more important than in the prior graph. Even with this change, there are some apparent similarities. There is a densely connected group of faculty and staff who remain in both networks because they are all affiliated with the new Cybersecurity program. Additionally, some of the playable case study researchers remain, since Derek was working on a project and a new grant related to the development of a cybersecurity playable case study during this time. In short, these images provide a view into Derek's work related to the cybersecurity topic.
Enterprises rely on their communication networks to function. A combination of phone, email, calendars, discussion forums, blogs, wikis, group messaging, texts, and file sharing are often used in concert to share ideas, documents, schedules, and data. Analyzing the patterns of connection within these collections can reveal important insights into the structure and dynamics of an organization. When an employee, for example, emails another employee, a link is formed that connects the two individuals, but also their organizational groups and divisions. These connections often crosscut the branches commonly represented in an “org-chart.” Most enterprises and institutions are organized hierarchically with people in a group reporting to a single manager who in turn, reports to a manager. These connections repeat until they connect to the single most senior part of the company, creating a tree or pyramid of branching, nested connections known as the traditional org-chart. But “leaf” groups at the ends of these branches often connect directly to other groups, without passing messages up and down the chain of command. A map of the network of connections among groups in an enterprise is an alternative vision to the org-chart that reveals information about the flows of information and connections through the organization.
The extraction of enterprise social media network data is not trivial and requires the coordination of several parts of a typical business. Support from the managers of enterprise email systems is essential to access records of email exchanges. Data about the organizational structure of the business are often stored in a separate corporate directory system that contains information about each employee, such as one's job title, physical location, level, and reporting structure (i.e., to whom the person reports). Coordinating the extraction of data from these two systems can be a challenge for organizations accustomed to managing these functions separately, although unique employee identifiers and email addresses can often be used to join the separate datasets that must be integrated. Privacy, security, and legal concerns arise and must be addressed, potentially for multiple jurisdictions. Although it may be nice to match performance data with network information, it is often not feasible because of potential privacy concerns. Data from multiple information systems is rarely available in a form that is immediately useful for network analysis such as an edge list, and it must be scrubbed to remove errors or inconsistencies. Despite these challenges, a number of companies have begun to create social network data that combine corporate email network data and corporate directory information, giving them a live window into their corporate communication patterns.
In this section you will analyze a sample of email traffic from a large global technology company we'll call TechABC. The company has > 100,000 employees in dozens of countries and hundreds of locations. Employees are aggregated into roughly 10,000 organizational units that have an average of 15 members. Organizational names in the visualizations have been anonymized. For privacy reasons we cannot provide the dataset.
People in each organizational unit send and receive email from people within their own unit as well as to people in other units. These events were logged in the corporate email server and were extracted for a weeklong period. An edge list of events in which an employee sent an email to another employee in the To, Cc, or Bcc fields was created. Data about each employee were then removed and replaced with the name of the organizational unit in which they were a member, helping address individual privacy concerns. Data were then aggregated (see Section 9.6.2), creating an edge weight that represents the number of messages sent from one unit to another. This process rolls messages exchanged between members of the same unit into self-loops, where the sending unit and receiving unit are the same. The total number of internally exchanged messages can be useful, but is best treated as attribute data on the Vertices worksheet rather than captured in the edge list.
Whole graph maps of enterprise networks are likely to be too large and dense to be informative. For example, TechABC's raw sent email network includes > 1.3 million edges and around 10,000 vertices. A process of filtering and selective display is required to peel away parts of the network that obscure structures of interest (see Chapter 7). When working with large datasets such as TechABC's, you may want to perform the first round of filtering using a database program like Microsoft Access because of size limitations in Excel.
A common edge filtering technique is to remove all connections below a threshold, helping whittle away infrequent ties to reveal the strong core skeletal structures of the company. The easiest threshold to use is the raw number of messages sent between units. However, because organizational units differ in size, this approach disadvantages smaller units with fewer members contributing to the number of messages. To account for this discrepancy, you can normalize the data by creating a new edge variable based on the number of messages sent per employee (e.g., per full-time equivalent or FTE). You'll need to decide if you want to use the number of FTEs from the sending unit, receiving unit, or some combination of the two. For the graph shown in Figure 9.8, we removed edges with fewer than 50 messages per FTE sent in a week, where we used the minimum of the sender and receiver FTE values as the denominator. This approach keeps an edge if it is important (i.e., a high number of emails per FTE) to either the sending or receiving unit (see the U.S. Senate co-voting example in Chapter 7 for another illustration of a similar technique). The resulting, filtered TechABC network includes 2303 edges and 2267 vertices. Figure 9.9 uses a similar approach, but because it focuses on a subset of units (only research units), the threshold was lowered to 10 messages per FTE sent in a week.
You could also normalize the data by calculating the number of messages sent from one unit to another unit as a percentage of all messages sent from the unit. This approach accounts for differences in a unit's overall email usage patterns, which can be desirable in some cases. For example, it would remove edges representing company announcements from a single unit (e.g., the human resources or information technology department) because the messages sent to any one unit would be a small percentage of the sending unit's overall sent messages. As with the prior example, you will need to decide if you want to use the sending or receiving unit's total message count as the denominator, or some combination of the two (e.g., maximum, minimum, average).
Other strategies for filtering data can lead to other insights. For example, showing only weak ties (edges with between 3 and 10 messages per FTE) can highlight lesser-known connections that might guide management efforts to improve connections across gaps in the company. Attributes of organizational units can be used to filter the network graph as well, helping to zoom into subsections of the larger graph. For example, you could remove all but the most central groups to reveal the network of core groups while hiding more peripheral groups. You can also focus on units within a particular department, geographic location, or similar mission. You will see this approach used in our second example (Figure 9.9), which looks at connections between research units of TechABC.
You may want to create an overview graph of an organization's email communication before moving into more detailed analyses of specific departments or groups. Overview graphs can be difficult to read because of their size. However, they are excellent for dynamically exploring by sorting on metric properties to identify important units, highlighting vertices of interest and seeing their connections on the graph, and using dynamic filters to further hone in on specific areas of interest. A highly filtered overview of TechABC is shown in Figure 9.8. Only edges with > 50 messages per FTE are shown, with additional filtering to show only the main component. You can think of this as the backbone of the company.
This graph and the accompanying data tell interesting stories about the company. Overall, the graph is sparse, largely because of the high filtering threshold we have chosen. The graph density is very low, suggesting that most units only communicate heavily with one or two other units. The average geodesic distance is 10.2 and maximum geodesic distance (i.e., diameter) is 29, both of which are quite high. If high numbers existed at a lower threshold, it would suggest that units may not be well connected with certain other units on the other “side” of the company. Increasing connections between otherwise disconnected groups may be a goal for an organization. For example, many organizations have created “communities of practice” consisting of people with similar skills who are scattered throughout different organizational units. An initiative to increase connections throughout the company could be evaluated by looking for increases in the network density and decreases in the diameter over time.
In addition to looking at global trends, it is possible to look at the role of individual units in the company. Even with the highly filtered graph shown in Figure 9.8, it is possible to identify several hubs (with high out-degree), some densely connected clusters, and units that act as bridges between other units. Some of these fill critical locations in the network, demonstrating their unique value that results from their network position. Organizational units that are connected to many other units (i.e., the hubs) perform services like IT management or library services that touch many parts of the company. Groups that are less connected but have a high betweenness centrality likely have coordination functions, bridging information between multiple groups such as specific geographical units within a larger region. Isolated groups and clusters of groups are likely specialists that perform a function for one or a few other groups to consume. Analyzing networks like the one visualized in Figure 9.8 also allows you to compare units that serve a similar function to see how they compare on various metrics, helping to identify those that could benefit from additional connections.
Although overview maps like Figure 9.8 can be helpful, they can also be cluttered and may filter out too much of the detail for large companies. To gain actionable insights, you will typically need to focus on subsections of the network, such as units that serve a similar purpose (e.g., IT, marketing, research). In this section you will explore the organizational units within TechABC that have a research mission. They were identified by looking for organizational unit names with the word “research” in them using Microsoft Excel's Search function (a non-case-sensitive function similar to the Find function).
Although you can restrict the network analysis to only the core units that meet your criteria (i.e., research units), it is often insightful to include all units connected to the core units. For example, Figure 9.9 includes all research units (maroon boxes), as well as all units they sent or received messages to (blue disks). A cutoff point of 10 messages per FTE was used in order to account for unit size differences. Because the focus is on the research units, connections between the non-research units are not shown. The result is a collection of all of the 1.0-degree networks of the research units. To create a similar graph, we added a new column called Research to the Edges worksheet that is a 1 if either Vertex1 or Vertex2 is a research unit and a 0 otherwise. This can be set to Edge Visibility Equal to 1 (in the Research column) using the Autofill Columns feature to exclude all other edges.
The network highlights several important bridge-spanning units, as well as some disconnected units that may need to be connected. For example, research unit Specific 6 plays an important role in connecting several other research groups either directly or indirectly. The organizational unit Specific 10 is important because it is the only path connecting the large Specific 2 unit to the other research units (albeit indirectly). There are also several non-research groups that play pivotal bridge spanning roles, such as the very small unit just above General 17 that is connected to six different research groups, none of which is directly connected to another. This small unit likely plays an important role and its small size may make it vulnerable to employee turnover, suggesting that the company may consider if additional resources are needed to support the group's function. In contrast, the network shows Market Research 1 and 2 in completely different components, not even connected indirectly. More generally, few research units are directly connected to each other, suggesting that there may be potential for increased exchanges through employee swaps, internships, or other shared projects. This assumes there would be benefits from interdisciplinary projects, which content experts would need to determine. Although many of the actionable insights require knowledge about the organization, Figure 9.9 gives you some idea of the potential benefits of this type of analysis.
In the prior section you explored an organizational network from the perspective of an insider who knows the company. In this section you will explore an organizational network from the perspective of an outsider trying to make sense of an email corpus collected as part of a lawsuit. Specifically, you will explore a subset of email messages sent and received by Enron employees. The original, publicly available dataset included approximately a half-million messages and was made public by the Federal Energy Regulatory Commission (FERC) during the investigation of Enron. It was later cleaned and made permanently accessible by researchers at MIT, CMU, and SRI (see www-2.cs.cmu.edu/~enron for details). The analysis in this chapter is based on a subset of 1700 messages coded by students and researchers at the University of California at Berkeley, filtered to only include messages that are work related. It focuses on business-related messages occurring later in the collection and includes discussions of the California Energy Crisis (see http://bailando.sims.berkeley.edu/enron_email.html for a complete description and compressed file of the individual messages). Messages were downloaded, indexed, and imported into NodeXL using the process described earlier in this chapter. You can download the NodeXL files that correspond to the images shown in this section from https://www.smrfoundation.org/nodexl/teaching-with-nodexl/teaching-resources/. The analysis is inspired by Jeffrey Heer's work [4].
One problem historians and lawyers face is identifying individuals who played key roles in important events. For employees that use email frequently, email networks provide a quick sense of who communicates with whom. Filtering email collections to include only those that use a particular keyword or set of words is a useful method for finding people related to some event.
You can see an example of this by analyzing the Enron email network of messages that include the term “FERC,” the commonly used acronym for the Federal Energy Regulatory Commission, an “independent agency that regulates the interstate transmission of natural gas, oil, and electricity” (see www.ferc.gov). To create this FERC network, you can use the NodeXL import tools, making sure to filter messages to include those that have “FERC” in the body of the message. The resulting network includes 370 vertices representing employee email addresses and 672 weighted edges. This is a smaller subset of the Enron message network tagged by UC Berkeley students that includes 1803 edges and 1102 vertices. The total sent and received FERC messages8 are included on the Vertices worksheet (see Advanced topic: Calculating total sent and received edges), along with a column called %_Received, which equals Received/(Sent + Received).
Once you calculate the graph metrics, you can use them to create a graph such as Figure 9.10 designed to highlight important individuals. The graph sets the size of each vertex based on in-degree, because those receiving FERC messages from many different individuals are likely “go to” people. Vertex color is based on the %_Received data, with greener vertices indicating that the individual received many messages but did not send out many. Individuals with something to hide may not send out messages, suggesting that focusing on the large green vertices may lead to potential violators. Indeed, one of these vertices represents Tim Belden, the head of trading in Enron Energy Services considered by many to be the mastermind of Enron's scheme to drive up energy prices in California. Belden pleaded guilty to one count of conspiracy to commit wire fraud as part of a plea bargain and ended up serving as a key witness against many top Enron executives.
Although visualizations like Figure 9.10 can help identify individuals worth following up on, they should be used cautiously. In this particular example, there are no messages sent from Tim Belden in the dataset, making it unclear if his high received ratio is due to his actual email usage patterns, purposefully deleted messages, or limitations with the original dataset. Even if the data accurately reflects actual email patterns, Figure 9.10 is imperfect in that it emphasizes many individuals aside from Tim Belden who were not accused of illegal activities. Furthermore, many of those found guilty of crimes were not included in this graph at all, perhaps because they recognized the liability of using email for sensitive communication or perhaps because of limitations in the dataset. Clearly, reading the content of the messages is of utmost importance. However, viewing the network can help identify individuals and messages of interest. Once an individual is known to be involved, mining email is an effective way to identify people with whom the suspect frequently interacts. For example, Figure 9.10 shows a strong connection from John Shelk to Tim Belden (and many other recipients), which is explained by the fact that John Shelk often reported on congressional meetings but rarely received replies to his reports. Integrating the content with network visualization tools can provide a powerful exploratory platform, as has been done with the Enron network dataset [4].
Email networks provide an intimate look into individuals' social and work relationships making them of interest to managers, community analysts, historians, researchers, and legal professionals. Because email is frequently and widely used in professional contexts, it reliably captures important aspects of many professional relationships. There are three main types of email collections: personal, organizational, and community. An analyst's existing experience with a collection is also important and impacts the types of questions asked and amount of detail needed.
Working with email networks can be challenging. Large collections must often be filtered to a manageable size. Filtering can be based on time, sender/receiver, messages' content, folders or labels, or any combination. Combining duplicate email addresses for the same individual can be time intensive but is often necessary. Integrating email networks with corporate personnel data can be challenging and poses ethical considerations, but when done responsibly can provide new insights.
Personal and organizational email networks can be explored for insights or shared with others to provide an overview. These networks may be based on individuals and their connections or on organizational units and their connections. Analysis can uncover important individuals and relationships such as boundary spanners, central members, broadcasters, and unresponsive recipients. Tightly connected subgroups can be identified and their relationship to one another can be mapped. The impact of interventions or external shocks on the network can be tracked over time, and common structural patterns such as recurring social roles or types of subgroups (e.g., cliques, fans) can be identified. These analyses can lead to actionable insights by identifying people or departments that need more cross-fertilization, helping newcomers get an overview of the social structure they are entering into, evaluating the effectiveness of a new community of practice initiative, and much more.
The widespread use of email has fostered a growing community of researchers whose goals are to understand usage patterns so as to improve user interfaces and management tools. Researchers have focused largely on individual usage of email [5, 6], but they increasingly work on forensic tools to analyze other person's email or a group's email [3, 7]. A popular theme has been to improve the strategies for finding relevant documents in a large email collection [8, 9]. Exploration tools have built on the traditional keyword or key phrase search strategies, but increased attention to visualization tools has enabled users to get an overview of temporal patterns, relationships with individuals, or the social structure within groups [8, 10–13].
The many opportunities to improve on email analysis systems is generating increased research on these issues and an increasing demand for such tools from corporate human resources staff, forensic investigators, legal analysts, and social scientists. The ability to detect temporal changes, such as sharp increases/decreases in communication among certain people or about certain topics, is a valuable guide to analysts. Temporal changes might be visualized by simple timelines or by animated changes to network diagrams, assuming stable layouts are used. The formation and dissolution of subgroups signal important changes that are useful in applications as diverse as detecting rumor spreading (gossip), corporate reorganizations, or antecedents of important events. Tying email to geographical position or even location in office buildings could help us to understand social processes in organizations.
52.15.55.18