Chapter 6

Calculating and visualizing network metrics

Abstract

NodeXL provides access to a powerful set of quantitative network metrics. Aggregate network metrics characterize the network as a whole and include graph density, diameter, reciprocated vertex pair ratio, number of connected components, etc. Vertex metrics also called centrality metrics identify important and unique individuals. They include degree, in-degree, out-degree, betweenness centrality, eigenvector centrality, closeness centrality, PageRank, and clustering coefficient. These metrics, along with other attribute data (i.e., data describing the people or connections) can be mapped onto visual properties to create meaningful network visualizations. NodeXL also provides text analysis features, time series analyses, and identifies top items such as those who are followed the most on Twitter. Network metrics are illustrated using the ABCD network, the Marvel Cinematic Universe network connecting movies- to-characters, and a Twitter Network surrounding the CSCS conference. Instructions on creating affiliation networks from a bimodal network are also provided.

Keywords

Centrality metrics; Degree; In-degree; Out-degree; Betweenness centrality; Eigenvector centrality; PageRank; Closeness centrality; Clustering coefficient; Density; Network diameter; Connected component; Vertex pair ratio; Time series

6.1 Introduction

When trying to understand networks, analysts often want to identify important vertices, locate subgroups, or get a sense of how interconnected a network is compared to other networks. Although visualization itself can help do this, it is often helpful to use the rich set of quantitative network metrics, also called network graph metrics, which have been developed by social network analysis researchers (Chapter 3).

Network graph metrics can describe an entire network, subgroups, or specific actors within a single network. Aggregate graph metrics such as network density can be used to systematically compare communities, helping analysts decide which communities are highly connected and which are sparse. Tracking aggregate graph metrics over time can determine the effectiveness of interventions on the network as a whole. For example, you would expect the total number of edges to grow, increasing the density of the graph, after a photo sharing activity designed to introduce people to those they don’t know.

Individual person-level metrics provide insights about a person’s position within the network, helping to identify important or “central” people. For example, network graph metrics help identify people who are bridge spanners or who are popular in a network. Once identified, analysts and managers can better know who to contact or influence or bring to the table when trying to implement new programs or gain broader understanding. Metrics can also be used to identify cliques or persistent social roles that show up in many communities. Understanding the mix of social roles that exist within a particular network can help analysts determine if they have a healthy mix of social types or who may be a good candidate to replace an outgoing leader.

NodeXL calculates several network graph metrics. Once calculated, you can use these metrics to change the visual display of your network graphs in powerful ways as shown in this chapter. You can also filter out vertices or edges based on network metrics as discussed later (Chapter 7).

6.2 ABCD network example

To better understand the meaning of each graph metric, start by opening the ABCD network visualized in the last chapter (or download it from https://www.smrfoundation.org/nodexl/teaching-with-nodexl/teaching-resources/). This network was designed specifically to illustrate the differences between several key metrics. If you download the version online, the layout shown in the book will be reproduced since the vertex positions are locked in place - the Locked? column values are set to Yes (see Advanced topic: Using hidden layout columns in Chapter 4). It should also be set to an undirected network type. Instructions on importing a NodeXL file made on another device can be found at the end of Chapter 4.

6.3 Computing graph metrics

To calculate graph metrics, first click on the Graph Metrics button on the Analysis section of the NodeXL ribbon. This opens the Graph Metrics dialog (Figure 6.1). Select the metrics you want to calculate by checking in the boxes next to them. Details about the metric that is selected (e.g., Vertex clustering coefficient) are shown in the box below (see Figure 6.1). Some metrics allow you to customize various options by clicking on the Options… button on the right-hand side. Check the boxes next to the metrics shown in Figure 6.1 and then click Calculate Metrics. Some of the graph metrics can take a while to calculate when working with large networks, so a status bar is used to show progress. NodeXL will create a new Overall Metrics worksheet and take you there to show summary information for the entire network. It also populates a set of Graph Metrics columns on the Vertices worksheet that shows vertex- specific metrics, such as centrality metrics.

Figure 6.1
Figure 6.1 The Graph Metrics dialog with checks next to relevant metrics.

6.3.1 Overall graph metrics

Navigate to the Overall Metrics worksheet, which summarizes some of the key properties of the entire network (Figure 6.2). These metrics include the following:

Figure 6.2
Figure 6.2 Part of the Overall Metrics worksheet showing data for the ABCD network.
  •  Graph type. Undirected or directed.
  •  Vertices. The number of total vertices (i.e., rows on the Vertices worksheet).
  •  Unique edges. The number of unique edges found in the Edges worksheet.
  •  Edges with duplicates. The number of repeated vertex pairs on the Edges worksheet. Duplicate vertex pairs may occur, as, for example in a Twitter network when person A mentions person B in multiple tweets. Duplicate vertex pairs are treated as a single edge for most metrics, since NodeXL metrics currently do not support weighted networks. The ABCD network does not include any duplicate edges.
  •  Total edges. The number of total edges (i.e., rows on the Edges worksheet), which is the sum of the Unique Edges and Edges With Duplicates.
  •  Self-loops. The number of edges that connect a vertex with itself. A self-loop occurs when the edge list includes the same exact name in the vertex 1 and vertex 2 columns on the Edges tab (i.e., a person is connected to themselves). This may happen when, for example, in an email list a person replies to his or her own email. Self-loops are represented visually in the graph pane by a circular edge that comes out of a vertex and returns to that same vertex. Metrics do not include self-loops in their calculations unless otherwise stated in the description of the metric (e.g., see degree, in-degree, and out-degree).
  •  Reciprocated vertex pair ratio. This is only applicable for directed networks. It represents the percent of connected vertex pairs that are connected by directed edges pointing in both directions. It is calculated as: (the number of vertex pairs that are connected to each other in both directions)/(the number of vertex pairs that are connected by one or two edges). In a Twitter network, where edges represent Mentions, a high Vertex Pair Ratio indicates that most of the time when a user mentions a second user, the second user also mentions the first user.
  •  Reciprocated edge ratio. This is also only applicable for directed networks. It represents the percentage of edges that are reciprocated (i.e., an edge that has a companion edge pointing in the opposite direction between the same two vertices). It is calculated as: (the number of edges that are reciprocated)/(the number of total edges). While this value is correlated with the Reciprocated Vertex Pair Ratio, it is not the same. It will typically be higher because a single reciprocated vertex pair consists of two reciprocated edges, making the numerator twice as high for the reciprocated edge ratio. While the denominator is also higher for the reciprocated edges calculation, it won’t be double that of the reciprocated vertex pair ratio.
  •  Connected components. The number of connected components (i.e., clusters of vertices that are connected to each other but separate from other vertices in the graph). In the ABCD network there is only one connected component because you can get from one vertex to all other vertices. If Ava was not connected to Ethan, then there would be two connected components (see Figure 6.3).
    f06-03-9780128177563
    Figure 6.3 The ABCD network showing graph metrics for each vertex. Degree is mapped to size (2–100), betweenness centrality is mapped to opacity (65–100), eigenvector centrality is a tooltip, and Shared_Connections are mapped to edge weight (1.5–5) and edge opacity (50–100).
  •  Single-vertex connected components. The number of isolated vertices that are not connected to any other vertices in the graph. There are no isolated vertices in the ABCD network. If Dmitri was not connected to Ava, he would not be connected to anyone in the network and would then become a single-vertex connected component.
  •  Maximum vertices in a connected component. The number of vertices in the connected component with the most vertices. This is equal to the number of vertices (13) in the ABCD network, because they are all part of the only connected component.
  •  Maximum edges in a connected component. The number of edges in the connected component with the most edges. This is equal to the number of edges (18) in the ABCD network, because they are all part of the only connected component.
  •  Maximum geodesic distance (diameter). The geodesic distance is the length of the shortest path between two people. If you think of the edges as roads and the vertices as houses, the geodesic distance would be the number of roads someone must take to get from one house to another, assuming that the person is traveling on the shortest path possible. The maximum geodesic distance, or diameter of a network, is the largest geodesic distance in the network, or the distance between the two vertices that are farthest from each other. In the ABCD network, this value is 4. For example, the shortest path between Liu and Kate is 4; similarly the shortest path from Camila to Ji-yoo, Hassan, Matt, and Kate is also 4. All other geodesic distances are smaller. For example, the shortest path between Gabe and Fay is 1.
  •  Average geodesic distance. This is the average length of the geodesic distances between all pairs of vertices. It gives a sense of how “close” community members are from one another. For example, in the ABCD network, a value of 2.22, suggesting that many people in the network know others directly or through a friend of a friend. Interestingly, many large social networks retain a relatively small average geodesic distance. For example, the Facebook network showed an average geodesic distance of only 4.57 with over 1.59 billion Facebook users in 2016.1
  •  Graph density. The graph density is a number between 0 and 1 that indicates the percentage of possible edges that are realized. It is a measure of how interconnected the vertices are in the network. The specific formula for calculating graph density is: (number of actual edges in the network)/(number of possible edges in the network). For undirected graphs the numerator is multiplied by 2. The number of possible edges in the network is based on the total number of vertices (n) in the network. Specifically, it is n*(n − 1). For example, the numerator for the ABCD network is 36 (i.e., 2*18) since it is an undirected network with 18 edges. The denominator is 156 (i.e., 13*12) since there are 13 vertices. Thus, the Graph Density is 0.23 (36/156). If more of the 18 employees became connected, this number would increase. Larger social networks tend to have lower graph densities, all things being equal, so be careful comparing this metric across different networks.
  •  Modularity. The modularity metric is only calculated when working with subgroups, which are discussed in Chapter 7. Modularity measures how distinct different subgroups of vertices are from the rest of the network [1]. It can be used to measure the “quality” of the separation of vertices into subgroups. Networks with high modularity have dense connections between vertices that are part of the same subgroup (i.e., module), but sparse connections between vertices that are part of different subgroups. More specifically, modularity is the fraction of within-group edges minus the expected fraction of edges if edges were distributed at random. It will be a value between − 1 and 1, where positive values indicate that the number of within-group edges is higher than the expected number of edges based on chance.
  •  NodeXL version. Indicates the version of NodeXL in use when metrics were calculated.

In addition, a frequency chart is created for each of the possible vertex-specific graph metrics. These frequency charts are particularly helpful when analyzing large networks. Some basic statistics about the metric distributions are shown under the charts (minimum degree, maximum degree, average degree, and median degree). These help characterize the entire networks and allow for comparisons over time or across networks.

Advanced topic

Calculating and importing additional graph metrics

Numerous network metrics exist in addition to those calculated by NodeXL. Furthermore, new metrics are constantly being developed. Additional aggregate metrics can be calculated using Excel’s built-in functions. For example, some analysts like to look at the variance of the degree, which can be calculated by using the function: = VAR.P(Vertices[Degree])

Other aggregate metrics capture aspects of the degree distribution. For example, network centralization measures how much the network depends on key people for its connectivity. Many of these aggregate metrics can be calculated in NodeXL using formulas.

Some specialized graph metrics are not currently calculated in NodeXL, such as centrality metrics that use edge weights (see Newman’s thoughtful review [2] for a comprehensive discussion of network metrics). Researchers who need such metrics may use other network analysis tools such as Pajek or UCINET to calculate them and import them into NodeXL as additional columns. This allows all of the advanced visualization features of NodeXL while still providing more network metrics.

6.3.2 Vertex-specific metrics

The different vertex-specific metrics, also called centrality metrics, help identify who is “important” or “central” to a network. Of course, people are important in different ways. Some may have the most direct connections, while others may be important bridge spanners who connect otherwise disparate parts of the network. Each centrality metric captures a different aspect of importance as described below.

To see the vertex-specific metrics navigate to the Vertices worksheet. You will see the new Graph Metrics columns, which can be hidden later if desired by unchecking Graph Metrics from the Workbook Columns button on the NodeXL ribbon. Each value relates directly to one of the vertices. For example, row 4 shows the graph metrics that are specific to Ava (Figure 6.3).

Vertex metrics can be mapped onto visual attributes (Figure 6.3), which you can recreate by using the Autofill Columns feature found in the NodeXL Visual Properties menu ribbon. The graph legend shows that degree (1–6) is mapped to size and betweenness centrality is mapped to opacity. Edge weight and opacity are also mapped to Shared_Connections (see Chapter 5). In addition, eigenvector centrality is mapped to the tooltip (see Ethan’s score in Figure 6.3) and the labels are set to Vertex and positioned so they don’t cross edges. A description of each metric and how it relates to the ABCD Network are provided below.

It is often useful to sort the spreadsheet columns based on graph metrics. For example, the rows in Figure 6.3 are sorted based upon data in the Degree column as indicated by the downward pointing arrow inside the Degree drop-down menu.

Degree

The degree of a vertex (sometimes called degree centrality) is a count of the number of unique edges that are connected to it. Fay has a degree of 6 because she is directly connected to 6 other individuals. In comparison, Kate has a degree of only 1 because she is connected to only one other person. If the edges represented strong friendship connections between employees at ABCD, we might say that Fay is the most popular person in the network and Kate is one of the least popular. If you were analyzing a directed graph, the single degree metric would be split into two metrics: (1) In-degree, which measures the number of edges that point toward the vertex of interest (i.e., number of people who have received endorsements from others), and (2) Out-degree, which measures the number of edges that the vertex of interest points toward (i.e., the number of people the person has endorsed). In the ABCD network, NodeXL only calculates degree since the network is specified as containing undirected ties.

Betweenness centrality

Although popularity is important, it is not everything. Betweenness centrality is a measure that captures a completely different type of importance: the extent to which a certain vertex lies on the shortest paths between other vertices. In other words, it helps identify individuals who play a “bridge spanning” role in a network. Consider Ethan in the ABCD network. He is directly related to only four people (i.e., he has a degree of 4). Despite his relatively low degree, his position as a “bridge” between Ava (and indirectly all those who Ava is connected to) and the rest of the group may be of utmost importance. If, for example, information were passed from one person to another, Ethan and Ava would be vital for assuring that Dmitri, Liu, Camila, and Ben could communicate with the rest of the group. In fact, if either Ethan or Ava were removed from the network, those four individuals would be entirely disconnected from the other employees. Thus, Ava and Ethan have high betweenness centrality. In contrast, Dmitri and others on the edge of the network have a betweenness centrality of 0. Even Gabe, who has a degree of 5 and is in the center of the graph, has a relatively low betweenness centrality (6.5) because so many of his edges connect people who are already connected through others. In NodeXL, betweenness centrality scores are doubled for directed networks, though the “shortest paths” do not consider directionality in the calculation.

Closeness centrality

Another characteristic you may care about is how close each person is to the other people in the network. If information needed to flow through the network, some people would be able to get a message to all the other people relatively quickly (i.e., in few steps), whereas others may require many steps. Closeness centrality is a measure of the average shortest distance from each vertex to each other vertex. Specifically, it is the inverse of the average shortest distance between the vertex and all other vertices in the network. The formula is 1/(average distance to all other vertices). The inverse is used so that a higher closeness centrality indicates a more desirable centrality score (i.e., a shorter average distance to other vertices). For example, in the ABCD network, Ethan has the highest closeness centrality score, because he sits right in the “middle” of the network—not too far away from those in the top half of the network and not too far away from those in the bottom half of the network. In contrast, Dmitry, Liu, Camila, and Ben have the lowest closeness since they are so far removed from the majority of the other vertices. In NodeXL, closeness centrality assumes an undirected network, though it shows the same results for directed networks.

Eigenvector centrality

In many cases, a connection to a popular individual is more important than a connection to a lone individual. The eigenvector centrality network metric takes into consideration not only how many connections a vertex has (i.e., its degree), but also the centrality of the vertices that it is connected to. Intuitively, it considers not just “how many people you know,” but also “who you know.” For example, in the ABCD network, Gabe has the highest eigenvector centrality (0.169) because his degree is relatively high (5), but also because those he connects to have high eigenvector centrality scores (e.g., Fay, Ji-yoo, Ethan, Hassan, and Ishita). In contrast, Ava has the same degree (i.e., number of connections) as Gabe, but those she connects with don’t have high eigenvector centrality scores since they have so few connections. As a result, Ava has a low eigenvector centrality score (0.043). In NodeXL, eigenvector centrality assumes an undirected network, though it shows the same results for directed networks.

PageRank

The PageRank centrality metric is best known as the core metric behind Google’s search engine [3]. It is related to eigenvector centrality, but is designed for directed networks such as the world wide web. PageRank includes three distinct factors that determine the ultimate values for each vertex: (1) The number of vertices that link to the target, (2) the PageRank centrality of the linking vertices, and (3) the link propensity of the linking vertices. Consider a specific vertex representing a webpage called PageX. According to factor 1, the PageRank of PageX will increase if more vertices (i.e., websites) link toward it (i.e., it has a high in-degree). According to factor 2, the PageRank of PageX will increase if those who link to it have high PageRank themselves. This means that links are not all created equal. On the web, a link from cnn.com will increase a webpage’s PageRank far more than a link from a local blogger with a small following. According to factor 3, the PageRank of PageX will increase if those who link to it don’t link to many other vertices. In other words, links coming from “selective” linkers (those who only link to a small number of vertices) are more valuable than those coming from “frequent” linkers (those who link to a large number of vertices). In NodeXL, PageRank assumes a directed network, though it shows the same results for both directed and undirected networks. This network metric is not useful for the undirected ABCD network, but is useful for other networks such as the wikipedia page-to-page directed network (Chapter 14).

Clustering coefficient

In some cases, a person’s friends may be friends with each other. For example, Hassan's three friends Ji-yoo, Gabe, and Fay are all directly connected to one another, creating a clique. More generally, a clique or complete graph occurs when all vertices in a group are directly connected to each other. In other cases, a person’s friends may not be friends with one another. For example, none of Ava’s friends are connected to each other. The clustering coefficient measures how connected a vertex’s neighbors are to one another. More specifically, it is calculated as: (the number of edges connecting a vertex’s neighbors)/(the total number of possible edges between the vertex’s neighbors). For example, Ishita’s neighbors include Ethan, Gabe, and Ji-yoo. There are to edges connecting those three individuals (Ji-yoo to Gabe; Gabe to Ethan). However, there are three possible edges between them (those mentioned plus Ethan to Ji-yoo). This results in a clustering coefficient of 2/3 or 0.667 as shown in Figure 6.3. The value will always be between 0 and 1, since it is the percent of possible edges that are realized. It is the same formula as the overall network density, but only calculated on a subset of vertices.

6.4 Marvel cinematic universe network example

Network metrics must be interpreted differently depending upon the nature of the network. So far, we have examined a traditional network where the vertices represent people, and the edges represent direct connections between those people. However, many interesting networks connect people to things they are affiliated with (e.g., clubs, wiki pages they have edited, Facebook groups, classes). To better understand these “affiliation networks” and the meaning of network metrics associated with them, you will explore the Marvel Cinematic Universe affiliation network. Download the raw data in the file named Marvel_Movie_to_Character_Raw.xlsx file found at https://www.smrfoundation.org/nodexl/teaching-with-nodexl/teaching-resources/. The network connects Marvel Universe movies to key characters that were in the movies. Data for the network was culled from the Marvel Cinematic Universe Wiki.2 Appearances in post-credit or deleted scenes are not included. As you will see, this bimodal network can be transformed into two unimodal networks: a character- to-character and a movie-to-movie network (see Advanced topic: Transforming a bimodal affiliation network into two unimodal networks).

6.4.1 Visualizing and interpreting metrics in a bimodal network

Start by looking at the available data on the Edges and Vertices worksheet. On the Edges worksheet, Vertex 1 includes the names of movies and Vertex 2 includes the names of key characters that appeared in those movies. The Vertex worksheet includes additional data for each vertex (see Figure 6.4) including the Type and corresponding Type_Code (1 is a character and 0 is a movie), Release_Date, Phase (a phase is a set of movies related to each other thematically and chronologically), IMDB_Score (average rating out of 10), Metascore (average rating out of 100), US_Opening (millions of dollars made in the opening weekend in the United States), Worldwide_Opening (millions of dollars made in the opening weekend worldwide), and URL (link to the IMDB page for the movie). Calculate the same Graph Metrics as you did for the ABCD network (Figure 6.1) since this network is also an undirected network.

Figure 6.4
Figure 6.4 Marvel Cinematic Universe network connecting Marvel movies to key characters that appear in those movies.

Advanced topic

Transforming a bimodal affiliation network into two unimodal networks

Bimodal affiliation networks like the Marvel Cinematic Universe network can be transformed into two single-mode networks: a person-to-person network (i.e., character- to-character network) and an affiliation-to-affiliation network (i.e., movie-to-movie network). The size of the new networks will depend on the number of people or affiliations. For example, in the Marvel network, there are 41 different characters, but only 17 movies.

A person-to-person network created from affiliation data represents an indirect relationship between people. When the Marvel dataset is transformed into a character-to- character network, two characters will share an edge if they have been in a movie together. Furthermore, the edges will be weighted based on the number of movies they have been in together. For example, Iron Man and Pepper Potts have been in six movies together, so they have an edge weight of 6. You may call this the co-appearance network. Similar networks can be created from social media channels, such as a co-author network connecting Wikipedia editors who have co-authored the same pages (see Chapter 14); or a co-commenter network connecting YouTube commenters who commented on the same videos (see Chapter 13).

Affiliation-to-affiliation networks also create weighted networks. When the Marvel dataset is transformed into a movie- to-movie network, the weighted edges connecting movies are based on the number of shared characters in those movies. For example, Avengers: Infinity War shares 10 key characters with Captain America: Civil War. Other comparable networks connect Wikipedia pages based on the number of people who have edited both of them (see Chapter 14); or YouTube videos based on the number of people who have commented on both of them (see Chapter 13). The topic of the Wikipedia pages or YouTube videos may be completely different, but the people who contribute to them are the same. Thus, connections can link content together based on social structures, not direct linking between content. This type of inferred relationship is what serves as a basis for recommender systems such as Amazon's “Customers Who Bought This Item Also Bought” feature that relates books to other books based on the number of people who were “affiliated” (i.e., purchased) both books together.

Transforming a bimodal network into a person-to-person or affiliation-to-affiliation network typically requires the use of matrices or complex SQL queries. Some network packages, such as UCINET, will do this conversion for you [4]. Alternatively, it can be done for reasonably sized networks using Excel's built-in Pivot Table feature and SumProduct function. An example file called Marvel_Affiliation_Matric_Example.xlsx shows how to create the character-to-character and movie-to-movie networks from the bimodal Marvel Cinematic Universe dataset. It is available at https://www.smrfoundation.org/nodexl/teaching-with-nodexl/teaching-resources/.

Before interpreting the graph metrics, create a more informative network visualization, such as the one shown in Figure 6.5. Because this is a bimodal network, it is important to make movies and characters easily distinguishable. To address this, use the Autofill Columns feature with values shown in Figure 6.5. This will make movies solid squares, while characters remain disks. Additionally, set the color based on Phase (i.e., clusters of related movies), size based on eigenvector centrality, and add vertex labels (see Figure 6.5). After you reposition the vertices for graph readability, the bimodal network can be understood much better.

Figure 6.5
Figure 6.5 Completed Marvel Cinematic Universe network visualization with Phase mapped to color, Type_Code mapped to Vertex shape, eigenvector centrality mapped to Vertex size, and labels shown.

Even with a clear visualization, examining the graph metrics can help highlight important vertices. You can sort on different network metrics to identify important characters or movies. Although the metrics are calculated in the same manner as they are calculated for unimodal networks, because it is a bimodal network, the interpretation is different.

Degree has the most intuitive interpretation. For example, Ant-Man (see row 3 in Figure 6.4) has a degree of 3, which means he is in 3 movies. In contrast, Avengers: Infinity War has a degree of 24 (see row 7 of Figure 6.4), indicating that there were 24 key characters in that movie. Sorting on Degree is an easy way to examine which movies include the most key characters (Avengers: Infinity War; Captain America: Civil War; Avengers: Age of Ultron; Avengers), and which characters were in the most movies (Iron Man; Black Widow; Captain America; War Machine; Thor; Pepper Potts).

Other metrics draw attention to different ways that movies or characters are important. As expected, Avengers: Infinity War has the highest betweenness, closeness, and eigenvector centrality given that it was designed to include all key characters from prior movies. However, these centrality metrics highlight unique positions in the network. For example, Ant-Man shows up with a very high betweenness centrality, because he is the only connection to the Ant-Man 1 and Ant-Man and the Wasp movies and their corresponding characters. Closeness centrality and eigenvector centrality both rank Iron Man and Black Widow highly due to the fact that they are connected to so many of the key movies with many characters. Interestingly, several movies with six or more key characters (e.g., Black Panther, Guardians of the Galaxy, Thor: The Dark Underworld) have fairly low eigenvector and closeness centrality scores because they are connected to characters who do not show up in many other movies. Because this is an affiliation network, no movies directly connect to other movies, and no characters directly connect to other characters. As a result, the clustering coefficient is equal to 0 for all vertices.

6.4.2 Mapping graph metrics to X and Y coordinates

In most layouts, the exact location of the vertices is not meaningful; only their position relative to one another has meaning. However, you may want to map network graph metrics, or other attribute data, to X and Y coordinates to visualize how two metrics interact with one another. Other metrics can be used to adjust visual properties, making it possible to display additional dimensions. For example, Figure 6.6 maps the movies onto the X and Y coordinates based on the degree and betweenness centrality respectively, using color and size to indicate Phase and IMDB score.

Figure 6.6
Figure 6.6 Marvel movies mapping Degree to the X axis (logarithmic mapping), Betweenness Centrality to the Y axis (logarithmic mapping), Phase to color, and IMDB_Score to size (1.5–20). Axes and Legend are shown.

To recreate Figure 6.6, use the Autofill Columns feature. First, set the Vertex Label to Vertex so that the name of each character will be shown. Next set Vertex X to Degree, Vertex Y to Betweenness Centrality, making sure to check the box that says Use a logarithmic mapping in the Options for both metrics. Set the Vertex Size to IMDB_Score (range from 1.5 to 20). Set Vertex Shape to Type_Code similar to the prior graph, so movies show up as a square. To only show movies, and not characters, make the Vertex Visibility based on Type_Code with the Options as shown in Figure 6.6. Finally, navigate to the Edges worksheet and enter Hide into all of the Edge Visibility cells. This will hide all of the edges from the graph, making it far more readable. You can make the Legend and Axes visible using the Graph Elements drop-down menu in the NodeXL Ribbon. Try creating a similar network showing the characters by changing the Vertex Visibility options to display characters instead of movies.

6.5 CSCW 2018 conference Twitter network example

The final example from this chapter will illustrate some of the metrics associated with directed networks that include textual data. Specifically, you will be analyzing the network of Twitter users who posted at least one tweet that included “cscw” (an acronym that currently stands for a community of researchers studying Computer-Supported Cooperative Work and Social Media). Tweets were gathered from September 1, 2018 until November 23rd, 2018, which included time both before and after the 2018 CSCW conference,3 which occurred November 3–7 in New York City. Even though tweets were not gathered until September 1, some older tweets are included, since they were retweeted or replied to during the data collection timeframe. In Chapter 11, you will learn how to import your own Twitter networks. For now, you can download the file CSCW_2018_Twitter_Raw.xlsx from https://www.smrfoundation.org/nodexl/teaching-with-nodexl/teaching-resources/.

The CSCW network is a good example of an EventGraph, or a “social media network diagram of conversations related to events, such as conferences” [5]. Such graphs can help make sense of the conversations around an event, helping to identify key individuals, subgroups, and general properties of the network compared to others. You will explore the graph metrics as part of this chapter and then use the same network in Chapter 7 to illustrate the value of filtering and grouping to bring clarity to a large network.

The CSCW network is also a good example of a multiplex network (Chapter 3), since it includes three types of edges: Mentions, Replies to, and Tweet. After importing the network, browse through the Edges worksheet. Notice the Relationship column, which specifies the type of edge. If a user Mentions another Twitter user, then Vertex1 will include the sender and Vertex2 will include the user who was mentioned. If a user Replies to another user's tweet, then Vertex1 will include the sender and Vertex2 will include the person being replied to. These are directed edges that “point” from Vertex1 to Vertex2. Otherwise, if a person posts a Tweet that is not connected to any other tweets, the same username will show up in the Vertex1 and Vertex2 column. Graphically, this creates a self-loop, which is a loop that starts and ends at the user's vertex. Additional columns on the Edges worksheet indicate the text of the tweet (Tweet), Tweet Date (UTC), Imported ID (a unique identifier for each tweet), and Edge Weight. It is important to remember that each row in the Edges worksheet does not necessarily map to a single tweet. For example, if UserA mentions UserB and UserC in a single tweet, then two rows will be added to the Edges worksheet. The Imported ID can be used to count the total number of unique tweets, as discussed later.

The Vertices worksheet includes a row for each Twitter user in the network, a profile image, and details about the Twitter user such as the number of Followers they have on Twitter as a whole. All of the extra data is pulled in from the Twitter API when using the Twitter importers described in Chapter 11. Additional worksheets and metrics that have been calculated are explained below.

6.5.1 Calculating and interpreting directed network metrics

In contrast to the small networks you have examined so far, many social media networks include hundreds or thousands of Vertices. In such cases, graph metrics become particularly important since initial visualizations obscure much of the data. Furthermore, network metrics can be used to filter out less important people as described in Chapter 7. The symbiotic relationship of network metrics and network visualization is extremely powerful, though it is often not used to its full potential.

Because it can take a long time to calculate graph metrics for a network of this size, the file that you downloaded has already run the relevant metrics. Figure 6.7 shows the Graph Metrics settings that were chosen. Options dialog metrics are displayed to indicate specific settings that were added. For example, the Overall Metrics options dialog was used to add Relationship as an edge type (Figure 6.7). This in turn, creates new totals for each edge type (Mentions, Replies to, and Tweet) on the Overall Metrics worksheet (Figure 6.8) with counts of the number of edges for each. Notice the Overall Metrics also includes many metrics that we haven't yet seen. For example, it shows the Number of Edge Types as 3 (i.e., Mentions, Replies to, Tweets). It also shows the Reciprocated Vertex Pair Ratio and the Reciprocated Edge Ratio since it is a directed network. There are 278 self-loops (the same number as there are Tweet edges as expected). There are also many different connected components (106), most of which are single-vertex connected components (65).

Figure 6.7
Figure 6.7 CSCW 2018 Network with directed graph metrics, time series, words and word pairs, and top items selected.
Figure 6.8
Figure 6.8 CSCW 2018 Network Overall Metrics results.

Because it is a directed network, all directed network metrics are chosen. Additionally, some undirected network metrics are chosen, such as eigenvector, betweenness, and closeness centrality. It is common for analysts to calculate such metrics, even for directed networks, but the interpretation of them is not exact. For example, betweenness centrality will still identify “bridge spanners,” but they may play that role because many disparate users mention them, or because they mention many disparate users. If you are looking for influencers, then identifying people with high In-Degree or PageRank, both of which are directed metrics, is more useful than identifying people with high Out-Degree or non-directed metrics such as Betweenness Centrality that may be driven by a person’s outbound links.

Two metrics not yet examined include Edge Reciprocation and Vertex reciprocated vertex pair ratio. Notice on the Edges worksheet there is a column called Reciprocated?. If the value is Yes, then there is another edge that exists with the two users’ positions flipped. For example, there is a row where acm_cscw Mentions fcalefato. There is also a row where fcalefato Mentions acm_cscw. As a result, in both of those rows, the Reciprocated? column shows a Yes. A related metric on the Vertices worksheet is shown in the Reciprocated column. This shows the percent of vertex pairs that are reciprocated. It helps identify individuals who are primarily involved in conversations, since the people they reply to or mention also reply to or mention them. For example, some users such as farbandish (0.366), niloufar_s (0.361), and morganklauss (0.409) all had high values because they were actively participating in conversations. Not surprisingly, they also had both high In-Degree and Out-Degree. In contrast, individuals such as katestarbird (0.005) and snaglee2401 (0.027) have low reciprocated edges, in this case because they were mentioned by many users (i.e., had high In-Degree), but mentioned or replied to relatively few (i.e., had low Out-Degree). In general, the more popular someone is, the more difficult it is to have high reciprocation scores. A good example of this is the acm_cscw account, which worked hard to mention and reply to 77 different users (i.e., Out-Degree is 77). However, because they were mentioned or replied to 380 different times, their reciprocation percent is relatively low (0.129). This illustrates the importance of looking holistically at different metrics to fully understand a network.

6.5.2 Examining top items output

When Top Items metrics were calculated, a new worksheet called Top Items was created (Figure 6.9). This includes the metrics that were indicated in the Top Item Metrics Options dialog (Figure 6.7). The first list indicates the users in the network with the highest number of overall Twitter Followers. These can be thought of as global influencers. The second and third list show the top 10 individuals based on In-Degree and PageRank, which help identify the local influencers, or people that the CSCW network is frequently mentioning and retweeting. This includes the official cscw account (acm_cscw), researchers (e.g., katestarbird, informor), academic departments (vt_cs), and topics of discussion (warcraft, the account for the game World of Warcraft which was presented on). Additional lists can be created in the options dialog (Figure 6.7) for items such as the most common hashtags, URLs, words, people replied to or mentioned, or most active tweeters.

Figure 6.9
Figure 6.9 Top Items worksheet showing the top 10 users based on Followers, In-Degree, and PageRank in the CSCW 2018 Network.

6.5.3 Examining time series output

When Time Series metrics were calculated, the results were put into a new Time Series worksheet that includes a pivot table and associated graph (Figure 6.10). The graph shows the total number of unique tweets after they have been bucketed into days (since Days was chosen in the Time Series option box shown in Figure 6.7). Remember the total number of unique tweets is not the same as the total number of edges, since edges are often duplicates (e.g., if user1 mentions user2 and user3 in the same tweet, then two edges are created). However, since the Unique edges by this column was set to Imported ID (Figure 6.7), the graph and corresponding data represent unique tweet counts as desired.

Figure 6.10
Figure 6.10 Time Series worksheet showing a graph and pivot table grouped into Days with an additional slicer that allows you to filter what is displayed based on the Relationship column (Mentioned, Replies to, Tweet).

Since Add a slicer for was set to the Relationship column (Figure 6.7), a filtering box is displayed. It allows you to filter based on Mentions, Replies to, and Tweets (Figure 6.10). If you click on one of the types of Relationship, it will filter the graph to only show tweets of that type. You can also choose multiple types. You can add different slicers (Figure 6.7) to examine other factors, such as location or time zone.

Advanced topic

Working with textual data

NodeXL includes the ability to perform text analysis, such as sentiment analysis (see Chapter 8 for a more detailed explanation). In Graph Metrics, you can choose Words and word pairs, and use the Options dialog to set up the type of analysis you desire (e.g., see Figure 6.7). For example, in the CSCW Network, a Sentiment Analysis was performed based on the default NodeXL settings. Sentiment analysis examines text to identify how “positive” or “negative” messages are. For example, positive messages may include words like “abundant,” “keen,” “lawful,” and “super.” Negative messages may include words like “abuse,” “mediocre,” “wrong,” and “rant.” Many frequent words, such as “a,” “the,” “ever,” can be skipped from the analysis (Figure 6.7). For the CSCW Network, the Tweet message content was used. However, you could run it on other data, such as emails, YouTube comments, Wikipedia page content, etc. Additionally, this tool can be used to measure things other than sentiment. Simply replace the words in each of the three categories on the Options page with other sets of words. For example, words could be classified into buckets that indicate different product lines or companies by adding words such as the names of Nike shoes vs the names of Adidas shoes.

Running these metrics creates a new worksheet called Words and another called Word Pairs. The Words worksheet includes all words that occur more than once (if indicated in the Options), along with the number of times they occur (Count), the Salience (a measure of how often the word occurs compared to other words), and a TRUE or FALSE statement indicating if the word is in one of the predetermined lists in the Options dialog (i.e., positive or negative word lists). The top includes summary data for all Positive, Negative, and Non-categorized words. The Word Pairs worksheet includes similar data, but for pairs of words that show up in the same message (e.g., tweet). It also includes a Mutual Information column that indicates the strength of the connection between the two terms.

In addition to creating the new worksheets, the Word and Word Pairs feature also adds new columns to the far right of the Edges and Vertices worksheets. On the Edges worksheet, the new columns identify the number and percentage of words in each category (i.e., positive, negative, non-categorized), as well as a total word count (Edge Content Word Count). They are only calculated for unique messages (i.e., tweets), which explains why many values are blank in the CSCW Network. The Vertices worksheet reports similar content, but it is based on all messages associated with a user. Similar data can also be calculated for each group (Chapter 7).

A related, but separate metric, is the Edge creation by shared content similarity feature. This feature allows you to create edges based on the similarity of the textual content used by two vertices. Instead of direct connections such as Mentions, these edges connect people who use similar words. Explanations for the use of this tool and details of the calculations are provided in the explanatory section of the Graph Metrics dialog.

6.6 Practitioner’s summary

Social network analysis provides a set of powerful quantitative graph metrics for understanding networks and the individuals and groups within them. These include aggregate network metrics such as graph density, diameter, reciprocated vertex pair ratio, and number of connected components, which characterize the network as a whole. They also include vertex metrics related to networks such as degree, in-degree, out-degree, betweenness centrality, eigenvector centrality, closeness centrality, PageRank, and clustering coefficient that can be used to identify unique or important people within a network. These metrics can be mapped onto visual properties such as size and opacity to help more easily make sense of the data, as was shown for the ABCD network. Affiliation networks, such as the Marvel Cinematic Universe network connecting movies-to-characters, have unique properties and their metrics must be interpreted carefully. Visualizations can combine calculated metrics (e.g., degree, betweenness centrality) with other attribute data (e.g., movie ratings; opening weekend proceeds) to gain insights into networks. NodeXL also provides text analysis features, time series analyses, and identifies top items when working with rich datasets such as the CSCW Twitter Network.

6.7 Researcher’s agenda

The network metrics in NodeXL are widely used because they reveal important properties of individuals in a network [610]. However, their computation can be slow, so research efforts on improved algorithms (e.g., [11]), parallelization of execution using multiple processors, and the use of specialized graphic co-processors to speed computation are important. Improved centrality metrics for different types of graphs, such as bimodal and weighted graphs are also being actively explored (see [10] for initial attempts at some of these). The combination of natural language processing (i.e., text analysis) and social network analysis is providing promising results (e.g., [12]). Other metrics are regularly being created to help discover important vertices, edges, motifs, cycles, and other structural features, such as triangles, cliques, near-cliques, chains, holes, and more. Some are specific to certain platforms (e.g., Twitter [13]), while others are more generic.

References

[1] Newman M.E.J., Girvan M. Finding and evaluating community structure in networks. Phys. Rev. E. 2004;69:026113.

[2] Newman M.E.J. Mathematics of networks. In: Blume L.E., Durlauf S.N., eds. The New Palgrave Encyclopedia of Economics. second ed. Basingstoke: Palgrave Macmillan; 2008.

[3] Page L., Brin S., Motwani R., Winograd T. The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab; 1999.

[4] Hanneman R., Riddle M. Chapter 17: Two Mode Networks. Introduction to Social Network Methods. Riverside, CA: University of California; 2005 Riverside (published in digital form at http://faculty.ucr.edu/~hanneman.

[5] Hansen D.L., Smith M.A., Shneiderman B. EventGraphs: charting collections of conference connections. In: 2011 44th Hawaii International Conference on System Sciences; 2011:1–10.

[6] Bonacich P. Power and centrality: a family of measures. Am. J. Sociol. 1987;92(5):1170–1182.

[7] Freeman L.C. A set of measures of centrality based on betweenness. Sociometry. 1977;40:35–41.

[8] Freeman L.C. Centrality in social networks: conceptual clarification. Soc. Networks. 1979;1(3):215–239.

[9] Koschützki D., Lehmann K.A., Peeters L., Richter S., Tenfelde-Podehl D., Zlotowski O. Centrality indices. In: Brandes U., Erlebach T., eds. Network Analysis: Methodological Foundations. Springer-Verlag; 2005:16–61 LNCS 3418.

[10] Wasserman S., Faust K. Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences), Chapter 5. Cambridge, UK: Cambridge University Press; 1994.

[11] Riondato M., Kornaropoulos E.M. Fast approximation of betweenness centrality through sampling. Data Min. Knowl. Disc. 2016;30(2):438–475.

[12] Bermingham A., Conway M., McInerney L., O'Hare N., Smeaton A.F. Combining social network analysis and sentiment analysis to explore the potential for online radicalisation. In Social Network Analysis and Mining2009. ASONAM'09. 2009;231–236.

[13] Bruns S., Stieglitz S. Towards more systematic Twitter analysis: metrics for tweeting activities. Int. J. Social Res. Methodol. 2013;16(2):91–108.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.127.37