Chapter 7

Grouping and filtering

Abstract

Abstract

Making sense of large and complex networks requires filtering and grouping as illustrated with the U.S. Senate co-voting network and CSCW Twitter network. Users can filter based on edge values, vertex metrics, and attribute data (e.g., Twitter followers). Even groups can be filtered out using the Visibility column. The Dynamic Filters feature allows you to interactively determine what is displayed on the graph by setting the position of sliders representing starting and ending ranges. Grouping vertices into subgroups (i.e., groups, clusters, communities, subgraphs) helps reveal important structures such as the Republican and Democratic divide. Subgroups can be created manually (e.g., using attribute data) or automatically (using one of NodeXL’s built in community detection algorithms). They can be visualized using unique vertex color and shape combinations, positioned using the group-in-a-box technique, or collapsed into single units or summary network motifs. Subgraph images can be created for each vertex.

Keywords

Filtering; Dynamic filters; Group; Cluster; Clique; Subgroup; Community detection algorithm; Visibility; Subgraph images; Group in a box layout; Network motif

7.1 Introduction

Most real-world social media networks are large and messy, much like the CSCW 2018 Network you began to examine last chapter. Visualizing and making sense of large networks can be challenging, particularly if they are densely connected. In this chapter, you will learn several different strategies for analyzing and visualizing large network datasets.

One strategy for understanding large networks is to filter out information. Many criteria can be used to filter out vertices and/or edges. For example, vertices that have low centrality scores can be filtered out, leaving in only those who are most important in the network. Other data associated with vertices, such as age, country of origin, time zone, or number of Twitter followers can be used to filter vertices. In this chapter you will filter out vertices to gain insights into the most important individuals in the CSCW 2018 Twitter network. Filtering can also be applied to edges. For example, if edges represent the number of email messages exchanged between two people, a network of “strongly” connected individuals may filter out those who have sent less than 10 messages to one another (see Chapter 9). In this chapter you will filter edges based on co-voting percentages between U.S. senators in the 115th Congress (2017–2018). Finally, filtering can also be an excellent method for exploring networks, particularly when using dynamic filtering tools such as those provided in NodeXL.

Another strategy for understanding large networks is examining groups of vertices. Many large networks are a complex combination of smaller groups or subgraphs. High school networks in the United States consist of subgroups of jocks, nerds, goths, and the like. Facebook networks are made up of clusters of family members, schoolmates, work colleagues, and other forms of association. Legislative bodies like the U.S. Congress contain two main political parties and numerous smaller coalitions. Identifying groups within a network and mapping their relationship to one another can be essential to making intelligent strategic decisions. Network analysis can help identify competing or complementary groups, potential allies to form a powerful group, and individuals who can connect you to a new group.

Social network analysis provides a set of tools for identifying and understanding groups, also called clusters or communities by network researchers. In the language of network analysis, clusters are pockets of densely connected vertices that are only sparsely connected to other pockets [1]. For example, Figure 7.1 shows a network consisting of three densely connected pockets (within the dotted circles) that are loosely connected to each other by only a few ties. One way to create groups is to associate vertices that have a shared attribute (e.g., people from California vs those from Maryland). However, often the most interesting groups are those that emerge from network connections, not formal group membership. Several algorithms exist that create these organic clusters based solely on network ties. For example, an analysis of a corporate email network (Chapter 9) can provide an authentic grouping of individuals based on communication patterns rather than formal reporting hierarchies. In this chapter you will learn to identify and visualize groups from several networks.

Figure 7.1
Figure 7.1 A network of three densely connected clusters (i.e., groups), each shown inside a dashed circle. Ties between clusters are rare and less dense.

7.2 U.S. Senate voting analysis

In this section you will analyze the voting patterns of U.S. senators, identifying clusters of senators connected together based on similar voting patterns. Chris Wilson of Slate magazine provided the original voting network data from 2007, which inspired us to create up-to-date versions for each of the 110–115th Congresses, covering the years 2007 through 2018. All datasets are available on the book website. In this chapter, we will use the 115th Congress data, which covered the time period from January 4, 2017 to December 21, 2018 (the last votes in the dataset when we pulled it) and included a total of 599 roll call votes. You can download the source file named Senate115.xlsx at https://www.smrfoundation.org/nodexl/teaching-with-nodexl/teaching-resources/. The source data was gathered from https://voteview.com [2]. The original data is not in a network format. Instead, the original data files include information on how each member of the Senate voted on each roll call. We transformed the data into a co-voting network as described below.

The Senate co-voting network is created from data that connects senators to one another based on the number of times they vote the same way (i.e., both voted yea or both voted nay on a bill; we do not draw an edge if they both abstained or were absent). For example, senators Alexander (R, Tennessee) and Baldwin (D, Wisconsin) voted the same only 226 times (39.5% of the time), whereas Senator Alexander (R, Tennessee) and Barrasso (R, Wyoming) voted the same 538 times (94.1% of the time) (Figure 7.2). Clearly the two Republicans have a stronger connection to one another than the Democrat and Republican senators. The network is undirected and is weighted based on the percent of similar votes (see the Voted_Same column in Figure 7.2). Using the raw number of similar votes can be problematic in cases where senators were frequently absent for votes (i.e., were campaigning). For this reason, the dataset also includes the total number of votes cast by each senator (Vertex1_Total & Vertex2_Total) and the percentage agreement (Percent_Agreement), which is calculated using the lowest of the two senators’ total votes as the denominator. The Vertices worksheet includes data about each senator including the senator’s party affiliation, the state the senator represents, the total number of votes the senator cast, and their unique ICPSR number (i.e., a unique identifier created by the Inter-university Consortium for Political and Social Research for each congressman) [2].

Figure 7.2
Figure 7.2 Unfiltered 2017–18 U.S. 115th Senate co-voting network showing all senators connected to one another. Other columns in the NodeXL Edges worksheet show the number of times each pair of senators voted the same and their percent agreement. “Raw” visualizations like this require refinement to display useful insights.

7.2.1 Filtering edges to identify groups within a network

When working with weighted networks such as the 115th Senate network, it is often necessary to filter out some of the edges to identify subgroups. Because every senator voted the same as every other senator at least once, choosing Show Graph results in an uninformative dark mass of connections (Figure 7.2). To make the graph more meaningful, you will want to only show edges between senators that have a high level of agreement. In other words, you want to filter out edges where the Percent_Agreement is underneath a certain threshold (e.g., 64%).

Advanced topic

Visibility column options

Like other visual attribute columns, the Visibility column can be filled using Autofill Columns, populated using formulas, or manually by typing in the desired option. The following options are available:

  •  Show if in an edge. If the vertex is connected to anything else via an edge, show it. Otherwise, ignore the vertex row. This is the default setting.
  •  Skip. Skip the vertex row and any of its edges. Or, if on the Edges worksheet, skip the edge row that is selected. Do not read them into the graph. This essentially pretends the data is not in the spreadsheet. For example, if you choose to calculate Graph Metrics, all skipped vertices are excluded from the calculations. They will also not be part of Groups and will not affect the layout of other vertices.
  •  Hide. Hide the vertex and its edges from showing up in the graph pane. Or, if on the Edges worksheet, skip the specific edge that is selected. This is what the dynamic filters do. Unlike skipped vertices, hidden vertices and edges are included when network graph metrics are calculated and they affect the layout of all vertices in the graph pane, even if you can’t see them.
  •  Show. Show the vertex regardless of whether it is part of an edge. Or, if on the Edges worksheet, show the specific edge that is selected.

Begin by using the AutoFill Columns feature to set the Vertex Label to the Vertex column (i.e., the senator’s last name). Set the Vertex Tooltip to State. Then navigate to the Shape column on the Vertices worksheet and change the values for all vertices to Label. This will make all senator’s names clearly visible and allow you to see their State on mouseover. Use the Edges tab on the Autofill Columns feature to set the Edge Opacity and Edge Visibility to Percent_Agreement. You’ll need to use the Options for each of them. Set the Edge Visibility Options so that the edge only shows up If the source column number is: Greater than 0.64 (Figure 7.3). Using the default settings, this will only show edges between two people who have a greater agreement level than 0.64, which is the average percentage agreement between all pairs of senators. See the Advanced topic: Visibility column options for more details. Set the Edge Opacity Options to those described in Figure 7.3. Finally, change the Color of the edges on the Edges worksheet to something more distinct than gray (e.g., 128, 128, 192). After refreshing the graph and positioning the vertices (see Section 4.4.1), the result should look like Figure 7.3. The largely two-party system in the U.S. Senate becomes very apparent, with a cluster of conservative senators and a cluster of liberal senators, and a few senators in the middle.

Figure 7.3
Figure 7.3 Filtered 2017–18 U.S. 115th Senate co-voting network after using the NodeXL Autofill Columns window with Edge Visibility Options set above 0.64 and Edge Opacity Options set to a range of 0.64 (edge opacity 10) to the largest number in the column (edge opacity 100). Labels are set to the Vertex column and the Vertex Shape is set to Label.

7.2.2 Using dynamic filters

NodeXL allows you to dynamically filter out edges or vertices based on any data fields on the Edges and Vertices worksheets. This can be an excellent way of exploring a dataset, without making any permanent decisions about how to display it. Click on the Dynamic Filters button above the graph (see Figure 7.4) and you will be presented with a new window that lets you set the minimum and maximum values for each variable. Find the frequency table and sliders associated with Percent_Agreement. Notice from the graphic that there is a bimodal distribution (i.e., a group of senators with low agreement toward the middle-left side and a group of senators with high agreement on the right-hand side). Use the sliders associated with the Percent_Agreement data to explore the dataset. For example, drag the left slider to the right of the current visibility threshold (0.64) to exclude any edges below, say 0.95 (i.e., 95% agreement). This will hide the edges and vertices from being displayed, although they still will affect the layout. You can faintly show the hidden vertices by setting the Filter Opacity to 5% as shown in Figure 7.4. This type of dynamic analysis can help identify differences between the political parties, as well as sub cliques within them. For example, Figure 7.4 makes clear that the Republicans (the top cluster) voted as a block more often than the Democrats (the bottom cluster). Comparing the different Senate datasets available on the book website can reveal trends over time. Once you are done with your explorations, you can revert to the original image by clicking on Reset Filters and close out of the dialog.

Figure 7.4
Figure 7.4 Dynamic Filters window used to “hide” edges with Percent_Agreement below 0.95. Filter opacity is set to 5% in order to faintly show the hidden connections.

7.2.3 Creating groups based on vertex attribute

Sometimes you will have data that describes attributes of the people in your network. For example, our Senate dataset includes information on the political party of each senator (see Party column on the Vertices worksheet). Values include D (Democrat), R (Republican), and I (Independent). To visually display this information, click on the Groups drop-down menu in the NodeXL ribbon and choose Group by Vertex Attribute. Choose Party from the first drop-down menu, since this is the attribute you want to group based upon. Then choose Categories from the second drop-down menu, since the data is categorical in nature. Notice, though, that groups can be created based on numerical data or date and time data. When you click OK, NodeXL will create two new worksheets. The Group Vertices worksheet includes a table showing each unique Vertex, alongside its Group and unique Vertex ID as shown in Figure 7.5. Additionally, the Groups worksheet is created, which includes one row for each unique group, alongside information on how to visually display the group, labels, and metrics, as shown in the left-hand side of Figure 7.6.

Figure 7.5
Figure 7.5 The Group Vertices worksheet that maps each vertex to exactly one cluster and a unique Vertex ID.
Figure 7.6
Figure 7.6 The Senate co-voting network after applying the Group by Vertex Attribute based on political Party. Notice that the labels disappeared because the Shape information is now pulled from the Groups worksheet. The Vertex Colors were automatically populated and are not good choices for this context.

The image in Figure 7.6 is problematic in a couple of ways, which can be fixed. You will noticed that the vertices that are part of each group have been assigned different colors. This is good. However, the automatically assigned colors are not appropriate for this context, where red is typically associated with Republicans and blue is typically associated with Democrats. This can be easily fixed by changing the Vertex Color values next to R, D, and I on the Groups worksheet to be Red, Blue, Green (see Figure 7.7). Second, the Vertex Shape is now set to Disk (on the Groups worksheet), instead of Label (on the Vertices worksheet). In order to fix this, open the Group Options dialog (via the Groups drop-down menu in the NodeXL ribbon) and choose the settings shown in Figure 7.7. The result is a visualization that clearly shows each senator, their political party, and their network position.

Figure 7.7
Figure 7.7 The Senate co-voting network after using the Group Option dialog to pull Vertex Color from the Groups worksheet and the Vertex Shape from the Vertices worksheet.

7.3 CSCW 2018 Twitter network analysis

An analysis of the CSCW 2018 Twitter network, introduced in Chapter 6, will help introduce several additional features of NodeXL that relate to filtering and grouping. Like most networks pulled from social media, the original network provides few insights without further refinement (see Figure 7.8). In this section, you will learn how to bring clarity and understanding to this type of real-world network.

Figure 7.8
Figure 7.8 CSCW 2018 Twitter network before it is filtered and grouped. The formula shown in the top bar sets the Visibility column to Skip if it is a Tweet, and Show if it is a Mentions or Replies To.

7.3.1 Filtering out self-loops using the edge visibility column

As described in Chapter 6, Twitter networks include three types of edges in the Relationship column: Mentions, Replies to, and Tweet. In order to focus in on the relationships between people, you can Skip the messages that only connect to one person—namely the Tweet messages. These are visually displayed as an edge that circles back and points at the same vertex it originated at. Because the data in the Relationship column is not numerical, you cannot use Autofill Columns to filter out the edges. Instead, write the formula shown in Figure 7.8 to set the Visibility column to Skip if the message type is Tweet, or, if not, set it to Show. After you Refresh Graph, there should not be any self-loops remaining. This method is preferred over using Excel's built-in filtering tool (e.g., accessible via the drop-down menu in the Relationship column header). While using the filtering tool will not necessarily break anything, it will hide the rows, which may cause inadvertent mistakes to be made later.

Advanced topic

Count and merge duplicate edges

Many networks, similar to the CSCW 2018 Twitter network, have duplicate edges. For example, as shown in Figure 7.8, dh_30x replied to w_cscw two times (see rows 18 and 19). A view of the Overall Metrics worksheet reveals that there are 1593 duplicate edges in the CSCW 2018 Twitter network. This can be useful in many contexts. For example, in this case, the content of the tweet is preserved. Unfortunately, it can cause some problems. For example, you may want to visualize the number of tweets as a weighted line. However, to do this, you will need to count how many duplicates there are for each row and represent that in a column (e.g., Edge Weight). Sometimes there is no need to preserve each individual row, and they can be merged into a single unique row, to simplify the dataset.

NodeXL allows you to do these things by using the Prepare Data dropdown menu found in the NodeXL Ribbon and choosing the Count and Merge Duplicate Edges option. This will open a new dialog like the one shown in Figure 7.9. The first option indicates that after the duplicate edges are merged, a new Edge Weight column will be added to the Edges worksheet and populated with the number of edges that were combined. For example, the two rows where dlh_30x replied to w_cscw would have an Edge Weight of 2 appear. The second option will collapse the duplicate edges into a single row. For example, if you were to check the Merge duplicate edges box, the number of rows in the Edges worksheet would be reduced to the number of unique edges. In our example, the two dlh_30x replied to w_cscw rows would be collapsed into a single row. In this network, you may NOT want to do this, since the tweet content of one of the collapsed edges would be permanently erased. However, this feature can come in handy in many instances.

Figure 7.9
Figure 7.9 Count and Merge Duplicate Edges dialog with settings that will merge all duplicate edges (i.e., rows) that contain the same values in the Vertex 1, Vertex 2, and Relationship columns.

The section titled Columns that determine whether edges are duplicates allows you to account for cases where you have different edge types, such as the CSCW 2018 Twitter network. If you were to select the first option, then all edges from one person to another would be counted or merged, even if they were of different types (e.g., replies to, mentions, and tweet). This may be desirable, but in many cases it is not. The example in Figure 7.9 shows the Relationship column to also determine uniqueness. By checking this box, the counts and/or merged edges do NOT roll up Replies to and Mentions (values in the Relationship column). For example, even if you merge edges, there will be two rows that show caylery in Vertex1 and farbandish in Vertex2. One would be of Relationship type Mentions (with an Edge Weight of 2) and the other would be of Relationship type Replies to (with an Edge Weight of 3).

7.3.2 Grouping and visualizing connected components

Perhaps the simplest method to identify groups within a network is to group together vertices that are part of different connected components. Find the groups connected vertices that are completely separated out from other groups of connected vertices. To do this, choose Group by Connected Component from the Groups dropdown menu. This will populate the Groups worksheet with a different row for each connected component in the graph. They will be sorted in order from largest to smallest. In many cases, there will be one large connected component and many smaller components. To make sure they do not overlap one another, open the Layout Options dialog (found in the Layout dropdown menu above the graph), and check the box shown in Figure 7.10 that lays out the smaller components in boxes at the bottom of the graph pane. You can customize the maximum size of the connected components (10) and size of the invisible boxes (25). You may want to also update the Fruchterman-Reingold layout settings to those shown in Figure 7.10 to achieve a better layout.

Figure 7.10
Figure 7.10 CSCW 2018 Twitter network after grouping by Connected Component and laying out the smaller connected components in boxes across the bottom.

The graph showing the entire network in Figure 7.10 provides a good sense of the overall scope and structure of the network. In this case, there seem to be many people who are connected directly or indirectly in the large blue connected component. There are also a number of smaller connected components that have not mentioned or replied to anyone in the larger network. It is also apparent that there seems to be a core group of people in the center of the large component, as well as a few key people that many people mention or reply to (e.g., the individual toward the right-hand side with a fan of individuals pointing toward her).

7.3.3 Using dynamic filters to filter based on time

One particularly useful way to use Dynamic Filters is to examine changes in a network over time. Since Twitter data has timestamps on the tweets that were sent, you can use the sliders to “play back” the discussion over time. Open Dynamic Filters and look at the first two Edge Filters called Relationship Date (UTC) and Tweet Date (UTC)(see Figure 7.11). For this network, these columns are identical, so you can use either one. Update the values in the Relationship Date (UTC) fields so the range covers only the time period of the conference as shown in Figure 7.11. Choose the Lay Out Visible Vertices Again option from the Lay Out Again drop-down menu as shown in Figure 7.11. This will make it so that the hidden vertices do not impact the layout of those that are shown.

Figure 7.11
Figure 7.11 Dynamic Filter window with the Relationship Date (UTC) set to show only tweets that occurred during the conference dates of Nov. 3, 2018 through Nov. 7, 2018. The Lay Out Visible Vertices Again feature is used so the hidden vertices don’t affect the layout of the visible vertices.

You can explore additional ways of playing with the Relationship Date (UTC) fields. Reset the filters back to their starting point by clicking the Reset Filters button (Figure 7.11). Then drag the right-hand slider all the way to the left and slowly slide it to the right. This will show a dynamic view of the network as it unfolded over time. A more precise way of doing this is to click on the day portion of the right-hand side date value and use the up or down arrow keys to increase or decrease the day. You will likely notice significant bursts of activity, such as a post by katestarbird on Sep. 4, 2018 that was widely retweeted. Once you are done exploring, click the Reset Filters button (Figure 7.11) and close out.

7.3.4 Filtering based on vertex metrics and the visibility column

While the full network visualization is a nice overview, it is hard to focus in on groups within the large component or identify individuals who are important within it. For that analysis, it is necessary to filter out the less important vertices as determined by network metrics. Navigate to the Metrics columns on the Vertices worksheet and use the Sort Largest to Smallest feature (e.g., on the In-Degree column title). This can help you identify important individuals, as well as cut-off points for filtering. For example, after sorting on In-Degree, it becomes apparent that there are some outliers, such as the acm_cscw account that has an in-degree of 380 (i.e., 380 unique individuals who have mentioned or replied to them). You may also notice that the majority of user accounts have an in-degree of 0, 1, and 2. This power-law distribution is typical of real world social networks.

Because in-degree is a useful measure of local importance within a particular network, you can use it to filter out those with a low score. Use Autofill Columns to skip the Visibility of vertices with an In-Degree of 2 or less as shown in the Figure 7.12 Vertex Visibility Options dialog. You can also set the Size of the vertices to be based on In-Degree (checking the Use a logarithmic mapping) and set the Vertex Tooltip to be based on the Vertex column (i.e., where usernames are recorded). The resulting image shown in Figure 7.12 is a bit less overwhelming than the unfiltered version, but is still somewhat cluttered. This is largely because of the prominent position of the acm_cscw user, whose connections are overwhelming some of the other structures, making them less apparent. Try removing the user by navigating to the Vertices worksheet and manually choosing Skip in the Visibility column on the acm_cscw row as shown in Figure 7.13. This will remove acm_cscw from the graph, though you should be careful since running Autofill Columns again will overwrite this manual change. The resulting graph (see Figure 7.13) helps you realize that the network is a bit more spread out than it might have otherwise appeared.

Figure 7.12
Figure 7.12 CSCW 2018 Twitter network after using Autofill Columns to filter out (i.e., Skip) vertices that are not greater than 2.
Figure 7.13
Figure 7.13 CSCW 2018 Twitter network after removing acm_cscw from the graph (right) by manually setting user’s Visibility to Skip (left).

7.3.5 Automatically identifying groups based on network clustering algorithms

NodeXL can automatically identify groups within a network based solely on network structure. In contrast to the approach of using existing data about the attributes as used in Section 7.2.3, this approach is based solely on who is connected to whom. A number of different network “clustering” (also known as “community detection”) algorithms exist, which help find subgroups of highly inter-connected vertices within a network. NodeXL includes three such algorithms: Clauset-Newman-Moore, Wakita-Tsurumi [3], and Girvan-Newman (which can take a long time to run on large graphs). In all of these algorithms, the number of clusters is not predetermined; instead the algorithm dynamically determines the number it thinks is best. Each vertex is assigned to exactly one cluster, meaning that clusters do not overlap. The number of vertices in each cluster can vary significantly. In some cases, a single cluster can encompass all vertices, whereas in other cases, a cluster can consist of a single vertex. See Newman [4] for background on some of these and other community identifying algorithms.

There is no “right” or “wrong” algorithm to use; instead, it is often useful to try out different ones and see which ones you believe provide the best results given your network. For example, in this network, the Clauset-Newman-Moore algorithm results in fewer, larger groups than the other algorithms, which provide more groups of a smaller size. Try applying the Wakita-Tsurumi clustering algorithm by clicking on the Groups dropdown menu in the NodeXL ribbon and choosing Group by Cluster and the checking the appropriate selector as shown in Figure 7.14. Notice that the data on the Groups worksheet is now updated to reflect the new groups.

Figure 7.14
Figure 7.14 CSCW 2018 Twitter network after applying the Wakita-Tsurumi clustering algorithm.

Advanced topic

Additional clustering algorithms

There are a large and growing number of network clustering algorithms, also called community detection algorithms (see Newman and Girvan [1] for an overview). Many community clustering algorithms don’t scale well for large networks, forcing a tradeoff between quality and speed. Most algorithms, including the ones used by NodeXL, are based on undirected, unweighted networks. Although you can apply them to more complex networks and often get reasonable results, you may need to use a more specialized community detection tool that includes a range of algorithms that take into consideration the specific properties of your network data. Most tools will output data into a format that can easily be pasted into the Groups and Group Vertices worksheets, allowing you to take advantage of NodeXL’s rich visual features.

In addition, there are many non-network clustering algorithms that can operate on collections of vertex attributes [5]. These include k-means, hierarchical agglomerative, hierarchical divisive, and many more. For example, a non-network clustering algorithm could be used to cluster people into groups that have similar participation patterns (e.g., those who use a certain collection of features similarly). This could then be represented on the network graph to see how those who use the system in a similar way are connected to one another.

7.3.6 Group properties and metrics

The Groups worksheet includes many fields that can be useful in analyzing and visualizing networks. If you click on one of the Group worksheet rows, it will highlight all vertices associated with that row’s group. The Vertex Color and Vertex Shape columns have already been introduced. It is worth reminding you that you may need to use the Graph Options dialog to determine if vertex color and/or shape should be pulled from the Groups worksheet or the Vertices worksheet (see Figure 7.7). The Visibility column can be used to Show, Skip, or Hide each group. Choose Skip for the bottom 7 groups, which are all separated from the large component (see Figure 7.15). This will filter them out of the graph, and also make it so metrics won’t be calculated for them. Choosing Yes in the Collapsed? column lets you collapse an entire group into a single shape (whatever is specified in the Vertex Shape column) with a plus sign in the middle of it. The size of the shape depends on the number of vertices in the group. However, collapsing groups may hide important information, so keep them set to the default (i.e., No) for now.

Figure 7.15
Figure 7.15 CSCW 2018 Twitter network Visual Properties and Graph Metrics columns after calculating the Group metrics using the Graph Metrics feature.

Calculate metrics for each group by opening the Graph Metrics dialog (see Chapter 6) and checking the box next to Group metrics (Figure 7.15). This will populate the Graph Metrics columns on the Groups worksheet, except for the groups that have Visibility set to Skip. Metrics include many of the metrics found on the Overall Metrics worksheet, such as the number of Vertices, Unique Edges, Total Edges, Graph Density, etc. Sorting on those columns can help identify key differences between the groups. For example, the group G5 has a very low Graph Density compared to most other groups, largely because there is one key person mentioning many other people.

7.3.7 Group layout and labels

At times, it can be useful to visualize groups separately from one another. NodeXL allows you to do this using the the Layout Options… dialog as shown in Figure 7.16. Choose the option Lay out each of the graph’s groups in its own box and set the Intergroup edges: to Hide. The resulting graph, shown in Figure 7.16 is missing data (i.e., the hidden edge connections between each group), but this allows for a more focused comparative analysis of groups. To learn more about this novel group in a box layout technique, see Ref. [6]. To better identify important individuals within the graph, enter the names of important users (based on metrics) in the Labels column on the Vertices worksheet. Additionally, add group labels by entering the group name in the Labels column on the Groups worksheet (see Figure 7.16). To place the group labels in the upper-left corner, as shown in Figure 7.16, use the Graph Options dialog, choose the Other tab, choose Labels… and set the Position: to Top Left in the Group box labels section.

Figure 7.16
Figure 7.16 CSCW 2018 Twitter network using the Layout Options to display each group in its own box and hiding the edges between vertices in different groups. Group labels are also shown.

Advanced topic

Group by Motif

NodeXL is able to group by different network motifs, or in other words, different common visual patterns of connections. See Ref. [7] for a full description of this novel visualization technique. To access this feature, choose Group by Motif in the Groups drop-down menu to open the Group by Motif dialog box (Figure 7.17). As with all of the different grouping options, using this will overwrite all other groups that were previously calculated.

Figure 7.17
Figure 7.17 The Group by Motif dialog box that allows you to determine which type of network motif to collapse in the visualization.

7.3.8 Creating subgraph images

Another useful way to understand complex networks is to view individual sections, or subgraphs, of the larger graph. NodeXL allows you to create Subgraph Images for each vertex in the network. Network scientists call these egocentric networks (see Chapter 3). They provide a personalized view of the network from the perspective of an individual vertex and are useful when comparing vertices to one another. For example, vertices with a similar structure (e.g., a hub and spoke) may play similar social roles such as a question answerer in a discussion forum (see Chapter 10).

To create network subgraph images, click on the Subgraph Images button on the Analysis section of the NodeXL ribbon (top-right of Figure 7.18). The Subgraph Images dialog box will appear (Figure 7.18). The first option allows you to choose the levels of adjacent vertices to include in each subgraph. For example, the default of 1.5 will show edges connecting the source vertex with its direct neighbors, as well as any edges that connect the neighbors to one another. Choosing 2.0 will show all of those edges, plus edges connecting the source vertex’s neighbors with all of their neighbors. If the data were from a social networking site such as Facebook, a 2.0 setting would show your friends, which of your friends know one another, and all of your friends’ friends (FOAF). For now, replicate Figure 7.18 by using the default settings. This will generate a new column called Subgraph as shown in Figure 7.18. Subgraph images are positioned relative to each other based on the currently selected layout algorithm, so make sure you use an appropriate one (e.g., Fruchterman-Reingold) for your data.

Figure 7.18
Figure 7.18 Subgraph images dialog box with default settings and the resulting subgraph images inserted into the Subgraph column on the Vertices worksheet.

Subgraphs highlight important differences between vertices. To illustrate this point, compare the subgraph images of individuals. For example, clifflampe and asbruckman are part of a densely connected active discussion group. In contrast, stbridgetathena and warcraft are mentioned or replied to by people who are otherwise not connected to one another. While the Clustering Coefficient metric leads to similar insights, a visual representation using subgraphs can often illuminate more nuances, such as the fact that hypotext is always mentioned alongside someone else.

7.4 Federal Communications Commission (FCC) lobbying coalition network

The power of social network analysis and visualization is best realized when combining the approaches discussed so far (Chapters 4 through 7). In this section you will see a network visualization created in NodeXL by Pierre de Vries, a Research Fellow at the University of Washington’s Economic Policy Research Center. This example is based on his submission to the first annual Journal of Social Structure Visualization Symposium held in 2010.

The social network shown in Figure 7.19 shows the relationship between organizations that lobbied the FCC on just one of the hundreds of issues that are before it: Docket 01-92 on intercarrier compensation. This proceeding is a battle over the fees that telephone companies pay each other when phone calls move between them. Legislation requires that most interactions between organizations and the FCC are publicly documented. This network data was extracted from metadata about these filings reported via the FCC’s Electronic Comment Filing System.

Figure 7.19
Figure 7.19 Lobbying Coalition network connecting organizations (vertices) that have jointly filed comments on U.S. Federal Communications Commission policies (edges). Vertex size represents number of filings and color represents eigenvector centrality (pink = higher). Darker edges connect organizations with many joint filings. Vertices were originally positioned using Fruchterman-Reingold and handpositioned to respect clusters identified by NodeXL’s Find Clusters algorithm.

The network captures all links between lobbying organizations during the duration of the proceeding (2001–2008). Vertices represent organizations that filed. Edges connect organizations that filed jointly, with edge weight representing the number of joint filings. Darker edges reflect higher edge weights.

Vertex size is proportional to the total number of filings that an organization made and is a proxy for lobbying investment. Well-funded companies and trade associations are prominent, although they are not necessarily well connected because they don’t need to be. Influence, measured by eigenvector centrality, is represented by the color of the nodes: the pinker the node, the better connected it is. Small companies can gain influence by linking different coalitions of local exchange carriers. Some organizations that hardly filed at all may be influential (i.e., small and pink), thanks to their many, straddling connections.

The Fruchterman-Reingold algorithm was used to prepare a preliminary network layout for these data. Next, the Group by Cluster feature was used to identify distinct clusters, which were used to guide the manual placement of vertices into visually intelligible positions. Once clusters were represented by their placement, the cluster colors were pulled from the Vertices worksheet instead of the Groups worksheet so that color could be used to represent eigenvector centrality. The file was exported as a high-resolution image, making it possible to zoom in on different sections of the graph and still read the vertex labels.

Interviews with policy practitioners confirm that the graph clusters correspond to real-world coalitions and alliances. For example, the clusters in the top right and bottom right are rural telephone companies; a loose coalition of rural trade associations and rural competitive local exchange carriers can be seen at the bottom middle of the illustration. The heavily connected cloud in the center of the chart shows competitive local exchange carriers who band together at various times in various permutations to make up strength in numbers.

Because graph clusters and evolution represent real-world behavior, they can be used to improve public understanding and lobbying effectiveness. Insiders can use graphs to identify potential collaborators or defectors (e.g., by looking for coalition members who are bridges between groups). They can also use changes in connectedness to track the emergence or breakdown of consensus in a proceeding. Outsiders can grasp the overall structure and evolution of the proceeding without having to understand the entire record.

7.5 Practitioner’s summary

Making sense of large networks can be challenging, particularly if they are densely connected. Several techniques used to simplify and clarify networks related to filtering. For example, edges can be filtered based on associated data as you saw in the Senate co-voting network (edges with a low co-voting percent were removed) and Twitter network (self-loops were removed). Vertices can also be filtered based on associated data or metrics, such as in-degree. Even groups can be filtered out using the Visibility column on the Groups worksheet. The Dynamic Filters feature allows you to interactively determine what is displayed on the graph by setting the position of sliders representing starting and ending ranges. These ranges can use data from graph metrics, timestamped data, or other attribute data associated with the network.

Another technique for making sense of large, complex graphs is to break them into groups, which are also called clusters, communities, or subgraphs. Groups are pockets of densely connected vertices that are only sparsely connected to other pockets. Groups can be automatically identified using community detection algorithms such as the those behind NodeXL’s Group by Cluster feature (e.g., Ref. [3]). If group memberships are known (e.g., Republican and Democrat party affiliations), vertices can be manually assigned to groups. Whether groups are automatically or manually created, they can be visualized in NodeXL with unique vertex color and shape combinations that indicate membership in different clusters. Users can visualize groups in many ways including the group in a box technique [6], collapsing each group into a single unit, using summary network motifs [7], or creating subgraph images for each vertex. Combining the strategies discussed so far in Part II can facilitate the creation of insightful and visually appealing graphs that can be the basis for understanding, explanation, decision making, and persuasion.

7.6 Researcher’s agenda

Strategies for dynamic filtering of complex network graphs have been around for decades, but still allow room for improvement [810]. Applying filters in an orderly process that arrives at a successful outcome requires skill and creative problem solving. As network researchers gain experience, they may develop more systematic approaches to choosing and setting filters so as to emphasize important features and remove distractions [11]. Being able to save a sequence of actions and then replay it on a fresh set of data would be a useful improvement. Developing standard process models (sequences of actions) to ensure complete exploration could dramatically advance the state of the art for social media network analysis. Such a process model would be systematic yet flexible, smoothly integrating statistics and visualization [12, 13] and guiding users effectively while enabling them to explore interesting possibilities.

Rapid progress in the past decade has turned the esoteric topic of clustering algorithms into a hot research area [14]. Newman’s work (see Additional resources) produced substantially improved strategies for organizing and presenting complex networks in meaningful ways, which stimulated further work on weighted, directed [15], and multiplex networks [16]. Because most algorithms run slowly, the research community has sought algorithms that can be adapted to run on multicore and specialized graphics processors that are increasingly embedded in modern computers [17]. The next step is to compute network clusters efficiently using parallel computers and cloud computing techniques. Most techniques create clusters based on edge connections, but an alternate strategy is to cluster nodes by attribute values of the nodes—for example, all people who graduated from the same university are in the same cluster [18]. This introduces the visual challenge of dealing with multiple memberships.

References

[1] Newman M.E.J., Girvan M. Finding and evaluating community structure in networks. Phys. Rev. E. 2004;69:026113.

[2] Lewis J.B., Poole K., Rosenthal H., Boche A., Rudkin A., Sonnet L. Voteview: Congressional Roll-Call Votes Database. https://voteview.com/. 2019.

[3] Wakita K., Tsurumi T. Finding community structure in mega-scale social networks: [extended abstract]. In: Proceedings of the 16th International Conference on World Wide Web (Banff, Alberta, Canada, May 08–12, 2007). WWW ‘07. ACM, New York; 2007:1275–1276.

[4] Newman M.E.J. Detecting community structure in networks. Eur. Phys. J. B. 2004;38:321–330.

[5] Witten I.H., Frank E., Hall M.A., Pal C.J. Data Mining: Practical Machine Learning Tools and Techniques. fourth ed. Cambridge, MA: Morgan Kaufmann; 2016.

[6] Rodrigues E.M., Milic-Frayling N., Smith M., Shneiderman B., Hansen D. Group-in-a-Box layout for multi-faceted analysis of communities. In: Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third International Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on IEEE; 2011:354–361.

[7] Dunne C., Shneiderman B. Motif simplification: improving network visualization readability with fan, connector, and clique glyphs. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; 2013:3247–3256 ACM.

[8] Li Q., North C. Empirical comparison of dynamic query sliders and brushing histograms. Proc. IEEE Symp. Inform. Vis. 2003;2003:147–154.

[9] Tweedie L., Spence B., Williams D., Bhogal R. The attribute explorer. In: Proceedings of the CHI ‘94 Conference Companion on Human Factors in Computing Systems, ACM Press, New York; 1994:435–436.

[10] Wittenburg K., Lanning T., Heinrichs M., Stanton M. Parallel Bargrams for consumer-based information exploration and choice. In: Proceedings 14th Annual ACM Symposium on User Interface Software and Technology, ACM Press, New York; 2001:51–60.

[11] Perer A., Shneiderman B. Systematic yet flexible discovery: guiding domain experts through exploratory data analysis. In: Proceedings ACM 13th International Conference on Intelligent User Interfaces, New York, NY; 2008:109–118.

[12] Perer A., Shneiderman B. Integrating statistics and visualization: case studies of gaining clarity during exploratory data analysis. In: CHI ‘08: Proceedings SIGCHI Conference on Human Factors in Computing Systems, ACM, New York, NY; 2008:265–274.

[13] Shneiderman B. Inventing discovery tools: combining information visualization with data mining. Inf. Vis. 2002;1(1):5–12.

[14] Fortunato S., Hric D. Community detection in networks: a user guide. Phys. Rep. 2016;659:1–44.

[15] Malliaros F.D., Vazirgiannis M. Clustering and community detection in directed networks: a survey. Phys. Rep. 2013;533(4):95–142.

[16] Loe C.W., Jensen H.J. Comparison of communities detection algorithms for multiplex. Phys. A Stat. Mech. Appl. 2015;431:29–45.

[17] Blondel V.D., Guillaume J.-L., Lambiotte R., Lefebre E. Fast unfolding of community hierarchies in large networks. arXiv 2008. 0803.0476. Available at: http://works.bepress.com/lambiotte/4.

[18] Aris A., Shneiderman B. Designing semantic substrates for visual network exploration. Inf. Visual. J. 2007;6(4):1–20.

Additional resources

[Bedi and Sharma, 2016] Bedi P., Sharma C. Community detection in social networks. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.. 2016;6(3):115–135.

[Fortunato and Hric, 2016] Fortunato S., Hric D. Community detection in networks: a user guide. Phys. Rep.. 2016;659:1–44.

[Girvan and Newman, 2002] Girvan M., Newman M.E.J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. U. S. A.. 2002;99:7821–7826.

[Newman, 2006] Newman M. Modularity and community structure in networks. Proc. Natl. Acad. Sci. U. S. A.. 2006;103:8577–8582.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.163.229