Chapter 6. Graph Statistics

While much of the work we do with graph analysis is visually oriented, there are still some critical ways to assess networks that rely more heavily on calculated measures. These measures can help us to add some hard numbers to our existing graph and begin to compare elements both within a network as well as across networks.

Graph statistics come in many flavors and can be used to assess multiple aspects of a network, giving us additional support beyond the visual analysis we might already have done. In this chapter, we'll take a tour through many of the statistical measures available in Gephi in the following order:

  • We'll start with a general overview of graph statistics and learn how and why they are critical to understanding the structure of our network.
  • Next, we'll focus on interpreting many of the statistical measures using the Gephi Statistics tab.
  • Finally, we'll spend a significant portion of the chapter working with some network data, using the statistics to illustrate structural differences between networks. For this section, we will again use the primary school network data used in Chapter 5, Working with Filters.

Let's begin with an overview of some of the primary methods used within the field, with a focus on those that are available within Gephi.

Overview of graph statistics

There are a number of directions we can head toward when we seek to understand the patterns and behaviors within a network. Here are a few of the key questions we might wish to understand:

  • What does the overall structure of the network look like? Is it densely connected, or is the network rather sparse?
  • Does our network have a high level of randomness, exemplified by few distinct patterns? Or are there very pronounced groups within the network that exhibit advanced levels of clustering?
  • Do we see evidence of high degree hubs in the network, as in the case of the Web, with many smaller nodes surrounding the hubs?
  • Are certain nodes critical within the structure of the network, perhaps acting as bridges between otherwise disconnected groups?

Each of these questions (and more) can be answered using graph statistics. Our job in this first section of the chapter is to better acquaint ourselves with which statistics to use and when to measure the various behaviors and structures of a network.

Network measures

Gephi provides many statistics that help us to assess the overall structure of our network, including measures for density, path length, hub detection, and overall connectedness. These measures can provide tangible numbers to help support an initial visual assessment of the network, and can also lead us to explore facets of the network that might have been less than obvious from a visual perspective.

We'll walk through several statistics and explore how you can use them to explore your own networks at a relatively macro level. Later sections will then delve into the patterns within the network, as measured by various centrality and clustering statistics.

Diameter

The concept of diameter in network analysis is very simple; it refers to the maximum number of connections required to traverse the graph. Another way to look at it is knowing how many steps it takes for the two most distant nodes in the network to reach one another. In practice, there will be more than a single pair of nodes that fit this description, but the concept remains the same.

Understanding the diameter of the graph helps us begin to comprehend the structure of a network; a graph with a diameter of three will often be less complex than one with a diameter of seven, although this is not always true. Similarly, if we know that a single node has an eccentricity level of four in a network where the diameter is seven, it lets us know that the node is likely quite central to the network.

Eccentricity

Eccentricity refers to the number of steps required for an individual node to cross the network. This number is limited by the diameter of the graph; for example, a graph with a diameter of seven will have nodes with a maximum eccentricity level of seven.

When used to compare nodes, eccentricity can help provide some context to assess the relative position and influence of nodes within a network. While not a substitute for the various centrality measures, eccentricity can nonetheless provide some clues into the relative importance of individual nodes within the network.

Graph density

Graph density is a measure of the level of connected edges within a network relative to the total possible value and is returned as a decimal value between zero and one. Graphs with values closer to one are typically considered to be dense graphs, while those closer to zero are termed as sparse graphs. What constitutes dense versus sparse will vary depending on the type of network data being studied, so it might not be appropriate to compare numbers between very different network subjects. Nonetheless, this is an important measure to develop an understanding for how an individual network is structured, and might help identify gaps or holes within the graph.

Average path length

The average path length (abbreviated as Avg. Path Length in Gephi) provides a measure of communication efficiency for an entire network, by measuring the shortest possible path between all nodes in the network. An overall number is calculated for the entire network, with lower numbers giving an indication that the network is relatively more efficient, and with high average numbers signifying a relatively inefficient graph for information flow.

This number will necessarily be less than the network diameter, as that value represents the maximum path length between nodes.

Connected components

When we have a network where more than one component exists, the Connected components tool can be used to provide us with a simple read on the number of distinct components within the network. When our network is fully connected, a value of 1 will be returned, so there is little need for this calculation. However, in very large networks it might be difficult to visually determine whether the network is fully connected, so we can use this function to ascertain the number of components.

Erdos number

The original Erdos number definition was based on the number of connections other network researchers had with Paul Erdos, the most prolific (published) mathematician in history, with more than 1,500 papers to his credit. The number that bears his name provides an indication of the collaborative distance between other mathematicians and Erdos. For example, an Erdos number of 1 indicates a direct collaboration with Erdos, while a value of 2 indicates that someone collaborated with one of Erdos' own collaborators, and so on.

In Gephi, this statistic is applied more generally, by specifying a base node (an Erdos proxy), and then determining the Erdos number for all other nodes in the network. Those directly connected to the base node have a value of 1, followed by the nodes which are connected to those with a 1, which have a value of 2, and so on. To work with Erdos numbers in Gephi, be sure to install and activate the Social Network Analysis plugin.

HITS

The Hyperlink-Induced Topic Search (HITS) function is based on the work by Kleinberg, and it calculates two separate values for each node. The first, termed as Authority, provides a measure for how valuable information stored by a particular node is, while the Hub number measures the quality of the links to and from that particular node. These measures can help to identify or confirm the roles played by critical members within the network.

Edge betweenness

The edge betweenness statistic provides a glimpse into how often specific edges reside within shortest paths between network nodes. This can help us to illustrate the most efficient paths for traversing a network by identifying the most frequently used paths. As with many of the statistics, the number can be returned in raw nominal form (say 10 or 12), or as a normalized value between zero and one. Normalized values allow us to make comparisons between networks to gauge relative edge betweenness importance levels, while non-normalized numbers are highly dependent on the size and structure of each network.

To work with this measure in Gephi, be certain that you have the Edge Betweenness Metric plugin installed and activated.

Centrality measures

The role of centrality within a network and its member nodes is part of what we generally attempt to understand. There are many ways to measure centrality, each with its own algorithm. Generally speaking, each of these approaches will help us understand a specific type of centrality, as opposed to offering competing versions of the same measurement.

Think of the various ways in which we might wish to understand centrality in a network:

  • We might begin by measuring how central a specific node is relative to the entire network. Is the node in question highly connected and centrally positioned within the network, or is it on the fringes with just a few connections? This is probably the first and most obvious of the centrality measures, but there are several other options.
  • Is a node surrounded by highly connected neighbors, without necessarily being directly linked to large portions of the network?
  • Is a given node most likely to be connected to other key members of a network, thus affirming its status as an important player, or are most of the connections with members on the fringe of the network, thus reducing the influence level?
  • Does a node form a bridge between otherwise unconnected portions of the network, even though it might otherwise appear to be unimportant to the network structure? This can be measured by how often a node appears on the shortest path between other nodes.

These are the four critical questions to address when we are applying centrality measures, and each has one or more methods in Gephi to help supply definitive answers. Let's take a look at these methods, first in theory, and then through some actual applications later in the chapter. We will include the standard statistical calculations used for each measure, but it is not critical that you commit these to memory. The definitions are provided for those of you who wish to further understand or explore the mathematical constructs further, but we will not spend additional time on the formulas within this book.

Degree centrality (undirected graphs)

Perhaps the simplest approach to measure centrality is found in this form, where we assess importance through the number of direct connections (degrees) one node has to other nodes. This approach is applicable in undirected networks only, but does have corollary measures (In-degree and Out-degree) designed for directed networks.

The assumption with degree centrality is that the number of connections is a key measure of importance or influence within the network. In an undirected network we do not have the luxury of determining whether one node exerts more or less influence in a relationship; we merely see that they are in fact connected and as such are weighted equally. As anyone familiar with the structure of the Web is aware, this is not always the most accurate way to determine influence, but it likely provides a good proxy for relative importance in a network structure.

Note that there are two ways in which we can measure each of these centrality methods:

  • The first, and most obvious way is to use the nominal degree figure, such as 49, 32, or 18 degrees. While this is informative within the context of a single network, it doesn't allow for comparisons between networks. Larger networks are quite likely to have more nodes with a higher degree measure, yet these nodes might not be as important as a node with a lower degree level in a smaller network.
  • The second, slightly more informative approach, albeit more complex, is to normalize the number based on the maximum potential degree level. Using the same three numbers, and assuming a network of 200 possible direct connections, we can then calculate the relative centrality of each node. This then provides us with the numbers that are better suited for comparison to other networks, as all nodes will have a decimal value between 0 and 1.

In-degree centrality (directed graphs)

If the graph has directed edges, we have the ability to go beyond the simpler degree measure and split it into In-degree and Out-degree measurements. These two measures can inform us whether a network is highly reciprocal (In-degree and Out-degree measures are quite similar) or not. Referring back to the Web example, a site such as Google has a very high In-degree measure, as millions of external sites link to Google. Most of these sites will have very few inbound links of their own and will therefore have very low levels of In-degree centrality. This will lead us to conclude that the Web is primarily a unidirectional network, characterized by a small proportion of hubs with high levels of In-degree connections, while the bulk of sites are weighted far more heavily to Out-degree links.

Out-degree centrality (directed graphs)

Another important measure for directed graphs is the measure of the number of Out-degrees, defined as connections flowing from a selected node to a range of other network members. We have already noted that a high level of Out-degrees relative to In-degrees will often tell us that a particular node is not thought of as a direct information source. This same node can however provide paths to a variety of critical sources, serving as an information aggregator or hub for others.

Closeness centrality

Closeness centrality represents an interesting case, wherein the selected node might actually be poorly connected in a direct sense, yet is still highly influential due to the proximity of well-connected neighbors. Consider the case of someone who is seen as a gatekeeper to an important and influential person; this individual might have relatively few first degree connections, but is still well positioned due to the presence of a high proportion of highly connected nodes as direct connections.

Eigenvector centrality

When nodes are highly connected to other nodes with high levels of influence, the result will be a high-level of Eigenvector centrality. In this case, it is not simply being connected to many other nodes that is critical, but being connected to the most highly influential nodes is paramount. Google's page rank, the famous algorithm used by Google Search to rank websites in their search engine results is a variation of this approach, taking into consideration the connections between sites and weighting their relative importance in terms of web searches, page hits, and other criteria.

Betweenness centrality

The betweenness centrality presents us with a rather unique case, identifying nodes that might well be poorly connected as defined by other centrality measures. In cases where these nodes offer the most direct path between otherwise disconnected clusters, we have what is often termed as a bridge. Being a bridge is not a necessary precondition to have a high betweenness centrality score, but it is often the case that these nodes will rank as critically important using this measure.

Clustering and neighborhood measures

Another key aspect of network graph structure is the level at which nodes group together into various clusters or groups, defined by some shared characteristics or behaviors. These common characteristics might come about as the result of a wide variety of social or geographic factors, including education levels, race, gender, profession, and so on. Gephi provides a few key statistics to help us assess the relative levels of clustering within a network that we'll discuss in the following sections.

Clustering coefficient

With the Clustering coefficient, Gephi provides us the ability to measure the level at which the nodes are grouped together, as opposed to being equally or randomly connected across the network. Scores on this measure will have an inverse correlation with other statistics, including several of the centrality calculations, particularly when we are speaking at the global level (the entire graph). We can also measure this statistic at a local-level, to understand the influence of a single node within its own neighborhood.

This calculation is based on calculating the number of closed triangles (triplets) relative to the potential number of triangles (triplets) available in the network. When we speak of triangles in network graph terminology, we are referencing the connections between the nodes that form complete triangles; in essence, this measures the degree to which your own friends are also friends with one another. In network science language, this could be termed a clique of size 3. High clustering coefficient scores reflect a network where this is most likely true. One might anticipate a high score from tightly knit social groups or clubs (church membership for example), whereas geographically remote networks might be expected to produce lower scores.

At the local level, the measurement determines how likely a single node's neighbors are to form a complete graph (clique). If all possible edge connections are made, the local clustering coefficient would have a score of 1.0, reflecting a graph where all available connections have been created.

Number of triangles

Another way to determine the connectedness and clustering patterns within a network is by measuring the number of triangles present, a measure that provides the nominal count of triangles that are part of the clustering coefficient calculation. This is simply a summation of all possible triplets (connections between three nodes with two or more nodes connected) in an individual node's network, regardless of whether we have a closed triplet (with three edges) or an open triplet (two of three edges are present). To calculate the clustering coefficient, we need to understand what proportions of the triplets are closed (complete).

Modularity

One more approach to measure clustering in a network is through the application of the modularity statistic, which attempts to assess the number of distinct groupings within a network. This can be done simply by using this statistic, or through the use of one of the Gephi plugins geared to parse nodes into distinct groups. The Chinese Clusters tool can be used for this purpose, as well as the MCL, MCODE, and Girvan Newman plugins. The end goal for any of these algorithms is to group nodes based on the strength of their relationships. Nodes that are highly connected are likely to wind up in a common cluster, regardless of which algorithm is employed. Yet each approach will return slightly different results depending on the size and structure of the network as well as the statistical thresholds employed.

Link Communities

The Link Communities function can be used to understand how individual nodes connect to multiple communities or neighborhoods within the graph. This gives us a slightly different view of the network compared to the most clustering or partition approaches, which force each node into a single grouping. In the communities' case, each edge connection is identified with one or more communities, creating a more complex but potentially more insightful mapping of the network and its relationships.

Neighborhood overlap and embeddedness

The idea of embeddedness and overlap are part of a larger area of network analysis termed as social cohesion. While these are not exactly synonymous with traditional clustering, they are certainly related concepts. The basic premise is that rather than neatly defined clusters or partitions, most networks will have nodes with connections into multiple groups or neighborhoods. We can then measure the embeddedness level of specific edges to help determine the neighborhoods where a given node has the strongest associations.

Note

Note that a node can still be connected to other more distant nodes and their respective neighborhoods, but if this is a random or singular connection, the resulting level of embeddedness will be lower.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.236.27