Interpreting graph statistics

Now that we have discussed all of the measures available in Gephi and provided a brief overview of what they can do, the time has come to understand what the raw numbers signify. Each of these measures provides statistical output that needs to be interpreted in order to understand each and every network and to be able to make accurate comparisons between networks.

Fortunately, many of these measures provide normalized output, that is, values that reside on a zero to one continuum, facilitating easy comparisons between different networks. We'll spend a little time on the interpretation basics before moving on to the actual application of many of these statistics to our working network.

Interpreting network measures

When we are examining the various network statistics in Gephi, our goal is to understand the general structure of the network. There are a few simple questions we need to answer using these functions:

  • How large is our network, measured by how many steps it takes for the two most distant nodes to connect to one another?
  • On average, is our network closely connected, with perhaps a few outliers, or do a high percentage of the members reside many steps away from other nodes?
  • Is the graph relatively dense (that is, many connections between nodes) or would it be considered sparse, with few connections relative to the size of the network?
  • Does the network have large hubs with many smaller surrounding nodes, or is there a more random structure to the graph?
  • Do we see clearly defined groupings, or is the network highly connected in many directions?

These are all key questions to address as we begin to more fully understand the structure of the network, and they can all be answered using some of the statistical tools we just reviewed.

Before we take this any further, it is critical to understand that all statistical measures are calculated relative to the network we are working on. This means that we cannot literally compare statistics from one network to another, although we might be able to make some of these comparisons in a relative sense. All statistics will be subject to external variables such as the industry or situation being studied, the definition of connectedness (are we measuring collaboration or simple interaction), the time span being studied (if applicable), and a host of other considerations.

So with those caveats in mind, let's move on to a basic understanding of what we are looking for as we interpret each of the available statistics.

One of the simplest measures to understand is diameter, which is merely a measure of the distance between the two most distant nodes in the network. However, even this rudimentary statistic has some subjectivity involved in its interpretation. Let's consider two networks, each with a diameter of six. Simple, right? At least until I inform you that Network A has 200 nodes, while Network B has 2,000. This begins to alter our perception about the structure of each network relative to the other. Network A suddenly seems inefficient compared to Network B, which should at least pique our interest into why these two different size networks require the same number of steps to fully traverse.

Eccentricity is closely related to diameter, as it essentially measures the diameter for each node within the network, versus a single network-based number. Every node in the network (assuming they are all connected) will have a value between 1 and the diameter value, with 1 being highly unlikely. In the case where each of our two networks has a diameter of 6, it would be very interesting to learn how many individual nodes match these criteria. If there are just a few nodes at the edges of the graph with an eccentricity value equal to 6, then there is perhaps little to be gleaned from this information. If, however, 10 percent of nodes in Network A have an eccentricity measure equal to the graph diameter, compared to just 3 percent in Network B, then we can safely assume that the relative structures of the two networks likely differ to a considerable degree.

If we wish to understand the role played by a single node in the network, we can apply the Erdos number calculation to see how all other nodes connect to this single member. In a sense, this is related to the eccentricity calculation we just discussed, although we are now looking at the inverse—how many steps does it take for all other nodes to connect to a single selected member?

The graph density statistic gives us critical feedback on the structure of the graph as measured by the proportion of edge connections relative to the total available number. As you would expect, this will return a value between 0 and 1, with higher values indicative of a dense graph where a majority of available connections have been completed. This is another measure where there is some room for subjective judgment, even though the returned value is based on a mathematical calculation. Take the earlier example of a collaboration network versus a simple interaction (Facebook, for example) network. The collaboration network requires that two or more individuals work together on a paper or project; this is a more challenging criteria compared to the simple acknowledgement that James (and James' friends) is a friend or acquaintance. Therefore, we would anticipate a lower graph density statistic for the collaboration network, given the more demanding connection criteria.

The most appropriate situations to compare graph density measures would be when the types of connections are similar, as in the case of Collaboration Network A versus Collaboration Network B, or Facebook connections versus Twitter connections. Either of these scenarios would furnish us with an apple to apple comparison where the various statistics could be fairly compared.

Average path length can provide insight into the general structure and connectedness of a network. If we reference the average path length for a graph, and find it to be very close to the network diameter, then this is an indication that the graph is inefficient from a communication standpoint. If this is the case, it may well be due to the presence of multiple clusters in the network (homophily) that add friction to the process of traversing the graph.

On the other hand, if the Average Path Length measure is well below the diameter, then it is likely that the network is efficient, and information can easily flow across the graph. We do need to be cautious in making this assessment, particularly if certain paths are heavily dependent on one or more bridge nodes to facilitate connections. In this case, the removal of a single pivotal node may have significant negative consequences on the network as a whole.

The HITS algorithm, as noted earlier, creates both the Authority and Hub statistics, which might or might not provide different values for each node. In the case of our primary school network dataset used in Chapter 5, Working with Filters, identical values are returned using this approach—an indication that no individual member is considered more authoritative than any other. In other words, the weighting is equal for all nodes, leading to values that are aligned with the hub statistics. If we had a directed network, the results would almost certainly be different. We still see differentiation within the hub statistic—a finding that is consistent with the degree measure produced earlier—but we have no way of determining the direction of the links; so no additional value comes out of the Authority number.

HITS is similar to the PageRank method for assessing web pages, with higher authority scores indicative of a page that is linked into by many hubs (In-degree connections), while the hub calculation is based on the number of outbound links (Out-degree connections).

Edge betweenness is an approach that helps to sort out the paths that are most often taken to minimize travel across the network. In the normalized version, this measure will reside somewhere between 0 and 1, with a score of 1 indicating that all nodes must travel via that particular edge to reach a desired node or nodes. Conversely, scores near 0 signify that there are other more efficient paths to use to make the shortest journey between two or more points. In the case where a small number of paths carry very high values (that is near 1), there is a risk that if these connections are lost, the network will become more fragmented and communications across the graph will suffer.

Now that we have a more complete understanding of how to interpret and apply generalized graph statistics, we can move onto the various node-based centrality measures and understand how to interpret their output.

Interpreting centrality statistics

Centrality statistics provide the framework to compare the roles played by various nodes within a single network. Given that every node is resident in the same graph, we are able to judge these statistics on equal footing, as all nodes (at least in theory) have the same opportunities. In practice, as we have already seen from our school network, some members might be better positioned than others based on constructs such as grade level, physical location, and so on. If these structures influence the network significantly, we still have the ability to compare results within smaller groups such as individual classes.

Let's take a look at how we might construe the meaning of the statistical results provided by each of the centrality algorithms.

Degree centrality

Degree centrality, as noted earlier, is the simplest of the centrality measures, as it calculates a result based only on the number of connections possessed by a given node. This number does not inform us about the strength of each connection, nor does it tell us the importance of any connections. So when we look at results and see that Node 1551 has 98 direct connections, and Node 1609 has merely 18, we can probably assume that this indicates a dramatic difference in their respective levels of influence in the network. However, in cases where the difference is of a much lower magnitude, we would not be so comfortable making this claim and would certainly want to support it through some of the more sophisticated measures to follow.

In-degree centrality

When we have a directed graph, In-degree centrality provides a more robust measure of influence compared to a simple degree measure. This measure tells us how likely other nodes are to seek out a single node, whatever the reason might be—charisma, money, perceived influence, knowledge, or some other attribute that makes this member an attractive target for others. Let's assume from our earlier case that we have a directed graph, and we discover that Node 1551 has an In-degree measure of just 18 (implying an Out-degree number equal to 80). Let's also assume that Node 1552, with a total of 90 degrees, has an In-degree value of 50. Now who seems to play a more influential role in the network? A simple undirected degree measure would have told us that 1551 was marginally more important, while our In-degree number portrays a vastly different story. In this case, 1551 would appear to be a social butterfly who seeks out others, while 1552 is perceived as someone who provides greater influence, power, or prestige.

Out-degree centrality

Out-degree centrality, as we just touched on, might well indicate someone who reaches out to others for more information, influence, or prestige; or in the case of the Web, it could be an indication that many small sites have high connection levels to hubs such as Google or Yahoo!. This is especially the case when there is a significant imbalance between the Out-degree and In-degree measures. Certainly, it is possible for individuals to have relatively even profiles in this respect, although this is generally not the case when networks are dominated by a small number of hubs, be they individuals, websites, or some other entities.

Closeness centrality

With Closeness centrality, we begin moving toward a more sophisticated measurement of centrality; in this case defined by the effort required (as measured by path length) to reach all other nodes in the network. As noted earlier, we might well have a number of graph members who score low on other measures such as degree centrality, but who still exercise influence based on their position within the network. It is often the case that a single node scores highly on each of these measures; if an individual has enough degree connections, she/he will likely have fewer long path connections to other nodes, resulting in a lower (that is, more connected) score on this statistic.

In terms of output, each node will have a measure higher than 1, with lower values indicating greater centrality. Members of the network located at the outer edges are highly likely to have high relative scores, as they require more steps to reach all other nodes in the network. A final note on Closeness centrality is that our network must be completely connected in order to run this measure, otherwise we get an infinite result.

Eigenvector centrality

Eigenvector centrality takes Closeness centrality a step further. Instead of being most concerned by the number of steps required to reach all other members of the network, our focus is on the importance of the connections. This is measured by understanding whether individual nodes have a high proportion of their connections to influential members, which boosts their own standing. This is calculated on a 0 to 1 continuum, with values near 1 indicating the highest levels of centrality. In our school classroom network, values range from 0.1 to 1, indicating vastly different levels of influence within the network.

Eigenvector centrality is likely to show a positive correlation with Degree centrality and Closeness centrality, although the relative correlations will of course change depending on the network structure.

Betweenness centrality

Betweenness centrality plays a unique role in understanding how information, influence, and even diseases flow within networks. Unlike the other centrality measures, which are largely predicated on direct connections and/or proximity to others with many connections, betweenness measures how important an individual node is for others traversing the network. Specifically, this measure tells us how often a given node lies on the shortest path between two other nodes.

It is quite possible, and often likely, that a member with a very high betweenness centrality score has very low numbers on all of the other centrality calculations. This will be highly dependent on the network structure and is far more likely to be true in networks with higher levels of clustering.

From here, let's transition to a discussion on the meanings of the various clustering statistics, and how we can develop a better understanding of the aggregation patterns within the network. These measures will help to bridge the gap between statistics that examine the network in its entirety and the centrality calculations, which were focused on the role of individual members within the network.

Interpreting clustering statistics

Clustering statistics are crucial to our understanding and interpretation of network behavior, and specifically, how individual nodes interact and form groups, or clusters. This is the third and final series of graph statistics that begins with the concept of overall network structure as embodied by measures such as diameter, path length, eccentricity, and edge betweenness. Our second series of statistics was focused on centrality, and how individual nodes are related to the network as a whole. When we look at clustering behaviors, we are synthesizing the first two series through the study of how individual nodes work together to shape the internal structure of the network. These measures will help close the loop to understand why certain networks have very high (or low) diameters relative to the number of members, and help us to understand how individual nodes manifest high (or low) centrality scores.

We'll begin with a discussion of how to interpret the clustering coefficient score, followed by brief overviews to understand triangles, modularity, Link Communities, and finally embeddedness.

Interpreting clustering coefficients

The global clustering coefficient is based on the concept of triplets: three nodes often in close proximity to one another, connected by two or three edges. In the case of two connections, the triplet is referred to as an open triplet, while if all three nodes are connected with one another, we have a closed triplet, also referred to as a triangle. The derived statistic is simply calculated by dividing the number of closed triplets by the total number of triplets (open and closed), resulting in a coefficient value between 0 and 1.

An interesting observation can be made about clustering coefficients—they are often inversely correlated to the number of degree connections possessed by a single node. Part of this is easily explained—those nodes with many connections spanning a network are likely connected with many others less likely to have the same behavior (most networks show a wide degree range from low to high). This is certainly the case in our school network, where the most highly connected members have rather low clustering coefficients, while those nodes that tend to stay closer to home with their links sport much higher values. The idea of having your own friends connect to one another (thus forming closed triplets) is more likely to happen in smaller, more cohesive groups, and this is ably demonstrated by the school network example.

Number of triangles

The Number of triangles statistic is simply a nominal count of all the available triangles generated by a node's degree connections. As you might have anticipated, this is strongly positively correlated to the number of degree connections, although the literal rank order might change slightly. Used in tandem with the clustering coefficient value, we can back into the number of closed triplets each node has in their personal network; just allow some rounding error differences.

Modularity

The Modularity statistic places individual nodes into an aggregated group or cluster based on shared characteristics. This is a simpler approach compared to plugins expressly designed for clustering and partitioning, but it can nonetheless help in understanding the network and provides a very easy means to do so when combined with filtering methods. Output for this function is simply an integer value starting at 0.

Link Communities

With the Link Communities statistic, we can begin to understand deeper levels within the network that might go undetected if we rely on traditional clustering or partitioning approaches. This statistic will group edges into communities based on similar patterns, applying a single number to every edge in the network. These communities can encompass a single edge or can be composed of dozens of edges that follow similar paths.

Embeddedness

Embeddedness refers to the level at which specific edges tie nodes into neighborhoods across the network. If a given edge connects two nodes across the network in a solitary fashion, then we will expect the embeddedness score to be quite low, perhaps approaching zero. This suggests that these two nodes have relatively little in common, despite their direct connection, which might be simply incidental. The fact that the two nodes have been positioned far apart from one another by a layout algorithm confirms this belief.

Higher embeddedness scores can help direct us toward an understanding of secondary neighborhoods that a given node is linked with, which might be far less evident through a simple inspection of the network layout. We would still anticipate the highest scores to be strongly correlated to the close physical proximity of similar nodes; however, some moderately high scores might point us to other associations or neighborhoods that are important to an individual node.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.123.126