© Ervin Varga 2019
E. Varga Practical Data Science with Python 3https://doi.org/10.1007/978-1-4842-4859-1_10

10. Graph Analysis

Ervin Varga1 
(1)
Kikinda, Serbia
 

A graph is the primary data structure for representing different types of networks (for example, directed, undirected, weighted, signed, and bipartite). Networks are most naturally represented as a set of nodes with links between them. Social networks are one very prominent class of networks. The Internet is the biggest real-world network, easily codified as a graph. Furthermore, road networks and related algorithms to calculate various routes are again based on graphs. This chapter introduces the basic elements of a graph, shows how to manipulate graphs using a powerful Python framework, NetworkX, and exemplifies ways to transform various problems as graph problems. This chapter also unravels pertinent metrics to evaluate properties of graphs and associated networks. Special attention will be given to bipartite graphs and corresponding algorithms like graph projections. We will also cover some methods to generate artificial networks with predefined properties.

The main reason for representing a real-world phenomenon as an abstract network (graph) is to answer questions about the target domain by analyzing the structure of that network. For example, if we represent cities as nodes and roads between them as links, then we can readily use many algorithms to calculate various routes between cities. The criteria could be to find the shortest, fastest, or cheapest routes. In social networks, it is interesting to find the most influential person, which can be mapped to the notion of centrality of a node inside a graph. The ability to associate a problem space with a matching graph structure is already half of the solution.

Usage Matrix As a Graph Problem

The best way to understand the types and elements of a graph is to see things in action using NetworkX ( https://networkx.github.io ). Figure 10-1 shows the major use cases (UCs) of a generic Internet of Things (IoT) platform. Suppose that you want to optimize the system using an analytical framework called Usage Matrix. It provides guidance to pinpoint opportunities for tuning the code base. This technique is applicable in many contexts, including data science products. Table 10-1 depicts the template of a usage matrix. Columns denote individual use cases or functions. Rows represent resources or data nodes. Table 10-2 shows one instance of the usage matrix (the values are made up just to illustrate the technique).

Summing down column j designates the relative volume of accesses against all resources caused by that use case. Summing across row i designates the relative volume of accesses against that resource caused by all use cases. Outliers mark the set of columns/rows that are candidates for optimization; that is, the corresponding use cases and/or resources should be prioritized for performance tuning according to the law of diminishing returns. A particular cell’s value is calculated as aij = freqijnijpj, where freqij denotes the use case j’s frequency regarding resource i (how many times use case j visits this resource per unit of time), nij shows the number of accesses per one visit, and pj is the relative priority of use case j.

We can reformulate our optimization task as a graph problem. Nodes (vertices) are use cases and resources, while links (edges) are weighted according to the impact of use cases on resources. The “impact” relationships are unidirectional (from use cases toward resources). The graph also shows other relationships, too. For example, there are directional edges denoting specific include and extend UML constructs for use cases. Finally, actors are also associated with use cases using the “interact” bidirectional relationships. Overall, a graph where one node may have parallel links is called a multigraph ; our graph is a weighted directed multigraph. A graph where all links are bidirectional and have some default weight is an example of an undirected graph.

Now, it is easy to calculate the equivalents of column/row sums. Namely, the sum of links incident to a resource node denotes its row sum (as similar reasoning holds for the use case nodes). We can ask many interesting questions, too. Does actor 1’s usage of a system have higher impact on resources than actor 2’s? Which actor most intensely utilizes the resources? Which actor interacts with most use cases (directly or indirectly)? All answers can be acquired by simply processing the underlying graph using established graph algorithms.
../images/473947_1_En_10_Chapter/473947_1_En_10_Fig1_HTML.jpg
Figure 10-1

The major use cases and actors in an IoT platform presented as a UML use case diagram (see reference [1] for more information). The usage matrix focuses on use cases and their impact on resources.

Table 10-1

Generic Template of a Usage Matrix

 

UC 1

UCj   

UCn   

R 1

   

R i

 

a ij

 

R m

   
Table 10-2

Sample Instance of Usage Matrix Template

 

Row Total

Communicate

Manage Dev.

Exec. Data Analytics

Use Bus. Int.

Use Op. Int.

Network

615

120*1*5=600

5*3*1=15

0

0

0

Messaging

700

120*1*5=600

0

20*2*3=100

0

0

Database

2660

0

0

30*10*3=900

40*10*2=800

80*3*4=960

Column Total

 

1200

15

1000

800

960

The concrete numbers in Table 10-2 are made up and not important. The Send/Receive use case isn’t included explicitly in Table 10-2 because it is part of other use cases. Of course, you may structure the table differently. At any rate, this UC will be associated with the other UC via the dedicated relationship type. Note that the UML extend mechanism isn’t the same as subclassing in Python and we don’t talk here about inheritance. Based on this table, we see that improving the database (possibly using separate technologies for real-time data acquisition and analytics) is most beneficial.

Listing 10-1 shows the code for producing the matching graph using NetworkX. Notice that I have omitted edges with zero weight. The script is quite simple to showcase the ease with which you can construct networks with NetworkX. It is a usual convention to name variables referencing graphs with uppercase letters. You will need to install pydot before executing Listing 10-1 (run conda install pydot from the shell to deploy pydot). Figure 10-2 displays the resulting graph (see also Exercise 10-1).
import networkx as nx
G = nx.MultiDiGraph()
# Add all nodes with their role and label. You can immediately work with labels, but having
# short node identifiers keeps your code uncluttered.
G.add_node(0, role='use-case', label="Communicate")
G.add_node(1, role='use-case', label='Manage Dev.')
G.add_node(2, role='use-case', label='Exec. Data Analytics')
G.add_node(3, role='use-case', label='Use Bus. Int.')
G.add_node(4, role='use-case', label='Use Op. Int.')
G.add_node(5, role='use-case', label='Send/Receive Data')
G.add_node(6, role="resource", label="Network")
G.add_node(7, role="resource", label="Messaging")
G.add_node(8, role="resource", label="Database")
G.add_node(9, role="actor", label="Device")
G.add_node(10, role="actor", label="Application")
G.add_node(11, role="actor", label="User")
# Add edges for the 'impact' relationship.
G.add_edge(0, 6, weight=600, relation="impact")
G.add_edge(0, 7, weight=600, relation="impact")
G.add_edge(1, 6, weight=15, relation="impact")
G.add_edge(2, 7, weight=100, relation="impact")
G.add_edge(2, 8, weight=900, relation="impact")
G.add_edge(3, 8, weight=800, relation="impact")
G.add_edge(4, 8, weight=960, relation="impact")
# Add edges for the 'include' relationship.
G.add_edge(0, 5, relation="include")
G.add_edge(1, 5, relation="include")
G.add_edge(2, 5, relation="include")
# Add edges for the 'extend' relationship.
G.add_edge(3, 2, relation="extend")
G.add_edge(4, 2, relation="extend")
# Add edges for the 'interact' relationship.
G.add_edge(9, 0, relation="interact")
G.add_edge(0, 9, relation="interact")
G.add_edge(10, 1, relation="interact")
G.add_edge(1, 10, relation="interact")
G.add_edge(10, 2, relation="interact")
G.add_edge(2, 10, relation="interact")
G.add_edge(11, 3, relation="interact")
G.add_edge(3, 11, relation="interact")
G.add_edge(11, 4, relation="interact")
G.add_edge(4, 11, relation="interact")
# Visualize the resulting graph using pydot and Graphviz.
from networkx.drawing.nx_pydot import write_dot
# By default NetworkX returns a deep copy of the source graph.
H = G.copy()
# Set some display properties for specific nodes and extract labels.
node_labels = {}
for node_id in H.nodes():
    node_labels[node_id] = H.node[node_id]['label']
    role = H.node[node_id]['role']
    if role == 'resource':
        H.node[node_id]['style'] = 'filled'
        H.node[node_id]['fillcolor'] = 'cyan'
        H.node[node_id]['shape'] = 'component'
        H.node[node_id]['fixedsize'] = 'shape'
    elif role == 'use-case':
        H.node[node_id]['shape'] = 'oval'
    elif role == 'actor':
        H.node[node_id]['style'] = 'rounded'
        H.node[node_id]['shape'] = 'box'
H.node[5]['style'] = 'dashed'
nx.relabel_nodes(H, node_labels, copy=False)
pos = nx.nx_pydot.graphviz_layout(H)
nx.draw(H, pos=pos, with_labels=True, font_weight="bold")
write_dot(H, 'usage_matrix.dot')
Listing 10-1

Content of the usage_matrix.py Module

The script creates the 'usage_matrix.dot' file inside the current directory. It contains instructions for using Graphviz ( https://www.graphviz.org ) to render the graph. To generate the final graph, you will need to install this tool. Issue the following command from the shell window:
dot usage_matrix.dot -Tpng -Gsplines=ortho > usage_matrix.png

Caution

I advise you to make a copy of the graph before adding visual effects and relabeling nodes (use the copy parameter to perform the operation in place). If you are not careful with the latter, then you may waste lots of time trying to decipher what is going on with the rest of your code (written to use the previous labels). Keep your original graph tidy, as it should serve as a source of truth.

../images/473947_1_En_10_Chapter/473947_1_En_10_Fig2_HTML.jpg
Figure 10-2

Imagine how much time you would need to spend on this drawing if you weren’t able to utilize NetworkX and Graphviz

For basic drawings, you may rely on the built-in matplotlib engine. However, you shouldn’t use it for more complex drawings because, for example, it is quite tedious to control the sizes and shapes of node boxes in matplotlib. Moreover, as you will encounter later, Graphviz allows you to visualize graphs and run algorithms for controlling layouts. Finding out which resource is the most impacted boils down to calculating the degree of resource nodes by using the in_degree method on a graph (keep in mind the directionality of edges). The only caveat is that you must specify the weight argument (in our case, it would point to the custom weight property of associated edges). In an undirected graph, you would simply use the degree method.

Opposite Quality Attributes

It is a known fact that you cannot attain all quality attributes in a product (for a full list, read the ISO/IEC 25010:2011 quality model at bit.ly/iso-25010 ). Therefore, any optimization effort must find the balance among competing forces. For example, maintainability supports reliability, and vice versa. Performance competes with almost all quality characteristics (which is the reason you should avoid premature optimization and the consequent unnecessary accidental complexity). It is handy to map these relations using a special graph called a signed graph, a graph whose edges are denoted with a plus sign or minus sign, depending on whether the matching nodes support or oppose each other. NetworkX allows you to attach arbitrary properties to nodes and edges (we have already seen examples of this, like role, label, relation, etc.). Listing 10-2 shows how you might model the interrelationships between maintainability, reliability, and performance (note that we are dealing here with an undirected graph).
import networkx as nx
G = nx.Graph()
G.add_node(0, role='quality-attribute', label="Maintainability")
G.add_node(1, role='quality-attribute', label="Reliability")
G.add_node(2, role='quality-attribute', label="Performance")
G.add_edge(0, 1, sign='+')
G.add_edge(0, 2, sign='-')
G.add_edge(1, 2, sign='-')
Listing 10-2

Module signed_graph.py Models the Relationships Between Quality Attributes (see Exercise 10-2)

Partitioning the Model into a Bipartite Graph

An important category of graphs is bipartite graphs, where nodes can be segregated into two disjunct sets L(eft) and R(ight). Edges are only allowed between nodes from disparate sets; that is, no edge may connect two nodes in set L or R. For example, in a recommender system, a set of users and a set of movies form a bipartite graph for the rated relation . It makes no sense for two users or two movies to rate each other. Therefore, there is only an edge between a user and a movie, if that user has rated that movie. The weight of the edge would be the actual rating.

Let’s add a new edge to our first multigraph G between the Communicate use case and the Application actor (see Listing 10-1):
G.add_edge(0, 10, relation="interact")
G.add_edge(10, 0, relation="interact")
This models the fact that an IoT platform may behave like a virtual device in a federated IoT network. Namely, it can act as an aggregator for a set of subordinate devices and behave as a new device from the perspective of higher-level subsystems. The set of actors and use cases should form a bipartite graph. Listing 10-3 illustrates the procedure to form this type of graph using NetworkX. Here, bipartite.py creates a bipartite graph from our original multigraph by selecting only pertinent nodes and edges. (Note that this code assumes that you have already run usage_matrix.py so that everything is set up.)
from networkx.algorithms import bipartite
# Add two new edges as previously described.
G.add_edge(0, 10, relation="interact")
G.add_edge(10, 0, relation="interact")
# Select all nodes and edges from G that participate in 'interact' relation and
# create an undirected graph from them.
H = nx.Graph()
H.add_edges_from((u, v) for u, v, r in G.edges(data='relation') if r == 'interact')
# Attach a marker to specify which nodes belong to what group.
for node_id in H.nodes():
    H.node[node_id]['bipartite'] = G.node[node_id]['role'] == 'actor'
nx.relabel_nodes(H, {n: G.node[n]['label'].replace(' ', ' ') for n in H.nodes()}, copy=False)
print("Validating that H is bipartite: ", bipartite.is_bipartite(H))
# This is a graph projection operation. Here, we seek to find out what use cases
# have common actors. The weights represent the commonality factor.
W = bipartite.weighted_projected_graph(H, [n for n, r in H.nodes(data='bipartite') if r == 0])
# Draw the graph using matplotlib under the hood.
pos = nx.shell_layout(W)
nx.draw(W, pos=pos, with_labels=True, node_size=800, font_size=12)
nx.draw_networkx_edge_labels(W, pos=pos,
                             edge_labels={(u, v): d['weight']
                                          for u, v, d in W.edges(data=True)})
Listing 10-3

Module bipartite.py Creating a Bipartite Graph from Our Original Multigraph

NetworkX doesn’t use a separate graph type for a bipartite graph. Instead, it uses an algorithm that works on usual graph types. Figure 10-3 shows the weighted projected graph.
../images/473947_1_En_10_Chapter/473947_1_En_10_Fig3_HTML.jpg
Figure 10-3

The Application actor uses all three use cases depicted as a triangle. Therefore, they all have a single common actor. The other two use cases are accessed by the User actor.

Scalable Graph Loading

Thus far, we have constructed our graphs purely by code. This is OK for small examples, but in the real world your network is going to be huge. A better approach is to directly load a graph definition from an external source (such as a text file or database). In this section we will examine how to combine Pandas and NetworkX to produce a graph whose structure is kept inside a CSV file (see also the sidebar “Graph Serialization Formats”).

Graph Serialization Formats

Graphs are usually serialized using the following major formats:
  • Adjacency (incidence) matrix: This is very similar in structure to our usage matrix, where rows and columns denote nodes. At each intersection of row i and column j there is a value 0 (edge doesn’t exist between nodes i and j) or 1 (edge exists between nodes i and j). You can pass this matrix as an initialization argument when creating a new graph. This matrix isn’t good to specify additional node/edge properties and is very memory hungry. Graphs are typically sparse, so you will have lots of zeros occupying space.

  • Adjacency list: A separate row exists for each node. Inside a row there are node identifiers of neighbors. This format preserves space, since only existing edges are listed. An isolated node would have an empty row. You cannot define additional properties with this format. NetworkX has a method read_adjlist to read a graph saved as an adjacency list. You can specify the nodetype parameter to convert nodes to a designated type. The create_using parameter is a pointer to the desired factory (for example, you can force NetworkX to load the graph as a multigraph). Text files may contain comments, and these can be skipped by passing the value for the comments parameter.

  • Edge list: Each row denotes an edge with source and sink node identifiers. You can attach additional items as edge properties (like weight, relation, etc.). It is possible to specify the data types for these properties inside the dedicated read_edgelist method. Note that this format alone cannot represent isolated nodes.

  • GraphML: This is a comprehensive markup language to express all sorts of graphs (visit http://graphml.graphdrawing.org ). NetworkX has a method read_graphml to read graphs saved in this file format.

We will prepare two files: nodes.csv to store the nodes (see Listing 10-4), and edges.edgelist to describe edges (see Listing 10-5). The first is a CSV file that we will read using Pandas to attach properties to nodes. The other text file is an edge list with the corresponding edge attributes. Listing 10-6 shows the complete code to load our graph from these files (see the module load_graph.py). NetworkX has a method named from_pandas_dataframe, but you can’t use this to assign node properties (it only supports edge attributes).
Id,Role,Label
0,use-case,Communicate
1,use-case,Manage Dev.
2,use-case,Exec. Data Analytics
3,use-case,Use Bus. Int.
4,use-case,Use Op. Int.
5,use-case,Send/Receive Data
6,resource,Network
7,resource,Messaging
8,resource,Database
9,actor,Device
10,actor,Application
11,actor,User
Listing 10-4

Content of the nodes.csv File with a Proper Header Row

# Add edges for the 'impact' relationship.
0 6 600 impact
0 7 600 impact
1 6 15 impact
2 7 100 impact
2 8 900 impact
3 8 800 impact
4 8 960 impact
# Add edges for the 'include' relationship. A weight of 1 is assigned as a placeholder.
0 5 1 include
1 5 1 include
2 5 1 include
# Add edges for the 'extend' relationship .
3 2 1 extend
4 2 1 extend
# Add edges for the 'interact' relationship.
9 0 1 interact
0 9 1 interact
10 1 1 interact
1 10 1 interact
10 2 1 interact
2 10 1 interact
11 3 1 interact
3 11 1 interact
11 4 1 interact
4 11 1 interact
0 10 1 interact
10 0 1 interact
Listing 10-5

edges.edgelist Stores Edges in the Standard Edge List Format

import networkx as nx
import pandas as pd
G = nx.read_edgelist('edges.edgelist',
                     create_using=nx.MultiDiGraph,
                     nodetype=int,
                     data=(('weight', int), ('relation', str)))
df = pd.read_csv('nodes.csv', index_col=0)
for row in df.itertuples():
    G.node[row.Index]['role'] = row.Role
    G.node[row.Index]['label'] = row.Label
# Make a small report.
print("Nodes: ", G.nodes(data=True), sep=")
print("-" * 20, " Edges: ", G.edges(data=True), sep=")
Listing 10-6

Complete Code to Load Graph from nodes.csv and edges.edgelist

Running this file produces the following report, assuring us that everything has been properly loaded:
Nodes:
[(0, {'role': 'use-case', 'label': 'Communicate'}), (6, {'role': 'resource', 'label': 'Network'}), (7, {'role': 'resource', 'label': 'Messaging'}), (1, {'role': 'use-case', 'label': 'Manage Dev.'}), (2, {'role': 'use-case', 'label': 'Exec. Data Analytics'}), (8, {'role': 'resource', 'label': 'Database'}), (3, {'role': 'use-case', 'label': 'Use Bus. Int.'}), (4, {'role': 'use-case', 'label': 'Use Op. Int.'}), (5, {'role': 'use-case', 'label': 'Send/Receive Data'}), (9, {'role': 'actor', 'label': 'Device'}), (10, {'role': 'actor', 'label': 'Application'}), (11, {'role': 'actor', 'label': 'User'})]
--------------------
Edges:
[(0, 6, {'weight': 600, 'relation': 'impact'}), (0, 7, {'weight': 600, 'relation': 'impact'}), (0, 5, {'weight': 1, 'relation': 'include'}), (0, 9, {'weight': 1, 'relation': 'interact'}), (0, 10, {'weight': 1, 'relation': 'interact'}), (1, 6, {'weight': 15, 'relation': 'impact'}), (1, 5, {'weight': 1, 'relation': 'include'}), (1, 10, {'weight': 1, 'relation': 'interact'}), (2, 7, {'weight': 100, 'relation': 'impact'}), (2, 8, {'weight': 900, 'relation': 'impact'}), (2, 5, {'weight': 1, 'relation': 'include'}), (2, 10, {'weight': 1, 'relation': 'interact'}), (3, 8, {'weight': 800, 'relation': 'impact'}), (3, 2, {'weight': 1, 'relation': 'extend'}), (3, 11, {'weight': 1, 'relation': 'interact'}), (4, 8, {'weight': 960, 'relation': 'impact'}), (4, 2, {'weight': 1, 'relation': 'extend'}), (4, 11, {'weight': 1, 'relation': 'interact'}), (9, 0, {'weight': 1, 'relation': 'interact'}), (10, 1, {'weight': 1, 'relation': 'interact'}), (10, 2, {'weight': 1, 'relation': 'interact'}), (10, 0, {'weight': 1, 'relation': 'interact'}), (11, 3, {'weight': 1, 'relation': 'interact'}), (11, 4, {'weight': 1, 'relation': 'interact'})]

You can also iterate over nonexistent edges in the graph G by invoking nx.non_edges(G). A similar function that returns the non-neighbors of the node n in the graph G is nx.non_neighbors(G, n).

Social Networks

There are many ways to describe social networks, but we will assume in this section that social network means any network of people where both individual characteristics and relationships are important to understand the real world’s dynamics. In data science, we want to produce actionable insights from data. The story always begins with a compelling question. To reason about reality and answer that question, we need a model. It turns out that representing people as nodes and their relationships as edges in a graph is a viable model. There are lots of metrics to evaluate various properties (both local and global) of a graph. From these we can infer necessary rules and synthesize knowledge. Moreover, measures of a graph may be used as features in machine learning systems (for example, to predict new link creations based on known facts from a training dataset).

Listing 10-7 shows a simple script to visualize an artificial graph from the atlas of graphs (see reference [2]) and present some distance and centrality measures. For thorough coverage of graph theory, read references [3] and [4]. NetworkX’s documentation is an excellent source for additional information about metrics (see the section about algorithms, where you will find subsections about various measures).
import pandas as pd
import networkx as nx
from networkx.generators.atlas import graph_atlas
import matplotlib.pyplot as plt
G = graph_atlas(681)
# Visualize the graph using the draw_networkx routine.
plt.figure(figsize=(5,5))
nx.draw_networkx(G, node_size=700)
plt.axis('off')
plt.tight_layout();
plt.show()
# Create a data frame to store various centrality measures.
df = pd.DataFrame(index=G.nodes())
df['lcc'] = pd.Series(nx.clustering(G))
df['eccent'] = pd.Series(nx.eccentricity(G))
df['deg-cent'] = pd.Series(nx.degree_centrality(G))
df['clos-cent'] = pd.Series(nx.closeness_centrality(G))
df['btwn-cent'] = pd.Series(nx.betweenness_centrality(G, normalized=True, endpoints=False))
print(df)
print(' Average clustering coefficient:',
      nx.average_clustering(G))
print('Transitivity:', nx.transitivity(G))
print('Radius:', nx.radius(G))
print('Diameter:', nx.diameter(G))
print('Peripherial nodes:', nx.periphery(G))
print('Central nodes:', nx.center(G))
Listing 10-7

Module artificial_network.py, Which Demonstrates Some Local and Global Graph Metrics

The following is the text output produced by this program, and Figure 10-4 shows the resulting graph:
        lcc  eccent  deg-cent  clos-cent  btwn-cent
0  1.000000       3  0.333333       0.50   0.000000
1  1.000000       3  0.333333       0.50   0.000000
2  0.333333       2  0.666667       0.75   0.533333
3  0.500000       2  0.666667       0.75   0.233333
4  0.500000       2  0.666667       0.75   0.233333
5  1.000000       3  0.333333       0.50   0.000000
6  1.000000       3  0.333333       0.50   0.000000
Average clustering coefficient: 0.761904761904762
Transitivity: 0.5454545454545454
Radius: 2
Diameter: 3
Peripherial nodes: [0, 1, 5, 6]
Central nodes: [2, 3, 4]
../images/473947_1_En_10_Chapter/473947_1_En_10_Fig4_HTML.jpg
Figure 10-4

Apparently, you cannot rely on a single metric to decide about some aspect of a graph

For example, nodes 2, 3, and 4 all have the same closeness centrality. Nonetheless, the betweenness centrality gives precedence to node 2, as it is on more shortest paths in the network. By judiciously selecting and combining different metrics, you may come up with good indicators of a network.

The local clustering coefficient (LCC) of a node represents the fraction of pairs of the node’s neighbors who are connected. Node 2 has an LCC of $$ frac{# of pairs of node's neighbors who are connected}{# of possible connected pairs}=frac{2}{frac{d_2left({d}_2-1
ight)}{2}} $$ $$ =frac{2}{frac{4ast 3}{2}}=frac{1}{3},mathrm{where} {d}_2 $$ is the degree of node 2. LCC embodies people’s tendency to form clusters; in other words, if two persons share a connection, then they tend to become connected, too (a.k.a. triadic closure).

Nodes 2, 3, and 4 are considered central because the maximum shortest path (in our case, measured as the number of hops in a path) to any other node is equal to the radius of the network. The radius of a graph is the minimum eccentricity for all nodes. The eccentricity of a node is the largest distance between this node and all other nodes. The diameter of a graph is the maximum distance between any pair of nodes. Peripheral nodes have eccentricity equal to the diameter. Note that these measures are very susceptible to outliers, in the same way as is the average. By adding a chain of nodes to a network’s periphery, we can easily move the central nodes toward those items. This is why social network are better described with different centrality measures.

The average clustering coefficient is simply a mean of individual clustering coefficients of nodes. The transitivity is another global clustering flag, which is calculated as the ratio of triangles and open triads in a network. For example, the set of nodes 2, 5, and 4 forms an open triad (there is a missing link between nodes 5 and 4 for a triangle). Transitivity puts more weight on high-degree nodes, which is why it is lower than the average clustering coefficient (nodes 2, 3, and 4 as central are heavier penalized for not forming triangles).

The centrality measures try to indicate the importance of nodes in a network. Degree centrality measures the number of connections a node has to other nodes; influential nodes are heavily connected. The closeness centrality says that important nodes are close to other nodes. The betweenness centrality defines a node’s importance by the ratio of shortest paths in a network including the node. There is a similar notion of betweenness centrality for an edge.

Listing 10-8 introduces kind of a “hello world” for social networks, which is the karate club network (it is so famous that NetworkX already includes it as part of the standard library), as shown in Figure 10-5.
import operator
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
G = nx.karate_club_graph()
node_colors = ['orange' if props['club'] == 'Officer' else 'blue'
               for _, props in G.nodes(data=True)]
node_sizes = [180 * G.degree(u) for u in G]
plt.figure(figsize=(10, 10))
pos = nx.kamada_kawai_layout(G)
nx.draw_networkx(G, pos,
                 node_size=node_sizes,
                 node_color=node_colors, alpha=0.8,
                 with_labels=False,
                 edge_color='.6')
main_conns = nx.edge_betweenness_centrality(G, normalized=True)
main_conns = sorted(main_conns.items(), key=operator.itemgetter(1), reverse=True)[:5]
main_conns = tuple(map(operator.itemgetter(0), main_conns))
nx.draw_networkx_edges(G, pos, edgelist=main_conns, edge_color="green", alpha=0.5, width=6)
nx.draw_networkx_labels(G, pos,
                        labels={0: G.node[0]['club'], 33: G.node[33]['club']},
                        font_size=15, font_color="white")
candidate_edges = ((8, 15), (30, 21), (29, 28), (1, 6))
nx.draw_networkx_edges(G, pos, edgelist=candidate_edges,
                       edge_color='blue', alpha=0.5, width=2, style="dashed")
nx.draw_networkx_labels(G, pos,
                        labels={u: u for t in candidate_edges for u in t},
                        font_size=13, font_weight="bold", font_color="yellow")
plt.axis('off')
plt.tight_layout();
plt.show()
# Create a data frame to store various centrality measures.
df = pd.DataFrame(index=candidate_edges)
# Add generic and community aware edge features for potential machine learning classification.
df['pref-att'] = list(map(operator.itemgetter(2),
                          nx.preferential_attachment(G, candidate_edges)))
df['jaccard-c'] = list(map(operator.itemgetter(2),
                          nx.jaccard_coefficient(G, candidate_edges)))
df['aa-idx'] = list(map(operator.itemgetter(2),
                          nx.adamic_adar_index(G, candidate_edges)))
df['ccn'] = list(map(operator.itemgetter(2),
                          nx.cn_soundarajan_hopcroft(G, candidate_edges, 'club')))
df['cra'] = list(map(operator.itemgetter(2),
                          nx.ra_index_soundarajan_hopcroft(G, candidate_edges, 'club')))
print(df)
Listing 10-8

karate-network.py Module Exemplifying the Well-Known Karate Club Network (see Exercise 10-3)

../images/473947_1_En_10_Chapter/473947_1_En_10_Fig5_HTML.jpg
Figure 10-5

The karate club social network, which clearly emphasizes the split of a club into two parts. The sizes of nodes reflect their degree. The five most important connections are highlighted in green. The dashed lines denote candidate edges. The nodes are colored according to what club they belong to.

After running this program, you will get the following report:
       pref-att  jaccard-c    aa-idx  ccn       cra
8  15        10   0.400000  0.755386    2  0.000000
30 21         8   0.200000  0.455120    1  0.000000
29 28        12   0.166667  0.352956    2  0.058824
1  6         36   0.083333  0.360674    2  0.062500
The column names designate the following link prediction measures (in the same order from right to left):
  • Preferential Attachment: Favors nodes with high degree centrality. This is why the score for nodes 1 and 6 is much higher than for the others.

  • Jaccard Coefficient: The number of common neighbors normalized by the total number of neighbors.

  • Adamic-Adar Index: Has a logarithm in the denominator that is compared to the Resource Allocation index. For nodes A and B, this index indicates the information flow dispersal level through common neighbors. In a sense, nodes that would like to join tend to have direct communication. If common neighbors have high degree centrality, then such a flow is going to be dissolved.

  • Community Common Neighbors: The number of common neighbors, with extra points for neighbors from the same community.

  • Community Resource Allocation: Similar to the Resource Allocation index, but only counts nodes inside the same community. Notice that the candidate edges (9, 15) and (30, 21) have received zero community resources, since they are from different karate clubs.

There are three more real social networks available in NetworkX (as of the time of writing). You may want to experiment with them as we did here. The visualization part can be easily reused.

When you study a social network, it is an interesting option to generate realistic network models. These networks aim to mimic pertinent properties of real-world networks under consideration. There are many network generators available inside NetworkX, including the popular preferential model (which obeys the power law) and the small-world network (which has a relatively high clustering coefficient and short average path length). The latter tries to mimic the six degrees of separation idea. Once you have a model, then you can compare it with observations (samples) from the real world. If the estimated parameters of a model (like the exponent α and constant C in a preferential model) work well to produce networks close to those from the real world, then real-world phenomena may be analyzed by using graph theory and computation. For example, you can train a classifier using network features (like our metrics) to predict probabilities about future connections.

Exercise 10-1. Beautify The Displayed Network, Part 1

NetworkX coupled with Graphviz gives you tremendous options to customize the visual appearance of a graph. In Figure 10-2 there is no differentiation between relation types or weights. It would be beneficial to present these in a meaningful fashion. Here is a snippet of code to iterate over edges in a directed multigraph G:
for i, j, k in G.edges(keys=True):
    print(G.adj[i][j][k])

Experiment by setting visual properties of edges (we have set some for nodes). For example, you may color relations differently and set the line widths to the weights of relations. As a bonus exercise, try to attach a custom image for actor nodes. This can be done by mixing in HTML content and referencing an image using the img tag.

Exercise 10-2. Beautify The Displayed Network, Part 2

Build on the experience you’ve gained from Exercise 10-1 to visualize a signed graph (like the one we used to model relationships between quality attributes). You may want to display a plus sign or minus sign for each edge as well as color them differently.

You can traverse the edges of an undirected graph G by using the G.edges(data=True) call, which returns all the edges with properties. To display the sign, use the G.edge[node_1][node_2]['sign'] expression, where node_1 and node_2 are node identifiers of this edge.

Exercise 10-3. Generate An Erdös-RÉnyi Network

Read the documentation for the nx.generators.random_graphs.erdos_renyi_graph function to see how to produce this random graph. This model chooses each of the possible edges with probability p, which is a synonym for a binomial graph (since it follows Bernoulli’s probability distribution).

The generated network doesn’t resemble real-world networks (as a matter of fact, the small-world network model doesn’t do a great job in this respect either). Nonetheless, the model is extremely simple, and binomial graphs are used as the basis in graph theory proofs.

Summary

Real-world networks are humungous, so they require more scalable approaches than what has been presented here. One plausible choice is to utilize the Stanford Network Analysis Project (SNAP) framework (visit http://snap.stanford.edu ), which has a Python API, too. Storing huge graphs is also an issue. There are specialized database systems for storing graphs and many more graph formats (most of them are documented and supported by NetworkX). One possible selection for a graph database is Neo4j (see https://neo4j.com/developer/python ).

Graphs are used to judge the robustness of real-world networks (such as the Web, transportation infrastructures, electrical power grids, etc.). Robustness is a quality attribute that indicates how well a network is capable of retaining its core structural properties under failures or attacks (when nodes and/or edges disappear). The basic structural property is connectivity. NetworkX has functions for this purpose (for evaluating both nodes and edges): nx.node_connectivity, nx.minimum_node_cut, nx.edge_connectivity, nx.minimum_edge_cut, and nx.all_simple_paths. The functions are applicable for both undirected and directed graphs. Graphs are also the basis for all sorts of search problems in artificial intelligence.

References

  1. 1.

    Ervin Varga, Draško Drašković, and Dejan Mijić, Scalable Architecture for the Internet of Things: An Introduction to Data-Driven Computing Platforms, O’Reilly, 2018.

     
  2. 2.

    Ronald C. Read and Robin J. Wilson, An Atlas of Graphs, Oxford University Press, 1999.

     
  3. 3.

    David Easley and Jon Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World, Cambridge University Press, 2010.

     
  4. 4.

    Michel Rigo, Advanced Graph Theory and Combinatorics, John Wiley & Sons, 2016.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.104.160