Although the conventional approach to analyzing data works in the majority of cases, there are times when a different kind of modeling works better. This may be due to the nature of the data, its dimensionality, or even the interpretability factor (which is crucial in many cases). There is an abundance of information on this topic, and a lot to learn to master it. For the purposes of our book, we will limit ourselves to only some of the applications of graph analysis, along with the tools for implementing them in Julia.
We’ll be making use of two packages throughout this chapter, Graphs and LightGraphs, so make sure that you have both installed on Julia before we get started. We won’t be delving too deep into the LightGraphs package; this introduction is mainly for the purpose of acquainting you with its unique feature of saving a graph object in a compact format as a file.
We’ll also be providing you with a couple of auxiliary functions for converting between the two corresponding graph types (see the IJulia notebook). Also, be sure to load the normalize.jl script into memory as we’ll be using the normalize() function.
There is enough material in graph theory and Julia’s implementation of it for a whole additional book. If you want to learn more about it, please refer to the references. In this chapter we will cover the following topics:
Importance of Graphs
Although graph theory has been around for many years, only recently has it become popular in the data analytics sphere. This could be because graphs are heavy on math–particularly the abstract type. Also, the database technology has been traditionally focused on table-based data structures and graphs are completely different from this paradigm. Fortunately, people are coming around to the value of graphs, and are gradually developing more tools for them (particularly on the database side). Also, databases like neo4j have come out of the shadows and entered the mainstream, even featured in some online courses. Before that, they were esoteric and known only by the very few who were specialists in the graph analysis subfield.
We have seen in the previous chapters how dimensionality can cause major performance issues, especially when distances are used. Also, the diversity (variety of the types of data, such as numeric or nominal) of features doesn’t make things easier, while the computational overhead of the feature space is not a trivial challenge. Needless to say, this isn’t intuitive, since we, as human beings, are accustomed to more direct ways of associating things.
Graphs follow this intuitive approach, simplifying things through the use of an abstract representation of all the connections (and lack thereof) among the various elements of a dataset. These elements can be data points, features, classes (in the case of classification), or anything else that makes sense when modeling the problem at hand.
Believe it or not, you have already used graphs during this book (unless you skipped the most juicy chapters!). Graphs are everywhere in data science, even if they are not made apparent. In fact, it is hard to imagine data analytics in general without graphs. Just to clarify, the graphs we are talking about are graphical representations of relationships, not plots like the ones we explored in Chapter 7. Yet, graphs can be plotted so that we understand them better (in fact, that’s one of their perks: easy two-dimensional plotting). You can view a typical graph in Figure 12.1.
Figure 12.1 An example of a typical graph.
For now, all you need to know about graphs is the following:
A, B
A, E
A, F
B, C
E, F
If there were weights corresponding to these connections, they would be part of the set as well, making the whole data structure a dictionary of sorts. This representation of a graph is particularly useful for graphs having few connections among their nodes. That’s it! All of the fundamentals of this incredibly complicated theory are distilled into the above points. If you want to learn more about the math behind these concepts, we recommend checking out the comprehensive tutorial created by Professor Harju of the University of Turku, Finland: http://bit.ly/29kRzJx. This tutorial may also be useful in understanding the corresponding algorithms, which we will be using in this chapter (although we won’t be describing them in any detail). Due to the vastness of the topic, in this chapter we’ll be focusing on undirected graphs only.
Custom Dataset
Unlike the previous chapters where all your code could be set up with just a couple of lines, here we’ll have to do some extra work (nothing too challenging). The reason is that the original datasets are not useful in illustrating the points we need to make in this chapter, so that you get acquainted with the set of techniques related to graph analysis.
So, for starters we’ll be introducing a new dataset that’s simple enough to draw on a piece of (electronic) paper. This way there will be no confusion about what’s happening in each algorithm, while examining the edges of the corresponding graph won’t be time-consuming.
Without further ado, here is the dataset. It describes the relationships among a group of friends: Alice, Bob, Christine, Dan, Eve, Frank, and George. Imagine that they all use a social medium of your choice (perhaps the website from the startup in the case study of Chapter 5) and that this website has established some kind of reliable metric that expresses how close they are to each other, in terms of interests, background, affinity for the Julia language, etc. This metric takes values between 0 and 1. The data we have looks like something like this:
Alice -- Bob, 0.00956345
Alice -- Christine, 0.952837
Alice -- Eve, 0.775323
Alice -- George, 0.0875062
Bob -- Christine, 0.546019
Bob -- Eve, 0.573757
Bob -- Frank, 0.43756
Bob -- George, 0.665684
Christine -- Dan, 0.445205
Christine -- Frank, 0.248718
Christine -- George, 0.977219
Dan -- Frank, 0.732508
Dan -- George, 0.149959
Eve -- Frank, 0.284119
Frank -- George, 0.990848
This can be easily coded in two ways: either as an array of edges (relationships between pairs of friends) with a corresponding weights array (similarity weight), or as a 7 x 7 matrix with a weight value wherever there is a connection between two friends. Here we will use the first option as it’s more straightforward and user-friendly. Feel free to try out the other approach if you want.
Whatever the case, we have a couple of functions that allow us to translate the data structures of one approach to that of the other. So, if we were to assign numbers to the various friends (according to the order they are listed), and get rid of the “--” string, we would end up with something like this:
1, 2, 0.00956345
1, 3, 0.952837
1, 5, 0.775323
1, 7, 0.0875062
2, 3, 0.546019
2, 5, 0.573757
2, 6, 0.43756
2, 7, 0.665684
3, 4, 0.445205
3, 6, 0.248718
3, 7, 0.977219
4, 6, 0.732508
4, 7, 0.149959
5, 6, 0.284119
6, 7, 0.990848
We can now split this numeric array into two arrays: one for the indexes of the people in the group, and one for the weights signifying the strength of their bonds. The first array will include the first two columns and the second array the last column.
In order to avoid any confusion, it is important to code the two arrays with the appropriate type. Although Julia will not create any fuss about it during the coding phase, the functions of the graph analysis packages most certainly will, since they expect the data in a specific way. So, make sure that the indexes are coded as integers while the weights as floats.
Also, keep in mind that these weights represent similarity in this example (in reality, they can represent whatever you want). If we were to examine how different one member of the group is from another, we would need to use some other metric (such as the inverse of w, or any other function that grows as w diminishes).
Statistics of a Graph
This is the graphical counterpart of data exploration, but it’s much more straightforward. Since a graph is simple as a data structure, the main things we can learn about a graph are:
Let’s see how we can calculate these on our simple graph (g) of Alice’s social circle, using the Graphs package:
In[1]: num_vertices(g)
Out[1]: 7
In[2]: num_edges(g)
Out[2]: 15
In[3]: vertices(g)
Out[3]: 1:7
In[4]: edges(g)
Out[4]: 15-element Array{Graphs.Edge{Int64},1}:
edge [1]: 1 -- 2
edge [2]: 1 -- 3
edge [3]: 1 -- 5
edge [4]: 1 -- 7
There are a few more of these, which we are omitting here as we don’t want to fill the pages with stuff you can view on the IJulia notebook.
In[5]: Graphs.is_directed(g)
Out[5]: false
In[6]: out_degree(1, g)
Out[6]: 4
This basically means that there are four edges stemming from node 1 (Alice).
In[7]: Graphs.out_neighbors(1, g)
This yields a somewhat elaborate TargetIterator object, which contains all the edges that correspond to node 1. These are basically all the neighbors of the node.
To find the degree of a graph, we calculate the degrees of all of its nodes and take the largest value:
In[8]: n = num_vertices(g)
d = Array(Int64, n)
for i = 1:n
d[i] = out_degree(i, g)
end
dg = maximum(d)
Out[8]: 5
Cycle Detection
Cycles are paths comprising three or more nodes that are connected to each other in a ring fashion. The nodes of a cycle need not be connected to all of the other members of the cycle, though there sometimes exist multiple connections among a cycle’s elements.
Take for example your closest friends. Some of these may be friends with other members of your group, while others may know just you and one other person. As long as you can trace a ring-like connection among you and your friends, this connection constitutes a cycle in a social network. This is actually how Facebook knows who to recommend to you as potential friends. You may be friends with Tracy, who is friends with Joe, who is Mike’s best buddy, who is also your friend, yet you are not connected to Joe directly. Facebook sees this and based on your connections (and probably other factors too), recommends that you reach out to Joe, while it does the same for Joe regarding you.
If a graph doesn’t have any cycles, it is called acyclical. A special case of acyclical graphs are the ones that are also directed graphs. These graphs are often referred to as DAGs (directed acyclical graphs) and are common in graph theory. So, finding out whether there are cycles in a graph is a simple way to deduce whether it’s a DAG (since we usually already know if it’s directed).
Cycles are interesting in the context of social networks, but they are also extremely valuable artifacts in feature analysis. If you represent all your dataset’s features as nodes in a graph and connect them through a similarity metric (the value of this metric can be the weight of the corresponding edge), then you have a clear representation of how everything is connected in your feature set.
Cycles in this graph represent redundancy in your feature set (i.e. three or more features being similar to each other) and therefore it would be best if we eliminate them. If feature A is closely connected to feature B, while both are closely connected to feature C, then one of them could go. Of course you would want to take other things into account, such as each feature’s connectivity to the target variable (i.e. through its DID score in the case of classification or its cosine similarity in the case of regression). For now we’ll just look at the cycle aspect of the graph, using Julia to sort things out.
Julia the cycle detective
Let’s look at how Julia can help simplify the feature selection process so that you can explain to everyone in a straightforward manner. As before, we’ll be making use of the Graphs package. Before we look into feature analysis, though, let’s see how the cycle detection function works on a simple graph:
In[9]: test_cyclic_by_dfs(g)
Out[9]: true
It gets even easier than that. Just plug in the graph object to the test_cyclic_by_dfs() function and you’ll quickly know whether there is a cycle in the graph or not. Now let’s see how we can apply this in a more useful scenario. First, let’s work the data of the OnlineNewsPopularity dataset and obtain the correlation matrix of its numeric features:
In[10]: ONP = readcsv(“D:\data\OnlineNewsPopularity\OnlineNewsPopularity.csv”);
It is useful to have the number of features in a variable, so:
In[11]: nd = size(ONP, 2) - 2
We can opt to normalize the whole thing in order to calculate the similarity (in case we’ll go with an alternative metric later on):
In[12]: ONP = normalize(map(Float64,ONP[2:end,2:end]), “stat”)
In[13]: F = ONP[:,1:(end-1)];
In[14]: CM = cor(F)
Now we can go ahead and build the graph corresponding to the feature correlations. To do this, we’ll need the connectivity matrix, which we can initialize as follows:
In[15]: M = Array(Bool, nd, nd)
Of course, we’ll need to populate it. We’ll be using the threshold of 0.7 to signify a strong similarity, although other thresholds can also be used:
In[16]: for i = 1:(nd-1)
M[i,i] = true
for j = (i+1):nd
M[i,j] = M[j,i] = (abs(CM[i,j]) .> 0.7)
end
end
Next, we can get the edge list going, using the custom CM2EL() function:
In[17]: E = CM2EL(M)
Finally, we can build the actual graph:
In[18]: g2 = simple_graph(nd, is_directed=false)
In[19]: for i = 1:size(E, 1)
e = E[i,:]
Graphs.add_edge!(g2, e[1], e[2])
end
Like in the previous example, we can perform some detective work using the corresponding function that identifies the cycles in a graph:
In[20]: test_cyclic_by_dfs(g2)
Out[20]: true
So, this analysis verifies the fact that there is redundancy in the features, as we realized during our data exploration in the early part of the book.
By the way, if you feel that this cycle detection function is a let-down, that’s because it is! On its own, there is limited value it can provide. However, it can be useful in that if you know that there is a cycle in your graph, you will have a reason to perform the next kind of a analysis, seeking out the connected components of the graph.
Connected Components
A connected component is a fancy term for a set of nodes in a graph, where you can get from node A to node B, for any A and B belonging to that set of nodes. So, if you were to draw the whole graph, a component of it would look like a bunch of nodes that are linked to each other in some kind of fashion. Obviously, you cannot traverse from any A to B if that A and B belong to different components. If a graph has exactly one connected component, then every node of it is accessible from any other node in the graph (i.e. it is connected to them directly or indirectly), and is therefore called a connected graph.
Basically, connected components are sets of sub-graphs, each one of which is fully connected to the others. One such graph may be composed of a single node. Needless to say, it’s great to see lots of these micro-graphs when analyzing the feature similarity of a dataset.
Once Julia detects a cycle in a graph, it is time to connect the dots and go deeper into the graph itself, using the connected_components() function. Since this function exists in both graph packages, it is necessary to specify which one you want to use:
In[21]: Graphs.connected_components(g)
Out[21]: 1-element Array{Array{Int64,1},1}:
[1,2,3,5,7,6,4]
In this case it looks like all of the nodes are jumbled up (so it wouldn’t be easy to infiltrate Alice’s group of friends!). This makes the said graph a connected one. Let’s see how our features graph from the OnlineNewPopularity dataset fares:
In[22]: Graphs.connected_components(g2)
Out[22]: 43-element Array{Array{Int64,1},1}:
[1]
[2]
[3]
[4,5,6]
[7]
In this case we have a lot of connected components, due to the fact that most of the features are unrelated to the rest of the feature set (or we could have set up a very high threshold). Most of these are omitted here, but if you check the notebook file, you’ll be able to verify this. Also, towards the end of the list of connected components lies [56, 59], denoting a connection between these two features. Let’s take a closer look using the correlation matrix array we calculated previously:
In[23]: CM[56,59]
Out[23]: 0.714527589349792
Cliques
The cliques are closely related to cycles as they are sets of nodes in a graph that are all connected to each other. So, when investigating this property of a graph, they are usually examined together. The cliques are common in social networks and logistics problems; we are generally interested in the largest one of them. This is because there are often many cliques, and the smaller ones don’t reveal much. The largest ones, however, are unique, containing more information related to their members.
So, if we have determined that there is a cycle in a graph, the next logical step would be to check for the actual nodes that make cycles possible, namely the clique(s) present in the graph. For example, if there is a clique that Alice belongs to, this means that we can identify her closest friends (or the people who are most similar to her, and are also similar to each other) by examining this clique. Clearly there is some marketing opportunity in this, as well as a more personalized experience for her given that all of the members of her group are on our website.
Let’s now check our simple graph for cliques and try to make sense of Alice’s social circle. To do this we’ll need to use the maximal_cliques() function, specifying the package as a prefix, like before:
In[24]: Graphs.maximal_cliques(g)
Out[24]: 5-element Array{Array{Int64,N},1}:
[4,7,3,6]
[2,7,3,6]
[2,7,3,1]
[2,5,6]
[2,5,1]
What does all this tell us? Well, if Alice wanted to go out with as many of her friends as possible, without making the communication among them awkward, she would have to pick the third cycle (since it is tied with the most members). Moreover, she would be sure that each person would be friends with at least two other people, so it would not be weird for them whenever she would go to the restroom! As for Bob, he can do the same with two groups of friends (the first two cliques), so if he is sick and tired of Alice, he can hang out with other members of the gang!
Although finding the maximal clique in a graph is useful overall, it doesn’t yield any interesting information when applied to a feature graph, so we won’t be examining how it applies to our other graph example.
Shortest Path in a Graph
More often than not, the edges of a graph have weights, so finding the shortest route from A to B becomes a different kind of problem. Now, the objective is not just getting there, but getting there quickly. To make this happen we will make use of the Dijktra algorithm, which is popular in computer science among other fields. Apart from graph theory, it has applications in finding the best way to schedule your tasks (project management) and getting you to the closest Julia event in your city (or any city for that matter) through a GPS-based app. There are other minimal distance calculation algorithms out there, the most important of which have been implemented in the Graphs package. We encourage you to experiment with them, once you are comfortable with the Dijktra algorithm.
We won’t go into depth on the math of this technique, but we encourage you to investigate it through a specialized document such as Melissa Yan’s presentation at MIT: http://bit.ly/29zFogk.
When calculating the distances involved in the shortest path investigation, we interpret the weight of an edge as the distance between the corresponding nodes. So, if our weights represent similarity, we’ll need to adjust them for the shortest path algorithms to work properly.
Let’s now see how we can calculate the distance of the shortest path between two points in a graph using Julia. First of all, let’s invert the weights to get a measure of how dissimilar the various members of the group are:
In[25]: ww = 1 - w
We could have used any number of functions to accomplish this. We refrained from using the obvious one (1/w) since this would yield extreme results in cases of nodes having a very low weight. In general, if you want to preserve the scale of the weights when inverting them, you can apply the function c + maximum(w) - w, where c is a positive constant of your choice. Also, if the weights are very disperse, you can apply a sigmoid to them first and then reverse them. Be creative!
Once we have all the weights sorted out, we need to define a starting point. In this case, we’ll make Alice the starting point (node 1). To find the shortest path to Alice from the various parts of the graph, we’ll apply the dijkstra_shortest_paths() function, after specifying that it’s from the Graphs package:
In[26]: z = Graphs.dijkstra_shortest_paths(g, ww, 1)
The result is a DijkstraStates object that’s loaded with information about the shortest paths to Alice’s node (set as the last input of the function). To get some more comprehensive information from all this, let’s access the two most relevant parts of this data structure, parents and dists:
In[27]: z.parents
Out[27]: 7-element Array{Int64,1}:
1
7
1
6
1
7
3
Hopefully you were not expecting a list of the names of each person’s actual parents, otherwise you’d be disappointed right now. Clearly, the term “parent” has a different meaning here; it refers to the node that is right before the examined node, as we travel from the source node (in this case, 1) to that node. Also note that the source node is a parent of itself.
So, if you were Bob (node 2) and wanted to get a message across to Alice as swiftly as possible, your best option would be to go through node 7, since apparently Alice and Bob are not close (i.e. the distance between them is high, as shown by the corresponding element of the ww array). To find the actual smallest distance between anyone in the graph and the source node, we’ll use the dists part of the aforementioned data structure:
In[28]: z.dists
Out[28]: 7-element Array{Float64,1}:
0.0
0.40426
0.047163
0.346588
0.224677
0.079096
0.069944
From this it becomes clear that the closest person to Alice (apart from herself) is Christine (node 3). This is to be expected, as there is a direct link between the two (edge 1---3), while the similarity weight is very high. On the other hand, the person furthest from Alice is Bob (node 2), since the direct link between them has a low similarity weight and a high distance weight. To go from one to the other, you would be best served by going through other people in the group.
Minimum Spanning Trees
A tree is basically an acyclical graph designed for a specific task. We have already seen classification trees that emulate the decision process for associating a data point to a particular class. If we were to do the same but using other data points instead of a class, we would end up with a spanning tree. If we were to minimize the total (or the average) weight of the connections of this tree, we would have a minimum spanning tree. So, fancy terminology aside, a minimum spanning tree (MST) is an optimal graph connecting all the data points of a dataset, where the overall weight is as small as possible.
One important application of MSTs is in logistics. If you wanted to create a road network with an exceptionally small budget, but you had to abide by the requirement that all the major buildings must be connected somehow (i.e. you could drive from one to any other, even if you had to take a scenic route), an MST would solve your problem.
In this case, the nodes would be the buildings you want to connect, and the edges would be the actual roads you would construct. The people in the area may not be too happy about this since, they might have to drive a lot and there might be just one route to get from the library to the courthouse, but the overall cost of the project would be minimal. This example may be a bit extreme, but it would make sense for a new subway project where you want to cover as many places as possible, without having any redundancy among the train lines used (at least for the initial phase of the project).
MSTs can also be useful in other applications, including classification. MSTs are one of the most intuitive parts of graph theory; as such, they hold a lot of potential for the development of future techniques in data science. We encourage you to delve deeper into this topic, through reliable references such as this web page by Princeton University: http://bit.ly/29pnIol.
Finally, there also exist graphs called maximum spanning trees. However, since all maximum spanning trees can be expressed as MSTs (e.g. by reversing the weights of the edges), we won’t be examining those in this chapter. A maximum spanning tree would be applicable if the weights of a graph expressed distance and the nodes features of a dataset, in which case you would want to perform a feature selection that yields a set of features as different as possible. Such a selection would be a connected component of the maximum spanning tree of the features graph.
There are two ways to calculate the MST of a graph. These are expressed through two efficient algorithms: Prim’s and Kruskal’s. Both are relatively easy (almost intuitive) and have been implemented in Julia through the Graphs package.
Julia the MST botanist
Julia can also be useful for creating MSTs out of graphs. Going back to Alice’s group, it may be the case that someone in the group wants to make a certain message known to everyone else at a time when it so happens that social media is down. So, how would that person go about communicating this message, keeping the awkwardness to a bare minimum (i.e. connecting everyone in the group through the shortest dissimilarity on average)? That’s where Julia’s MST functions come in handy. Here we’ll be using the kruskal_minimum_spantree() function, along with the distance weights, since this MST algorithm is easier to use:
In[29]: E, W = kruskal_minimum_spantree(g, ww)
You can examine the actual tree on your own by checking out which edges of the original graph are connected (array E). Right now we’ll focus on the average weights (array W), as this is a common metric associated with MSTs:
In[30]: mean(W)
Out[30]: 0.15093016666666667
So, even though there are some people in Alice’s group who can’t stand each other, the average distance (awkwardness) of this whole communique can be fairly small through the use of an MST that connects everyone.
Saving and loading graphs from a file
We can make that graph we crafted available for future analytics. The best part is that it can be compressed, so it won’t take much space. To make this happen we’ll use the save() and load() functions from the LightGraphs package. First, we’ll need to transform the graph to a type that this package can recognize:
In[31]: gg = G2LG(g)
Now we can proceed with the actual saving operation, using the save() function:
In[32]: save(“graph data.jgz”, gg, “AliceGang”, compress=true)
Out[32]: 1
The output denotes the number of graphs saved. Also, the name for the graph (AliceGang) is not essential, so you can omit it if you want (if you do, it will be named something generic like “graph”). However, if you plan to store more graphs in a file, you may want to give them descriptive names so that you can distinguish them from each other after you load them. Speaking of loading a graph data file:
In[33]: gg_loaded = load(“graph data.jgz”, “AliceGang”)
Since we knew the name of the graph in the data file, we asked for that particular graph. If we didn’t, we could just load it into a temp file and access the graph afterwards. This is relatively easy since the object that the load() command yields if no name is given as a second parameter is a dictionary.
Also keep in mind that the loaded graph is going to be a LightGraphs object, so if you want to use it with functions from the Graphs package, you’ll need to convert it, using the custom LG2G() function:
In[34]: g_loaded = LG2G(gg_loaded)
Should you wish to compare the original graph with the loaded one, you can check the edges using something like this:
Int [35]: hcat(edges(g_loaded), edges(g))
Graph Analysis and Julia’s Role in it
Although the field of graph analysis is still relatively new and Julia even newer as a technology, it is clear that there is a good match between the two. There are certain signs (or features, if you will) that clearly point toward the possibility of a powerful synergy. Specifically:
Beyond these points, it becomes clear to anyone who uses Julia for this kind of task that even though other data science languages can fudge their way into this field through the use of C under the hood, it is the more powerful languages that can truly handle it and allow for innovation in the tools. So, for those people who are not particularly keen on writing C, C++ or Java code, Julia is pretty much the best option out there for graph analysis applications.
It is difficult to overstate how empowering this is, since a robust language allows programmers to experiment more freely, without having to rely on a bunch of third-party scripts that you may never have a chance to examine and understand fully. This may lead to a more fast-paced evolution of the graph analytics field.
Summary
Chapter Challenge
18.220.28.21