CHAPTER 12: Graph Analysis

Chapter012.jpg

Although the conventional approach to analyzing data works in the majority of cases, there are times when a different kind of modeling works better. This may be due to the nature of the data, its dimensionality, or even the interpretability factor (which is crucial in many cases). There is an abundance of information on this topic, and a lot to learn to master it. For the purposes of our book, we will limit ourselves to only some of the applications of graph analysis, along with the tools for implementing them in Julia.

We’ll be making use of two packages throughout this chapter, Graphs and LightGraphs, so make sure that you have both installed on Julia before we get started. We won’t be delving too deep into the LightGraphs package; this introduction is mainly for the purpose of acquainting you with its unique feature of saving a graph object in a compact format as a file.

We’ll also be providing you with a couple of auxiliary functions for converting between the two corresponding graph types (see the IJulia notebook). Also, be sure to load the normalize.jl script into memory as we’ll be using the normalize() function.

There is enough material in graph theory and Julia’s implementation of it for a whole additional book. If you want to learn more about it, please refer to the references. In this chapter we will cover the following topics:

  • Importance of graphs
  • Overview of graph theory
  • Graph statistics
  • Shortest distance in a graph
  • Detection of cycles and maximal cliques in a graph
  • Connected components of a graph
  • Minimum spanning trees (special graphs used in machine learning).

Importance of Graphs

Although graph theory has been around for many years, only recently has it become popular in the data analytics sphere. This could be because graphs are heavy on math–particularly the abstract type. Also, the database technology has been traditionally focused on table-based data structures and graphs are completely different from this paradigm. Fortunately, people are coming around to the value of graphs, and are gradually developing more tools for them (particularly on the database side). Also, databases like neo4j have come out of the shadows and entered the mainstream, even featured in some online courses. Before that, they were esoteric and known only by the very few who were specialists in the graph analysis subfield.

We have seen in the previous chapters how dimensionality can cause major performance issues, especially when distances are used. Also, the diversity (variety of the types of data, such as numeric or nominal) of features doesn’t make things easier, while the computational overhead of the feature space is not a trivial challenge. Needless to say, this isn’t intuitive, since we, as human beings, are accustomed to more direct ways of associating things.

Graphs follow this intuitive approach, simplifying things through the use of an abstract representation of all the connections (and lack thereof) among the various elements of a dataset. These elements can be data points, features, classes (in the case of classification), or anything else that makes sense when modeling the problem at hand.

Believe it or not, you have already used graphs during this book (unless you skipped the most juicy chapters!). Graphs are everywhere in data science, even if they are not made apparent. In fact, it is hard to imagine data analytics in general without graphs. Just to clarify, the graphs we are talking about are graphical representations of relationships, not plots like the ones we explored in Chapter 7. Yet, graphs can be plotted so that we understand them better (in fact, that’s one of their perks: easy two-dimensional plotting). You can view a typical graph in Figure 12.1.

Image034.jpg

Figure 12.1 An example of a typical graph.

For now, all you need to know about graphs is the following:

  • Graphs (also called networks) are abstract representations of objects and the connections among them. They are composed of nodes (also called vertices) and edges.
  • Nodes are the “stop points” of a graph, often represented by circles, and depicting data points, features, etc. In Figure 12.1 they are points A, B, C, etc.
  • Edges are the (usually) straight lines that connect nodes. They depict various types of relationships: distances, similarities, etc. In Figure 12.1 they are the lines connecting A, B, C, E, and F.
  • Nodes and edges have associated properties including name, some numeric variable, and more.
  • There are numerous ways of plotting a graph, since you can move nodes around as you wish, as long as you maintain the connections within the graph. For large graphs, the nodes are typically located at points on the periphery of a circle, with all the edges as curves connecting the corresponding points.
  • Graphs are described as “connected” if each of their nodes are connected to one or more nodes.
  • Edges can be directed or undirected (i.e. one-way vs. two-way streets).
  • You can represent a graph in either a connection or adjacency matrix, or as a set of edges and a list of the corresponding weights.
  • A connection or adjacency matrix is a square matrix of size N, where N is the total number of nodes. Its elements represent the strength of connection between the members of each of node, which is termed “weight.” If weight is zero, then the corresponding nodes are not connected to each other. This matrix is usually sparse (i.e. mostly zeros).
  • The set of edges shows which nodes are connected. In the example of Figure 12.1, the set of edges looks something like this:

      A, B

      A, E

      A, F

      B, C

      E, F

If there were weights corresponding to these connections, they would be part of the set as well, making the whole data structure a dictionary of sorts. This representation of a graph is particularly useful for graphs having few connections among their nodes. That’s it! All of the fundamentals of this incredibly complicated theory are distilled into the above points. If you want to learn more about the math behind these concepts, we recommend checking out the comprehensive tutorial created by Professor Harju of the University of Turku, Finland: http://bit.ly/29kRzJx. This tutorial may also be useful in understanding the corresponding algorithms, which we will be using in this chapter (although we won’t be describing them in any detail). Due to the vastness of the topic, in this chapter we’ll be focusing on undirected graphs only.

Custom Dataset

Unlike the previous chapters where all your code could be set up with just a couple of lines, here we’ll have to do some extra work (nothing too challenging). The reason is that the original datasets are not useful in illustrating the points we need to make in this chapter, so that you get acquainted with the set of techniques related to graph analysis.

So, for starters we’ll be introducing a new dataset that’s simple enough to draw on a piece of (electronic) paper. This way there will be no confusion about what’s happening in each algorithm, while examining the edges of the corresponding graph won’t be time-consuming.

Without further ado, here is the dataset. It describes the relationships among a group of friends: Alice, Bob, Christine, Dan, Eve, Frank, and George. Imagine that they all use a social medium of your choice (perhaps the website from the startup in the case study of Chapter 5) and that this website has established some kind of reliable metric that expresses how close they are to each other, in terms of interests, background, affinity for the Julia language, etc. This metric takes values between 0 and 1. The data we have looks like something like this:

Alice -- Bob, 0.00956345

Alice -- Christine, 0.952837

Alice -- Eve, 0.775323

Alice -- George, 0.0875062

Bob -- Christine, 0.546019

Bob -- Eve, 0.573757

Bob -- Frank, 0.43756

Bob -- George, 0.665684

Christine -- Dan, 0.445205

Christine -- Frank, 0.248718

Christine -- George, 0.977219

Dan -- Frank, 0.732508

Dan -- George, 0.149959

Eve -- Frank, 0.284119

Frank -- George, 0.990848

This can be easily coded in two ways: either as an array of edges (relationships between pairs of friends) with a corresponding weights array (similarity weight), or as a 7 x 7 matrix with a weight value wherever there is a connection between two friends. Here we will use the first option as it’s more straightforward and user-friendly. Feel free to try out the other approach if you want.

Whatever the case, we have a couple of functions that allow us to translate the data structures of one approach to that of the other. So, if we were to assign numbers to the various friends (according to the order they are listed), and get rid of the “--” string, we would end up with something like this:

1, 2, 0.00956345

1, 3, 0.952837

1, 5, 0.775323

1, 7, 0.0875062

2, 3, 0.546019

2, 5, 0.573757

2, 6, 0.43756

2, 7, 0.665684

3, 4, 0.445205

3, 6, 0.248718

3, 7, 0.977219

4, 6, 0.732508

4, 7, 0.149959

5, 6, 0.284119

6, 7, 0.990848

We can now split this numeric array into two arrays: one for the indexes of the people in the group, and one for the weights signifying the strength of their bonds. The first array will include the first two columns and the second array the last column.

In order to avoid any confusion, it is important to code the two arrays with the appropriate type. Although Julia will not create any fuss about it during the coding phase, the functions of the graph analysis packages most certainly will, since they expect the data in a specific way. So, make sure that the indexes are coded as integers while the weights as floats.

Also, keep in mind that these weights represent similarity in this example (in reality, they can represent whatever you want). If we were to examine how different one member of the group is from another, we would need to use some other metric (such as the inverse of w, or any other function that grows as w diminishes).

Statistics of a Graph

This is the graphical counterpart of data exploration, but it’s much more straightforward. Since a graph is simple as a data structure, the main things we can learn about a graph are:

  • The number of nodes (aka vertices) and edges
  • The identities of the nodes and edges
  • The nature of the graph (directed or not)
  • The number of neighbors of a node (termed the degree of the node)
  • The identity of each node’s neighbors
  • The degree of a graph
  • Other characteristics which lend themselves to more specialized applications.

Let’s see how we can calculate these on our simple graph (g) of Alice’s social circle, using the Graphs package:

In[1]:  num_vertices(g)

Out[1]: 7

In[2]:  num_edges(g)

Out[2]: 15

In[3]:  vertices(g)

Out[3]: 1:7

In[4]:  edges(g)

Out[4]: 15-element Array{Graphs.Edge{Int64},1}:

  edge [1]: 1 -- 2

  edge [2]: 1 -- 3

  edge [3]: 1 -- 5

  edge [4]: 1 -- 7

There are a few more of these, which we are omitting here as we don’t want to fill the pages with stuff you can view on the IJulia notebook.

In[5]:  Graphs.is_directed(g)

Out[5]: false

In[6]:  out_degree(1, g)

Out[6]: 4

This basically means that there are four edges stemming from node 1 (Alice).

In[7]: Graphs.out_neighbors(1, g)

This yields a somewhat elaborate TargetIterator object, which contains all the edges that correspond to node 1. These are basically all the neighbors of the node.

To find the degree of a graph, we calculate the degrees of all of its nodes and take the largest value:

In[8]: n = num_vertices(g)

    d = Array(Int64, n)

    for i = 1:n

  d[i] = out_degree(i, g)

    end

    dg = maximum(d)

Out[8]: 5

Cycle Detection

Cycles are paths comprising three or more nodes that are connected to each other in a ring fashion. The nodes of a cycle need not be connected to all of the other members of the cycle, though there sometimes exist multiple connections among a cycle’s elements.

Take for example your closest friends. Some of these may be friends with other members of your group, while others may know just you and one other person. As long as you can trace a ring-like connection among you and your friends, this connection constitutes a cycle in a social network. This is actually how Facebook knows who to recommend to you as potential friends. You may be friends with Tracy, who is friends with Joe, who is Mike’s best buddy, who is also your friend, yet you are not connected to Joe directly. Facebook sees this and based on your connections (and probably other factors too), recommends that you reach out to Joe, while it does the same for Joe regarding you.

If a graph doesn’t have any cycles, it is called acyclical. A special case of acyclical graphs are the ones that are also directed graphs. These graphs are often referred to as DAGs (directed acyclical graphs) and are common in graph theory. So, finding out whether there are cycles in a graph is a simple way to deduce whether it’s a DAG (since we usually already know if it’s directed).

Cycles are interesting in the context of social networks, but they are also extremely valuable artifacts in feature analysis. If you represent all your dataset’s features as nodes in a graph and connect them through a similarity metric (the value of this metric can be the weight of the corresponding edge), then you have a clear representation of how everything is connected in your feature set.

Cycles in this graph represent redundancy in your feature set (i.e. three or more features being similar to each other) and therefore it would be best if we eliminate them. If feature A is closely connected to feature B, while both are closely connected to feature C, then one of them could go. Of course you would want to take other things into account, such as each feature’s connectivity to the target variable (i.e. through its DID score in the case of classification or its cosine similarity in the case of regression). For now we’ll just look at the cycle aspect of the graph, using Julia to sort things out.

Julia the cycle detective

Let’s look at how Julia can help simplify the feature selection process so that you can explain to everyone in a straightforward manner. As before, we’ll be making use of the Graphs package. Before we look into feature analysis, though, let’s see how the cycle detection function works on a simple graph:

In[9]: test_cyclic_by_dfs(g)

Out[9]: true

It gets even easier than that. Just plug in the graph object to the test_cyclic_by_dfs() function and you’ll quickly know whether there is a cycle in the graph or not. Now let’s see how we can apply this in a more useful scenario. First, let’s work the data of the OnlineNewsPopularity dataset and obtain the correlation matrix of its numeric features:

In[10]: ONP = readcsv(“D:\data\OnlineNewsPopularity\OnlineNewsPopularity.csv”);

It is useful to have the number of features in a variable, so:

In[11]: nd = size(ONP, 2) - 2

We can opt to normalize the whole thing in order to calculate the similarity (in case we’ll go with an alternative metric later on):

In[12]: ONP = normalize(map(Float64,ONP[2:end,2:end]), “stat”)

In[13]: F = ONP[:,1:(end-1)];

In[14]: CM = cor(F)

Now we can go ahead and build the graph corresponding to the feature correlations. To do this, we’ll need the connectivity matrix, which we can initialize as follows:

In[15]: M = Array(Bool, nd, nd)

Of course, we’ll need to populate it. We’ll be using the threshold of 0.7 to signify a strong similarity, although other thresholds can also be used:

In[16]: for i = 1:(nd-1)

  M[i,i] = true

  for j = (i+1):nd

   M[i,j] = M[j,i] = (abs(CM[i,j]) .> 0.7)

  end

   end

Next, we can get the edge list going, using the custom CM2EL() function:

In[17]: E = CM2EL(M)

Finally, we can build the actual graph:

In[18]: g2 = simple_graph(nd, is_directed=false)

In[19]: for i = 1:size(E, 1)

  e = E[i,:]

  Graphs.add_edge!(g2, e[1], e[2])

   end

Like in the previous example, we can perform some detective work using the corresponding function that identifies the cycles in a graph:

In[20]: test_cyclic_by_dfs(g2)

Out[20]: true

So, this analysis verifies the fact that there is redundancy in the features, as we realized during our data exploration in the early part of the book.

By the way, if you feel that this cycle detection function is a let-down, that’s because it is! On its own, there is limited value it can provide. However, it can be useful in that if you know that there is a cycle in your graph, you will have a reason to perform the next kind of a analysis, seeking out the connected components of the graph.

Connected Components

A connected component is a fancy term for a set of nodes in a graph, where you can get from node A to node B, for any A and B belonging to that set of nodes. So, if you were to draw the whole graph, a component of it would look like a bunch of nodes that are linked to each other in some kind of fashion. Obviously, you cannot traverse from any A to B if that A and B belong to different components. If a graph has exactly one connected component, then every node of it is accessible from any other node in the graph (i.e. it is connected to them directly or indirectly), and is therefore called a connected graph.

Basically, connected components are sets of sub-graphs, each one of which is fully connected to the others. One such graph may be composed of a single node. Needless to say, it’s great to see lots of these micro-graphs when analyzing the feature similarity of a dataset.

Once Julia detects a cycle in a graph, it is time to connect the dots and go deeper into the graph itself, using the connected_components() function. Since this function exists in both graph packages, it is necessary to specify which one you want to use:

In[21]:  Graphs.connected_components(g)

Out[21]: 1-element Array{Array{Int64,1},1}:

  [1,2,3,5,7,6,4]

In this case it looks like all of the nodes are jumbled up (so it wouldn’t be easy to infiltrate Alice’s group of friends!). This makes the said graph a connected one. Let’s see how our features graph from the OnlineNewPopularity dataset fares:

In[22]:  Graphs.connected_components(g2)

Out[22]: 43-element Array{Array{Int64,1},1}:

  [1]  

  [2]  

  [3]  

  [4,5,6]

  [7]  

In this case we have a lot of connected components, due to the fact that most of the features are unrelated to the rest of the feature set (or we could have set up a very high threshold). Most of these are omitted here, but if you check the notebook file, you’ll be able to verify this. Also, towards the end of the list of connected components lies [56, 59], denoting a connection between these two features. Let’s take a closer look using the correlation matrix array we calculated previously:

In[23]:  CM[56,59]

Out[23]: 0.714527589349792

Cliques

The cliques are closely related to cycles as they are sets of nodes in a graph that are all connected to each other. So, when investigating this property of a graph, they are usually examined together. The cliques are common in social networks and logistics problems; we are generally interested in the largest one of them. This is because there are often many cliques, and the smaller ones don’t reveal much. The largest ones, however, are unique, containing more information related to their members.

So, if we have determined that there is a cycle in a graph, the next logical step would be to check for the actual nodes that make cycles possible, namely the clique(s) present in the graph. For example, if there is a clique that Alice belongs to, this means that we can identify her closest friends (or the people who are most similar to her, and are also similar to each other) by examining this clique. Clearly there is some marketing opportunity in this, as well as a more personalized experience for her given that all of the members of her group are on our website.

Let’s now check our simple graph for cliques and try to make sense of Alice’s social circle. To do this we’ll need to use the maximal_cliques() function, specifying the package as a prefix, like before:

In[24]:  Graphs.maximal_cliques(g)

Out[24]: 5-element Array{Array{Int64,N},1}:

  [4,7,3,6]

  [2,7,3,6]

  [2,7,3,1]

  [2,5,6]

  [2,5,1]

What does all this tell us? Well, if Alice wanted to go out with as many of her friends as possible, without making the communication among them awkward, she would have to pick the third cycle (since it is tied with the most members). Moreover, she would be sure that each person would be friends with at least two other people, so it would not be weird for them whenever she would go to the restroom! As for Bob, he can do the same with two groups of friends (the first two cliques), so if he is sick and tired of Alice, he can hang out with other members of the gang!

Although finding the maximal clique in a graph is useful overall, it doesn’t yield any interesting information when applied to a feature graph, so we won’t be examining how it applies to our other graph example.

Shortest Path in a Graph

More often than not, the edges of a graph have weights, so finding the shortest route from A to B becomes a different kind of problem. Now, the objective is not just getting there, but getting there quickly. To make this happen we will make use of the Dijktra algorithm, which is popular in computer science among other fields. Apart from graph theory, it has applications in finding the best way to schedule your tasks (project management) and getting you to the closest Julia event in your city (or any city for that matter) through a GPS-based app. There are other minimal distance calculation algorithms out there, the most important of which have been implemented in the Graphs package. We encourage you to experiment with them, once you are comfortable with the Dijktra algorithm.

We won’t go into depth on the math of this technique, but we encourage you to investigate it through a specialized document such as Melissa Yan’s presentation at MIT: http://bit.ly/29zFogk.

When calculating the distances involved in the shortest path investigation, we interpret the weight of an edge as the distance between the corresponding nodes. So, if our weights represent similarity, we’ll need to adjust them for the shortest path algorithms to work properly.

Let’s now see how we can calculate the distance of the shortest path between two points in a graph using Julia. First of all, let’s invert the weights to get a measure of how dissimilar the various members of the group are:

In[25]: ww = 1 - w

We could have used any number of functions to accomplish this. We refrained from using the obvious one (1/w) since this would yield extreme results in cases of nodes having a very low weight. In general, if you want to preserve the scale of the weights when inverting them, you can apply the function c + maximum(w) - w, where c is a positive constant of your choice. Also, if the weights are very disperse, you can apply a sigmoid to them first and then reverse them. Be creative!

Once we have all the weights sorted out, we need to define a starting point. In this case, we’ll make Alice the starting point (node 1). To find the shortest path to Alice from the various parts of the graph, we’ll apply the dijkstra_shortest_paths() function, after specifying that it’s from the Graphs package:

In[26]: z = Graphs.dijkstra_shortest_paths(g, ww, 1)

The result is a DijkstraStates object that’s loaded with information about the shortest paths to Alice’s node (set as the last input of the function). To get some more comprehensive information from all this, let’s access the two most relevant parts of this data structure, parents and dists:

In[27]:  z.parents

Out[27]: 7-element Array{Int64,1}:

  1

  7

  1

  6

  1

  7

  3

Hopefully you were not expecting a list of the names of each person’s actual parents, otherwise you’d be disappointed right now. Clearly, the term “parent” has a different meaning here; it refers to the node that is right before the examined node, as we travel from the source node (in this case, 1) to that node. Also note that the source node is a parent of itself.

So, if you were Bob (node 2) and wanted to get a message across to Alice as swiftly as possible, your best option would be to go through node 7, since apparently Alice and Bob are not close (i.e. the distance between them is high, as shown by the corresponding element of the ww array). To find the actual smallest distance between anyone in the graph and the source node, we’ll use the dists part of the aforementioned data structure:

In[28]:  z.dists

Out[28]: 7-element Array{Float64,1}:

  0.0  

  0.40426

  0.047163

  0.346588

  0.224677

  0.079096

  0.069944

From this it becomes clear that the closest person to Alice (apart from herself) is Christine (node 3). This is to be expected, as there is a direct link between the two (edge 1---3), while the similarity weight is very high. On the other hand, the person furthest from Alice is Bob (node 2), since the direct link between them has a low similarity weight and a high distance weight. To go from one to the other, you would be best served by going through other people in the group.

Minimum Spanning Trees

A tree is basically an acyclical graph designed for a specific task. We have already seen classification trees that emulate the decision process for associating a data point to a particular class. If we were to do the same but using other data points instead of a class, we would end up with a spanning tree. If we were to minimize the total (or the average) weight of the connections of this tree, we would have a minimum spanning tree. So, fancy terminology aside, a minimum spanning tree (MST) is an optimal graph connecting all the data points of a dataset, where the overall weight is as small as possible.

One important application of MSTs is in logistics. If you wanted to create a road network with an exceptionally small budget, but you had to abide by the requirement that all the major buildings must be connected somehow (i.e. you could drive from one to any other, even if you had to take a scenic route), an MST would solve your problem.

In this case, the nodes would be the buildings you want to connect, and the edges would be the actual roads you would construct. The people in the area may not be too happy about this since, they might have to drive a lot and there might be just one route to get from the library to the courthouse, but the overall cost of the project would be minimal. This example may be a bit extreme, but it would make sense for a new subway project where you want to cover as many places as possible, without having any redundancy among the train lines used (at least for the initial phase of the project).

MSTs can also be useful in other applications, including classification. MSTs are one of the most intuitive parts of graph theory; as such, they hold a lot of potential for the development of future techniques in data science. We encourage you to delve deeper into this topic, through reliable references such as this web page by Princeton University: http://bit.ly/29pnIol.

Finally, there also exist graphs called maximum spanning trees. However, since all maximum spanning trees can be expressed as MSTs (e.g. by reversing the weights of the edges), we won’t be examining those in this chapter. A maximum spanning tree would be applicable if the weights of a graph expressed distance and the nodes features of a dataset, in which case you would want to perform a feature selection that yields a set of features as different as possible. Such a selection would be a connected component of the maximum spanning tree of the features graph.

There are two ways to calculate the MST of a graph. These are expressed through two efficient algorithms: Prim’s and Kruskal’s. Both are relatively easy (almost intuitive) and have been implemented in Julia through the Graphs package.

Julia the MST botanist

Julia can also be useful for creating MSTs out of graphs. Going back to Alice’s group, it may be the case that someone in the group wants to make a certain message known to everyone else at a time when it so happens that social media is down. So, how would that person go about communicating this message, keeping the awkwardness to a bare minimum (i.e. connecting everyone in the group through the shortest dissimilarity on average)? That’s where Julia’s MST functions come in handy. Here we’ll be using the kruskal_minimum_spantree() function, along with the distance weights, since this MST algorithm is easier to use:

In[29]: E, W = kruskal_minimum_spantree(g, ww)

You can examine the actual tree on your own by checking out which edges of the original graph are connected (array E). Right now we’ll focus on the average weights (array W), as this is a common metric associated with MSTs:

In[30]:  mean(W)

Out[30]: 0.15093016666666667

So, even though there are some people in Alice’s group who can’t stand each other, the average distance (awkwardness) of this whole communique can be fairly small through the use of an MST that connects everyone.

Saving and loading graphs from a file

We can make that graph we crafted available for future analytics. The best part is that it can be compressed, so it won’t take much space. To make this happen we’ll use the save() and load() functions from the LightGraphs package. First, we’ll need to transform the graph to a type that this package can recognize:

In[31]: gg = G2LG(g)

Now we can proceed with the actual saving operation, using the save() function:

In[32]: save(“graph data.jgz”, gg, “AliceGang”, compress=true)

Out[32]: 1

The output denotes the number of graphs saved. Also, the name for the graph (AliceGang) is not essential, so you can omit it if you want (if you do, it will be named something generic like “graph”). However, if you plan to store more graphs in a file, you may want to give them descriptive names so that you can distinguish them from each other after you load them. Speaking of loading a graph data file:

In[33]: gg_loaded = load(“graph data.jgz”, “AliceGang”)

Since we knew the name of the graph in the data file, we asked for that particular graph. If we didn’t, we could just load it into a temp file and access the graph afterwards. This is relatively easy since the object that the load() command yields if no name is given as a second parameter is a dictionary.

Also keep in mind that the loaded graph is going to be a LightGraphs object, so if you want to use it with functions from the Graphs package, you’ll need to convert it, using the custom LG2G() function:

In[34]: g_loaded = LG2G(gg_loaded)

Should you wish to compare the original graph with the loaded one, you can check the edges using something like this:

Int [35]: hcat(edges(g_loaded), edges(g))

Graph Analysis and Julia’s Role in it

Although the field of graph analysis is still relatively new and Julia even newer as a technology, it is clear that there is a good match between the two. There are certain signs (or features, if you will) that clearly point toward the possibility of a powerful synergy. Specifically:

  • As more and more people become aware of the value of networks of all sorts, the availability of graph datasets will grow, making graph analysis more relevant. This is good news for Julia, since it is particularly good at working with complex datasets and finding insightful connections before other analytics platforms can even finish loading the data.
  • Graph analysis generally attracts competent data scientists, the ones that are usually head-hunted by Facebook, Twitter, Microsoft, and other well-known companies. If just 1% of these professionals decide to give Julia a chance, the language will grow significantly more robust, as the packages developed will be made by people who are adept at writing clean code and documenting it well.
  • As Julia’s popularity grows in the graph analysis field, database developers may take notice and develop a IDEs for them. We expect in the near term that forward-thinking companies like Turi will take notice, because Julia can handle extremely novel data structures. This is due to Julia’s intimate connection with C-based languages (e.g. C++, through the Cxx package).

Beyond these points, it becomes clear to anyone who uses Julia for this kind of task that even though other data science languages can fudge their way into this field through the use of C under the hood, it is the more powerful languages that can truly handle it and allow for innovation in the tools. So, for those people who are not particularly keen on writing C, C++ or Java code, Julia is pretty much the best option out there for graph analysis applications.

It is difficult to overstate how empowering this is, since a robust language allows programmers to experiment more freely, without having to rely on a bunch of third-party scripts that you may never have a chance to examine and understand fully. This may lead to a more fast-paced evolution of the graph analytics field.

Summary

  • Graphs are useful for modeling certain kinds of problems, although they could be used for a variety of datasets.
  • Graphs are dimensionless, as they constitute abstract representations of data, with emphasis on the relationships in the data. The relationships can be among data points, features, classes, or any other aspects of the data at hand.
  • Graphs can be directed or undirected. These correspond to cases where the relationships among the nodes are one-way or two-way, respectively.
  • Graphs are defined by their list of edges (this includes the direction of each relationship) and the corresponding weights (if any), or by the connectivity (or adjacency) matrix.
  • There are several statistics we can calculate for a given graph g, the most important of which are the following:
  • The number of nodes – num_vertices(g)
  • The number of edges – num_edges(g)
  • The nodes themselves – vertices(g)
  • The edges themselves – edges(g)
  • Whether a graph is directed – Graphs.is_directed(g)
  • The degree of a node x – out_degree(x, g)
  • The neighbors of a node x – Graphs.out_neighbors(x, g)
  • The degree of a graph –the maximum value of the list of the degrees of all the nodes of the graph.
  • Cycle detection involves finding whether there is a path starting from a node and ending up in the same node. This is important for feature selection, among other things. You can apply cycle detection on a graph g using the function: test_cyclic_by_dfs(g).
  • Graphs that don’t have any cycles and are directed in nature are called directed acyclical graphs, or DAGs for short.
  • A connected component is a set of nodes that are all accessible from each other (i.e. there is a path through which you can go from A to B, for any A and B in the set of nodes). You can uncover all of the connected components of a graph g through use of the function Graphs.connected_components(g).
  • Cliques are sets of nodes in a graph that are all connected to each other. As there are often several such cases in a graph (particularly a social one), we are usually interested in the largest ones, called maximal cliques.
  • A maximal clique is the largest possible clique in a graph. Usually there are several such maximal cliques, depending on what part of the graph we are looking at. We can obtain a whole list of these maximal cliques of a graph g through the function Graphs.maximal_cliques(g).
  • The shortest path in a graph, connecting node x to some other node, is often useful as it allows for efficient traversing through the graph. You can identify these paths and their distances in graph g for node x through the use of the Dijktra algorithm, implemented by the function Graphs.dijkstra_shortest_paths(g, ww, x), where ww is the vector of the weights corresponding to the various edges of the graph, expressing some kind of dissimilarity or distance. The resulting object contains several pieces of information related to the shortest paths to node x. The most important components of this object are:
  • Parents: a list of the parent of each node, in relation to x. This is the closest node from the node examined, that you would have to go through in order to reach x. Remember that node x is a parent of itself.
  • Dists: the list of the distances corresponding to each path.
  • Minimum Spanning Trees (or MSTs) are acyclical graphs that connect all the nodes of a given graph, minimizing the overall weight. You can calculate the MST of a graph using one of two algorithms, Prim’s or Kruskal’s. The latter is the easier to work with in Julia and is implemented through the function: kruskal_minimum_spantree(g, ww). The outputs of this are two objects, E and W, representing the list of edges and the corresponding weights of the MST, respectively.
  • You can save and load graphs using the LightGraphs package as follows:
  • Saving a graph g: save(fn, g, gn), where fn is the filename of the data file where the graph g is going to be saved, and gn (an optional parameter) is the name to be used for referring to that graph in the data file.
  • Loading a graph g: load(fn, gn), where fn is the filename of the data file where the graph is stored, and gn (an optional parameter) is the name of the graph to be retrieved. The output of this is a LightGraphs graph object. If you don’t use that parameter, you’ll end up with a dictionary, containing all the graphs stored in that data file.
  • Transforming a graph from one type to another is helpful if you want your graph to be accessible from both the Graphs and the LightGraphs packages. This is possible through the G2LG() and LG2G() custom functions, which convert a Graphs graph to a LightGraphs one, and vice versa.
  • Julia can play an important role in the graph analytics field, due to its high performance and the fact that it allows for easy access and understanding of the algorithms used for each task.

Chapter Challenge

  1. 1. Why are graphs useful for data science?
  2. 2. How would you use graph analysis to establish the credibility of a feature set?
  3. 3. Can everything be modeled and analyzed as a graph? Why?
  4. 4. Is it possible to use MSTs as a classification system? Explain.
  5. 5. Can you use graph analysis right off the shelf on a dataset? Why?
  6. 6. Write a program that creates the maximum spanning tree of a given graph. (Hint: if you use a function from a graph package as a dependency, the required program is going to be very small.)
  7. 7. Does the data file of the saved graph (gg) contain all of the information of the graph? Why?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.28.21