In the last section, we made a lot of leeway in graph analytics and discussed an interesting retweet graph. Before we dive into more complicated operations, let's take a step back and consider other options to construct graphs with GraphX. Having completed this interlude, we will have a quick look into visualization tools and then turn to the more involved applications.
In fact, we have already seen two ways to create GraphX graphs, one was to construct the vertex and edge RDDs explicitly, ourselves, to construct a graph from it; the other one was to use Graph.fromEdges. Another very handy possibility is to load a so-called edge list file. An example of this format is the following:
1 3
5 3
4 2
3 2
1 5
So, an edge list file is a text file with pairs of IDs per row, separated by a space. Assuming that we store the preceding data as edge_list.txt in the current working directory, we can load a graph object in one line from it, using the GraphLoader interface:
import org.apache.spark.graphx.GraphLoader
val edgeListGraph = GraphLoader.edgeListFile(sc, "./edge_list.txt")
This represents a very convenient entry point, given that we have data provided in the right format. Additional vertex and edge data has to be joined to the resulting graph after loading the edge list file, though. Another similar approach to constructing a graph from the preceding data is to use the fromEdgeTuples method provided by the Graph object, which can be utilised as shown in the following code snippet:
val rawEdges: RDD[(VertexId, VertexId)] = sc.textFile("./edge_list.txt").map {
line =>
val field = line.split(" ")
(field(0).toLong, field(1).toLong)
}
val edgeTupleGraph = Graph.fromEdgeTuples(
rawEdges=rawEdges, defaultValue="")
The difference from the previous construction is that we create a raw-edge RDD, containing pairs of vertex IDs, which, together with a default value for vertex data, feeds into the construction of the graph.
With this last example, we have essentially seen every single way currently supported in GraphX to load a graph from the given data. There is, however, also the possibility of generating random and deterministic graphs, which is very helpful for tests, quick sanity checks, and demonstrations. To this end, we import the following class:
import org.apache.spark.graphx.util.GraphGenerators
This class has a lot of functionality to offer. The two deterministic graph construction methods help build star and grid graphs. A star graph consists of a single central vertex and several vertices connecting only to the central one by means of one single edge. Here is how to create a star graph with ten vertices connecting to the central vertex:
val starGraph = GraphGenerators.starGraph(sc, 11)
The following image is a graphical representation of a star graph:
The other deterministic method for graph creation builds a grid, meaning that the vertices are organised in a matrix, and each vertex connects to its direct neighbours both vertically and horizontally. In a grid graph with n rows and m columns, there are precisely n(m-1) + m(n-1) edges--the first term is for all the vertical connections and the second one is for all the horizontal grid connections. This is how to build a 5 times 5 grid with 40 edges in GraphX:
val gridGraph = GraphGenerators.gridGraph(sc, 5, 5)
As far as random graphs are concerned, we will cover one creation method that approximately reflects many real-world graphs structurally, namely log normal graphs. Many structures found in real life follow a power law, in which the measure of an entity is given by the power of another. A concrete example for this would be the Pareto-principle, often called 80/20 principle, which implies that 80% of the wealth is possessed by 20% of the people, that is, most wealth is attributed to a few. A variant of this, called Zipf's law, applies to our scenario, namely a few vertices have very high degree, while most have very little connections. In the context of a social graph, very few people tend to have a lot of followers, while the majority have very little. This leads to a distribution of vertex degrees that follows a log-normal distribution. The star graph in Figure 10 is an extreme variant of this behavior, in which all the edges are centered around one vertex.
Creating a log normal graph with 20 vertices in graphX is simply done as follows:
val logNormalGraph = GraphGenerators.logNormalGraph(
sc, numVertices = 20, mu=1, sigma = 3
)
In the preceding code snippet, we also impose a mean of one out-degree per vertex and a standard deviation of three. Let's see if we can confirm the log-normal distribution on vertex out-degree:
logNormalGraph.outDegrees.map(_._2).collect().sorted
This will produce a Scala array that should look as follows.
Note that you might get different results, since the graph is randomly generated. Next, let's see how to visualize some of the graphs that we have constructed so far.