GraphX

As shown in the preceding section, we can model many real-life use cases as Graphs with a set of vertices and a set of edges linking the vertices. We also wrote simple code trying to implement some basic graph operations and queries such as, Is X a friend of Y ? However, as we explored further, the algorithms only get more complicated along with use cases and also the size of graphs is much much larger than can be handled on one machine.

It is not possible to fit one billion Facebook users along with all their friendship relations into one machine or even a few machines.

What we need to do is to look beyond the one machine and few machines thrown together and rather start considering highly scalable architectures to implement the complex graph algorithms, which can handle the volume of data and complex interconnections of the data elements. We have already seen an introduction to Spark, how Spark solves some challenges of distributed computing and big data analytics. We also saw real-time stream processing and Spark SQL along with DataFrames and RDDs. Can we also solve the challenges of graph algorithms? The answer to this is GraphX, which comes with Apache Spark and just like other libraries, sits on top of Spark Core.

GraphX extends the spark RDD by providing a graph abstraction on top of the RDD concept. Graphs in GraphX are created using the concept of vertices or nodes to represent the objects and edges or links to describe the relation between objects and GraphX provides the means to realize many use cases, which suit the graph processing paradigm. In this section, we will learn about GraphX, how to create vertices, edges, and graphs comprising vertices and edges. We will also write code to learn by example some techniques surrounding graph algorithms and processing.

To get started , you will need to import some packages as listed here:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

import org.apache.spark.graphx.GraphLoader
import org.apache.spark.graphx.GraphOps

The fundamental data structure of GraphX is a graph, which abstractly represents a graph with arbitrary objects associated with vertices and edges. The graph provides basic operations to access and manipulate the data associated with vertices and edges as well as the underlying structure. Like Spark RDDs, the graph is a functional data structure in which mutating operations return new graphs. This immutable nature of the Graph object makes it possible to do large-scale parallel computations without the risk of running into synchronization problems.

Concurrent updates or modification of objects is the primary reason for complex multithreading programming done in many programs.

The graph defines the basic data structure and there is a helper class GraphOps, which contains additional convenience operations and graph algorithms.

A graph is defined as follows as a class template with two attributes specifying the data type of the two pieces, which make up the graph, namely, the vertices and the edges:

class Graph[VD: ClassTag, ED: ClassTag] 

A graph consists of vertices and edges as we already discussed. The set of vertices is in a special data structure known as VertexRDD. Similarly, the set of edges is in a special data structure known as EdgeRDD. Together the vertices and edges form the graph and all the subsequent operations can be done using the two data structures.

So, the declaration of the class Graph looks like this:

class Graph[VD, ED] {
//A RDD containing the vertices and their associated attributes.
val vertices: VertexRDD[VD]

//A RDD containing the edges and their associated attributes.
The entries in the RDD contain just the source id and target id
along with the edge data.
val edges: EdgeRDD[ED]

//A RDD containing the edge triplets, which are edges along with the
vertex data associated with the adjacent vertices.
val triplets: RDD[EdgeTriplet[VD, ED]]
}

Now, let's look at the two main components of the Graph class, the VertexRDD, and the EdgeRDD.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.163.91