Chapter 5. Apache Spark GraphX

In this chapter, I want to examine the Apache Spark GraphX module, and graph processing in general. I also want to briefly examine graph-based storage by looking at the graph database called Neo4j. So, this chapter will cover the following topics:

  • GraphX coding
  • Mazerunner for Neo4j

The GraphX coding section, written in Scala, will provide a series of graph coding examples. The work carried out on the experimental Mazerunner product by Kenny Bastani, which I will also examine, ties the two topics together in one practical example. It provides an example prototype-based on Docker to replicate data between Apache Spark GraphX, and Neo4j storage.

Before writing code in Scala to use the Spark GraphX module, I think it would be useful to provide an overview of what a graph actually is in terms of graph processing. The following section provides a brief introduction using a couple of simple graphs as examples.

Overview

A graph can be considered to be a data structure, which consists of a group of vertices, and edges that connect them. The vertices or nodes in the graph can be objects or perhaps, people, and the edges are the relationships between them. The edges can be directional, meaning that the relationship operates from one node to the next. For instance, node A is the father of node B.

In the following diagram, the circles represent the vertices or nodes (A to D), whereas the thick lines represent the edges, or relationships between them (E1 to E6). Each node, or edge may have properties, and these values are represented by the associated grey squares (P1 to P7).

So, if a graph represented a physical route map for route finding, then the edges might represent minor roads or motorways. The nodes would be motorway junctions, or road intersections. The node and edge properties might be the road type, speed limit, distance, and the cost and grid locations.

There are many types of graph implementation, but some examples are fraud modeling, financial currency transaction modeling, social modeling (as in friend-to-friend connections on Facebook), map processing, web processing, and page ranking.

Overview

The previous diagram shows a generic example of a graph with associated properties. It also shows that the edge relationships can be directional, that is, the E2 edge acts from node B to node C. However, the following example uses family members, and the relationships between them to create a graph. Note that there can be multiple edges between two nodes or vertices. For instance, the husband-and-wife relationships between Mike and Sarah. Also, it is possible that there could be multiple properties on a node or edge.

Overview

So, in the previous example, the Sister property acts from node 6 Flo, to node 1, Mike. These are simple graphs to explain the structure of a graph, and the element nature. Real graph applications can reach extreme sizes, and require both, distributed processing, and storage to enable them to be manipulated. Facebook is able to process graphs, containing over 1 trillion edges using Apache Giraph (source: Avery Ching-Facebook). Giraph is an Apache Hadoop eco-system tool for graph processing, which has historically based its processing on Map Reduce, but now uses TinkerPop, which will be introduced in Chapter 6, Graph-based Storage. Although this book concentrates on Apache Spark, the number of edges provides a very impressive indicator of the size that a graph can reach.

In the next section, I will examine the use of the Apache Spark GraphX module using Scala.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.31.67