Chapter 9. Graph Analytics with GraphX

Graph analytics enables finding relationship patterns in data. This chapter is aimed at introducing graph processing techniques that are generally used in page ranking, search engines, finding relationships in social networks, finding the shortest paths between two places, recommending products, and many more applications.

In this chapter, we will cover the following topics:

  • Introducing graph processing
  • Getting started with GraphX
  • Analyzing flight data using GraphX
  • Introducing GraphFrames

Introducing graph processing

As the number of users increases to millions in large organizations, traditional relational database performance will be degraded while finding relationships between these users. For example, finding relationships between two friends results in a simple join SQL query. But, if you have to find a relationship with a friend of a friend, six levels deep, you have to join the tables six times in a SQL query which leads to poor performance. Graph processing finds relationships without performance degradation as the size of the graph grows. In relational databases, relationships are established only by joining tables. In graph databases, relationships are first-class citizens. Let's understand what a graph is and how they are created and processed.

What is a graph?

A graph is a collection of vertices connected to each other using edges as shown in the following Figure 9.1. Vertex is a synonym for node, which can be a place or person with associated relationships expressed using edges. Jacob, Jessica, and Emily in the figure are vertices and their relationships are edges. This is a simple example of a social graph. Just imagine expressing the whole of Facebook as a social graph. Finding relationships between them can be a very complex task. In Chapter 3, Deep Dive into Apache Spark, we learned about creating the Directed Acyclic Graph (DAG) for Spark jobs that consist of Resilient Distributed Datasets (RDD) which are nothing but vertices, and transformations, which are nothing but edges. There are multiple ways to go from one place to another place. If we create places as vertices and roads as edges, graph processing can provide an optimized shortest path between these two places:

What is a graph?

Figure 9.1: Representation of a family relationship graph

Weights in a graph provide strength to an edge, for example the distance between two cities can be closer or more distant. Graphs can be directed or undirected. RDD's DAG is directed because it goes only in one direction. Graphs can be cyclic or acyclic. For example, in a social network, you can start with one friendship and circle between different friends to get back to the original person.

Graphs are everywhere and almost everything can be represented as graphs. Graph processing can be implemented in telecoms, aviation, bio informatics, social networks, and many more fields.

Graph databases versus graph processing systems

There are two types of technologies in graph processing; graph databases and graph processing systems. Graph databases can be seen as OLTP databases that provide transactions, updates, deletes, and query language. Graph processing systems can be seen as OLAP systems that provide offline analytic capabilities. The following table shows the important graph databases and graph processing systems:

Graph databases

Graph processing systems

Neo4J, Titan, OrientDB, AllegroGraph, GraphBase, Oracle Spatial, and Graph.

NoSQL databases such as HBase and Cassandra can also be used for storing graphs.

GraphX, Apache Giraph, GraphLab, Apache Hama, and Microsoft's Graph Engine.

The GraphLab framework implements the Message Passing Interface (MPI) model to run complex graph algorithms using data in HDFS. Apache Giraph and Apache Hama are based on the Bulk Synchronous Parallel (BSP) model inspired by Google's Pregel project. GraphX is built on top of Spark and supports a variant of the Pregel API as well. This chapter focuses on graph processing systems and especially GraphX.

Apache Giraph, born at LinkedIn, is a stable system that can process a trillion edges on top of Hadoop, but Apache Giraph is not supported by all major Hadoop vendors. If your use case is a pure graph-related problem, you can use Apache Giraph as a robust and stable solution, but if graph processing is just a part of the solution, GraphX on top of Spark provides a great unified solution with the power of Spark core capabilities.

Introducing GraphX

GraphX is a graph processing system built on top of Apache Spark, which can be easily integrated with other Spark modules to unify ETL, exploratory analytics, and graph analytics. GraphX creates graphs as another type of RDD such as VertexRDD and EdgeRDD. Spark's GraphX features such as speed of iterative processing and in-memory capabilities removed the issues of MapReduce that Apache Giraph and Pregel were designed to address. GraphX stores a graph's edges in one file and vertices in another file. This allows graph algorithms implemented in GraphX to view graphs either as graphs or as collections of edges or vertices to efficiently combine multiple processing types within the same program.

Typically, graphs would be too large to fit on a single machine and must use distributed systems such as HDFS and S3.

Graph algorithms

GraphX provides the following algorithms out-of-the-box in version 2.0. Some of these algorithms are practically implemented in the next section of this chapter:

  • Connected components
  • Label propagation
  • PageRank
  • SVD++
  • Shortest Path
  • Strongly connected components
  • Triangle count
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.200.71