In this chapter, we have seen how to put large-scale graph analytics in practice using Spark GraphX. Modeling entity relationships as graphs with vertices and edges is a powerful paradigm to assess many interesting problems.
In GraphX, graphs are finite, directed property graphs, potentially with multiple edges and loops. GraphX does graph analytics on highly optimized versions of vertex and edge RDDs, which allows you to leverage both data and graph-parallel applications. We have seen how such graphs can be read by either loading them from edgeListFile or constructing them individually from other RDDs. On top of that, we have seen how easy it is to create both random and deterministic graph data for quick experiments. Using just the rich built-in functionality of the Graph model, we have shown how to investigate a graph for core properties. To visualize more complex graphs, we introduced Gephi and an interface to it, which allows one to gain intuition about the graph structure at hand.
Among the many other possibilities that Spark GraphX has to offer, we introduced two powerful graph analytics tools, namely aggregateMessages and the Pregel API. Most of GraphX’s built-in algorithms are written using one of these options. We have seen how to write our own algorithms using each of these APIs. We also gave a brief overview of the GraphFrames package, which builds on top of DataFrames, comes equipped with an elegant query language that is not available in plain GraphX, and can come in handy for analytics purposes.
In terms of practical applications, we have seen an interesting retweet graph, as well as a Hollywood movie actor graph in action. We carefully explained and applied Google’s PageRank algorithm, studied (strongly) connected components of graphs, and counted triangles thereof as a means of doing clustering. We finished by discussing the relationship between Spark MLlib and GraphX for advanced machine learning applications.