For this application section, in which we will discuss triangle counting, (strongly) connected components, PageRank and other algorithms available in GraphX, we will load another interesting graph dataset from http://networkrepository.com/. This time please download data from http://networkrepository.com/ca-hollywood-2009.php, which consists of an undirected graph whose vertices represent actors occurring in movies. Each line of the file contains two vertex IDs representing an edge, meaning that these actors appeared together in a movie.
The dataset consists of about 1.1 million vertices and has 56.3 million edges. Although the file size, even after unzipping, is not particularly large, a graph of this size is a real challenge for a graph processing engine. Since we assume you work with Spark's standalone mode locally, this graph will likely not fit into your computer's memory and will crash the Spark application. To prevent this, let's restrict the data a little, which also gives us the chance to clean up the file header. We assume you have unpacked ca-hollywood-2009.mtx and stored it in your current working directory. We use unix tools tail and head to delete the first two lines and then restrict to the first million edges:
tail -n+3 ca-hollywood-2009.mtx | head -1000000 > ca-hollywood-2009.txt
If these tools should not be available to you, any other will do, including manually modifying the file. From the structure described previously we can simply use edgeListFile functionality to load the graph into Spark and confirm that it indeed has a million edges:
val actorGraph = GraphLoader.edgeListFile(sc, "./ca-hollywood-2009.txt")
actorGraph.edges.count()
Next, let's see what we can do with GraphX to analyze this graph.