Creating a graph

This section will explain generic Scala code up to the point of creating a GraphX graph from data. This will save time as the same code is reused in each example. Once this is explained, we will concentrate on the actual graph-based manipulation in each code example.

  1. The generic code starts by importing Spark context, GraphX, and RDD functionality for use in the Scala code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
  1. Then an application is defined, which extends the App class. The application name changes for each example from graph1 to graph5. This application name will be used when running the application using spark-submit:
object graph1 extends App {
  1. As already mentioned, there are two data files that contain vertex and edge information:
val vertexFile = "graph1_vertex.csv"
val edgeFile = "graph1_edges.csv"
  1. The Spark Master URL is defined as the application name, which will appear in the Spark user interface when the application runs. A new Spark configuration object is created, and the URL and name are assigned to it:
val sparkMaster = "spark://localhost:7077"
val appName = "Graph 1"
val conf = new SparkConf()
conf.setMaster(sparkMaster)
conf.setAppName(appName)
  1. A new Spark context is created using the configuration that was just defined:
val sparkCxt = new SparkContext(conf)
  1. The vertex information from the file is then loaded into an RDD-based structure called vertices using the sparkCxt.textFile method. The data is stored as a Long VertexId and strings to represent the person's name and age. The data lines are split by commas as this is CSV-based data:
val vertices: RDD[(VertexId, (String, String))] =
sparkCxt.textFile(vertexFile).map { line =>
val fields = line.split(",")
( fields(0).toLong, ( fields(1), fields(2) ) )
}
  1. Similarly, the edge data is loaded into an RDD-based data structure called edges. The CSV-based data is again split by comma values. The first two data values are converted to long values as they represent the source and destination vertex IDs. The final value representing the relationship of the edge is left as String. Note that each record in the RDD structure edges is actually now an Edge record:
val edges: RDD[Edge[String]] =
sparkCxt.textFile(edgeFile).map { line =>
val fields = line.split(",")
Edge(fields(0).toLong, fields(1).toLong, fields(2))
}
  1. A default value is defined in case a connection or vertex is missing; the graph is then constructed from the RDD-based structures vertices and edges and the default record:
val default = ("Unknown", "Missing")
val graph = Graph(vertices, edges, default)
  1. This creates a GraphX-based structure called graph, which can now be used for each of the examples. Remember that, although these data samples might be small, you could create extremely large graphs using this approach.

Many of these algorithms are iterative applications, for instance, PageRank and triangle count. As a result, the programs will generate many iterative Spark jobs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.170.187