Chapter 4. Creating a SparkContext

This chapter will cover how to create a SparkContext context for your cluster. A SparkContext class represents the connection to a Spark cluster and provides the entry point for interacting with Spark. We need to create a SparkContext instance so that we can interact with Spark and distribute our jobs. In Chapter 2, Using the Spark Shell, we interacted with Spark through the Spark shell, which created a SparkContext. Now you can create RDDs, broadcast variables, counters, and so on, and actually do fun things with your data. The Spark shell serves as an example of interaction with the Spark cluster through SparkContext in ./repl/src/main/scala/spark/repl/SparkILoop.scala.

The following code snippet creates a SparkContext instance using the MASTER environment variable (or local, if none are set) called Spark shell and doesn't specify any dependencies. This is because the Spark shell is built into Spark and, as such, doesn't have any JAR files that it needs to be distributed.

def createSparkContext(): SparkContext = {
  val master = this.master match {
    case Some(m) => m
    case None => {
     val prop = System.getenv("MASTER")
     if (prop != null) prop else "local"
    }
  }
  sparkContext = new SparkContext(master, "Spark shell")
  sparkContext
  }

For a client to establish a connection to the Spark cluster, the SparkContext object needs some basic information as follows:

  • master: The master URL can be in one of the following formats:
    • local[n]: for a local mode
    • spark://[sparkip]: to point to a Spark cluster
    • mesos://: for a mesos path if you are running a mesos cluster
  • application name: This is the human-readable application name
  • sparkHome: This is the path to Spark on the master/workers machines
  • jars: This gives the path to the list of JAR files required for your job

Scala

In a Scala program, you can create a SparkContext instance using the following code:

val spar kContext = new SparkContext(master_path, "application name", ["optional spark home path"],["optional list of jars"])

While you can hardcode all of these values, it's better to read them from the environment with reasonable defaults. This approach provides maximum flexibility to run the code in a changing environment without having to recompile the code. Using local as the default value for the master machine makes it easy to launch your application locally in a test environment. By carefully selecting the defaults, you can avoid having to over-specify them. An example would be as follows:

import spark.sparkContext
import spark.sparkContext._
import scala.util.Properties

val master = Properties.envOrElse("MASTER","local")
val sparkHome = Properties.get("SPARK_HOME")
val myJars = Seq(System.get("JARS")
val sparkContext = new SparkContext(master, "my app", sparkHome, myJars)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.146.155