Spark applications

Let's understand the difference between spark Shell and spark applications and how they are created and submitted.

Spark Shell versus Spark applications

Spark lets you access your datasets through a simple, yet specialized, Spark shell for Scala, Python, R, and SQL. Users do not need to create a full application to explore the data. They can start exploring data with commands that can be converted to programs later. This provides higher developer productivity. A Spark application is a complete program with SparkContext that is submitted with the spark-submit command.

Scala programs are generally written using Scala IDE or IntelliJ IDEA and SBT is used to compile the programs. Java programs are generally written in Eclipse and compiled with Maven. Python and R programs can be written in any text editor and also using IDEs such as Eclipse. Once the Scala and Java programs are written, they are compiled and executed with the spark-submit command as shown in the following. Since Python and R are interpreter languages, they are directly executed using the spark-submit command. Spark 2.0 is built with scala 2.11, so scala 2.11 is needed to build spark applications using Scala.

Creating a Spark context

The first step in any Spark program is to create a Spark context that provides an entry point to the Spark API. Set configuration properties by passing a SparkConf object to SparkContext, as shown in the following, in Python code:

from pyspark import SparkConf, SparkContext 
conf = (SparkConf() 
  .setMaster("spark://masterhostname:7077") 
  .setAppName("My Analytical Application") 
  .set("spark.executor.memory", "2g")) 
sc = SparkContext(conf = conf)

SparkConf

SparkConf is the primary configuration mechanism in spark and an instance is required when creating a new SparkContext. A SparkConf instance contains string key/value pairs of configuration options that the user wants to override the defaults. SparkConf settings are hardcoded into the application code, passed from the command line, or passed from configuration files, as shown in the following code:

# Construct a conf
conf = new SparkConf()
conf.set("spark.app.name", "My Spark App")
conf.set("spark.master", "local[4]")
conf.set("spark.ui.port", "36000") # Override the default port
# Create a SparkContext with this configuration
sc = SparkContext(conf)

Tip

The SparkConf associated with a given application is immutable once it is passed to the SparkContext constructor. That means that all configuration decisions must be made before a SparkContext is instantiated.

SparkSubmit

The spark-submit script is used to launch spark applications on a cluster with any cluster resource manager.

SparkSubmit allows setting configurations dynamically and then injecting into the environment when the app is launched (when a new SparkConf is constructed). User apps can simply construct an 'empty' SparkConf and pass it directly to the SparkContext constructor if using SparkSubmit. The SparkSubmit tool provides built-in flags for the most common Spark configuration parameters and a generic --conf flag, which accepts any Spark config value as shown :

[cloudera@quickstart ~]$ spark-submit 
  --class com.example.loganalytics.MyApp 
  --master yarn 
  --name "Log Analytics Application" 
  --executor-memory 2G  
  --num-executors 50 
  --conf spark.shuffle.spill=false 
  myApp.jar  
  /data/input 
  /data/output

In case of multiple configuration parameters, put all of them in a file and pass it to the application using --properties-file:

[cloudera@quickstart ~]$ spark-submit 
   --class com.example.MyApp 
   --properties-file my-config-file.conf 
   myApp.jar 

## Contents of my-config-file.conf ##
spark.master spark://5.6.7.8:7077
spark.app.name "My Spark App"
spark.ui.port 36000
spark.executor.memory 2g
spark.serializer org.apache.spark.serializer.KryoSerializer

Application dependency JARs included with the --jars option will be automatically shipped to the worker nodes. For Python, the equivalent --py-files option can be used to distribute .egg, .zip, and .py libraries to executors. Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. It's always better to add all code dependencies within a JAR while creating the JAR. This can be easily done in Maven or SBT.

For getting a complete list of options for spark-submit, use the following command:

[cloudera@quickstart ~]$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]

Spark Conf precedence order

Spark configuration precedence, from higher to lower, is as follows:

  1. Configurations declared explicitly in the user's code using the set() function on a SparkConf object.
  2. Flags passed to spark-submit or spark-shell.
  3. Values in the spark-defaults.conf properties file.
  4. Default values of Spark.

Important application configurations

Some of the important configuration parameters for submitting applications are listed in below table:

Command Line Parameter

Equivalent Configuration Property

Default

Meaning

--master

spark.master

None. If this parameter is not mentioned, it will choose local mode.

Spark's master URL. Options are local, local(*), local(n), spark://masterhostname:7077, yarn-client, yarn-cluster, and mesos://host:port.

--class

None

None

Application class.

--deploy-mode

None

Client Mode

Deploying application in client or cluster mode.

--conf

None

None

Pass arbitrary configuration in key value format.

--py-files

None

None

Add Python dependencies.

--supervise

None

None

Restart driver if it fails.

--driver-memory

spark.driver.memory

1G

Memory for Driver.

--executor-memory

spark.executor.memory

1G

Memory for executors.

--total-executor-cores

spark.cores.max

None. default will be spark.deploy.defaultCores on Spark's standalone cluster manager.

Used in Spark Standalone mode or Mesos coarse grained mode only.

--num-executors

spark.executor.instances

2

Number of executors in YARN mode.

--executor-cores

spark.executor.cores

1 in YARN mode, all the available cores on the worker in standalone mode.

Number of cores on each executor.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.170.223