Launching and deploying Spark programs

A Spark program can run by itself or over cluster managers. The first option is similar to running a program locally with multiple threads, and one thread is considered one Spark job worker. Of course, there is no parallelism at all, but it is a quick and easy way to launch a Spark application, and we will be deploying it in this model by way of demonstration, throughout the chapter. For example, we can run the following script to launch a Spark application:

./bin/spark-submit examples/src/main/python/pi.py

This is precisely as we did in the previous section. Or, we can specify the number of threads:

./bin/spark-submit --master local[4] examples/src/main/python/pi.py

In the previous code, we run Spark locally with four worker threads, or as many cores as there are on the machine by using the following command:

./bin/spark-submit --master local[*] examples/src/main/python/pi.py

Similarly, we can launch the interactive shell by replacing spark-submit with pyspark:

./bin/pyspark --master local[2] examples/src/main/python/pi.py

As for the cluster mode, it (version 2.3.2) currently supports the following approaches:

  • Standalone: This is the simplest mode to launch a Spark application. It means that the master and workers are located on the same machine. Details of how to launch a Spark application in standalone cluster mode can be found at the following link: https://spark.apache.org/docs/latest/spark-standalone.html.
  • Apache Mesos: As a centralized and fault-tolerant cluster manager, Mesos is designed for managing distributed computing environments. In Spark, when a driver submits tasks for scheduling, Mesos determines which machines handle which tasks. Refer to https://spark.apache.org/docs/latest/running-on-mesos.html for further details.
  • Apache Hadoop YARN: The task scheduler in this approach becomes YARN, as opposed to Mesos in the previous one. YARN, which is short for Yet Another Resource Negotiator, is the resource manager in Hadoop. With YARN, Spark can be integrated into the Hadoop ecosystem (such as MapReduce, Hive, and File System) more easily. For more information, please go to the following link: https://spark.apache.org/docs/latest/running-on-yarn.html.
  • Kubernetes: This is an open-source system providing container-centric infrastructure. It helps automate job deployment and management, and has gained in popularity over recent years. Kubernetes for Spark is still pretty new but, if you are interested, feel free to read more at the following link: https://spark.apache.org/docs/latest/running-on-kubernetes.html.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.54.255