Spark standalone

Spark standalone uses a built-in scheduler without depending on any external scheduler such as YARN or Mesos. To install Spark in standalone mode, you have to copy the spark binary install package onto all the machines in the cluster.

In standalone mode, the client can interact with the cluster, either through spark-submit or Spark shell. In either case, the Driver communicates with the Spark master Node to get the worker nodes, where executors can be started for this application.

Multiple clients interacting with the cluster create their own executors on the Worker Nodes. Also, each client will have its own Driver component.

The following is the standalone deployment of Spark using Master node and worker nodes:

Let's now download and install Spark in standalone mode using a Linux/Mac:

  1. Download Apache Spark from the link http://spark.apache.org/downloads.html:
  1. Extract the package in your local directory:
      tar -xvzf spark-2.2.0-bin-hadoop2.7.tgz
  1. Change directory to the newly created directory:
      cd spark-2.2.0-bin-hadoop2.7
  1. Set environment variables for JAVA_HOME and SPARK_HOME by implementing the following steps:
    1. JAVA_HOME should be where you have Java installed. On my Mac terminal, this is set as:
            export JAVA_HOME=/Library/Java/JavaVirtualMachines/
jdk1.8.0_65.jdk/Contents/Home/
    1. SPARK_HOME should be the newly extracted folder. On my Mac terminal, this is set as:
            export SPARK_HOME= /Users/myuser/spark-2.2.0-bin-
hadoop2.7
  1. Run Spark shell to see if this works. If it does not work, check the JAVA_HOME and SPARK_HOME environment variable: ./bin/spark-shell
  2. You will now see the shell as shown in the following:

  3. You will see the Scala/ Spark shell at the end and now you are ready to interact with the Spark cluster:
      scala>

Now, we have a Spark-shell connected to an automatically setup local cluster running Spark. This is the quickest way to launch Spark on a local machine. However, you can still control the workers/executors as well as connect to any cluster (standalone/YARN/Mesos). This is the power of Spark, enabling you to quickly move from interactive testing to testing on a cluster and subsequently deploying your jobs on a large cluster. The seamless integration offers a lot of benefits, which you cannot realize using Hadoop and other technologies.

You can refer to the official documentation in case you want to understand all the settings http://spark.apache.org/docs/latest/.

There are several ways to start the Spark shell as in the following snippet. We will see more options in a later section, showing Spark shell in more detail.:

  • Default shell on local machine automatically picks local machine as master:
    ./bin/spark-shell
  • Default shell on local machine specifying local machine as master with n threads:
    ./bin/spark-shell --master local[n]
  • Default shell on local machine connecting to a specified spark master:
    ./bin/spark-shell --master spark://<IP>:<Port>
  • Default shell on local machine connecting to a YARN cluster using client mode:
    ./bin/spark-shell --master yarn --deploy-mode client
  • Default shell on local machine connecting to a YARN cluster using cluster mode:
    ./bin/spark-shell --master yarn --deploy-mode cluster

Spark Driver also has a Web UI, which helps you to understand everything about the Spark cluster, the executors running, the jobs and tasks, environment variables, and cache. The most important use, of course, is to monitor the jobs.

Launch the Web UI for the local Spark cluster at http://127.0.0.1:4040/jobs/

The following is the Jobs tab in the Web UI:

The following is the tab showing all the executors of the cluster:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.173.40