Running the Spark Job on YARN (local)

Hadoop has a long history, and in most cases, organizations have already invested in the Hadoop infrastructure before they move their MR jobs to Spark. Unlike the Spark standalone cluster manager, which can run only Spark jobs, and Mesos, which can run a variety of applications, YARN runs Hadoop jobs as first-class. At the same time, it can run Spark jobs as well. This means that when a team decides to replace some of their MR jobs with Spark jobs, they can use the same cluster manager to run Spark jobs. In this recipe, we'll see how to deploy our Spark application on the YARN cluster manager.

How to do it...

Running a Spark job on YARN is very similar to running it against a Spark standalone cluster. It involves the following steps:

  1. Installing the Hadoop cluster.
  2. Starting HDFS and YARN.
  3. Pushing the Spark assembly and dataset to HDFS.
  4. Running the Spark Job in the yarn-client mode.
  5. Running the Spark Job in the yarn-cluster mode.

Installing the Hadoop cluster

While the setup of the cluster itself is beyond the scope of this recipe, for the sake of completeness, let's quickly look at the relevant site XML configurations that were made while setting up a single-node pseudo-distributed cluster on a local machine. Refer to http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php for the complete details on how to set up a local YARN/HDFS cluster:

The core-site.xml file:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
  </property>
</configuration>

The mapred-site.xml file:

<configuration>
   <property>
       <name>mapred.job.tracker</name>
       <value>localhost:54311</value>
   </property>
</configuration>

The hdfs-site.xml file:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Starting HDFS and YARN

Once the setup of the cluster is done, let's format HDFS and start the cluster (dfs and yarn):

Format namenode:

hdfs namenode -format

Start both HDFS and YARN:

sbin/start-all.sh

Let's confirm that the services are running through jps, and this is what we should see:

Starting HDFS and YARN

Pushing Spark assembly and dataset to HDFS

Ideally, when we do a spark-submit, YARN should be able to pick our spark-assembly JAR (or Uber JAR) and upload it to HDFS. However, this doesn't happen correctly and results in the following error:

Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

In order to work around this issue, let's upload our spark-assembly JAR manually to HDFS and change our conf/spark-env.sh to reflect the location. The Hadoop config directory should also be specified in spark-env.sh:

Uploading the spark assembly to HDFS.

hadoop fs -mkdir /sparkbinary
hadoop fs -put /Users/Gabriel/Apps/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar /sparkbinary/
hadoop fs -ls /sparkbinary

Uploading the Spam dataset to HDFS:

hadoop fs -mkdir /scalada
hadoop fs -put ~/SMSSpamCollection /scalada/
hadoop fs -ls /scalada

Entries in spark-env.sh:

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
SPARK_EXECUTOR_URI=hdfs://localhost:9000/sparkbinary/spark-assembly-1.4.1-hadoop2.6.0.jar
Pushing Spark assembly and dataset to HDFS

Before we submit our Spark job to the YARN cluster, let's confirm that our setup is fine using the Spark shell. The Spark shell is a wrapper arround the Scala REPL, with Spark libraries set in the classpath. Configuring HADOOP_CONF_DIR to point to the Hadoop config directory ensures that Spark will now use YARN to run its jobs. However, there are two modes in which we can run the Spark job in YARN, namely yarn-client and yarn-cluster. Let's explore both of them in this subrecipe. But before we do that, to validate our configuration, we'll launch the Spark shell pointing the master to the yarn-client. After a rain of logs, we should be able to see a Scala prompt. This confirms that our configuration is good:

bin/spark-shell --master yarn-client
Pushing Spark assembly and dataset to HDFS

Running a Spark job in yarn-client mode

Now that we have confirmed that the shell loads up fine against the YARN master, let's head over to deploying our Spark job on YARN.

As we discussed earlier, there are two modes in which we can run a Spark application on YARN: the yarn-client mode and the yarn-cluster mode. In the yarn-client mode, the driver program resides on the client side and the YARN worker nodes are used only to execute the job. All of the brain of the application resides in the client JVM that polls the application master for the status. The application master does nothing except watching out for failure of the executor nodes and reporting and requesting for resources accordingly to the resource manager. This also means that the client (our driver JVM) needs to run as long as the application executes:

./bin/spark-submit 
  --class com.packt.scalada.learning.BinaryClassificationSpamYarn 
  --master yarn-client 
  --executor-memory 1G 
  ~/scalada-learning-assembly.jar
Running a Spark job in yarn-client mode

As we see from the YARN console, our job is running fine. Here is a screenshot that shows this:

Running a Spark job in yarn-client mode

Finally, we can see the output on the client JVM (the driver) itself:

Running a Spark job in yarn-client mode

Running Spark job in yarn-cluster mode

In the yarn-cluster mode, the client JVM doesn't do anything at all. In fact, it just submits and polls the Application master for status. The driver program itself runs on the Application master, which now has all the brains of the program. Unlike the yarn-client mode, the user logs won't be displayed on the client JVM because the driver, which consolidates the results, is executing inside the YARN cluster:

./bin/spark-submit 
  --class com.packt.scalada.learning.BinaryClassificationSpamYarn 
  --master yarn-cluster 
  --executor-memory 1g 
  ~/scalada-learning-assembly.jar
Running Spark job in yarn-cluster mode

As expected, the client JVM indicates that the job has run successfully. It doesn't, however, show the user logs.

Running Spark job in yarn-cluster mode

The following screenshot shows the final status of our client and the cluster mode runs:

Running Spark job in yarn-cluster mode

The actual output of this program is inside the Hadoop user logs. We can either go to the logs directory of Hadoop, or check it out from the Hadoop console itself, when we click on the application link and then on the logs link in the console.

As you can see in the following screenshot, the stdout file shows our embarrassing println commands:

Running Spark job in yarn-cluster mode

In this chapter, we took an example Spark application and deployed it on a Spark standalone cluster manager, YARN, and Mesos. Along the way, we touched upon the internals of these cluster managers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.182.62