Advance job submissions in a YARN cluster

If you opt for the more advanced way of submitting Spark jobs to be computed in your YARN cluster, you can specify additional parameters. For example, if you want to enable the dynamic resource allocation, make the spark.dynamicAllocation.enabled parameter true. However, to do so, you also need to specify minExecutors, maxExecutors, and initialExecutors as explained in the following. On the other hand, if you want to enable the shuffling service, set spark.shuffle.service.enabled as true. Finally, you could also try specifying how many executor instances will be running using the spark.executor.instances parameter.

Now, to make the preceding discussion more concrete, you can refer to the following submission command:

$ SPARK_HOME/bin/spark-submit   
    --class "com.chapter13.Clustering.KMeansDemo"  
    --master yarn  
    --deploy-mode cluster  
    --driver-memory 16g  
    --executor-memory 4g  
    --executor-cores 4  
    --queue the_queue  
    --conf spark.dynamicAllocation.enabled=true  
    --conf spark.shuffle.service.enabled=true  
    --conf spark.dynamicAllocation.minExecutors=1  
    --conf spark.dynamicAllocation.maxExecutors=4  
    --conf spark.dynamicAllocation.initialExecutors=4  
    --conf spark.executor.instances=4  
    KMeans-0.0.1-SNAPSHOT-jar-with-dependencies.jar  
    Saratoga_NY_Homes.txt

However, the consequence of the preceding job submission script is complex and sometimes nondeterministic. From my previous experience, if you increase the number of partitions from code and the number of executors, then the app will finish faster, which is okay. But if you increase only the executor-cores, the finish time is the same. However, you might expect the time to be lower than initial time. Second, if you launch the preceding code twice, you might expect both jobs to finish in say 60 seconds, but this also might not happen. Often, both jobs might finish after 120 seconds instead. This is a bit weird, isn't it? However, here goes the explanation that would help you understand this scenario.

Suppose you have 16 cores and 8 GB memory on your machine. Now, if you use four executors with one core each, what will happen? Well, when you use an executor, Spark reserves it from YARN and YARN allocates the number of cores (for example, one in our case) and the memory required. The memory is required more than you asked for actually for faster processing. If you ask for 1 GB, it will, in fact, allocate almost 1.5 GB with 500 MB overhead. In addition, it will probably allocate an executor for the driver with probably 1024 MB memory usage (that is, 1 GB).

Sometimes, it doesn't matter how much memory your Spark job wants but how much it reserves. In the preceding example, it will not take 50 MB of the test but around 1.5 GB (including the overhead) per executor. We will discuss how to configure Spark cluster on AWS later this chapter.

Table of Contents for Advance job submissions in a YARN cluster

Create new playlist

Sign In

Sign Up

Table of Contents for
Advance job submissions in a YARN cluster