Level of parallelism

Although you can control the number of map tasks to be executed through optional parameters to the SparkContext.text file, Spark sets the same on each file according to its size automatically. In addition to this, for a distributed reduce operation such as groupByKey and reduceByKey, Spark uses the largest parent RDD's number of partitions. However, sometimes, we make one mistake, that is, not utilizing the full computing resources for your nodes in a computing cluster. As a result, the full computing resources will not be fully exploited unless you set and specify the level of parallelism for your Spark job explicitly. Therefore, you should set the level of parallelism as the second argument.

For more on this option, please refer to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions.

Alternatively, you can do it by setting the config property spark.default.parallelism to change the default. For operations such as parallelizing with no parent RDDs, the level of parallelism depends on the cluster manager, that is, standalone, Mesos, or YARN. For the local mode, set the level of parallelism equal to the number of cores on the local machine. For Mesos or YARN, set fine-grained mode to 8. In other cases, the total number of cores on all executor nodes or 2, whichever is larger, and in general, 2-3 tasks per CPU core in your cluster is recommended.

Table of Contents for Level of parallelism

Create new playlist

Sign In

Sign Up

Table of Contents for
Level of parallelism