Level of parallelism

Although you can control the number of map tasks to be executed through optional parameters to the SparkContext.text file, Spark sets the same on each file according to its size automatically. In addition to this, for a distributed reduce operation such as groupByKey and reduceByKey, Spark uses the largest parent RDD's number of partitions. However, sometimes, we make one mistake, that is, not utilizing the full computing resources for your nodes in a computing cluster. As a result, the full computing resources will not be fully exploited unless you set and specify the level of parallelism for your Spark job explicitly. Therefore, you should set the level of parallelism as the second argument.

Alternatively, you can do it by setting the config property spark.default.parallelism to change the default. For operations such as parallelizing with no parent RDDs, the level of parallelism depends on the cluster manager, that is, standalone, Mesos, or YARN. For the local mode, set the level of parallelism equal to the number of cores on the local machine. For Mesos or YARN, set fine-grained mode to 8. In other cases, the total number of cores on all executor nodes or 2, whichever is larger, and in general, 2-3 tasks per CPU core in your cluster is recommended.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.103.227