Data locality

Data locality means how close the data is to the code to be processed. Technically, data locality can have a nontrivial impact on the performance of a Spark job to be executed locally or in cluster mode. As a result, if the data and the code to be processed are tied together, computation is supposed to be much faster. Usually, shipping a serialized code from a driver to an executor is much faster since the code size is much smaller than that of data.

In Spark application development and job execution, there are several levels of locality. In order from closest to farthest, the level depends on the current location of the data you have to process:

Data Locality Meaning Special Notes
PROCESS_LOCAL Data and code are in the same location Best locality possible
NODE_LOCAL Data and the code are on the same node, for example, data stored on HDFS A bit slower than PROCESS_LOCAL since the data has to propagate across the processes and network
NO_PREF The data is accessed equally from somewhere else Has no locality preference
RACK_LOCAL The data is on the same rack of servers over the network Suitable for large-scale data processing
ANY The data is elsewhere on the network and not in the same rack Not recommended unless there are no other options available
Table 2: Data locality and Spark

Spark is developed such that it prefers to schedule all tasks at the best locality level, but this is not guaranteed and not always possible either. As a result, based on the situation in the computing nodes, Spark switches to lower locality levels if available computing resources are too occupied. Moreover, if you would like to have the best data locality, there are two choices for you:

  • Wait until a busy CPU gets free to start a task on your data on the same server or same node
  • Immediately start a new one, which requires moving data there
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.1.156