Data locality

Data locality means how close the data is to the code to be processed. Technically, data locality can have a nontrivial impact on the performance of a Spark job to be executed locally or in cluster mode. As a result, if the data and the code to be processed are tied together, computation is supposed to be much faster. Usually, shipping a serialized code from a driver to an executor is much faster since the code size is much smaller than that of data.

In Spark application development and job execution, there are several levels of locality. In order from closest to farthest, the level depends on the current location of the data you have to process:

Data Locality	Meaning	Special Notes
`PROCESS_LOCAL`	Data and code are in the same location	Best locality possible
`NODE_LOCAL`	Data and the code are on the same node, for example, data stored on HDFS	A bit slower than `PROCESS_LOCAL` since the data has to propagate across the processes and network
`NO_PREF`	The data is accessed equally from somewhere else	Has no locality preference
`RACK_LOCAL`	The data is on the same rack of servers over the network	Suitable for large-scale data processing
`ANY`	The data is elsewhere on the network and not in the same rack	Not recommended unless there are no other options available

Table 2: Data locality and Spark

Spark is developed such that it prefers to schedule all tasks at the best locality level, but this is not guaranteed and not always possible either. As a result, based on the situation in the computing nodes, Spark switches to lower locality levels if available computing resources are too occupied. Moreover, if you would like to have the best data locality, there are two choices for you:

Wait until a busy CPU gets free to start a task on your data on the same server or same node
Immediately start a new one, which requires moving data there

Table of Contents for Data locality

Create new playlist

Sign In

Sign Up

Table of Contents for
Data locality