Slow jobs or unresponsiveness

Sometimes, if the SparkContext cannot connect to a Spark standalone master, then the driver may display errors such as the following:

02/05/17 12:44:45 ERROR AppClient$ClientActor: All masters are unresponsive! Giving up. 
02/05/17 12:45:31 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up. 
02/05/17 12:45:35 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Spark cluster looks down

At other times, the driver is able to connect to the master node but the master is unable to communicate back to the driver. Then, multiple attempts to connect are made even though the driver will report that it could not connect to the Master's log directory.

Furthermore, you might often experience very slow performance and progress in your Spark jobs. This happens because your driver program is not that fast to compute your jobs. As discussed earlier, sometimes a particular stage may take a longer time than usual because there might be a shuffle, map, join, or aggregation operation involved. Even if the computer is running out of disk storage or main memory, you may experience these issues. For example, if your master node does not respond or you experience unresponsiveness from the computing nodes for a certain period of time, you might think that your Spark job has halted and become stagnant at a certain stage:

Figure 24: An example log for executor/driver unresponsiveness

Potential solutions could be several, including the following:

Check to make sure that workers and drivers are correctly configured to connect to the Spark master on the exact address listed in the Spark master web UI/logs. Then, explicitly supply the Spark cluster's master URL when starting your Spark shell:

      $ bin/spark-shell --master spark://master-ip:7077

Set SPARK_LOCAL_IP to a cluster-addressable hostname for the driver, master, and worker processes.

Sometimes, we experience some issues due to hardware failure. For example, if the filesystem in a computing node closes unexpectedly, that is, an I/O exception, your Spark job will eventually fail too. This is obvious because your Spark job cannot write the resulting RDDs or data to store to the local filesystem or HDFS. This also implies that DAG operations cannot be performed due to the stage failures.

Sometimes, this I/O exception occurs due to an underlying disk failure or other hardware failures. This often provides logs, as follows:

Figure 25: An example filesystem closed

Nevertheless, you often experience slow job computing performance because your Java GC is somewhat busy with, or cannot do, the GC fast. For example, the following figure shows that for task 0, it took 10 hours to finish the GC! I experienced this issue in 2014, when I was new to Spark. Control of these types of issues, however, is not in our hands. Therefore, our recommendation is that you should make the JVM free and try submitting the jobs again.

Figure 26: An example where GC stalled in between

The fourth factor could be the slow response or slow job performance is due to the lack of data serialization. This will be discussed in the next section. The fifth factor could be the memory leak in the code that will tend to make your application consume more memory, leaving the files or logical devices open. Therefore, make sure that there is no option that tends to be a memory leak. For example, it is a good practice to finish your Spark application by calling sc.stop() or spark.stop(). This will make sure that one SparkContext is still open and active. Otherwise, you might get unwanted exceptions or issues. The sixth issue is that we often keep too many open files, and this sometimes creates FileNotFoundException in the shuffle or merge stage.

Table of Contents for Slow jobs or unresponsiveness

Create new playlist

Sign In

Sign Up

Table of Contents for
Slow jobs or unresponsiveness