Debugging Spark jobs running as local and standalone mode

While debugging your Spark application locally or as standalone mode, you should know that debugging the driver program and debugging one of the executors is different since using these two types of nodes requires different submission parameters passed to spark-submit. Throughout this section, I'll use port 4000 as the address. For example, if you want to debug the driver program, you can add the following to your spark-submit command:

--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000

After that, you should set your remote debugger to connect to the node where you have submitted the driver program. For the preceding case, port number 4000 was specified. However, if something (that is, other Spark jobs, other applications or services, and so on) is already running on that port, you might also need to customize that port, that is, change the port number.

On the other hand, connecting to an executor is similar to the preceding option, except for the address option. More specifically, you will have to replace the address with your local machine's address (IP address or host name with the port number). However, it is always a good practice and recommended to test that you can access your local machine from the Spark cluster where the actual computing occurs. For example, you can use the following options to make the debugging environment enable to your spark-submit command:

 

--num-executors 1
--executor-cores 1
--conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=n,address=localhost:4000,suspend=n"

In summary, use the following command to submit your Spark jobs (the KMeansDemo application in this case):

 

$ SPARK_HOME/bin/spark-submit 
--class "com.chapter13.Clustering.KMeansDemo"
--master spark://ubuntu:7077
--num-executors 1
--executor-cores 1
--conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=n,address= host_name_to_your_computer.org:5005,suspend=n"
--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000
KMeans-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Saratoga_NY_Homes.txt

Now, start your local debugger in a listening mode and start your Spark program. Finally, wait for the executor to attach to your debugger. You will observe the following message on your terminal:

Listening for transport dt_socket at address: 4000 

It is important to know that you need to set the number of executors to 1 only. Setting multiple executors will all try to connect to your debugger and will eventually create some weird problems. It is to be noted that sometimes setting the SPARK_JAVA_OPTS helps in debugging your Spark applications that are running locally or as standalone mode. The command is as follows:

$ export SPARK_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,address=4000,suspend=y,onuncaught=n

However, since Spark release 1.0.0, SPARK_JAVA_OPTS has been deprecated and replaced by spark-defaults.conf and command line arguments to Spark-submit or Spark-shell. It is also to be noted that setting spark.driver.extraJavaOptions and spark.executor.extraJavaOptions, which we saw in the previous section, in spark-defaults.conf is not a replacement for SPARK_JAVA_OPTS. But to be frank, SPARK_JAVA_OPTS, it still works pretty well and you can try as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.54.239