Step 3: Running Spark jobs on the AWS cluster

Now you master and worker nodes are active and running. This means that you can submit your Spark job to them for computing. However, before that, you need to log in the remote nodes using SSH. For doing so, execute the following command to SSH remote Spark cluster:

$ SPARK_HOME/spark-ec2 
--key-pair=<name_of_the_key_pair> 
--identity-file=<path_of_the _key_pair> 
--region=<region> 
--zone=<zone>
login <cluster-name>

For our case, it should be something like the following:

$ SPARK_HOME/spark-ec2 
--key-pair=my-key-pair 
--identity-file=/usr/local/key/aws-key-pair.pem 
--region=eu-west-1 
--zone=eu-west-1
login ec2-spark-cluster-1

Now copy your application, that is, JAR file (or python/R script) to the remote instance (that is, ec2-52-48-119-121.eu-west-1.compute.amazonaws.com in our case) by executing the following command (in a new terminal):

$ scp -i /usr/local/key/aws-key-pair.pem /usr/local/code/KMeans-0.0.1-SNAPSHOT-jar-with-dependencies.jar [email protected]:/home/ec2-user/

Then you need to copy your data (/usr/local/data/Saratoga_NY_Homes.txt, in our case) to the same remote instance by executing the following command:

$ scp -i /usr/local/key/aws-key-pair.pem /usr/local/data/Saratoga_NY_Homes.txt [email protected]:/home/ec2-user/

Note that if you have already configured HDFS on your remote machine and put your code/data file, you don't need to copy the JAR and data files to the slaves; the master will do it automatically.

Well done! You are almost done! Now, finally, you will have to submit your Spark job to be computed by the slaves or worker nodes. To do so, just execute the following commands:

$SPARK_HOME/bin/spark-submit 
 --class com.chapter13.Clustering.KMeansDemo 
--master spark://ec2-52-48-119-121.eu-west-1.compute.amazonaws.com:7077 
file:///home/ec2-user/KMeans-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
file:///home/ec2-user/Saratoga_NY_Homes.txt

Place your input file under file:///input.txt if HDFS is not set on your machine.

If you have already put your data on HDFS, you should issue the submit command something like following:

$SPARK_HOME/bin/spark-submit 
 --class com.chapter13.Clustering.KMeansDemo 
--master spark://ec2-52-48-119-121.eu-west-1.compute.amazonaws.com:7077 
hdfs://localhost:9000/KMeans-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
hdfs://localhost:9000//Saratoga_NY_Homes.txt

Upon successful completion of the job computation, you are supposed to see the status and related statistics of your job at port 8080.

Table of Contents for Step 3: Running Spark jobs on the AWS cluster

Create new playlist

Sign In

Sign Up

Table of Contents for
Step 3: Running Spark jobs on the AWS cluster