Up to Spark 1.6.3 release, Spark distribution (that is, /SPARK_HOME/ec2) provides a shell script called spark-ec2 for launching Spark Cluster in EC2 instances from your local machine. This eventually helps in launching, managing, and shutting down the Spark Cluster that you will be using on AWS. However, since Spark 2.x, the same script was moved to AMPLab so that it would be easier to fix bugs and maintain the script itself separately.
The script can be accessed and used from the GitHub repo at https://github.com/amplab/spark-ec2.
You also need to create an IAM Instance profile for your Amazon EC2 instances (Console). For details, refer to http://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-create-iam-instance-profile.html. For simplicity, let's download the script and place it under a directory ec2 in Spark home ($SPARK_HOME/ec2). Once you execute the following command to launch a new instance, it sets up Spark, HDFS, and other dependencies on the cluster automatically:
$ SPARK_HOME/spark-ec2
--key-pair=<name_of_the_key_pair>
--identity-file=<path_of_the key_pair>
--instance-type=<AWS_instance_type >
--region=<region> zone=<zone>
--slaves=<number_of_slaves>
--hadoop-major-version=<Hadoop_version>
--spark-version=<spark_version>
--instance-profile-name=<profile_name>
launch <cluster-name>
We believe that these parameters are self-explanatory. Alternatively, for more details, please refer to https://github.com/amplab/spark-ec2#readme.
Suppose that you have already created and configured an instance profile. Now you are ready to launch the EC2 cluster. For our case, it would be something like the following:
$ SPARK_HOME/spark-ec2
--key-pair=aws_key_pair
--identity-file=/usr/local/aws_key_pair.pem
--instance-type=m3.2xlarge
--region=eu-west-1 --zone=eu-west-1a --slaves=2
--hadoop-major-version=yarn
--spark-version=2.1.0
--instance-profile-name=rezacsedu_aws
launch ec2-spark-cluster-1
The following figure shows your Spark home on AWS:
After the successful completion, spark cluster will be instantiated with two workers (slaves) nodes on your EC2 account. This task, however, sometimes might take half an hour approximately, depending on your Internet speed and hardware configuration. Therefore, you'd love to have a coffee break. Upon successful competition of the cluster setup, you will get the URL of the Spark cluster on the terminal. To make sure if the cluster is really running, check https://<master-hostname>:8080 on your browser, where the master-hostname is the URL you receive on the terminal. If every think was okay, you will find your cluster running; see cluster home in Figure 18.