Step 2: Configuring Spark cluster on EC2

Up to Spark 1.6.3 release, Spark distribution (that is, /SPARK_HOME/ec2) provides a shell script called spark-ec2 for launching Spark Cluster in EC2 instances from your local machine. This eventually helps in launching, managing, and shutting down the Spark Cluster that you will be using on AWS. However, since Spark 2.x, the same script was moved to AMPLab so that it would be easier to fix bugs and maintain the script itself separately.

The script can be accessed and used from the GitHub repo at https://github.com/amplab/spark-ec2.

Starting and using a cluster on AWS will cost money. Therefore, it is always a good practice to stop or destroy a cluster when the computation is done. Otherwise, it will incur additional cost to you. For more about AWS pricing, please refer to https://aws.amazon.com/ec2/pricing/.

You also need to create an IAM Instance profile for your Amazon EC2 instances (Console). For details, refer to http://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-create-iam-instance-profile.html. For simplicity, let's download the script and place it under a directory ec2 in Spark home ($SPARK_HOME/ec2). Once you execute the following command to launch a new instance, it sets up Spark, HDFS, and other dependencies on the cluster automatically:

$ SPARK_HOME/spark-ec2 
--key-pair=<name_of_the_key_pair>
--identity-file=<path_of_the key_pair>
--instance-type=<AWS_instance_type >
--region=<region> zone=<zone>
--slaves=<number_of_slaves>
--hadoop-major-version=<Hadoop_version>
--spark-version=<spark_version>
--instance-profile-name=<profile_name>
launch <cluster-name>

We believe that these parameters are self-explanatory. Alternatively, for more details, please refer to https://github.com/amplab/spark-ec2#readme.

If you already have a Hadoop cluster and want to deploy spark on it: If you are using Hadoop-YARN (or even Apache Mesos), running a spark job is relatively easier. Even if you don't use either, Spark can run in standalone mode. Spark runs a driver program, which, in turn, invokes spark executors. This means that you need to tell Spark the nodes where you want your spark daemons to run (in terms of master/slave). In your spark/conf directory, you can see a file slaves. Update it to mention all the machines you want to use. You can set up spark from source or use a binary from the website. You always should use the Fully Qualified Domain Names (FQDN) for all your nodes, and make sure that each of those machines are passwordless SSH accessible from your master node.

Suppose that you have already created and configured an instance profile. Now you are ready to launch the EC2 cluster. For our case, it would be something like the following:

$ SPARK_HOME/spark-ec2 
--key-pair=aws_key_pair
--identity-file=/usr/local/aws_key_pair.pem
--instance-type=m3.2xlarge
--region=eu-west-1 --zone=eu-west-1a --slaves=2
--hadoop-major-version=yarn
--spark-version=2.1.0
--instance-profile-name=rezacsedu_aws
launch ec2-spark-cluster-1

The following figure shows your Spark home on AWS:

Figure 18: Cluster home on AWS

After the successful completion, spark cluster will be instantiated with two workers (slaves) nodes on your EC2 account. This task, however, sometimes might take half an hour approximately, depending on your Internet speed and hardware configuration. Therefore, you'd love to have a coffee break. Upon successful competition of the cluster setup, you will get the URL of the Spark cluster on the terminal. To make sure if the cluster is really running, check https://<master-hostname>:8080 on your browser, where the master-hostname is the URL you receive on the terminal. If every think was okay, you will find your cluster running; see cluster home in Figure 18.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.69.240