Running Spark on EC2

There are many handy scripts to run Spark on EC2 in the ec2 directory. These scripts can be used to run multiple Spark clusters, and even run on-the-spot instances. Spark can also be run on Elastic MapReduce (EMR). This is Amazon's solution for MapReduce cluster management, which gives you more flexibility around scaling instances.

Running Spark on EC2 with the scripts

To get started, you should make sure that you have EC2 enabled on your account by signing up for it at https://portal.aws.amazon.com/gp/aws/manageYourAccount. It is a good idea to generate a separate access key pair for your Spark cluster, which you can do at https://portal.aws.amazon.com/gp/aws/securityCredentialsR. You will also need to create an EC2 key pair, so that the Spark script can SSH to the launched machines; this can be done at https://console.aws.amazon.com/ec2/home by selecting Key Pairs under Network & Security. Remember that key pairs are created "per region", so you need to make sure you create your key pair in the same region as you intend to run your spark instances. Make sure to give it a name that you can remember (we will use spark-keypair in this chapter as its example key pair name) as you will need it for the scripts. You can also choose to upload your public SSH key instead of generating a new key. These are sensitive, so make sure that you keep them private. You also need to set your AWS_ACCESS_KEY and AWS_SECRET_KEY key pairs as environment variables for the Amazon EC2 scripts:

chmod 400 spark-keypair.pem
export AWS_ACCESS_KEY="..."
export AWS_SECRET_KEY="..."

You will find it useful to download the EC2 scripts provided by Amazon from http://aws.amazon.com/developertools/Amazon-EC2/351. Once you unzip the resulting ZIP file, you can add the bin folder to your PATH variable in a similar manner to what you did with the Spark bin folder:

wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip
unzip ec2-api-tools.zip
cd ec2-api-tools-*
export EC2_HOME=`pwd`
export PATH=$PATH:`pwd`:/bin

The Spark EC2 script automatically creates a separate security group and firewall rules for the running Spark cluster. By default your Spark cluster will be universally accessible on port 8080, which is somewhat a poor form. Sadly, the spark_ec2.py script does not currently provide an easy way to restrict access to just your host. If you have a static IP address, I strongly recommend limiting the access in spark_ec2.py; simply replace all instances 0.0.0.0/0 with [yourip]/32. This will not affect intra-cluster communication, as all machines within a security group can talk to one another by default.

Next, try to launch a cluster on EC2:

./ec2/spark-ec2 -k spark-keypair -i pk-[....].pem -s 1 launch myfirstcluster

Note

If you get an error message such as "The requested Availability Zone is currently constrained and....", you can specify a different zone by passing in the --zone flag.

If you get an error about not being able to SSH to the master, make sure that only you have permission to read the private key, otherwise SSH will refuse to use it.

You may also encounter this error due to a race condition when the hosts report themselves as alive, but the Spark-ec2 script cannot yet SSH to them. There is a fix for this issue pending in https://github.com/mesos/spark/pull/555. For now a temporary workaround, until the fix is available in the version of Spark you are using, is to simply let the cluster sleep an extra 120 seconds at the start of setup_cluster.

If you do get a transient error when launching a cluster, you can finish the launch process using the resume feature by running:

./ec2/spark-ec2 -i ~/spark-keypair.pem launch myfirstsparkcluster --resume

If everything goes ok, you should see something like the following screenshot:

Running Spark on EC2 with the scripts

This will give you a bare-bones cluster with one master and one worker, with all the defaults on the default machine instance size. Next, verify that it has started up, and if your firewall rules were applied by going to the master on port 8080. You can see in the preceding screenshot that the name of the master is output at the end of the script.

Try running one of the example's jobs on your new cluster to make sure everything is ok:

sparkuser@h-d-n:~/repos/spark$ ssh -i ~/spark-keypair.pem [email protected]
Last login: Sun Apr  7 03:00:20 2013 from 50-197-136-90-static.hfc.comcastbusiness.net
       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2012.03-release-notes/
There are 32 security update(s) out of 272 total update(s) available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2013.03 is available.
[root@domU-12-31-39-16-B6-08 ~]# ls
ephemeral-hdfs  hive-0.9.0-bin  mesos  mesos-ec2  persistent-hdfs  scala-2.9.2  shark-0.2  spark  spark-ec2
[root@domU-12-31-39-16-B6-08 ~]# cd spark
[root@domU-12-31-39-16-B6-08 spark]# ./run spark.examples.GroupByTest spark://`hostname`:7077
13/04/07 03:11:38 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started
13/04/07 03:11:39 INFO storage.BlockManagerMaster: Registered BlockManagerMaster Actor
....
13/04/07 03:11:50 INFO spark.SparkContext: Job finished: count at GroupByTest.scala:35, took 1.100294766 s
2000

Now that you've run a simple job on our EC2 cluster, it's time to configure your EC2 cluster for our Spark jobs. There are a number of options you can use to configure with the Spark-ec2 script.

First, consider what instance types you may need. EC2 offers an ever-growing collection of instance types, and you can choose a different instance type for the master and the workers. The instance type has the most obvious impact on the performance of your spark cluster. If your work needs a lot of RAM, you should choose an instance with more RAM. You can specify the instance type with --instance-type=(name of instance type). By default, the same instance type will be used for both the master and the workers. This can be wasteful if your computations are particularly intensive and the master isn't being heavily utilized. You can specify a different master instance type with --master-instance-type=(name of instance).

EC2 also has GPU instance types that can be useful for workers, but would be completely wasted on the master. This text will cover working with Spark and GPUs later on; however, it is important to note that EC2 GPU performance may be lower than what you get while testing locally, due to the higher I/O overhead imposed by the hypervisor.

Spark's EC2 scripts uses AMI (Amazon Machine Images) provided by the Spark team. These AMIs may not always be up-to-date with the latest version of Spark, and if you have custom patches (such as using a different version of HDFS) for Spark, they will not be included in the machine image. At present, the AMIs are also only available in the U.S. East region, so if you want to run it in a different region you will need to copy the AMIs or make your own AMIs in a different region.

To use Spark's EC2 scripts, you need to have an AMI available in your region. To copy the default Spark EC2 AMI to a new region, figure out what the latest Spark AMI is by looking at the spark_ec2.py script and seeing what URL the LATEST_AMI_URL points to and fetch it. For Spark 0.7, run the following command to get the latest AMI:

curl https://s3.amazonaws.com/mesos-images/ids/latest-spark-0.7

There is an ec2-copy-image script that you would hope provides the ability to copy the image, but sadly it doesn't work on images that you don't own. So you will need to launch an instance of the preceding AMI and snapshot it. You can describe the current image by running:

ec2-describe-images ami-a60193cf

This should show you that it is an EBS-based (Elastic Block Store) image, so you will need to follow EC2's instructions for creating EBS-based instances. Since you already have a script to launch the instance, you can just start an instance on an EC2 cluster and then snapshot it. You can find the instances you are running with:

ec2-describe-instances -H

You can copy the i-[string] instance name and save it for later use.

If you wanted to use a custom version of Spark or install any other tools or dependencies and have them available as part of our AMI, you should do that (or at least update the instance) before snapshotting.

ssh -i ~/spark-keypair.pem root@[hostname] "yum update"

Once you have your updates installed and any other customizations you want, you can go ahead and snapshot your instance with:

ec2-create-image -n "My customized Spark Instance" i-[instancename]

With the AMI name from the preceding code, you can launch your customized version of Spark by specifying the [cmd]--ami[/cmd] command-line argument. You can also copy this image to another region for use there:

ec2-copy-image -r [source-region] -s [ami] --region [target region]

This will give you a new AMI name, which you can use for launching your EC2 tasks. If you want to use a different AMI name, simply specify --ami [aminame].

Note

As of this writing, there was an issue with the default AMI and HDFS. You may need to update the version of Hadoop on the AMI, as it does not match the version that Spark was compiled for. You can refer to https://spark-project.atlassian.net/browse/SPARK-683 for details.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.178.151