Cluster management

The following diagram, borrowed from the spark.apache.org website, demonstrates the role of the Apache Spark cluster manager in terms of the master, slave (worker), executor, and Spark client applications:

Cluster management

The Spark context, as you will see from many of the examples in this book, can be defined via a Spark configuration object, and a Spark URL. The Spark context connects to the Spark cluster manager, which then allocates resources across the worker nodes for the application. The cluster manager allocates executors across the cluster worker nodes. It copies the application jar file to the workers, and finally it allocates tasks.

The following subsections describe the possible Apache Spark cluster manager options available at this time.

Local

By specifying a Spark configuration local URL, it is possible to have the application run locally. By specifying local[n], it is possible to have Spark use <n> threads to run the application locally. This is a useful development and test option.

Standalone

Standalone mode uses a basic cluster manager that is supplied with Apache Spark. The spark master URL will be as follows:

Spark://<hostname>:7077

Here, <hostname> is the name of the host on which the Spark master is running. I have specified 7077 as the port, which is the default value, but it is configurable. This simple cluster manager, currently, only supports FIFO (first in first out) scheduling. You can contrive to allow concurrent application scheduling by setting the resource configuration options for each application. For instance, using spark.core.max to share cores between applications.

Apache YARN

At a larger scale when integrating with Hadoop YARN, the Apache Spark cluster manager can be YARN, and the application can run in one of two modes. If the Spark master value is set as yarn-cluster, then the application can be submitted to the cluster, and then terminated. The cluster will take care of allocating resources and running tasks. However, if the application master is submitted as yarn-client, then the application stays alive during the life cycle of processing, and requests resources from YARN.

Apache Mesos

Apache Mesos is an open source system for resource sharing across a cluster. It allows multiple frameworks to share a cluster by managing and scheduling resources. It is a cluster manager, which provides isolation using Linux containers, allowing multiple systems, like Hadoop, Spark, Kafka, Storm, and more to share a cluster safely. It is highly scalable to thousands of nodes. It is a master slave-based system, and is fault tolerant, using Zookeeper for configuration management.

For a single master node Mesos cluster, the Spark master URL will be in this form:

Mesos://<hostname>:5050

Where <hostname> is the host name of the Mesos master server, the port is defined as 5050, which is the default Mesos master port (this is configurable). If there are multiple Mesos master servers in a large scale high availability Mesos cluster, then the Spark master URL would look like this:

Mesos://zk://<hostname>:2181

So, the election of the Mesos master server will be controlled by Zookeeper. The <hostname> will be the name of a host in the Zookeeper quorum. Also, the port number 2181 is the default master port for Zookeeper.

Amazon EC2

The Apache Spark release contains scripts for running Spark in the cloud against Amazon AWS EC2-based servers. The following listing, as an example, shows Spark 1.3.1 installed on a Linux CentOS server, under the directory called /usr/local/spark/. The EC2 resources are available in the Spark release EC2 subdirectory:

[hadoop@hc2nn ec2]$ pwd

/usr/local/spark/ec2

[hadoop@hc2nn ec2]$ ls
deploy.generic  README  spark-ec2  spark_ec2.py

In order to use Apache Spark on EC2, you will need to set up an Amazon AWS account. You can set up an initial free account to try it out here: http://aws.amazon.com/free/.

If you take a look at Chapter 8, Spark Databricks you will see that such an account has been set up, and is used to access https://databricks.com/. The next thing that you will need to do is access your AWS IAM Console, and select the Users option. You either create or select a user. Select the User Actions option, and then select Manage Access Keys. Then, select Create Access Key, and then Download Credentials. Make sure that your downloaded key file is secure, assuming that you are on Linux chmod the file with permissions = 600 for user-only access.

You will now have your Access Key ID, Secret Access Key, key file, and key pair name. You can now create a Spark EC2 cluster using the spark-ec2 script as follows:

export AWS_ACCESS_KEY_ID="QQpl8Exxx"
export AWS_SECRET_ACCESS_KEY="0HFzqt4xxx"

./spark-ec2  
    --key-pair=pairname 
    --identity-file=awskey.pem 
    --region=us-west-1 
    --zone=us-west-1a  
    launch cluster1

Here, <pairname> is the key pair name that you gave when your access details were created; <awskey.pem> is the file that you downloaded. The name of the cluster that you are going to create is called <cluster1>. The region chosen here is in the western USA, us-west-1. If you live in the Pacific, as I do, it might be wiser to choose a nearer region like ap-southeast-2. However, if you encounter allowance access issues, then you will need to try another zone. Remember also that using cloud-based Spark clustering like this will have higher latency and poorer I/O in general. You share your cluster hosts with multiple users, and your cluster maybe in a remote region.

You can use a series of options to this basic command to configure the cloud-based Spark cluster that you create. The –s option can be used:

-s <slaves>

This allows you to define how many worker nodes to create in your Spark EC2 cluster, that is, –s 5 for a six node cluster, one master, and five slave workers. You can define the version of Spark that your cluster runs, rather than the default latest version. The following option starts a cluster with Spark version 1.3.1:

--spark-version=1.3.1

The instance type used to create the cluster will define how much memory is used, and how many cores are available. For instance, the following option will set the instance type to be m3.large:

--instance-type=m3.large

The current instance types for Amazon AWS can be found at: http://aws.amazon.com/ec2/instance-types/.

The following figure shows the current (as of July 2015) AWS M3 instance types, model details, cores, memory, and storage. There are many instance types available at this time; for instance, T2, M4, M3, C4, C3, R3, and more. Examine the current availability and choose appropriately:

Amazon EC2

Pricing is also very important. The current AWS storage type prices can be found at: http://aws.amazon.com/ec2/pricing/.

The prices are shown by region with a drop-down menu, and a price by hour. Remember that each storage type is defined by cores, memory, and physical storage. The prices are also defined by operating system type, that is, Linux, RHEL, and Windows. Just select the OS via a top-level menu.

The following figure shows an example of pricing at the time of writing (July 2015); it is just provided to give an idea. Prices will differ over time, and by service provider. They will differ by the size of storage that you need, and the length of time that you are willing to commit to.

Be aware also of the costs of moving your data off of any storage platform. Try to think long term. Check whether you will need to move all, or some of your cloud-based data to the next system in, say, five years. Check the process to move data, and include that cost in your planning.

Amazon EC2

As described, the preceding figure shows the costs of AWS storage types by operating system, region, storage type, and hour. The costs are measured per unit hour, so systems such as https://databricks.com/ do not terminate EC2 instances, until a full hour has elapsed. These costs will change with time and need to be monitored via (for AWS) the AWS billing console.

You may also have problems when wanting to resize your Spark EC2 cluster, so you will need to be sure of the master slave configuration before you start. Be sure how many workers you are going to require, and how much memory you need. If you feel that your requirements are going to change over time, then you might consider using https://databricks.com/, if you definitely wish to work with Spark in the cloud. Go to Chapter 8, Spark Databricks and see how you can set up, and use https://databricks.com/.

In the next section, I will examine Apache Spark cluster performance, and the issues that might impact it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.66.185