Using Amazon EC2 Spot and Auto Scaling

Using spot instances is a great way to save money compared to using on-demand instances. However, be more careful in a SLA-driven environment (where you cannot withstand any failures). In most cases, the odds of a failure are pretty low, and Hadoop itself can handle several node failures so that even if some nodes are taken away you may still be good—you should run task nodes that don’t have data so as to not impact HDFS. But still, there might be failures and you could lose a bunch of nodes and Spark may not be able to re-compute a DataFrame. Having the logic to just kill the cluster and create a new one on-demand for that one job is more expensive but over a period of time, the savings still make it a worthwhile approach. Templatize the architecture so that you can quickly recover if something happens, is doable and recommended when you are using spot instances, and you have workloads that are SLA bound.

Using instance fleets is similar to using spot fleets; however, you do not specify the instance to use. You will specify a list of instance types and Amazon will do the best based on the availability of instances to give some combination of instances you want. This solves the following three problems:

Provision across instance types. When requesting instance types sometimes they are not available, then AWS will try to get the capacity you need based on instance types available from the list of instance types you specify. You give four or five different choices and AWS will try to provision the end capacity you need.
You can also specify the list of AZs and EMR will select the optimal EC2 AZ to launch the cluster in.
If spot instances are not available and I still need to run the job, then the on-demand feature can be used. For example, if I can’t get all the instances I need for 30 minutes, then switch to on-demand.

EMR auto-scaling—whether for long running jobs or even for transient workloads to some degree—uses some of the features of application auto-scaling under the hood, but AWS has stitched things together for you. You need to specify what you need and AWS does the rest. (No need to specify CloudWatch metrics yourself.) You can use a variety of different CloudWatch metrics such as YARN memory or the ratio of containers pending to containers active, which is a kind of a proxy for if you gave me another node do you have any YARN containers to put on it. So give me another node.

There are a bunch of CloudWatch metrics that can be used, but also use custom metrics—for example, there is currently no metric for aggregate CPU on the cluster, but if you have installed Ganglia, then there is—so pump that to CloudWatch and scale based on that.

Auto-scaling points involves EMR scales-in at YARN task completion. It selectively removes nodes with no running tasks (that is, a YARN container); yarn.resourcemanager.decommissioning.timeout(default is one hour). If a particular node is running a YARN container then it will wait for this value (1 hour), and if no other node is free then it will take this node and kill the YARN container. If you have a use case that involves caching a large Spark DataFrame and all the YARN containers are required, then you can set this parameter to an arbitrary high value to prevent a node from being taken away while it is in use. Ultimately, the configuration parameter settings depend on your specific use case.

Table of Contents for Using Amazon EC2 Spot and Auto Scaling

Create new playlist

Sign In

Sign Up

Table of Contents for
Using Amazon EC2 Spot and Auto Scaling