Best practices for using Amazon EMR

Amazon has made working with Hadoop a lot easier. You can launch an EMR cluster in minutes for big data processing, machine learning, and real-time stream processing with the Apache Hadoop ecosystem. You can use the Management Console or the command line to start a few nodes or a thousand nodes with ease.

Like EC2, EMR pricing is now changed from pay per hour to pay per second. If you stop your cluster—you stop paying immediately. This results in lower costs and you don’t have to worry about the hourly boundary, anymore.

EMR makes a whole bunch of the latest versions of open source software available to you. There are 19 open source projects, presently. New releases are made every 4 to 6 weeks so latest versions of the open source projects are available. This is very useful especially for rapidly evolving open source projects such as Apache Spark where each release contains critical bug fixes and features. However, you are not forced to upgrade but a new release is made available, if you choose to use it. With EMR, you can spin up a bunch of instances and you could process massive volumes of data residing on S3 at a reasonable cost.

A variety of different cluster management options are supported, including YARN. You can run HBase, Presto (low latency, distributed SQL engine), Spark, Tez, and a variety of frontend tools such as ganglia, Zeppellin, notebooks, and SQL editors. Additionally, connectors to a variety of different AWS services are also available, for example, you can use Spark to load Redshift (using the Redshift connector, which under the hood uses Redshift commands to get a good throughput). You can access DynamoDB for analytics applications; use Sqoop to access relational data and so on.

One particularly interesting connector is AWS Glue. AWS Glue comprises three main components:

ETL service: This lets you drag things around to create serverless ETL pipelines
AWS Glue Data Catalog: This is a fully managed Hive metastore-compliant service. Earlier, the systems ran an external Hive metastore database in RDS or Aurora. This was great, if you shut down your cluster, all your metadata was persisted so you didn’t have to recreate your tables with extra durability and availability (in case something happened to your metastore with MySQL on the master node). With Glue all that is fully managed. You have an intelligent metastore—you don’t have to write DDL to create a table, you could just have Glue crawl your data, infer what the schema is, and create those tables for you. You can also have it add partitions. One thing that can be painful is to add partitions—if you are constantly updating your Hive tables, you need a process to kick-off to load that partition in—Glue catalog can do it for you. You can have a variety of complex data types that it supports as well.
Crawlers: The crawlers let you the crawl data to infer the schema.

AWS Glue is a managed service, so you spend less time monitoring; as a fully managed service, it is also responsible for replacing unhealthy nodes and auto-scaling. It is easy to enable security options. It supports full customization and control, and you don’t have to waste time creating and configuring the cluster. In most cases, the default settings are good enough, but even if you wanted to change them or install custom components, you have root access over all the boxes, so you can make any changes you need.

Table of Contents for Best practices for using Amazon EMR

Create new playlist

Sign In

Sign Up

Table of Contents for
Best practices for using Amazon EMR