Understanding common EMR use cases

Using HBase for random access at a massive scale involves a lot of customers who are running HBase with HDFS. Now there is support for HBase using S3 object store for HFiles. Also, there is the ability to use Read Replica HBase cluster in another AZ. Shifting to S3 can save you 50% or higher on storage costs. Instead of sizing the cluster for HDFS, they can now size it for the amount of processing power required for the HBase Region Servers. The S3 option is also good for load balancing and disaster recovery across AZs. As S3 is available across a region, you don’t have to replicate the data twice, that is, you don’t need two full HDFS clusters. Now you can set up a smaller cluster for the Read Replicas that point to the same HFiles and you can drive the read traffic through there.

Real-time and batch processing involves utilizing EMR; you can use Kinesis for pushing data to Spark. Use Spark Streaming for real-time analytics or processing data on-the-fly and then dump that data into S3. If you don’t have real-time processing use cases, then Kinesis Firehose is a great alternative too. The data can be cataloged in the Glue Data Catalog and then you can have the data accessible via a variety of different analytical engines. EMR supports several analytical engines including Hive, Tez, and Spark. Once the data is in the Data Catalog on S3, you can use Athena (serverless SQL queries), Glue ETL (serverless ETL), and Redshift Spectrum.

Data exploration with Spark using Zeppelin or Jupyter notebook. This allows you to arm your data scientists with a way to explore large amounts of data (instead of using one node, now you can spread the data across the cluster). It also makes it easy to move it to production.

There is a big rise in the use of Presto for ad hoc SQL queries (in combination with Athena). They approach the same thing from two different angles. Presto gives you advanced configurations and a way to build exactly what you need for your use case but you have to deal with the cluster management versus Athena where you just go to the console and start writing SQL. Now many BI tools support Presto as well for supporting low latency dashboards. You can also do traditional batch processing workloads using Spark.

Deep learning with GPU instances is where you can launch GPU hardware for EMR. There’s support for MxNet. You can do end-to-end data engineering work. Support for TensorFlow is coming.

Typical ML projects implement a multi-step process, including ETL, feature engineering, model training, model evaluation, model deployment, and model scoring and updates. Such pipelines need to support batch model training and real-time ML model serving. Using Apache Spark for implementing ML pipelines is very popular as it supports each step in a ML pipeline, scales for small and large jobs, good ML libraries, and has an active user base.

There are several options for deploying Spark on AWS. For example, you can use EC2 as it can support for batch/streaming, integrates with tooling, spin up/down clusters, larger/smaller clusters. Additionally, it also has support for different versions of Hadoop and Spark. However, using EC2 for Spark deployment places a huge management burden on us. Hence, EMR can be a simpler and better alternative here. It is simple to provision and you can use a wizard (and then generate the commands for the command line from it, if required). You can create tags for cost management and send logs to S3.

Table of Contents for Understanding common EMR use cases

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding common EMR use cases