Chapter 7. Exploring Data in the Cloud

Hadoop on the cloud is a new deployment option that allows organizations to create and customize Hadoop clusters on virtual machines utilizing the computing resources of virtual instances and deployment scripts. Similar to the on-premise full custom option, this gives businesses full control of the cluster. In addition, it gives flexibility and many advantages—for example, capacity on demand, decreased staff costs, storage services, and technical support. Finally, it gives the opportunity to get fast time to value, that is we can deploy our infrastructure in the Amazon cloud and start analyze our data very quickly because we don't need setup hardware and software as well as we don't need many technical resources. One of the most popular Hadoop cloud is Amazon Elastic MapReduce (EMR).

With Hunk we can interactively explore, analyze, and visualize data stored in Amazon EMR and Amazon S3. The integrated offering lets AWS and Splunk customers:

  • Unlock the business value of data: Preview search results before MapReduce jobs finish and conduct sophisticated analytics—all from an integrated analytics platform that's fast and easy for everyone to use.
  • Gain insights: Hunk lets us explore, analyze, and visualize Amazon EMR and Amazon S3 data on a massive scale, all with just a few clicks.
  • Easily provision and deploy Hunk, when we need it, for only as long as you need it—charged by the hour.

In this chapter, the reader will learn how to run Amazon EMR and deploy Hunk on top of it. In addition, the reader will create virtual indexes and use the Amazon S3 file system.

An introduction to Amazon EMR and S3

In this section, we will learn about Amazon EMR and Simple Storage Service (S3). Moreover, we try to run these services by creating EMR clusters and S3 buckets.

Amazon EMR

Amazon EMR is a Hadoop framework in the cloud offered as a managed service. It is used by thousands of customers. It uses millions of EMR clusters in a variety of big data use cases, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. EMR can easily process any type of big data without its own big data infrastructure:

Amazon EMR

As with any other Amazon service, EMR is easy to run by filling in option forms. Enter the cluster name, the size, and the types of node in the cluster. And it creates in two minutes a fully running EMR cluster. It is ready to process data. It removes all the headache of maintaining clusters and version compatibility. Amazon takes care of all tasks involved in running and supporting Hadoop.

Setting up an Amazon EMR cluster

Let's start EMR cluster in order to connect to Hunk:

  1. Go to http://aws.amazon.com and sign up or sign in.
  2. Go to the AWS Management Console and choose Amazon EMR.
  3. Click Create Cluster and choose the appropriate parameters.
  4. Type in the Cluster name, for example emr-cluster-packtpub. In addition, we can switch off Logging and Termination protection:
    Setting up an Amazon EMR cluster
  5. Choose a Hadoop distribution:
    Setting up an Amazon EMR cluster
  6. Choose an EC2 instance type and the number of nodes:
    Setting up an Amazon EMR cluster

    Note

    We can learn more about how to plan the capacity for EMR here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-instances.html.

    If we want to pay for Hunk by the hour, we should add Hunk as an additional application:

    Setting up an Amazon EMR cluster
  7. For Security and Access, select your EC2 key pair and IAM user access settings.
  8. For IAM Roles, select your EMR Role setting and EC2 instance profile for your cluster. Hunk requires that you run your EMR cluster using IAM roles.
  9. Click Create Cluster. It takes time to prepare and run a new cluster by AWS.

Amazon S3

Amazon S3 provides developers and IT teams with secure, durable, and highly-scalable object storage. Amazon S3 is easy to use, with a simple web service interface to store and retrieve any amount of data from anywhere on the Web. We can easily write and retrieve objects, which can range in size from a few bytes to terabytes, and we can work with an unlimited number of files. The process of interacting with S3 is very trivial, we just use the web service interface to write or retrieve objects. It is reliable, secure, and durable. Finally, it is backed by AWS SLA:

Amazon S3

S3 as a data provider for Hunk

Amazon S3 can be a data provider for Hunk. Let's create a bucket and upload two files with two weeks' worth of all HTTP requests to the ClarkNet WWW server. ClarkNet is a full Internet access provider for the Metro Baltimore-Washington DC area, and can be found in the attachment to this chapter or as a direct download from: http://ita.ee.lbl.gov/html/contrib/ClarkNet-HTTP.html:

  1. Go to the AWS Management Console and choose Amazon S3.
  2. Create a bucket called my-web-logs.
  3. Choose Actions and click Upload in order to upload clarknet_access_log_Aug28 and clarknet_access_log_Sep4.
  4. When we set up Hunk, we will use this bucket as a data source for a virtual index.

The advantages of EMR and S3

Amazon Elastic MapReduce can be seen as both a complement and a competitor to Hadoop. EMR can run on top of a Hadoop HDFS cluster, but it can also run directly on top of AWS S3. There are several advantages to using S3 and EMR together. First of all, using Amazon EMR and S3 gives us full native support to access data—in other words, we are provided with a full distributed file system and full support from EMR. EMR runs on top of S3 and S3 works as a data store. In addition, it allows us to avoid the complexity of Hadoop and HDFS management. For example, if we have Hadoop on-premise, it is not easy to maintain it:

The advantages of EMR and S3

Moreover, EMR is elastic; it is easy to increase clusters dynamically on demand. Finally, it uses a pay for what you use model. For example:

  • Long running versus Transient
  • Spot versus Reserved Instances

Furthermore, EMR and S3 are very popular with thousands of customers; they have a big ecosystem and a very large community.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.80.100