Integrating Hunk with EMR and S3

Integrating Hunk with EMR and S3 is a pretty sensible proposition. If we connect the vast amounts of data that we store in HDFS or S3 with the rich capabilities of Hunk, we can build a full analytics solution for any type of data and any size of data on the cloud:

Integrating Hunk with EMR and S3

Fundamentally, we have a three-tier architecture. The first tier is data storage based on HDFS or S3. The next one is the compute or processing framework, provided by EMR. Finally, the visualization, data discovery, analytics, and app development framework is provided by Hunk.

The traditional method for hosting Hunk in the cloud is to simply buy a standard license and then provision a virtual machine in much the same way you would do it on-site. The instance would then have to be manually configured to point to the correct Hadoop or AWS cluster. This method is also called Bring Your Own License (BYOL).

On the other hand, Splunk and Amazon offer another method, in which Hunk instances can be automatically provisioned in AWS. This includes automatically discovering EMR data sources, which allows for instances to be brought online in a manner of minutes. In order to take advantage of this, Hunk instances are billed at an hourly rate. Let's try to perform both methods.

Method 1: BYOL

We have already run an EMR cluster. In addition, we should load data into S3 or HDFS.

Setting up the Hunk AMI

Let's find and run the Hunk AMI:

  1. Go to the AWS Market place—https://aws.amazon.com/marketplace—and find the Hunk AMI:
    Setting up the Hunk AMI

    Tip

    There is detailed information about Amazon Machine Images at: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html.

  2. Choose the appropriate Instance depending on your workflow.

    Tip

    It is important to create an instance with enough resources to accommodate the expected workload and search concurrency. For instance, a c3.2xlarge (high-CPU instance) provides a good starting point.

    It is important that the Hunk instance can communicate with all EMR cluster nodes. To do this, we have to edit Security Groups in the EC2 Management page to make sure traffic is allowed to and from all ports.

    If we are using S3 for our data directories, we have to set up a Hunk working directory in HDFS. This improves processing speed and allows us to keep our directory read-only, if desired.

  3. Configure the IAM Role; choose EC2_EMR_DefaultRole.
  4. Configure the Security Group. Hunk needs to be a part of ElasticMapReduce-master. In addition, we should attach to Hunk-Group in order to access the web interface.
  5. Click Launch Instance and get a new window, where we create a Key Pair, or choose the existing one in order to connect the instance by SSH.

As a result, we have a running EMR and Hunk with configured security. We need to copy the Public DNS. In the From EC2 menu, choose our Hunk Instance and copy the Public DNS. In addition, we can copy the Instance ID as a password for Hunk:

Setting up the Hunk AMI

Then paste it in a browser and add :8000 as a default port. We then get the Hunk Web Interface:

Setting up the Hunk AMI

Adding a license

We chose the BYOL model; that's why we should add the license file. Go to Settings | Licensing and click on Add license in order to upload the license file:

Adding a license

By default we can use the trial license for 60 days.

Configuring the data provider

Let's configure the data provider in order to create a virtual index and start to explore our log file, based on S3:

  1. Go to Settings | Virtual Indexes | Providers tab and click New Provider.
  2. Complete the following parameters:
    • Name: Type any name.
    • Java Home: AMI has this default folder: /opt/java/latest/.
    • Hadoop Home: AMI has a default folder for various Hadoop versions: /opt/hadoop/apache/hadoop-X.X.X.
    • Hadoop Version: Choose the appropriate one.
    • Job Tracker: We need to use the Private IP of the Master Node.
    • File System: In the case of S3, we should use this: s3n://<AWS key>:<AWS secret>@<s3 bucket path>.

    The following screenshot is an example of a filled form:

    Configuring the data provider
  3. Click Save.

Configuring a virtual index

After the provider, we need a new virtual index. On the Virtual Indexes tab, click New Virtual Index. Add a unique name, the full S3 path to the logs (optionally, we can use Whitelist if there are many log types in that path), and then click Save:

Configuring a virtual index

Setting up a provider and virtual index in the configuration file

We can connect Hunk instances via SSH using our key pair and set up a data provider via configuration files. For a step-by-step guide, see: http://docs.splunk.com/Documentation/Hunk/latest/Hunk/Setupavirtualindex.

In order to connect the instance via the Terminal, we can use the following command:

ssh –i <private key> ec2-user@<public DNS>

Exploring data

When we have successfully created a virtual index, we can start to explore the data:

Exploring data

Using SPL we can create reports, depending on what you fancy or want.

Method 2: Hunk–hourly pricing

If we don't have a Hunk license, we can use Hunk on a pay-as-you-go basis. In order to use this method, we should add Hunk as an additional application during the configuration of EMR clusters (see the Setting up an Amazon EMR cluster section).

In addition, we have two options for provisioning Hunk.

Provisioning a Hunk instance using the Cloud formation template

We can go to http://aws.amazon.com/cloudformation/ and create a new stack. Then, we should configure Hunk as usual.

Provisioning a Hunk instance using the EC2 Console

Another option is reusing the EC2 instance. We should find our favorite Hunk AMI and run it. When the cluster is ready, we can create a virtual index that points to a data set location of our choice (HDFS or S3).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.37.10