Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Explore Hadoop Data with Hunk

Hadoop has become an enterprise standard for big organizations working towards mining and implementing big data strategies. The use of Hadoop on a larger scale is set to become the new standard for practical, result-driven applications for data mining. However, it is a challenging task to extract data from Hadoop in order to explore it and find business insights. It is a fact that Hadoop provides cheap storage for any data but, unfortunately, it is inflexible for data analytics. There are plenty of tools that can add flexibility and interactivity for analytics tasks, but they have many restrictions.

Hunk avoids the main drawbacks of big data analytics and offers rich functionality and interactivity for analytics.

In this chapter you will learn how to deploy Hunk on top of Hadoop in order to start discovering Hunk. In addition, we will load data into Hadoop and will discover it via Hunk, using the Splunk Processing Language (SPL). Finally, we will learn about Hunk security.

Setting up Hunk

In order to start exploring Hadoop data, we have to install Hunk on top of our Hadoop Cluster. Hunk is easy to install and configure. Let's learn how to deploy Hunk version 6.2.1 on top of an existing CDH cluster. It's assumed that your VM is up and running.

Extracting Hunk to a VM

Open the console application.

Run ls -la to see the list of files in your home directory:

[cloudera@quickstart ~]$ cd ~
[cloudera@quickstart ~]$ ls -la | grep hunk
-rw-r--r--   1 root     root     113913609 Mar 23 04:09 hunk-6.2.1-249325-Linux-x86_64.tgz

Unpack the archive:

cd /opt
sudo tar xvzf /home/cloudera/hunk-6.2.1-249325-Linux-x86_64.tgz -C /opt

Setting up Hunk variables and configuration files

It's time to set the SPLUNK_HOME environment variable. This variable has already been added to the profile:
```
export SPLUNK_HOME=/opt/hunk
```
Use the default splunk-launch.conf. This is the basic properties file used by the Hunk service. We don't have to change anything special, so let's use the default settings:
```
Sudo cp /opt/hunk/etc/splunk-launch.conf.default /opt/hunk//etc/splunk-launch.conf
```

Running Hunk for the first time

Run Hunk using the following command:

sudo /opt/hunk/bin/splunk start --accept-license

Here is the sample output from the first run:

This appears to be your first time running this version of Splunk.
Copying '/opt/hunk/etc/openldap/ldap.conf.default' to '/opt/hunk/etc/openldap/ldap.conf'.
Generating RSA private key, 1024 bit long modulus

Some output lines were deleted to reduce amount of log text

Waiting for web server at http://127.0.0.1:8000 to be available.... Done


If you get stuck, we're here to help.  
Look for answers here: http://docs.splunk.com

The Splunk web interface is at http://vm-cluster-node1.localdomain:8000

Now you can access the Hunk UI using http://localhost:8000 in the browser on your virtual machine.

Setting up a data provider and virtual index for CDR data

We need to accomplish two tasks: providing a technical connector to underlying data storage and creating a virtual index for data on this storage.

Log in to http://localhost:8000. The system will ask you to change the default admin user password. I have set it to admin.

Setting up a data provider and virtual index for CDR data

Setting up a connection to Hadoop

Right now we are ready to set up integration between Hadoop and Hunk. First we need to specify the way Hunk connects to the current Hadoop installation. We are using the most recent way: YARN with MR2. Then we have to point the virtual indexes to data stored in Hadoop:

Click on Explore Data.
Click on Create a provider on the next screen:

Let's fill in the form to create a data provider. The data provider component is used to interact with frameworks such as Hadoop. You should set up the necessary properties in order to make sure the provider correctly gets data from the underlying datasource. We will also create a data provider for Mongo later in this book. You don't have to install something special. Cloudera VM, used as a base for this example, carries all the necessary software. Java JDK 1.7 is on board already.

Property name	Value
Name	`hadoop-hunk-provider`
Java home	`/usr/java/jdk1.7.0_67-cloudera`
Hadoop home	`/usr/lib/hadoop`
Hadoop version	`Hadoop 2.x, (Yarn)`
Filesystem	`hdfs://quickstart.cloudera:8020`
Resource Manager Address	`quickstart.cloudera:8032`
Resource Scheduler Address	`quickstart.cloudera:8030`
HDFS Working Directory	`/user/hunk`
Job Queue	`default`

You don't have to modify any other properties. The HDFS working directory has been created for you in advance. You can create it using this command:

sudo -u hdfshadoop fs  -mkdir -p /user/hunk

You should see the following screen, if you did everything correctly:

Let's discuss briefly what we have done:

We told Hunk where Hadoop home and Java are. Hunk uses Hadoop streaming internally so it needs to know how to call Java and Hadoop streaming. You can inspect submitted jobs from Hunk (discussed later) and view these lines:
```
/opt/hunk/bin/jars/sudobash /usr/bin/hadoop jar "/opt/hunk/bin/jars/SplunkMR-s6.0-hy2.0.jar" "com.splunk.mr.SplunkMR"
```
A MapReduce JAR is submitted by Hunk. Also we need to tell Hunk where the YARN resource manager and scheduler are located. These services allow us to ask for cluster resources and run jobs.
Job queues could be useful in a production environment. You could have several queues for cluster resource distribution in real life. We will set the queue name as default since we are not discussing cluster utilization and load balancing.

Setting up a virtual index for data stored in Hadoop

Now it's time to create a virtual index. We are going to add a dataset with AVRO files to the virtual index as example data. We will work with that index later in Chapter 6, Discovering Hunk Integration Apps.

Click on Explore Data and then click on Create a virtual index on the next screen:
You'll get a message to the effect that there are No indexes:
Just click on New Virtual Index.
A virtual index is metadata; it tells Hunk where data is located and what provider should be used to read that data. The virtual index goal is to declare access to data. That data could be structured or unstructured. A virtual index is immutable; you can only read data through that type of index. The data provider tells Hunk how to read data and the virtual index declares the data properties.

Property name

Value

Name

milano_cdr_aggregated_10_min_activity

Path to data in HDFS

/masterdata/stream/milano_cdr
Here is an example of the screen you should see after creating the first virtual index.

Property name	Value
Name	`milano_cdr_aggregated_10_min_activity`
Path to data in HDFS	`/masterdata/stream/milano_cdr`

Accessing data through a virtual index

Click on Explore Data and Select a Provider and Virtual Index:
Select part-m-00000.avro by clicking on it. The Next Button will be activated after you pick a file:
Preview the data on the Preview data step in the wizard. You should see how Hunk automatically formats the timestamp from our CDR data.
Pay attention to the Time column and the field named time_interval in the Event column. The time_interval field keeps the time of the record. Hunk should automatically use that field as the time field, which allows you to run a search query using a time range. It's a typical pattern for reading time series data.
Save the source type by clicking on Save as then click on Next.
At the step Enter Context Settings choose Application Context, than in Sharing context choose All apps, then click on Next.
The last step allows you to review what you've done:
Click on Finish to create final wizard.