Hunk report acceleration

We can easily accelerate our searches, which is critical for business. The idea behind Hunk is easy: the same search on the same data always gives the same result. In other words, same search + same data = same results. In the case of acceleration, Hunk caches the results and returns them on demand. Moreover, it gives us the opportunity to choose a data range for a particular data summary. In other words, if the data change is due to a fresh portion of events, then the accelerated report will rebuild the data summary in order to meet the requirements of the particular data range. Technically, we just cache the map phase in HDFS. When we run the accelerated search, Hunk just returns straight to us. There are four main steps in running an accelerated search:

  1. The scheduled job builds a cache.
  2. Find cache hits.
  3. Stream the results to a search head.
  4. Reduce on the search head.

Tip

There is more information about search heads at: http://docs.splunk.com/Splexicon:Searchhead.

The acceleration feature gives us lots of benefits, such as fast reports on unstructured data as well as on structured data. In addition, we can create fast dashboards and reports. As a result, it helps to improve user experience. Moreover, if we cache the result for mappers, then we just return it streaming to the search head and no mappers will work. It helps to reduce resources.

Creating a virtual index

Let's create a new index based on fake data in order to learn about report acceleration:

  1. Go to Settings | Virtual indexes.
  2. Click on New virtual index.
  3. Enter orders in the Name field.
  4. Enter /staging/orders in Path to data in HDFS field and click on Save:
    Creating a virtual index
  5. Then we can start to explore the data; click on Explore Data.
  6. Select hadoop-Hunk-provider and Virtual index in the Provider and Orders fields respectively. Click on Next.
  7. Select the orders.txt file and click on Next.
  8. In Preview data, we should choose the appropriate source type or create our own. By default, Hunk can't find the timestamp in our data set. We should try to help it to identify the timestamp.
  9. Click on Timestamp and enter timestamp in the Timestamp prefix field; Hunk will define events in our data set. Then click on Save As and type the Sourcetype name as onlineorders. In addition, choose the default app as Search and Reporting. Then click on Next.
  10. In the application context, choose Search, click on Next, and then click Finish.

We got a message from Hunk to the effect that our new configuration has been saved. Now we can start to explore our data.

Streaming mode

Let's look at events from our new index. Just run a search: index="orders".

We can see how Hunk will return events until the map reduce phase finishes. It is very useful because we can save lots of time and figure out if we got incorrect data at the beginning. The following screenshot explains this:

Streaming mode

At the beginning, Hunk returns 12,347 events from 118,544 events and then continues to return other events. That's how streaming mode works but it takes some time to achieve the result. Let's try to build a new query and accelerate it in order to see this incredible feature.

Creating an acceleration search

Let's create another search that returns the top category from index orders:

index="orders" | top  "items{}.category" | rename "items{}.category" as Category | sort -Category | eval message = "Hello World"

It looks silly; however it works more than a minute on my laptop. Save as Report | Top Category.

Tip

In order to accelerate a report, a report must have an underlying search that uses a transforming command (such as chart, timechart, stats, and top). In addition, any search commands before the first transforming command in the search string need to be streaming commands. (Nonstreaming commands are allowed after the first transforming command.)

Now we can accelerate this report:

  1. Go to Reports in Search App.
  2. Click on Edit in the Actions column and click on Edit Acceleration:
    Creating an acceleration search
  3. A new window—Edit Acceleration—will appear.
  4. Check Accelerate Report, choose Summary Range | All Time, and click on Save:
    Creating an acceleration search

Tip

When we enable acceleration for our report, Hunk will begin to build a report acceleration summary for it.

We successfully accelerated our report. It takes some time in order to accelerate a report. Let's try to run the accelerated report; it takes around 7 seconds on my laptop. We can go to the job monitor and compare results. Go to Settings | Job and verify the differences:

Creating an acceleration search

In my case, it is 7 seconds versus 1 minute and 27 seconds.

It's an amazing result. Let's try to understand how to manage report acceleration and review our acceleration reports.

What's going on in Hadoop?

As we learnt before, after the map phase, a cache is created in HDFS. We can be more precise: <vix.splunk.home.hdfs>/cache. If you are interested in where the cache storing is, then we can use Hue in order to look at the file system. We should open the Hue job browser (https://quickstart.cloudera:8888/jobbrowser/) and find the job, which begins from SPLK_quickstart.cloudera_scheduler__admin__search. If we start looking at the logs we can find a trace of cache as follows:

015-04-29 14:05:24,256 INFO [main] com.splunk.mr.SplunkBaseMapper: Pushed records=118544, bytes=116267468 to search process ...
2015-04-29 14:05:25,245 INFO [main] com.splunk.mr.SplunkSearchMapper: Processed records=118544, bytes=116267468, return_code=0, elapsed_ms=10078

In other words, all things are going in Hadoop.

Report acceleration summaries

When we have created the accelerated search, we can go to the menu, where we can find our acceleration searches and their statistics. In order to find this menu, click on Settings | Report Acceleration Summaries:

Report acceleration summaries

Here we can see common information about report summaries. Let's learn more about the names of the columns:

Name of Column

Definition

Summary ID

The unique hashes that Hunk assigns to summaries. The IDs are derived from the remote search string for the report. They are used as part of the directory name that Hunk creates for the summary files.

Normalized Summary ID

Report Using Summary

The name of the report.

Summarization Load

This is calculated by dividing the number of seconds it takes to run the populating report by the interval of the populating report.

Access Count

This shows whether the summary is rarely used or hasn't been used in a long time.

Summary Status

This is the general state of the summary and tells you when it was last updated with new data. Possible status values are complete, pending, suspended, not enough data to summarize, or the percentage of the summary that is complete at the moment.

If we need summary details or we want to perform summary management actions, we should click on a summary ID or normalized summary ID to view summary details.

Reviewing summary details

Let's click on one of the IDs in our summary for the top categories:

Reviewing summary details

Under Summary Status, we can see the status information for the summary. It has the same information as the previous menu, but it also provides information about the verification status of the summary. As a result, we can easily update a summary to the present moment by clicking Update to kick off a new summary-populating report.

Tip

If the Size value stays at 0.00 MB it means that Hunk is not currently generating this summary because the reports associated with it either don't have enough events (at least 100k hot bucket events are required) or the projected summary size is over 10 percent of the bucket with which the report is associated. Hunk will periodically check this report and automatically create a summary for it when it meets the criteria for summary creation. But, anyway, our search starts to work faster.

Managing report accelerations

The report acceleration files for the Hadoop ERP are stored in HDFS. By default we can find files in <vix.splunk.home.hdfs>/cache. Let's check it via Hue.

Go to Hue http://quickstart.cloudera:8888/filebrowser/ and click on File Browser. Go to user | hunk | cache:

Managing report accelerations

Here we can find a file that has information about the cache. Go to orders and find info.json, which has information about our summary:

{"index":"orders","search":"search (index=orders) | addinfo type=count label=prereport_events | fields keepcolorder=t "cvp_reserved_count" "items{}.category" | pretop 10 "items{}.category"","summary_id":"D2DB9B94-7D5E-47B2-B9F9-D419A67CAC05_search_admin_NS6c17c90570bf72ce"}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.173.242