Chapter 4. Aggregations for Analytics

Elasticsearch is a search engine at the core but what makes it more usable is its ability to make complex data analytics in an easy and simple way. The volume of data is growing rapidly and companies want to perform analysis on data in real time. Whether it is log, real-time streaming of data, or static data, Elasticsearch works wonderfully in getting a summarization of data through its aggregation capabilities.

In this chapter, we will cover the following topics:

  • Introducing the aggregation framework
  • Metric and bucket aggregations
  • Combining search, buckets, and metrics
  • Memory pressure and implications

Introducing the aggregation framework

The aggregation functionality is completely different from search and enables you to ask sophisticated questions of the data. The use cases of aggregation vary from building analytical reports to getting real-time analysis of data and taking quick actions.

Also, despite being different in functionality, aggregations can operate along the usual search requests. Therefore, you can search or filter your data, and at the same time, you can also perform aggregation on the same datasets matched by search/filter criteria in a single request. A simple example can be to find the maximum number of hashtags used by users related to tweets that has crime in the text field. Aggregations enable you to calculate and summarize data about the current query on the fly. They can be used for all sorts of tasks such as dynamic counting of result values to building a histogram.

Aggregations come in two flavors: metrics and buckets.

  • Metrics: Metrics are used to do statistics calculations, such as min, max, average, on a field of a document that falls into a certain criteria. An example of a metric can be to find the maximum count of followers among the user's follower counts.
  • Buckets: Buckets are simply the grouping of documents that meet a certain criteria. They are used to categorize documents, for example:
    • The category of loans can fall into the buckets of home loan or personal loan
    • The category of an employee can be either male or female

Elasticsearch offers a wide variety of buckets to categorize documents in many ways such as by days, age range, popular terms, or locations. However, all of them work on the same principle: document categorization based on some criteria.

The most interesting part is that bucket aggregations can be nested within each other. This means that a bucket can contain other buckets within it. Since each of the buckets defines a set of documents, one can create another aggregation on that bucket, which will be executed in the context of its parent bucket. For example, a country-wise bucket can include a state-wise bucket, which can further include a city-wise bucket.

Note

In SQL terms, metrics are simply functions such as MIN(), MAX(), SUM(), COUNT(), and AVG(), where buckets group the results using GROUP BY queries.

Aggregation syntax

Aggregation follows the following syntax:

"aggregations" : {
    "<aggregation_name>" : {
        "<aggregation_type>" : {
            <aggregation_body>
        }
        [,"aggregations" : { [<sub_aggregation>]+ } ]?
    }
    [,"<aggregation_name_2>" : { ... } ]*
}

Let's understand how the preceding structure works:

  • aggregations: The aggregations objects (which can also be replaced with agg) in the preceding structure holds the aggregations that have to be computed. There can be more than one aggregation inside this object.
  • <aggregation_name>: This is a user-defined logical name for the aggregations that are held by the aggregations object (for example, if you want to compute the average age of users in the index, it makes sense to give the name as avg_age). These logical names will also be used to uniquely identify the aggregations in the response.
  • <aggregation_type>: Each aggregation has a specific type, for example, terms, sum, avg, min, and so on.
  • <aggregation_body>: Each type of aggregation defines its own body depending on the nature of the aggregation (for example, an avg aggregation on a specific field will define the field on which the average will be calculated).
  • <sub_aggregation>: The sub aggregations are defined on the bucketing aggregation level and are computed for all the buckets built by the bucket aggregation. For example, if you define a set of aggregations under the range aggregation, the sub aggregations will be computed for the range buckets that are defined.

Look at the following JSON structure to understand a more simple structure of aggregations:

{
  "aggs": {
    "NAME1": {
      "AGG_TYPE": {},
      "aggs": {
        "NAME": {
          "AGG_TYPE": {}
        }
      }
    },
    "NAME2": {
      "AGG_TYPE": {}
    }
  }
}

Extracting values

Aggregations typically work on the values extracted from the aggregated document set. These values can be extracted either from a specific field using the field key inside the aggregation body or can also be extracted using a script.

While it's easy to define a field to be used to aggregate data, the syntax of using scripts needs some special understanding. The benefit of using scripts is that one can combine the values from more than one field to use as a single value inside an aggregation.

Note

Using scripting requires much more computation power and slows down the performance on bigger datasets.

The following are the examples of extracting values from a script:

Extracting a value from a single field:

{ "script" : "doc['field_name'].value" }

Extracting and combining values from more than one field:

"script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"

The scripts also support the use of parameters using the param keyword. For example:

{
  "avg": {
    "field": "price",
    "script": {
      "inline": "_value * correction",
      "params": {
        "correction": 1.5
      }
    }
  }
}

The preceding aggregation calculates the average price after multiplying each value of the price field with 1.5, which is used as an inline function parameter.

Returning only aggregation results

Elasticsearch by default computes aggregations on a complete set of documents using the match_all query and returns 10 documents by default along with the output of the aggregation results.

If you do not want to include the documents in the response, you need to set the value of the size parameter to 0 inside your query. Note that you do not need to use the from parameter in this case. This is a very useful parameter because it avoids document relevancy calculation and the inclusion of documents in the response, and only returns the aggregated data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.43.134