Bucket aggregations

Similar to metric aggregations, bucket aggregations are also categorized into two forms: Single buckets that contain only a single bucket in the response, and multi buckets that contain more than one bucket in the response.

The following are the most important aggregations that are used to create buckets:

  • Multi bucket aggregations
    • Terms aggregation
    • Range aggregation
    • Date range aggregation
    • Histogram aggregation
    • Date histogram aggregation
  • Single bucket aggregation
    • Filter-based aggregation

    Note

    We will cover a few more aggregations such as nested and geo aggregations in subsequent chapters.

Buckets aggregation response formats are different from the response formats of metric aggregations. The response of a bucket aggregation usually comes in the following format:

"aggregations": {

      "aggregation_name": {
         "buckets": [
            {
               "key": value,
               "doc_count": value
            },
            ......
         ]
      }
   }

Note

All the bucket aggregations can be created in Java using the AggregationBuilder and AggregationBuilders classes. You need to have the following classes imported inside your code for the same:

org.elasticsearch.search.aggregations.AggregationBuilder;
org.elasticsearch.search.aggregations.AggregationBuilders;

Also, all the aggregation queries can be executed with the following code snippet:

SearchResponse response =   client.prepareSearch(indexName).setTypes(docType)
  .setQuery(QueryBuilders.matchAllQuery())
  .addAggregation(aggregation)
  .execute().actionGet();

The setQuery() method can take any type of Elasticsearch query, whereas the addAggregation() method takes the aggregation built using AggregationBuilder.

Terms aggregation

Terms aggregation is the most widely used aggregation type and returns the buckets that are dynamically built using one per unique value.

Let's see how to find the top 10 hashtags used in our Twitter index in descending order.

Python example

query = {
  "aggs": {
    "top_hashtags": {
      "terms": {
        "field": "entities.hashtags.text",
        "size": 10,
        "order": {
          "_term": "desc"
        }
      }
    }
  }
}

In the preceding example, the size parameter controls how many buckets are to be returned (defaults to 10) and the order parameter controls the sorting of the bucket terms (defaults to asc):

res = es.search(index='twitter', doc_type='tweets', body=query)

The response would look like this:

 "aggregations": {
      "top_hashtags": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 44,
         "buckets": [
            {
               "key": "politics",
               "doc_count": 2
            },
            ….............
         ]
      }
   }

Java example

Terms aggregation can be built as follows:

AggregationBuilder aggregation =
        AggregationBuilders.terms("agg").field(fieldName)
        .size(10);

Here, agg is the aggregation bucket name and fieldName is the field on which the aggregation is performed.

The response object can be parsed as follows:

To parse the terms aggregation response, you need to import the following class:

import org.elasticsearch.search.aggregations.bucket.terms.Terms;

Then, the response can be parsed with the following code snippet:

Terms screen_names = response.getAggregations().get("agg");
    for (Terms.Bucket entry : screen_names.getBuckets()) {
      entry.getKey();      // Term
      entry.getDocCount(); // Doc count
    }

Range aggregation

With range aggregation, a user can specify a set of ranges, where each range represents a bucket. Elasticsearch will put the document sets into the correct buckets by extracting the value from each document and matching it against the specified ranges.

Python example

  query = "aggs": {
    "status_count_ranges": {
      "range": {
        "field": "user.statuses_count",
        "ranges": [
          {
            "to": 50
          },
          {
            "from": 50,
            "to": 100
          }
        ]
      }
    }
  },"size": 0
}
res = es.search(index='twitter', doc_type='tweets', body=query)

Note

The range aggregation always discards the to value for each range and only includes the from value.

The response for the preceding query request would look like this:

  "aggregations": {
      "status_count_ranges": {
         "buckets": [
            {
               "key": "*-50.0",
               "to": 50,
               "to_as_string": "50.0",
               "doc_count": 3
            },
            {
               "key": "50.0-100.0",
               "from": 50,
               "from_as_string": "50.0",
               "to": 100,
               "to_as_string": "100.0",
               "doc_count": 3
            }
         ]
      }
   }

Java example

Building range aggregation:

AggregationBuilder aggregation =
  AggregationBuilders
  .range("agg")
  .field(fieldName)
  .addUnboundedTo(1)    // from -infinity to 1 (excluded)
  .addRange(1, 100)  // from 1 to 100(excluded)
  .addUnboundedFrom(100); // from 100 to +infinity

Here, agg is the aggregation bucket name and fieldName is the field on which the aggregation is performed. The addUnboundedTo method is used when you do not specify the from parameter and the addUnboundedFrom method is used when you don't specify the to parameter.

Parsing the response

To parse the range aggregation response, you need to import the following class:

import org.elasticsearch.search.aggregations.bucket.range.Range;

Then, the response can be parsed with the following code snippet:

Range agg = response.getAggregations().get("agg");
for (Range.Bucket entry : agg.getBuckets()) {
    String key = entry.getKeyAsString();       // Range as key
    Number from = (Number) entry.getFrom();   // Bucket from
    Number to = (Number) entry.getTo();       // Bucket to
    long docCount = entry.getDocCount();      // Doc count
  }

Date range aggregation

The date range aggregation is dedicated for date fields and is similar to range aggregation. The only difference between range and date range aggregation is that the latter allows you to use a date math expression inside the from and to fields. The following table shows an example of using math operations in Elasticsearch. The supported time units for the math operations are: y (year), M (month), w (week), d (day), h (hour), m (minute), and s (second):

Operation

Description

Now

Current time

Now+1h

Current time plus 1 hour

Now-1M

Current time minus 1 month

Now+1h+1m

Current time plus 1 hour plus one minute

Now+1h/d

Current time plus 1 hour rounded to the nearest day

2016-01-01||+1M/d

2016-01-01 plus 1 month rounded to the nearest day

Python example

query = {
       "aggs": {
        "tweets_creation_interval": {
          "range": {
            "field": "created_at",
            "format": "yyyy",
            "ranges": [
              {
                "to": 2000
              },
              {
                "from": 2000,
                "to": 2005
              },
              {
                "from": 2005
              }
            ]
          }
        }
      },"size": 0
    }
res = es.search(index='twitter', doc_type='tweets', body=query)
print res

Java example

Building date range aggregation:

AggregationBuilder aggregation =
  AggregationBuilders
  .dateRange("agg")
  .field(fieldName)
  .format("yyyy")
  .addUnboundedTo("2000")   // from -infinity to 2000 (excluded)
  .addRange("2000", "2005")  // from 2000 to 2005 (excluded)
  .addUnboundedFrom("2005"); // from 2005 to +infinity

Here, agg is the aggregation bucket name and fieldName is the field on which the aggregation is performed. The addUnboundedTo method is used when you do not specify the from parameter and the addUnboundedFrom method is used when you don't specify the to parameter.

Parsing the response:

To parse the date range aggregation response, you need to import the following class:

import org.elasticsearch.search.aggregations.bucket.range.Range;
import org.joda.time.DateTime;

Then, the response can be parsed with the following code snippet:

Range agg = response.getAggregations().get("agg");
for (Range.Bucket entry : agg.getBuckets()) {
  String key = entry.getKeyAsString();   // Date range as key
  DateTime fromAsDate = (DateTime) entry.getFrom(); // Date bucket from as a Date
  DateTime toAsDate = (DateTime) entry.getTo(); // Date bucket to as a Date
  long docCount = entry.getDocCount();         // Doc count
}

Histogram aggregation

A histogram aggregation works on numeric values extracted from documents and creates fixed-sized buckets based on those values. Let's see an example for creating buckets of a user's favorite tweet counts:

Python example

query = {
  "aggs": {
    "favorite_tweets": {
      "histogram": {
        "field": "user.favourites_count",
        "interval": 20000
      }
    }
  },"size": 0
}
res = es.search(index='twitter', doc_type='tweets', body=query)
for bucket in res['aggregations']['favorite_tweets']['buckets']:
    print bucket['key'], bucket['doc_count']

The response for the preceding query will look like the following, which says that 114 users have favorite tweets between 0 to 20000 and 8 users have more than 20000 as their favorite tweets:

"aggregations": {
      "favorite_tweets": {
         "buckets": [
            {
               "key": 0,
               "doc_count": 114
            },
            {
               "key": 20000,
               "doc_count": 8
            }
         ]
      }
   }

Note

While executing the histogram aggregation, the values of the documents are rounded off and they fall into the closest bucket; for example, if the favorite tweet count is 72 and the bucket size is set to 5, it will fall into the bucket with the key 70.

Java example

Building histogram aggregation:

AggregationBuilder aggregation =
        AggregationBuilders
        .histogram("agg")
        .field(fieldName)
        .interval(5);

Here, agg is the aggregation bucket name and fieldName is the field on which aggregation is performed. The interval method is used to pass the interval for generating the buckets.

Parsing the response:

To parse the histogram aggregation response, you need to import the following class:

import org.elasticsearch.search.aggregations.bucket.histogram.Histogram;

Then, the response can be parsed with the following code snippet:

Range agg = response.getAggregations().get("agg");
for (Histogram.Bucket entry : agg.getBuckets()) {
      Long key = (Long) entry.getKey();       // Key
      long docCount = entry.getDocCount();    // Doc coun
    }

Date histogram aggregation

Date histogram is similar to the histogram aggregation but it can only be applied to date fields. The difference between the two is that date histogram allows you to specify intervals using date/time expressions.

The following values can be used for intervals:

  • year, quarter, month, week, day, hour, minute, and second

You can also specify fractional values, such as 1h (1 hour), 1m (1 minute) and so on.

Date histograms are mostly used to generate time-series graphs in many applications.

Python example

query = {
      "aggs": {
        "tweet_histogram": {
          "date_histogram": {
            "field": "created_at",
            "interval": "hour"
          }
        }
      }, "size": 0
    }

The preceding aggregation will generate an hourly-based tweet timeline on the field, created_at:

res = es.search(index='twitter', doc_type='tweets', body=query)
for bucket in res['aggregations']['tweet_histogram']['buckets']:
    print bucket['key'], bucket['key_as_string'], bucket['doc_count']

Java example

Building date histogram aggregation:

AggregationBuilder aggregation =
        AggregationBuilders
        .histogram("agg")
        .field(fieldName)
        .interval(DateHistogramInterval.YEAR);

Here, agg is the aggregation bucket name and fieldname is the field on which the aggregation is performed. The interval method is used to pass the interval to generate buckets. For interval in days, you can do this: DateHistogramInterval.days(10)

Parsing the response:

To parse the date histogram aggregation response, you need to import the following class:

import org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramInterval;

The response can be parsed with this code snippet:

Histogram agg = response.getAggregations().get("agg");
for (Histogram.Bucket entry : agg.getBuckets()) {
  DateTime key = (DateTime) entry.getKey();    // Key
  String keyAsString = entry.getKeyAsString(); // Key as String
  long docCount = entry.getDocCount();         // Doc count
  }

Filter-based aggregation

Elasticsearch allows filters to be used as aggregations too. Filters preserve their behavior in the aggregation context as well and are usually used to narrow down the current aggregation context to a specific set of documents. You can use any filter such as range, term, geo, and so on.

To get the count of all the tweets done by the user, d_bharvi, use the following code:

Python example

query = {
  "aggs": {
    "screename_filter": {
      "filter": {
        "term": {
          "user.screen_name": "d_bharvi"
        }
      }
    }
  },"size": 0
}

In the preceding request, we have used a term filter to narrow down the bucket of tweets done by a particular user:

res = es.search(index='twitter', doc_type='tweets', body=query)
    for bucket in res['aggregations']['screename_filter']['buckets']:
        print bucket['doc_count']

The response would look like this:

 "aggregations": {
      "screename_filter": {
         "doc_count": 100
      }
   }
  }

Java example

Building filter-based aggregation:

AggregationBuilder aggregation =
  AggregationBuilders
  .filter("agg")
  .filter(QueryBuilders.termQuery("user.screen_name ", "d_bharvi"));

Here, agg is the aggregation bucket name under the first filter method and the second filter method takes a query to apply the filter.

Parsing the response:

To parse a filter-based aggregation response, you need to import the following class:

import org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramInterval;

The response can be parsed with the following code snippet:

Filter agg = response.getAggregations().get("agg");
    agg.getDocCount(); // Doc count
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.100.205