Terms aggregation

Terms aggregation is probably the most widely used aggregation. It is useful for segmenting or grouping the data by a given field's distinct values. Suppose that, in the network traffic data example that we have loaded, we have the following question:

Which are the top categories, that is, categories that are surfed the most by users?

We are interested in the most surfed categories – not in terms of the bandwidth used, but just in terms of counts (record counts). In a relational database, we could write a query like the following:

SELECT category, count(*) FROM usageReport GROUP BY category ORDER BY count(*) DESC;

The Elasticsearch aggregation query, which would do a similar job, can be written as follows:

GET /bigginsight/_search
{
"aggs": { 1
"byCategory": { 2
"terms": { 3
"field": "category" 4
}
}
},
"size": 0 5
}

Let's look at the terms of the aggregation query here. Notice the numbers that refer to different parts of the query:

  • The aggs or aggregations element at the top level should wrap any aggregation.
  • Give a name to the aggregation. Here, we are doing terms aggregation by the category field, and hence, the name we chose is byCategory.
  • We are doing a terms aggregation, and hence, we have the terms element.
  • We want to do a terms aggregation on the category field.
  • Specify size = 0 to prevent raw search results from being returned. We just want aggregation results, and not the search results, in this case. Since we haven't specified any top-level query element, it matches all documents. We do not want any raw documents (or search hits) in the result.

The response looks like the following:

{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total" : {
"value" : 10000, 1
"relation" : "gte"
},
"max_score": 0,
"hits": [] 2
},
"aggregations": { 3
"byCategory": { 4
"doc_count_error_upper_bound": 0, 5
"sum_other_doc_count": 0, 6
"buckets": [ 8
{
"key": "Chat", 9
"doc_count": 52277 10
},
{
"key": "File Sharing",
"doc_count": 46912
},
{
"key": "Other HTTP",
"doc_count": 38535
},
{
"key": "News",
"doc_count": 25784
},
{
"key": "Email",
"doc_count": 21003
},
{
"key": "Gaming",
"doc_count": 19578
},
{
"key": "Jobs",
"doc_count": 19429
},
{
"key": "Blogging",
"doc_count": 19317
}
]
}
}
}

Please note the following in the response, and notice the numbers that are annotated as well:

  • The total element under hits (we will refer to this as hits.total, navigating the path from the top JSON element) is greater than 10000. This is the total number of documents considered in this aggregation. As we mentioned previously, if you want the exact total hits to be returned, you need to pass an extra parameter in the request.
  • The hits.hits array is empty. This is because we specified "size": 0, so as to not include any search hits here. What we were interested in was the aggregations, and not the search results.
  • The aggregations element at the top level in the JSON response contains all the aggregation results.
  • The name of the aggregation is byCategory. This is the name that was given by us to this terms aggregation. This name helps us to relate the response to the request, since the request can be generated for several aggregations at once.
  • doc_count_error_upper_bound is the measure of error while doing this aggregation. Data is distributed in shards; if each shard sends data for all bucket keys, this results in too much data being sent across the network. Elasticsearch only sends the top n buckets across the network if the aggregation was requested for the top n items. Here, n is the number of aggregation buckets determined by the size parameter to the bucket aggregation. We will look at bucket aggregation's size parameter later in this chapter.
  • sum_other_doc_count is the total count of documents that are not included in the buckets that are returned. By default, the terms aggregations returns the top 10 buckets if there are more than 10 distinct buckets. The remaining documents, other than these 10 buckets, are summed and returned in this field. In this case, there are only eight categories, and hence, this field is set to zero.
  • The list of buckets returned by the aggregation.
  • The key of one of the buckets, that is, the category of Chat.
  • The count of documents in the bucket.

As you can see, there are only eight distinct buckets in the results of the query. 

Next, we want to find out the top applications in terms of the maximum number of records for each application:

GET /bigginsight/_search?size=0
{
"aggs": {
"byApplication": {
"terms": {
"field": "application"
}
}
}
}

Note that we have added size=0 as a request parameter in the URL itself. 

This returns a response like the following:

{
...,
"aggregations": {
"byApplication": {
"doc_count_error_upper_bound": 6339,
"sum_other_doc_count": 129191,
"buckets": [
{
"key": "Skype",
"doc_count": 26115
},
...
}

Note that sum_other_doc_count has a big value, 129191. This is a big number that's relative to the total hits; as we saw in the previous query, there are around 242,000 documents in the index. The reason for this is that the terms aggregation only returns 10 buckets, by default. In the current setting, the top 10 buckets with the highest documents are returned in descending order. The remaining documents that are not covered in the top 10 buckets are indicated in sum_other_doc_count. There are actually 30 different applications for which we have network traffic data. The number in sum_other_doc_count is the sum of the counts for the remaining 20 applications that were not included in the buckets list.

To get the top buckets instead of the default 10, we can use the size parameter inside the terms aggregation:

GET /bigginsight/_search?size=0
{
"aggs": {
"byApplication": {
"terms": {
"field": "application",
"size": 15
}
}
}
}

Notice that this size (specified inside the terms aggregation) is different from the size specified at the top level. At the top level, the size parameter is used to prevent any search hits, whereas the size parameter being used inside the terms aggregation denotes the maximum number of term buckets to be returned. 

Terms aggregation is very useful for generating data for pie charts or bar charts, where we may want to analyze the relative counts of string typed fields in a set of documents. In Chapter 7Visualizing Data with Kibana, you will learn that Kibana terms aggregation is useful for generating pie and bar charts.

Next, we will look at how to do bucketing on numerical types of fields.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.97.64