Calculating the cumulative sum of usage over time

While discussing Date Histogram aggregation, in the Focusing on a specific day and changing intervals section, we looked at the aggregation that's used to compute hourly bandwidth usage for one particular day. After completing that exercise, we had data for September 24, with hourly consumption between 12:00 am to 1:00 am, 1:00 am to 2:00 am, and so on. Using cumulative sum aggregation, we can also compute the cumulative bandwidth usage at the end of every hour of the day. Let's look at the query and try to understand it:

GET /bigginsight/_search?size=0
{
  "query": {
    "bool": {
      "must": [
        {"term": {"customer": "Linkedin"}}, 
        {"range": {"time": {"gte": 1506277800000}}}
      ]
    }
  },
  "aggs": {
    "counts_over_time": {
      "date_histogram": {
        "field": "time",
        "interval": "1h",
        "time_zone": "+05:30"
      },
      "aggs": {
        "hourly_usage": {
          "sum": { "field": "usage" }
        },
        "cumulative_hourly_usage": {            1
          "cumulative_sum": {                   2
              "buckets_path": "hourly_usage"    3 
          }
        }
      }
    }
  }
}

Only the part highlighted in bold is the new addition over the query that we saw previously. What we wanted was to calculate the cumulative sum over the buckets generated by the previous aggregation. Let's go over the newly added code, which has been annotated with numbers:

This gives an easy to understand name to this aggregation and places it inside the parent Date Histogram aggregation, which is the bucket aggregation containing this aggregation.
We are using the cumulative sum aggregation, and hence, we refer to its name, cumulative_sum, here.
The buckets_path element refers to the metric over which we want to do the cumulative sum. In our case, we want to sum over the hourly_usage metric that was created previously.

The response should look as follows. It has been truncated for brevity:

{
  ...,
  "aggregations": {
    "counts_over_time": {
      "buckets": [
        {
          "key_as_string": "2017-09-25T00:00:00.000+05:30",
          "key": 1506277800000,
          "doc_count": 465,
          "hourly_usage": {
            "value": 1385524
          },
          "cumulative_hourly_usage": {
            "value": 1385524
          }
        },
        {
          "key_as_string": "2017-09-25T01:00:00.000+05:30",
          "key": 1506281400000,
          "doc_count": 478,
          "hourly_usage": {
            "value": 1432123
          },
          "cumulative_hourly_usage": 
           {
            "value": 2817647
           }
}

As you can see, cumulative_hourly_usage contains the sum of hourly_usage, so far. In the first bucket, the hourly usage and the cumulative hourly usage are the same. From the second bucket onward, the cumulative hourly usage has the sum of all the hourly buckets we've seen so far.

Pipeline aggregations are powerful. They can compute derivatives, moving averages, the average over other buckets (as well as the min, max, and so on), and the average over previously calculated aggregations.

Table of Contents for Calculating the cumulative sum of usage over time

Create new playlist

Sign In

Sign Up

Table of Contents for
Calculating the cumulative sum of usage over time