We could spend the next few pages defining the various aggregations and their syntax, but aggregations are truly best learned by example. Once you learn how to think about aggregations, and how to nest them appropriately, the syntax is fairly trivial.
A complete list of aggregation buckets and metrics can be found at the online reference documentation. We’ll cover many of them in this chapter, but glance over it after finishing so you are familiar with the full range of capabilities.
So let’s just dive in and start with an example. We are going to build some aggregations that might be useful to a car dealer. Our data will be about car transactions: the car model, manufacturer, sale price, when it sold, and more.
First we will bulk-index some data to work with:
POST
/
cars
/
transactions
/
_bulk
{
"index"
:
{}}
{
"price"
:
10000
,
"color"
:
"red"
,
"make"
:
"honda"
,
"sold"
:
"2014-10-28"
}
{
"index"
:
{}}
{
"price"
:
20000
,
"color"
:
"red"
,
"make"
:
"honda"
,
"sold"
:
"2014-11-05"
}
{
"index"
:
{}}
{
"price"
:
30000
,
"color"
:
"green"
,
"make"
:
"ford"
,
"sold"
:
"2014-05-18"
}
{
"index"
:
{}}
{
"price"
:
15000
,
"color"
:
"blue"
,
"make"
:
"toyota"
,
"sold"
:
"2014-07-02"
}
{
"index"
:
{}}
{
"price"
:
12000
,
"color"
:
"green"
,
"make"
:
"toyota"
,
"sold"
:
"2014-08-19"
}
{
"index"
:
{}}
{
"price"
:
20000
,
"color"
:
"red"
,
"make"
:
"honda"
,
"sold"
:
"2014-11-05"
}
{
"index"
:
{}}
{
"price"
:
80000
,
"color"
:
"red"
,
"make"
:
"bmw"
,
"sold"
:
"2014-01-01"
}
{
"index"
:
{}}
{
"price"
:
25000
,
"color"
:
"blue"
,
"make"
:
"ford"
,
"sold"
:
"2014-02-12"
}
Now that we have some data, let’s construct our first aggregation. A car dealer
may want to know which color car sells the best. This is easily accomplished
using a simple aggregation. We will do this using a terms
bucket:
GET
/
cars
/
transactions
/
_search
?
search_type
=
count
{
"aggs"
:
{
"colors"
:
{
"terms"
:
{
"field"
:
"color"
}
}
}
}
Aggregations are placed under the top-level aggs
parameter (the longer aggregations
will also work if you prefer that).
We then name the aggregation whatever we want: colors
, in this example
Finally, we define a single bucket of type terms
.
Aggregations are executed in the context of search results, which means it is
just another top-level parameter in a search request (for example, using the /_search
endpoint). Aggregations can be paired with queries, but we’ll tackle that later
in Chapter 29.
You’ll notice that we used the count
search_type.
Because we don’t care about search results—the aggregation totals—the
count
search_type will be faster because it omits the fetch phase.
Next we define a name for our aggregation. Naming is up to you; the response will be labeled with the name you provide so that your application can parse the results later.
Next we define the aggregation itself. For this example, we are defining
a single terms
bucket. The terms
bucket will dynamically create a new
bucket for every unique term it encounters. Since we are telling it to use the
color
field, the terms
bucket will dynamically create a new bucket for each color.
Let’s execute that aggregation and take a look at the results:
{
...
"hits"
:
{
"hits"
:
[]
},
"aggregations"
:
{
"colors"
:
{
"buckets"
:
[
{
"key"
:
"red"
,
"doc_count"
:
4
},
{
"key"
:
"blue"
,
"doc_count"
:
2
},
{
"key"
:
"green"
,
"doc_count"
:
2
}
]
}
}
}
No search hits are returned because we used the search_type=count
parameter
Our colors
aggregation is returned as part of the aggregations
field.
The key
to each bucket corresponds to a unique term found in the color
field.
It also always includes doc_count
, which tells us the number of docs containing the term.
The count of each bucket represents the number of documents with this color.
The response contains a list of buckets, each corresponding to a unique color (for example, red or green). Each bucket also includes a count of the number of documents that “fell into” that particular bucket. For example, there are four red cars.
The preceding example is operating entirely in real time: if the documents are searchable, they can be aggregated. This means you can take the aggregation results and pipe them straight into a graphing library to generate real-time dashboards. As soon as you sell a silver car, your graphs would dynamically update to include statistics about silver cars.
Voila! Your first aggregation!
The previous example told us the number of documents in each bucket, which is useful. But often, our applications require more-sophisticated metrics about the documents. For example, what is the average price of cars in each bucket?
To get this information, we need to tell Elasticsearch which metrics to calculate, and on which fields. This requires nesting metrics inside the buckets. Metrics will calculate mathematical statistics based on the values of documents within a bucket.
Let’s go ahead and add an average
metric to our car example:
GET
/
cars
/
transactions
/
_search
?
search_type
=
count
{
"aggs"
:
{
"colors"
:
{
"terms"
:
{
"field"
:
"color"
},
"aggs"
:
{
"avg_price"
:
{
"avg"
:
{
"field"
:
"price"
}
}
}
}
}
}
We add a new aggs
level to hold the metric.
We then give the metric a name: avg_price
.
And finally, we define it as an avg
metric over the price
field.
As you can see, we took the previous example and tacked on a new aggs
level.
This new aggregation level allows us to nest the avg
metric inside the
terms
bucket. Effectively, this means we will generate an average for each
color.
Just like the colors
example, we need to name our metric (avg_price
) so we
can retrieve the values later. Finally, we specify the metric itself (avg
)
and what field we want the average to be calculated on (price
):
{
...
"aggregations"
:
{
"colors"
:
{
"buckets"
:
[
{
"key"
:
"red"
,
"doc_count"
:
4
,
"avg_price"
:
{
"value"
:
32500
}
},
{
"key"
:
"blue"
,
"doc_count"
:
2
,
"avg_price"
:
{
"value"
:
20000
}
},
{
"key"
:
"green"
,
"doc_count"
:
2
,
"avg_price"
:
{
"value"
:
21000
}
}
]
}
}
...
}
Although the response has changed minimally, the data we get out of it has grown substantially. Before, we knew there were four red cars. Now we know that the average price of red cars is $32,500. This is something that you can plug directly into reports or graphs.
The true power of aggregations becomes apparent once you start playing with different nesting schemes. In the previous examples, we saw how you could nest a metric inside a bucket, which is already quite powerful.
But the real exciting analytics come from nesting buckets inside other buckets. This time, we want to find out the distribution of car manufacturers for each color:
GET
/
cars
/
transactions
/
_search
?
search_type
=
count
{
"aggs"
:
{
"colors"
:
{
"terms"
:
{
"field"
:
"color"
},
"aggs"
:
{
"avg_price"
:
{
"avg"
:
{
"field"
:
"price"
}
},
"make"
:
{
"terms"
:
{
"field"
:
"make"
}
}
}
}
}
}
Notice that we can leave the previous avg_price
metric in place.
Another aggregation named make
is added to the color
bucket.
This aggregation is a terms
bucket and will generate unique buckets for
each car make.
A few interesting things happened here. First, you’ll notice that the previous
avg_price
metric is left entirely intact. Each level of an aggregation can
have many metrics or buckets. The avg_price
metric tells us the average price
for each car color. This is independent of other buckets and metrics that
are also being built.
This is important for your application, since there are often many related, but entirely distinct, metrics that you need to collect. Aggregations allow you to collect all of them in a single pass over the data.
The other important thing to note is that the aggregation we added, make
, is
a terms
bucket (nested inside the colors
terms
bucket). This means we will
generate a (color
, make
) tuple for every unique combination in your dataset.
Let’s take a look at the response (truncated for brevity, since it is now growing quite long):
{
...
"aggregations"
:
{
"colors"
:
{
"buckets"
:
[
{
"key"
:
"red"
,
"doc_count"
:
4
,
"make"
:
{
"buckets"
:
[
{
"key"
:
"honda"
,
"doc_count"
:
3
},
{
"key"
:
"bmw"
,
"doc_count"
:
1
}
]
},
"avg_price"
:
{
"value"
:
32500
}
},
...
}
Our new aggregation is nested under each color bucket, as expected.
We now see a breakdown of car makes for each color.
Finally, you can see that our previous avg_price
metric is still intact.
The response tells us the following:
There are four red cars.
The average price of a red car is $32,500.
Three of the red cars are made by Honda, and one is a BMW.
Just to drive the point home, let’s make one final modification to our example before moving on to new topics. Let’s add two metrics to calculate the min and max price for each make:
GET
/
cars
/
transactions
/
_search
?
search_type
=
count
{
"aggs"
:
{
"colors"
:
{
"terms"
:
{
"field"
:
"color"
},
"aggs"
:
{
"avg_price"
:
{
"avg"
:
{
"field"
:
"price"
}
},
"make"
:
{
"terms"
:
{
"field"
:
"make"
},
"aggs"
:
{
"min_price"
:
{
"min"
:
{
"field"
:
"price"
}
},
"max_price"
:
{
"max"
:
{
"field"
:
"price"
}
}
}
}
}
}
}
}
Which gives us the following output (again, truncated):
{
...
"aggregations"
:
{
"colors"
:
{
"buckets"
:
[
{
"key"
:
"red"
,
"doc_count"
:
4
,
"make"
:
{
"buckets"
:
[
{
"key"
:
"honda"
,
"doc_count"
:
3
,
"min_price"
:
{
"value"
:
10000
},
"max_price"
:
{
"value"
:
20000
}
},
{
"key"
:
"bmw"
,
"doc_count"
:
1
,
"min_price"
:
{
"value"
:
80000
},
"max_price"
:
{
"value"
:
80000
}
}
]
},
"avg_price"
:
{
"value"
:
32500
}
},
...
With those two buckets, we’ve expanded the information derived from this query to include the following:
There are four red cars.
The average price of a red car is $32,500.
Three of the red cars are made by Honda, and one is a BMW.
The cheapest red Honda is $10,000.
The most expensive red Honda is $20,000.
3.133.132.99