This kind of aggregation is an evolution of the previous one in that it's able to cover several scenarios such as:
In these scenarios cases, the result must not be as simple as the previous terms aggregations; it must be computed as a variance between a foreground set (generally the query) and a background one (a large bulk of data).
You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via the command line, you need to install curl
for your operative system.
To correctly execute the following command, you need an index populated with the script (chapter_08/populate_aggregations.sh
) available in the online code.
For executing a significant term aggregation, we will perform the following steps:
tag
given some tags. The REST
call should be as follows:curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? pretty=true&size=0' -d '{ "query" : { "terms" : {"tag" : [ "ullam", "in", "ex" ]} }, "aggs": { "significant_tags": { "significant_terms": { "field": "tag" } } } }'
{ "took" : 6, "timed_out" : false, "_shards" : { ...truncated... }, "hits" : { ...truncated... }, "aggregations" : { "significant_tags" : { "doc_count" : 45, "buckets" : [ { "key" : "ullam", "doc_count" : 17, "score" : 8.017283950617283, "bg_count" : 17 }, { "key" : "in", "doc_count" : 15, "score" : 7.0740740740740735, "bg_count" : 15 }, { "key" : "ex", "doc_count" : 14, "score" : 6.602469135802469, "bg_count" : 14 }, { "key" : "vitae", "doc_count" : 3, "score" : 0.674074074074074, "bg_count" : 6 }, { "key" : "necessitatibus", "doc_count" : 3, "score" : 0.3373737373737374, "bg_count" : 11 } ] } } }
The aggregation result is composed from several buckets with:
key
: the term used to populate the bucketdoc_count
: the number of results with the key
termscore
: the score for this bucketbg_count
: the number of background documents that contains the key
termThe execution of the aggregation is similar to the previous ones. Internally, two terms aggregations are computed: one related to the documents matched with the query or parent aggregation and one based on all the documents on the knowledge base. Then, the two results datasets are scored to compute the significant result.
Due to the large cardinality of terms queries and the cost of significant relevance computation, this kind of aggregation is very CPU intensive.
The significant aggregation returns terms that are evaluated as significant for the current query.
To compare the results of significant terms aggregation with the plain terms aggregation, we can execute the same aggregation with the terms
one as follows:
curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?pretty=true&size=0' -d '{ "query" : { "terms" : {"tag" : [ "ullam", "in", "ex" ]} }, "aggs": { "tags": { "terms": { "field": "tag" } } } }'
The returned results will be as follows:
{ ...truncated... "aggregations" : { "tags" : { "doc_count_error_upper_bound" : 2, "sum_other_doc_count" : 96, "buckets" : [ {"key" : "ullam", "doc_count" : 17}, {"key" : "in", "doc_count" : 15}, {"key" : "ex", "doc_count" : 14}, {"key" : "necessitatibus", "doc_count" : 3}, {"key" : "vitae", "doc_count" : 3}, {"key" : "architecto", "doc_count" : 2}, {"key" : "debitis", "doc_count" : 2 }, {"key" : "dicta", "doc_count" : 2}, {"key" : "error", "doc_count" : 2}, {"key" : "excepturi", "doc_count" : 2} ] } } }
18.223.108.105