Executing significant terms aggregation

This kind of aggregation is an evolution of the previous one in that it's able to cover several scenarios such as:

  • Suggesting relevant terms related to current query text
  • Discovering relations of terms
  • Discover common patterns in text

In these scenarios cases, the result must not be as simple as the previous terms aggregations; it must be computed as a variance between a foreground set (generally the query) and a background one (a large bulk of data).

Getting ready

You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

To execute curl via the command line, you need to install curl for your operative system.

To correctly execute the following command, you need an index populated with the script (chapter_08/populate_aggregations.sh) available in the online code.

How to do it...

For executing a significant term aggregation, we will perform the following steps:

  1. We want to calculate the significant terms tag given some tags. The REST call should be as follows:
            curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?   
            pretty=true&size=0' -d '{
                "query" : {
                    "terms" : {"tag" : [ "ullam", "in", "ex" ]}
                 },
                 "aggs": {
                     "significant_tags": {
                         "significant_terms": {
                              "field": "tag"
                         }
                     }
                 }
              }'
  2. The result returned by Elasticsearch, if everything is okay, should be as follows:
            {
              "took" : 6,
              "timed_out" : false,
              "_shards" : { ...truncated... },
              "hits" : { ...truncated... },
              "aggregations" : {
                "significant_tags" : {
                  "doc_count" : 45,
                  "buckets" : [
                    {
                      "key" : "ullam",
                      "doc_count" : 17,
                      "score" : 8.017283950617283,
                      "bg_count" : 17
                    },
                    {
                      "key" : "in",
                      "doc_count" : 15,
                      "score" : 7.0740740740740735,
                      "bg_count" : 15
                    },
                    {
                      "key" : "ex",
                      "doc_count" : 14,
                      "score" : 6.602469135802469,
                      "bg_count" : 14
                    },
                    {
                      "key" : "vitae",
                      "doc_count" : 3,
                      "score" : 0.674074074074074,
                      "bg_count" : 6
                    },
                    {
                      "key" : "necessitatibus",
                      "doc_count" : 3,
                      "score" : 0.3373737373737374,
                      "bg_count" : 11
                    }
                   ]
                 }
              }
            }

The aggregation result is composed from several buckets with:

  • key: the term used to populate the bucket
  • doc_count: the number of results with the key term
  • score: the score for this bucket
  • bg_count: the number of background documents that contains the key term

How it works...

The execution of the aggregation is similar to the previous ones. Internally, two terms aggregations are computed: one related to the documents matched with the query or parent aggregation and one based on all the documents on the knowledge base. Then, the two results datasets are scored to compute the significant result.

Due to the large cardinality of terms queries and the cost of significant relevance computation, this kind of aggregation is very CPU intensive.

The significant aggregation returns terms that are evaluated as significant for the current query.

To compare the results of significant terms aggregation with the plain terms aggregation, we can execute the same aggregation with the terms one as follows:

curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?pretty=true&size=0' -d '{
"query" : {
          "terms" : {"tag" : [ "ullam", "in", "ex" ]}
    },
        "aggs": {
       "tags": {
          "terms": {
             "field": "tag"
            }
         }
      }
  }'

The returned results will be as follows:

{
  ...truncated...
  "aggregations" : {
    "tags" : {
      "doc_count_error_upper_bound" : 2,
      "sum_other_doc_count" : 96,
      "buckets" : [
        {"key" : "ullam", "doc_count" : 17},
        {"key" : "in", "doc_count" : 15},
        {"key" : "ex", "doc_count" : 14},
        {"key" : "necessitatibus", "doc_count" : 3},
        {"key" : "vitae", "doc_count" : 3},
        {"key" : "architecto", "doc_count" : 2},
        {"key" : "debitis", "doc_count" : 2 },
        {"key" : "dicta", "doc_count" : 2},
        {"key" : "error", "doc_count" : 2},
        {"key" : "excepturi", "doc_count" : 2}
       ]
    }
  }
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.108.105