Working with nested objects

Nested objects look similar to plain objects but they differ in mapping and the way they are stored internally in Elasticsearch.

We will work with the same Twitter data but this time we will index it in a nested structure. We will have a user as our root object and every user can have multiple tweets as nested documents. Indexing this kind of data without using nested mapping will lead to problems, as shown in the following example:

PUT /twitter/tweet/1
{
  "user": {
    "screen_name": "d_bharvi",
    "followers_count": "2000",
    "created_at": "2012-06-05"
  },
  "tweets": [
    {
      "id": "121223221",
      "text": "understanding nested relationships",
      "created_at": "2015-09-05"
    },
    {
      "id": "121223222",
      "text": "NoSQL databases are awesome",
      "created_at": "2015-06-05"
    }
  ]
}

PUT /twitter/tweet/2
{
  "user": {
    "screen_name": "d_bharvi",
    "followers_count": "2000",
    "created_at": "2012-06-05"
  },
  "tweets": [
    {
      "id": "121223223",
      "text": "understanding nested relationships",
      "created_at": "2015-09-05"
    },
    {
      "id": "121223224",
      "text": "NoSQL databases are awesome",
      "created_at": "2015-09-05"
    }
  ]
}

Now, if we want to query all the tweets that are about NoSQL and have been created on 2015-09-05, we would use the following code:

GET twitter/tweets/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "tweets.text": "NoSQL"
          }
        },
        {
          "term": {
            "tweets.created_at": "2015-09-05"
          }
        }
      ]
    }
  }
}

The preceding query will return both the documents in the response. The reason is that Elasticsearch internally stores objects in the following way:

{tweets.id : ["121223221","121223222","121223223","121223224"],
tweets.text : ["understanding nested relationships",........],
tweets.created_at : ["2015-09-05","2015-06-05","2015-09-05","2015-09-05"]}

All the fields of the tweet objects are flattened into an array format, which leads to loosing the association between the tweet texts and tweet creation dates, and because of this, the previous query returned the wrong results.

Creating nested mappings

The mapping for nested objects can be defined in the following way:

PUT twitter_nested/users/_mapping
{
  "properties": {
    "user": {
      "type": "object",
      "properties": {
        "screen_name": {
          "type": "string"
        },
        "followers_count": {
          "type": "integer"
        },
        "created_at": {
          "type": "date"
        }
      }
    },
    "tweets": {
      "type": "nested",
      "properties": {
        "id": {
          "type": "string"
        },
        "text": {
          "type": "string"
        },
        "created_at": {
          "type": "date"
        }
      }
    }
  }
}

In the previous mapping, user is a simple object field but the tweets field is defined as a nested type object, which contains id, text, and created_at as its properties.

Indexing nested data

You can use the same JSON documents that we used in the previous section to index users and their tweets, as indexing nested fields is similar to indexing object fields and does not require any extra effort in the code. However, Elasticsearch considers all the nested documents as separate documents and stores them internally in the following format, which preserves the relationships between tweet texts and dates:

{tweets.id : "121223221",tweets.text : "understanding nested relationships", tweets.created_at : "2015-09-05"}
{tweets.id : "121223221",tweets.text : "understanding nested relationships", tweets.created_at : "2015-09-05"}
{tweets.id : "121223221",tweets.text : "understanding nested relationships", tweets.date : "2015-09-05"}

Querying nested type data

To query a nested field, Elasticsearch offers a nested query, which has the following syntax:

"query": {
    "nested": {
      "path": "path_to_nested_doc",
      "query": {}
    }
  }

Let's understand the nested query syntax:

  • The top most query parameter wraps all the queries inside it.
  • The nested parameter tells Elasticsearch that this query is of the nested type
  • The path parameter specifies the path of the nested field
  • The internal query object contains all the queries supported by Elasticsearch

Now let's run the nested query to search all the tweets that are about NoSQL and have been created on 2015-09-05.

Py thon example

query = {
  "query": {
    "nested": {
      "path": "tweets",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "tweets.text": "NoSQL"
              }
            },
            {
              "term": {
                "tweets.created_at": "2015-09-05"
              }
            }
          ]
        }
      }
    }
  }
}
res = es.search(index='twitter_nested', doc_type= 'users', body=query)

Java example

SearchResponse response = client.prepareSearch("twitter_nested")
  .setTypes("users")
  .setQuery(QueryBuilders
  .nestedQuery(nestedField, QueryBuilders
  .boolQuery()
    .must(QueryBuilders
      .matchQuery("tweets.text", "Nosql Databases"))
    .must(QueryBuilders
      .termQuery("tweets.created_at", "2015-09-05"))))
      .execute().actionGet();

The response object contains the output returned from Elasticsearch, which will have one matching document in the response this time.

Nested aggregations

Nested aggregations allow you to perform aggregations on nested fields. There are two types of nested aggregations available in Elasticsearch. The first one (nested aggregation) allows you to aggregate the nested fields, whereas the second one (reverse nested aggregation) allows you to aggregate the fields that fall outside the nested scope.

Nested aggregation

A nested aggregation allows you to perform all the aggregations on the fields inside a nested object. The syntax is as follows:

{
  "aggs": {
    "NAME": {
      "nested": {
        "path": "path_to_nested_field"
      },
      "aggs": {}
    }
  }
}

Understanding nested aggregation syntax:

The syntax of a nested aggregation is similar to the other aggregations but here we need to specify the path of the topmost nested field as we have learnt to do in the nested queries. Once the path is specified, you can perform any aggregation on the nested documents using the inner aggs object. Let's see an example of how to do it:

Python example

query = {
  "aggs": {
    "NESTED_DOCS": {
      "nested": {
        "path": "tweets"
      },"aggs": {
        "TWEET_TIMELINE": {
          "date_histogram": {
            "field": "tweets.created_at",
            "interval": "day"
          }
        }
      }
    }
  }
}
res = es.search(index='twitter_nested', doc_type= 'users', body=query, size=0)

The preceding aggregation query creates a bucket of nested aggregation, which further contains the date histogram of tweets (the number of tweets created per day). Please note that we can combine nested aggregation with full-text search queries in a similar way to how we saw in Chapter 4, Aggregations for Analytics.

Java example

The following example requires this extra import in your code:

org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramInterval

You can build the aggregation in the following way:

SearchResponse response = client.prepareSearch("twitter_nested")
.setTypes("users")
.addAggregation(AggregationBuilders.nested("NESTED_DOCS")
.path(nestedField)
.subAggregation(AggregationBuilders
.dateHistogram("TWEET_TIMELINE")
.field("tweets.created_at")
.interval(DateHistogramInterval.DAY)
)).setSize(0).execute().actionGet();

Note

The DateHistogramInterval class offers the final static variables (DAY in our example) to define the intervals of buckets. The possible values are SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, and YEAR.

The output for the preceding query will look like the following:

"aggregations" : {
    "NESTED_DOCS" : {
      "doc_count" : 2,
      "TWEET_TIMELINE" : {
        "buckets" : [ {
          "key_as_string" : "2015-09-05T00:00:00.000Z",
          "key" : 1441411200000,
          "doc_count" : 2
        } ]
      }
    }
  }

In the output, NESTED_DOCS is the name of our nested aggregations that shows doc_count as 2 because our document was composed using an array of two nested tweet documents. The TWEET_TIMELINE buckets show two documents because we have two tweets in one document.

Reverse nested aggregation

Nested aggregation has the limitation that it can only access the fields within the nested scope. Reverse nested aggregations overcome this scenario and allow you to look beyond the nested scope and go back to the root document or other nested documents.

For example, we can find all the unique users who have tweeted in a particular date range with the following reverse nested aggregation:

Python example

query = {
  "aggs": {
    "NESTED_DOCS": {
      "nested": {
        "path": "tweets"
      },
      "aggs": {
        "TWEET_TIMELINE": {
          "date_histogram": {
            "field": "tweets.created_at",
            "interval": "day"
          },
          "aggs": {
            "USERS": {
              "reverse_nested": {},
              "aggs": {
                "UNIQUE_USERS": {
                  "cardinality": {
                    "field": "user.screen_name"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
resp = es.search(index='twitter_nested', doc_type= 'users', body=query, size=0)

Java example

SearchResponse response =
      client.prepareSearch(indexName).setTypes(docType)
  .addAggregation(AggregationBuilders.nested("NESTED_DOCS")
  .path(nestedField)
  .subAggregation(AggregationBuilders.dateHistogram("TWEET_TIMELINE")
.field("tweets.created_at").interval(DateHistogramInterval.DAY)
  .subAggregation(AggregationBuilders.reverseNested("USERS")
  .subAggregation(AggregationBuilders.cardinality("UNIQUE_USERS")
.field("user.screen_name")))))
.setSize(0).execute().actionGet();

The output for the preceding aggregation will be as follows:

{  "aggregations": {
    "NESTED_DOCS": {
      "doc_count": 2,
      "TWEET_TIMELINE": {
        "buckets": [
          {
            "key_as_string": "2015-09-05T00:00:00.000Z",
            "key": 1441411200000,
            "doc_count": 2,
            "USERS": {
              "doc_count": 1,
              "UNIQUE_USERS": {
                "value": 1             
    }           
  }       
}        ]      }    }  }  }

The preceding output shows the nested docs count as 2, whereas the USERS key specifies that there is only one root document that exists in the given time range. UNIQUE_USERS shows the cardinality aggregation output for the unique users in the index.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.110.156