Understanding Query-DSL parameters

  • query: The query object contains all the queries that need to be passed to Elasticsearch. For example, the query to find all the documents that belong to a search category can written as follows:
    GET index_name/doc_type/_search
    {
      "query": {
        "query_string": {
     "default_field": "category",
          "query": "search"
        }
      }
    }
  • from and size: These parameters control the pagination and the result size to be returned after querying. The from parameter is used to specify the starting point from which document the results will be returned. It defaults to 0. The size parameter, which defaults to 10, specifies how many top documents will be returned from the total matched documents in a corpus.
  • _source: This is an optional parameter that takes field names in an array format, which are to be returned in the query results. It by default returns all the fields stored inside the _source field. If you do not want to return any field, you can specify _source: false.

Elasticsearch queries majorly fall into two categories:

  • Basic Queries: These queries include normal keyword searching inside indexes.
  • Compound Queries: These queries combine multiple basic queries together with Boolean clauses.

    Note

    We will be using our Twitter dataset to perform all the queries in this and the upcoming chapters.

Query types

At the abstract level, there are two major categories of queries in Elasticsearch:

  • Full-Text Search Queries: These are the queries that usually run over text fields like a tweet text. These queries understand the field mapping, and depending on the field type and analyzer used for that field and query, the text goes through an analysis phase (similar to indexing) to find the relevant documents.
  • Term-based search queries: Unlike full-text queries, term-based queries do not go through an analysis process. These queries are used to match the exact terms stored inside an inverted index.

    Note

    There exist a few other categories of queries such as Compound Queries, Geo Queries, and Relational Queries. We will cover Compound Queries in this chapter and the rest will be covered in the subsequent chapters.

Full-text search queries

The most important queries in this category are the following:

  • match_all
  • match
  • match_phrase
  • multi_match
  • query_string

match_all

The simplest query in Elasticsearch is match_all query that matches all the documents. It gives a generous _score of 1.0 to each document in the index. The syntax of the match_all query is as follows:

{
   "query": {
      "match_all": {}
   }
}

match query

The text passed inside a match query goes through the analysis phase and, depending on the operator (which defaults to OR), documents are matched. For example:

{
  "query": {
    "match": {
      "text": "Build Great Web Apps",
      "operator" : "and"
    }
  }
}

The preceding query will match the documents that contain the Build, Great, Web, and Apps terms in the text field. If we had used the OR operator, it would have matched the documents containing any of these terms.

Tip

If you want the exact matches, you need to pass the text in the following way so that the text is not broken into tokens:

{
  "query": {
    "match": {
      "text": ""Build Great Web Apps""
    }
  }
}

The preceding query will match the documents in which Build Great Web Apps appear together exactly in the same order.

Phrase search

Match query provides an option to search phrases with the type parameter in the following way:

{
  "query": {
    "match": {
      "text": "Build Great Web Apps",
      "type" : "phrase"
    }
  }
}

multi match

The multi_match query is similar to the match query but it provides options to search the same terms in more than one field at one go. For example:

{
  "query": {
  "multi_match": {
    "query": "Build Great Web Apps",
    "fields": ["text","retweeted_status.text"]
  }
  }
}

The preceding query will search the words Build, Great, Web, and Apps inside the two fields text and retweeted_status.text, and will return the relevant results in a sorted order based on the score each document gets. If you want to match only those documents in which all the terms are present, then use the and keyword in the operator parameter.

query_string

In comparison to all the other queries available in Elasticsearch, the query_string query provides a full Lucene syntax to be used in it. It uses a query parser to construct an actual query out of the provided text. Similar to the match query, it also goes through the analysis phase. The following is the syntax for query_string:

{
  "query": {
    "query_string": {
      "default_field": "text",
      "query": "text:analytics^2 +text:data -user.name:d_bharvi"
    }
  }
}

The match query that we used in the previous section can be written using a query string in the following way:

{
  "query": {
    "query_string": {
      "default_field": "text",
      "query": "Build Great Web Apps"
    }
  }
}

Term-based search queries

The most important queries in this category are the following:

  • Term query
  • Terms query
  • Range query
  • Exists query
  • Missing query

Term query

The term query does an exact term matching in a given field. So, you need to provide the exact term to get the correct results. For example, if you have used a lowercase filter while indexing, you need to pass the terms in lowercase while querying with the term query.

The syntax for a term query is as follows:

{
  "query": {
    "term": {
      "text": "elasticsearch"
    }
  }
}

Terms query

If you want to search more than one term in a single field, you can use the terms query. For example, to search all the tweets in which the hashtags used are either bomb or blast, you can write a query like this:

{
  "query": {
    "terms": {
      "entities.hashtags": [
        "bomb",
        "blast"
      ],
      "minimum_match": 1
    }
  }
}

The minimum_match specifies the number of minimum terms that should match in each document. This parameter is optional.

Range queries

Range queries are used to find data within a certain range. The syntax of a range query is as follows and is the same for date fields as well as number fields such as integer, long, and so on:

{
  "query": {
    "range": {
      "user.followers_count": {
        "gte": 100,
        "lte": 200
      }
    }
  }
}

The preceding query will find all the tweets created by users whose follower count is between 100 and 200. The parameters supported in the range queries are: gt, lt, gte, and lte.

Note

Please note that if you use range queries on string fields, you will get weird results as strings. String ranges are calculated lexicographically or alphabetically, so a string stored as 50 will be lesser than 6. In addition, doing range queries on strings is a heavier operation in comparison to numbers.

Range queries on dates allow date math operations. So, for example, if you want to find all the tweets from the last one hour, you can use the following query:

{
  "query": {
    "range": {
      "created_at": {
        "gt": "now-1h"
      }
    }
  }
}

Similarly, months (M), minutes (m), years (y), and seconds (s) are allowed in the query.

Exists queries

The exists query matches documents that have at least one non-value in a given field. For example, the following query will find all the tweets that are replies to any other tweet:

{
  "query":{
    "constant_score":{
      "filter":{
        "exists":{"field":"in_reply_to_user_id"}
      }
    }
  }
}

Missing queries

Missing queries are the opposite of exists queries. They are used to find the documents that contain null values. For instance, the following query finds all the tweets that do not contain any hashtags:

{
  "query":{
    "constant_score":{
      "filter":{
        "missing":{"field":"hashtags"}
      }
    }
  }
}

Note

The story of filters

Before version 2.0.0, Elasticsearch used to have two different objects for querying data: Queries and Filters. Both used to differ in functionality and performance.

Queries were used to find out how relevant a document was to a particular query by calculating a score for each document, whereas filters were used to match certain criteria and were cacheable to enable faster execution. This means that if a filter matched 1,000 documents, with the help of bloom filters, Elasticsearch used to cache them in the memory to get them quickly in case the same filter was executed again.

However, with the release of Lucene 5.0, which is used by Elasticsearch version 2.0.0, things have completely changed and both the queries and filters are now the same internal object. Users need not worry about caching and performance anymore, as Elasticsearch will take care of it. However, one must be aware of the contextual difference between a query and a filter that was listed in the previous paragraph.

In the query context, put the queries that ask the questions about document relevance and score calculations, while in the filter context, put the queries that need to match a simple yes/no question.

If you have been using an Elasticsearch version below 2.0.0, please go through the breaking changes here: https://www.elastic.co/guide/en/elasticsearch/reference/2.0/breaking-changes-2.0.html, and migrate your application code accordingly since there have been a lot of changes, including the removal of various filters.

Compound queries

Compound queries are offered by Elasticsearch to connect multiple simple queries together to make your search better. A compound query clause can combine any number of queries including compound ones that allow you to write very complex logic for your searches. You will need them at every step while creating any search application.

Note

In the previous chapter, we saw how Lucene calculates a score based on the TF/IDF formula. This score is calculated for each and every query we send to Elasticsearch. Thus, when we combine queries in a compound form, the scores of all the queries are combined to calculate the overall score of the document.

The primary compound queries are as follows:

  • Bool query
  • Not query
  • Function score query (will be discussed in Chapter 8, Controlling Relevancy)

Bool queries

Bool queries allow us to wrap up many queries clauses together including bool clauses. The documents are matched based on the combinations of these Boolean clauses that are listed as follows:

  • must: The queries that are written inside this clause must match in order to return the documents.
  • should: The queries written inside the should clause may or may not have a match but if the bool query has no must clause inside it, then at least one should condition needs to be matched in order to return the documents.
  • must_not: The queries wrapped inside this clause must not appear in the matching documents.
  • filter: A query wrapped inside this clause must appear in the matching documents. However, this does not contribute to scoring. The structure of bool queries is as follows:
    {
      "query":{
      "bool":{
        "must":[{}],
        "should":[{}],
        "must_not":[{}]
        "filter":[{}]
        }
      }
    }

There are some additional parameters supported by bool queries that are listed here:

  • boost: This parameter controls the score of each query, which is wrapped inside the must or should clause.
  • minimum_should_match: This is only used for the should clauses. Using this, we can specify how many should clauses must match in order to return a document.
  • disable_coord: The bool queries by default use query coordination for all the should clauses; it is a good thing to have since the more clauses get matched, the higher the score a document will get. However, look at the following example where we may need to disable this:
    {
    "query":{
      "bool":{
        "disable_coord":true,
        "should":[
          {"term":{"text":{"value":"turmoil"}}},
          {"term":{"text":{"value":"riot"}}}
          ]
        }
      }
    }

In the preceding example, inside the text field, we are looking for the terms turmoil and riot, which are synonyms of each other. In these cases, we do not care how many synonyms are present in the document since all have the same meaning. In these kinds of scenarios, we can disable query coordination by setting disable_coord to true, so that similar clauses do not impact the score factor computation.

Not queries

The not query is used to filter out the documents that match the query. For example, we can use the following to get the tweets that are not created within a certain range of time:

{
  "filter": {
    "not": {
      "filter": {
        "range": {
          "created_at": {
            "from": "2015-10-01",
            "to": "2010-10-30"
          }
        }
      }
    }
  }
}

Please note that any filter can be used inside bool queries with the must, must_not, or should blocks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.152.26