Improving the relevancy of search results

In general, Elasticsearch is used for searching while it is a data analysis tool. In this respect, improving query relevance is an important issue. Of course, searching also means querying and scoring, thus it is a very important part of querying in Apache Lucene as well. We can use the re-scoring mechanism to improve the query's relevance. In addition to the capabilities of document scoring in the Apache Lucene library, Elasticsearch provides different query types to manipulate the score of the results returned by our queries. In this section, you will find several tips on this issue.

Boosting the query

Boosting queries allows us to effectively demote results that match a given query. This feature is very useful in that we can send some irrelevant records of the result set to the back. For example, we have an index that stores the skills of developers and we're looking for developers who know the Java language. We use a query such as the following for this case:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "fields": ["age", "skills", "education_status"],
  "query": {
    "match": {
      "skills": "java"
    }
  }
}'
...
         {
            "_index": "my_index",
            "_type": "talent",
            "_id": "AVERYloLvXHAFW5Vn9ct",
            "_score": 0.30685282,
            "fields": {
               "skills": [
                  "c++",
                  "ruby",
                  "java",
                  "scala",
                  "python"
               ],
               "education_status": [
                  "graduated"
               ],
               "age": [
                  26
               ]
            }
         },
         {
            "_index": "my_index",
            "_type": "talent",
            "_id": "AVERZkNpvXHAFW5Vn9jo",
            "_score": 0.30685282,
            "fields": {
               "skills": [
                  "java",
                  "jsf",
                  "wicket",
                  "scala",
                  "python",
                  "play",
                  "spring"
               ],
               "education_status": [
                  "student"
               ],
               "age": [
                  22
               ]
            }
         },
         {
            "_index": "my_index",
            "_type": "talent",
            "_id": "AVERXyjCvXHAFW5Vn9W9",
            "_score": 0.30685282,
            "fields": {
               "skills": [
                  "c",
                  "java",
                  "spring",
                  "spring mvc",
                  "node.js"
               ],
               "education_status": [
                  "graduated"
               ],
               "age": [
                  27
               ]
            }
         }

What can we do if there are some documents returned that we don't care as much about than other documents, and what can we do in order to discover the most relevant records first while browsing through the data? For example, we want to prioritize students. Reducing the score of documents that have unwanted terms could be a solution. You can specify negative rules in a bool query. In this case, the documents containing unwanted terms are still returned, but their overall scores are reduced. To send such a query to Elasticsearch, we will use the following command:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "fields": ["age", "skills", "education_status"],
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "skills": "java"
        }
      },
      "negative": {
        "match": {
          "education_status": "graduated"
        }
      },
    "negative_boost": 0.2
    }
  }
}'
...
         {
            "_index": "my_index",
            "_type": "talent",
            "_id": "AVERZkNpvXHAFW5Vn9jo",
            "_score": 0.30685282,
            "_source": {
                "age": 22,
               "skills": [
                  "java",
                  "jsf",
                  "wicket",
                  "scala",
                  "python",
                  "play",
                  "spring"
               ],
               "education_status": "student"
            }
         },
         {
            "_index": "my_index",
            "_type": "talent",
            "_id": "AVERYloLvXHAFW5Vn9ct",
            "_score": 0.061370563,
            "_source": { 
               "age": 26,
               "skills": [
                  "c++",
                  "ruby",
                  "java",
                  "scala",
                  "python"
               ],
               "education_status": "graduated"
            }
         },
         {
"_index": "my_index",
            "_type": "talent",
            "_id": "AVERXyjCvXHAFW5Vn9W9",
            "_score": 0.061370563,
            "_source": {
               "age": 27,
               "skills": [
                  "c",
                  "java",
                  "spring",
                  "spring mvc",
                  "node.js"
               ],
               "education_status": "graduated"
            }
         }

As you can see, the score of the document whose education_status field value is student is the same as the previous query result, but the scores of the last two documents have been decreased by 80 %. The reason is that it has been changed in terms of the value of the negative boost. We set its value to 0.2 in the preceding command.

Bool query

The bool query allows us to use Boolean combinations in nested queries. It provides a should occurrence type that defines no must clauses in a Boolean query (of course, this behavior can be changed by setting the minimum_should_match parameter), but each matching should clause increases the document score. This feature is very useful when you want to move some results among the result set to the forefront. For example, we have an index that stores technical articles and we're looking for articles written about Docker. We use a query like the following for this:

curl -XGET localhost:9200/my_index/_search -d '{
  "query": {
   "multi_match": {
     "query": "docker",
     "fields": ["_all"]
   }
  }
}'
...
         {
            "_index": "my_index",
            "_type": "article",
            "_id": "AVETmMSTOCXTx0WbQQh1",
            "_score": 0.13005449,
            "_source": {
               "title": "9 Open Source DevOps Tools We Love",
               "content": "We have configured Jenkins to build code, create Docker containers..."
            }
         },
         {
            "_index": "my_index",
            "_type": "article",
            "_id": "AVETl_kKOCXTx0WbQQga",
            "_score": 0.111475274,
            "_source": {
               "title": "Using Docker Volume Plugins with Amazon ECS-Optimized AMI",
               "content": "Amazon EC2 Container Service (ECS) is a highly scalable, high performance container management services..."
            }
         }
...

As you can see, the first document seems less relevant for docker compared to the second document. In this case, we can use a should clause, plus we can use the boost parameter to improve the relevancy of our search results. The boost parameter allows us to increase the weight of the given fields. Thus, it tells Elasticsearch that some fields are more important than other fields when performing term matching. If the title field contains the term that we're looking for, the document is relevant. This assessment is not wrong. Therefore, in our example, the important field is title. We could run the following command as an another example:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "_all": "docker"
          }
        }
      ],
      "should": [
        {
          "match": {
            "title": {
              "query": "docker",
              "boost": 2
            }
          }
        }
      ]
    }
  }
}'

Okay, let's now look at the example response:

...
         {
            "_index": "my_index",
            "_type": "article",
            "_id": "AVETl_kKOCXTx0WbQQga",
            "_score": 0.33130926,
            "_source": {
               "title": "Using Docker Volume Plugins with Amazon ECS-Optimized AMI",
               "content": "Amazon EC2 Container Service (ECS) is a highly scalable, high performance container management services..."
            }
         },
         {
            "_index": "my_index",
            "_type": "article",
            "_id": "AVETmMSTOCXTx0WbQQh1",
            "_score": 0.018529123,
            "_source": {
               "title": "9 Open Source DevOps Tools We Love",
               "content": "We have configured Jenkins to build code, create Docker containers..."
            }
         }
...

As you can see, the first document returned is now more relevant with regard to the should clause and the boost parameter.

Synonyms

We talked about subtle analysis in the Introduction to Analysis section in Chapter 4, Analysis and Analyzers. Recall what you learned about the topic: TR relates to Turkey and a search for Jeffrey Jacob Abrams also relates to J.J. Abrams. The simpler and more subtle the changes, the easier it is for human beings to notice this similarity. However, the machines need assistance here. Synonyms allow us to ensure that documents are found with terms of the same/similar meanings in this regard. In other words, they are used to broaden the scope of what is considered as a matching document. Now let's examine the following example:

curl -XPUT localhost:9200/travel -d '{
  "settings": {
    "analysis": {
      "filter": {
        "tr_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "tr,turkey"
          ]
        }
      },
      "analyzer": {
        "tr_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "tr_synonym_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "city": {
      "properties": {
        "city": {
          "type": "string", "analyzer": "tr_synonyms"
        },
        "description": {
          "type": "string", "analyzer": "tr_synonyms"
        }
      }
    }
  }
}'

We created a travel index using the tr_synonyms analyzer. It is configured with the synonym token filter whose name is tr_synonym_filter. The tr_synonym_filter handles synonyms during the analysis process. Its synonyms parameter accepts an array of synonyms that were provided by us. The only element of the array says that tr is a synonym of turkey and vice versa. Now let's add a document to the index:

curl -XPOST localhost:9200/travel/city -d '{
  "city": "Istanbul",
  "description": "Istanbul is the most populous city in Turkey."
}'
{"_"index":"""travel","_""type":"""city","_""id":"""AVEXOA_xXNtV9WrYCpuZ","_""version":"1,"created":"true}

Now, let us search tr phrase on travel index:

curl -XGET localhost:9200/travel/_search?pretty -d '{
  "query": {
    "match": {
      "description": "tr"
    }
  }
}'
{
   "took": 12,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.13561106,
      "hits": [
         {
            "_index": "travel",
            "_type": "city",
            "_id": "AVEXOA_xXNtV9WrYCpuZ",
            "_score": 0.13561106,
            "_source": {
               "city": "Istanbul",
               "description": "Istanbul is the most populous city in Turkey."
            }
         }
      ]
   }
}

As you can see, the document that we're looking for was returned to us because the tr_synonym_filter handles synonyms by means of the synonyms provided that were defined by us.

Be careful about the _all field

We talked about the _all field in the _all section in Chapter 3, Basic Concepts of Mapping. To remind you briefly, Elasticsearch allows you to search in all the fields of a document. This facility is provided by the _all field, because it includes the text of one or more other fields within the document indexed and concatenates them into one big string. This feature is very useful when want to use a full-text search. However, due to the structure of the field, we may not produce the expected results when searching on this field. For example, let's change the query to run on the _all field that we used in our previous example:

curl -XGET localhost:9200/travel/_search?pretty -d '{
  "query": {
    "match": {
      "_all": "tr"
    }
  }
}'
{
   "took": 15,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

As you can see, no document was returned to us in the query results. This is because the _all field combines the original values from each field of the document as a string. In our previous example, the _all field only included these terms: [istanbul, is, the, most, populous, city, in, turkey].

So, similar words did not appear in this field. Another important point to note is that the _all field is of the type string. This means that the fields' values of different types are stored as a string type. For example, if we have a date field whose value is 2002-11-03 00:00:00 UTC, the _all field will contain the terms [2003, 11, and 03].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.38.24