Rescoring query results

Imagine a situation in which your score calculation is affected by numerous function queries, which makes the score calculation very CPU-intensive. This is not a problem for small result sets, but it is for larger ones. Starting from Solr 4.9, this great search engine gives us the possibility of rerank results. This means that Solr will get some results from our initial query and will apply another query only on those results. The query that is applied modifies the score of the documents. This recipe will show you how this can be done.

How to do it...

Let's say that we have a use case where we want to show the latest books added to our index and boost them on the basis of some additional query. To do this, we will need to take the following steps:

  1. Let's start with a simple index structure. Our index will be built of three fields that look as follows (please put the following entries to the schema.xml file):
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true" />
    <field name="added" type="date" indexed="true" stored="true" />
  2. Next, we index our example data that looks as follows:
    <add>
     <doc>
      <field name="id">1</field>
      <field name="title">Solr 4.0 cookbook</field>
      <field name="added">2012-01-12T23:59:59Z</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="title">Solr 3.1 cookbook</field>
      <field name="added">2011-07-01T23:59:59Z</field>
     </doc>
     <doc>
      <field name="id">3</field>
      <field name="title">Elasticsearch Server</field>
      <field name="added">2012-03-01T23:59:59Z</field>
     </doc>
     <doc>
      <field name="id">4</field>
      <field name="title">Elasticsearch Server second edition</field>
      <field name="added">2014-04-01T23:59:59Z</field>
     </doc>
     <doc>
      <field name="id">5</field>
      <field name="title">Mastering Elasticsearch</field>
      <field name="added">2013-11-01T23:59:59Z</field>
     </doc>
    </add>
  3. Let's assume that our standard query looks as follows:
    http://localhost:8983/solr/cookbook/select?q={!boost%20b=recip(ms(NOW,added),3.16e-11,1,1)}*:*+OR+_query_:"title:solr+title:cookbook"&fl=*,score
  4. The results returned by Solr looks like this (note that your score for the documents might be different because of the time):
    <?xml version="1.0" encoding="UTF-8"?>
    <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">3</int>
      <lst name="params">
        <str name="q">{!boost b=recip(ms(NOW,added),3.16e-11,1,1)}*:* OR _query_:"title:solr title:cookbook"</str>
        <str name="fl">*,score</str>
      </lst>
    </lst>
    <result name="response" numFound="5" start="0" maxScore="0.35659808">
      <doc>
        <str name="id">1</str>
        <str name="title">Solr 4.0 cookbook</str>
        <date name="added">2012-01-12T23:59:59Z</date>
        <long name="_version_">1487144228514430976</long>
        <float name="score">0.35659808</float></doc>
      <doc>
        <str name="id">2</str>
        <str name="title">Solr 3.1 cookbook</str>
        <date name="added">2011-07-01T23:59:59Z</date>
        <long name="_version_">1487144228588879872</long>
        <float name="score">0.31378558</float></doc>
      <doc>
        <str name="id">4</str>
        <str name="title">Elasticsearch Server second edition</str>
        <date name="added">2014-04-01T23:59:59Z</date>
        <long name="_version_">1487144228589928448</long>
        <float name="score">0.12536839</float></doc>
      <doc>
        <str name="id">5</str>
        <str name="title">Mastering Elasticsearch</str>
        <date name="added">2013-11-01T23:59:59Z</date>
        <long name="_version_">1487144228590977024</long>
        <float name="score">0.10079001</float></doc>
      <doc>
        <str name="id">3</str>
        <str name="title">Elasticsearch Server</str>
        <date name="added">2012-03-01T23:59:59Z</date>
        <long name="_version_">1487144228588879873</long>
        <float name="score">0.056244835</float></doc>
    </result>
    </response>
  5. Of course, this is only an example. To see the actual difference in query execution time, we need to have way more documents indexed than the five shown in the example. However, let's assume that the query is being run for a longer period of time. The modified query that scores only the top documents looks as follows:
    http://localhost:8983/solr/cookbook/select?q=*:*&rq={!rerank reRankQuery=$rerankQueryreRankDocs=100 reRankWeight=10}&rerankQuery=title:solr+title:cookbook&sort=added+desc&fl=score,*
  6. The results are as follows:
    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">*:*</str>
       <str name="rerankQuery">title:solr title:cookbook</str>
       <str name="fl">score,*</str>
       <str name="sort">added desc</str>
       <str name="rq">{!rerank reRankQuery=$rerankQuery reRankDocs=100 reRankWeight=10}</str>
      </lst>
     </lst>
     <result name="response" numFound="5" start="0" maxScore="11.68315">
      <doc>
       <str name="id">1</str>
       <str name="title">Solr 4.0 cookbook</str>
       <date name="added">2012-01-12T23:59:59Z</date>
       <long name="_version_">1471421442477260800</long>
       <float name="score">11.68315</float></doc>
      <doc>
       <str name="id">2</str>
       <str name="title">Solr 3.1 cookbook</str>
       <date name="added">2011-07-01T23:59:59Z</date>
       <long name="_version_">1471421442543321088</long>
       <float name="score">11.68315</float></doc>
      <doc>
       <str name="id">3</str>
       <str name="title">Elasticsearch Server</str>
       <date name="added">2012-03-01T23:59:59Z</date>
       <long name="_version_">1471421442544369664</long>
       <float name="score">1.0</float></doc>
      <doc>
       <str name="id">4</str>
       <str name="title">Elasticsearch Server second edition</str>
       <date name="added">2014-04-01T23:59:59Z</date>
       <long name="_version_">1471421442544369665</long>
       <float name="score">1.0</float></doc>
      <doc>
       <str name="id">5</str>
       <str name="title">Mastering Elasticsearch</str>
       <date name="added">2013-11-01T23:59:59Z</date>
       <long name="_version_">1471421442545418240</long>
       <float name="score">1.0</float></doc>
     </result>
    </response>

As we can see, the results changed, and believe me, so did the execution time. Now, let's see how it works.

How it works...

Our index structure is very simple; it contains the following:

  • The book identifier (the id field)
  • The book title (the title field)
  • The date the book was added into our application (the added field)

The example data is also very simple, so let's skip discussing it.

Our initial query asks for all documents in the index and boosts the documents that were added recently ({!boost%20b=recip(ms(NOW,added),3.16e-11,1,1)}*:*). We also boost the documents by adding an OR query (_query_:"title:solr+title:cookbook"). The results returned by Solr shows that the query works as it should.

Note

The recip(field_name, m, a, b) is a reciprocal function that implements a/(m*x+b), where m, a, and b are constants, and x is the value stored in field_name. For a description of available functions, refer to the official Solr documentation available at https://cwiki.apache.org/confluence/display/solr/Function+Queries.

The thing is that we are calculating the score of the documents for all of the documents that match the query, and for some use cases, this is not the best way, it might be too resource-heavy. This is why we modified our query. It also queries for all the documents (q=*:*); however, it first sorts the document on the basis of the date they were added (sort=added+desc). In this way, we have the newest documents at the top of the results set, so we are sure we will use them to score calculations using our reranking.

Instead of calculating the score for all the documents, we decided to use the Solr rerank functionality. We specified the query that should be used for boosting (rerankQuery=title:solr+title:cookbook) and included the rerank functionality. To do this, we used the rq parameter and rerank query parser (!rerank). It allows us to specify the rerank query by dereferencing the query itself; we said that Solr should take the query stored in the rerankQuery parameter (reRankQuery=$rerankQuery). We also said that we only want the score to be calculated on the top 100 documents returned by our query (reRankDocs=100) and the rerank weight to be set to 10 (reRankWeight=10). The best thing about the second query is that the score using the boosting query will only be given for the top 100 documents because of the reRankDocs property. If you look at the results, you can see that the score was properly calculated.

The thing to keep in mind is that this method can't be used every time, for every query, and every use case. If you need to score all the documents and show only the top ones among them, you can't use this method. In our case, we were able to change the boosting on date for date sorting because we are only interested in the newest documents, but remember that this is not always the case.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.14.245