Imagine a situation in which your score calculation is affected by numerous function queries, which makes the score calculation very CPU-intensive. This is not a problem for small result sets, but it is for larger ones. Starting from Solr 4.9, this great search engine gives us the possibility of rerank results. This means that Solr will get some results from our initial query and will apply another query only on those results. The query that is applied modifies the score of the documents. This recipe will show you how this can be done.
Let's say that we have a use case where we want to show the latest books added to our index and boost them on the basis of some additional query. To do this, we will need to take the following steps:
schema.xml
file):<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="text_general" indexed="true" stored="true" /> <field name="added" type="date" indexed="true" stored="true" />
<add> <doc> <field name="id">1</field> <field name="title">Solr 4.0 cookbook</field> <field name="added">2012-01-12T23:59:59Z</field> </doc> <doc> <field name="id">2</field> <field name="title">Solr 3.1 cookbook</field> <field name="added">2011-07-01T23:59:59Z</field> </doc> <doc> <field name="id">3</field> <field name="title">Elasticsearch Server</field> <field name="added">2012-03-01T23:59:59Z</field> </doc> <doc> <field name="id">4</field> <field name="title">Elasticsearch Server second edition</field> <field name="added">2014-04-01T23:59:59Z</field> </doc> <doc> <field name="id">5</field> <field name="title">Mastering Elasticsearch</field> <field name="added">2013-11-01T23:59:59Z</field> </doc> </add>
http://localhost:8983/solr/cookbook/select?q={!boost%20b=recip(ms(NOW,added),3.16e-11,1,1)}*:*+OR+_query_:"title:solr+title:cookbook"&fl=*,score
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">3</int> <lst name="params"> <str name="q">{!boost b=recip(ms(NOW,added),3.16e-11,1,1)}*:* OR _query_:"title:solr title:cookbook"</str> <str name="fl">*,score</str> </lst> </lst> <result name="response" numFound="5" start="0" maxScore="0.35659808"> <doc> <str name="id">1</str> <str name="title">Solr 4.0 cookbook</str> <date name="added">2012-01-12T23:59:59Z</date> <long name="_version_">1487144228514430976</long> <float name="score">0.35659808</float></doc> <doc> <str name="id">2</str> <str name="title">Solr 3.1 cookbook</str> <date name="added">2011-07-01T23:59:59Z</date> <long name="_version_">1487144228588879872</long> <float name="score">0.31378558</float></doc> <doc> <str name="id">4</str> <str name="title">Elasticsearch Server second edition</str> <date name="added">2014-04-01T23:59:59Z</date> <long name="_version_">1487144228589928448</long> <float name="score">0.12536839</float></doc> <doc> <str name="id">5</str> <str name="title">Mastering Elasticsearch</str> <date name="added">2013-11-01T23:59:59Z</date> <long name="_version_">1487144228590977024</long> <float name="score">0.10079001</float></doc> <doc> <str name="id">3</str> <str name="title">Elasticsearch Server</str> <date name="added">2012-03-01T23:59:59Z</date> <long name="_version_">1487144228588879873</long> <float name="score">0.056244835</float></doc> </result> </response>
http://localhost:8983/solr/cookbook/select?q=*:*&rq={!rerank reRankQuery=$rerankQueryreRankDocs=100 reRankWeight=10}&rerankQuery=title:solr+title:cookbook&sort=added+desc&fl=score,*
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="q">*:*</str> <str name="rerankQuery">title:solr title:cookbook</str> <str name="fl">score,*</str> <str name="sort">added desc</str> <str name="rq">{!rerank reRankQuery=$rerankQuery reRankDocs=100 reRankWeight=10}</str> </lst> </lst> <result name="response" numFound="5" start="0" maxScore="11.68315"> <doc> <str name="id">1</str> <str name="title">Solr 4.0 cookbook</str> <date name="added">2012-01-12T23:59:59Z</date> <long name="_version_">1471421442477260800</long> <float name="score">11.68315</float></doc> <doc> <str name="id">2</str> <str name="title">Solr 3.1 cookbook</str> <date name="added">2011-07-01T23:59:59Z</date> <long name="_version_">1471421442543321088</long> <float name="score">11.68315</float></doc> <doc> <str name="id">3</str> <str name="title">Elasticsearch Server</str> <date name="added">2012-03-01T23:59:59Z</date> <long name="_version_">1471421442544369664</long> <float name="score">1.0</float></doc> <doc> <str name="id">4</str> <str name="title">Elasticsearch Server second edition</str> <date name="added">2014-04-01T23:59:59Z</date> <long name="_version_">1471421442544369665</long> <float name="score">1.0</float></doc> <doc> <str name="id">5</str> <str name="title">Mastering Elasticsearch</str> <date name="added">2013-11-01T23:59:59Z</date> <long name="_version_">1471421442545418240</long> <float name="score">1.0</float></doc> </result> </response>
As we can see, the results changed, and believe me, so did the execution time. Now, let's see how it works.
Our index structure is very simple; it contains the following:
id
field)title
field)added
field)The example data is also very simple, so let's skip discussing it.
Our initial query asks for all documents in the index and boosts the documents that were added recently ({!boost%20b=recip(ms(NOW,added),3.16e-11,1,1)}*:*
). We also boost the documents by adding an OR
query (_query_:"title:solr+title:cookbook"
). The results returned by Solr shows that the query works as it should.
The recip(field_name, m, a, b)
is a reciprocal function that implements a/(m*x+b)
, where m
, a
, and b
are constants, and x
is the value stored in field_name
. For a description of available functions, refer to the official Solr documentation available at https://cwiki.apache.org/confluence/display/solr/Function+Queries.
The thing is that we are calculating the score of the documents for all of the documents that match the query, and for some use cases, this is not the best way, it might be too resource-heavy. This is why we modified our query. It also queries for all the documents (q=*:*
); however, it first sorts the document on the basis of the date they were added (sort=added+desc
). In this way, we have the newest documents at the top of the results set, so we are sure we will use them to score calculations using our reranking.
Instead of calculating the score for all the documents, we decided to use the Solr rerank functionality. We specified the query that should be used for boosting (rerankQuery=title:solr+title:cookbook
) and included the rerank functionality. To do this, we used the rq
parameter and rerank query parser (!rerank
). It allows us to specify the rerank query by dereferencing the query itself; we said that Solr should take the query stored in the rerankQuery
parameter (reRankQuery=$rerankQuery
). We also said that we only want the score to be calculated on the top 100 documents returned by our query (reRankDocs=100
) and the rerank weight to be set to 10
(reRankWeight=10
). The best thing about the second query is that the score using the boosting query will only be given for the top 100 documents because of the reRankDocs
property. If you look at the results, you can see that the score was properly calculated.
The thing to keep in mind is that this method can't be used every time, for every query, and every use case. If you need to score all the documents and show only the top ones among them, you can't use this method. In our case, we were able to change the boosting on date for date sorting because we are only interested in the newest documents, but remember that this is not always the case.
18.116.14.245