Handling deep paging efficiently

In most cases, the top results returned to the user should be what they are looking for. The top results should be the most relevant ones and the ones we want to show. However, there are use cases where this is not enough. Sometimes we want to get all the results—in the worst case, we want to get all the documents stored in the collection and do something with them. When you are requesting a high number of pages, you will see that the performance will start suffering. This is because Solr needs to build the results list for each request and discard the first N ones to get to the requested page. Of course, there are better ways to handle such cases, and Solr allows you to use one of those methods that we will discuss in this recipe.

How to do it...

  1. The actual index structure doesn't matter, but for the purpose of this recipe, let's assume that we have the following index structure (we just put the field's definition in the schema.xml file):
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true" />
  2. Of course, we also need some data for this example to work. It would be perfect if we had hundreds of thousands of documents to illustrate the performance gains, but for the purpose of the book, we will be using only a few documents to illustrate how cursor-based paging works. Our example data looks as follows:
    <add>
     <doc>
      <field name="id">1</field>
      <field name="title">Solr 4.0 cookbook</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="title">Solr 3.1 cookbook</field>
     </doc>
     <doc>
      <field name="id">3</field>
      <field name="title">ElasticSearch Server</field>
     </doc>
     <doc>
      <field name="id">4</field>
      <field name="title">Mastering Elasticsearch</field>
     </doc>
     <doc>
      <field name="id">5</field>
      <field name="title">Elasticsearch Server Second Edition</field>
     </doc>
    </add>
  3. Now we can start sending the queries. Assuming that we want to scroll through the results starting from the first one and ending on the last, and we want two documents per page of the results. We start by sending the following query:
    q=*:*&rows=2&sort=score+desc,id+asc&cursorMark=*

    The results returned by Solr are as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">*:*</str>
       <str name="cursorMark">*</str>
       <str name="sort">score desc,id asc</str>
       <str name="rows">2</str>
      </lst>
     </lst>
     <result name="response" numFound="5" start="0">
      <doc>
       <str name="id">1</str>
       <str name="title">Solr 4.0 cookbook</str>
       <long name="_version_">1475631480903303168</long></doc>
      <doc>
       <str name="id">2</str>
       <str name="title">Solr 3.1 cookbook</str>
       <long name="_version_">1475631480954683392</long></doc>
     </result>
     <str name="nextCursorMark">AoIIP4AAACEy</str>
    </response>

    Of course, we got the documents we wanted, but we are not only interested in them. We should also look at the value of nextCursorMark returned along with the results by Solr. In our case, its value is AoIIP4AAACEy, and we will use this value in the next query that will give us the next page of results.

  4. The query to give us the second page of results looks as follows:
    q=*:*&rows=2&sort=score+desc,id+asc&cursorMark=AoIIP4AAACEy

    The results returned by Solr are as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">3</int>
      <lst name="params">
       <str name="q">*:*</str>
       <str name="cursorMark">AoIIP4AAACEy</str>
       <str name="sort">score desc,id asc</str>
       <str name="rows">2</str>
      </lst>
     </lst>
     <result name="response" numFound="5" start="0">
      <doc>
       <str name="id">3</str>
       <str name="title">ElasticSearch Server</str>
       <long name="_version_">1475631480954683393</long></doc>
      <doc>
       <str name="id">4</str>
       <str name="title">Mastering Elasticsearch</str>
       <long name="_version_">1475631480955731968</long></doc>
     </result>
     <str name="nextCursorMark">AoIIP4AAACE0</str>
    </response>

    And we got the next two results and the new value of the nextCursorMark parameter.

  5. Finally, to get the last pages of the results, we will run the following query:
    q=*:*&rows=2&sort=score+desc,id+asc&cursorMark=AoIIP4AAACE0

    The results are as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">*:*</str>
       <str name="cursorMark">AoIIP4AAACE0</str>
       <str name="sort">score desc,id asc</str>
       <str name="rows">2</str>
      </lst>
     </lst>
     <result name="response" numFound="5" start="0">
      <doc>
       <str name="id">5</str>
       <str name="title">Elasticsearch Server Second Edition</str>
       <long name="_version_">1475631480956780544</long></doc>
     </result>
    <str name="nextCursorMark">AoIIP4AAACE1</str>
    </response>

Now let's take a look at how it works.

How it works...

I'll skip discussing the index structure and the data itself, because it is very simple and really doesn't matter in this recipe. They are just here so that we are able to query Solr and get results back.

Before we go into the details on how the scroll method works, you need to remember that Solr is almost stateless when it comes to querying. Of course, there are some caches, but still for a given request Solr creates the result set from scratch for almost each request. When sending a query with start=0 and rows=10, Solr needs to sort all the documents matching the query and return the 10 values on the top. Now imagine that we pass start=1000000 and rows=10. Solr needs to sort all the documents, discard the first 1,000,000, and return the ones on positions 1,000,001 to 1,000,010. This doesn't sound too efficient, and it isn't. The cursor paging method allows you to overcome this by giving Solr the query state information in an encoded value provided by the cursorMark parameter. The con of such an approach is the need of getting one page after another—we cannot randomly choose which page we want.

So starting with our first query—we said that we want all documents to be matched (q=*:*), we want Solr to return two documents on a single page of results (rows=2), and we want the results to be sorted on the basis of score (sort=score+desc,id+asc). Finally, we have the cursorMark parameter. Because this is the first page of results, we pass * as its value.

Note

Note that until Solr 4.10, one needs to explicitly define the sorting of document identifiers for Solr so that it can properly handle results sorting using the cursor paging method.

As you can see, in addition to the standard results, Solr returned one additional thing—the nextCursorMark property. We take the value of this property and use it as the value of the cursorMark parameter in the next query. This is needed to get to the next page of results. In our case, the value of the cursorMark parameter should be set to AoIIP4AAACEy. Of course, you can expect the value of the nextCursorMark property to be different after each page of results.

As you can see, the second query is almost the same as the first one, with one change—the value of the cursorMark parameter. We set the value of this parameter to the one returned by the nextCursorMark property just as I described. And as you can see, Solr returned the second page of results. We did exactly the same for the third query, but of course we set the value of the cursorMark parameter to the one returned by the nextCursorMark property in the second page of results (which was AoIIP4AAACE0 in our case).

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.106.150