Cursor pagination

Pagination implementation using the start and rows parameters is very easy and straightforward. But when we have very large data volumes, these parameters are not sufficient to implement pagination. For example, consider this:

  • During query processing, Solr first loads all the matching documents in memory; then it creates an offset by the start and rows parameters and returns that offset for that query. If the data volume is very large, Solr first loads all matching results in the memory and then applies pagination. So this will create a performance problem.
  • In large volumes of data, a request for start=0&rows=1000000 may create trouble for Solr in maintaining and sorting a collection of 1 million documents in memory.
  • request for start=999000&rows=1000 also creates the same problem. To match the document at the 999001th place, Solr has to traverse through the first 999,000 documents.
  • A similar problem exists with SolrCloud, where indexes are distributed. If we have 10 shards, then Solr retrieves 10 million documents (1 million from each shard) and then sorts to find 1,000 documents matching the query parameters.

As a solution to these problems, Solr introduce a feature of pagination—cursor. Cursor does not manage caching on the server but marks the point from where the last document was returned. Now that mark point, called cursorMark, is supplied inside the parameters of subsequent requests to tell Solr where to continue from.

To implement pagination through cursor, we need to specify the parameter cursorMark with the value of *. Once we implement cursor pagination in Solr, Solr returns the top N number of sorted results (where N can be specified in the rows parameter) along with an encoded string named nextCursorMark in the response. In subsequent requests, the nextCursorMark value will be passed as the cursorMark parameter. The process will be repeated until the expected result set is retrieved or the value of nextCursorMark matches cursorMark (which means that there are no more results).

Examples: Fetching all documents.

Using pseudo-code:

$params = [ q => $some_query, sort => 'id asc', rows => $r, cursorMark => '*' ]
$done = false
while (not $done) {
$results = fetch_solr($params)
// do something with $results
if ($params[cursorMark] == $results[nextCursorMark]) {
$done = true
}
$params[cursorMark] = $results[nextCursorMark]
}

Using SolrJ:

SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
doCustomProcessingOfResults(rsp);
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}

Using curl:

$ curl '...&rows=10&sort=id+asc&cursorMark=*'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 10 docs here ...
]},
"nextCursorMark":"AoEjR0JQ"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEjR0JQ'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 10 more docs here ...
]},
"nextCursorMark":"AoEpVkRCREIxQTE2"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEpVkRCREIxQTE2'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 10 more docs here ...
]},
"nextCursorMark":"AoEmbWF4dG9y"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEmbWF4dG9y'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 2 docs here because we've reached the end.
]},
"nextCursorMark":"AoEpdmlld3Nvbmlj"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEpdmlld3Nvbmlj'
{
"response":{"numFound":32,"start":0,"docs":[
// no more docs here, and note that the nextCursorMark
// matches the cursorMark param we used
]},
"nextCursorMark":"AoEpdmlld3Nvbmlj"}

Examples: Fetching N Number of documents.

SolrJ:

while (! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
boolean hadEnough = doCustomProcessingOfResults(rsp);
if (hadEnough || cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}

When implementing pagination, we need to take care of a few things:

  • If we have used a start parameter in the request, we have to specify some value.
  • The field on which we are applying sorting must be unique (uniqueKey field).
  • The cursorMark values are calculated based on the sort values of each document, and it may be possible that multiple documents have the same sort values; that will create identical cursorMarks values. Now, Solr will be confused in subsequent requests as to which cursorMark value should be considered. To overcome this, Solr provides an additional field, uniqueKey, used as a clause with a sorting parameter. This uniqueKey guarantees that the documents are returned in deterministic order, and that way each cursorMark will always point to a unique value.
  • When documents are sorted based on a Date function, NOW will create confusion for cursor because NOW will create a new sort value for each document in each subsequent request. This will result in never-ending cursors and return the same document every time. To overcome this situation, select a fixed value for the NOW parameter in all requests.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.2.23