Exporting whole query results

One of the features of search engines such as Solr that users frequently ask about is the ability to pull the data from the search engine in some form. I'm not talking about a few hundred results returned by a query, but about all the documents that are indexed in a particular core or collection. With the new releases of Solr, we have the ability to scroll through the results and with some effort, we will be able to export all the results. However, with the release of Solr 4.10, we were also given a possibility of exporting fully sorted query results at once. This recipe will show you how to do that.

How to do it...

Let's assume that we have an index that contains book names and the number of votes users have given to those books and that our hypothetical index is large. What we would like to do is export all the books matching a particular query along with the number of votes they have to a separate file. The results of such a query can be massive:

  1. We start with our index structure that contains the following fields (we just put the following entries into the schema.xml file):
    <field name="id" type="int" indexed="true" stored="true" required="true" />
    <field name="name" type="text_general" indexed="true" stored="true" />
    <field name="votes" type="int" indexed="false" stored="false" docValues="true" />
    <field name="name_export" type="string" indexed="false" stored="false" docValues="true" />
  2. We also need to define a copy field that we also put into the schema.xml file:
    <copyField source="name" dest="name_export" />
  3. The example data we will use is small and looks as follows (this will only serve the purpose of showing the export functionality):
    <add>
     <doc>
      <field name="id">1</field>
      <field name="name">Solr cookbook</field>
      <field name="votes">5</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="name">Mechanics cookbook</field>
      <field name="votes">12</field>
     </doc>
     <doc>
      <field name="id">3</field>
      <field name="name">Other cookbook</field>
      <field name="votes">1</field>
     </doc>
     <doc>
      <field name="id">4</field>
      <field name="name">Yet another cookbook</field>
      <field name="votes">0</field>
     </doc>
    </add>
  4. Now, let's take a look at the configuration of Solr. First, we need to add the following request handler definition to our solrconfig.xml file:
    <requestHandler name="/export" class="solr.SearchHandler">
      <lst name="invariants">
        <str name="rq">{!xport}</str>
        <str name="wt">xsort</str>
        <str name="distrib">false</str>
      </lst>
      <lst name="defaults">
        <str name="df">name</str>
      </lst>
    </requestHandler>
  5. Finally, we need to disable lazy fields loading by putting the following entry into our solrconfig.xml file (or modifying the existing configuration):
    <enableLazyFieldLoading>false</enableLazyFieldLoading>

    Now, we can export our data by running the following command:

    curl 'localhost:8983/solr/cookbook/export_books?q=cookbook&sort=votes+asc&fl=name_export,votes'
    

    The result returned by Solr for our example is as follows:

    {"numFound":4, "docs":[{"name_export":"Yet another cookbook","votes":0},{"name_export":"Other cookbook","votes":1},{"name_export":"Solr cookbook","votes":5},{"name_export":"Mechanics cookbook","votes":12}]}
    

As we can see, our data was exported. Let's take a look at how it works now.

How it works...

Before we start, remember that the feature you are reading about was introduced in Solr 4.10 and is in a very simple form. For example, in Solr 4.10, it required that fields used for sorting and displaying during export were using doc values. It uses a stream sorting technique that enables you to send results within milliseconds after the request was made. This can change in the future, so keep an eye on Solr release notes and sites such as http://solr.pl for more information about this.

We start with the index structure, which is similar to most of the recipes is pretty simple. It contains four fields—one to hold the unique identifier of the document (the id field), one to hold the name of the book (the name field), the third one to hold the number of votes the book was given (the votes field), and finally, the last field, name_export, we will use for exporting. Solr export functionality allows you to export and sort only on those fields that have doc values enabled. Because of this, we need to set the docValues property to true for the votes field and create a new field called name_export, because doc values can't be turned on for analyzed fields. We also introduced a copy field section to tell Solr to automatically copy the contents of the name field into the name_export field, so we don't have to worry about that.

Now, let's get to our /export_books request handler definition. As you can see, it is based on the standard solr.SearchHandler handler, but it contains some additional properties that we didn't see till now. To use Solr export functionality, we need to provide three properties. First, we specify the rq parameter to {!xport}. The rq stands for re-ranking query and to use Solr export functionality, we need to set it to {!xport}; otherwise, it won't work. The second parameter—the wt one—specifies the response writer and for export functionality, it needs to be set to xsort. Finally, we need to set the distrib parameter to false so that the request is not propagated to other shards and is only executed locally.

Note

Note that initially, Solr export functionality didn't support distributed operations. Exporting data for collections that are built of more than a single primary shard needs to be done manually, shard by shard. This is going to change in future Solr releases.

As you can see, the three mentioned parameters were placed into the invariants section of the request handler definition so that the user can't overwrite them by providing the same parameter during a query. We also defined the default search field using the df property.

One more thing that we did is setting the enableLazyFieldLoading property in solrconfig.xml to false. This is needed because the initial implementation of Solr export functionality contains a bug that will result in query failures when the enableLazyFieldLoading property is set to true.

After this, we are ready to export our results. In the example, we exported all the documents matching the cookbook query (sent against the default search field, which is name in our case). The export functionality in Solr allows you to provide two properties in addition to the query—the sort property and the fl property. The sort property can hold up to four fields and defines how the documents in the export should be sorted. In our case, we want the documents to be sorted in an ascending order based on the number of votes. The fl property defines which fields should be exported for each document—in our case, it is name_export and votes. Remember that each field used in the sort property or the fl property has to use doc values—this is a requirement for now.

As you can see, the exported data is what we actually wanted and was exported in JSON. This is the only format supported for now. The good thing about this functionality is that we can export even massive datasets without putting too much pressure on Solr, so you can use it whenever you need to export large amounts of data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.212.237