One of the features of search engines such as Solr that users frequently ask about is the ability to pull the data from the search engine in some form. I'm not talking about a few hundred results returned by a query, but about all the documents that are indexed in a particular core or collection. With the new releases of Solr, we have the ability to scroll through the results and with some effort, we will be able to export all the results. However, with the release of Solr 4.10, we were also given a possibility of exporting fully sorted query results at once. This recipe will show you how to do that.
Let's assume that we have an index that contains book names and the number of votes users have given to those books and that our hypothetical index is large. What we would like to do is export all the books matching a particular query along with the number of votes they have to a separate file. The results of such a query can be massive:
schema.xml
file):<field name="id" type="int" indexed="true" stored="true" required="true" /> <field name="name" type="text_general" indexed="true" stored="true" /> <field name="votes" type="int" indexed="false" stored="false" docValues="true" /> <field name="name_export" type="string" indexed="false" stored="false" docValues="true" />
schema.xml
file:<copyField source="name" dest="name_export" />
<add> <doc> <field name="id">1</field> <field name="name">Solr cookbook</field> <field name="votes">5</field> </doc> <doc> <field name="id">2</field> <field name="name">Mechanics cookbook</field> <field name="votes">12</field> </doc> <doc> <field name="id">3</field> <field name="name">Other cookbook</field> <field name="votes">1</field> </doc> <doc> <field name="id">4</field> <field name="name">Yet another cookbook</field> <field name="votes">0</field> </doc> </add>
solrconfig.xml
file:<requestHandler name="/export" class="solr.SearchHandler"> <lst name="invariants"> <str name="rq">{!xport}</str> <str name="wt">xsort</str> <str name="distrib">false</str> </lst> <lst name="defaults"> <str name="df">name</str> </lst> </requestHandler>
solrconfig.xml
file (or modifying the existing configuration):<enableLazyFieldLoading>false</enableLazyFieldLoading>
Now, we can export our data by running the following command:
curl 'localhost:8983/solr/cookbook/export_books?q=cookbook&sort=votes+asc&fl=name_export,votes'
The result returned by Solr for our example is as follows:
{"numFound":4, "docs":[{"name_export":"Yet another cookbook","votes":0},{"name_export":"Other cookbook","votes":1},{"name_export":"Solr cookbook","votes":5},{"name_export":"Mechanics cookbook","votes":12}]}
As we can see, our data was exported. Let's take a look at how it works now.
Before we start, remember that the feature you are reading about was introduced in Solr 4.10 and is in a very simple form. For example, in Solr 4.10, it required that fields used for sorting and displaying during export were using doc
values. It uses a stream sorting technique that enables you to send results within milliseconds after the request was made. This can change in the future, so keep an eye on Solr release notes and sites such as http://solr.pl for more information about this.
We start with the index structure, which is similar to most of the recipes is pretty simple. It contains four fields—one to hold the unique identifier of the document (the id
field), one to hold the name of the book (the name
field), the third one to hold the number of votes the book was given (the votes
field), and finally, the last field, name_export
, we will use for exporting. Solr export functionality allows you to export and sort only on those fields that have doc
values enabled. Because of this, we need to set the docValues
property to true
for the votes
field and create a new field called name_export
, because doc
values can't be turned on for analyzed fields. We also introduced a copy field section to tell Solr to automatically copy the contents of the name
field into the name_export
field, so we don't have to worry about that.
Now, let's get to our /export_books
request handler definition. As you can see, it is based on the standard solr.SearchHandler
handler, but it contains some additional properties that we didn't see till now. To use Solr export functionality, we need to provide three properties. First, we specify the rq
parameter to {!xport}
. The rq
stands for re-ranking query and to use Solr export functionality, we need to set it to {!xport}
; otherwise, it won't work. The second parameter—the wt
one—specifies the response writer and for export functionality, it needs to be set to xsort
. Finally, we need to set the distrib
parameter to false
so that the request is not propagated to other shards and is only executed locally.
As you can see, the three mentioned parameters were placed into the invariants
section of the request handler definition so that the user can't overwrite them by providing the same parameter during a query. We also defined the default search field using the df
property.
One more thing that we did is setting the enableLazyFieldLoading
property in solrconfig.xml
to false
. This is needed because the initial implementation of Solr export functionality contains a bug that will result in query failures when the enableLazyFieldLoading
property is set to true
.
After this, we are ready to export our results. In the example, we exported all the documents matching the cookbook
query (sent against the default search field, which is name
in our case). The export functionality in Solr allows you to provide two properties in addition to the query—the sort
property and the fl
property. The sort
property can hold up to four fields and defines how the documents in the export should be sorted. In our case, we want the documents to be sorted in an ascending order based on the number of votes. The fl
property defines which fields should be exported for each document—in our case, it is name_export
and votes
. Remember that each field used in the sort
property or the fl
property has to use doc
values—this is a requirement for now.
As you can see, the exported data is what we actually wanted and was exported in JSON. This is the only format supported for now. The good thing about this functionality is that we can export even massive datasets without putting too much pressure on Solr, so you can use it whenever you need to export large amounts of data.
52.15.212.237