Let's assume that we are running a bookstore, we want to sort our books by the publication date, and run faceting on the number of likes each book gets. However, we get all our data in XML, and we don't have data in the proper format, and so on. The good thing is that we can tell Solr to parse our data property so that we don't have to change what we already have. This recipe will show you how to do this.
Before continuing with this recipe, I suggest reading the Counting the number of fields recipe of this chapter to get used to updating the request processor configuration.
Let's look at the steps we need to take to make data parsing work.
schema.xml
file:<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="text_general" indexed="true" stored="true" /> <field name="published" type="date" indexed="true" stored="true" /> <field name="likes" type="long" indexed="true" stored="true" />
solrconfig.xml
file:<updateRequestProcessorChain name="parse"> <processor class="solr.ParseLongFieldUpdateProcessorFactory"> <str name="fieldName">likes</str> </processor> <processor class="solr.ParseDateFieldUpdateProcessorFactory"> <str name="fieldName">published</str> <arr name="format"> <str>yyyy-MM-dd</str> </arr> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
/update
request handler configuration by adding the following section to our solrconfig.xml
file:<requestHandler name="/update" class="solr.UpdateRequestHandler"> <lst name="defaults"> <str name="update.chain">parse</str> </lst> </requestHandler>
<add> <doc> <field name="id">1</field> <field name="title">Solr Cookbook 4</field> <field name="published">2013-01-10</field> <field name="likes">10</field> </doc> </add>
http://localhost:8983/solr/cookbook/select?q=*:*&sort=published+desc&facet=true&facet.field=likes
The response from Solr looks as follows:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">106</int> <lst name="params"> <str name="q">*:*</str> <str name="facet.field">likes</str> <str name="sort">published desc</str> <str name="facet">true</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">1</str> <str name="title">Solr Cookbook 4</str> <date name="published">2013-01-10T00:00:00Z</date> <long name="likes">10</long> <long name="_version_">1468068127952601088</long></doc> </result> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="likes"> <int name="10">1</int> </lst> </lst> <lst name="facet_dates"/> <lst name="facet_ranges"/> </lst> </response>
As you can see, the data was properly parsed, the sorting works, and faceting also works, so let's see how it was possible.
Our data is very simple. Each book is described with its identifier (the id
field), the title (the title
field), the publication day (the published
field), and the number of likes (the likes
field). The published
field is of the date
type for proper date-based sorting, and the likes
field is of the long
type.
Our defined update request processor chain consists of two new processors that we are not familiar with. The first processor, solr.ParseLongFieldUpdateProcessorFactory
, is responsible for parsing the data to a long
type. It takes the field defined in the fieldName
property from the document sent to indexation and parses it. The second processor is solr.ParseDateFieldUpdateProcessorFactory
, which we already talked about in the Using Solr in a schemaless mode recipe in Chapter 1, Apache Solr Configuration, but let's a recap. It takes the field defined in the fieldName
property from the document sent to indexation and tries to parse its value using the date formats defined using the format
array. We only defined a single format, but you can put multiple formats if this is what you need.
For a description of the possible formats, refer to http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html.
We also defined the solr.UpdateRequestHandler
configuration, and then altered the default configuration by adding the defaults
section and including the update.chain
property to script
(our update request processor chain name). This means that our defined update request processor chain will be used with every indexing request.
After indexing our data and running a query, we will see that our data has proper field types. We will also see that sorting works on the published
field, which was parsed into data, although our published
field content was not in a format understandable by Solr.
solr.FieldMutatingUpdateProcessorFactory
available at http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html. The classes extending this class provide a nice description of the additional possibilities.3.144.97.126