Using parsing update processors to parse data

Let's assume that we are running a bookstore, we want to sort our books by the publication date, and run faceting on the number of likes each book gets. However, we get all our data in XML, and we don't have data in the proper format, and so on. The good thing is that we can tell Solr to parse our data property so that we don't have to change what we already have. This recipe will show you how to do this.

Getting ready

Before continuing with this recipe, I suggest reading the Counting the number of fields recipe of this chapter to get used to updating the request processor configuration.

How to do it...

Let's look at the steps we need to take to make data parsing work.

  1. First, we need to prepare our index structure, so we add the following section to the schema.xml file:
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true" />
    <field name="published" type="date" indexed="true" stored="true" />
    <field name="likes" type="long" indexed="true" stored="true" />
  2. In addition to this, we need a custom update request processor chain defined. To do this, we add the following section to the solrconfig.xml file:
    <updateRequestProcessorChain name="parse">
     <processor class="solr.ParseLongFieldUpdateProcessorFactory">
      <str name="fieldName">likes</str>
     </processor>
     <processor class="solr.ParseDateFieldUpdateProcessorFactory">
      <str name="fieldName">published</str>
      <arr name="format">
       <str>yyyy-MM-dd</str>
      </arr>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>
  3. The third step is to alter the /update request handler configuration by adding the following section to our solrconfig.xml file:
    <requestHandler name="/update" class="solr.UpdateRequestHandler">
     <lst name="defaults">
      <str name="update.chain">parse</str>
     </lst>
    </requestHandler>
  4. Now, we can index our data, which looks like this:
    <add>
     <doc>
      <field name="id">1</field>
      <field name="title">Solr Cookbook 4</field>
      <field name="published">2013-01-10</field>
      <field name="likes">10</field>
     </doc>
    </add>
  5. After we send our data, we can check a simple query like this:
    http://localhost:8983/solr/cookbook/select?q=*:*&sort=published+desc&facet=true&facet.field=likes

    The response from Solr looks as follows:

    <?xml version="1.0" encoding="UTF-8"?>
     <response>
      <lst name="responseHeader">
       <int name="status">0</int>
       <int name="QTime">106</int>
       <lst name="params">
        <str name="q">*:*</str>
        <str name="facet.field">likes</str>
        <str name="sort">published desc</str>
        <str name="facet">true</str>
       </lst>
      </lst>
      <result name="response" numFound="1" start="0">
       <doc>
        <str name="id">1</str>
        <str name="title">Solr Cookbook 4</str>
        <date name="published">2013-01-10T00:00:00Z</date>
        <long name="likes">10</long>
        <long name="_version_">1468068127952601088</long></doc>
      </result>
     <lst name="facet_counts">
      <lst name="facet_queries"/>
      <lst name="facet_fields">
      <lst name="likes">
       <int name="10">1</int>
      </lst>
     </lst>
     <lst name="facet_dates"/>
     <lst name="facet_ranges"/>
     </lst>
    </response>

As you can see, the data was properly parsed, the sorting works, and faceting also works, so let's see how it was possible.

How it works...

Our data is very simple. Each book is described with its identifier (the id field), the title (the title field), the publication day (the published field), and the number of likes (the likes field). The published field is of the date type for proper date-based sorting, and the likes field is of the long type.

Our defined update request processor chain consists of two new processors that we are not familiar with. The first processor, solr.ParseLongFieldUpdateProcessorFactory, is responsible for parsing the data to a long type. It takes the field defined in the fieldName property from the document sent to indexation and parses it. The second processor is solr.ParseDateFieldUpdateProcessorFactory, which we already talked about in the Using Solr in a schemaless mode recipe in Chapter 1, Apache Solr Configuration, but let's a recap. It takes the field defined in the fieldName property from the document sent to indexation and tries to parse its value using the date formats defined using the format array. We only defined a single format, but you can put multiple formats if this is what you need.

Note

For a description of the possible formats, refer to http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html.

We also defined the solr.UpdateRequestHandler configuration, and then altered the default configuration by adding the defaults section and including the update.chain property to script (our update request processor chain name). This means that our defined update request processor chain will be used with every indexing request.

After indexing our data and running a query, we will see that our data has proper field types. We will also see that sorting works on the published field, which was parsed into data, although our published field content was not in a format understandable by Solr.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.97.126