Counting the number of fields

Imagine a situation where we have a simple document to be indexed to Solr with titles and tags. What we will want to do is separate the premium documents that have more tag values because they are better in terms of our business. Of course, we can count the number of tags ourselves, but why not let Solr do this? This recipe will show you how to do this with Solr.

How to do it...

Let's look at the steps we need to take to count the number of field values.

  1. We start with the index structure. What we need to do is put the following section in the schema.xml file:
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true"/>
    <field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
    <field name="tags_count" type="int" indexed="true" stored="true"/>
  2. The next thing is our test data, which looks as follows:
    <add>
     <doc>
      <field name="id">1</field>
      <field name="title">Solr Cookbook 4</field>
      <field name="tags">solr</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="title">Solr Cookbook 4 second edition</field>
      <field name="tags">search</field>
      <field name="tags">solr</field>
      <field name="tags">cookbook</field>
     </doc>
    </add>
  3. In addition to this, we need to alter our solrconfig.xml file. First, we add the proper update request processor to the file:
    <updateRequestProcessorChain name="count">
     <processor class="solr.CloneFieldUpdateProcessorFactory">
      <str name="source">tags</str>
      <str name="dest">tags_count</str>
     </processor>
     <processor class="solr.CountFieldValuesUpdateProcessorFactory">
      <str name="fieldName">tags_count</str>
     </processor>
     <processor class="solr.DefaultValueUpdateProcessorFactory">
      <str name="fieldName">tags_count</str>
      <int name="value">0</int>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>
  4. We would also like to have our update processor be used with every indexing request, so we change our /update handler in the solrconfig.xml file so that it looks like this:
    <requestHandler name="/update" class="solr.UpdateRequestHandler">
     <lst name="defaults">
      <str name="update.chain">count</str>
     </lst>
    </requestHandler>
  5. Now, if we want to use the count information Solr automatically added, we will send the following query:
    http://localhost:8983/solr/cookbook/select?q=title:cookbook&bf=field(tags_count)&defType=edismax
  6. Solr will position the document with more tags at the top of the result list:
    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">title:cookbook</str>
       <str name="defType">edismax</str>
       <str name="bf">field(tags_count)</str>
      </lst>
     </lst>
     <result name="response" numFound="2" start="0">
      <doc>
       <str name="id">2</str>
       <str name="title">Solr Cookbook 4 second edition</str>
       <arr name="tags">
        <str>search</str>
        <str>solr</str>
        <str>cookbook</str>
       </arr>
       <int name="tags_count">3</int>
       <long name="_version_">1467535763434373120</long></doc>
      <doc>
       <str name="id">1</str>
       <str name="title">Solr Cookbook 4</str>
       <arr name="tags">
        <str>solr</str>
       </arr>
       <int name="tags_count">1</int>
       <long name="_version_">1467535763382992896</long></doc>
      </result>
    </response>

Now, let's see how it works.

How it works...

The index structure is quite simple. It contains a unique identifier field, a title, a field holding tags, and a field holding the count of tags. As you can see, in the example data, we provide the identifier of the document, its title, and the tags. What we don't provide is the number of tags that we calculate during indexation.

We also defined a new update request processor chain called count. It contains five update processors.

The first update processor, solr.CloneFieldUpdateProcessorFactory, is responsible for copying the value of the field defined by the source property to a field defined by the dest property. The second update processor, solr.CountFieldValuesUpdateProcessorFactory, replaces the actual value of the field defined by the fieldName property with the count of values. This is why we need the solr.CloneFieldUpdateProcessorFactory update processor before solr.CountFieldValuesUpdateProcessorFactory. The third update processor, solr.DefaultValueUpdateProcessorFactory, sets the default value (defined by the value property) for the field defined by the fieldName property. The other request processors are responsible for logging the request information and running the update. By defining this chain, we tell Solr that we want the tags field to be cloned into tags_count first, then we want the counts to be calculated and placed in the tags_count field; if we don't have a value in the tags_count field, we set it to 0.

We also define the solr.UpdateRequestHandler configuration and then alter the default configuration by adding the defaults section and including the update.chain property to count (our update request processor chain name). This means that our defined update request processor chain will be used with every indexing request.

Our query searches for every document that includes the cookbook term in the title field. We will also use the edismax query parser (defType=edismax). We also include a simple boosting function that boosts documents by the value of their tags_count field (bf=field(tags_count)). As you can see in the results, we get what we wanted to achieve.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.182.66