Using scripting update processors to modify documents

Sometimes, we need to modify documents during indexing, and we don't want to do this on the indexing application side. For example, we have documents describing the Internet sites. What we want to be able to do is filter the sites on the basis of the protocol used, for example, http or https. We don't have this information; we only have the whole URL address. Let's see how we can achieve this with Solr.

Getting ready

Before continuing with the following recipe, I suggest reading the Counting the number of fields recipe of this chapter to get used to updating request processor configuration.

How to do it...

The following steps will take you through the process of achieving our goal:

  1. First, we start with the index structure, putting the following section in the schema.xml file:
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="url" type="text_general" indexed="true" stored="true"/>
    <field name="protocol" type="string" indexed="true" stored="true" />
  2. The next step is configuring Solr by adding a new update request processor chain called script. We do this by adding the following section to our solrconfig.xml file:
    <updateRequestProcessorChain name="script">
     <processor class="solr.StatelessScriptUpdateProcessorFactory">
      <str name="script">script.js</str>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>
  3. The third step is to alter the /update request handler configuration by adding the following section to our solrconfig.xml file:
    <requestHandler name="/update" class="solr.UpdateRequestHandler">
     <lst name="defaults">
      <str name="update.chain">script</str>
     </lst>
    </requestHandler>
  4. Finally, we need the script mentioned in the update request processor chain configuration, which we called script.js and stored in the conf directory (the same directory where the schema.xml file is placed). The content of the script.js file looks as follows:
    functionfunction processAdd(cmd) {
      doc = cmd.solrDoc;  
      url = doc.getFieldValue("url");
      if (url != null) {
      parts = url.split(":");
      if (parts != null && parts.length > 0) {
         doc.setField("protocol", parts[0]);
        }
      }
    }
    
    function processDelete(cmd) {
    }
    
    function processMergeIndexes(cmd) {
    }
    
    function processCommit(cmd) {
    }
    
    function processRollback(cmd) {
    }
    
    function finish() {
    }

    Our example data looks as follows:

    <add>
     <doc>
      <field name="id">1</field>
      <field name="url">http://solr.pl/</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="url">https://drive.google.com/</field>
     </doc>
    </add>
  5. After indexing our data, we can try our script out by running the following query:
    http://localhost:8983/solr/cookbook/select?q=*:*&fq=protocol:http

    The response from Solr should be similar to the following:

    <?xml version="1.0" encoding="UTF-8"?>
     <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">*:*</str>
       <str name="fq">protocol:http</str>
      </lst>
     </lst>
     <result name="response" numFound="1" start="0">
      <doc>
       <str name="id">1</str>
       <strname="url">http://solr.pl/</str>
       <strname="protocol">http</str>
       <long name="_version_">1468022030035058688</long></doc>
     </result>
    </response>

As you can see, everything works as it should, so now let's see how it worked.

How it works...

Our data is very simple. Each document is described with its identifier (the id field), the URL (the url field), and the field holding the protocol (the protocol field). The first two fields will be passed in the data; the protocol field will be filled automatically by our update request processor chain.

The next thing is to configure our update request processor chain. We already described most of the configuration details in the Counting the number of fields recipe of this chapter. The new thing is the solr.StatelessScriptUpdateProcessorFactory processor. It allows us to define a script (using the script property) that will be used to process our documents. In our case, this script is called script.js. Solr will load this script and use it for each document passed through the update request processor chain.

We also defined the solr.UpdateRequestHandler configuration, and then altered the default configuration by adding the defaults section and including the update.chain property to script (our update request processor chain name). This means that our defined update request processor chain will be used with every indexing request.

Finally, we come to the juicy part of the recipe, the script.js script. The solr.StatelessScriptUpdateProcessorFactory processor allows us to alter Solr behavior using the following script functions:

  • processAdd: This function is executed when a document is added to the index. In our case, we will put our code in this function.
  • processDelete: This function is executed when a delete operation is sent to Solr.
  • processMergeIndexes: This function is executed when the index merge command is sent to Solr.
  • processCommit: This function is executed when the commit command is sent to Solr.
  • processRollback: This function is executed when the rollback command is sent to Solr.
  • finish: Any code that should be run after the script that finished executing is put in this method.

Apart from the finish function, all the other functions have a single argument that represents the command sent to Solr.

As already mentioned, we only need to provide logic in the processAdd function. We start by retrieving the Solr document from the command (the cmd object) and then store the document in the doc variable (doc = cmd.solrDoc;). Next, we get the url field of the document (url = doc.getFieldValue("url");). We check whether the field is defined (if (url != null)); if it is, we split the URL using the : character. This means that for the http://solr.pl URL, we should get an array containing the two parts http and //solr.pl. We are interested in the first value. We check if the parts variable, which was returned by the split function, is defined and if it has elements (if (parts != null &&parts.length> 0)). If the condition is true, we just set a new field using the first element in the parts array, which will contain the protocol.

After indexing our data and running a query that filters the documents to only those that has the http protocol, we see that we did the job right.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.9.148