Extending and customizing the index process

As we saw before, the Solr index chain is highly customizable at different points. This section will give you some hints and examples to create your own extension in order to customize the indexing phase.

Changing the stored value of fields

One of the most frequent needs that I encounter while I'm indexing bibliographic data is to correct or change the headings (labels) belonging to the incoming records (documents).

Note

This has nothing to do with the text analysis we have previously seen. Here, we are dealing with unwanted (wrong) values, diacritics that need to be replaced, or in general, labels in the original record that we want to change and show to the end users. In Solr terms, we want to change the stored value of a field before it gets indexed.

Suppose a library has a lot of records and wants to publish them in an OPAC. Unfortunately, many of those records have titles with a trailing underscore, which has a special meaning for librarians. While this is not a problem for the cataloguing software (because librarians are aware of that convention), it is not acceptable to end users, and it will surely be seen as a typo. So if we have records with titles such as "A good old story_" or "This is another title_" in our application, we want to show "A good old story" and "This is another title" without underscores when the user searches for those records.

Remember that analyzers and tokenizers declared in your schema only act on the indexed value of a given field. The stored value is copied verbatim as it arrives, so there's no chance to modify it once it is indexed.

In these cases, an UpdateRequestProcessor perfectly fits our needs. The example project associated with this chapter contains several examples of custom UpdateRequestProcessors. Here, we are interested in RemoveTrailingUnderscoreProcessor, which can be found in the src/main/java within the org.gazzax.labs.solr.ase.chr.urp package.

As you can see, writing an UpdateRequestProcessor requires two classes to be implemented:

  • Factory: A class that extends org.apache.solr.update.processor.UpdateRequestProcessorFactory
  • Processor: A class that extends org.apache.solr.update.processor.UpdateRequestProcessor

The first is a factory that creates concrete instances of your processor and can be configured with a set of custom parameters in solrconfig.xml:

<processor class="org.gazzax.labs.solr.ase.chr.urp. RemoveTrailingUnderscoreProcessorFactory">
  <arr name="fields">
    <str name="fields">title</str>
    <str name="fields">author</str>
  </arr>
</processor>

In this case, instead of hardcoding the name of the fields that we want to check, we define an array parameter called fields. That parameter is retrieved in the factory, specifically in the init() method, which will be called by Solr when the factory is instantiated:

private String [] fields;

@Override
public void init (NamedList args) {
  SolrParams parameters = SolrParams.toSolrParams(args);
  this.fields = parameters.getParams("fields");
}

The other relevant section of the factory is in the getInstance method, where a new instance of the processor is created:

@Override
public void getInstance(SolrQueryRequest req, SolrQueryReponse res, UpdateRequestProcessor next) {
 return new RemoveTrailingUpdateRequestProcessor(next, fields);
}

A new processor instance is created with the next processor in the chain and the list of target fields we configured. Now the processor receives those parameters and can add its contribution to the index phase. In this case, we want to put some logic before the add phase:

@Override
public void processAdd(final AddUpdateCommand command) {
  // 1. Retrieve the Solr (Input) Document
  SolrInputDocument document = command.getSolrInputDocument();

  // 2. Loop thorugh target fields
  for (String name : fields) {
    // 3. Get the field value
    // we assume target fields are monovalued for simplicity
    String value = document.getFieldValue(name);
    
    // 4. Check and eventually change the value
    if (value != null && value.endsWith("_")) {
      String newValue = value.substring(0, value.length()-1);
      document.setFieldValue(name, newValue);
    }
  }

  // 5. IMPORTANT: forward to the next processor in the chain
  super.processAdd(command);
}

Tip

You can find the source code of the whole example under the org.gazzax.labs.solr.ase.ch2.urp package of the source folder in the project associated with this chapter. The package contains additional examples of UpdateRequestProcessor.

Indexing custom data

The default UpdateRequestHandler is very powerful because it covers the most popular formats of data. However, there are some cases where data is available in a legacy format. Hence, we need to do something in order to have Solr working with that.

In this example, I will use a flat file, that is, a simple text file that typically describes records with fields of data defined by fixed positions. They are very popular in integration projects between banks and ERP systems (just to give you a concrete context).

Tip

In the example project associated with this chapter, you can find an example of such a file describing books under the src/solr/solr-homes/flatIndexer/example-input-data folder.

Here, each line has a fixed length of 107 characters and represents a book, with the following format:

Parameter

Position

Id

0 to 8

ISBN

8 to 22

Title

22 to 67

Author

67 to 106

There are two approaches in this scenario: the first moves the responsibility on the client side, thus creating a custom indexer client that gets the data in any format and carries out some manipulation to convert it into one of the supported formats. We won't cover this scenario right now, as we will discuss client APIs in a next chapter.

Another approach could be a custom extension of the UpdateRequestHandler. In this case, we want to have a new content type (text/plain) and a corresponding custom handler to load that kind of data. There are two things we need to implement. The first is a subclass of the existing UpdateRequestHandler:

public class FlatDataUpdate extends UpdateRequestHandler {
  @Override
  protected Map<String, ContentStreamLoader> createDefaultLoaders(NamedList n) {
    Map<String, ContentStreamLoader> registry = new HashMap<String, ContentStreamLoader>();
    registry.put("text/plain", new FlatDataLoader());
    return registry;
  }
}

Here, we are simply overriding the content type registry (the registry in the superclass cannot be modified) to add our content type, with a corresponding handler called FlatDataLoader. This class extends ContentStreamLoader and implements the parsing logic of the flat data:

public class FlatDataLoader extends ContentStreamLoader

The custom loader must provide a load(…) method to implement the stream parsing logic:

@Override
public void load(
SolrQueryRequest req, 
SolrQueryResponse rsp,
ContentStream stream, 
UpdateRequestProcessor processor) throws Exception {

  // 1. get a reader associated with the content stream BufferedReader reader = null;
  try {
    reader = new BufferedReader(stream.getReader());
    String actLine = null;
    while ((actLine = reader.readLine()) != null) {

    // 2. Sanity check: check line length
    if (actLine.length() != 107) {
      continue;
    }

    // 3. parse and create the document
    SolrInputDocument doc = new SolrInputDocument();
    doc.setField("id", actLine.substring(0, 8));
    doc.setField("isbn", actLine.substring(8,22));
    doc.setField("title", actLine.substring(22, 67));
    doc.setField("author", actLine.substring(67));
    AddUpdateCommand command = getAddCommand(req);
    command.solrDoc = document;
    processor.processAdd(command);
  } finally {
  // Close the reader
  … 
  }
}

If you want to view this example, just open the command line in the folder of the project associated with this chapter, and run the following command:

# mvn cargo:run –P flatIndexer

Tip

You can do the same with Eclipse by creating a new Maven launch as previously described. In that case, you will also be able to put debug breakpoints in the source code (your source code and the Solr source code) and proceed step by step in the Solr index process.

Once Solr has started, open another shell, change the directory to go to the project folder, and run the following command:

# curl http://127.0.0.1:8983/solr/flatIndexer/update?commit=true -H "Content-type:text/plain" --data-binary @src/solr/solr-homes/flatIndexer/example-input-data/books.flat

You should see something like this in the console:

[UpdateHandler] start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

[SolrCore] SolrDeletionPolicy.onCommit: commits: num=2
[SolrCore] newest commit generation = 4
[SolrIndexSearcher] Opening Searcher@77ee04bb[flatIndexer] main
[UpdateHandler] end_commit_flush

Now open the administration console at http://127.0.0.1:8983/solr/#/flatIndexer/query, and click on the Execute Query button. You should see three documents on the right pane.

Tip

You can find the source code of the entire example under the org.gazzax.labs.solr.ase.ch2.handler package of the source folder in the project associated with this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.235.8