Time for action – indexing incrementally using delta imports

Once we understand the basic elements for handling entities during the construction of a document structure to be posted to Solr with JDBC, we can, for example, write our own script to manage this kind of operation on a scheduled interval of time (think about scheduled task on a Windows machine or crontab on *nix machines). A scheduled script will use the command=delta-import parameter, as given in the following command:

>> curl -H 'Content-Type:application/json; charset=UTF-8' -X POST 'http://localhost:8983/solr/wgarts/dih?command=delta-import&commit=true&wt=json&indent=true'

And, it will need the following internal query to resume its activity:

<document name="painting_document_delta">
  <entity name="painting_delta" pk="uri" transformer="..." 
    query="SELECT * FROM authors AS A JOIN paintings AS P ON (A.ID=P.AUTHOR_ID) WHERE (P.FORM='painting')"
    deltaImportQuery="SELECT * FROM authors AS A JOIN paintings AS P ON (A.ID=P.AUTHOR_ID) WHERE (P.FORM='painting' AND P.ID='${dih.delta.id}')" 
    deltaQuery="SELECT P.ID FROM paintings AS P WHERE (last_update &gt; '${dih.last_index_time}')" >
    
    <field column="entity_type" template="painting" />
    <field column="entity_source" template="Web Gallery Of Art" />
  </entity>
</document>

In this case, to improve readability, we have omitted minor details that have not changed from the previous examples.

What just happened?

This example should be read mostly as a draft. A suggestion on how to add the delta import definition for a JDBC indexing (if you want to look at a complete example, you will find it at /SolrStarterBook/solr-app/chp08/wgarts_delta).

The first interesting element here is the deltaQuery="SELECT P.ID FROM..." statement by itself. By using this query, we are in fact projecting only the unique ID for the most recent (newly added) resources. Note that we use an internal ${dih.last_index_time} variable to identify the last index time for a resource. Moreover, the > comparator needs to be written as &gt; for compatibility with the XML syntax.

Once we have found a row to be indexed by its own ID, a deltaImportQuery is finally triggered. Note that this query will generally contain a clause that will use this ID value (by selecting only the relative row), or we will index every single row, every time. Note how the P.ID='${dih.delta.id}' parameter is used to select rows from the ID that will be emitted from the other time-based query.

When an indexing is performed using the DataImportHandler, a properties file will be produced. This is a simple textual file with the name dih.properties, which has a simple internal structure similar to the following one:

#Mon Sep 01 18:32:11 CEST 2013
f.last_index_time=2013-09-01 19:24:38
last_index_time=2013-09-01 19:32:03
files.last_index_time=2013-09-01 19:32:03

Note that this can be manually edited (if needed) to give default values also to the delta import process. It will be overwritten when a new running process will start with the updated values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.46.92