Incremental imports with DIH

In most use cases, indexing the data from scratch during every indexation doesn't make sense. Why index your 1,00,000 documents when only 1,000 were modified or added? This is where the Solr Data Import Handler delta queries come in handy. Using them, we can index our data incrementally. This recipe will show you how to set up the Data Import Handler to use delta queries and index data in an incremental way.

Getting ready

Refer to the Indexing data from a database using Data Import Handler recipe in this chapter to get to know the basics of the Data Import Handler configuration. I assume that Solr is set up according to the description given in the mentioned recipe.

How to do it...

We will reuse parts of the configuration shown in the Indexing data from a database using Data Import Handler recipe in this chapter, and we will modify it. Execute the following steps:

  1. The first thing you should do is add an additional column to the tables you use, a column that will specify the last modification date of the record. So, in our case, let's assume that we added a column named last_modified (which should be a timestamp-based column). Now, our db-data-config.xml will look like this:
    <dataConfig>
     <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/users" user="users" password="secret" />
     <document>
      <entity name="user" query="SELECT user_id, user_name FROM users" deltaImportQuery="select user_id, user_name FROM users WHERE user_id = '${dih.delta.user_id}'" deltaQuery="SELECT user_id FROM users WHERE last_modified &gt; '${dih.last_index_time}'">
       <field column="user_id" name="id" />
       <field column="user_name" name="name" />
       <entity name="user_desc" query="select desc from users_description where user_id=${user.user_id}">
        <field column="desc" name="description" />
       </entity>
      </entity>
     </document>
    </dataConfig>
  2. After this, we run a new kind of query to start the delta import:
    http://localhost:8983/solr/cookbook/dataimport?command=delta-import

How it works...

First, we modified our database table to include a column named last_modified. We need to ensure that the column will contain the last modified date of the record it corresponds to. Solr will not modify the database, so you have to ensure that your application will do this.

When running a delta import, the Data Import Handler will start by reading a file named dataimport.properties inside a Solr configuration directory. If it is not present, the Data Import Handler will assume that no indexing was ever made. Solr will use this file to store information about the last indexation time, and this file will be updated or created after indexation is finished. The last index time will be stored as a timestamp. As you can guess, the Data Import Handler uses this timestamp to distinguish whether the data was changed. It can be used in a query by using a special variable, ${dih.last_index_time}.

You might already have noticed the two differences—two additional attributes defining entities named user, deltaQuery, and deltaImportQuery. The deltaQuery attribute is responsible for getting the information about users that were modified since the last index. Actually, it only gets the users' unique identifiers and uses the last_modified column we added to determine which users were modified since the last import. The deltaImportQuery attribute gets users with the appropriate unique identifier (which was returned by deltaQuery) to get all the needed information about the user. One thing worth noticing is the way that I used the user identifier in the deltaImportQuery attribute; we did this using ${dih.delta.user_id}. We used the dih.delta variable with its user_id property (which is the same as the table column name) to refer to the user identifier.

You might notice that I left the query attribute in the entity definition. It's left on purpose; you might need to index the full data once again so that the configuration will be useful for full as well as partial imports.

Next, we have a query that shows how to run the delta import. You might notice that compared to the full import, we didn't use the full-import command; we sent the delta-import command instead.

The statuses that are returned by Solr are the same as those with the full import, so refer to the appropriate chapters to see what information they carry.

One more thing—the delta queries are only supported for the default SqlEntityProcessor. This means that you can only use these queries with JDBC data sources.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.171.253