Transforming data when using DIH

Data that is stored in our data source is not always in a form we would like it to be indexed in our Solr cluster. For example, imagine that you want to split the first and second names into two fields during indexing because these two reside in a single column in the database and are separated by a whitespace character. Of course, we can modify our database, but in most cases this is not possible. Can we do this? Of course we can, we just need to add some more configuration details to the Data Import Handler configuration. This recipe will show you how to do this.

Getting ready

Refer to the Indexing data from a database using Data Import Handler recipe in this chapter.

How to do it...

We will reuse the data from the Indexing data from a database using Data Import Handler recipe in this chapter. So, to select users from our table, we use the following SQL query:

SELECT user_id, user_name FROM users

The response in the text client looks as follows:

| user_id | user_name     |
| 1       | John Kowalski |
| 2       | Amanda Looks  |

Our task is to split the first and second names from the user_name column and place it in the two fields firstname and secondname. The steps we need to take are as follows:

  1. First, we need to change the index structure so that our field definitions in the schema.xml file look as follows:
    <field name="id" type="string" indexed="true" stored="true" required="true"/>
    <field name="firstname" type="string" indexed="true" stored="true"/>
    <field name="secondname" type="string" indexed="true" stored="true"/>
    <field name="description" type="text" indexed="true" stored="true"/>
  2. Now, we need to modify our db-data-config.xml file (the one we created earlier) so that it looks as follows:
    <dataConfig>
     <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/users" user="users" password="secret" />
    <script><![CDATA[
      function splitName(row) {
       var nameTable = row.get('user_name').split(' '),
       row.put('firstname', nameTable[0]);
       row.put('secondname', nameTable[1]);
       row.remove('user_name'),
       return row;
      }
     ]]></script>
     <document>
      <entity name="user" transformer="script:splitName" query="SELECT user_id, user_name, description FROM users">
       <field column="user_id" name="id" />
       <field column="firstname" />
       <field column="secondname" />
       <entity name="user_desc" query="SELECT desc FROM users_description WHERE user_id=${user.user_id}">
        <field column="desc" name="description" />
       </entity>
      </entity>
     </document>
    </dataConfig>
  3. Now, you can follow the normal indexing procedure that was discussed in the Indexing data from a database using Data Import Handler recipe in this chapter.

How it works...

The first two listings are the sample SQL query and result given by a database. Next, we have a field definition part of a schema.xml file that defines four fields. Look at the example database rows once again. See the difference? We have four fields in our index structure, while our database rows have only two columns. We must split the contents of the user_name column into the two index fields firstname and secondname. To do this, we will use the JavaScript language and script transformer functionality of the Data Import Handler.

The solrconfig.xml file is the same as the one discussed in the Indexing data from a database using Data Import Handler recipe in this chapter, so I'll skip this as well.

Next, we have the updated contents of the db-data-config.xml file, which we use to define the behavior of the Data Import Handler. The first and biggest difference is the script tag that holds our scripts and alters the data. The scripts should be held in the CDATA section. I defined a simple function called splitName that takes one parameter, the database row (remember that the functions that operate on the entity data should always take one parameter, which actually is an instance of the Map<String, Object> Java object). The first thing in the function is to get the contents of the user_name column, split it with the space character, and assign it into a JavaScript array. Then, we create two additional columns in the processed rows firstname and secondname. The contents of these rows come from the JavaScript table we created. Then, we remove the user_name column because we don't want it to be indexed. The last operation is the returning of the processed row.

To enable script processing, you must add one additional attribute to the entity definition; this is the transformer attribute with contents similar to that of script:functionName. In our example, it looks like transformer="script:splitName". It tells the Data Import Handler to use the defined function name for every row returned by the query.

This is how it works. The rest is the usual indexing process described in the Indexing data from a database using Data Import Handler recipe in this chapter, so I'll skip this as well.

There's more...

There is one more thing I want to mention; check out the following section for more information.

Using scripts other than JavaScript

If you want to use a language other than JavaScript, then you have to specify it in the language attribute of the <script> tag. Just remember that the scripting language that you want to use must be supported by Java. The example definition will look like this:

<script language="ECMAScript">…</script>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.32.67