The DataImportHandler framework

Solr includes a very popular contrib module for importing data known as the DataImportHandler. It's a data processing pipeline built specifically for Solr. Here's a list of the notable capabilities:

  • It imports data from databases through JDBC (Java Database Connectivity). This supports importing only changed records, assuming a last-updated date
  • It imports data from a URL (HTTP GET)
  • It imports data from files (that is, it crawls files)
  • It imports e-mail from an IMAP server, including attachments
  • It supports combining data from different sources
  • It extracts text and metadata from rich document formats
  • It applies XSLT transformations and XPath extraction on XML data
  • It includes a diagnostic/development tool

Furthermore, you could write your own data source or transformation step once you learn how by seeing how the existing ones are coded.

Tip

Consider DIH alternatives

The DIH's capabilities really have little to do with Solr itself, yet the DIH is tied to Solr (to a Solr core, to be precise). Consider alternative data pipelines such as those referenced here: http://wiki.apache.org/solr/SolrEcosystem—this includes building your own. Alternatives can run on another machine to reduce the load on Solr when there is significant processing involved. And in being agnostic of where the data is delivered, your investment in them can be re-used for other purposes independent of Solr. With that said, the DIH is a strong choice because it is integrated with Solr and it has a lot of capabilities.

The complete reference documentation for the DIH is available at https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler. It's rather thorough. In this chapter, we'll demonstrate some of its features, but you'll need to turn to the wiki for further details.

Configuring the DataImportHandler framework

The DIH is not considered a core part of Solr, even though it comes with the Solr download. Consequently, you must add its Java JAR files to your Solr setup in order to use it. If this isn't done, you'll eventually see a ClassNotFoundException error. The DIH's JAR files are located in Solr's dist directory: solr-dataimporthandler-4.x.x.jar and solr-dataimporthandler-extras-4.x.x.jar. The easiest way to add JAR files to a Solr configuration is to copy them to the <solr_home>/lib directory; you may need to create it. Another method is to reference them from solrconfig.xml via <lib/> tags—see Solr's example configuration for examples of that. You will probably need some additional JAR files as well. If you'll be communicating with a database, you'll need to get a JDBC driver for it. If you will be extracting text from various document formats, you'll need to add the JARs in /contrib/extraction/lib. Finally, if you'll be indexing an e-mail, you'll need to add the JARs in /contrib/dataimporthandler/lib.

The DIH needs to be registered with Solr in solrconfig.xml as follows:

<requestHandler name="/dih_artists_jdbc" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">mb-dih-artists-jdbc.xml</str>
    </lst>
</requestHandler>

This reference mb-dih-artists-jdbc.xml is located in <solr-home>/conf, which specifies the details of a data import process. We'll get to that file in a bit.

The development console

Before describing a DIH configuration file, we're going to take a look at the DIH development console. Visit the URL http://localhost:8983/solr/#/mbartists/dataimport/ (modifications may be needed for your host, port, core, and so on).

If there is more than one request handler registered, then you'll see a simple page listing them with links to get to the development console for that handler. The development console looks like the following screenshot:

The development console

The screen is divided into two panes: on the left is the DIH control form and on the right is the command output as JSON.

Note

The editable configuration, via Debug-Mode option, highlighted in the preceding screenshot, is not saved to the disk! It is purely for live trial-and-error debugging. Once you are satisfied with any changes, you'll need to save them back to the file yourself and then take some action to get Solr to reload the changes, such as by clicking on the Reload button, and then reload the page to pick up the changes on the screen.

The last section on the DIH in this chapter goes into more detail on submitting a command to the DIH.

Writing a DIH configuration file

The key pieces of a DIH configuration file includes a data source, an entity, some transformers, and a list of fields. Sometimes they can be omitted. At first, we'll list the various types of these DIH components with a simple description. Each has further details on usage, for which you'll need to see the Solr Reference Guide. Then we'll show you a few sample configuration files to give you a sense of how it all comes together.

Data sources

A <dataSource/> tag specifies, as you might guess, the source of data referenced by an entity. This is the simplest piece of the configuration. The type attribute specifies the type, which defaults to JdbcDataSource. Depending on the type, there are further configuration attributes (not listed here). There could be multiple data sources but not usually. Furthermore, with the exception of JdbcDataSource, each type handles either binary or text but not both. The following is a listing of available data source types. They all have a name ending with DataSource.

  • JdbcDataSource: This is a reference to a database via JDBC; usually relational
  • FieldStreamDataSource and FieldReaderDataSource: These are for extracting binary or character data from a column from JdbcDataSource.
  • BinFileDataSource and FileDataSource: This is to specify a path to a binary or text file
  • URLDataSource: This is to specify a URL to a text resource
  • BinContentStreamDataSource and ContentStreamDataSource: These receive binary or text data posted to the DIH instead of the DIH pulling it from somewhere.

    Tip

    ContentStreamDataSource is very interesting because it lets you use the DIH to receive asynchronous, on-demand data processing instead of the typical scheduled batch-process mode. It could be used for many things, even a Web Hook: http://www.webhooks.org/.

If you were looking for a MailDataSource, then there isn't one. The MailEntityProcessor was coded to fetch the e-mail itself instead of decoupling that function to a data source.

Entity processors

Following the data sources is a <document/> element, which contains one or more <entity/> elements referencing an Entity Processor via the processor attribute; the default is SqlEntityProcessor. An entity processor produces documents when it is executed. The data to produce the documents typically comes from a referenced data source. An entity that is an immediate child of <document> is by default a root entity, which means its documents are indexed by Solr. If the rootEntity attribute is explicitly set to false, then the DIH recursively traverses down until it finds one that doesn't have this marking. There can be sub-entities, which execute once for each parent document and which usually reference the parent document to narrow a query. Documents from a sub-entity are merged into its root entity's document, producing multivalued fields when more than one document with the same field is produced by the sub-entity.

Tip

This explanation is surely quite confusing without having seen several examples. You might want to read this again once you get to some examples.

The entity processors have some common configuration attributes and some that are unique to each one.

Entity processors all have a name ending with EntityProcessor. The following are a list of them:

  • SqlEntityProcessor: This references a JdbcDataSource and executes a specified SQL query. The columns in the result set, map to fields by the same name. This processor uniquely supports delta import.
  • CachedSqlEntityProcessor: This is like SqlEntityProcessor, but caches every record in memory for future lookups instead of running the query each time. This is only an option for sub-entities of a root entity.
  • XPathEntityProcessor: This processes XML from a text data source. It separates the XML into separate documents according to an XPath expression. The fields reference a part of the XML via an XPath expression.
  • PlainTextEntityProcessor: This takes the text from a text data source putting it into a single field.
  • LineEntityProcessor: This takes each line of text from a text data source, creating a document from each one. A suggested use is for an input file of URLs that are referenced by a sub-entity such as Tika.
  • FileListEntityProcessor: This finds all files meeting the specified criteria, creating a document from each one with the file path in a field. A sub-entity such as Tika could then extract text from the file.
  • TikaEntityProcessor: This extracts text from a binary data source, using Apache Tika. Tika supports many file types such as HTML, PDF, and Microsoft Office documents. Recent Tika versions allow specifying that the HTML not be stripped out (it is by default). From Solr 4.3, you can specify this via IdentityHtmlMapper in the DIH configuration. This is an alternative approach to Solr Cell, which is described later.
  • MailEntityProcessor: This fetches e-mail from an IMAP mail server, including attachments processed with Tika. It doesn't use a data source. You can specify a starting date, but, unfortunately, it doesn't support DIH's delta import.

    Note

    Solr supports a pluggable cache for DIH so that any entity can be made cacheable by adding the cacheImpl parameter. For additional information, check SOLR-2382.

Fields and transformers

Within an <entity/> tag are <field/> elements that declare how the columns in the query map to Solr. A field element must have a column attribute that matches the corresponding named column in the SQL query. Its name attribute is the Solr schema field name that the column is going into. If it is not specified, then it defaults to the column name. When a column in the result can be placed directly into Solr without further processing, there is no need to specify the field declaration, because it is implied.

Tip

When importing from a database, use the SQL AS keyword to use the same names as the Solr schema instead of the database schema. This reduces the number of <field/> elements and shortens the existing ones.

An attribute of the entity declaration that we didn't mention yet is the transformer. This declares a comma-separated list of transformers that create, modify, and delete fields and even entire documents. The transformers are evaluated in order, which can be significant. Usually, the transformers use attributes specific to them on a given field to trigger that it should take some action, whether it be splitting the field into multiple values or formatting it. The following is a list of transformers:

  • TemplateTransformer: This overwrites or modifies a value based on a string template. The template can contain references to other fields and DIH variables.
  • RegexTransformer: This either performs a string substitution, splits the field into multiple values, or splits the field into separately named fields. This transformer is very useful!
  • DateFormatTransformer: This parses a date-time format according to a specified pattern. The output format is Solr's date format.
  • NumberFormatTransformer: This parses a number according to a specified locale and style (that is number, percent, integer, currency). The output format is a plain number suitable for one of Solr's numeric fields.
  • HTMLStripTransformer: This removes the HTML markup according to HTMLStripCharFilter (documented in the previous chapter). By performing this step here instead of a text analysis component, the stored value will also be cleansed, not just the indexed (that is, searchable) data.
  • ClobTransformer: This transforms a CLOB value from a database into a plain string.
  • LogTransformer: This logs a string for diagnostic purposes, using a string template such as TemplateTransformer. Unlike most transformers, this is configured at the entity since it is evaluated for each entity output document, not for each field.
  • ScriptTransformer: This invokes user-defined code in-line that is defined in a <script/> element. This transformer is specified differently within the transformers attribute—use …,script:myFunctionName,… where myFunctionName is a named function in the provided code. The code is written in JavaScript by default, but most other languages that run on the JVM can be used too.

By the way, DIH transformers are similar to Solr UpdateRequestProcessors described at the end of this chapter. The former operates strictly within the DIH framework, whereas the latter is applicable to any importing mechanism.

Example DIH configurations

A DIH configuration file tends to look different depending on whether the source is a database, the content is XML, or if text is being extracted from documents.

Note

It's important to understand that the various data sources, data formats, and transformers, are mostly independent. The next few examples pick combinations to demonstrate a variety of possibilities for illustrative purposes. You should pick the pieces that you need.

Importing from databases

The following is the mb-dih-artists-jdbc.xml file with a rather long SQL query:

<dataConfig>
  <dataSource name="jdbc" driver="org.postgresql.Driver" url="jdbc:postgresql://localhost/musicbrainz_db" user="musicbrainz" readOnly="true" autoCommit="false" />
  <document>
    <entity name="artist" dataSource="jdbc" pk="id" query="
      select
        a.id as id,
        a.name as a_name, a.sortname as a_name_sort,
        a.begindate as a_begin_date, a.enddate as a_end_date,
        a.type as a_type,
        array_to_string(array(select aa.name from artistalias aa where aa.ref = a.id ),
          '|') as a_alias,
        array_to_string(
          array(select am.name from v_artist_members am where am.band = a.id order by am.id), '|') as a_member_name,
        array_to_string(array(select am.id from v_artist_members am where am.band = a.id order by am.id),
          '|') as a_member_id,
        (select re.releasedate from release re inner join album r on re.album = r.id where r.artist = a.id order by releasedate desc limit 1)
          as a_release_date_latest
      from artist a
          "transformer="RegexTransformer,DateFormatTransformer, TemplateTransformer">
      <field column = "id" template="Artist:${artist.id}" />
      <field column = "type" template="Artist" />
      <field column = "a_begin_date"
        dateTimeFormat="yyyy-MM-dd" />
      <field column = "a_end_date"
        dateTimeFormat="yyyy-MM-dd" />
      <field column = "a_alias" splitBy="|" />
      <field column = "a_member_name" splitBy="|"/>
      <field column = "a_member_id" splitBy="|" />
    </entity>
  </document>
</dataConfig>

If the type attribute on dataSource is not specified (it isn't here), then it defaults to JdbcDataSource. Those familiar with JDBC should find the attributes in this example familiar, and there are also others available. For a reference to all of them, see the Solr Reference Guide.

Tip

Efficient JDBC configuration

Many database drivers in the default configurations (including those for PostgreSQL and MySQL) fetch all of the query results into memory instead of on-demand or using a batch/fetch size! This may work well for typical database usage, in which a relatively small amount of data needs to be fetched quickly, but is completely unworkable for ETL (Extract, Transform, and Load) usage such as this. Configuring the driver to stream the data will sometimes require driver-specific configuration settings. Settings for some specific databases are at http://wiki.apache.org/solr/DataImportHandlerFaq.

The main piece of an <entity/> used with a database is the query attribute, which is the SQL query to be evaluated. You'll notice that this query involves some subqueries, which are made into arrays and then transformed into strings joined by spaces. The particular functions used to do these sorts of things are generally database specific. This is done to shoehorn multivalued data into a single row in the results. It may create a more complicated query, but it does mean that the database does all of the heavy lifting so that all of the data Solr needs for an artist is in one row.

Tip

Sub-entities

There are numerous examples on the DIH wiki depicting entities within entities (assuming the parent entity is a root entity). This is an approach to the problem of getting multiple values for the same Solr field. It's also an approach for spanning different data sources. We advise caution against that approach because it will generate a separate query in response to each source record, which is very inefficient. It can be told to cache just one query to be used for future lookups, but that is only applicable to data shared across records that can also fit in memory. If all required data is in your database, we recommend the approach illustrated earlier instead.

Importing XML from a file with XSLT

In this example, we're going to import an XML file from the disk and use XSLT to do most of the work instead of DIH transformers.

Tip

Solr supports using XSLT to process input XML without requiring use of the DIH as we show in this simple example. The following command would have the same effect:

curl 'http://localhost:8983/solr/mbartists/update/xslt?tr=artists.xsl&commit=true' -H 'Content-type:text/xml' --data-binary @downloads/artists_veryshort.xml
<dataConfig>
 <dataSource name="artists" type="FileDataSource" encoding="UTF-8"/>
 <document name="artists">
   <entity name="artist" dataSource="artists" url="downloads/artists_veryshort.xml" processor="XPathEntityProcessor" xsl="cores/mbtype/conf/xslt/artists.xsl"
     useSolrAddSchema="true">
   </entity>
 </document>
</dataConfig>

This dataSource of type FileDataSource is for text files. The entity URL is relative to the baseUrl on the data source; since it's not specified then, it defaults to the current working directory of the server. To see the referenced XSLT file, download the code supplement for the book.

An interesting thing about this example is not just the use of XSLT, but the use of useSolrAddSchema, which signifies that the resulting XML structure follows Solr's XML <add><doc><field name=… structure. Our input file is an HTML table and the XSLT file transforms it. These two options are best used together.

Tip

There are some other examples at the DIH wiki illustrating XML processing. One of them shows how to process a Wikipedia XML file dump, which is rather interesting.

Importing multiple rich document files – crawling

In this example, we have a configuration that crawls all PDF files in a directory and then extracts text and metadata from them:

<dataConfig>
  <dataSource type="BinFileDataSource" />
  <document>
    <entity name="f" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" baseDir="/my/file/path" fileName=".*pdf"
      recursive="true">
      <entity name="tika-test" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text">
        <field column="Author" name="author" meta="true"/>
        <field column="title" name="title" meta="true"/>
        <field column="text" name="text"/>
      </entity>
    </entity>
  </document>
</dataConfig>

The FileListEntityProcessor is the piece that does the file crawling. It doesn't actually use a data source but it's required. Because this entity is not a root entity, thanks to rootEntity="false", it's the sub-entity within it that is a root entity, which corresponds to a Solr document. The entity is named f and the sub-entity tika-test refers to the path provided by f via f.fileAbsolutePath in its url. This example uses the variable substitution syntax ${…}.

Note

Speaking of which, there are a variety of variables that the DIH makes available for substitution, including those defined in solr.xml and solrconfig.xml. Again, see the DIH wiki for further details.

The TikaEntityProcessor part is relatively straightforward. Tika makes a variety of metadata available about documents; this example just used two.

Importing commands

The DIH is issued one of several different commands to do different things. Importing all data is called a full import, in contrast to a delta import that will be described shortly. Commands are given to the DIH request handler with the command attribute. We could tell the DIH to do a full import just by going to this URL: http://localhost:8983/solr/mbartists/dataimport?command=full-import. On the command line, we will use the following code:

curl http://localhost:8983/mbartists/solr/dataimport -F command=full-import

It uses HTTP POST, which is more appropriate than GET, as discussed earlier.

Unlike the other importing mechanisms, the DIH returns an HTTP response immediately while the import continues asynchronously. To get the current status of the DIH, go to this URL http://localhost:8983/solr/mbartists/dataimport, and you'll get an output like the following:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">15</int>
    </lst>
    <lst name="initArgs">
        <lst name="defaults">
            <str name="config">mb-dih-artists-jdbc.xml</str>
        </lst>
    </lst>
    <str name="status">idle</str>
    <str name="importResponse"/>
    <lst name="statusMessages"/>
    <str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>

The command attribute defaults to status, which is what this output shows. When an import is in progress, it shows statistics on that progress along with a status state of busy.

Other Boolean parameters named clean, commit, and optimize may accompany the command. clean is specific to the DIH, and it means that before running the import, it will delete all the documents first. To customize exactly which documents are deleted, you can specify a preImportDeleteQuery attribute on a root entity. You can even specify documents to be deleted after an import by using the postImportDeleteQuery attribute. The query syntax is documented in Chapter 5, Searching.

Tip

Beware that these defaults are inconsistent with other Solr importing mechanisms. No other importing mechanism will delete all documents first, and none will commit or optimize by default.

Two other useful commands are reload-config and abort. The first will reload the DIH configuration file, which is useful for picking up changes without having to restart Solr. The second will cancel any existing imports in progress.

Delta imports

The DIH supports what it calls a delta import, which is a mode of operation in which only data that has changed since the last import is retrieved. A delta import is only supported by the SqlEntityProcessor and it assumes that your data is time-stamped. The official DIH approach to this is prominently documented on the wiki. It uses a deltaImportQuery and deltaQuery pair of attributes on the entity, and a delta-import command. That approach is verbose, hard to maintain, and slow compared to a novel alternative documented at http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport.

Essentially, what you can do is introduce a timestamp check in your SQL's WHERE clause using variable substitution, along with another check if the clean parameter was given to the DIH in order to control whether or not a delta or full import should happen. Here is a concise <entity/> definition on a fictitious schema and dataset showing the relevant WHERE clause:

<entity name="item" pk="ID"
  query="SELECT * FROM item
    WHERE '${dataimporter.request.clean}' != 'false'
      OR last_modified > '${dataimporter.last_index_time}'">

Notice the ${…} variable substitution syntax. To issue a full import, use the full-import command with clean enabled: /dataimport?command=full-import&clean=true. And for a delta import, we still use the full-import command, but we set clean to false: /dataimport?command=full-import&clean=false&optimize=false. We also disabled the index optimization since it's not likely that this is desired for a delta import.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.112.1