Time for action – indexing rich documents (for example, PDF)

So, here we are. We have seen many simple, yet different ways to index documents in Solr; starting from the initial examples (based on direct indexing on PDF, using a Tika extraction), followed by a "manual" indexing using cURL and HTTP (and this is the most widely used case), and finally adopting one or more specific DataImportHandler configurations. At this point, we can close this circle by seeing how to index rich documents (PDF in particular, but also others) by using Tika again for metadata and text extraction, but this time with a DataImportHandler processor.

We will define a new core in /SolrStarterBook/solr-app/chp08/tika, which will be almost identical to the previous one. First of all, we have to verify that the required libraries are correctly added in our solrconfig.xml configuration file:

<lib dir="../../../solr/dist/" regex="solr-dataimporthandler-.*.jar" />
<lib dir="../../../solr/contrib/extraction/lib" />
<lib dir="../../../solr/dist/" regex="solr-cell-d.*.jar" />A

Once the imports are corrected, we have to configure a binary data source for the import handler. We will handle the binary files with a specific TikaEntityProcessor as follows:

<dataSource name="bin" type="BinFileDataSource" />
<document>
  <entity name="files" dataSource="null" 
    rootEntity="false" 
    processor="FileListEntityProcessor" 
    baseDir="${solr.solr.home}/../../resources/arts_pdf" 
    fileName=".*" 
    onError="skip" 
    transformer="script:console" 
    recursive="true">
    <field column="fileAbsolutePath" name="path" />
    <entity pk="uri" name="file" 
      dataSource="bin" 
      processor="TikaEntityProcessor" 
      url="${files.fileAbsolutePath}" 
      format="text" 
      rootEntity="true" 
      onError="skip">
      <field column="Author" name="author" meta="true" />
      <field column="title" name="title" meta="true" />
      <field column="text" name="text" />
      <field column="fileSize" name="size" />
      <field column="fileLastModified" name="lastmodified" />
    </entity>
  </entity>
</document>

In the example, we are using some common fields; but obviously we can use every single metadata field emitted by the Tika processor. If you are not familiar with Tika, you can still define a script transformer and start printing all the fields emitted, to decide what you want and whether you need to manipulate the values during the indexing phase.

What just happened?

This processor uses Tika to extract common metadata (based on dublin core and others) from rich documents. We will obtain very commonly used metadata with Tika, depending on the type of binary file indexed. A MIDI file can offer musical notations and lyrics, an image can contain its own EXIF data, and a PDF file will have at least an author, title, and a raw, unstructured, unparsed plain text. The latter will be always emitted by a specific text field.

Using a format="text" parameter is the default parameter, and it will produce a plain text containing all the content for an MS Word or PDF file. Beware that this can produce a very dirty file, so sometimes it can be better to obtain a format="xml" version of the same data in order to be able to handle the XML data produced with an XPathProcessor too.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.82.253