Time for action – using Tika and cURL to extract text from PDFs

Solr integrates many powerful components, for example, SolrCell can be used to extract text from rich text documents and even PDF files, and to help us to automatically index it, as we will see in a while. The Tika library is designed to perform the following functions:

  • Extract text from rich documents such as PDF, doc, and others
  • Extract metadata for a file
  • Identify the mime type of a file
  • Automatically identify the language used in a document

We can play with Tika to extract text from one of the PDFs in the /SolrStarterBook/resources/pdfs/ directory. We will now see how it's possible to map Tika inside Solr in order to index a PDF with a simple HTTP POST request:

  1. First of all it's important to add the SolrCell component to make it executable by a specific Solr handler. All we have to do is to add a specific configuration in solrconfig.xml:
    <config>
      <luceneMatchVersion>LUCENE_45</luceneMatchVersion>
    ...
      <lib dir='${solr.core.instanceDir}/../lib' />
    ...
      <requestHandler name='/update' class='solr.UpdateRequestHandler' />
    
      <requestHandler name='/update/extract' class='solr.extraction.ExtractingRequestHandler'>
        <lst name='defaults'>
          <str name='captureAttr'>true</str>
          <str name='lowernames'>true</str>
          <str name='overwrite'>true</str>
          <str name='literalsOverride'>true</str>
          <str name='fmap.a'>link</str>
        </lst>
      </requestHandler>
    ...
    </config>
  2. The new request handler will also contain an interesting parameter, used to instruct Tika on how to emit metadata from the binary PDF file.

What just happened?

We have defined imports for the needed Tika libraries; on most of the solrconfig.xml files they have to be placed in the first part of the file. In our example we are importing all the jars contained in the same folder in the lib directory as the core folder. This configuration will produce the same result that we would get when we added the library in our configuration of the solr.xml file:

<str name='sharedLib'>lib</str>

The libraries we need to copy and put into the lib directory are:

  • SOLR_DIST/dist/: solr-cell-4,5,0.jar
  • SOLR_DIST/extraction/lib/: tika-core-1.4.jar, tika-parsers-1.4.jar, pdfbox-1.8.1.jar, fontbox-1.8.1.jar, xercesimpl-2.9.2.jar

Note that we imported some libraries specific to PDF. If we need to parse metadata from different sources, we have to add a specific library for parsing it.

We have defined a new requestHandler to the path /update/extract. This is the address where we have to post our PDF file to obtain from it the extracted text by Tika.

Navigate to the directory where you have your PDF, for example:

>> cd /SolrStarterBook/resources/pdfs/

Now we will extract data from the PDF using Tika by the /update/extract API:

>> curl -X POST 'http://localhost:8983/solr/pdfs/update/extract?extractOnly=true&extractFormat=text' -F '[email protected]'

The output extracted will be a structured text containing metadata parsed by Tika and plain text. We can try different values for extractFormat, XML and JSON. The extractOnly parameter is used to call Tika for extraction, without sending the extracted metadata to the update handler.

You can read more specific information on the SolrCell component at http://wiki.apache.org/solr/ExtractingRequestHandler.

The Tika component exposes several metadata from rich documents, depending on the type of the document. It is possible to index metadata even for MP3s or images, so we can, for example, search into EXIF values or into an album description in ID3.

Using cURL to index some PDF data

Using cURL we can send every kind of request to Solr core. We can even post files to be indexed using SolrCell. Once we have seen how Tika is able to extract metadata and texts from files, we can finally start indexing them into Solr:

>> curl -X POST 'http://localhost:8983/solr/pdfs/update/extract?extractFormat=text&literal.annotation=The+Wikipedia+Page+About+Apache+Lucene&commit=true' -F '[email protected]'

From the preceding example we have started extracting the extractOnly parameter (we can write extractOnly=false), so that the metadata will not give an output and we can send the metadata directly to the update handler. Then commit=true ensures that the indexed data is saved and available for searches. A last note: literal.annotation can be used to add custom metadata during the extract/post phase.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.107.254