Content Extraction Library

The Content Extraction Library (also known as SolrCell) integrates the popular Apache Tika framework to detect and extract metadata and text from a large variety of file types such as PDF, Microsoft Office, Libre Office, and Open Office documents.

Apache Tika provides a façade parser interface on top of several low-level frameworks that are able to manage and manipulate specific file types (for example, PDFBox for PDFs and Apache POI for Microsoft documents). Its simple interface also provides automatic mime-type detection, so the framework itself is able to understand the correct parser that needs to be applied for a given file.

On the Solr side, a dedicated ExtractingRequestHandler will be in charge of getting the input data (files) sent by clients and extracting metadata and text by means of Tika.

The configuration of ExtractingRequestHandler follows the same procedure that we saw for the other handlers. Specifically, it has to be declared in solrconfig.xml, as follows:

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    …
  </lst>
</requestHandler>

SolrCell has several options that can be configured to fine-tune its behavior. Most of them are related to metadata handling, field name mapping, and custom Tika configuration.

The src/solr/solr-home/example-data folder in the example project contains a document that can be sent to SolrCell. Open a shell and type the following (replace the PROJECT_HOME placeholder with your ch7 project local path):

# curl "http://localhost:8983/solr/example/update/extract?commit=true" -F data=@PROJECT_HOME/ch7/src/solr/solr-home/example-data/libreoffice-writer.odt

Wait for a moment, and then you should see a response like this:

<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">572</int>
  </lst>
</response>

The document (the LibreOffice document in this case, but you can also try other files) has been indexed. You can see that, when you open the browser and type http://127.0.0.1:8983/solr/example/select?q=stream_name:libreoffice-writer.odt&indent=true, the XML response shows the extracted text (under the text attribute) and all the metadata fields that have been detected for that document.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.226.240