Configuring Apache Tika in Solr

Let's go ahead and create a new core called tika-example in our Solr instance. To make things easier, you can copy the core from the Chapter 6 folder of the ZIP file that comes with this book. After creating the core, we'll need to configure solrconfig.xml.

We need to add the extraction libraries that are available in the %SOLR_HOME/contrib/extraction/lib folder, and also the solr-cell library in solrconfig.xml:

<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*.jar"/>
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-d.*.jar"/>

We can then configure ExtractingRequestHandler in solrconfig.xml:

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="fmap.content">content</str>
    <str name="lowernames">true</str>
    <str name="uprefix">attr_</str>
    <str name="captureAttr">false</str>
  </lst>
</requestHandler>

We can override the default values used by ExtractingRequestHandler by passing it in the defaults list. ExtractingRequestHandler will, by default, put the content of the extracted file into the text field. But we can override that using fmap.content key. Also, captureAttr will tell ExtractingRequestHandler to get all the metadata information from the document. To keep this example simple, we'll set captureAttr to false.

The solrconfig.xml configuration file will look as follows:

<?xml version="1.0" encoding="UTF-8" ?>
<config>
  <luceneMatchVersion>4.10.1</luceneMatchVersion>
  <lib dir="../../../contrib/dataimporthandler/lib/" regex=".*.jar"/>
  <lib dir="../../../dist/" regex="solr-dataimporthandler-.*.jar" />

  <lib dir="../../../contrib/extraction/lib" regex=".*.jar"/>
  <lib dir="../../../dist/" regex="solr-cell-d.*.jar"/>

  <dataDir>${solr.data.dir:}</dataDir>

  <requestDispatcher handleSelect="false">
    <httpCaching never304="true"/>
  </requestDispatcher>

  <requestHandler name="/select" class="solr.SearchHandler"/>
  <requestHandler name="/update" class="solr.UpdateRequestHandler" />
  <requestHandler name="/admin" class="solr.admin.AdminHandlers" />
  <requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" startup="lazy"/>

  <requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="fmap.content">content</str>
      <str name="lowernames">true</str>
      <str name="uprefix">attr_</str>
      <str name="captureAttr">false</str>
    </lst>
  </requestHandler>
</config>

After making the configuration changes, let's go ahead and index some documents in Solr.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.244.45