Let's go ahead and create a new core called tika-example in our Solr instance. To make things easier, you can copy the core from the Chapter 6
folder of the ZIP file that comes with this book. After creating the core, we'll need to configure solrconfig.xml
.
We need to add the extraction libraries that are available in the %SOLR_HOME/contrib/extraction/lib
folder, and also the solr-cell
library in solrconfig.xml
:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*.jar"/> <lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-d.*.jar"/>
We can then configure ExtractingRequestHandler
in solrconfig.xml
:
<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="fmap.content">content</str> <str name="lowernames">true</str> <str name="uprefix">attr_</str> <str name="captureAttr">false</str> </lst> </requestHandler>
We can override the default values used by ExtractingRequestHandler
by passing it in the defaults list. ExtractingRequestHandler
will, by default, put the content of the extracted file into the text field. But we can override that using fmap.content
key. Also, captureAttr
will tell ExtractingRequestHandler
to get all the metadata information from the document. To keep this example simple, we'll set captureAttr
to false
.
The solrconfig.xml
configuration file will look as follows:
<?xml version="1.0" encoding="UTF-8" ?> <config> <luceneMatchVersion>4.10.1</luceneMatchVersion> <lib dir="../../../contrib/dataimporthandler/lib/" regex=".*.jar"/> <lib dir="../../../dist/" regex="solr-dataimporthandler-.*.jar" /> <lib dir="../../../contrib/extraction/lib" regex=".*.jar"/> <lib dir="../../../dist/" regex="solr-cell-d.*.jar"/> <dataDir>${solr.data.dir:}</dataDir> <requestDispatcher handleSelect="false"> <httpCaching never304="true"/> </requestDispatcher> <requestHandler name="/select" class="solr.SearchHandler"/> <requestHandler name="/update" class="solr.UpdateRequestHandler" /> <requestHandler name="/admin" class="solr.admin.AdminHandlers" /> <requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" startup="lazy"/> <requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="fmap.content">content</str> <str name="lowernames">true</str> <str name="uprefix">attr_</str> <str name="captureAttr">false</str> </lst> </requestHandler> </config>
After making the configuration changes, let's go ahead and index some documents in Solr.
3.140.244.45