Solr integrates many powerful components, for example, SolrCell can be used to extract text from rich text documents and even PDF files, and to help us to automatically index it, as we will see in a while. The Tika library is designed to perform the following functions:
We can play with Tika to extract text from one of the PDFs in the /SolrStarterBook/resources/pdfs/
directory. We will now see how it's possible to map Tika inside Solr in order to index a PDF with a simple HTTP POST request:
solrconfig.xml
:<config> <luceneMatchVersion>LUCENE_45</luceneMatchVersion> ... <lib dir='${solr.core.instanceDir}/../lib' /> ... <requestHandler name='/update' class='solr.UpdateRequestHandler' /> <requestHandler name='/update/extract' class='solr.extraction.ExtractingRequestHandler'> <lst name='defaults'> <str name='captureAttr'>true</str> <str name='lowernames'>true</str> <str name='overwrite'>true</str> <str name='literalsOverride'>true</str> <str name='fmap.a'>link</str> </lst> </requestHandler> ... </config>
We have defined imports for the needed Tika libraries; on most of the solrconfig.xml
files they have to be placed in the first part of the file. In our example we are importing all the jars contained in the same folder in the lib directory as the core folder. This configuration will produce the same result that we would get when we added the library in our configuration of the solr.xml
file:
<str name='sharedLib'>lib</str>
The libraries we need to copy and put into the lib
directory are:
SOLR_DIST/dist/
: solr-cell-4,5,0.jar
SOLR_DIST/extraction/lib/
: tika-core-1.4.jar
, tika-parsers-1.4.jar
, pdfbox-1.8.1.jar
, fontbox-1.8.1.jar
, xercesimpl-2.9.2.jar
Note that we imported some libraries specific to PDF. If we need to parse metadata from different sources, we have to add a specific library for parsing it.
We have defined a new requestHandler
to the path /update/extract
. This is the address where we have to post our PDF file to obtain from it the extracted text by Tika.
Navigate to the directory where you have your PDF, for example:
>> cd /SolrStarterBook/resources/pdfs/
Now we will extract data from the PDF using Tika by the /update/extract
API:
>> curl -X POST 'http://localhost:8983/solr/pdfs/update/extract?extractOnly=true&extractFormat=text' -F '[email protected]'
The output extracted will be a structured text containing metadata parsed by Tika and plain text. We can try different values for extractFormat
, XML
and JSON
. The extractOnly
parameter is used to call Tika for extraction, without sending the extracted metadata to the update handler.
You can read more specific information on the SolrCell component at http://wiki.apache.org/solr/ExtractingRequestHandler.
The Tika component exposes several metadata from rich documents, depending on the type of the document. It is possible to index metadata even for MP3s or images, so we can, for example, search into EXIF values or into an album description in ID3.
Using cURL we can send every kind of request to Solr core. We can even post files to be indexed using SolrCell. Once we have seen how Tika is able to extract metadata and texts from files, we can finally start indexing them into Solr:
>> curl -X POST 'http://localhost:8983/solr/pdfs/update/extract?extractFormat=text&literal.annotation=The+Wikipedia+Page+About+Apache+Lucene&commit=true' -F '[email protected]'
From the preceding example we have started extracting the extractOnly
parameter (we can write extractOnly=false
), so that the metadata will not give an output and we can send the metadata directly to the update handler. Then commit=true
ensures that the indexed data is saved and available for searches. A last note: literal.annotation
can be used to add custom metadata during the extract/post
phase.
3.148.107.254