Solr Cell basics

As we have earlier seen that, the Solr Cell framework leverages the Tika framework. Let's look at some basic concepts about this.

Please specify the MIME type for Tika explicitly to specify the document type. This has to be done with the stream.type parameter or else Tika will decide the document type provided on its own.

Tika creates some additional metadata on its own, such as Title, Author, and Subject, which respects DublinCore. Some of the file types where metadata can be extracted are as follows:

HTML
XML and derived formats such as XHTML, OOXML and ODF
Formats of MS Office document types
OpenDocument (ODF)
Formats with iWorks document
PDF
Email formats
Crypto formats
Rich Text Format (RTF)
Electronic publication
Packaging and compression formats such as .tar, .zip, and .7zip files
Text format
Help formats
Feed and syndication formats (RSS and atom feeds)
Audio formats
JARs and Java class files
Video formats
Cad formats
Scientific formats
EXE programs and libraries
Image formats
Source code
Font formats

All extracted text from any of these formats is mapped with content field. Along with these formats, Tika's metadata fields can be mapped to Solr fields.

First, Tika produces an XHTML stream, which is passed to the SAX ContentHandler, and then Solr acts on various SAX events; finally it creates the fields to index. Since there is an XML-based parser, we can apply an XPath expression to XHTML to filter the content.

Table of Contents for Solr Cell basics

Create new playlist

Sign In

Sign Up

Table of Contents for
Solr Cell basics