Solr Cell basics

As we have earlier seen that, the Solr Cell framework leverages the Tika framework. Let's look at some basic concepts about this.

Please specify the MIME type for Tika explicitly to specify the document type. This has to be done with the stream.type parameter or else Tika will decide the document type provided on its own.

Tika creates some additional metadata on its own, such as Title, Author, and Subject, which respects DublinCore. Some of the file types where metadata can be extracted are as follows:

  • HTML
  • XML and derived formats such as XHTML, OOXML and ODF
  • Formats of MS Office document types
  • OpenDocument (ODF)
  • Formats with iWorks document
  • PDF
  • Email formats
  • Crypto formats
  • Rich Text Format (RTF)
  • Electronic publication 
  • Packaging and compression formats such as .tar, .zip, and .7zip files
  • Text format
  • Help formats
  • Feed and syndication formats (RSS and atom feeds)
  • Audio formats
  • JARs and Java class files
  • Video formats
  • Cad formats
  • Scientific formats
  • EXE programs and libraries
  • Image formats
  • Source code
  • Font formats

All extracted text from any of these formats is mapped with content field. Along with these formats, Tika's metadata fields can be mapped to Solr fields.

First, Tika produces an XHTML stream, which is passed to the SAX ContentHandler, and then Solr acts on various SAX events; finally it creates the fields to index. Since there is an XML-based parser, we can apply an XPath expression to XHTML to filter the content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.251.217