Indexing documents with Solr Cell

While most of this book assumes that the content you want to index in Solr is in a neatly structured data format of some kind, such as in a database table, a selection of XML files, or CSV, the reality is that we also store information in the much messier world of binary formats such as PDF, Microsoft Office, or even images and music files.

One of the coauthors of this book, Eric Pugh, first became involved with the Solr community when he needed to ingest the thousands of PDF and Microsoft Word documents that a client had produced over the years. The outgrowth of that early effort is Solr Cell providing a very powerful and simple framework for indexing rich document formats.

Tip

Solr Cell is technically called the ExtractingRequestHandler. The current name came about as a derivation of Content Extraction Library, which appeared more fitting to its author, Grant Ingersoll. Perhaps a name including Tika would have been most appropriate considering that this capability is a small adapter to Tika. You may have noticed that the DIH includes this capability via the appropriately named TikaEntityProcessor.

The complete reference material for Solr Cell is available at https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika.

Extracting text and metadata from files

Every file format is different and all of them provide different types of metadata, as well as different methods of extracting content. The heavy lifting of providing a single API to an ever-expanding list of formats is delegated to Apache Tika.

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Tika supports a wide variety of formats, from the predictable to the unexpected. Some of the most commonly used formats supported are Adobe PDF, Microsoft Office, including Word, Excel, PowerPoint, Visio, and Outlook. The other formats that are supported include extracting metadata from images such as JPG, GIF, and PNG, as well as from various audio formats such as MP3, MIDI, and Wave audio. Tika itself does not attempt to parse the individual document formats. Instead, it delegates the parsing to various third-party libraries, while providing a high-level stream of XML SAX events as the documents are parsed. A full list of the supported document formats supported by the 1.5 version that are used by Solr 4.8 is available at http://tika.apache.org/1.5/formats.html.

Solr Cell is a fairly thin adapter to Tika consisting of a SAX ContentHandler that consumes the SAX events and builds the input document from the fields that are specified for extraction.

Some not-so-obvious things to keep in mind when indexing binary documents are:

  • You can supply any kind of supported document to Tika, and Tika will attempt to discover the correct MIME type of the document in order to use the correct parser. If you know the correct MIME type, you can specify it via the stream.type parameter.
  • The default SolrContentHandler that is used by Solr Cell is fairly simplistic. You may find that you need to perform extra massaging of the data being indexed beyond what Solr Cell offers to reduce the junk data being indexed. One approach is to implement a custom Solr UpdateRequestProcessor, described later in this chapter. Another is to subclass ExtractingRequestHandler and override createFactory() to provide a custom SolrContentHandler.
  • Remember that during indexing, you are potentially sending large binary files over the wire that must then be parsed by Solr, which can be very slow. If you are looking to only index metadata, then it may make sense to write your own parser using Tika directly, extract the metadata, and post that across to the server. See the Indexing with SolrJ section in Chapter 9, Integrating Solr for an example of parsing out metadata from an archive of a website and posting the data through SolrJ.

You can learn more about the Tika project at http://tika.apache.org/.

Configuring Solr

A sample request handler for parsing binary documents, in solrconfig.xml, looks like the following code:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="map.Last-Modified">last_modified</str>
      <str name="uprefix">metadata_</str>
    </lst>
</requestHandler>

Here, we can see that the Tika metadata attribute Last-Modified is being mapped to the Solr field last_modified, assuming we are provided that Tika attribute. The uprefix parameter specifies the prefix to use when storing any Tika fields that don't have a corresponding matching Solr field.

Solr Cell is distributed as a contrib module and is made up of the solr-cell-4.x.x.jar and roughly 25 more JARs that support parsing individual document formats. In order to use Solr Cell, you will need to place the Solr Cell JAR and supporting JARs in the lib directory for the core, as it is not included by default in solr.war. To share these libs across multiple cores, you would add them to ./examples/cores/lib/.

Solr Cell parameters

Before jumping into examples, we'll review Solr Cell's configuration parameters, all of which are optional. They are organized here and are ordered roughly by their sequence of use internally.

At first, Solr Cell (or, more specifically, Tika) determines the format of the document. It generally makes good guesses, but it can be assisted with these parameters:

  • resource.name: This is an optional parameter for specifying the name of the file. This assists Tika in determining the correct MIME type.
  • stream.type: This optional parameter allows you to explicitly specify the MIME type of the document being extracted to Tika, taking precedence over Tika guessing.

Tika converts all input documents into a basic XHTML document, including metadata in the head section. The metadata becomes fields and all text within the body goes into the content field. The following parameters further refine this:

  • capture: This is the XHTML element name (for example, "p") to be copied into its own field; it can be set multiple times.
  • captureAttr: This is set to true to capture XHTML attributes into fields named after the attribute. A common example is for Tika to extract href attributes from all the <a/> anchor tags for indexing into a separate field.
  • xpath: This allows you to specify an XPath query to filter which element's text is put into the content field. To return only the metadata, and discard all the body content of the XHMTL, you would use xpath=/xhtml:html/xhtml:head/descendant:node(). Notice the use of the xhtml: namespace prefix for each element. Note that only a limited subset of the XPath specification is supported. See http://tika.apache.org/0.8/api/org/apache/tika/sax/xpath/XPathParser.html. The API fails to mention that it also supports /descendant:node().
  • literal.[fieldname]: This allows you to supply the specified value for this field, for example, for the unique key field.

At this point each resulting field name is potentially renamed in order to map into the schema. These parameters control this process:

  • lowernames: This is set to true to lowercase the field names and convert nonalphanumeric characters to an underscore. For example, Last-Updated becomes last_updated.
  • fmap.[tikaFieldName]: This maps a source field name to a target field name. For example, fmap.last_modified=timestamp maps the metadata field last_modified generated by Tika to be recorded in the timestamp field defined in the Solr schema.
  • uprefix: This prefix is applied to the field name, if the unprefixed name doesn't match an existing field. It is used in conjunction with a dynamic field for mapping individual metadata fields separately:
    uprefix=meta_
    <dynamicField name="meta_*" type="text_general" indexed="true" stored="true" multiValued="true"/>
  • defaultField: This is a field to use if uprefix isn't specified, and no existing fields match. This can be used to map all the metadata fields into one multivalued field:
    defaultField=meta
    <field name="meta" type="text_general" indexed="true" stored="true" multiValued="true"/>

    Tip

    Ignoring metadata fields

    If you don't want to index unknown metadata fields, you can throw them away by mapping them to the ignored_ dynamic field by setting uprefix="ignore_" and using the ignored field type: <dynamicField name="ignored_*" type="ignored" multiValued="true"/>.

The other miscellaneous parameters:

  • boost.[fieldname]: Boost the specified field by this factor, a float value, to affect scoring. For example, boost="2.5", default value is 1.0.
  • extractOnly: Set this to true to return the XHTML structure of the document as parsed by Tika without indexing the document. This is typically done in conjunction with wt=json&indent=true to make the XHTML easier to read. The purpose of this option is to aid in debugging.
  • extractFormat: This defaults to xml (when extractOnly=true) to produce the XHMTL structure. Can be set to text to return the raw text extracted from the document.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.65.65