Using the Solr 1045 patch – map-side indexing

The Apache Solr 1045 patch provides Solr users a way to build Solr indexes using the MapReduce framework of Apache Hadoop. Once created, this index can be pushed to Solr storage. The following diagram depicts the mapper and reducer in Hadoop:

Using the Solr 1045 patch – map-side indexing

Each Apache Hadoop mapper transforms input records into a set of (key-value) pairs, which then gets transformed into SolrInputDocument. The Mapper task ends up creating an index from SolrInputDocument.

The focus of reducer is to perform de-duplication of different indexes and merge them if needed. Once the indexes are created, you can load them on your Solr instance and use them to search. You can read more about this patch on https://issues.apache.org/jira/browse/SOLR-1045.

The patch follows the standard process of patching up your label through SVN. To apply a patch to your Solr instance, you first need to build your Solr instance using source. The instance should be supported by the Solr 1045 patch. Now, download the patch from the Apache JIRA site (https://issues.apache.org/jira/secure/attachment/12401278/SOLR-1045.0.patch). Before running the patch, first do a dry run, which does not actually apply the patch. You can do it with the following command:

cd <solr-trunk-dir>
svn patch <name-of-patch> --dry-run

If dry-run works without any failure, you can apply the patch directly. You can also perform dry-run using a simple patch command:

patch <name-of-patch> --dry-run

If it is successful, you can run the patch without the -dry-run option to apply the patch. On Windows, you can apply the patch with a right-click:

Using the Solr 1045 patch – map-side indexing

On Linux, you can use the SVN path as shown in the previous example. Let's look at some of the important classes in the patch. The SolrIndexUpdateMapper class is responsible for creating create new indexes from the input document. The SolrXMLDocRecordReader class reads Solr input XML files for indexing. The SolrIndexUpdater class is responsible for creating a MapReduce job and running it to read the document and for updating Solr instance.

Note

Although Apache Solr patch 1045 provides an excellent parallel mapper and reducer, when the indexing is done at map side, all the <key, value> pairs received by the reducer gain equal weight/importance. So, it is difficult to use this patch with data that carries ranking/weight information.

This patch also provides a way for users to merge the indexes in the reducer phase of the patch. This patch is not yet part of the Solr label, but it is targeted for the Solr 4.9/5.0 label.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.154.161