Using Solr 1045 Patch – map-side indexing

Apache Solr 1045 patch provides Solr users a way to build Solr indexes using the MapReduce framework of Apache Hadoop. Once created, this index can be pushed to Solr storage. The following diagram depicts the Mapper and Reducer in Hadoop:

Using Solr 1045 Patch – map-side indexing

Each Apache Hadoop mapper transforms the input records into a set of (key, value) pairs, which then get transformed into SolrInputDocument. The Mapper task then ends up creating an index from SolrInputDocument.

The focus of Reducer is to perform de-duplication of different indexes and merge them if needed. Once the indexes are created, you can load them on your Solr instance and use them for searching. You can read more about this patch at https://issues.apache.org/jira/browse/SOLR-1045.

The patch follows the standard process of patching up your label through svn (Subversion). To apply a patch to your Solr instance, first, you need to build your Solr instance using source. The instance should be supported by Solr-1045 patch. Now, download the patch from Apache JIRA site (https://issues.apache.org/jira/secure/attachment/12401278/SOLR-1045.0.patch). Before running the patch, first do a dry run, which does not actually apply a patch. You can do it with the following command:

cd <Solr-trunk-dir>
svn patch <name-of-patch> --dry-run

If the dry run works without any failure, you can apply the patch directly. You can also perform the dry run by using a simple patch command:

patch <name-of-patch> --dry-run

Once it is successful you can run the patch without the –dry-run option to apply the patch. On Windows, you can apply the patch with the right click:

Using Solr 1045 Patch – map-side indexing

On Linux, you can use svn path as shown in the previous example. Let's look at some of the important classes in the patch. The SolrIndexUpdateMapper class is responsible for creating new indexes from the input document. The SolrXMLDocRecordReader class reads Solr input XML files for indexing. The SolrIndexUpdater class is responsible for creating a MapReduce job and running it to read the document and updating the Solr instance.

Note

Although Apache Solr patch 1045 provides an excellent parallel mapper and reducer, when the indexing is done at the map side, all the <key, value> pairs received by the reducer gain equal weight/importance. So, it is difficult to use this patch with data that carries ranking/weight information.

This patch also provides a way for users to merge the indexes in the reducer phase of the patch. This patch is not yet part of the Solr label, but it is targeted for the Solr 4.9/5.0 label.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.191.94