Using Solr 1301 Patch – reduce-side indexing

The Solr 1301 patch is responsible for generating an index using the Apache Hadoop MapReduce framework. This patch is merged in Solr version 4.7 and is available in the code-line if you take Apache Solr with 4.7+ versions. This patch is similar to the previously discussed patch (SOLR-1045), but the difference is that the indexes that are generated using Solr 1301 are in the reduce phase and not in the map phase of Apache Hadoop's MapReduce. Once the indexes are generated, they can be loaded on Solr or SolrCloud for further processing and application searching. The following diagram depicts the overall flow:

Using Solr 1301 Patch – reduce-side indexing

In case of Solr 1301, a map task is responsible for converting input records into a <key, value> pair. Later, they are passed to the reducer. The reducer is responsible for converting and publishing SolrInputDocument, which is then transformed into Solr indexes. The indexes are then persisted on HDFS directly and can later be exported to a Solr instance. In the latest Solr instance, this patch is part of the contrib module in the $SOLR_HOMEcontribmap-reduce folder. The patch/contrib map-reduce folder provides a MapReduce job that allows a user to build Solr indexes and merge them in the Solr cluster optionally.

You will require a Hadoop cluster to run a solr 1301 patch. The solr 1301 patch is merged in Solr version 4.7 and is part of Solr contrib already. Once Hadoop is set, you can run the following command:

$HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR jar $SOLR_HOME/contrib/dist/solr-map-reduce-*.jar -D 'mapred.child.java.opts=-Xmx500m' --morphline-file readAvroContainer.conf --zk-host 127.0.0.1:9983 --output-dir hdfs://127.0.0.1:8020/outdir --collection collection1 --log4j log4j.properties --go-live --verbose "hdfs://127.0.0.1:8020/indir"

In this command, the config parameter requires the configuration folder path of the Hadoop setup, the mapred.child.java.opts parameter passes the parameters to MapReduce programs, while the zk-host parameter points to an Apache ZooKeeper instance, the output-dir is where the output of this program should be stored, collection points to the collection in Apache Solr, log4j provides pointers to the log, the go-live option enables the merging of the output shards of the previous phase into a set of live customer-facing Solr servers, and morphline-file provides the configuration of the Avro-based pipe.

This will run the mapper and the reducer to generate a Solr index. Once the index is created through a Hadoop patch, it will be provisioned to the Solr server. The patch contains the default converter for CSV files. Let's look at some of the important classes that are a part of this patch. The CSVDocumentConverter class takes care of converting the output of mapper(key,value) to SolrInputDocument. The CSVReducer class provides the reducer implementation of the Hadoop Reduce cluster. The CSVIndexer class has to be called from the command line to run or create indexes using MapReduce; similarly, the CSVMapper class provides an introspection of the CSV and finally extracts with the key-value pairs. It requires additional parameters such as paths to point and output for storing shards. The SolrDocumentConverter class is responsible for transforming custom objects into SolrInputDocument. This class transforms (key, value) pairs into data that resides in HDFS or locally. The SolrRecordWriter class provides an extension over the MapReduce record writer. It divides the data into multiple pairs; these pairs are then transformed into the SolrInputDocument form.

Follow these steps to run this patch:

  1. Create a local folder with the configuration and the lib folder, a conf file containing the Solr configuration (solr-config.xml, schema.xml), and lib folder, which contains the library.
  2. SolrDocumentConverter provides an abstract class for writing your own converters. Create your own converter class implementing SolrDocumentConverter; this will be used by SolrOutputFormat to convert output records to Solr document. If required, override the OutputFormat class provided in Solr by your own extension.
  3. Write a simple Hadoop MapReduce job in the configuration writer:
    SolrOutputFormat.setupSolrHomeCache(new File(solrConfigDir), conf);
    conf.setOutputFormat(SolrOutputFormat.class);
    SolrDocumentConverter.setSolrDocumentConverter(<your classname>.class, conf);
  4. Zip your configuration, and load it in HDFS. The zip file name should be solr.zip (unless you change the patch code).
  5. Now, run the patch; each job will instantiate EmbeddedSolrInstance, which will in turn do the conversion, and finally, SolrOutputDocument will be stored in the output format.

With reduce-sized index generation, it is possible preserve the weights of documents, which can contribute to the prioritization performed during a search query.

Merging of indexes is not possible as in Solr 1045 because the indexes are created in the reduce phase. The reducer becomes the crucial component of the system as the major tasks are performed in the reducer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.42.33