The Solr 1301 patch is responsible for generating an index using the Apache Hadoop MapReduce framework. This patch is merged in Solr version 4.7 and is available in the code-line if you take Apache Solr with 4.7+ versions. This patch is similar to the previously discussed patch (SOLR-1045), but the difference is that the indexes that are generated using Solr 1301 are in the reduce phase and not in the map phase of Apache Hadoop's MapReduce. Once the indexes are generated, they can be loaded on Solr or SolrCloud for further processing and application searching. The following diagram depicts the overall flow:
In case of Solr 1301, a map task is responsible for converting input records into a <key, value> pair. Later, they are passed to the reducer. The reducer is responsible for converting and publishing SolrInputDocument, which is then transformed into Solr indexes. The indexes are then persisted on HDFS directly and can later be exported to a Solr instance. In the latest Solr instance, this patch is part of the contrib module in the $SOLR_HOMEcontribmap-reduce
folder. The patch/contrib map-reduce
folder provides a MapReduce job that allows a user to build Solr indexes and merge them in the Solr cluster optionally.
You will require a Hadoop cluster to run a solr 1301 patch. The solr 1301 patch is merged in Solr version 4.7 and is part of Solr contrib already. Once Hadoop is set, you can run the following command:
$HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR jar $SOLR_HOME/contrib/dist/solr-map-reduce-*.jar -D 'mapred.child.java.opts=-Xmx500m' --morphline-file readAvroContainer.conf --zk-host 127.0.0.1:9983 --output-dir hdfs://127.0.0.1:8020/outdir --collection collection1 --log4j log4j.properties --go-live --verbose "hdfs://127.0.0.1:8020/indir"
In this command, the config
parameter requires the configuration folder path of the Hadoop setup, the mapred.child.java.opts
parameter passes the parameters to MapReduce programs, while the zk-host
parameter points to an Apache ZooKeeper instance, the output-dir is where the output of this program should be stored, collection
points to the collection in Apache Solr, log4j provides pointers to the log, the go-live
option enables the merging of the output shards of the previous phase into a set of live customer-facing Solr servers, and morphline-file
provides the configuration of the Avro-based pipe.
This will run the mapper and the reducer to generate a Solr index. Once the index is created through a Hadoop patch, it will be provisioned to the Solr server. The patch contains the default converter for CSV files. Let's look at some of the important classes that are a part of this patch. The CSVDocumentConverter
class takes care of converting the output of mapper(key,value)
to SolrInputDocument
. The CSVReducer
class provides the reducer implementation of the Hadoop Reduce cluster.
The CSVIndexer
class
has to be called from the command line to run or create indexes using MapReduce; similarly, the CSVMapper
class provides an introspection of the CSV and finally extracts with the key-value pairs. It requires additional parameters such as paths to point and output for storing shards. The SolrDocumentConverter
class
is responsible for transforming custom objects into SolrInputDocument. This class transforms (key, value) pairs into data that resides in HDFS or locally. The SolrRecordWriter
class provides an extension over the MapReduce record writer. It divides the data into multiple pairs; these pairs are then transformed into the SolrInputDocument
form.
Follow these steps to run this patch:
lib
folder, a conf file containing the Solr configuration (solr-config.xml
, schema.xml
), and lib folder, which contains the library.SolrDocumentConverter
provides an abstract class for writing your own converters. Create your own converter class implementing SolrDocumentConverter
; this will be used by SolrOutputFormat
to convert output records to Solr document. If required, override the OutputFormat
class provided in Solr by your own extension.SolrOutputFormat.setupSolrHomeCache(new File(solrConfigDir), conf); conf.setOutputFormat(SolrOutputFormat.class); SolrDocumentConverter.setSolrDocumentConverter(<your classname>.class, conf);
solr.zip
(unless you change the patch code).EmbeddedSolrInstance
, which will in turn do the conversion, and finally, SolrOutputDocument
will be stored in the output format.With reduce-sized index generation, it is possible preserve the weights of documents, which can contribute to the prioritization performed during a search query.
Merging of indexes is not possible as in Solr 1045 because the indexes are created in the reduce phase. The reducer becomes the crucial component of the system as the major tasks are performed in the reducer.
3.142.42.33