The Apache Solr architecture

An Apache Solr instance can run as a single core or multicore; it is a client server model. A Solr core is nothing but the running instance of a Solr index along with its configuration. Earlier, Apache Solr had a single core that in turn limited the consumers to run Solr on one application, through a single schema and configuration file. Later, support for creating multiple cores was added. With this support one can now run one Solr instance for multiple schemas and configurations with unified administrations. You can run Solr in multicore with the following command:

java -Dsolr.solr.home=multicore -jar start.jar

Apache Solr is composed of multiple modules, some of them being separate projects in themselves. Let's understand the different components of the Apache Solr architecture. The following diagram depicts the Apache Solr conceptual architecture:

The Apache Solr architecture

Apache Solr can run in a master-slave mode. Index replicator is responsible for distributing indexes across multiple slaves. The master server maintains index updating, and slaves are responsible for talking with the master to get them replicated for high availability. Apache Lucene core gets packages as a library with the Apache Solr application. It provides the core functionalities for Solr, such as index, query processing, searching data, ranking matched results, and returning them.

Apache Lucene comes with a variety of query implementations. Query Parser is responsible for parsing the queries passed by the end search as the search string. Lucene provides TermQuery, BooleanQuery, PhraseQuery, PrefixQuery, RangeQuery, MultiTermQuery, FilteredQuery, SpanQuery, and so on as query implementations.

Index Searcher is a basic component of Solr searched with a default base searcher class. This class is responsible for returning ordered matched results of the searched keyword ranked, as per the computed score. Index Reader provides access to indexes stored in the file system. It can be used for searching for an index. Similar to Index Searcher, Index Writer allows you to create and maintain indexes in Apache Lucene.

Analyzer is responsible for examining the fields and generating tokens. Tokenizer breaks field data into lexical units or tokens. Filter examines a field of tokens from Tokenizer and either keeps them, transforms them, discards them, or creates new ones. Tokenizers and Filters together form a chain or pipeline of Analyzers. There can only be one Tokenizer per Analyzer. The output of one chain is fed to another. The Analyzer process is used for indexing as well as querying by Solr. They play an important role in speeding up the query as well as index time and finding the right set of matches; they also reduce the amount of data that gets generated out of these operations. You can define your own customer as Analyzers depending upon your use case.

Query Parser is responsible for parsing the queries and converting them into Lucene Query Objects. There are different types of parsers available, such as lucene, DisMax, and edismax. Each parser offers different functionalities and can be used on the basis of particular requirements. Once a query is parsed, it hands it over to index searcher. The job of index reader is to run the queries on index store, gather the results, and send them to response writer. Response Writer is responsible for responding to the client; it formats the query response on the basis of search outcomes from the Lucene engine.

Index Handler is a type of update handler, handling the tasks of add, update, and delete documents for indexing. Apache Solr supports updates through index handler in the JSON, XML, and plaintext formats.

Data Import Handler (DIH) provides a mechanism for integrating different data sources with Apache Solr for indexing. The data sources could be relational databases or web-based sources (for example, RSS, ATOM feeds, and e-mails).

Although DIH is a part of Solr development, the default installation does not include it in the Solr application; it needs to be included in the application explicitly. We will be looking at Apache Tika in detail in the following sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.243.64