Understanding SolrCloud

SolrCloud provides a new way to enable distributed enterprise search using Apache Solr in enterprises. Previously, with the standard distributed Solr support, a lot of the manual work has been automated by SolrCloud. With the introduction of SolrCloud, the manual steps such as configuring solr-config.xml to talk with shards, adding documents to the shards, and similar type of work is automatic. Unlike the traditional approach of master- or slave-based distributed Solr, SolrCloud provides a leader-replica-based approach as its implementation. SolrCloud runs on top of Apache Zookeeper. First, let's understand Zookeeper.

Why Zookeeper?

SolrCloud contains a cluster of nodes, which talk with one another through Apache Zookeeper. Apache Zookeeper is responsible for maintaining coordination among various nodes. Besides coordinating among nodes, it also maintains configuration information, and group services to the distributed system. Due to its in-memory management of information, it offers distributed coordination at high speed.

Note

Apache Zookeeper itself is replicated over a set of nodes called ensemble. They all form a set called Zookeeper service. Each node that runs Zookeeper and stores its data is also called znode.

Each Zookeeper ensemble has one leader and many followers. The choice of the leader is on the start of the Zookeeper cluster. The Apache Zookeeper nodes contain information related to the distributed cluster, changes in the data, timestamp, ACL (Access Control List), as well as client-uploaded information. Zookeeper maintains a hierarchical metadata system unlike our conventional UNIX filesystem. The following figure depicts the structure of the Zookeeper in a distributed environment:

Why Zookeeper?

When the cluster is started, one of the nodes is elected as a leader. All others are followers. Each follower preserves the read-only copy of the leader's metadata in itself. The followers keep their metadata in sync with the leader by listening to the leader's atomic broadcast messages. A leader once broadcasted ensures the receipts by majority of the followers to commit the changes made and informs the client about transaction completion. This means that Apache Zookeeper ensures eventual consistency. The clients are allowed to upload their own information onto the Zookeeper and distribute it across the cluster. The clients can collect followers for reading the information. The Zookeeper maintains a sequential track of updates through its transaction logs; hence, it guarantees the sequential updates as they are received from different clients by the leader.

Note

Running Zookeeper in standalone mode is convenient for development and testing. But in production, you should run Zookeeper in replication. A replicated group of servers in the same application is called a quorum.

In case a leader fails, the next leader is chosen and the clients are expected to connect to the new leader. Apache Solr utilizes Zookeeper to enable distributed capabilities. By default, it provides the embedded Zookeeper along with its default installation. Apache Zookeeper was being used by many distributed systems including Apache Hadoop in the past.

SolrCloud architecture

We have already seen the concepts of shards and indexing in the earlier chapter. It is important to understand some terminologies used in SolrCloud. Unlike Apache Zookeeper, SolrCloud has a similar concept of leaders and replicas. Let's assume that we have to create a SolrCloud for the document database. Right now, the document database has a total of three documents, which are as follows:

Document [1] = "what are you eating"
Document [2] = "are you eating pie"
Document [3] = "I like apple pie"

The inverted index for these documents will be as follows:

what(1,1),are(1,2)(2,1),you(1,3)(2,2),eating(1,4)(2,3), pie(2,4)(3,4), I(3,1),like(3,2), apple(3,3)

A collection is a complete set of indices in the SolrCloud cluster of nodes; in this case, it will be as follows:

what(1,1),are(1,2)(2,1),you(1,3)(2,2),eating(1,4)(2,3), pie(2,4)(3,4), I(3,1),like(3,2), apple(3,3)

A shard leader in this case will be a piece of a complete index. A shard replica contains a copy of the same shard. Together, the shard leader and the shard replica form a complete shard index or slice. Let's say we divide the index into three shards; they will look like the following code:

Shard1: what(1,1),are(1,2)(2,1),you(1,3)(2,2)
Shard2: eating (1,4)(2,3), pie(2,4)(3,4)
Shard3: I (3,1),like(3,2), apple(3,3)

If we assume that all shards are replicated on three machines, each node participating in the SolrCloud will contain one or more shards / shard replicas of the index; the setup will look similar to the setup shown in the following table:

Machine/VM

Solr instance – Port*

M1

M1:8983/solr/: Solr Shard1

M1:9983: Zookeeper Leader

M1:8883/solr/: Solr Shard3-Replica

M1

M1:8883/solr/: Solr Shard2

M1:9983: Zookeeper Follower

M1:8883/solr/: Solr Shard1-Replica

M2

M1:8983/solr/: Solr Shard2-Replica

M1:9983/solr/: Solr Shard3

* The follower/replica is decided automatically by Apache Zookeeper and Solr, by default.

A Solr core represents an instance of Apache Solr with complete configuration (including files, such as solrconfig.xml, schema files, stop words, and other essentials) that are required to run itself. In the preceding table, we can see a total of six Solr cores with each machine running two different cores.

The organization and interaction between multiple Solr cores and Zookeeper can be seen in the following system context diagram:

SolrCloud architecture

SolrCloud lets you create a cluster of Solr nodes, each of them running one or more collections. A collection holds one or more shards, which are hosted on one or more (in case of replication) nodes. Any updates to any nodes participating in SolrCloud can in turn sync with the rest of the nodes. It uses Apache Zookeeper to bring in distributed coordination and configuration among multiple nodes. This in turn enables near real-time searching on SolrCloud due to the active sync of indexes. Apache Zookeeper loads all the configuration files of Apache Solr in its own repository from the filesystem and allows nodes to access it in a distributed manner. With this, even if the instance goes away, the configuration will still be accessible to all other nodes. When a new core is introduced in SolrCloud, it registers with a Zookeeper server, by sharing information regarding core. SolrCloud may run one or more collections.

SolrCloud does index distribution to the appropriate shard; it also takes care of distributing search across multiple shards. Search is possible with near real time, after the document is committed. Zookeeper provides load balancing and failover to the Solr cluster making the overall setup more robust. Index partitioning can be done in the following ways using Apache Solr:

  • Simple: This is done using the hashing function on a fixed number of shards.
  • Prefix based: This involves partitioning based on the document ID, that is, Red!12345, White!22321. Red and White are prefixes used for partitioning.
  • Custom: This is based on custom-defined partitioning, such as document creation time.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.99.6