Chapter 7. ZooKeeper in Action

In the last six chapters, we learned all the essentials about Apache ZooKeeper, which has given us a firm grip on its background concepts, usage, and administration. We read about community projects such as Curator, which makes programming with ZooKeeper much simpler, and also adds more functionality on top of it. We saw how Exhibitor makes the managing of ZooKeeper instances easier in a production environment. ZooKeeper is one of the successful projects from Apache Software Foundation, and has gained wide adoption by other projects and organizations in their production platforms.

In this chapter, we are going to read about how ZooKeeper is used by software projects for their inherent need of distributed coordination. We will particularly look at the following projects that use and depend on ZooKeeper for some of their functionality:

  • Apache BookKeeper
  • Apache Hadoop
  • Apache HBase
  • Apache Helix
  • OpenStack Nova

We will also look at how ZooKeeper is used by organizations such as Yahoo!, Netflix, Facebook, Twitter, eBay, and so on in their production platform to achieve distributed coordination and synchronization. In addition to this, we will also learn how companies such as Nutanix and VMware leverage ZooKeeper in their enterprise-grade compute and storage appliances.

Note

The use of ZooKeeper by various organizations has been compiled from blogs and tech notes available in the public domain. The author or the publisher of this book bears no responsibility for the authenticity of the information.

Projects powered by ZooKeeper

In this section, we will delve into the details of how various open source software projects use Apache ZooKeeper to implement some of their functionality. As it's beyond the scope of this book to give a full functional description of the individual projects, readers are advised to go through the respective project pages to learn more about their design and architectural details.

Apache BookKeeper

The Apache BookKeeper (http://zookeeper.apache.org/bookkeeper/) is a subproject of ZooKeeper. BookKeeper is a highly available and reliable distributed logging service. Hedwig is a topic-based distributed publish/subscribe system built on BookKeeper. In this section, we will take a sneak peek of Apache BookKeeper.

BookKeeper can be used to reliably log streams of records. It achieves high availability through replication. Applications that need to log operations or transactions in a reliable fashion so that crash recovery can be done in case of failure can use BookKeeper. It is highly scalable, fault-tolerant, and high performant. In a nutshell, BookKeeper comprises the following components:

  • Ledger: Ledgers are streams of logs that consist of a sequence of bytes. Log streams are written sequentially to a ledger in an append-only semantics. It uses the write-ahead logging (WAL) protocol.
  • BookKeeper client: A BookKeeper client creates ledgers. It runs in the same machine as the application and enables the application to write to the ledgers.
  • Bookie: Bookies are BookKeeper storage servers that store and manage the ledgers.
  • Metadata storage service: The information related to ledgers and bookies are stored with this service.

BookKeeper uses ZooKeeper for its metadata storage service. Whenever the application creates a ledger with the BookKeeper client, it stores the metadata about the ledger in the metadata storage service backed by a ZooKeeper instance. Clients use ZooKeeper coordination to ascertain that only a single client is writing to a ledger. The writer has to close the ledger before any other client issues a read operation on that ledger. BookKeeper ensures that after the ledger has been closed, other clients can see the same content while reading from it. The closing of a ledger is done by creating a close znode for the ledger, and the use of ZooKeeper prevents any race conditions.

Apache Hadoop

The Apache Hadoop project (http://hadoop.apache.org/) is an umbrella of projects, and ZooKeeper is one of the allied subprojects.

Apache Hadoop is a distributed framework that processes large datasets using clusters that span over a large number of nodes. It uses simple programming models such as MapReduce to process the datasets. It is scalable up to thousands of servers. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer.

The project includes the following components:

  • Hadoop Common: This includes the libraries and utilities used by other Hadoop modules
  • Hadoop Distributed File System (HDFS): This is a scalable and fault-tolerant distributed filesystem that stores large datasets and provides high-performant access to it
  • Hadoop YARN (Yet Another Resource Negotiator): YARN is a distributed framework that provides job scheduling and cluster resource management
  • Hadoop MapReduce: This is a YARN-based software framework that performs parallel processing of large datasets

Apache ZooKeeper is used in Hadoop to implement high availability in YARN and HDFS.

YARN (http://bit.ly/1xDJG8r) is the next-generation compute and resource-management framework in Apache Hadoop. It consists of a global ResourceManager (RM) and a per-application ApplicationMaster:

  • The RM mediates resource allocation among all the applications in the system. It has two components: Scheduler and ApplicationsManager. The Scheduler is responsible for allocating resources to the applications, and the ApplicationsManager manages job- and application-specific details. RM uses a per-node daemon called the NodeManager to monitor resource usage (CPU, memory, disk, and network) and report the same to the RM.
  • The ApplicationMaster is a library that is specific to every application. It runs along with the application in the Hadoop cluster, does resource negotiating with the RM, and assists the NodeManager(s) to execute and monitor the tasks.

The RM coordinates the running tasks in a YARN cluster. However, in a Hadoop cluster, only one instance of the RM runs and is a single point of failure. A ZooKeeper solution provides high availability (HA) to the RM, which allows a failover of the RM to another machine when the active one crashes.

The solution works by storing the current internal state of the RM in ZooKeeper. Since ZooKeeper itself is a highly available data store for small amount of data, it makes the RM state highly available too. Whenever an RM resumes after a restart or a crash, it loads the internal state from ZooKeeper.

An extension to the solution to provide failover capability is to have multiple RMs, of which one is in an active role and the others are mere standbys. When the active RM goes down, a leader election can be done with ZooKeeper to elect a new RM. Use of ZooKeeper prevents the potential problem of more than one node claiming the active role (fencing).

Tip

More details on YARN HA with ZooKeeper can be found on the blog written by Karthik Kambatla, Wing Yew Poon, and Vikram Srivastava at http://blog.cloudera.com/blog/2014/05/how-apache-hadoop-yarn-ha-works/.

ZooKeeper is also used in Hadoop for the purpose of achieving high availability for HDFS. The metadata node or the NameNode (NN), which stores the metadata information of the whole filesystem is a single point of failure (SPOF) in HDFS. The NN being a SPOF was a problem till Hadoop 1.x. However, in Hadoop 2.x, an automatic failover mechanism is available in HDFS, for a fast failover of the NN role from the active node to another node in the event of a crash.

The problem is solved in a similar manner to the approach used for YARN RM. Multiple NNs are set up, of which only one NN assumes the active role, and the others remain in the standby mode. All client filesystem operations go to the active NN in the cluster, while the standby acts as a slave. The standby NN maintains enough state about the filesystem namespace to provide a fast failover. Each of the NNs (active as well as standbys) runs a ZKFailoverController (ZKFC) in it. ZKFC maintains a heartbeat with the ZooKeeper service. The ZKFC in the active NN holds a special "lock" znode through an ephemeral znode in the ZooKeeper tree. In the event of a failure of the current active NN, the session with the ZooKeeper service expires, triggering an election for the next active NN. One among the standby NNs wins the election and acquires the active NN role.

Apache HBase

Apache HBase (http://hbase.apache.org/) is a distributed, scalable, big data store. It's a non-relational database on top of HDFS.

In HBase architecture, there is a master server called HMaster, and a number of slave servers called RegionServer. The HMaster monitors the RegionServers, which store and manage the regions. Regions are contiguous ranges of rows stored together. The data is stored in persistent storage files called HFiles.

HBase uses ZooKeeper for distributed coordination. Every RegionServer creates its own ephemeral znode in ZooKeeper, which the HMaster uses in order to discover available servers. HBase also uses ZooKeeper to ensure that there is only one HMaster running and to store the root of the regions for region discovery. ZooKeeper is an essential component in HBase, without which HBase can't operate.

Tip

For more details on HBase architecture, refer to http://hbase.apache.org/book.html.

Apache Helix

Apache Helix (http://helix.apache.org/) is a cluster management framework. It provides a generic way of automatically managing the resources in a cluster. Helix acts as a decision subsystem for the cluster, and is responsible for the following tasks and many more:

  • Automating the reassignment of resources to the nodes
  • Handling node failure detection and recovery
  • Dynamic cluster reconfiguration (node and resource addition/deletion)
  • Scheduling of maintenance tasks (backups, index rebuilds, and so on)
  • Maintaining load balancing and flow control in the cluster

In order to store the current cluster state, Helix needs a distributed and highly available configuration or cluster metadata store, for which it uses ZooKeeper.

ZooKeeper provides Helix with the following capabilities:

  • This framework represents the PERSISTENT state, which remains until it's removed
  • This framework also represents the TRANSIENT/EPHEMERAL state, which goes away when the process that created the state leaves the cluster
  • This framework notifies the subsystem when there is a change in the PERSISTENT and EPHEMERAL state of the cluster

Helix also allows simple lookups of task assignments through the configuration store built on top of ZooKeeper. Through this, clients can look up where the tasks are currently assigned. This way, Helix can also provide a service discovery registry.

OpenStack Nova

OpenStack (http://www.openstack.org/) is an open source software stack for the creation and management of private and public clouds. It is designed to manage pools of compute, storage, and networking resources in data centers, allowing the management of these resources through a consolidated dashboard and flexible APIs.

Nova is the software component in OpenStack, which is responsible for managing the compute resources, where virtual machines (VMs) are hosted in a cloud computing environment. It is also known as the OpenStack Compute Service. OpenStack Nova provides a cloud computing fabric controller, supporting a wide variety of virtualization technologies such as KVM, Xen, VMware, and many more. In addition to its native API, it also includes compatibility with Amazon EC2 and S3 APIs.

Nova depends on up-to-date information about the availability of the various compute nodes and services that run on them, for its proper operation. For example, the virtual machine placement operation requires to know the currently available compute nodes and their current state.

Nova uses ZooKeeper to implement an efficient membership service, which monitors the availability of registered services. This is done through the ZooKeeper ServiceGroup Driver, which works by using ZooKeeper's ephemeral znodes. Each service registers by creating an ephemeral znode on startup. Now, when the service dies, ZooKeeper will automatically delete the corresponding ephemeral znode. The removal of this znode can be used to trigger the corresponding recovery logic.

For example, when a compute node crashes, the nova-compute service that is running in that node also dies. This causes the session with ZooKeeper service to expire, and as a result, ZooKeeper deletes the ephemeral znode created by the nova-compute service. If the cloud controller keeps a watch on this node-deletion event, it will come to know about the compute node crash and can trigger a migration procedure to evacuate all the VMs that are running in the failed compute node to other nodes. This way, high availability of the VMs can be ensured in real time.

ZooKeeper is also being considered for the following use cases in OpenStack Nova:

  • Storing the configuration metadata (nova.conf)
  • Maintaining high availability of services and automatic failover using leader election
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.242.253