10. Cluster Monitoring

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10. Cluster Monitoring

Once you have your HBase cluster up and running, it is essential to continuously ensure that it is operating as expected. This chapter explains how to monitor the status of the cluster with a variety of tools.

Introduction

Just as it is vital to monitor production systems, which typically expose a large number of metrics that provide details regarding their current status, it is vital that you monitor HBase.

HBase actually inherits its monitoring APIs from Hadoop. But while Hadoop is a batch-oriented system, and therefore often is not immediately user-facing, HBase is user-facing, as it serves random access requests to, for example, drive a website. The response times of these requests should stay within specific limits to guarantee a positive user experience—also commonly referred to as a service-level agreement (SLA).

With distributed systems the administrator is facing the difficult task of making sense of the overall status of the system, while looking at each server separately. And even with a single server system it is difficult to know what is going on when all you have to go by is a handful of raw logfiles. When disaster strikes it would be good to see where—and when—it all started. But digging through mega-, giga-, or even terabytes of text-based files to find the needle in the haystack, so to speak, is something only a few people have mastered. And even if you have mad log-reading skills, it will take time to draw and test hypotheses to eventually arrive at the cause of the disruption.

This is obviously not something new, and viable solutions have been around for years. These solutions fall into the groups of graphing and monitoring—with some tools covering only one of these groups, while others cover both. Graphing captures the exposed metrics of a system and displays them in visual charts, typically with a range of time filters—for example, daily, monthly, and yearly time frames. This is good, as it can quickly show you what your system has been doing lately—like they say, a picture speaks a thousand words.

The graphs are good for historical, quantitative data, but with a rather large time granularity it is also difficult to see what a system is doing right now. This is where qualitative data is needed, which is handled by the monitoring kind of support systems. They keep an ear out on your behalf to verify that each data point, or metric, exposed is within a specified range. Often, the support tools already supply a significant set of checks, so you only have to tweak them for your own purposes. Checks that are missing can be added in the form of plug-ins, or simple script-based extensions. You can also fine-tune how often the checks are run, which can range from seconds to days.

Whenever a check indicates a problem, or outright failure, evasive actions could be taken automatically: servers could be decommissioned, restarted, or otherwise repaired. When a problem persists there are rules to escalate the issue to, for example, the administrators to handle it manually. This could be done by sending out emails to various recipients, or SMS messages to telephones.

While there are many possible support systems you can choose from, the Java-based nature of HBase, and its affinity to Hadoop, narrow down your choices to a more limited set of systems, which also have been proven to work reliably in combination. For graphing, the system supported natively by HBase is Ganglia. For monitoring, you need a system that can handle the JMX^[109]-based metrics API as exposed by the HBase processes. A common example in this category is Nagios.

Note

You should set up the complete support system framework that you want to use in production, even when prototyping a solution, or working on a proof-of-concept study based on HBase. That way you have a head start in making sense of the numbers and configuring the system checks accordingly. Using a cluster without monitoring and metrics is the same as driving a car while blindfolded.

It is great to run load tests against your HBase cluster, but you need to correlate the cluster’s performance with what the system is doing under the hood. Graphing the performance lets you line up events across machines and subsystems, which is an invaluable when it comes to understanding test results.

The Metrics Framework

Every HBase process, including the master and region servers, exposes a specific set of metrics. These are subsequently made available to the various monitoring APIs and tools, including JMX and Ganglia. For each kind of server there are multiple groups of metrics, usually pertaining to a subsystem within each server. For example, one group of metrics is provided by the Java Virtual Machine (JVM) itself, giving insight into many interesting details of the current process, such as garbage collection statistics and memory usage.

Contexts, Records, and Metrics

HBase employs the Hadoop metrics framework, inheriting all of its classes and features. This framework is based on the MetricsContext interface to handle the generation of data points for monitoring and graphing. Here is a list of available implementations:

GangliaContext: Used to push metrics to Ganglia; see Ganglia for details.
FileContext: Writes the metrics to a file on disk.
TimeStampingFileContext: Also writes the metrics to a file on disk, but adds a timestamp prefix to each metric emitted. This results in a more log-like formatting inside the file.
CompositeContext: Allows you to emit metrics to more than one context. You can specify, for example, a Ganglia and file context at the same time.
NullContext: The Off switch for the metrics framework. When using this context, nothing is emitted, nor aggregated, at all.
NullContextWithUpdateThread: Does not emit any metrics, but starts the aggregation thread. This is needed when retrieving the metrics through JMX. See JMX for details.

Each context has a unique name, specified in the external configuration file (see HBase-related steps), which is also used to define various properties and the actual implementing class of the MetricsContext interface.

Note

Another artifact of HBase inheriting the metrics framework from Hadoop is that it uses the supplied ContextFactory, which loads the various context classes. The configuration filename is hardcoded in this class to hadoop-metrics.properties—which is the reason HBase uses the exact same filename as Hadoop, as opposed to the more intuitive hbase-metrics.properties you might have expected.

Multiple metrics are grouped into a MetricsRecord, which describes, for example, one specific subsystem. HBase uses these groups to keep the statistics for the master, region server, and so on. Each group also has a unique name, which is combined with the context and the actual metric name to form the fully qualified metric:

<context-name>.<record-name>.<metric-name>

The contexts have a built-in timer that triggers the push of the metrics on regular intervals to whatever the target is—which can be a file, Ganglia, or your own custom solution if you choose to build one. The configuration file enabling the context has a period property per context that is used to specify the interval period in seconds for the context to push its updates. Specific context implementations might have additional properties that control their behavior. Figure 10-1 shows a sequence diagram with all the involved classes.

Figure 10-1. Sequence diagram of the classes involved in preparing the metrics

The metrics are internally tracked by container classes, based on MetricsBase, which have various update and/or increment methods that are called when an event occurs. The framework, in turn, tracks the number of events for every known metric and correlates it to the time elapsed since it was last polled.

The following list summarizes the available metric types in the Hadoop and HBase metrics framework, associating abbreviations with each. These are referenced in the remainder of this chapter.

Integer value (IV)

Tracks an integer counter. The metric is only updated when the value changes.

Long value (LV)

Tracks a long counter. The metric is only updated when the value changes.

Rate (R)

A float value representing a rate, that is, the number of operations/events per second. It provides an increment method that is called to track the number of operations. It also has a last polled timestamp that is used to track the elapsed time. When the metric is polled, the following happens:

The rate is calculated as number of operations / elapsed time in seconds.
The rate is stored in the previous value field.
The internal counter is reset to zero.
The last polled timestamp is set to the current time.
The computed rate is returned to the caller.

String (S)

A metric type for static, text-based information. It is used to report the HBase version number, build date, and so on. It is never reset nor changed—once set, it remains the same while the process is running.

Time varying integer (TVI)

A metric type in which the context keeps aggregating the value, making it a monotonously increasing counter. The metric has a simple increment method that is used by the framework to count various kinds of events. When the value is polled it returns the accrued integer value, and resets to zero, until it is polled again.

Time varying long (TVL)

Same as TVI, but operates on a long value for faster incrementing counters, that could otherwise exceed the maximum integer value. Also resets upon its retrieval.

Time varying rate (TVR)

Tracks the number of operations or events and the time they required to complete. This is used to compute the average time for an operation to finish. The metric also tracks the minimum and maximum time per operation observed. Table 10-1 shows how the values are exported under the same name, but with different postfixes.

The values in the Short column are postfixes that are attached to the actual metric name. For instance, when you retrieve the metric for the increment() calls, as provided by HTable, you will see four values, named incrementNumOps, incrementMinTime, incrementMaxTime, and incrementAvgTime.

This is not evident in all places, though. For example, the context-based metrics only expose the AvgTime and NumOps values, while JMX gives access to all four.

Note that the values for operation count and time accrued are reset once the metric is polled. The number of operations is aggregated by the polling context, though, making it a monotonously increasing counter. In contrast, the average time is set as an absolute value. It is computed when the metric is retrieved at the end of a polling interval.

The minimum and maximum observed time per operation is not reset and is kept until the resetMinMax() call is invoked. This can be done through JMX (see JMX), or it can be triggered for some metrics by the extended period property implicitly.

Persistent time varying rate (PTVR)

An extension to the TVR. This metric adds the necessary support for the extended period metrics: since these long-running metrics are not reset for every poll they need to be reported differently.

Table 10-1. Values exposed by metrics based on time varying rate

Value name	Short	Description
Number Operations	`NumOps`	The actual number of events since the last poll.
Mininum Time	`MinTime`	The shortest time reported for an event to complete.
Maximum Time	`MaxTime`	The longest time reported for an event to complete.
Average Time	`AvgTime`	The average time for completing events; this is computed as the sum of the reported times per event, divided by the number of events.

When we subsequently discuss the different metrics provided by HBase you will find the type abbreviation next to it for reference, in case you are writing your own support tool. Keep in mind that these metrics behave differently when they are retrieved through a metrics context, or via JMX.

Some of the metrics—for example, the time varying ones—are reset once they are polled, but the containing context aggregates them as monotonously increasing counters. Accessing the same values through JMX will reveal their reset behavior, since JMX accesses the values directly, not through a metric context.

A prominent example is the NumOps component of a TVR metric. Reading it through a metric context gives you an ever increasing value, while JMX would only give you the absolute number of the last poll period.

Other metrics are only emitting data when the value has changed since the last update. Again, this is evident when using the contexts, but not when using JMX. The latter will simply retrieve the values from the last poll. If you do not set a poll period, the JMX values will never change. More on this in JMX. Figure 10-2 shows how, over each metric period, the different metric types are updated and emitted. JMX always accesses the raw metrics, which results in a different behavior compared to context-based aggregation.

Figure 10-2. Various metric types collected and (optionally) reset differently

HBase also has some exceptional rate metrics that span across specific time frames, overriding the usual update intervals.

Note

There are a few long-running processes in HBase that require some metrics to be kept until the process has completed. This is controlled by the hbase.extendedperiod property, specified in seconds. The default is no expiration, but the supplied configuration sets it to a moderate 3600 seconds, or one hour.

Currently, this extended period is applied to the time and size rate metrics for compactions, flushes, and splits for the region servers and master, respectively. On the region server it also triggers a reset of all other-rate based metrics, including the read, write, and sync latencies.

Master Metrics

The master process exposes all metrics relating to its role in a cluster. Since the master is relatively lightweight and only involved in a few cluster-wide operations, it does expose only a limited set of information (in comparison to the region server, for example). Table 10-2 lists them.

Table 10-2. Metrics exposed by the master

Metric	Description
Cluster requests `(R)`	The total number of requests to the cluster, aggregated across all region servers
Split time `(PTVR)`	The time it took to split the write-ahead log files after a restart
Split size `(PTVR)`	The total size of the write-ahead log files that were split

Region Server Metrics

The region servers are part of the actual data read and write path, and therefore collect a substantial number of metrics. These include details about different parts of the overall architecture inside the server—for example, the block cache and in-memory store.

Instead of listing all possible metrics, we will discuss them in groups, since it is more important to understand their meaning as opposed to the separate data point. Within each group the meaning is quite obvious and needs only a few more notes, if at all.

Block cache metrics

The block cache holds the loaded storage blocks from the low-level HFiles, read from HDFS. Given that you have allowed for a block to be cached, it is kept in memory until there is no more room, at which point it is evicted.

The count (LV) metric reflects the number of blocks currently in the cache, while the size (LV) is the occupied Java heap space. The free (LV) metric is the remaining heap for the cache, and evicted (LV) counts the number of blocks that had to be removed because of heap size constraints.

The block cache keeps track of the cache hit (LV) and miss (LV) counts, as well as the hit ratio (IV), which is the number of cache hits in relation to the total number of requests to the cache.

Finally, the more ominous hit caching count is similar to the hit ratio, but only takes into account requests and hits of operations that had requested for the block cache to be used (see, e.g., the setCacheBlocks() method in Single Gets).

Note

All read operations will try to use the cache, regardless of whether retaining the block in the cache has been requested. Use of setCacheBlocks() only influences the retainment policy of the request.

Compaction metrics

When the region server has to perform the asynchronous (or manually invoked) housekeeping task of compacting the storage files, it reports its status in a different metric. The compaction size (PTVR) and compaction time (PTVR) give details regarding the total size (in bytes) of the storage files that have been compacted, and how long that operation took, respectively. Note that this is reported after a completed compaction run, because only then are both values known.

The compaction queue size (IV) can be used to check how many files a region server has queued up for compaction currently.

Note

The compaction queue size is another recommended early indicator of trouble that should be closely monitored. Usually the number is quite low, and varies between zero and somewhere in the low tens. When you have I/O issues, you usually see this number rise sharply. See Figure 10-5 on page for an example.

Keep in mind that major compactions will also cause a sharp rise as they queue up all storage files. You need to account for this when looking at the graphs.

Memstore metrics

Mutations are kept in the memstore on the region server, and will subsequently be written to disk via a flush. The memstore metrics expose the memstore size MB metric (IV), which is the total heap space occupied by all memstores for the server in megabytes. It is the sum of all memstores across all online regions.

The flush queue size (IV) is the number of enqueued regions that are being flushed next. The flush size (PTVR) and flush time (PTVR) give details regarding the total size (in bytes) of the memstore that has been flushed, and the time it took to do so, respectively.

Just as with the compaction metrics, these last two metrics are updated after the flush has completed. So the reported values slightly trail the actual value, as it is missing what is currently in progress.

Note

Similar to the compaction queue you will see a sharp rise in count for the flush queue when, for example, your servers are under I/O duress. Monitor the value to find the usual range—which should be a fairly low number as well—and set sensible limits to trigger warnings when it rises above these thresholds.

Store metrics

The store files (IV) metric states the total number of storage files, spread across all stores—and therefore regions—managed by the current server. The stores (IV) metric gives you the total number of stores for the server, across all regions it currently serves. The store file index size MB metric (IV) is the sum of the block index, and optional meta index, for all store files in megabytes.

I/O metrics

The region server keeps track of I/O performance with three latency metrics, all of them keeping their numbers in milliseconds. The fs read latency (TVR) reports the filesystem read latency—for example, the time it takes to load a block from the storage files. The fs write latency (TVR) is the same for write operations, but combined for all writers, including the storage files and write-ahead log.

Finally, the fs sync latency (TVR) measures the latency to sync the write-ahead log records to the filesystem. The latency metrics provide information about the low-level I/O performance and should be closely monitored.

Miscellaneous metrics

In addition to the preceding metrics, the region servers also provide global counters, exposed as metrics. The read request count (LV) and write request count (LV) report the total number of read (such as get()) and write (such as put()) operations, respectively, summed up for all online regions this server hosts.

The requests (R) metric is the actual request rate per second encountered since it was last polled. Finally, the regions (IV) metric gives the number of regions that are currently online and hosted by this region server.

RPC Metrics

Both the master and region servers also provide metrics from the RPC subsystem. The subsystem automatically tracks every operation possible between the different servers and clients. This includes the master RPCs, as well as those exposed by region servers.

The RPC metrics for the master and region servers are shared—in other words, you will see the same metrics exposed on either server type. The difference is that the servers update the metrics for the operations the process invokes. On the master, for example, you will not see updates to the metrics for increment() operations, since those are related to the region server. On the other hand, you do see all the metrics for all of the administrative calls, like enableTable or compactRegion.

Since the metrics relate directly to the client and administrative APIs, you can infer their meaning from the corresponding API calls. The naming is not completely consistent, though, to remove arbitration. A notable pattern is the addition of the Region postfix to the region-related API calls—for example, the split() call provided by HBaseAdmin maps to the splitRegion metric. Only a handful of metrics have no API counterpart, and these are listed in Table 10-3. These are metrics provided by the RPC subsystem itself.

Table 10-3. Non-API metrics exposed by the RPC subsystem

Metric	Description
RPC Processing Time	This is the time it took to process the RPCs on the server side. As this spans all possible RPC calls, it averages across them.
RPC Queue Time	Since RPC employs a queuing system that lines up calls to be processed, there might be a delay between the time the call arrived and when it is actually processed, which is the queue time.

Monitoring the queue time is a good idea, as it indicates the load on the server. You could use thresholds to trigger warnings if this number goes over a certain limit. These are early indicators of future problems.

The remaining metrics are from the RPC API between the master and the region servers, including regionServerStartup() and regionServerReport. They are invoked when a region server initially reports for duty at its assigned master node, and for regular status reports, respectively.

JVM Metrics

When it comes to optimizing your HBase setup, tuning the JVM settings requires expert skills. You will learn how to do this in Garbage Collection Tuning. This section discusses what you can retrieve from each server process using the metrics framework. Every HBase process collects and exposes JVM-related details that are helpful to correlate, for example, server performance with underlying JVM internals. This information, in turn, is used when tuning your HBase cluster setup.

The provided metrics can be grouped into related categories:

Memory usage metrics: You can retrieve the used memory and the committed memory^[110] in megabytes for both heap and nonheap usage. The former is the space that is maintained by the JVM on your behalf and garbage-collected at regular intervals. The latter is memory required for JVM internal purposes.
Garbage collection metrics: The JVM is maintaining the heap on your behalf by running garbage collections. The gc count metric is the number of garbage collections, and the gc time millis is the accumulated time spent in garbage collection since the last poll.
Certain steps in the garbage collection process cause so-called stop-the-world pauses, which are inherently difficult to handle when a system is bound by tight SLAs.
Usually these pauses are only a few milliseconds in length, but sometimes they can increase to multiple seconds. Problems arise when these pauses approach the multiminute range, because this can cause a region server to miss its ZooKeeper lease renewal—forcing the master to take evasive actions.^[111]
Use the garbage collection metric to track what the server is currently doing and how long the collections take. As soon as you see a sharp increase, be prepared to investigate. Any pause that is greater than the zookeeper.session.timeout configuration value should be considered a fault.
Thread metrics: This group of metrics reports a variety of numbers related to Java threads. You can see the count for each possible thread state, including new, runnable, blocked, and so on.
System event metrics: Finally, the events group contains metrics that are collected from the logging subsystem, but are subsumed under the JVM metrics category (for lack of a better place). System event metrics provide counts for various log-level events. For example, the log error metric provides the number of log events that occured on the error level, since the last time the metric was polled. In fact, all log event counters show you the counts accumulated during the last poll period.

Using these metrics, you are able to feed support systems that either graph the values over time, or trigger warnings based on definable thresholds. It is really important to understand the values and their usual ranges so that you can make use of them in production.

Info Metrics

The HBase processes also expose a group of metrics called info metrics. They contain rather fixed information about the processes, and are provided so that you can check these values in an automated fashion. Table 10-4 lists these metrics and provides a description of each. Note that these metrics are only accessible through JMX.

Table 10-4. Metrics exposed by the info group

Metric	Description
date	The date HBase was built
version	The HBase version
revision	The repository revision used for the build
url	The repository URL
user	The user that built HBase
hdfsDate	The date HDFS was built
hdfsVersion	The HDFS version currently in use
hdfsRevision	The repository revision used to build HDFS
hdfsUrl	The HDFS repository URL
hdfsUser	The user that built HDFS

HDFS refers to the hadoop-core-<X.Y-nnnn>.jar file that is currently in use by HBase. This usually is the supplied JAR file, but it could be a custom file, depending on your installation. The values returned could look like this:

date:Wed May 18 15:29:52 CEST 2011
version:0.91.0-SNAPSHOT
revision:1100427
url:https://svn.apache.org/repos/asf/hbase/trunk
user:larsgeorge

hdfsDate:Wed Feb  9 22:25:52 PST 2011
hdfsVersion:0.20-append-r1057313
hdfsRevision:1057313
hdfsUrl:http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append
hdfsUser:Stack

The values are obviously not useful for graphing, but they can be used by an administrator to verify the running configuration.

Ganglia

HBase inherits its native support for Ganglia^[112] directly from Hadoop, providing a context that can push the metrics directly to it.

Note

As of this writing, HBase only supports the 3.0.x line of Ganglia versions. This is due to the changes in the network protocol used by the newer 3.1.x releases. The GangliaContext class is therefore not compatible with the 3.1.x Ganglia releases. This was addressed in HADOOP-4675 and committed in Hadoop 0.22.0. In other words, future versions of HBase will support the newly introduced GangliaContext31 and work with the newer Ganglia releases.

Advanced users also have the option to apply the patch themselves and replace the stock Hadoop JAR with their own. Some distributions for Hadoop—for example, CDH3 from Cloudera—have this patch already applied.

Ganglia consists of three components:

Ganglia monitoring daemon (gmond)

The monitoring daemon needs to run on every machine that is monitored. It collects the local data and prepares the statistics to be polled by other systems. It actively monitors the host for changes, which it will announce using uni- or multicast network messages. If configured in multicast mode, each monitoring daemon has the complete cluster state—of all servers with the same multicast address—present.

Ganglia meta daemon (gmetad)

The meta daemon is installed on a central node and acts as the federation node to the entire cluster. The meta daemon polls from one or more monitoring daemons to receive the current cluster status, and saves it in a round-robin, time-series database, using RRDtool.^[113] The data is made available in XML format to other clients—for example, the web frontend.

Ganglia also supports a hierarchy of reporting daemons, where at each node of the hierarchy tree a meta daemon is aggregating the results of its assigned monitoring daemons. The meta daemons on a higher level then aggregate the statistics for multiple clusters polling the status from their assigned, lower-level meta daemons.

Ganglia PHP web frontend

The web frontend, supplied by Ganglia, retrieves the combined statistics from the meta daemon and presents it as HTML. It uses RRDtool to render the stored time-series data in graphs.

Installation

Ganglia setup requires two steps: first you need to set up and configure Ganglia itself, and then have HBase send the metrics to it.

Ganglia-related steps

You should try to install prebuilt binary packages for the operating system distribution of your choice. If this is not possible, you can download the source from the project website and build it locally. For example, on a Debian-based system you could perform the following steps.

Ganglia monitoring daemon

Perform the following on all nodes you want to monitor.

Add a dedicated user account:

$ sudo adduser --disabled-login --no-create-home ganglia

Download the source tarball from the website, and unpack it into a common location:

$ wget http://downloads.sourceforge.net/project/ganglia/  
   ganglia%20monitoring%20core/3.0.7%20%28Fossett%29/ganglia-3.0.7.tar.gz
$ tar -xzvf ganglia-3.0.7.tar.gz -C /opt
$ rm ganglia-3.0.7.tar.gz

Install the dependencies:

$ sudo apt-get -y install build-essential libapr1-dev 
   libconfuse-dev libexpat1-dev python-dev

Now you can build and install the binaries like so:

$ cd /opt/ganglia-3.0.7
$ ./configure
$ make
$ sudo make install

The next step is to set up the configuration. This can be fast-tracked by generating a default file:

$ gmond --default_config > /etc/gmond.conf

Change the following in the /etc/gmond.conf file:

globals {
  user = ganglia
}

cluster {
  name = HBase
  owner = "Foo Company"
  url = "http://foo.com/"
}

The global section defines the user account created earlier. The cluster section defines details about your cluster. By default, Ganglia is configured to use multicast UDP messages with the IP address 239.2.11.71 to communicate—which is a good for clusters less than ~120 nodes.

While the default communication method between monitoring daemons (gmond) is UDP multicast messages, you may encounter environments where multicast is either not possible or a limiting factor. The former is true, for example, when using Amazon’s cloud-based server offerings, called EC2.

Another known issue is that multicast only works reliably in clusters of up to ~120 nodes. If either is true for you, you can switch from multicast to unicast messages instead. In the /etc/gmond.conf file, change these options:

udp_send_channel {
  # mcast_join = 239.2.11.71
  host = host0.foo.com
  port = 8649
  # ttl = 1
}
 
udp_recv_channel {
  # mcast_join = 239.2.11.71
  port = 8649
  # bind = 239.2.11.71
}

This example assumes you dedicate the gmond on the master node to receive the updates from all other gmond processes running on the rest of the machines.

The host0.foo.com would need to be replaced by the hostname or IP address of the master node. In larger clusters, you can have multiple dedicated gmond processes on separate physical machines. That way you can avoid having only a single gmond handling the updates.

You also need to adjust the /etc/gmetad.conf file to point to the dedicated node. See the note in this chapter that discusses the use of unicast mode for details.

Start the monitoring daemon with:

$ sudo gmond

Note

Test the daemon by connecting to it locally:

$ nc localhost 8649

This should print out the raw XML based cluster status. Stopping the daemon is accomplished by using the kill command.

Ganglia meta daemon

Perform the following on all nodes you want to use as meta daemon servers, aggregating the downstream monitoring statistics. Usually this is only one machine for clusters less than 100 nodes. Note that the server has to create the graphs, and therefore needs some decent processing capabilities.

Add a dedicated user account:

$ sudo adduser --disabled-login --no-create-home ganglia

Download the source tarball from the website, and unpack it into a common location:

$ wget http://downloads.sourceforge.net/project/ganglia/  
   ganglia%20monitoring%20core/3.0.7%20%28Fossett%29/ganglia-3.0.7.tar.gz
$ tar -xzvf ganglia-3.0.7.tar.gz -C /opt
$ rm ganglia-3.0.7.tar.gz

Install the dependencies:

$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev 
   libexpat1-dev python-dev librrd2-dev

Now you can build and install the binaries like so:

$ cd /opt/ganglia-3.0.7
$ ./configure --with-gmetad
$ make
$ sudo make install

Note the extra --with-gmetad, which is required to build the binary we will need. The next step is to set up the configuration, copying the supplied default gmetad.conf file like so:

$ cp /opt/ganglia-3.0.7/gmetad/gmetad.conf /etc/gmetad.conf

Change the following in /etc/gmetad.conf:

setuid_username "ganglia"
data_source "HBase" host0.foo.com
gridname "<Your-Grid-Name>"

The data_source line must contain the hostname or IP address of one or more gmonds.

Note

When you are using unicast mode you need to point your data_source to the server that acts as the dedicated gmond server. If you have more than one, you can list them all, which adds failover safety.

Now create the required directories. These are used to store the collected data in round-robin databases.

$ mkdir -p /var/lib/ganglia/rrds/
$ chown -R ganglia:ganglia /var/lib/ganglia/

Now start the daemon:

$ gmetad

Stopping the daemon requires the use of the kill command.

Ganglia web frontend

The last part of the setup concerns the web-based frontend. A common scenario is to install it on the same machine that runs the gmetad process. At a minimum, it needs to have access to the round-robin, time-series database created by gmetad.

First install the required libraries:

$ sudo apt-get -y install rrdtool apache2 php5-mysql libapache2-mod-php5 php5-gd

Ganglia comes fully equipped with all the required PHP files. You can copy them in place like so:

$ cp -r /opt/ganglia-3.0.7/web /var/www/ganglia

Now restart Apache:

$ sudo /etc/init.d/apache2 restart

You should now be able to browse the web frontend using http://ganglia.foo.com/ganglia—assuming you have pointed the ganglia subdomain name to the host running gmetad first. You will only see the basic graph of the servers, since you still need to set up HBase to push its metrics to Ganglia, which is discussed next.

HBase-related steps

The central part of HBase and Ganglia integration is provided by the GangliaContext class, which sends the metrics collected in each server process to the Ganglia monitoring daemons. In addition, there is the hadoop-metrics.properties configuration file, located in the conf/ directory, which needs to be amended to enable the context. Edit the file like so:

# HBase-specific configuration to reset long-running stats 
# (e.g. compactions). If this variable is left out, then the default 
# is no expiration.
hbase.extendedperiod = 3600

# Configuration of the "hbase" context for ganglia
# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext
#hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
hbase.period=10
hbase.servers=239.2.11.71:8649

jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
#jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=239.2.11.71:8649

rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext
#rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
rpc.period=10
rpc.servers=239.2.11.71:8649

Note

I mentioned that HBase currently (as of version 0.91.x) only supports Ganglia 3.0.x, so why is there a choice between GangliaContext and GangliaContext31? Some repackaged versions of HBase already include patches to support Ganglia 3.1.x. Use this context only if you are certain that your version of HBase supports it (CDH3 does, for example).

When you are using Unicast messages, the 239.2.11.71 default multicast address needs to be changed to the dedicated gmond hostname or IP address. For example:

...
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext
hbase.period=10
hbase.servers=host0.yourcompany.com:8649

jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.period=10
jvm.servers=host0.yourcompany.com:8649

rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext
rpc.period=10
rpc.servers=host0.yourcompany.com:8649

Once you have edited the configuration file you need to restart the HBase cluster processes. No further changes are required. Ganglia will automatically pick up all the metrics.

Usage

Once you refresh the web-based UI frontend you should see the Ganglia home page, shown in Figure 10-3.

Figure 10-3. The Ganglia web-based frontend that gives access to all graphs

You can change the metric, time span, and sorting on that page; it will reload automatically. On an underpowered machine, you might have to wait a little bit for all the graphs to be rendered. Figure 10-4 shows the drop-down selection for the available metrics.

Figure 10-4. The drop-down box that provides access to the list of metrics

Finally, Figure 10-5 shows an example of how the metrics can be correlated to find root causes of problems. The graphs show how, at around midnight, the garbage collection time sharply rose for a heavily loaded server. This caused the compaction queue to increase significantly as well.

Note

It seems obvious that write-heavy loads cause a lot of I/O churn, but keep in mind that you can see the same behavior (though not as often) for more read-heavy access patterns. For example, major compactions that run in the background could have accrued many storage files that all have to be rewritten. This can have an adverse effect on read latencies without an explicit write load from the clients.

Ganglia and its graphs are a great tool to go back in time and find what caused a problem. However, they are only helpful when dealing with quantitative data—for example, for performing postmortem analysis of a cluster problem. In the next section, you will see how to complement the graphing with a qualitative support system.

Figure 10-5. Graphs that can help align problems with related events

JMX

The Java Management Extensions technology is the standard for Java applications to export their status. In addition to what we have discussed so far regarding Ganglia and the metrics context, JMX also has the ability to provide operations. These allow you to remotely trigger functionality on any JMX-enabled Java process.

Before you can access HBase processes using JMX, you need to enable it. This is accomplished in the $HABASE_HOME/conf/hbase-env.sh configuration file by uncommenting—and amending—the following lines:

# Uncomment and adjust to enable JMX exporting
# See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to 
# configure remote password access. More details at:
# http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html
#
export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false  
  -Dcom.sun.management.jmxremote.authenticate=false"
export HBASE_MASTER_OPTS="$HBASE_JMX_BASE  
  -Dcom.sun.management.jmxremote.port=10101"
export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE  
  -Dcom.sun.management.jmxremote.port=10102"
export HBASE_THRIFT_OPTS="$HBASE_JMX_BASE  
  -Dcom.sun.management.jmxremote.port=10103"
export HBASE_ZOOKEEPER_OPTS="$HBASE_JMX_BASE 
  -Dcom.sun.management.jmxremote.port=10104"

This enables JMX with remote access support, but with no security credentials. It is assumed that, in most cases, the HBase cluster servers are not accessible outside a firewall anyway, and therefore no authentication is needed. You can enable authentication if you want to, which makes the setup only slightly more complex.^[114] You also need to restart HBase for these changes to become active.

When a server starts, it not only registers its metrics with the appropriate context, it also exports them as so-called JMX attributes. I mentioned already that when you want to use JMX to access the metrics, you need to at least enable the NullContextWithUpdateThread with an appropriate value for period—for example, a minimal hadoop-metrics.properties file could contain:

hbase.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread
hbase.period=60

jvm.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread
jvm.period=60

rpc.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread
rpc.period=60

This would ensure that all metrics are updated every 10 seconds, and therefore would be retrievable as JMX attributes. Failing to do so would yield all JMX attributes useless. You could still use the JMX operations, though. Obviously, if you already have another context enabled—for example, the GangliaContext—this is adequate.

JMX uses the notion of managed beans, or MBeans, which expose a specific set of attributes and operations. There is a loose overlap between the metric context, as provided by the metrics framework, and the MBeans exposed over JMX. These MBeans are addressed in the form:

hadoop:service=<service-name>,name=<mbean-name>

The following MBeans are provided by the various HBase processes:

hadoop:service=Master,name=MasterStatistics: Provides access to the master metrics, as described in Master Metrics.
hadoop:service=RegionServer,name=RegionServerStatistics: Provides access to the region metrics, as described in .
hadoop:service=HBase,name=RPCStatistics- <port>: Provides access to the RPC metrics, as described in RPC Metrics. Note that the port in the name is dynamic and may change when you reconfigure where the master, or region server, binds to.
hadoop:service=HBase,name=Info: Provides access to the info metrics, as described in Info Metrics.

The MasterStatistics, RegionServerStatistics, and RPCStatistics MBeans also provide one operation: resetAllMinMax. Use this operation to reset the minimal and maximal observed completion times to time varying rate (TVR) metrics.

You have a few options to access the JMX attributes and operations, two of which are described next.

JConsole

Java ships with a helper application called JConsole, which can be used to connect to local and remote Java processes. Given that you have the $JAVA_HOME directory in your search path, you can start it like so:

$ jconsole

Once the application opens, it shows you a dialog that lets you choose whether to connect to a local or a remote process. Figure 10-6 shows the dialog.

Figure 10-6. Connecting to local or remote processes when JConsole starts

Since you have configured all HBase processes to listen to specific ports, it is advisable to use those and treat them as remote processes—one advantage is that you can reconnect to a server, even when the process ID has changed. With the local connection method this is not possible, as it is ultimately bound to said ID.

Connecting to a remote HBase process is accomplished by using JMX Service URLs, which follow this format:

service:jmx:rmi:///jndi/rmi://<server-address>:<port>/jmxrmi

This uses the Java Naming and Directory Interface (JNDI) registry to look up the required details. Adjust the <port> to the process you want to connect to. In some cases, you may have multiple Java processes running on the same physical machine—for example, the Hadoop name node and the HBase Master—so that each of them requires a unique port assignment. See the hbase-env.sh file contents shown earlier, which sets a port for every process. The master, for example, listens on port 10101, the region server on port 10102, and so on. Since you can only run one region server per physical machine, it is valid to use the same port for all of them, as in this case, the <server-address>—which is the hostname or IP address—changes to form a unique address:port pair.

Once you connect to the process, you will see a tabbed window with various details in it. Figure 10-7 shows the initial screen after you have connected to a process. The constantly updated graphs are especially useful for seeing what a server is currently up to.

Figure 10-7. The JConsole application, which provides insight into a running Java process

Figure 10-8 is a screenshot of the MBeans tab that allows you to access the attributes and operations exposed by the registered managed beans. Here you see the compactionQueueSize metric.

See the official documentation for all the possible options, and an explanation of each tab with its content.

Figure 10-8. The MBeans tab, from which you can access any HBase process metric.

JMX Remote API

Another way to get the same information is the JMX Remote API, using remote method invocation or RMI.^[115] Many tools are available that implement a client to access the remote managed Java processes. Even the Hadoop project is working on adding some basic support for it.^[116]

As an example, we are going to use the JMXToolkit, also available in source code online (https://github.com/larsgeorge/jmxtoolkit). You will need the git command-line tools, and Apache Ant. Clone the repository and build the tool:

$ git clone git://github.com/larsgeorge/jmxtoolkit.git
Initialized empty Git repository in jmxtoolkit/.git/
...
$ cd jmxtoolkit
$ ant
Buildfile: jmxtoolkit/build.xml
...
jar:
      [jar] Building jar: /private/tmp/jmxtoolkit/build/hbase-jmxtoolkit.jar

BUILD SUCCESSFUL
Total time: 2 seconds

After the building process is complete (and successful), you can see the provided options by invoking the -h switch like so:

$ java -cp build/hbase-jmxtoolkit.jar 
  org.apache.hadoop.hbase.jmxtoolkit.JMXToolkit -h

Usage: JMXToolkit [-a <action>] [-c <user>] [-p <password>] 
  [-u url] [-f <config>] [-o <object>] [-e regexp] 
  [-i <extends>] [-q <attr-oper>] [-w <check>] 
  [-m <message>] [-x] [-l] [-v] [-h]

  -a <action>     Action to perform, can be one of the following 
                        (default: query)

          create  Scan a JMX object for available attributes
          query   Query a set of attributes from the given objects
          check   Checks a given value to be in a valid range (see -w below)
          encode  Helps creating the encoded messages (see -m and -w below)
          walk    Walk the entire remote object list
...
  -h              Prints this help

You can use the JMXToolkit to walk, or print, the entire collection of available attributes and operations. You do have to know the exact names of the MBean and the attribute or operation you want to get. Since this is not an easy task, because you do not have this list yet, it makes sense to set up a basic configuration file that will help in subsequently retrieving the full list. Create a properties file with the following content:

$ vim hbase.properties
$ cat hbase.properties 
; HBase Master
[hbaseMasterStatistics]
@object=hadoop:name=MasterStatistics,service=Master
@url=service:jmx:rmi:///jndi/rmi://${HOSTNAME1|localhost}:10101/jmxrmi
@user=${USER|controlRole}
@password=${PASSWORD|password}
[hbaseRPCMaster]
@object=hadoop:name=RPCStatistics-60000,service=HBase
@url=service:jmx:rmi:///jndi/rmi://${HOSTNAME1|localhost}:10101/jmxrmi
@user=${USER|controlRole}
@password=${PASSWORD|password}

; HBase RegionServer
[hbaseRegionServerStatistics]
@object=hadoop:name=RegionServerStatistics,service=RegionServer
@url=service:jmx:rmi:///jndi/rmi://${HOSTNAME2|localhost}:10102/jmxrmi
@user=${USER|controlRole}
@password=${PASSWORD|password}
[hbaseRPCRegionServer]
@object=hadoop:name=RPCStatistics-60020,service=HBase
@url=service:jmx:rmi:///jndi/rmi://${HOSTNAME2|localhost}:10102/jmxrmi
@user=${USER|controlRole}
@password=${PASSWORD|password}

; HBase Info 
[hbaseInfo]
@object=hadoop:name=Info,service=HBase
@url=service:jmx:rmi:///jndi/rmi://${HOSTNAME1|localhost}:10101/jmxrmi
@user=${USER|controlRole}
@password=${PASSWORD|password}

; EOF

This configuration can be fed into the tool to retrieve all the attributes and operations of the listed MBeans. The result is saved in myjmx.properties:

$ java -cp build/hbase-jmxtoolkit.jar 
  org.apache.hadoop.hbase.jmxtoolkit.JMXToolkit 
  -f hbase.properties -a create -x > myjmx.properties

$ cat myjmx.properties 
[hbaseMasterStatistics]
@object=hadoop:name=MasterStatistics,service=Master
@url=service:jmx:rmi:///jndi/rmi://${HOSTNAME1|localhost}:10101/jmxrmi
@user=${USER|controlRole}
@password=${PASSWORD|password}
splitTimeNumOps=INTEGER
splitTimeAvgTime=LONG
splitTimeMinTime=LONG
splitTimeMaxTime=LONG
splitSizeNumOps=INTEGER
splitSizeAvgTime=LONG
splitSizeMinTime=LONG
splitSizeMaxTime=LONG
cluster_requests=FLOAT
*resetAllMinMax=VOID
...

Note

These commands assume you are running them against a pseuodistributed, local HBase instance. When you need to run them against a remote set of servers, simply set the variables included in the template properties file. For example, adding the following lines to the earlier command will specify the hostnames (or IP addresses) for the master and a slave node:

-DHOSTNAME1=master.foo.com -DHOSTNAME2=slave1.foo.com

When you look into the newly created myjmx.properties file you will see all the metrics you have seen already. The operations are prefixed with a * (i.e., the star charater).

You can now start requesting metric values on the command line using the toolkit and the populated properties file. The first query is for an attribute value, while the second is triggering an operation (which in this case does not return a value):

$ java -cp build/hbase-jmxtoolkit.jar 
   org.apache.hadoop.hbase.jmxtoolkit.JMXToolkit 
   -f myjmx.properties -o hbaseRegionServerStatistics -q compactionQueueSize
compactionQueueSize:0

$ java -cp build/hbase-jmxtoolkit.jar 
   org.apache.hadoop.hbase.jmxtoolkit.JMXToolkit 
   -f myjmx.properties -o hbaseRegionServerStatistics -q *resetAllMinMax

Once you have created the properties files, you can retrieve a single value, all values of an entire MBean, trigger operations, and so on. The toolkit is great for quickly scanning a managed process and documenting all the available information, thereby taking the guesswork out of querying JMX MBeans.

Once the JMXToolkit JAR is built, it can be used on a Cacti server. The first step is to copy the JAR into the Cacti scripts directory (which can vary between installs, so make sure you know what you are doing). Next, extract the scripts:

$ cd $CACTI_HOME/scripts
$ unzip hbase-jmxtoolkit.jar bin/*
$ chmod +x bin/*

Once the scripts are in place, you can test the basic functionality:

$ bin/jmxtkcacti-hbase.sh host0.foo.com hbaseMasterStatistics

splitTimeNumOps:0 splitTimeAvgTime:0 splitTimeMinTime:-1 splitTimeMaxTime:0 
splitSizeNumOps:0 splitSizeAvgTime:0 splitSizeMinTime:-1 splitSizeMaxTime:0 
cluster_requests:0.0

The JAR also includes a set of Cacti templates^[117] that you can import into it, and use as a starting point to graph various values exposed by Hadoop’s and HBase’s JMX MBeans. Note that these templates use the preceding script to get the metrics via JMX.

Setting up the graphs in Cacti is much more involved compared to Ganglia, which dynamically adds the pushed metrics from the monitoring daemons. Cacti comes with a set of PHP scripts that can be used to script the addition (and updates) of cluster servers as a bulk operation.

Nagios

Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status. It polls current metrics on a regular basis and compares them with given thresholds. Once the thresholds are exceededing it will start evasive actions, ranging from sending out emails, or SMS messages to telephones, all the way to triggering scripts, or even physically rebooting the server when necessary.

Typical checks in Nagios are either the supplied ones, those added as plug-ins, or custom scripts that have to return a specific exit code and print the outcome to the standard output. Integrating Nagios with HBase is typically done using JMX. There are many choices for doing so, including the already discussed JMXToolkit.

The advantage of JMXToolkit is that once you have built your properties file with all the attributes and operations in it, you can add Nagios thresholds to it. (You can also use a different monitoring tool if you’d like, so long as it uses the same exit code and/or standard output message approach as Nagios.) These are subsequently executed, and changing the check to, for example, different values is just a matter of editing the properties file. For example:

attributeXYZ=INTEGER|0:OK%3A%20%7B0%7D|2:WARN%3A%20%7B0%7D:80:<| 
1:FAILED%3A%20%7B0%7D:95:<
*operationABC=FLOAT|0|2::0.1:>=|1::0.5:>

You can follow the same steps described earlier in the Cacti install. You can then wire the Nagios checks to the supplied JMXToolkit script. If you have checks defined in the properties file, you only specify the object and attribute or operation to query. If not, you can specify the check within Nagios like so:

$ bin/jmxtknagios-hbase.sh host0.foo.com hbaseRegionServerStatistics 
   compactionQueueSize "0:OK%3A%20%7B0%7D|2:WARN%3A%20%7B0%7D:10:>=| 
   1:FAIL%3A%20%7B0%7D:100:>"
OK: 0

Note that JMXToolkit also comes with an action to encode text into the appropriate format.

Obviously, using JMXToolkit is only one of many choices. The crucial point, though, is that monitoring and graphing are essential to not only maintain a cluster, but also be able to track down issues much more easily. It is highly recommended that you implement both monitoring and graphing early in your project. It is also vital that you test your system with a load that reflects your real workload, because then you can become familiar with the graphs, and how to read them. Set thresholds and find sensible upper and lower limits—it may save you a lot of grief when going into production later on.

^[109]JMX is an acronym for Java Management Extensions, a Java-based technology that helps in building solutions to monitor and manage applications. See the project’s website for more details, and JMX.—

^[110]See the official documentation on MemoryUsage for details on what used versus committed memory means.

^[111]“The HBase development team has affectionately dubbed this scenario a Juliet Pause—the master (Romeo) presumes the region server (Juliet) is dead when it’s really just sleeping, and thus takes some drastic action (recovery). When the server wakes up, it sees that a great mistake has been made and takes its own life. Makes for a good play, but a pretty awful failure scenario!” (http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/)

^[112]Ganglia is a distributed, scalable monitoring system suitable for large cluster systems. See its project website for more details on its history and goals.

^[113]See the RRDtool project website for details.

^[114]The HBase page metrics has information on how to add the password and access credentials files.

^[115]See the official documentation for details.

^[116]See HADOOP-4756 for details.

^[117]As of this writing, the templates are slightly outdated, but should work for newer versions of HBase.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10. Cluster Monitoring

Create new playlist

Sign In

Sign Up

Chapter 10. Cluster Monitoring

Introduction

Note

The Metrics Framework

Contexts, Records, and Metrics

Note

Note

Master Metrics

Region Server Metrics

Note

Note

Note

RPC Metrics

JVM Metrics

Info Metrics

Ganglia

Note

Installation

Ganglia-related steps

Ganglia monitoring daemon

Note

Ganglia meta daemon

Note

Ganglia web frontend

HBase-related steps

Note

Usage

Note

JMX

JConsole

JMX Remote API

Note

Nagios

Table of Contents for
10. Cluster Monitoring