Chapter 17. Timeouts and Garbage Collection

Garbage collection (GC) is the process that cleans up the memory no longer referenced by Java objects. This is done to free up the memory and make it available again to the application. Garbage collections are a normal, unavoidable occurrence in Java, but they can be managed. Depending on the garbage collection algorithm used, there might be one or more garbage collection events going. The most impactful garbage collection event for an HBase server is when the Java Virtual Machine needs to perform a full garbage collection (full GC). Such an operation is known as a “stop-the-world pause” and will pause the JVM while it is cleaning enough memory to allocate required objects. During this operation, the current running threads are in a holding pattern until the JVM completes its GC. Long-running GC pauses are usually the main cause of timeouts on an HBase server, but we will also look at a few other risks.

Consequences

As stated before, full GCs are a “stop-the-world pause,” meaning any operation running on an HBase server will be paused until GC completion. These pauses can manifest themselves externally as visible big spikes in the reads and writes latencies, or even in the worst cases as false server failures. There are other issues that can reflect similar performance spikes in the RegionServer. The sources of these spikes should always be clearly identified before starting to treat GC issues. The JVM will be paused while performing the garbage collection cleanup, during this time all read and write operations will be queued by the client, which must wait for the server to become responsive again. Garbage collection should always complete before the configured timeout values. Once complete, the reads and writes will resume to the server and be processed. However, if the Java tunings are not properly configured, the garbage collection can last longer than the configured timeouts. There are a few different actions that can occur when the garbage collection exceeds the configured timeout values. If the timeout occurs on the client side before the operations were processed by the server, the client will receive an exception and operation will not be processed:

Tue Mar 17 12:23:03 EDT 2015, null, java.net.SocketTimeoutException:
callTimeout=1, callDuration=2307: row '' on table 'TestTable' at
region=TestTable,,1426607781080.c1adf2b53088bef7db148b893c2fd4da.,
hostname=t430s,45896,1426609305849, seqNum=1276
    at o.a.h.a.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException
    (RpcRetryingCallerWithReadReplicas.java:264)
    at o.a.h.a.client.ScannerCallableWithReplicas.call
    (ScannerCallableWithReplicas.java:199)
    at o.a.h.a.client.ScannerCallableWithReplicas.call
    (ScannerCallableWithReplicas.java:56)
    at o.a.h.a.client.RpcRetryingCaller.callWithoutRetries
    (RpcRetryingCaller.java:200)
    at o.a.h.a.client.ClientScanner.call(ClientScanner.java:287)
    at o.a.h.a.client.ClientScanner.next(ClientScanner.java:367)
    at DeleteGetAndTest.main(DeleteGetAndTest.java:110)
Caused by: java.net.SocketTimeoutException: callTimeout=1, callDuration=2307:
row '' on table 'TestTable' at region=TestTable,,1426607781080.
c1adf2b53088bef7db148b893c2fd4da., hostname=t430s,45896,
1426609305849, seqNum=1276
    at o.a.h.a.client.RpcRetryingCaller.callWithRetries
    (RpcRetryingCaller.java:159)
    at o.a.h.a.client.ScannerCallableWithReplicas$RetryingRPC.call
    (ScannerCallableWithReplicas.java:294)
    at o.a.h.a.client.ScannerCallableWithReplicas$RetryingRPC.call
    (ScannerCallableWithReplicas.java:275)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker
    (ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run
    (ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Call to t430s/10.32.0.23:45896 failed on local
exception: o.a.h.a.ipc.CallTimeoutException: Call id=2, waitTime=2001,
operationTimeout=2000 expired.
    at o.a.h.a.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1235)
    at o.a.h.a.ipc.RpcClientImpl.call(RpcClientImpl.java:1203)
    at o.a.h.a.ipc.AbstractRpcClient.callBlockingMethod
    (AbstractRpcClient.java:216)
    at o.a.h.a.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.
    callBlockingMethod(AbstractRpcClient.java:300)
    at o.a.h.a.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan
    (ClientProtos.java:31751)
    at o.a.h.a.client.ScannerCallable.call(ScannerCallable.java:199)
    at o.a.h.a.client.ScannerCallable.call(ScannerCallable.java:62)
    at o.a.h.a.client.RpcRetryingCaller.callWithRetries
    (RpcRetryingCaller.java:126)
    ... 6 more
Caused by: o.a.h.a.ipc.CallTimeoutException: Call id=2, waitTime=2001,
operationTimeout=2000 expired.
    at o.a.h.a.ipc.Call.checkAndSetTimeout(Call.java:70)
    at o.a.h.a.ipc.RpcClientImpl.call(RpcClientImpl.java:1177)
    ... 12 more

Before timing out, operations will be retried a few times with an increasing delay between each of the retries.

The timeout can also occur on the server side, which can cause false RegionServer failures. From time to time, HBase RegionServers need to report back to ZooKeeper to confirm they are still alive. By default, to confirm they are still alive, the HBase regions servers need to report back to ZooKeeper every 40 seconds when using an external ZooKeeper, or every 90 seconds when HBase manages ZooKeeper service. Timeouts and retries can be configured using multiple parameters such as zookeeper.session.timeout, hbase.rpc.timeout, hbase.client.retries.number, or hbase.client.pause.

When a server is too slow to report, it can miss the heartbeat and report to ZooKeeper too late. When the RegionServer misses a heartbeat, it is considered as lost by ZooKeeper and will be reported to the HBase Master because it’s not possible to determine if the RegionServer is responding too slow, or if the server has actually crashed. In the event of a crash, the HBase master will reassign all the regions previously assigned to the server to the other RegionServers. While reassigning the regions, the previously pending operations will be reprocessed to guarantee consistency. When the slow RegionServer recovers from the pause and finally reports to ZooKeeper, the RegionServer will be informed that it has been considered dead and will throw a YouAreDeadException which will cause the RegionServer to terminate itself.

Here’s an example of a YouAreDeadException caused by a long server pause:

2015-03-18 12:29:32,664 WARN  [t430s,16201,1426696058785-HeapMemoryTunerChore]
      util.Sleeper: We slept 109666ms instead of 60000ms, this is likely due
      to a long garbage collecting pause and it's usually bad, see
      http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-03-18 12:29:32,682 FATAL [regionserver/t430s/10.32.0.23:16201] regionserver.
      HRegionServer: ABORTING RegionServer t430s,16201,1426696058785:
      o.a.h.h.YouAreDeadException: Server REPORT rejected; currently
      processing t430s,16201,1426696058785 as dead server
    at o.a.h.h.m.ServerManager.checkIsDead(ServerManager.java:382)
    at o.a.h.h.m.ServerManager.regionServerReport(ServerManager.java:287)
    at o.a.h.h.m.MasterRpcServices.regionServerReport(MasterRpcServices.java:278)
    at o.a.h.h.p.g.RegionServerStatusProtos$RegionServerStatusService$2.
               callBlockingMethod(RegionServerStatusProtos.java:7912)
    at o.a.h.h.i.RpcServer.call(RpcServer.java:2031)
    at o.a.h.h.i.CallRunner.run(CallRunner.java:107)
    at o.a.h.h.i.RpcExecutor.consumerLoop(RpcExecutor.java:130)
    at o.a.h.h.ic.RpcExecutor$1.run(RpcExecutor.java:107)
    at java.lang.Thread.run(Thread.java:745)

Causes

When dealing with Java-based platforms, memory fragmentation is typically the main cause of full garbage collection and pauses. In a perfect world, all the objects would be the same size, and it would be very easy to track, find, and allocate free memory without causing heavy memory fragmentation. Unfortunately, in the real world this is not the case, and puts received by the servers are almost never the same size. Also, the compressed HFiles blocks and many other HBase objects are also almost neither never the same size. The previously mentioned factors will create fragmentation of the available memory. The more the JVM memory is fragmented, the more it will need to run garbage collection to release unused memory. Typically, the larger the heap is, the longer a full GC will take to complete. This means raising the heap too high without (or even sometimes) with proper tuning can lead to timeouts.

Garbage collection is not the only contributing factor to timeouts and pauses. It is possible to create a scenario where the memory is overallocated (giving to HBase or other applications more memory than available physical memory), the operating system will need to swap some of the memory pages to the disks. This is one of the least desirable situations in HBase. Swapping is guaranteed to cause issues with cluster, which in turn will create timeouts and drastically impact the latencies.

The final risk is improper hardware behaviors or failures. Network failures and hard drives are good examples of such risks.

Storage Failure

For multiple reasons, a storage drive can fail and become unresponsive. When this occurs, the operating system will try multiple times before reporting the error to the application. This might create delays and general slowness, which again will impact the latency and might trigger timeouts or failures. HDFS is designed to handle such failures and no data will be lost, however failures on the operating system drive my impact the entire host availability.

Power-Saving Features

To reduce energy consumption, some disks have energy-saving features. When this feature is activated, if a disk is not accessed for a certain period of time, it will go on sleep and will stop spinning. As soon as the system needs data stored on that disk, it needs first to start spinning the disk back, wait for it to reach its full speed, and then only start to read from it. When HBase makes any request to the I/O layer, the total time it takes the energy-saving drives to spin back up can contribute to timeouts and latency spikes. In certain circumstances, those timeouts can even cause server failures.

Network Failure

On big clusters, servers are assigned into racks, connected together using top-of-the-rack switches. Such switches are powerful and come with many features, but like any other hardware they can have small failures and can pause because of them. Those pauses will cause communications between the servers to fail for a certain period of time. Here again, if this period of time extends beyond the configured timeout, HBase will consider those servers as lost and will kill them if they come back online, with all the impacts on the latencies that you can imagine.

Solutions

When a server crashes or has been closed because of a timeout issue, the only solution is to restart it. The source of such a failure can be a network, configuration, or hardware issue (or something else entirely). Whatever the source of this failure is, it needs to be identified and addressed to avoid similar issues in the future. When your server fails for any reason, if not addressed, the chances of this occurring again on your cluster are very high and other failures are to be expected. The best place to start with to identify issues are the servers logs, both HBase Master and the culprit RegionServer; however, it might also be good to have a look at the system logs (/var/log/messages), the network interfaces statistics (in particular, the drop packages), and the switch monitoring interface. It is also a very good practice to validate your network bandwidth between multiple nodes at the same time across different racks using tools like iperf. Testing the network performances and stability will allow you to detect bottleneck in the configuration, but will miss configured interfaces (like, configured to be 1 Gbps instead of 10 Gbps, etc.).

Hardware duplication and new generation network management tools like Arista products can help to avoid network and hardware issues and will be looked at in the following section.

Last, Cloudera Manager and other cluster management applications can automatically restart the services when they exit (RegionServers, Masters, ThriftServers, etc.), however, it might be good to not activate such features, as you want to identify the root cause of the failures before restarting the services.

There are multiple ways to prevent such situations from occurring, and the next section will review them one by one.

Prevention

As we have just seen, timeouts and garbage collection can occur for multiple different reasons. In this section, we identify what those situations might be, and provide ways to prevent them from happening.

Reduce Heap Size

The bigger the heap size, the longer it takes for the JVM to do a garbage collection (GC). Even if more memory can be nice to increase cache size and improve latency and performance, you might face some long pauses because of the GC. With the JVM’s default garbage collection algorithms, a full garbage collection will freeze the application and is highly probable to create a ZooKeeper timeout, which will result in the RegionServer being killed. There is very little risks to face GC pauses on the HBase Master server. Using the default HBase settings, it is recommended to keep the HBase heap size under 20 GB. Garbage collection can be configured inside the conf/hbase-conf.sh script. The default setting is to use the Java CMS (Concurrent Mark Sweep) algorithm: export HBASE_OPTS="-XX:+UseConcMarkSweepGC".

Additional debugging information for the GC processes is avalable in the configuration file, and is ready to use. In the following example, we are adding only one of these options (we recommend taking a look at the configuration file to understand all of the available options):

# This enables basic GC logging to its own file with automatic log rolling.
# Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in
# the HBASE_LOG_DIR .
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
# -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1
# -XX:GCLogFileSize=512M"

Starting with Oracle JDK 1.7, the JVM comes with a new garbage collection algorithm called G1 for Garbage First. Details about G1 can be found in “Using the G1GC Algorithm” .

Off-Heap BlockCache

HBase 1.0 or more recent allows you to configure off-heap BlockCache using what is called BucketCache. Using off-heap memory allows HBase to manage the fragmentation on its own. Because all HBase blocks are supposed to be the same size (except if specifically defined differently at the table level), it is easy to manage the related memory fragmentation. This reduces the heap memory usage and fragmentation and therefore, the required garbage reduction. When using off-heap cache, you can reduce the default heap size memory, which reduces the time required for the JVM to run a full garbage collection. Also, if HBase is getting a mix load of reads and writes, as only the BlockCache is moved off heap, you can reduce the heap BlockCache to give more memory to the memstore using hbase.regionserver.global.memstore and hfile.block.cache.size properties. When servers have a limited amount of memory, it is also possible to configure the bucket cache to use specific devices such as SSD drives as the memory storage for the cache.

Using the G1GC Algorithm

The Default JVM garbage collection algorithm in Java 6, 7, and 8 tends to struggle with large heaps running full GC. These full GCs will pause the JVM until memory is cleaned. As seen in the previous section, this might negatively impact the server availability. Java 1.7_u60+ is the first stable release of the G1GC algorithm. G1GC is a garbage collection algorithm that stands for garbage first collection. G1 breaks memory into 1–32 MB regions depending on the heap size. Then when running cleanup attempts to clean the regions that are no longer being used first, G1 frees up more memory without having to perform stop-the-world pauses. G1GC also heavily leverages mixed collections that occur on the fly, again helping to avoid the stop-the-world pauses that can be detrimental to HBase latency.

Tuning the G1GC collector can be daunting and require some trial and error. Once properly configured, we have used the G1GC collector with heaps as large as 176 GB. Heaps that large are beneficial to more than just read performance. If you are looking for strictly for read performance, off-heap caching will offer the performance bump without the tuning headaches. A benefit of having large heaps with G1GC is that it allows HBase to scale better horizontally by offering a large memstore pool for regions to share.

Must-use parameters

  • -XX:+UseG1GC

  • -XX:+PrintFlagsFinal

  • -XX:+PrintGCDetails

  • -XX:+PrintGCDateStamps

  • -XX:+PrintGCTimeStamps

  • -XX:+PrintAdaptiveSizePolicy

  • -XX:+PrintReferenceGC

Additional HBase settings while exceeding 100 GB heaps

-XX:-ResizePLAB

Promotion local allocation buffers (PLABs) avoid the large communication cost among GC threads, as well as variations during each GC.

-XX:+ParallelRefProcEnabled

When this flag is turned on, GC uses multiple threads to process the increasing references during young and mixed GC.

-XX:+AlwaysPreTouch

Pre-touch the Java heap during JVM initialization. That means every page of the heap is demand-zeroed during initialization rather than incrementally during application execution.

-XX:MaxGCPauseMillis=100

Max pause will attempt to keep GC pauses below the referenced time. This becomes critical for proper tuning. G1GC will do its best to hold to the number, but will pause longer if necessary.

Other interesting parameters

-XX:ParallelGCThreads=X

Formula: 8 + (logical processors – 8) (5/8)

-XX:G1NewSizePercent=X

Eden size is calculated by (heap size * G1NewSizePercent). The default value of G1NewSizePercent is 5 (5% of total heap). Lowering this can change the total time young GC pauses take to complete.

-XX:+UnlockExperimentalVMOptions

Unlocks the use of the previously referenced flags.

-XX:G1HeapWastePercent=5

Sets the percentage of heap that you are willing to waste. The Java HotSpot VM does not initiate the mixed garbage collection cycle when the reclaimable percentage is less than the heap waste percentage. The default is 10%.

-XX:G1MixedGCLiveThresholdPercent=75

Sets the occupancy threshold for an old region to be included in a mixed garbage collection cycle. The default occupancy is 65%. This is an experimental flag.

The primary goal from the tuning is to be running many mixed GC collections and avoiding the full GCs. When properly tuned, we should see very repeatable GC behavior assuming the workload stays relatively uniform. It is important to note that overprovisioning of the heap can help avoid the full GCs by leaving headroom for the mixed GCs to continue in the background. For further reading, check out Oracle’s comprehensive documentation.

Configure Swappiness to 0 or 1

HBase might not be the only application running on your environment, and at some point memory might have been over allocated. By default, Linux will try to anticipate the lack of memory and will start to swap to disk some memory pages. This is fine because it will allow the operating system to avoid being out of memory; however, writing such pages into disk takes time and will, here again, impact the latency. When a lot of memory need to be written to disk that way, a significant pause might occur into the operating system, which might cause the HBase server to miss the ZooKeeper heartbeat and die. To avoid this, it is recommended to reduce the swappiness to its minimum. Depending of the version of your kernel, this value will need to be set to 0 or 1. Starting with kernel version 3.5-rc1, vm.swappiness=1 simulates vm.swappiness=0 from earlier kernels. Also, it is important to not over commit the memory. For each of the applications running in your server, note the allocated memory, including off heap usage (if any), and sum them. You will need this value to be under the available physical memory after you remove some memory reserved for the operating system

Disable Environment-Friendly Features

As we discussed earlier in this chapter, environment-friendly features such as power saving can impact your server’s performance and stability. It is recommended to disable all those features. They can usually be disabled from the bios. However, sometimes (like for some disks) it is simply not possible to deactivate them. If that’s the case, it is recommended to replace those disks. In a production environment, you will need to be very careful when ordering your hardware to avoid anything including such features.

Hardware Duplication

It is a common practice to duplicate hardware to avoid downtime. Primary examples would be: running two OS drives in RAID1, leveraging dual power supplies, and bonding network interfaces in a failover setting. These duplications will remove numerous single points of failure creating a cluster with better uptime and less operations maintenance. The same can be applied to the network switches and all the other hardware present in the cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.138.195