Chapter 1. Introducing HBase

 

This chapter covers

  • The origins of Hadoop, HBase, and NoSQL
  • Common use cases for HBase
  • A basic HBase installation
  • Storing and querying data with HBase

 

HBase is a database: the Hadoop database. It’s often described as a sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column key, and timestamp. You’ll hear people refer to it as a key value store, a column family-oriented database, and sometimes a database storing versioned maps of maps. All these descriptions are correct. But fundamentally, it’s a platform for storing and retrieving data with random access, meaning you can write data as you like and read it back again as you need it. HBase stores structured and semistructured data naturally so you can load it with tweets and parsed log files and a catalog of all your products right along with their customer reviews. It can store unstructured data too, as long as it’s not too large. It doesn’t care about types and allows for a dynamic and flexible data model that doesn’t constrain the kind of data you store.

HBase isn’t a relational database like the ones to which you’re likely accustomed. It doesn’t speak SQL or enforce relationships within your data. It doesn’t allow interrow transactions, and it doesn’t mind storing an integer in one row and a string in another for the same column.

HBase is designed to run on a cluster of computers instead of a single computer. The cluster can be built using commodity hardware; HBase scales horizontally as you add more machines to the cluster. Each node in the cluster provides a bit of storage, a bit of cache, and a bit of computation as well. This makes HBase incredibly flexible and forgiving. No node is unique, so if one of those machines breaks down, you simply replace it with another. This adds up to a powerful, scalable approach to data that, until now, hasn’t been commonly available to mere mortals.

 

Join the community

Unfortunately, no official public numbers specify the largest HBase clusters running in production. This kind of information easily falls under the realm of business confidential and isn’t often shared. For now, the curious must rely on footnotes in publications, bullets in presentations, and the friendly, unofficial chatter you’ll find at user groups, meet-ups, and conferences.

So participate! It’s good for you, and it’s how we became involved as well. HBase is an open source project in an extremely specialized space. It has well-financed competition from some of the largest software companies on the planet. It’s the community that created HBase and the community that keeps it competitive and innovative. Plus, it’s an intelligent, friendly group. The best way to get started is to join the mailing lists.[1] You can follow the features, enhancements, and bugs being currently worked on using the JIRA site.[2] It’s open source and collaborative, and users like yourself drive the project’s direction and development.

1 HBase project mailing lists: http://hbase.apache.org/mail-lists.html.

2 HBase JIRA site: https://issues.apache.org/jira/browse/HBASE.

Step up, say hello, and tell them we sent you!

 

Given that HBase has a different design and different goals as compared to traditional database systems, building applications using HBase involves a different approach as well. This book is geared toward teaching you how to effectively use the features HBase has to offer in building applications that are required to work with large amounts of data. Before you set out on the journey of learning how to use HBase, let’s get historical perspective about how HBase came into being and the motivations behind it. We’ll then touch on use cases people have successfully solved using HBase. If you’re like us, you’ll want to play with HBase before going much further. We’ll wrap up by walking through installing HBase on your laptop, tossing in some data, and pulling it out. Context is important, so let’s start at the beginning.

1.1. Data-management systems: a crash course

Relational database systems have been around for a few decades and have been hugely successful in solving data storage, serving, and processing problems over the years. Several large companies have built their systems using relational database systems, online transactional systems, as well as back-end analytics applications.

Online transaction processing (OLTP) systems are used by applications to record transactional information in real time. They’re expected to return responses quickly, typically in milliseconds. For instance, the cash registers in retail stores record purchases and payments in real time as customers make them. Banks have large OLTP systems that they use to record transactions between users like transferring of funds and such. OLTP systems aren’t limited to money transactions. Web companies like LinkedIn also have such applications—for instance, when users connect with other users. The term transaction in OLTP refers to transactions in the context of databases, not financial transactions.

Online analytical processing (OLAP) systems are used to answer analytical queries about the data stored in them. In the context of retailers, these would mean systems that generate daily, weekly, and monthly reports of sales and slice and dice the information to allow analysis of it from several different perspectives. OLAP falls in the domain of business intelligence, where data is explored, processed, and analyzed to glean information that could further be used to drive business decisions. For a company like LinkedIn, where the establishing of connections counts as transactions, analyzing the connectedness of the graph and generating reports on things like the number of average connections per user falls in the category of business intelligence; this kind of processing would likely be done using OLAP systems.

Relational databases, both open source and proprietary, have been successfully used at scale to solve both these kinds of use cases. This is clearly highlighted by the balance sheets of companies like Oracle, Vertica, Teradata, and others. Microsoft and IBM have their share of the pie too. All such systems provide full ACID[3] guarantees. Some scale better than others; some are open source, and others require you to pay steep licensing fees.

3 For those who don’t know (or don’t remember), ACID is an acronym standing for atomicity, consistency, isolation, and durability. These are fundamental principles used to reason about data systems. See http://en.wikipedia.org/wiki/ACID for an introduction.

The internal design of relational databases is driven by relational math, and these systems require an up-front definition of schemas and types that the data will thereafter adhere to. Over time, SQL became the standard way of interacting with these systems, and it has been widely used for several years. SQL is arguably a lot easier to write and takes far less time than coding up custom access code in programming languages. But it might not be the best way to express the access patterns in every situation, and that’s where issues like object-relational mismatch arose.

Any problem in computer science can be solved with a level of indirection. Solving problems like object-relational mismatch was no different and led to frameworks being built to alleviate the pain.

1.1.1. Hello, Big Data

Let’s take a closer look at the term Big Data. To be honest, it’s become something of a loaded term, especially now that enterprise marketing engines have gotten hold of it. We’ll keep this discussion as grounded as possible.

What is Big Data? Several definitions are floating around, and we don’t believe that any of them explains the term clearly. Some definitions say that Big Data means the data is large enough that you have to think about it in order to gain insights from it. Others say it’s Big Data when it stops fitting on a single machine. These definitions are accurate in their own respect but not necessarily complete. Big Data, in our opinion, is a fundamentally different way of thinking about data and how it’s used to drive business value. Traditionally, there were transaction recording (OLTP) and analytics (OLAP) on the recorded data. But not much was done to understand the reasons behind the transactions or what factors contributed to business taking place the way it did, or to come up with insights that could drive the customer’s behavior directly. In the context of the earlier LinkedIn example, this could translate into finding missing connections based on user attributes, second-degree connections, and browsing behavior, and then prompting users to connect with people they may know. Effectively pursuing such initiatives typically requires working with a large amount of varied data.

This new approach to data was pioneered by web companies like Google and Amazon, followed by Yahoo! and Facebook. These companies also wanted to work with different kinds of data, and it was often unstructured or semistructured (such as logs of users’ interactions with the website). This required the system to process several orders of magnitude more data. Traditional relational databases were able to scale up to a great extent for some use cases, but doing so often meant expensive licensing and/or complex application logic. But owing to the data models they provided, they didn’t do a good job of working with evolving datasets that didn’t adhere to the schemas defined up front. There was a need for systems that could work with different kinds of data formats and sources without requiring strict schema definitions up front, and do it at scale. The requirements were different enough that going back to the drawing board made sense to some of the internet pioneers, and that’s what they did. This was the dawn of the world of Big Data systems and NoSQL. (Some might argue that it happened much later, but that’s not the point. This did mark the beginning of a different way of thinking about data.)

As part of this innovation in data management systems, several new technologies were built. Each solved different use cases and had a different set of design assumptions and features. They had different data models, too.

How did we get to HBase? What fueled the creation of such a system? That’s up next.

1.1.2. Data innovation

As we now know, many prominent internet companies, most notably Google, Amazon, Yahoo!, and Facebook, were on the forefront of this explosion of data. Some generated their own data, and others collected what was freely available; but managing these vastly different kinds of datasets became core to doing business. They all started by building on the technology available at the time, but the limitations of this technology became limitations on the continued growth and success of these businesses. Although data management technology wasn’t core to the businesses, it became essential for doing business. The ensuing internal investment in technical research resulted in many new experiments in data technology.

Although many companies kept their research closely guarded, Google chose to talk about its successes. The publications that shook things up were the Google File System[4] and MapReduce papers.[5] Taken together, these papers represented a novel approach to the storage and processing of data. Shortly thereafter, Google published the Bigtable paper,[6] which provided a complement to the storage paradigm provided by its file system. Other companies built on this momentum, both the ideas and the habit of publishing their successful experiments. As Google’s publications provided insight into indexing the internet, Amazon published Dynamo,[7] demystifying a fundamental component of the company’s shopping cart.

4 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” Google Research Publications, http://research.google.com/archive/gfs.html.

5 Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Google Research Publications, http://research.google.com/archive/mapreduce.html.

6 Fay Chang et al., “Bigtable: A Distributed Storage System for Structured Data,” Google Research Publications, http://research.google.com/archive/bigtable.html.

7 Werner Vogels, “Amazon’s Dynamo,” All Things Distributed, www.allthingsdistributed.com/2007/10/amazons_dynamo.html.

It didn’t take long for all these new ideas to begin condensing into open source implementations. In the years following, the data management space has come to host all manner of projects. Some focus on fast key-value stores, whereas others provide native data structures or document-based abstractions. Equally diverse are the intended access patterns and data volumes these technologies support. Some forego writing data to disk, sacrificing immediate persistence for performance. Most of these technologies don’t hold ACID guarantees as sacred. Although proprietary products do exist, the vast majority of the technologies are open source projects. Thus, these technologies as a collection have come to be known as NoSQL.

Where does HBase fit in? HBase does qualify as a NoSQL store. It provides a key-value API, although with a twist not common in other key-value stores. It promises strong consistency so clients can see data immediately after it’s written. HBase runs on multiple nodes in a cluster instead of on a single machine. It doesn’t expose this detail to its clients. Your application code doesn’t know if it’s talking to 1 node or 100, which makes things simpler for everyone. HBase is designed for terabytes to petabytes of data, so it optimizes for this use case. It’s a part of the Hadoop ecosystem and depends on some key features, such as data redundancy and batch processing, to be provided by other parts of Hadoop.

Now that you have some context for the environment at large, let’s consider specifically the beginnings of HBase.

1.1.3. The rise of HBase

Pretend that you’re working on an open source project for searching the web by crawling websites and indexing them. You have an implementation that works on a small cluster of machines but requires a lot of manual steps. Pretend too that you’re working on this project around the same time Google publishes papers about its data-storage and -processing frameworks. Clearly, you would jump on these publications and spearhead an open source implementation based on them. Okay, maybe you wouldn’t, and we surely didn’t; but Doug Cutting and Mike Cafarella did.

Built out of Apache Lucene, Nutch was their open source web-search project and the motivation for the first implementation of Hadoop.[8] From there, Hadoop began to receive lots of attention from Yahoo!, which hired Cutting and others to work on it full time. From there, Hadoop was extracted out of Nutch and eventually became an Apache top-level project. With Hadoop well underway and the Bigtable paper published, the groundwork existed to implement an open source Bigtable on top of Hadoop. In 2007, Cafarella released code for an experimental, open source Bigtable. He called it HBase. The startup Powerset decided to dedicate Jim Kellerman and Michael Stack to work on this Bigtable analog as a way of contributing back to the open source community on which it relied.[9]

8 A short historical summary was published by Doug Cutting at http://cutting.wordpress.com/2009/08/10/joining-cloudera/.

9 See Jim Kellerman’s blog post at http://mng.bz/St47.

HBase proved to be a powerful tool, especially in places where Hadoop was already in use. Even in its infancy, it quickly found production deployment and developer support from other companies. Today, HBase is a top-level Apache project with thriving developer and user communities. It has become a core infrastructure component and is being run in production at scale worldwide in companies like StumbleUpon, Trend Micro, Facebook, Twitter, Salesforce, and Adobe.

HBase isn’t a cure-all of data management problems, and you might include another technology in your stack at a later point for a different use case. Let’s look at how HBase is being used today and the types of applications people have built using it. Through this discussion, you’ll gain a feel for the kinds of data problems HBase can solve and has been used to tackle.

1.2. HBase use cases and success stories

Sometimes the best way to understand a software product is to look at how it’s used. The kinds of problems it solves and how those solutions fit into a larger application architecture can tell you a lot about a product. Because HBase has seen a number of publicized production deployments, we can do just that. This section elaborates on some of the more common use cases that people have successfully used HBase to solve.

 

Note

Don’t limit yourself to thinking that HBase can solve only these kinds of use cases. It’s a nascent technology, and innovation in terms of use cases is what drives the development of the system. If you have a new idea that you think can benefit from the features HBase offers, try it. The community would love to help you during the process and also learn from your experiences. That’s the spirit of open source software.

 

HBase is modeled after Google’s Bigtable, so we’ll start our exploration with the canonical Bigtable problem: storing the internet.

1.2.1. The canonical web-search problem: the reason for Bigtable’s invention

Search is the act of locating information you care about: for example, searching for pages in a textbook that contain the topic you want to read about, or for web pages that have the information you’re looking for. Searching for documents containing particular terms requires looking up indexes that map terms to the documents that contain them. To enable search, you have to build these indexes. This is precisely what Google and other search engines do. Their document corpus is the entire internet; the search terms are whatever you type in the search box.

Bigtable, and by extension HBase, provides storage for this corpus of documents. Bigtable supports row-level access so crawlers can insert and update documents individually. The search index can be generated efficiently via MapReduce directly against Bigtable. Individual document results can be retrieved directly. Support for all these access patterns was key in influencing the design of Bigtable. Figure 1.1 illustrates the critical role of Bigtable in the web-search application.

Figure 1.1. Providing web-search results using Bigtable, simplified. The crawlers—applications collecting web pages—store their data in Bigtable. A MapReduce process scans the table to produce the search index. Search results are queried from Bigtable to display to the user.

 

Note

In the interest of brevity, this look at Bigtable doesn’t do the original authors justice. We highly recommend the three papers on Google File System, MapReduce, and Bigtable as required reading for anyone curious about these technologies. You won’t be disappointed.

 

With the canonical HBase example covered, let’s look at other places where HBase has found purchase. The adoption of HBase has grown rapidly over the last couple of years. This has been fueled by the system becoming more reliable and performant, due in large part to the engineering effort invested by the various companies backing and using it. As more commercial vendors provide support, users are increasingly confident in using the system for critical applications. A technology designed to store a continuously updated copy of the internet turns out to be pretty good at other things internet-related. HBase has found a home filling a variety of roles in and around social-networking companies. From storing communications between individuals to communication analytics, HBase has become a critical infrastructure at Facebook, Twitter, and StumbleUpon, to name a few.

HBase has been used in three major types of use cases but it’s not limited to those. In the interest of keeping this chapter short and sweet, we’ll cover the major use cases here.

1.2.2. Capturing incremental data

Data often trickles in and is added to an existing data store for further usage, such as analytics, processing, and serving. Many HBase use cases fall in this category—using HBase as the data store that captures incremental data coming in from various data sources. These data sources can be, for example, web crawls (the canonical Bigtable use case that we talked about), advertisement impression data containing information about which user saw what advertisement and for how long, or time series data generated from recording metrics of various kinds. Let’s talk about a few successful use cases and the companies that are behind these projects.

Capturing Metrics: OpenTSDB

Web-based products serving millions of users typically have hundreds or thousands of servers in their back-end infrastructure. These servers spread across various functions—serving traffic, capturing logs, storing data, processing data, and so on. To keep the products up and running, it’s critical to monitor the health of the servers as well as the software running on these servers (from the OS right up to the application the user is interacting with). Monitoring the entire stack at scale requires systems that can collect and store metrics of all kinds from these different sources. Every company has its own way of achieving this. Some use proprietary tools to collect and visualize metrics; others use open source frameworks.

StumbleUpon built an open source framework that allows the company to collect metrics of all kinds into a single system. Metrics being collected over time can be thought of as basically time-series data: that is, data collected and recorded over time. The framework that StumbleUpon built is called OpenTSDB, which stands for Open Time Series Database. This framework uses HBase at its core to store and access the collected metrics. The intention of building this framework was to have an extensible metrics collection system that could store and make metrics be available for access over a long period of time, as well as allow for all sorts of new metrics to be added as more features are added to the product. StumbleUpon uses OpenTSDB to monitor all of its infrastructure and software, including its HBase clusters. We cover OpenTSDB in detail in chapter 7 as a sample application built on top of HBase.

Capturing User-Interaction Data: Facebook and Stumbleupon

Metrics captured for monitoring are one category. There are also metrics about user interaction with a product. How do you keep track of the site activity of millions of people? How do you know which site features are most popular? How do you use one page view to directly influence the next? For example, who saw what, and how many times was a particular button clicked? Remember the Like button in Facebook and the Stumble and +1 buttons in StumbleUpon? Does this smell like a counting problem? They increment a counter every time a user likes a particular topic.

StumbleUpon had its start with MySQL, but as the service became more popular, that technology choice failed it. The online demand of this increasing user load was too much for the MySQL clusters, and ultimately StumbleUpon chose HBase to replace those clusters. At the time, HBase didn’t directly support the necessary features. StumbleUpon implemented atomic increment in HBase and contributed it back to the project.

Facebook uses the counters in HBase to count the number of times people like a particular page. Content creators and page owners can get near real-time metrics about how many users like their pages. This allows them to make more informed decisions about what content to generate. Facebook built a system called Facebook Insights, which needs to be backed by a scalable storage system. The company looked at various options, including RDBMS, in-memory counters, and Cassandra, before settling on HBase. This way, Facebook can scale horizontally and provide the service to millions of users as well as use its existing experience in running large-scale HBase clusters. The system handles tens of billions of events per day and records hundreds of metrics.

Telemetry: Mozilla and Trend Micro

Operational and software-quality data includes more than just metrics. Crash reports are an example of useful software-operational data that can be used to gain insights into the quality of the software and plan the development roadmap. This isn’t necessarily related to web servers serving applications. HBase has been successfully used to capture and store crash reports that are generated from software crashes on users’ computers.

The Mozilla Foundation is responsible for the Firefox web browser and Thunder-bird email client. These tools are installed on millions of computers worldwide and run on a wide variety of OSs. When one of these tools crashes, it may send a crash report back to Mozilla in the form of a bug report. How does Mozilla collect these reports? What use are they once collected? The reports are collected via a system called Socorro and are used to direct development efforts toward more stable products. Socorro’s data storage and analytics are built on HBase.[10]

10 Laura Thomson, “Moving Socorro to HBase,” Mozilla WebDev, http://mng.bz/L2k9.

The introduction of HBase enabled basic analysis over far more data than was previously possible. This analysis was used to direct Mozilla’s developer focus to great effect, resulting in the most bug-free release ever.

Trend Micro provides internet security and threat-management services to corporate clients. A key aspect of security is awareness, and log collection and analysis are critical for providing that awareness in computer systems. Trend Micro uses HBase to manage its web reputation database, which requires both row-level updates and support for batch processing with MapReduce. Much like Mozilla’s Socorro, HBase is also used to collect and analyze log activity, collecting billions of records every day. The flexible schema in HBase allows data to easily evolve over time, and Trend Micro can add new attributes as analysis processes are refined.

Advertisement Impressions and Clickstream

Over the last decade or so, online advertisements have become a major source of revenue for web-based products. The model has been to provide free services to users but have ads linked to them that are targeted to the user using the service at the time. This kind of targeting requires detailed capturing and analysis of user-interaction data to understand the user’s profile. The ad to be displayed is then selected based on that profile. Fine-grained user-interaction data can lead to building better models, which in turn leads to better ad targeting and hence more revenue. But this kind of data has two properties: it comes in the form of a continuous stream, and it can be easily partitioned based on the user. In an ideal world, this data should be available to use as soon as it’s generated, so the user-profile models can be improved continuously without delay—that is, in an online fashion.

 

Online vs. offline systems

The terms online and offline have come up a couple times. For the uninitiated, these terms describe the conditions under which a software system is expected to perform. Online systems have low-latency requirements. In some cases, it’s better for these systems to respond with no answer than to take too long producing the correct answer. You can think of a system as online if there’s a user at the other end impatiently tapping their foot. Offline systems don’t have this low-latency requirement. There’s a user waiting for an answer, but that response isn’t expected immediately.

The intent to be an online or an offline system influences many technology decisions when implementing an application. HBase is an online system. Its tight integration with Hadoop MapReduce makes it equally capable of offline access as well.

 

These factors make collecting user-interaction data a perfect fit for HBase, and HBase has been successfully used to capture raw clickstream and user-interaction data incrementally and then process it (clean it, enrich it, use it) using different processing mechanisms (MapReduce being one of them). If you look for companies that do this, you’ll find plenty of examples.

1.2.3. Content serving

One of the major use cases of databases traditionally has been that of serving content to users. Applications that are geared toward serving different types of content are backed by databases of all shapes, sizes, and colors. These applications have evolved over the years, and so have the databases they’re built on top of. A vast amount of content of varied kinds is available that users want to consume and interact with. In addition, accessibility to such applications has grown, owing to this burgeoning thing called the internet and an even more rapidly growing set of devices that can connect to it. The various kinds of devices lead to another requirement: different devices need the same content in different formats.

That’s all about users consuming content. In another entirely different use case, users generate content: tweets, Facebook posts, Instagram pictures, and micro blogs are just a few examples.

The bottom line is that users consume and generate a lot of content. HBase is being used to back applications that allow a large number of users interacting with them to either consume or generate content.

A content management system (CMS) allows for storing and serving content, as well as managing everything from a central location. More users and more content being generated translates into a requirement for a more scalable CMS solution. Lily,[11] a scalable CMS, uses HBase as its back end, along with other open source frameworks such as Solr to provide a rich set of functionality.

11 Lily Content Management System: www.lilyproject.org.

Salesforce provides a hosted CRM product that exposes rich relational database functionality to customers through a web browser interface. Long before Google was publishing papers about its proto-NoSQL systems, the most reasonable choice to run a large, carefully scrutinized database in production was a commercial RDBMS. Over the years, Salesforce has scaled that approach to do hundreds of millions of transactions per day, through a combination of database sharding and cutting-edge performance engineering.

When looking for ways to expand its database arsenal to include distributed database systems, Salesforce evaluated the full spectrum of NoSQL technologies before deciding to implement HBase.[12] The primary factor in the choice was consistency. Bigtable-style systems are the only architectural approach that combines seamless horizontal scalability with strong record-level consistency. Additionally, Salesforce already used Hadoop for doing large offline batch processing, so the company was able to take advantage of in-house expertise in running and administering systems on the Hadoop stack.

12 This statement is based on personal conversations with some of the engineers at Salesforce.

URL Shorteners

URL shorteners gained a lot of popularity in the recent past, and many of them cropped up. StumbleUpon has its own, called su.pr. Su.pr uses HBase as its back end, and that allows it to scale up—shorten URLs and store tons of short URLs and their mapping to the longer versions.

Serving User Models

Often, the content being served out of HBase isn’t consumed directly by users, but is instead used to make decisions about what should be served. It’s metadata that is used to enrich the user’s interaction.

Remember the user profiles we talked about earlier in the context of ad serving? Those profiles (or models) can also be served out of HBase. Such models can be of various kinds and can be used for several different use cases, from deciding what ad to serve to a particular user, to deciding price offers in real time when users shop on an e-commerce portal, to adding context to user interaction and serving back information the user asked for while searching for something on a search engine. There are probably many such use cases that aren’t publicly talked about, and mentioning them could get us into trouble.

Runa[13] serves user models that are used to make real-time price decisions and make offers to users during their engagement with an e-commerce portal. The models are fine-tuned continuously with the help of new user data that comes in.

13www.runa.com.

1.2.4. Information exchange

The world is becoming more connected by the day, with all sorts of social networks cropping up.[14] One of the key aspects of these social networks is the fact that users can interact using them. Sometimes these interactions are in groups (small and large alike); other times, the interaction is between two individuals. Think of hundreds of millions of people having conversations over these networks. They aren’t happy with just the ability to communicate with people far away; they also want to look at the history of all their communication with others. Luckily for social network companies, storage is cheap, and innovations in Big Data systems allow them to use the cheap storage to their advantage.[15]

14 Some might argue that connecting via social networks doesn’t mean being more social. That’s a philosophical discussion, and we’ll stay out of it. It has nothing to do with HBase, right?

15 Plus all those ad dollars need to be put to use somehow.

One such use case that is often publicly discussed and is probably a big driver of HBase development is Facebook messages. If you’re on Facebook, you’ve likely sent or received messages from your Facebook friends at some point. That feature of Face-book is entirely backed by HBase. All messages that users write or read are stored in HBase.[16] The system supporting Facebook messages needs to deliver high write throughput, extremely large tables, and strong consistency within a datacenter. In addition to messages, other applications had requirements that influenced the decision to use HBase: read throughput, counter throughput, and automatic sharding are necessary features. The engineers found HBase to be an ideal solution because it supports all these features, it has an active user community, and Facebook’s operations teams had experience with other Hadoop deployments. In “Hadoop goes realtime at Facebook,”[17] Facebook engineers provide deeper insight into the reasoning behind this decision and their experience using Hadoop and HBase in an online system.

16 Kannan Muthukkaruppan, “The Underlying Technology of Messages,” Facebook, https://www.facebook.com/note.php?note_id=454991608919.

17 Dhruba Borthakur et al., “Apache Hadoop goes realtime at Facebook,” ACM Digital Library, http://dl.acm.org/citation.cfm?id=1989438.

Facebook engineers shared some interesting scale numbers at HBaseCon 2012. Billions of messages are exchanged every day on this platform, translating to about 75 billion operations per day. At peak time, this can involve up to 1.5 million operations per second on Facebook’s HBase clusters. From a data-size perspective, Facebook is adding 250 TB of new data to its clusters every month.[18] This is likely one of the largest known HBase deployments out there, both in terms of the number of servers and the number of users the application in front of it serves.

18 This statistic was shared during a keynote at HBaseCon 2012. We don’t have a document to cite, but you can do a search for more info.

These are just a few examples of how HBase is solving interesting problems new and old. You may have noticed a common thread: using HBase for both online services and offline processing over the same data. This is a role for which HBase is particularly well suited. Now that you have an idea how HBase can be used, let’s get started using it.

1.3. Hello HBase

HBase is built on top of Apache Hadoop and Apache ZooKeeper. Like the rest of the Hadoop ecosystem components, it’s written in Java. HBase can run in three different modes: standalone, pseudo-distributed, and full-distributed. The standalone mode is what we’ll work with through the book. That means you’re running all of HBase in just one Java process. This is how you’ll interact with HBase for exploration and local development. You can also run in pseudo-distributed mode, a single machine running many Java processes. The last deployment configuration is fully distributed across a cluster of machines. The other modes required dependency packages to be installed and HBase to be configured properly. These topics are covered in chapter 9.

HBase is designed to run on *nix systems, and the code and hence the commands throughout the book are designed for *nix systems. If you’re running Windows, the best bet is to get a Linux VM.

 

A note about Java

HBase is essentially written in Java, barring a couple of components, and the only language that is currently supported as a first-class citizen is Java. If you aren’t a Java developer, you’ll need to learn some Java skills along with learning about HBase. The intention of this book is to teach you how to use HBase effectively, and a big part of that is learning how to use the API, which is all Java. So, brace up.

 

1.3.1. Quick install

To run HBase in the standalone mode, you don’t need to do a lot. You’ll work with the Apache 0.92.1 release and install it using the tarball. Chapter 9 talks more about the various distributions. If you’d like to work with a different distribution than the stock Apache 0.92.1, feel free to install that. The examples are based on 0.92.1 (and Cloud-era’s CDH4), and any version that has compatible APIs should work fine.

HBase needs the Java Runtime Environment (JRE) to be installed and available on the system. Oracle’s Java is the recommended package for use in production systems. The Hadoop and HBase communities have tested some of the JRE versions, and the recommended one for HBase 0.92.1 or CDH4 at the time of writing this manuscript is Java 1.6.0_31.[19] Java 7 hasn’t been tested with HBase so far and therefore isn’t recommended. Go ahead and install Java on your system before diving into the installation of HBase.

19 Installing the recommended Java versions: http://mng.bz/Namq.

Download the tarball from the Apache HBase website’s download section (http://hbase.apache.org/):

$ mkdir hbase-install
$ cd hbase-install
$ wget http://apache.claz.org/hbase/hbase-0.92.1/hbase-0.92.1.tar.gz
$ tar xvfz hbase-0.92.1.tar.gz

These steps download and untar the HBase tarball from the Apache mirror. As a convenience, create an environment variable pointing at this location; it’ll make life easier later. Put it in your environment file so you don’t have to set it every time you open a new shell. You’ll use HBASE_HOME later in the book:

$ export HBASE_HOME=`pwd`/hbase-0.92.1

Once that’s done, you can spin up HBase using the provided scripts:

$ $HBASE_HOME/bin/start-hbase.sh
starting master, logging to .../hbase-0.92.1/bin/../logs/...-master out

If you want, you can also put $HBASE_HOME/bin in your PATH so you can simply run hbase rather than $HBASE_HOME/bin/hbase next time.

That’s all there is to it. You just installed HBase in standalone mode. The configurations for HBase primarily go into two files: hbase-env.sh and hbase-site.xml. These exist in the /etc/hbase/conf/ directory. By default in standalone mode, HBase writes data into /tmp, which isn’t the most durable place to write to. You can edit the hbasesite.xml file and put the following configuration into it to change that location to a directory of your choice:

<property>
    <name>hbase.rootdir</name>
    <value>file:///home/user/myhbasedirectory/</value>
</property>

Your HBase install has a management console of sorts running on http://localhost:60010. It looks something like the screen in figure 1.2.

Figure 1.2. The HBase Master status page. From this interface, you can get a general sense of the health of your installation. It also allows you to explore the distribution of data and perform basic administrative tasks, but most administration isn’t done through this interface. You’ll learn more about HBase operations in chapter 10.

Now that you have everything installed and HBase fired up, let’s start playing with it.

1.3.2. Interacting with the HBase shell

You use the HBase shell to interact with HBase from the command line. This works the same way for both local and cluster installations. The HBase shell is a JRuby application wrapping the Java client API. You can run it in either interactive or batch mode. Interactive is for casual inspection of an HBase installation; batch is great for programmatic interaction via shell scripts and or even loading small files. For this section, we’ll keep to interactive mode.

 

JRuby and JVM languages

Those of you unfamiliar with Java may be confused by this JRuby concept. JRuby is an implementation of the Ruby programming language on top of the Java runtime. In addition to the usual Ruby syntax, JRuby provides support for interacting with Java objects and libraries. Java and Ruby aren’t the only languages available on the JVM. Jython is an implementation of Python on the JVM, and there are entirely unique languages like Clojure and Scala as well. All of these languages can interact with HBase via the Java client API.

 

Let’s start with interactive mode. Launch the shell from your terminal using hbase shell. The shell provides you with tab-completion of your commands and inline access to command documentation:

$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.1-cdh4.0.0, rUnknown, Mon Jun  4 17:27:36 PDT 2012

hbase(main):001:0>

If you’ve made it this far, you’ve confirmed the installation of both Java and the HBase libraries. For the final validation, let’s ask for a listing of registered tables. Doing so makes a full-circle request from this client application to the HBase server infrastructure and back again. At the shell prompt, type list and press Enter. You should see zero results found as a confirmation and again be greeted with a prompt:

hbase(main):001:0> list
TABLE
0 row(s) in 0.5710 seconds

hbase(main):002:0>

With your installation complete and verified, let’s create a table and store some data.

1.3.3. Storing data

HBase uses the table as the top-level structure for storing data. To write data into HBase, you need a table to write it into. To begin, create a table called mytable with a single column family. Yes, column family. (Don’t worry; we’ll explain later.) Create your table now:

Writing Data

With a table created, you can now write some data. Let’s add the string hello HBase to the table. In HBase parlance, we say, “Put the bytes 'hello HBase' to a cell in 'mytable' in the 'first'row at the 'cf:message'column.” Catch all that? Again, we’ll cover all this terminology in the next chapter. For now, perform the write:

hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase'
0 row(s) in 0.2070 seconds

That was easy. HBase stores numbers as well as strings. Go ahead and add a couple more values, like so:

hbase(main):005:0> put 'mytable', 'second', 'cf:foo', 0x0
0 row(s) in 0.0130 seconds
hbase(main):006:0> put 'mytable', 'third', 'cf:bar', 3.14159
0 row(s) in 0.0080 second

You now have three cells in three rows in your table. Notice that you didn’t define the columns before you used them. Nor did you specify what type of data you stored in each column. This is what the NoSQL crowd means when they say HBase is a schema-less database. But what good is writing data if you can’t read it? No good at all. It’s time to read your data back.

Reading Data

HBase gives you two ways to read data: get and scan. As you undoubtedly astutely noticed, the command you gave HBase to store the cells was put. get is the complement of put, reading back a single row. Remember when we mentioned HBase having a key-value API but with a twist? scan is that twist. Chapter 2 will explain how scan works and why it’s important. In the meantime, focus on using it.

Start with get:

hbase(main):007:0> get 'mytable', 'first'
COLUMN                CELL
 cf:message           timestamp=1323483954406, value=hello HBase
1 row(s) in 0.0250 seconds

Just like that, you pulled out your first row. The shell shows you all the cells in the row, organized by column, with the value associated at each timestamp. HBase can store multiple versions of each cell. The default number of versions stored is three, but it’s configurable. At read time, only the latest version is returned, unless otherwise specified. If you don’t want multiple versions to be stored, you can configure HBase to store only one version. There is no way to disable this feature.

Use scan when you want values from multiple rows. Be careful! Unless you specify otherwise, it returns all rows in the table. Don’t say we didn’t warn you. Try it:

hbase(main):008:0> scan 'mytable'
ROW                 COLUMN+CELL
 first              column=cf:message, timestamp=1323483954406, value=hell
                    o HBase
 second             column=cf:foo, timestamp=1323483964825, value=0
 third              column=cf:bar, timestamp=1323483997138, value=3.14159
3 row(s) in 0.0240 seconds

All your data came back. Notice the order in which HBase returns rows. They’re ordered by the row name; HBase calls this the rowkey. HBase has a couple other tricks up its sleeve, but everything else is built on the basic concepts you’ve just used. Let’s wrap it up.

1.4. Summary

We covered quite a bit of material for an introductory chapter. Knowing where a technology comes from is always helpful when you’re learning. By now you should understand the roots of HBase and have some context for the NoSQL phenomenon. You also understand the basics of the problem HBase is designed for as well as some of the problems HBase has solved. Not only that, but you’re running HBase right now and using it to store some of your most precious “hello world” data.

No doubt we’ve raised more questions for you. Why is strong consistency important? How does the client find the right node for a read? What’s so fancy about scans? What other tricks does HBase have waiting? We’ll answer all these and more in the coming chapters. Chapter 2 will get you started on the path of building your own application that uses HBase as its back-end data store.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.151.144