Preface

I got my start with HBase in the fall of 2008. It was a young project then, released only in the preceding year. As early releases go, it was quite capable, although not without its fair share of embarrassing warts. Not bad for an Apache subproject with fewer than 10 active committers to its name! That was the height of the NoSQL hype. The term NoSQL hadn’t even been presented yet but would come into common parlance over the next year. No one could articulate why the idea was important—only that it was important—and everyone in the open source data community was obsessed with this concept. The community was polarized, with people either bashing relational databases for their foolish rigidity or mocking these new technologies for their lack of sophistication.

The people exploring this new idea were mostly in internet companies, and I came to work for such a company—a startup interested in the analysis of social media content. Facebook still enforced its privacy policies then, and Twitter wasn’t big enough to know what a Fail Whale was yet. Our interest at the time was mostly in blogs. I left a company where I’d spent the better part of three years working on a hierarchical database engine. We made extensive use of Berkeley DB, so I was familiar with data technologies that didn’t have a SQL engine. I joined a small team tasked with building a new data-management platform. We had an MS SQL database stuffed to the gills with blog posts and comments. When our daily analysis jobs breached the 18-hour mark, we knew the current system’s days were numbered.

After cataloging a basic set of requirements, we set out to find a new data technology. We were a small team and spent months evaluating different options while maintaining current systems. We experimented with different approaches and learned firsthand the pains of manually partitioning data. We studied the CAP theorem and eventual consistency—and the tradeoffs. Despite its warts, we decided on HBase, and we convinced our manager that the potential benefits outweighed the risks he saw in open source technology.

I’d played a bit with Hadoop at home but had never written a real MapReduce job. I’d heard of HBase but wasn’t particularly interested in it until I was in this new position. With the clock ticking, there was nothing to do but jump in. We scrounged up a couple of spare machines and a bit of rack, and then we were off and running. It was a .NET shop, and we had no operational help, so we learned to combine bash with rsync and managed the cluster ourselves.

I joined the mailing lists and the IRC channel and started asking questions. Around this time, I met Amandeep. He was working on his master’s thesis, hacking up HBase to run on systems other than Hadoop. Soon he finished school, joined Amazon, and moved to Seattle. We were among the very few HBase-ers in this extremely Microsoft-centric city. Fast-forward another two years...

The idea of HBase in Action was first proposed to us in the fall of 2010. From my perspective, the project was laughable. Why should we, two community members, write a book about HBase? Internally, it’s a complex beast. The Definitive Guide was still a work in progress, but we both knew its author, a committer, and were well aware of the challenge before him. From the outside, I thought it’s just a “simple key-value store.” The API has only five concepts, none of which is complex. We weren’t going to write another internals book, and I wasn’t convinced there was enough going on from the application developer’s perspective to justify an entire book.

We started brainstorming the project, and it quickly became clear that I was wrong. Not only was there enough material for a user’s guide, but our position as community members made us ideal candidates to write such a book. We set out to catalogue the useful bits of knowledge we’d each accumulated over the couple of years we’d used the technology. That effort—this book—is the distillation of our eight years of combined HBase experience. It’s targeted to those brand new to HBase, and it provides guidance over the stumbling blocks we encountered during our own journeys. We’ve collected and codified as much as we could of the tribal knowledge floating around the community. Wherever possible, we prefer concrete direction to vague advice. Far more than a simple FAQ, we hope you’ll find this book to be a complete manual to getting off the ground with HBase.

HBase is now stabilizing. Most of the warts we encountered when we began with the project have been cleaned up, patched, or completely re-architected. HBase is approaching its 1.0 release, and we’re proud to be part of this community as we approach this milestone. We’re proud to present this manuscript to the community in hopes that it will encourage and enable the next generation of HBase users. The single strongest component of HBase is its thriving community—we hope you’ll join us in that community and help it continue to innovate in this new era of data systems.

NICK DIMIDUK

If you’re reading this, you’re presumably interested in knowing how I got involved with HBase. Let me start by saying thank you for choosing this book as your means to learn about HBase and how to build applications that use HBase as their underlying storage system. I hope you’ll find the text useful and learn some neat tricks that will help you build better applications and enable you to succeed.

I was pursuing graduate studies in computer science at UC Santa Cruz, specializing in distributed systems, when I started working at Cisco as a part-time researcher. The team I was working with was trying to build a data-integration framework that could integrate, index, and allow exploration of data residing in hundreds of heterogeneous data stores, including but not limited to large RDBMS systems. We started looking for systems and solutions that would help us solve the problems at hand. We evaluated many different systems, from object databases to graph databases, and we considered building a custom distributed data-storage layer backed by Berkeley DB. It was clear that one of the key requirements was scalability, and we didn’t want to build a full-fledged distributed system. If you’re in a situation where you think you need to build out a custom distributed database or file system, think again—try to see if an existing solution can solve part of your problem.

Following that principle, we decided that building out a new system wasn’t the best approach and to use an existing technology instead. That was when I started playing with the Hadoop ecosystem, getting my hands dirty with the different components in the stack and going on to build a proof-of-concept for the data-integration system on top of HBase. It actually worked and scaled well! HBase was well-suited to the problem, but these were young projects at the time—and one of the things that ensured our success was the community. HBase has one of the most welcoming and vibrant open source communities; it was much smaller at the time, but the key principles were the same then as now.

The data-integration project later became my master’s thesis. The project used HBase at its core, and I became more involved with the community as I built it out. I asked questions, and, with time, answered questions others asked, on both the mailing lists and the IRC channel. This is when I met Nick and got to know what he was working on. With each day that I worked on this project, my interest and love for the technology and the open source community grew, and I wanted to stay involved.

After finishing grad school, I joined Amazon in Seattle to work on back-end distributed systems projects. Much of my time was spent with the Elastic MapReduce team, building the first versions of their hosted HBase offering. Nick also lived in Seattle, and we met often and talked about the projects we were working on. Toward the end of 2010, the idea of writing HBase in Action for Manning came up. We initially scoffed at the thought of writing a book on HBase, and I remember saying to Nick, “It’s gets, puts, and scans—there’s not a lot more to HBase from the client side. Do you want to write a book about three API calls?”

But the more we thought about this, the more we realized that building applications with HBase was challenging and there wasn’t enough material to help people get off the ground. That limited the adoption of the project. We decided that more material on how to effectively use HBase would help users of the system build the applications they need. It took a while for the idea to materialize; in fall 2011, we finally got started.

Around this time, I moved to San Francisco to join Cloudera and was exposed to many applications that were built on top of HBase and the Hadoop stack. I brought what I knew, combined it with what I had learned over the last couple of years working with HBase and pursuing my master’s, and distilled that into concepts that became part of the manuscript for the book you’re now reading. HBase has come a long way in the last couple of years and has seen many big players adopt it as a core part of their stack. It’s more stable, faster, and easier to operationalize than it has ever been, and the project is fast approaching its 1.0 release.

Our intention in writing this book was to make learning HBase more approachable, easier, and more fun. As you learn more about the system, we encourage you to get involved with the community and to learn beyond what the book has to offer—to write blog posts, contribute code, and share your experiences to help drive this great open source project forward in every way possible. Flip open the book, start reading, and welcome to HBaseland!

AMANDEEP KHURANA

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.235.225