About this Book

HBase sits at the top of a stack of complex distributed systems including Apache Hadoop and Apache ZooKeeper. You need not be an expert in all these technologies to make effective use of HBase, but it helps to have an understanding of these foundational layers in order to take full advantage of HBase. These technologies were inspired by papers published by Google. They’re open source clones of the technologies described in these publications. Reading these academic papers isn’t a prerequisite for using HBase or these other technologies; but when you’re learning a technology, it can be helpful to understand the problems that inspired its invention. This book doesn’t assume you’re familiar with these technologies, nor does it assume you’ve read the associated papers.

HBase in Action is a user’s guide to HBase, nothing more and nothing less. It doesn’t venture into the bowels of the internal HBase implementation. It doesn’t cover the broad range of topics necessary for understanding the Hadoop ecosystem. HBase in Action maintains a singular focus on using HBase. It aims to educate you enough that you can build an application on top of HBase and launch that application into production. Along the way, you’ll learn some of those HBase implementation details. You’ll also become familiar with other parts of Hadoop. You’ll learn enough to understand why HBase behaves the way it does, and you’ll be able to ask intelligent questions. This book won’t turn you into an HBase committer. It will give you a practical introduction to HBase.

Roadmap

HBase in Action is organized into four parts. The first two are about using HBase. In these six chapters, you’ll go from HBase novice to fluent in writing applications on HBase. Along the way, you’ll learn about the basics, schema design, and how to use the most advanced features of HBase. Most important, you’ll learn how to think in HBase. The two chapters in part 3 move beyond sample applications and give you a taste of HBase in real applications. Part 4 is aimed at taking your HBase application from a development prototype to a full-fledged production system.

Chapter 1 introduces the origins of Hadoop, HBase, and NoSQL in general. We explain what HBase is and isn’t, contrast HBase with other NoSQL databases, and describe some common use cases. We’ll help you decide if HBase is the right technology choice for your project and organization. Chapter 1 concludes with a simple HBase install and gets you started with storing data.

Chapter 2 kicks off a running sample application. Through this example, we explore the foundations of using HBase. Creating tables, storing and retrieving data, and the HBase data model are all covered. We also explore enough HBase internals to understand how data is organized in HBase and how you can take advantage of that knowledge in your own applications.

Chapter 3 re-introduces HBase as a distributed system. This chapter explores the relationship between HBase, Hadoop, and ZooKeeper. You’ll learn about the distributed architecture of HBase and how that translates into a powerful distributed data system. The use cases for using HBase with Hadoop MapReduce are explored with hands-on examples.

Chapter 4 is dedicated to HBase schema design. This complex topic is explained using the example application. You’ll see how table design decisions affect the application and how to avoid common mistakes. We’ll map any existing relational database knowledge you have into the HBase world. You’ll also see how to work around an imperfect schema design using server-side filters. This chapter also covers the advanced physical configuration options exposed by HBase.

Chapter 5 introduces coprocessors, a mechanism for pushing computation out to your HBase cluster. You’ll extend the sample application in two different ways, building new application features into the cluster itself.

Chapter 6 is a whirlwind tour of alternative HBase clients. HBase is written in Java, but that doesn’t mean your application must be. You’ll interact with the sample application from a variety of languages and over a number of different network protocols.

Part 3 starts with Chapter 7, which opens a real-world, production-ready application. You’ll learn a bit about the problem domain and the specific challenges the application solves. Then we dive deep into the implementation and don’t skimp on the technical details. If ever there was a front-to-back exploration of an application built on HBase, this is it.

Chapter 8 shows you how to map HBase onto a new problem domain. We get you up to speed on that domain, GIS, and then show you how to tackle domain-specific challenges in a scalable way with HBase. The focus is on a domain-specific schema design and making maximum use of scans and filters. No previous GIS experience is expected, but be prepared to use most of what you’ve learned in the previous chapters.

In part 4, chapter 9 bootstraps your HBase cluster. Starting from a blank slate, we show you how to tackle your HBase deployment. What kind of hardware, how much hardware, and how to allocate that hardware are all fair game in this chapter. Considering the cloud? We cover that too. With hardware determined, we show you how to configure your cluster for a basic deployment and how to get everything up and running.

Chapter 10 rolls your deployment into production. We show you how to keep an eye on your cluster through metrics and monitoring tools. You’ll see how to further tune your cluster for performance, based on your application workloads. We show you how to administer the needs of your cluster, keep it healthy, diagnose and fix it when it’s sick, and upgrade it when the time comes. You’ll learn to use the bundled tools for managing data backups and restoration, and how to configure multi-cluster replication.

Intended audience

This book is a hands-on user’s guide to a database. As such, its primary audience is application developers and technology architects interested in coming up to speed on HBase. It’s more practical than theoretical and more about consumption than internals. It’s probably more useful as a developer’s companion than a student’s textbook. It also covers the basics of deployment and operations, so it will be a useful starting point for operations engineers. (Honestly, though, the book for that crowd, as pertains to HBase, hasn’t been written yet.)

HBase is written in Java and runs on the JVM. We expect you to be comfortable with the Java programming language and with JVM concepts such as class files and JARs. We also assume a basic familiarity with some of the tooling around the JVM, particularly Maven, as it pertains to the source code used in the book. Hadoop and HBase are run on Linux and UNIX systems, so experience with UNIX basics such as the terminal are expected. The Windows operating systems aren’t supported by HBase and aren’t supported with this book. Hadoop experience is helpful, although not mandatory. Relational databases are ubiquitous, so concepts from those technologies are also assumed.

HBase is a distributed system and participates in distributed, parallel computation. We expect you to understand basic concepts of concurrent programs, both multi-threaded and concurrent processes. We don’t expect you know how to program a concurrent program, but you should be comfortable with the idea of multiple simultaneous threads of execution. This book isn’t heavy in algorithmic theory, but anyone working with terabytes or petabytes of data should be familiar with asymptotic computational complexity. Big-O notation does play a role in the schema design chapter.

Code conventions

In line with our aim of producing a practical book, you’ll find that we freely mix text and code. Sometimes as little as two lines of code are presented between paragraphs. The idea is to present as little as necessary before showing you how to use the API; then we provide additional detail. Those code snippets evolve and grow over the course of a section or chapter. We always conclude a chapter that contains code with a complete listing that provides the full context. We occasionally employ pseudo-code in a Python-like style to assist with an explanation. This is done primarily where the pure Java contains so much boilerplate or other language noise that it confuses the intended point. Pseudo-code is always followed by the real Java implementation.

Because this is a hands-on book, we also include many commands necessary to demonstrate aspects of the system. These commands include both what you type into the terminal and the output you can expect from the system. Software changes over time, so it’s entirely possible that this output has changed since we printed the output of the commands. Still, it should be enough to orient you to the expected behavior.

In commands and source code, we make extensive use of bold text; and annotations draw your attention to the important aspects of listings. Some of the command output, particularly when we get into the HBase shell, can be dense; use the bold text and annotations as your guide. Code terms used in the body of the text appear in a monotype font like this.

Code downloads

All of our source code, both small scripts and full applications, is available and open source. We’ve released it under the Apache License, Version 2.0—the same as HBase. You can find the source code on the GitHub organization dedicated to this book at www.github.com/hbaseinaction. Each project contained therein is a complete, self-contained application. You can also download the code from the publisher’s website at www.manning.com/HBaseinAction.

In the spirit of open source, we hope you’ll find our example code useful in your applications. We encourage you to play with it, modify it, fork it, and share it with others. If you find bugs, please let us know in the form of issues, or, better still, pull requests. As they often say in the open source community: patches welcome.

Author Online

Purchase of HBase in Action includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum and subscribe to it, go to www.manning.com/HBaseinAction. This page provides information on how to get on the forum once you’re registered, what kind of help is available, and the rules of conduct on the forum.

Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It’s not a commitment to any specific amount of participation on the part of the authors, whose contribution to the book’s forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions, lest their interest stray!

The Author Online forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.109.21