Chapter 14: Choosing Among NoSQL Flavors

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14

Choosing Among NoSQL Flavors

WHAT’S IN THIS CHAPTER?

Understanding the strengths and weaknesses of NoSQL products
Comparing and contrasting the available NoSQL products
Evaluating NoSQL products based on performance benchmarks

Not all NoSQL databases are similar, nor are they made to solve the same problems, so comparing them to choose one from among them is probably a fruitless exercise. However, understanding which database is appropriate for a given situation and context is important. This chapter presents the facts and opinions to help you compare and contrast the available NoSQL choices. It uses feature, performance, and context-based criteria to classify the NoSQL databases and to weigh them against each other.

The evolution of NoSQL and databases beyond RDBMS can be compared to the rapid evolution of multiple programming languages in the past few years. Availability of multiple programming languages allows the use of the right language for the right task, often leading a single developer to have more than one language in his or her repertoire. A single developer working with multiple languages is often compared to a person speaking more than one natural language. The knowledge of multiple languages makes a person a polyglot. Being a polyglot enables an individual to communicate effectively in situations where the lack of knowledge of a language could have been an impediment. Similarly, adopting multiple programming languages is termed as polygot programming. Often, polyglot programming is seen as a smart way of programming where an appropriate language is used for the task at hand. Along the same lines, it’s becoming evident that one database does not fit all sizes and knowledge and adoption of more than one database is a wise strategy. The knowledge and use of multiple database products and methodologies is popularly now being called polyglot persistence.

NoSQL databases come in many shapes, sizes, and forms so feature-based comparison is the first way to logically group them together. Often, solutions for many problems easily map to desired features.

COMPARING NOSQL PRODUCTS

This section compares and contrasts the NoSQL choices on the basis of the following features:

Scalability
Transactional integrity and consistency
Data modeling
Query support
Access and interface availability

Scalability

Although all NoSQL databases promise horizontal scalability they don’t rise up equally to the challenge. The Bigtable clones — HBase and Hypertable — stand in front and in-memory stores, like Membase or Redis, and document databases, like MongoDB and Couchbase Server, lag behind. This difference is amplified as the data size becomes very large, especially if it grows over a few petabytes.

In the past several chapters, you gained a deep understanding of the storage architecture of most mainstream NoSQL database types. Bigtable and its clones promote the storage of large individual data points and large collections of data. The Bigtable model supports a large number of columns and an immensely large number of rows. The data can be sparse where many columns have no value. The Bigtable model, of course, does not waste space and simply doesn’t store cells that have no value.

The number of columns and rows in an HBase cluster is theoretically unbound. The numbers of column-families are restricted to about 100. The number of rows can keep growing as long as newer nodes are available to save the data. The number of columns is rarely more than a few hundred. Too many columns could impose logical challenges in manipulating the data set.

Google led the column-family-centric data store revolution to store the large and ever-growing web index its crawlers brought home. The Web has been growing in unbounded ways for the past several years. Google needed a store to grow with the expanding index. Therefore, Bigtable and its clones were built to scale out, limited only by the hardware available to spin off newer nodes in the cluster. Over the past few years, Google has successfully used the Bigtable model to store and retrieve a variety of data that is also very large in volume.

The HBase wiki lists a number of users on its Powered By page (http://wiki.apache.org/hadoop/Hbase/PoweredBy). Some users listed clearly testify to HBase’s capability to scale.

Although the next paragraph or two demonstrate HBase’s capabilities, Hypertable, being another Google Bigtable clone, also delivers the same promise.

Meetup (www.meetup.com) is a popular site that facilitates user groups and interest groups to organize local events and meetings. Meetup has grown from a small, unknown site in 2001 to 8 million members in 100 countries, 65,000+ organizers, 80,000+ meetup groups, and 50,000 meetups each week (http://online.wsj.com/article/SB10001424052748704170404575624733792905708.html). Meetup is an HBase user. All group activity is directly written to HBase and is indexed per member. A member’s custom feed is directly served from HBase.

Facebook is another big user of HBase. Facebook messaging is built on HBase. Facebook was the number one destination site on the Internet in 2010. It has grown to more than 500 million active users (www.facebook.com/press/info.php?statistics) and is the largest software application in terms of the number of users. Facebook messaging is a robust infrastructure that integrates chat, SMS, and e-mail. Hundreds of billions of messages are sent every month through this messaging infrastructure. The engineering team at Facebook shared a few notes on using HBase for their messaging infrastructure. Read the notes online at www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/454991608919.

HBase has some inherent advantages when it comes to scaling systems. HBase supports auto load balancing, failover, compression, and multiple shards per server. HBase works well with the Hadoop distributed filesystem (a.k.a. HDFS, which is a massively scalable distributed filesystem). You know from earlier chapters that HDFS replicates and automatically re-balances to easily accommodate large files that span multiple servers. Facebook chose HBase to leverage many of these features. HBase is a necessity for handling the number of messages and users they serve. The Facebook engineering notes also mention that the messages in their infrastructure are short, volatile, and temporal and are rarely accessed later. HBase, and in general Bigtable clones, are particularly suitable when ad-hoc querying of data is not important. From earlier chapters, you know that HBase supports the querying of data sets but is a weak replacement to an RBDMS as far as its querying capabilities are concerned. Infrastructures like Google App Engine (GAE) successfully expose a data modeling API, with advanced querying capabilities, on top of the Bigtable. More information on querying is covered in a section titled “Querying Support,” later in this chapter.

So it seems clear that column-family-centric NoSQL databases are a good choice if extreme scalability is a requirement. However, such databases may not be the best choice for all types of systems, especially those that involve real-time transaction processing. An RDBMS often makes a better choice than any NoSQL flavor if transactional integrity is very important. Eventually consistent NoSQL options, like Cassandra or Riak, may be workable if weaker consistency is acceptable. Amazon has demonstrated that massively scalable e-commerce operations may be a use case for eventually consistent data stores, but examples beyond Amazon where such models apply well are hard to find. Databases like Cassandra follow the Amazon Dynamo paradigm and support eventual consistency. Cassandra promises incredibly fast read and write speeds. Cassandra also supports Bigtable-like column-family-centric data modeling. Amazon Dynamo also inspired Riak. Riak supports a document store abstraction in addition to being an eventually consistent store. Both Cassandra and Riak scale well in horizontal clusters but if scalability is of paramount importance, my vote goes in favor of HBase or Hypertable over the eventually consistent stores. Perhaps places where eventually consistent stores fare better than sorted ordered column-family stores is where write throughput and latency is important. Therefore, if both horizontal scalability and high write throughput are required, possibly consider Cassandra or Riak. Even in these cases, consider a hybrid approach where you can logically partition the data write process from the access and analytics and use two separate databases for each of the tasks.

If scalability implies large data becoming available at an incredibly fast pace, for example stock market tick data or advertisement click tracking data, then column-family stores alone may not provide a complete solution. It’s prudent to store the massively growing data in these stores and manipulate them using MapReduce operations for batch querying and data mining, but you may need something more nimble for fast writes and real-time manipulation. Nothing is faster than manipulating the data in memory and so leveraging NoSQL options that keep data in memory and flush it to disk when it fills the available capacity are probably good choices. Both MongoDB and Redis follow this strategy. Currently, MongoDB uses mmap and Redis implements a custom mapping from memory to disk. However, both MongoDB and Redis, have actively been re-engineering their memory mapping feature and things will continue to evolve. Using MongoDB or Redis with HBase or Hypertable makes a good choice for a system that needs fast real-time data manipulation and a store for extensive data mining. Memcached and Membase can be used in place of MongoDB or Redis. Memcached and Membase act as a layer of fast and efficient cache, and therefore supplement well on top of column-family stores. Membase has been used effectively with Hadoop-based systems for such use cases. With the merger of Membase and CouchDB, a well integrated NoSQL product with both fast cache-centric features and distributed scalable storage-centric features is likely to emerge.

Although scalability is very important if your data requirements grow to the size of Google’s or Facebook’s, not all applications become that large. Scalable systems are probably relevant for cases much smaller than these widespread systems but sometimes an attempt to make things scalable can become an exercise in over-engineering. You certainly want to avoid unnecessary complexity.

In many systems, data integrity and transactional consistency are more important than any other requirements. Is NoSQL an option for such systems?

Transactional Integrity and Consistency

Transactional integrity is relevant only when data is modified, updated, created, and deleted. Therefore, the question of transactional integrity is not pertinent in pure data warehousing and mining contexts. This means that batch-centric Hadoop-based analytics on warehoused data is also not subject to transactional requirements.

Many data sets like web traffic log files, social networking status updates (including tweets or buzz), advertisement click-through imprints, road-traffic data, stock market tick data, and game scores are primarily, if not completely, written once and read multiple times. Data sets that are written once and read multiple times have limited or no transactional requirements.

Some data sets are updated and deleted, but often these modifications are limited to a single item and not a range within the data set. Sometimes, updates are frequent and involve a range operation. If range operations are common and integrity of updates is required, an RDBMS is the best choice. If atomicity at an individual item level is sufficient, then column-family databases, document databases, and a few distributed key/value stores can guarantee that. If a system needs transactional integrity but could accommodate a window of inconsistency, eventual consistency is a possibility.

Opponents of NoSQL take the lack of ACID support in many scalable and robust non-relational databases as a major weakness. However, many data sets need little or no transactional support. Such data sets gain immediately from the scalable and fluid architecture of the NoSQL options. The power of scalable parallel processing using MapReduce operations on these NoSQL databases can help manipulate and mine large data sets effectively. Don’t let the unnecessary worry of transactional integrity worry you.

HBase and Hypertable offer row-level atomic updates and consistent state with the help of Paxos. MongoDB offers document-level atomic updates. All NoSQL databases that follow a master-slave replication model implicitly support transactional integrity.

Data Modeling

RDBMS offers a consistent way of modeling data. Relational algebra underlies the data model. The theory is well established and implementation is standardized. Therefore, consistent ways of modeling and normalizing data is well understood and documented. In the NoSQL world there is no such standardized and well-defined data model. This is because all NoSQL products do not intend to solve the same problem or have the same architecture.

If you need an RDBMS-centric data model for storage and querying and cannot under any circumstances step outside those definitions, just don’t use NoSQL. If, however, you are happy with SQL-type querying but can accommodate non-relational storage models, you have a few NoSQL options to choose from.

Document databases, like MongoDB, provide a gradual adoption path from formal RDBMS models to lose document-centric models. MongoDB supports SQL-like querying, rudimentary relational references, and database objects that draw a lot of inspiration from the standard table- and column-based model. If relaxed schema is your primary reason for using NoSQL, then MongoDB is a great option for getting started with NoSQL.

MongoDB is used by many web-centric businesses. Foursquare is perhaps its most celebrated user. Shutterfly, bit.ly, etsy, and sourceforge are a few other users that add feathers to MongoDB’s cap. In many of these use cases MongoDB is preferred because it supports a flexible data model and offers speedy reads and writes. Web applications often evolve rapidly and it often gets cumbersome for developers to continuously change underlying RDBMS models, especially when the changes are frequent and at times drastic. Added to the schema change challenges are the issues relating to data migration.

MongoDB has good support for web framework integration. Rails, one of the most popular web application frameworks, can be used effectively with MongoDB. The data from Rails applications can be persisted via an object mapper. Therefore, MongoDB can easily be used in place of an RDBMS. Read about Rails 3 integration at www.mongodb.org/display/DOCS/Rails+3+-+Getting+Started.

For Java web developers, Spring offers first-class support for MongoDB via its Spring Data project. Read more about the Spring Data Document release that supports MongoDB at www.springsource.org/node/3032. Spring Data project, in fact, adds support for a number of NoSQL products, and not just MongoDB. It integrates Spring with Redis, Riak, CouchDB, Neo4j, and Hadoop. Get more details online at the Spring Data project homepage, which is www.springsource.org/spring-data.

MongoDB acts like a persistent cache, where data is kept in memory and flushed to disk as required. Therefore, MongoDB could also be thought of as an intermediate option between an RDBMS and an in-memory store or a flat file structure. Many web applications like real-time analytics, comments system, ratings storage, content management software, user data system, and event logging applications benefit from the fluid schema that MongoDB offers. Added to that, such applications enjoy MongoDB’s RDBMS-like querying capabilities and its ability to segregate data into collections that resemble tables.

Apache CouchDB is a document database alternative to MongoDB. Apache CouchDB is now available as Couchbase server, with the primary creators of CouchDB having recently merged their company, CouchOne, with Membase, Inc. Couchbase offers a packaged version of Apache CouchDB with GeoCouch and support in the form of Couchbase Server.

Couchbase Server epitomizes adherence to web standards. Couchbase’s primary interface to the data store is through RESTful HTTP interactions and is more web-centric than any database has ever been. Couchbase includes a web server as a part of the data store. It is built on top of Erlang OTP. This means you could effectively create an entire application using Couchbase. Future versions of Couchbase will be adding access to the data store through the Memcached protocol, gaining from Membase’s ability to manage speed and throughput with a working set. Couchbase also plans to scale up, growing from Membase’s elastic capabilities to seamlessly span across more nodes. Although Couchbase is very powerful and feature-rich, it has a very small footprint. Its nimble nature makes it appropriate for installation on a smartphone or an embedded device. Read more about mobile Couchbase at www.couchbase.com/products-and-services/mobile-couchbase.

Couchbase models support REST-style data management. A database in CouchDB can contain JSON format documents, with additional metadata or supporting artifacts as attachments. All operations on data — create, retrieve, update, and delete — are performed via RESTful HTTP requests. Long-running complex queries across replicated Couchbase servers leverage MapReduce.

REST, which stands for Representational State Transfer, represents a style of software architecture suitable for distributed hypermedia systems like the world wide web. The term REST was introduced and defined by Roy Fielding as a part of his PhD thesis. Read more about REST at www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm.

Not Just a Map

In typical in-memory databases and caches, the most well-known data structure is a map or a hash. A map stores key/value pairs and allows for fast and easy access to data. In-memory NoSQL stores provide filesystem-backed persistence of in-memory data. This means that stored data survives a system reboot. Many NoSQL in-memory databases support data structures beyond just maps, making using them for more than simple cache data extremely attractive.

At the most basic level, Berkeley DB stores pairs of binary key/value pairs. The underlying store itself does not attach any metadata to the stored key/value pairs. Layers on top of basic storage, like the persistence API or the object wrappers, allow persistence of higher-level abstractions to a Berkeley DB store.

Membase, on the other hand, supports the Memcached protocol, both text and binary, and adds features around distributed replica management and consistent hashing on top of the basic key/value store. Membase also adds the ability to grow and shrink the number of servers as part of a cluster without interrupting client access. Redis takes a slightly different approach. It supports most popular data structures out of the box. In fact, it is defined as a “data structure” server. Redis supports lists, sets, sorted sets, and strings in addition to maps. Redis has even added transaction-like capabilities to specify atomicity across a number of discrete operations.

If your use case gains from using a file-backed in-memory NoSQL product, consider the supported data models to make a choice on the best fit. In many cases, a key/value storage is enough, but if you need more than that look at Berkeley DB, Membase, and Redis. If you need a powerful and stable distributed key/value store to support large user and activity load, you are not likely to go wrong with Membase.

What about HBase and Hypertable?

In the previous section on scalability, I gave my entire vote in favor of the column-family stores. When it comes to supporting the rich data models, though, these are generally not the most favorable choices. The upfront choice of row-keys for lookup and only column-family-centric model metadata support is usually considered inadequate. With a powerful abstraction layer on top of column-family stores, a lot becomes possible.

Google started the column-family store revolution. Google also created the data modeling abstraction on top of its column-family store for its very popular app engine. The GAE data modeling support provides rich data modeling using Python and Java. (Chapter 10 has details on this topic.) With the DataNucleus JDO and JPA support, you can use the popular object modeling abstractions in Java to persist data to HBase and Hypertable. You can also draw inspiration from the non-relational support in Django that works well on the app engine.

Querying Support

Storage is one part of the puzzle. The other is querying the stored data. Easily and effectively querying data is almost mandatory for any database to be considered seriously. It can be especially important when building the operational data store for applications with which people are interacting. An RDBMS thrives on SQL support, which makes accessing and querying data easy. Standardized syntax and semantics make it an attractive choice. The first chapter in this book talks about the quest for a SQL-like query language in the world of NoSQL and the subsequent chapters show how it is implemented.

Among document databases, MongoDB provides the best querying capabilities. Best is a relative term, and developers argue about what they consider superior to alternatives, but I base my judgment on three factors: similarity to SQL, an easy syntax, and an easy learning curve. CouchDB’s querying capabilities are equally powerful and rather more straightforward once you understand the concepts of views and design documents. However, the concept of views as CouchDB defines it is new and can pose initial challenges to developers.

For key/value pairs and in-memory stores, nothing is more feature-rich than Redis as far as querying capabilities go. Redis has one of the most exhaustive sets of methods available for querying the data structures it stores. To add icing to the cake, it is all nicely documented. Read about the access methods at http://redis.io/commands.

Column-family stores like HBase have little to offer as far as rich querying capabilities go. However, an associated project called Hive makes it possible to query HBase using SQL-like syntax and semantics. Chapter 12 covers Hive. Hypertable defines a query language called HQL and also supports Hive.

Bringing Hive into the mix raises the question of manipulating data for operational usage versus accessing it for batch processing and business intelligence. Hive is not an interactive tool in a way SQL is to RDBMS. Hive resembles SQL in syntax but is really a way to abstract MapReduce-style manipulations. Hive allows you to use SQL like predicate-driven syntax instead of map and reduce function definitions to carry batch data manipulation operations on the data set.

Access and Interface Availability

MongoDB has the notion of drivers. Drivers for most mainstream libraries are available for interfacing and interacting with MongoDB. CouchDB uses web-standard ways of interaction and so you can connect to it using any programming language that supports the web idiom of communication. Wrappers for some languages make communication to CouchDB work like drivers for MongoDB, though CouchDB always has the RESTful HTTP interface available.

Redis, Membase, Riak, HBase, Hypertable, Cassandra, and Voldemort have support for language bindings to connect from most mainstream languages. Many of these wrappers use language-independent services layers like Thrift or serialization mechanisms like Avro under the hood. So it becomes important to understand the performance characteristics of the various serialization formats.

One good benchmark that provides insight into the performance characteristics of serialization formats on the JVM is the jvm-serializers project at https://github.com/eishay/jvm-serializers/wiki/. The performance measures via the efforts of this project relate to a number of data formats. The formats covered are as follows:

protobuf 2.3.0 — Google data interchange format. http://code.google.com/p/protobuf/
thrift 0.4.0 — Open sourced by Facebook. Commonly used by a few NoSQL products, especially HBase, Hypertable, and Cassandra. http://incubator.apache.org/thrift/
avro 1.3.2 — An Apache project. Replacing Thrift in some NoSQL products. http://avro.apache.org/
kryo 1.03 — Object graph serialization framework for Java. http://code.google.com/p/kryo/
hessian 4.0.3 — Binary web services protocol. http://hessian.caucho.com/
sbinary 0.3.1-SNAPSHOT — Describing binary format for scala types. https://github.com/harrah/sbinary
google-gson 1.6 — Library to convert Java objects to JSON. http://code.google.com/p/google-gson/
jackson 1.7.1 — Java JSON-processor. http://jackson.codehaus.org/
javolution 5.5.1 — Java for real-time and embedded systems. http://javolution.org/
protostuff 1.0.0.M7 — Serialization that leverages protobuf. http://code.google.com/p/protostuff/
woodstox 4.0.7 — High-performance XML processor. http://woodstox.codehaus.org/
aalto 0.9.5 — Aalto XML processor. www.cowtowncoder.com/hatchery/aalto/index.html
fast-infoset 1.2.6 — Open-source implementation of Fast infoset for binary XML. http://fi.java.net/
xstream 1.3.1 — Library to serialize XML and back. http://xstream.codehaus.org/

The performance runs are on a JVM but the results may be as relevant to other platforms as well. The results show that protobuf, protostuff, kryo, and the manual process are among the most efficient for serialization and de-serialization. Kyro and Avro are among the formats that are most efficient in terms of serialized size and compressed size.

Having gained a view into the performance of formats, the next section segues into benchmarks of NoSQL products themselves.

BENCHMARKING PERFORMANCE

The Yahoo! Cloud Services Benchmark (YCSB) is the best known benchmarking infrastructure for comparing NoSQL products. It’s not without its limitations but it does provide a well-rounded insight into how the different NoSQL products stack up. The YCSB toolkit contains two important utilities:

A workload generator
Sample load that the workload generator uses

The YCSB project is online at https://github.com/brianfrankcooper/YCSB. Yahoo! has run tests on a number of popular NoSQL products as a part of the benchmark. The last published results include the following:

Sherpa/PNUT Bigtable-like systems (HBase, Hypertable, HTable, Megastore)
Azure
Apache Cassandra
Amazon Web Services (S3, SimpleDB, EBS)
CouchDB
Voldemort
Dynomite
Tokyo Cabinet
Redis
MongoDB

The tests are carried out in a tiered manner, measuring latency and throughput at each tier. Tier 1 focuses on performance by maximizing workload on a given hardware. The hardware is kept constant and workload is increased until the hardware is saturated. Tier 2 focuses on scalability. This means hardware is added as workload increases. The tier 2 benchmarks measure latency as workload and hardware availability are scaled up proportionally.

Workloads have different configurations for measuring performance and scalability in a balanced and exhaustive manner. The popular test cases are illustrated next.

50/50 Read and Update

A 50/50 read and update scenario could be considered an update-heavy test case. Results show that under this test case Apache Cassandra outperforms the competition on both read and update latencies. HBase comes close but stays behind Cassandra. Cassandra is able to perform more than 10,000 operations (50/50 read and update) per second with an average of around 25 milliseconds of read latency. Updates seem to be better than even reads with an average latency of just over 10+ milliseconds for the same workload of more than 10,000 operations per second. YCSB includes MySQL in addition to the NoSQL products. Although I ignore the RDBMS vs. NoSQL benchmarks in this chapter, it’s interesting to see that MySQL’s read and update latencies are comparable until around 4,000 operations per seconds but latency tends to increase quickly as the numbers grow to more than 5,000 operations per second.

95/5 Read and Update

A 95/5 read and update test case is a read-heavy case. This test case reveals and concurs with a few of the theories stated in this book such as the ones that state the sorted ordered column-family stores perform best for contiguous range reads. HBase seems to deliver consistent performance for reads, irrespective of the number of operations per second. The 5 percent updates in HBase have practically no latency. MySQL delivers the best performance for read-only cases. This is possibly because data is returned from cache. Combining HBase with a distributed cache like Memcached or Membase could match MySQL read performance and scale better with increased workloads. Cassandra demonstrates impressive performance in a read-heavy case as well, outperforming HBase in the tests. Remember, though, that Cassandra follows an eventual consistency model and all writes are appended to the commit log.

Scans

HBase is meant to outperform other databases for scans, short 1–100 records and range scans, and the test confirms that. Cassandra shows unpredictable performance as far as scans go.

Scalability Test

As workloads are increased and hardware is added the performance is fairly consistent for Cassandra and HBase. Some results show HBase being unstable when there are less than three nodes. An important aspect of adding hardware is elasticity. Elasticity measures how data gets rebalanced as additional nodes get added. Cassandra seems to perform poorly and seems to take long periods of time to stabilize. HBase shows consistent performance per rebalancing is affected by compaction.

As mentioned earlier, the performance tests tell a story but basing decisions solely on the tests is possibly misleading. Also, products are continuously evolving and tests run on different versions of a product produce different results. Combining performance measures with feature-based comparison is likely a more prudent choice than depending on either alone.

The Hypertable tests are not part of the YCSB tests. The Hypertable tests and YCSB tests are separate and independent tests. YCSB tests are more broad-based and apply to a number of NoSQL and RDBMS products, whereas the Hypertable tests focus on testing the performance of sorted ordered column-family stores.

Hypertable Tests

The Hypertable team carried out a set of tests to compare and contrast HBase and Hypertable, two Google Bigtable clones. The tests provide interesting insights. The carried out tests were in line with what the research paper on Google Bigtable proposed. Read section 7 of the Bigtable research paper, available online at http://labs.google.com/papers/bigtable.html to understand the tests.

The results consistently demonstrated that Hypertable outperformed HBase in most measures. You can access details on the tests and the results at www.hypertable.com/pub/perfeval/test1/. Some significant findings are explained next.

Hypertable dynamically adjusts how much memory it allocates to each subsystem depending on the workload. For read-intensive cases, Hypertable allocates most of the memory to the block cache. HBase has a fixed cache allocation, which is 20 percent of the Java heap. Flipping the measure from the latency standpoint, it becomes clear that Hypertable consistently involves less latency than HBase does, but the difference is stark when the data size is smaller. In the lower limit case of only 2 GBs of data, all data can be loaded up in the cache.

The results of tests that compared Hypertable and HBase for random write, sequential read, and scan performance also showed that Hypertable performed better in each of these cases. When you run a clustered data store to manage your large data, sometimes performance differences like these can have cost ramifications. Better performance could translate to lower compute cycle and resource consumption, which could mean greater cost savings.

Numerous other benchmarks from various product vendors are available, including the following:

Tokyo Cabinet Benchmarks — http://tokyocabinet.sourceforge.net/benchmark.pdf
How fast is Redis — http://redis.io/topics/benchmarks
Riak benchmark — https://bitbucket.org/basho/basho_bench/
VoltDB: Key/value benchmarking — http://voltdb.com/blog/key/value-benchmarking
Sort benchmark — http://sortbenchmark.org/

CONTEXTUAL COMPARISON

The previous two sections compared the NoSQL options on the basis of features and benchmarks. This section provides contextual information that relates a few NoSQL products to the conditions that led to their creation and evolution.

Not all NoSQL products are equal. Not all NoSQL products stack up equally either in terms of features or benchmarks. However, each NoSQL product has its own history, motivation, use case, and unique value proposition. Aligning yourself with these viewpoints, and especially with the product’s history and evolution, will help you understand which NoSQL product is suitable for the job at hand.

For the popular document databases, explore the following online resources:

CouchDB — Watch a video (www.erlang-factory.com/conference/SFBayAreaErlangFactory2009/speakers/DamienKatz) from an Erlang Factory’s 2009 session, where CouchDB founder Damien Katz talks about the History of CouchDB development from a very personal point of view. He talks about the inspirations for CouchDB and why he decided to move his wife and kids to a cheaper place and live off savings to build the database. He talks about the decision to switch to Erlang and the transition to joining the Apache Foundation. The video brings to light the motivations and reasons for the product’s existence.
MongoDB — Read the unofficial history of MongoDB that Kristina Chodrow wrote on her blog: www.snailinaturtleneck.com/blog/2010/08/23/history-of-mongodb/.

For the established key/value-centric databases, explore these:

Redis — Read a mailing list post (http://groups.google.com/group/redis-db/browse_thread/thread/0c706a43bc78b0e5/17c21c48642e4936?#17c21c48642e4936) by Antirez (Salavtore Sanfillippo) after he decided to eat his own dog food and switch lloogg.com to use Redis instead of MySQL.
Tokyo Cabinet — Read the Tokyo Cabinet value proposition on the product homepage at http://fallabs.com/tokyocabinet/.
Kyoto Cabinet — The Tokyo Cabinet folks created a new product called Kyoto Cabinet. Read details online at http://fallabs.com/kyotocabinet/.

The history of Bigtable and Dynamo clones — HBase, Hypertable, Cassandra, and Riak — is primarily that of an attempt to copy the success of Google and Amazon. Studying the initial history of evolution of these clones does not reveal anything substantially more than a quest to copy the good ideas that emerged at Google and Amazon. Certainly, copying the ideas wasn’t easy and involved a process of discovery and innovation. As users leverage these products for newer and varied use cases, the products continue to rapidly evolve. Evolution of these products is likely to introduce many newer features beyond those implemented as a part of the original inspiration.

NoSQL is a young and emerging domain and although understanding the context of a product’s evolution in the domain is beneficial, a lot of the history of many of the NoSQL products is still being written.

SUMMARY

This chapter provided a terse yet meaningful comparison of the popular NoSQL products. The chapter does not claim to be exhaustive or promise a panacea for all problems. Adoption of a NoSQL product needs to be done with care and only after understanding a product’s features, performance characteristics, and history.

The chapter did not explain all the features or provide a model to choose a product. Instead it built on what has been covered in the previous chapters in this book. The illustration highlighted a few important facts and summarized essential viewpoints. The decision, as always, should and must be yours.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 14: Choosing Among NoSQL Flavors

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 14: Choosing Among NoSQL Flavors