Chapter 2. Cassandra Architecture

This chapter aims to set you into a perspective where you can see the evolution of the NoSQL paradigm. It starts with a discussion of common problems that an average developer faces when the application starts to scale up and software components cannot keep up with it. Then, we'll see what can be assumed as a thumb rule in the NoSQL world: the CAP theorem that says to choose any two out of consistency, availability, and partition-tolerance. As we discuss, we realize how important it is to serve the customers (availability) than to be correct (consistency) all the time. However, we cannot afford to be wrong (inconsistent) for a long time. The customers wouldn't like to see that the items are in stock, but the checkout is failing. Cassandra comes into the picture with its tunable consistency.

We take a quick peep into all the actions that go on when a read or mutate happens. This leaves us with lots of fancy terms. Next, we move on to see these terms in full glory with explanation as we discuss various parts of the Cassandra design. We will also see how close yet how far Cassandra is when compared with its precursors and inspiration databases, such as Google's BigTable and Amazon's Dynamo. We happen to meet with some of the modern and efficient data structures, such as Bloom filters and Merkle tree, and algorithms, such as gossip protocol, phi accrual failure detectors, and log-structured merge trees.

Problems in the RDBMS world

RDBMS is a great approach. It keeps data consistent, is good for OLTP (http://en.wikipedia.org/wiki/Online_transaction_processing), provides access to good grammar, and manipulates data supported by all the popular programming languages. It was tremendously successful for the last 40 years (the relational data model in its first avatar: Codd, E.F. (1970), A Relational Model of Data for Large Shared Data Banks). However, in the early 2000s, big companies, such as Google (BigTable, http://research.google.com/archive/bigtable.html) and Amazon that have gigantic load on their databases to serve, started to feel bottlenecked with RDBMS.

If you ever used an RDBMS for a non-trivial web application, you must have faced problems, such as slow queries due to complex joins, expensive vertical scaling, and problems in horizontal scaling. Due to these problems, indexing takes a long time. At some point you choose to replicate the data, there is still some locking, and this hurts availability. That means under heavy load, locking will cause the user experience to deteriorate.

Although replication gives some relief, a busy slave may not catch up with the master (or there may be a connectivity glitch between the master and the slave). Consistency of such systems cannot be guaranteed (replication in MySQL is available at http://www.databasejournal.com/features/mysql/article.php/3355201/Database-Replication-in-MySQL.htm). With growth of the application, the demand to scale the backend gets more pressing, and the developer teams decide to add a caching layer (such as Memcached) at the top of the database. This alleviates some load off the database, but now the developers need to maintain the object states at two places: in the database and the caching layer. Although some ORMs provide a built-in caching mechanism, they have their own issues: larger memory requirement, and polluted application code with mapping code. In order to achieve more from RDBMS, we start to denormalize the database to avoid joins, and keep the aggregates in the columns to avoid statistical queries.

Sharding or horizontal scaling is another way to distribute the load. Sharding in itself is a good idea, but it adds too much manual work, plus the knowledge of sharding creeps into the application code. Sharded databases make the operational tasks (backup, schema alteration, and adding index) difficult (hardships of sharding is available at http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/).

There are ways to loosen up consistency by providing various isolation levels. But concurrency is just one part. Maintaining relational integrity, difficulties in managing data that cannot be accommodated on one machine, and difficult recovery were all making the traditional database systems hard to be accepted in the rapidly growing Big Data world. Companies needed a tool that can support hundreds of terabytes of data on the ever-failing commodity hardware reliably.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.184.200