Chapter 1. Instant Apache Cassandra for Developers Starter

Welcome to Instant Apache Cassandra for Developers Starter. This book consolidates information required to adapt and work with Cassandra. This explains the basic needs to work with or understand Cassandra, building your first course, and discovering some tips and tricks for tweaking Cassandra.

This document contains the following sections:

So, what is Cassandra? explains what Cassandra is all about, discusses its offerings, and highlights valid use cases for which Cassandra is best fit.

Installation teaches how to download and install Cassandra quickly, configuring Cassandra for project requirements and specifications.

Quick start– Creating your first Java application describes how to perform the core tasks of Cassandra. It will discuss Cassandra internals and use case implementation.

Top features you’ll want to know about discusses the important Cassandra features and also discusses performance tweaks.

People and places you should get to know provides you with many useful links to the project page and forums, as well as a number of helpful articles, tutorials, and blogs.

So, what is Cassandra?

In recent times "Big data" has been quite a buzzword around. The relational database has dominated application development for the last 25 to 30 years. Massive data growth and the ability to scale and perform analytics over large data is the reason behind the emergence of NoSQL (Not Only structured query language).

Cassandra is a column-oriented distributed database management system (DBMS) and it has been open source in Apache incubation since 2009. It is designed to handle large data volumes distributed across multiple machines (commodity servers), and ensures high data availability with no Single Point of Failure (SPOF). Cassandra's data model is inspired by Google's Big Table and communication protocol (Gossip protocol) from Amazon's Dynamo DB.

Cassandra offerings

Cassandra offers a rich feature set, including the following:

  • Evolving schema: This schema is not necessarily available and evolves as you process the data. The Cassandra column family might look similar to the traditional relational DBMS (RDBMS) table, but columns with a column family can be dynamically generated.
  • No single point of failure: Cassandra is decentralized and data is distributed across data nodes. All data nodes are equal and there is no master-slave configuration. So, even if one data node goes down, any subsequent read/write request can be served by other nodes in the ring. There is no single point of failure.
  • High availability: This means that data is available all the time with minimum downtime. Replication factor is a way to create redundant data nodes (replicate data across the nodes). So, even if one node goes down, another one is ready to serve the requests.
  • Data partitioning: One important factor for distributed DBMS is data partitioning. Data nodes in Cassandra are connected in a ring shape (based on ring topology), where the data range between distinct nodes is equally distributed based on a selected partitioning scheme. Cassandra offers two types of partitioning schemes, namely, Random data partitioning and Ordered partitioning. We will cover them in them in the Cassandra storage architecture.
  • Configurable consistency: Though Cassandra prefers availability and partitioning over consistency, it provides a mechanism for the client application to tune/configure consistency on read/write requests. We will discuss this in detail in the Cassandra storage architecture.
  • Scalability: Cassandra prefers "scale out" over "scale up." Scale out or horizontal scaling is adding more nodes to a cluster. As Cassandra requires a bare minimum of administration, it is possible to add commodity servers to produce a high performance throughput.

Cassandra use cases

Cassandra is best fit in the following cases:

  • Data growth for application is massive and scalability is an issue
  • Data access over the distributed database does not provide desired performance
  • Processing real-time analytics over large schemaless/unorganized data set (that is, application and activity logs)
  • Preference of scalability over rich relational schema design
  • High availability of data is a requirement
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.230.81