Apache Cassandra is a free, open source, distributed data storage system that differs sharply from relational database management systems (RDBMSs).
Cassandra first started as an Incubator project at Apache in January of 2009. Shortly thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, released version 0.3 of Cassandra, and steadily made releases up to the milestone 3.0 release. Since 2017, the project has been led by Apache Cassandra Project Chair Nate McCall, producing releases 3.1 through the latest 4.0 release. Cassandra is being used in production by some of the biggest companies on the Web, including Facebook, Twitter, and Netflix.
Its popularity is due in large part to the outstanding technical features it provides. It is durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there’s no single point of failure. It is highly available and offers a data model based on the Cassandra Query Language (CQL).
This book is intended for a variety of audiences. It should be useful to you if you are:
A developer working with large-scale, high-volume applications, such as Web 2.0 social applications, ecommerce sites, financial services, or sensor-based Internet of Things (IoT) systems
An application architect or data architect who needs to understand the available options for high-performance, decentralized, elastic data stores
A database administrator or database developer currently working with standard relational database systems who needs to understand how to implement a fault-tolerant, eventually consistent data store
A manager who wants to understand the advantages (and disadvantages) of Cassandra to help make decisions about technology strategy
A student, analyst, or researcher who is designing a project related to Cassandra or other non-relational data store options
This book is a technical guide. In many ways, Cassandra and other NoSQL databases represent a new way of thinking about data. Many developers who gained their professional chops in the last 15–20 years have become well versed in thinking about data in purely relational or object-oriented terms. Cassandra’s data model is different and can be difficult to wrap your mind around at first, especially for those of us with entrenched ideas about what a database is (and should be).
Using Cassandra does not mean that you have to be a Java developer. However, Cassandra is written in Java, so if you’re going to dive into the source code, a solid understanding of Java is crucial. Many of the examples in this book are in Java, but Cassandra drivers are available in a wide variety of languages, including Java, Node.js, Python, C#, PHP, Ruby, and Go.
Finally, it is assumed that you have a good understanding of how the Web works, can use an integrated development environment (IDE), and are somewhat familiar with the typical concerns of data-driven applications. You might be a well-seasoned developer or administrator but still, on occasion, encounter tools used in the Cassandra world that you’re not familiar with. For example, Apache Ant is used to build Cassandra, and the Cassandra source code is available via Git. In cases where we speculate that you’ll need to do a little setup of your own in order to work with the examples, we try to support that.
This book is designed with the chapters acting, to a reasonable extent, as standalone guides. This is important for a book on Cassandra, which has a variety of audiences in different job roles and industries. To borrow from the software world, the book is designed to be modular. If you’re new to Cassandra, it makes sense to read the book in order; if you’ve passed the introductory stages, you will still find value in later chapters, which you can read as standalone guides.
Here is how the book is organized:
This chapter reviews the history of the enormously successful relational database and the rise of non-relational database technologies like Cassandra.
This chapter introduces Cassandra and discusses what’s exciting and different about it, where it came from, and what its advantages are.
This chapter walks you through installing Cassandra, getting it running, and trying out some of its basic features.
Here we look at Cassandra’s data model, highlighting how it differs from the traditional relational model. We also explore how this data model is expressed in the Cassandra Query Language (CQL).
This chapter introduces principles and processes for data modeling in Cassandra. We analyze a well-understood domain to produce a working schema.
This chapter helps you understand what happens during read and write operations and how the database accomplishes some of its notable aspects, such as durability and high availability. We go under the hood to understand some of the more complex inner workings, such as the gossip protocol, hinted handoffs, read repairs, Merkle trees, and more.
In order to help make some of Cassandra’s architecture concepts more concrete, we’ll explore some of the common ways in which Cassandra figures into the architecture and design of modern cloud applications.
There are a variety of drivers available for different languages, including Java, Node.js, Python, Ruby, C#, and PHP, in order to abstract Cassandra’s lower-level API. We help you understand how to use common driver features to develop applications with Cassandra.
We build on the previous chapters to learn how Cassandra works “under the covers” to read and write data. We’ll also discuss concepts such as batches, lightweight transactions, and paging.
This chapter shows you how to specify partitioners, replica placement strategies, and snitches. We set up a cluster and see the implications of different configuration choices. We’ll discuss how to plan your cluster deployments, including hybrid and multi-cloud deployments using providers such as Amazon, Microsoft, and Google, as well as deploying and managing clusters using Docker and Kubernetes.
Once your cluster is up and running, you’ll want to monitor its usage, memory patterns, and thread patterns, and understand its general activity. Cassandra has a rich Java Management Extensions (JMX) interface baked in, which we put to use to monitor all of these and more.
The ongoing maintenance of a Cassandra cluster is made somewhat easier by some tools that ship with the server. We see how to decommission a node, load balance the cluster, get statistics, and perform other routine operational tasks.
One of Cassandra’s most notable features is its speed—it’s very fast. But there are a number of things, including memory settings, data storage, hardware choices, caching, and buffer sizes, that you can tune to squeeze out even more performance.
NoSQL technologies are often slighted as being weak on security. Thankfully, Cassandra provides authentication, authorization, and encryption features, which we’ll learn how to configure in this chapter.
We close the book with a summary of the steps involved in bringing Cassandra into your enterprise, from the perspective of migrating from a relational database to Cassandra. We’ll look at the implications for data modeling, application development, and deployment as well as how Cassandra integrates with other popular technologies, including:
streaming systems such as Apache Kafka
search engines such as Apache Lucene, Apache Solr, and ElasticSearch
analytics platforms such as Apache Spark
This book was developed using Apache Cassandra 4.0 and the DataStax Java Driver version 4.1. The formatting and content of tool output, log files, configuration files, and error messages are as they appear in the 4.0 release, and may change in future releases.
When discussing features added in releases 2.0 and later, we cite the release in which the feature was added for readers who may be using earlier versions and are considering whether to upgrade.
The first edition of Cassandra: The Definitive Guide was the first book published on Cassandra, and has remained highly regarded over the years. However, the Cassandra landscape has changed significantly since 2010, both in terms of the technology itself and the community that develops and supports that technology. Here’s a summary of the key updates we’ve made to bring the book up to date:
The first edition was written against the 0.7 release in 2010. As of 2016, we’re up to the 3.X series. The most significant change has been the introduction of CQL and deprecation of the old Thrift API. Other new architectural features include secondary indexes, materialized views, and lightweight transactions. We provide a summary release history in Chapter 2 to help guide you through the changes. As we introduce new features throughout the text, we frequently cite the releases in which these features were added.
Development and testing with Cassandra has changed a lot over the years, with the introduction of the CQL shell (
cqlsh) and the gradual replacement of community-developed clients with the drivers provided by DataStax. We give in-depth treatment to
cqlsh in Chapter 3 and Chapter 4, and the drivers in Chapter 8 and Chapter 9. We also provide an expanded description of Cassandra’s read path and write path in Chapter 9 to enhance your understanding of the internals and help you understand the impact of decisions.
As more and more individuals and organizations have deployed Cassandra in production environments, the knowledge base of production challenges and best practices to meet those challenges has increased. We’ve added entirely new chapters on security (Chapter 14) and integration (Chapter 15), and greatly expanded the monitoring, maintenance, and performance tuning chapters (Chapter 11 through Chapter 13) in order to relate this collected wisdom.
For this third edition, there is not quite as much of a time gap to cover as there was between the first and second editions, but there have been several key changes we’d like to note:
The conventional wisdom in the software engineering community has been that it takes 5-10 years for a new database engine to fully mature. Thankfully, Cassandra has reached this maturity milestone, and while the 4.0 release certainly has some stability and availability improvements, the bulk of the new features are focused on features that make the database easier to understand and maintain. This edition covers new 4.0 features including: virtual tables (covered in Chapter 11), audit logging (covered in Chapter 14), and change data capture (CDC) (covered in Chapter 15.
The types of applications in which Cassandra is used continues to increase. To help bridge the gap between concept and reality, we’ve added a new chapter on Chapter 7. We’ve also updated Chapter 15 to include discussion of several patterns for using Kafka and Cassandra together.
When the second edition was published, Docker had already become a popular choice for application deployment, but the verdict was still out on running databases on Docker. Since then, there have been sufficient advances that we now feel comfortable recommending deployment of Cassandra on Docker. Kubernetes has emerged as the key technology for orchestrating the deployment and maintenance of containers across clusters of machines. In this edition we’ve updated Chapter 10 with new guidance on deployment of Cassandra to Docker and added coverage of Kubernetes to reflect the changing landscape.
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
The code examples found in this book are available for download at https://github.com/jeffreyscarpenter/cassandra-guide and https://github.com/jeffreyscarpenter/reservation-service.
This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Cassandra: The Definitive Guide, Third Edition, by Jeff Carpenter (O’Reilly). Copyright 2020 Jeff Carpenter, 978-1-098-11516-6.”
If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at [email protected].
For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com.
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/cassandra3e.
To comment or ask technical questions about this book, send email to [email protected].
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
There are many wonderful people to whom we are grateful for helping bring this book to life.
Thank you to our technical reviewers: Stu Hood, Robert Schneider, and Gary Dusbabek contributed thoughtful reviews to the first edition, while Andrew Baker, Ewan Elliot, Kirk Damron, Corey Cole, Jeff Jirsa, Chris Judson and Patrick McFadin reviewed the second edition.
Thank you to Jonathan Ellis and Patrick McFadin for writing forewords for the first and second editions, respectively, and to Nate McCall for the third edition forward. Thanks also to Patrick for his contributions to the Spark integration section in Chapter 15.
Thanks to our editors, Mike Loukides, Marie Beaugureau, and Nicole Tache, for their constant support and making this a better book.
Jeff would like to thank Eben for entrusting him with the opportunity to update such a well-regarded, foundational text, and for Eben’s encouragement from start to finish.
Finally, we’ve been inspired by the many terrific developers who have contributed to Cassandra. Hats off for making such an elegant and powerful database.