Before we start with the hands-on exercises, let's discuss a bit more about the Cassandra storage architecture and its different components, followed by the Cassandra query language (CQL) command-line exercises.
In the previous sections, we learned about Cassandra basics. Cassandra's storage architecture is designed to manage large data volumes and revolve around some important factors:
Decentralized systems are systems that provide maximum throughput from each node. Cassandra offers decentralization by keeping each node with an identical configuration. There are no such master-slave configurations between nodes. Data is spread across nodes and each node is capable of serving read/write requests with the same efficiency.
Cassandra replicates data across the nodes based on configured replication. If the replication factor is 1
, it means that one copy of the dataset will be available on one node only. If the replication factor is 2
, it means two copies of each dataset will be made available on different nodes in the cluster. Still, Cassandra ensures data transparency, as for an end user data is served from one logical cluster. Cassandra offers two types of replication strategies.
Simple strategy is best suited for clusters involving a single data center, where data is replicated across different nodes based on the replication factor in a clockwise direction. With a replication factor of 3
, two more copies of each row will be copied on nearby nodes in a clockwise direction:
Network topology strategy (NTS) is preferred when a cluster is made up of nodes spread across multiple data centers. With NTS, we can configure the number of replicas needed to be placed within each data center. Data colocation and no single point of failure are two important factors that we need to consider priorities while configuring the replication factor and consistency level. NTS identifies the first node based on the selected schema partitioning and then looks up for nodes in a different rack (in the same data center). In case there is no such node, data replicas will be passed on to different nodes within the same rack. In this way, data colocation can be guaranteed by keeping the replica of a dataset in the same data center (to serve read requests locally). This also minimizes the risk of network latency at the same time. NTS depends on snitch configuration for proper data replica placement across different data centers.
A snitch relies upon the node IP address for grouping nodes within the network topology. Cassandra depends upon this information for routing data requests internally between nodes. The preferred snitch configurations for NTS are RackInferringSnitch
and PropertyFileSnitch
. We can configure snitch in cassandra.yaml
(the configuration file).
Data partitioning strategy is required for node selection of a given data read/request. Cassandra offers two types of partitioning strategies.
Random partitioning is the recommended partitioning scheme for Cassandra. Each node is assigned a 128-bit token value (initial_token
for a node is defined in cassandra.yaml
) generated by a one way hashing (MD5) algorithm. Each node is assigned an initial token value (to determine the position in a ring) and a data range is assigned to the node. If a read/write request with the token value (generated for a row key value) lies within the assigned range of nodes, then that particular node is responsible for serving that request. The following diagram is a common graphical representation of the numbers of nodes placed in a circular representation or a ring, and the data range is evenly distributed between these nodes:
Ordered partitioning is useful when an application requires key distribution in a sorted manner. Here, the token value is the actual row key value. Ordered partitioning also allows you to perform range scans over row keys. However, with ordered partitioning, key distribution might be uneven and may require load balancing administration. It is certainly possible that the data for multiple column families may get unevenly distributed and the token range may vary from one node to another. Hence, it is strongly recommended not to opt for ordered partitioning unless it is really required.
Here, we will discuss how the Cassandra process writes a request and stores it on a disk:
As we have mentioned earlier, all nodes in Cassandra are peers and there is no master-slave configuration. Hence, on receiving a write request, a client can select any node to serve as a coordinator. The coordinator node is responsible for delegating write requests to an eligible node based on the cluster's partitioning strategy and replication factor. First, it is written to a commit log and then it is delegated to corresponding memtables (see the preceding diagram). A memtable is an in-memory table, which serves subsequent read requests without any look up in the disk. For each column family, there is one memtable. Once a memtable is full, data is flushed down in the form of SS tables (on disk), asynchronously. Once all the segments are flushed onto the disk, they are recycled. Periodically, Cassandra performs compaction over SS tables (sorted by row keys) and claims unused segments. In case of data node restart (unwanted scenarios such as failover), the commit log replay will happen, to recover any previous incomplete write requests.
Cassandra provides a default command-line interface that is located at:
CASSANDRA_HOME/bin/cassandra-cli.sh
using Linux CASSANDRA_HOME/bin/cassandra-cli.bat
using WindowsBefore we proceed with the sample exercise, let's have a look at the Cassandra schema:
sub
comparator. Super columns do have some limitations, such as that secondary indexes over super columns are not possible. Also, it is not possible to read a particular super column without deserialization of the wrapped subcolumns. Because of such limitations, usage of super columns is highly discouraged within the Cassandra community. Using composite columns we can achieve such functionalities. In the next sections, we will cover composite columns in detail:A counter column is a sort of 64 bit signed integer. To create a counter column family, we simply need to define default_validation_class
as CounterColumnType
. Counter columns do have some application and technical limitations:
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
There are still some unresolved issues (https://issues.apache.org/jira/browse/CASSANDRA-4775) and to considering the preceding limitations before opting for counter columns is recommended.
You can start a Cassandra server simply by running $CASSANDRA_HOME/bin/cassandra
. If started in the local mode, it means there is only one node. Once successfully started, you should see logs on your console, as follows:
cassandra-cli
), which can be used for basic ddl
/dml
operations; you can connect to a local/remote Cassandra server instance by specifying the host and port options, as follows: $CASSANDRA_HOME/bin/cassandra-cli -host locahost -port 9160
create keyspace
command, as follows:create keyspace
command:This operation will create a keyspace cassandraSample
with node placement strategy as SimpleStrategy
and replication factor one. By default, if you don't specify placement_strategy
and strategy_options
, it will opt for NTS, where replication will be on one data center:
create keyspace cassandraSample with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};
We can look for available keyspaces by running the following command:
show keyspaces;
This will result in the following output:
2
for cassandraSample
, you simply need to execute the following command:update keyspace cassandraSample with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:2};
update keyspace cassandraSample with placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options = {datacenter1:1};
use cassandraSample;
Use the following command to create column family users
within the cassandraSample
keyspace:
create column family users with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and default_validation_class = 'UTF8Type';
To create a super column family suysers
, you need to run the following command:
create column family suysers with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and subcomparator='UTF8Type' and default_validation_class = 'UTF8Type' and column_type='Super' and column_metadata=[{column_name: name, validation_class: UTF8Type}];
set
method as follows:// create a column named "username", with a value of "user1" for row key 1 set users[1][username] = user1; // create a column named "password", with a value of "password1" for row key 1 set users[1][password] = password1; // create a column named "username", with a value of "user2" for row key 2 set users[2][username] = user2; // create a column named "password", with a value of "password2" for row key 2 set users[2][password] = password2;
// to list down all persisted rows within a column family. list users ; // to fetch a row from users column family having row key value "1". get users[1];
// to delete a column "username" for row key 1; del users[1][username];
If you want to change key_validation_class
from UTF8Type
to BytesType
and validation_class
for the password
column from UTF8Type
to BytesType
, then type the following command:
update column family users with key_validation_class=BytesType and comparator=UTF8Type and column_metadata = [{column_name:password, validation_class:BytesType}]
users
, as follows: truncate users;
drop column family users;
These are some basic operations that should give you a brief idea about how to create/manage the Cassandra schema.
Cassandra is schemaless, but CQL is useful when we need data modeling with the traditional RDBMS flavor. Two variants of CQL (2.0 and 3.0) are provided by Cassandra. We will use CQL3.0 for a quick exercise. We will refer to similar exercises, as we follow with the Cassandra-cli
interface.
cql
is as follows:$CASSANDRA_HOME/bin/cqlsh host port cqlversion
localhost
and 9160
ports by executing the following command:$CASSANDRA_HOME/bin/cqlsh localhost 9160 -3
create keyspace cassandrasample with strategy_class='SimpleStrategy' and strategy_options:replication_factor=1; Update keyspace alter keyspace cassandrasample with strategy_class='NetworkTopologyStrategy' and strategy_options:datacenter=1;
cassandraSample
). We can authorize to a keyspace as follows:use cassandrasample;
describe keyspace
command to look into containing column families and configuration settings. We can describe a keyspace as follows:describe keyspace cassandrasample;
users
column family with user_id
as row key and username
and password
as columns. To create a column family, such as users
, use the following command:create columnfamily users(user_id varchar PRIMARY KEY,username varchar, password varchar);
users
column family for row key value 1
, we will run the following CQL query:insert into users(user_id,username,password) values(1,'user1','password1'),
users
column family, we need to execute the following CQL query:select * from users;
delete
operation. The following command-line scripts are to perform the deletion of a complete row and column age
from the users
column family, respectively:// delete complete row for user_id=1 delete from users where user_id=1; // delete age column from users for row key 1. delete age from users where user_id=1;
Here are a few examples:
// add a new column alter columnfamily users add age int; // update column metadata alter columnfamily users alter password type blob;
truncate users; drop columnfamily users;
drop keyspace cassandrasample;
Cassandra's data model and storage architecture is inspired by Google Big Table, which strengthens Cassandra to allow schemaless development. However, over a period of time, there have been demands to support schema in Cassandra, which led Cassandra to introduce datatype and CQL. Currently, CQL is mature enough to design and build schema-based applications and is often referred to as the native language for Cassandra nowadays. For example, CQL3.0 provides support for composite key, which is not possible with the previous CQL versions or the Thrift
API. I would prefer a combination of both to build my application, which is essentially offered by Cassandra. If column families and columns are not required to be modeled upfront, then it's preferable to use the Thrift
protocol to process our dataset. At the same time, you can create/manage a schema for other column families using CQL in the same keyspace. It's great to have a RDBMS-like structured schema flavor on top of Cassandra; this offers scalability and availability at the same time!
Over a period of time, Cassandra has become very popular and is preferred as a solution for NoSQL-specific problems. There have been a number of language APIs, including Java, that are available to serve the purpose of high level clients. A few of these clients are mentioned in the following subsections.
Hector is the popular open source high-level java client built on top of the Cassandra Thrift API. It is considered to be the most stable client and has been available for use since the inception of Cassandra. More information on Hector can be found at http://hector-client.github.com/hector/build/html/index.html.
Astyanax is loosely inspired by Hector and developed by Netflix. It was later open sourced and available for use. Netflix applications are currently using it in production. More information on Astyanax can be found at https://github.com/Netflix/astyanax.
Kundera is a Java persistence API ( JPA) 2.0-compliant, open source high-level Java client, which currently supports Cassandra, HBase, and MongoDB. It comes in very handy when we need strong a JPA-like feature set (for example, association, JPQL, and so on) and quick development. Using the JPA standard, Kundera simply hides the complexity involved in NoSQL development.
More information on Kundera can be found at https://github.com/impetus-opensource/Kundera.
There are number of available high level clients and we can refer you to http://wiki.apache.org/cassandra/ClientOptions for other clients.
JPA is a standard specification and has been widely used in the industry since a long time. Kundera is an open source JPA 2.0-compliant object datastore mapping library for NoSQL datastores, such as Cassandra, HBase, and MongoDB. The latest, stable kundera release is 2.2.
In this section, we will cover how to implement Create, Insert, Update, and Delete (CRUD) and run JPA queries over Cassandra using Kundera. You can refer to http://jcp.org/aboutJava/communityprocess/final/jsr317/index.html for a basic JPA understanding. Before we start, please refer README.txt
(attached with the jpaExamples.zip
source code) for basic source code setup.
Create a new maven-based project:
mvn archetype:generate -DgroupId=com.packtpub.cassandra -DartifactId=jpaexample
Edit pom.xml
to add the Kundera dependency, as follows:
<dependency> <groupId>com.impetus.client</groupId> <artifactId>kundera-cassandra</artifactId> <version>2.2</version> </dependency>
Create a subfolder named resources/META-INF
within the jpaExample/src/main
folder and edit persistence.xml
for the following configuration:
<persistence xmlns="http://java.sun.com/xml/ns/persistence" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/persistence https://raw.github.com/impetus-opensource/Kundera/Kundera-2.0.4/kundera-core/src/test/resources/META-INF/persistence_2_0.xsd" version="2.0"> <persistence-unit name="cassandra_pu"> <provider>com.impetus.kundera.KunderaPersistence</provider> <properties> <property name="kundera.nodes" value="localhost" /> <property name="kundera.port" value="9160" /> <property name="kundera.keyspace" value="jpaExamples" /> <property name="kundera.dialect" value="cassandra" /> <property name="kundera.client.lookup.class" value="com.impetus.client.cassandra.thrift.ThriftClientFactory"/> <property name="kundera.cache.provider.class" value="com.impetus.kundera.cache.ehcache.EhCacheProvider" /> <property name="kundera.cache.config.resource" value="/ehcache-test.xml" /> </properties> </persistence-unit> </persistence>
Some other important configurations (defined in persistence.xml
) worth mentioning are:
<property name="kundera.nodes" value="localhost" />
<property name="kundera.port" value="9160" />
<property name="kundera.keyspace" value="jpaExamples" />
<property name="kundera.client.lookup.class" value="com.impetus.client.cassandra.thrift.ThriftClientFactory"/>
Configure the low-level Thrift client, as follows:
We will create a column family and keyspace using the cassandra-cli
command-line client.
create keyspace jpaExamples with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};
Use jpaExamples; create column family AUTHOR with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and column_metadata=[{column_name:AUTHOR_NAME, validation_class: UTF8Type, index_type:KEYS},{column_name:AGE, validation_class: IntegerType}, {column_name:SEX, validation_class: UTF8Type}];
To map the Author
entity with the Cassandra database, you need to annotate it with the @Table
annotation, as follows:
@Table(name = "AUTHOR", schema = jpaExamples@cassandra_pu")
Here, jpaExamples
is the keyspace name and cassandra_pu
is the persistence unit name (see persistence.xml
). Attributes annotated with @Id
will be mapped as the cassandra
row key:
Now, let's discuss how to perform the CRUD operation over the Author
entity.
EntityManagerFactory
as follows:author
entity, we can perform the jpa persist
operation as follows:merge jpa
or delete jpa
operations. For example, to update the age
column value for the author
object having a row key value of 1
, or to delete an entity, we can perform the following operation:Author
entity, we can execute a JPA query as follows:authname
column values. We can perform such JPA queries as follows:The column family AUTHOR
contains an index over the authorName
value (please refer to Step 4 – Database script).
In the previous section, we already discussed the schema configuration and the storage architecture of Cassandra. We will now cover how to build a simple Java application using a high-level Java client. TripPlanner
is a simple web-based application and basic functionality that we will implement. In the previous sections, we discussed the architecture and usage of Cassandra. In this section, we will build the TripPlanner
application over Cassandra. The basic functionalities we will be covering are:
You may be wondering why Cassandra is used for such web-based applications. Do such applications exist in the relational world as well? Consider that the TripPlanner
website is generating data with a growth of 100 percent and there are around one million hits per day. Soon, the relational schema will give performance a hit, and scalability will be an issue; that's where there is a need for databases such as Cassandra.
To develop the TripPlanner
application, we will use Hector as a high-level Java API. Please refer to the Java and other available language APIs section for more details on Hector.
On the database side, we need one keyspace and four column families:
User
: The User
column family will hold user details. It will also enable secondary indexes over the first name and country column values. Create the User
column family:
create column family User with key_validation_class = 'UUIDType' and comparator = 'UTF8Type' and column_metadata=[{column_name:firstname, validation_class: UTF8Type, index_type:KEYS},{column_name:lastname, validation_class: UTF8Type},{column_name:password, validation_ lass: UTF8Type},{column_name:email, validation_class: UTF8Type},{column_name:country, validation_class: UTF8Type, index_type:KEYS}];
Hotel
: The Hotel
column family will hold hotel details. It will hold secondary indexes over the hotel name and category values. Create the Hotel
column family:
create column family Hotel with key_validation_class = 'UUIDType' and comparator = 'UTF8Type' and column_metadata=[{column_name:hotelname, validation_class: UTF8Type, index_type:KEYS},{column_name:category, validation_class: UTF8Type,index_type:KEYS},{column_name:location, validation_class: UTF8Type},{column_name:email, validation_class: UTF8Type},{column_name:contactno, validation_class: UTF8Type}];
Review
: Reviews written by users will be store in the Review
column family. Column family will have a reference for a row key value for users (userid
) and will also hold a reference to a row key for hotels (hotelid
).Create the Review
column family:
create column family Review with key_validation_class = 'UUIDType' and comparator = 'UTF8Type' and column_metadata=[{column_name:userid, validation_class: UTF8Type, index_type:KEYS},{column_name:hotelid, validation_class: UTF8Type,index_type:KEYS},{column_name:rating, validation_class: UTF8Type},{column_name:comments, validation_class: UTF8Type}];
AverageRating
: Whenever any user publishes a review or rates a specific hotel, the rating counter column will get incremented by the provided rating value and the number of reviews will be incremented by one. The purpose of this column family is to get the rating values ready and count the number of reviews while storing reviews. To calculate the average rating for a specific hotel, you need a counter column family with two counter columns. Column rating and number of reviews are counter column values.Create the AverageRating
column family:
CREATE COLUMN FAMILY AverageRating WITH default_validation_class=CounterColumnType AND key_validation_class=UTF8Type AND comparator=UTF8Type;
Let's have a brief discussion of the code snippets and different types of modules used to build the application:
HectorClient.java
is a high-level client, which is built using the Hector API and holds definitions for methods to cater various functional requirements for the application. Let's discuss the methods implemented in HectorClient.java
:addUser
: To persist a record in Cassandra using the Hector API, we need to create an instance of Mutator
and add each column name/value tuple using userid
as the row key. This method takes the user dto
(data transfer object) as an input parameter and stores the username
, password
, and email
column values using the Hector Mutator
API:findUserIdByName
: In order to fetch row key for the firstName
value from the user
column family, use rangeslicequery
and setReturnKeysOnly
to return the row key value and exclude any other columns for the matching the records:findHotels
: To find hotels for a give column name and column value, we will perform rangeslicequery
and populate Hotel
(data transfer objects) from the fetched records:findAverageRating
: To retrieve the total number of reviews and summed rating value from the AverageRating
column family, we need to perform rangeslicecounterquery
:3.135.216.75