Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Hadoop in a Cassandra cluster

The production version of the Hadoop and Cassandra combination needs to go into a separate cluster. The first obvious issue is you probably wouldn't want Hadoop to keep polling Cassandra nodes, hampering Cassandra's performance to end users. The general pattern to avoid this is to split the ring into two data centers. Since Cassandra automatically and immediately replicates the changes between data centers, they will always be in sync. What's more, you can assign one of the data centers as transactional with a higher replication factor and the other as an analytical data center with a replication factor 1. The analytical data center is the one used by Hadoop without affecting the transactional data center.

Now, you do not really need to have two physically separated data centers to make this configuration work. Remember NetworkTopologyStrategy? (Refer to Chapter 3, Effective CQL.) You can tweak Cassandra thinking there are two data centers by just assigning the nodes that you wanted to use for analytics in a different data center. You may need to use PropertyFileSnitch and specify the details about data centers in a cassandra-toplogy.properties file. So, your keyspace creation looks something like this:

createkeyspacemyKeyspace
withplacement_strategy = 'NetworkTopologyStrategy'
andstrategy_options = {TX_DC : 2, HDP_DC: 1};

The previous statement defines two data centers, TX_DC for transactional purposes and HDP_DC for analytics in Hadoop. A node in a transactional data center has a snitch configured like this:

# Transaction Data Center
192.168.1.1=TX_DC:RAC1
192.168.1.2=TX_DC:RAC1

192.168.2.1=TX_DC:RAC2

# Analytics Data Center
192.168.1.3=HDP_DC:RAC1

192.168.2.2=HDP_DC:RAC2
192.168.2.3=HDP_DC:RAC2

# For new/unknown/unassigned nodes
default=TX_DC:RAC1

We are mostly done setting up machines. Here are a couple of things to remember:

Install TaskTracker and DataNode processes on each node in the analytical data center.
Do not have a node that has Cassandra running on it and has running services for TaskTracker and DataNode. Use a separate robust machine to install the master services such as JobTracker and NameNode.
Make sure conf/hadoop-env.sh has all the JAR files that need to execute the MapReduce program as a part of the HADOOP_CLASSPATH variable.

With all these configurations, your Cassandra cluster should be ready to serve analytical results to all the concerned people.

Cassandra filesystem

Configuring Hadoop backed by Cassandra may give an illusion that we are replacing HDFS because we take data from Cassandra and dump the results into it. It is not true. Hadoop still needs NameNode and DataNodes for various activities such as storing intermediate results, JAR files, and static data. Therefore, essentially, you are backed by no single point of failure (SPOF) database, but you are still bounded by SPOFs such as NameNode and JobTracker.

DataStax, a leading company in professional support for Cassandra, provides a solution to this. Their enterprise offering of the Cassandra DataStax Enterprise product has a built-in Cassandra File System (CFS), which is HDFS-compatible. CFS smartly uses Cassandra as underlying storage. What this gives to an end user is simplicity in configuration and no need to have DataNode, NameNode, and secondary NameNode running.

More about CFS is out of the scope of this book. You may read more about CFS on the DataStax blog, CFS Design, at http://www.datastax.com/dev/blog/cassandra-file-system-design.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Hadoop in a Cassandra cluster

Create new playlist

Sign In

Sign Up

Hadoop in a Cassandra cluster

Cassandra filesystem

Table of Contents for
Hadoop in a Cassandra cluster