The production version of Hadoop and Cassandra combination needs to go into a separate cluster. The first obvious issue is you probably wouldn't want Hadoop to keep pulling Cassandra nodes, hampering Cassandra performance to end users. The general pattern to avoid this is to split the ring into two data centers. Since Cassandra automatically and immediately replicates the changes between data centers, they will be in sync always. What's more, you can assign one of the data centers as a transactional with a higher replication factor and the other as an analytical data center with a replication factor of 1. The analytical data center is the one used by Hadoop without affecting the transactional data center a bit.
Now, you do not really need to have two physically separated data centers to make this configuration work. Remember NetworkTopologyStrategy
? (Refer to the NetworkTopologyStrategy section in Chapter 4, Deploying a Cluster.) You can tweak Cassandra thinking there are two data centers by just assigning the nodes that you wanted to use for analytics in a different data center. You may need to use PropertyFileSnitch
and specify the details about data centers in a cassandra-toplogy.properties
file. So, your keyspace creation looks something like this:
createkeyspacemyKeyspace withplacement_strategy = 'NetworkTopologyStrategy' andstrategy_options = {TX_DC : 2, HDP_DC: 1};
The previous statement defines two data centers: TX_DC
for transactional purposes and HDP_DC
for Hadoop to do analytics. A node in a transactional data center has a snitch configured like this:
# Transaction Data Center 192.168.1.1=TX_DC:RAC1 192.168.1.2=TX_DC:RAC1 192.168.2.1=TX_DC:RAC2 # Analytics Data Center 192.168.1.3=HDP_DC:RAC1 192.168.2.2=HDP_DC:RAC2 192.168.2.3=HDP_DC:RAC2 # For new/unknown/unassigned nodes default=TX_DC:RAC1
We are mostly done setting up machines. A couple of things to remember:
conf/hadoop-env.sh
has all the jar
files that need to execute the MapReduce program as a part of the HADOOP_CLASSPATH
variable.Configuring Hadoop backed by Cassandra may give the illusion that we are replacing HDFS, because we take data from Cassandra and dump the results into it. It is not true. Hadoop still needs NameNode and DataNodes for various activities such as storing intermediate results, jar
files, and static data. So, essentially, you are backed by no single point of failure (SPOF) database but you are still bounded by SPOFs such as NameNode and JobTracker.
DataStax, a leading company in professional support for Cassandra, provides a solution to it. Its enterprise offering of Cassandra DataStax Enterprise product has a built-in Cassandra File System (CFS), which is HDFS compatible. CFS smartly uses Cassandra as an underlying storage. What this gives to an end user is simplicity in configuration and no need to have DataNode, NameNode, and secondary NameNode running.
More detail about CFS is out of the scope of this book. You may read more about CFS on the DataStax blog, Cassandra File System Design, at http://www.datastax.com/dev/blog/cassandra-file-system-design.
3.147.85.221