Chapter 8. Integration

Big Data is the latest trend in the technical community and industry in general. Cassandra and many other NoSQL solutions solve a major part of the problem: storage of a large amount of data set in a scalable manner while keeping the mutations and retrieval queries superfast. But this is just half the picture. A major part is processing. A database that provides better integration with analytical tools such as Apache Hadoop, Twitter Storm, Pig, Spark, and other platforms will be a preferable choice.

Cassandra provides native support to Hadoop MapReduce, Pig, Hive, and Oozie. It is a matter of tiny changes to get the Hadoop family up and working with Cassandra. There are a couple of independently developed projects that use Cassandra as storage. One of the most popular projects is Solandra. Solandra provides Cassandra integration with Solr. However, it does not magically enable you to test search in Cassandra. Cassandra serves just as a backend.

Third-party supports for Hadoop and Solr have taken Cassandra to the next step in terms of integration. Third-party proprietary tooling such as DataStax Enterprise edition for Cassandra makes it easy to work with Hadoop and actually text search Cassandra using Solr.

Cassandra is a very powerful database engine. We have seen its salient features as a single software entity. In this chapter we will see how Cassandra can be used as a data store for third-party software such as Hadoop MapReduce, Pig, and Apache Solr.

Note

This chapter does not cover Cassandra integration with Hive and Oozie. To learn about Cassandra integration with Oozie, refer to:

http://wiki.apache.org/cassandra/HadoopSupport#Oozie

Hive can be integrated with Cassandra via independent projects such as https://github.com/riptano/hive_old on GitHub. But it seems deprecated, so it is suggested to not use it. There are ongoing efforts to bring Hive integration to Cassandra as its native part. If you are planning to use Cassandra with Hive, keep watch on this issue at:

https://issues.apache.org/jira/browse/CASSANDRA-4131

DataStax Enterprise editions have built-in Cassandra-enabled Hive MapReduce clients; you may want to check them out at:

http://www.datastax.com/docs/datastax_enterprise3.0/solutions/about_hive

Using Hadoop

Hadoop is for data processing. So is Matlab, R, Octave, Python (NLTK and many other libraries for data analysis), and SAS, you may say. They are great tools, but they are good for data that can fit in memory. It means you can churn a couple of GBs to maybe 10s of GBs, and the rate of processing depends on the CPU on that machine, maybe 16 cores. This poses a big restriction. The data is no more in GB limits at Internet scale. In the age of billions of cell phones (an estimated 6.8 billion cell users at the end of 2012; source: http://mobithinking.com/mobile-marketing-tools/latest-mobile-stats/a#subscribers), we are generating a humongous amount of data every second (Twitter reports 143,199 Tweets per second; source: https://blog.twitter.com/2013/new-tweets-per-second-record-and-how) by checking in places, tagging photos, uploading videos, commenting, messaging, purchasing, dining, running (fitness apps monitor your activities), and many other activities that we do; we literally record events somewhere. It does not stop at organic data generation. A lot of data, a lot more than organic data, is generated by machines (http://en.wikipedia.org/wiki/Machine-generated_data). Web logs, financial market data, data from various sensors (including ones in your cell phone), machine part data, and many more are such examples. Health, genomics, and medical science have some of the most interesting Big Data corpus ready to be analyzed and inferred. To give you a glimpse of how big genetic data can be, we should check data from 1,000 genome projects (http://www.1000genomes.org/). This data is available for free (there are storage charges) to be used by anyone. The genome data for (only) 1,700 individuals makes a corpus of 200 terabytes. It is doubtful that any conventional in-memory computation tool such as R or Matlab can do it. Hadoop helps you process the data of that extent.

Hadoop is an example of distributed computing, so you can scale beyond a single computer. Hadoop virtualizes the storage and processors. This means you can roughly treat a 10-machine Hadoop cluster as one machine with 10 times the processing power and 10 times storage capacity than a single one. With multiple machines parallely processing the data, Hadoop is the best fit for large unstructured data sets. It can help you to clean data (data munging) and can perform data transformation too. HDFS provides a redundant distributed data storage. Effectively, it can work as your extract, transform, and load (ETL) platform.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.26.221