CHAPTER 11: Resources, References, and Tools

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 11

Resources, References, and Tools

There is a plethora of big data products available from large and small vendors. Some of these products cater to niche areas like social media analytics and NoSQL databases, while some have a Hadoop ecosystem with a combination of infrastructure, visualization, and analytical capabilities. This chapter gives a broad overview of many of the products you will need to implement the architecture and patterns described in this book.

Big Data Product Catalog

Problem

List the main big data product areas and the associated vendors of those products.

Solution

Table 11-1 lists tools you can use as a solution, as well as the vendors providing those tools.

Table 11-1. Distributed and Clustered Flume Taxonomy

Tools	Vendors
Hadoop Distributions	Cloudera Hortonworks MapR IBM BigInsights Pivotal HD Microsoft Windows Azure cloud platform, HDInsight
In-Memory Hadoop	Intel
Hadoop Alternatives	HPCC Systems from LexisNexis
Hadoop SQL Interfaces	Apache Hive Cloudera Impala EMC HAWQ Hortonworks Stinger
Ingestion Tools	Flume Storm S4 Sqoop
Map Reduce Alternatives	Spark Nokia Disco
Cloud Options	AWS EMR Microsoft Azure Google BigQuery
NoSQL Databases	IBM Netezza HP Vertica Aster Teradata Google BigTable EMC Greenplum
In-Memory Database Management Systems	SAP HANA Oracle Exalytics
Visualization	Tableau QLikView Tibco Spotfire MicroStrategy SAS VA
Search	Solr
Analytics	SAS Revolution Analytics Pega
Integration Tools	Talend Informatica
Operational Intelligence Tools	Splunk
Graph Databases	Neo4J OpenLink
Document Store Database Management Systems	MongoDB Cloudant MarkLogic Couchbase
Datasets	InfoChimps
Social Media Integrator	Clarabridge radian6 SAS Informatica PowerExchange
Archive Infrastructure	EMC Ipsilon
Data Discovery	IBM Vivisimo Oracle Endeca MarkLogic
Table-Style Database Management Services	Cassandra HP Vertica DataStax Teradata

Hadoop Distributions

Problem

Apache Hadoop does not come integrated with all the components required for an enterprise-scale big data system. Do I have any better options to save the time and effort to configure multiple frameworks?

Solution

Because the Hadoop ecosystem is made up of multiple entities (Hive, Pig, HDFS, Ambari, and others), with each entity maturing individually and coming up with newer versions, there are chances of version incompatibility, security, and performance-tuning issues. Vendors like Cloudera, MapR and Hortonworks do a good job to package it all together into one distribution and manage the incompatibility, performance, and security issues within the packaged distribution. This greatly helps because the maintenance support for these open source entities is only through forums.

These vendors are coming up with their own SQL interfaces and monitoring dashboards and contributing back to the Apache Hadoop community, thereby enriching the open source community. Here are some examples:

MapR stands out as a unique file system. Unlike the HDFS file systems in Apache Hadoop, MapR allows you to mount the cluster as an NFS volume. “HDFS” is replaced by “Direct Access NFS.” Further details can be found at http://www.mapr.com/Download-document/4-MapR-M3-Datasheet .
Hortonworks stands out for its offering on Windows operating systems. Hortonworks is the only vendor providing Hadoop on Windows OS.
Intel provides Hadoop storage that is in-memory.

A typical packaged distribution covers all of the open source Hadoop entities.

In-memory Hadoop

Intel provides optimization for solid state disks and cache acceleration. See www.intel.com/bigdata for information on Intel’s big data resources in general.

Over and above the core open source Hadoop entities, vendors provide additional services such as those shown in Table 11-2.

Table 11-2. Vendor-Specific Hadoop Services

Hadoop Alternatives

Problem

Is Apache Hadoop the only option to implement big data, map reduce, and a distributed file system?

Solution

The nearest open source alternative to Hadoop is the HPCC system. Refer to http://hpccsystems.com/Why-HPCC/How-it-works .

Unlike Hadoop, HPCC provides massively parallel processing and a shared nothing architecture not based on any type of key-value NoSQL databases. See http://hpccsystems.com/Why-HPCC/case-studies/lexisnexis for a case study of HPCC implementation in LexisNexis document management.

Hadoop SQL Interfaces

Problem

How can I improve the performance of my Hive queries?

Solution

Apache Hive is the most widely used open source SQL interface. Apache Hive can run over HDFS or over the NoSQL HBase columnar database. Apache Hive was developed to make a SQL developer’s life easier. Instead of forcing developers to learn a new language or learning new CLI commands to run MapReduce code, Apache Hive provides a SQL-like language to trigger map reduce jobs.

Cloudera Impala, EMC HAWQ, and Hortonworks Stinger are some of the products available that can overcome the performance issues encountered while using Apache Hive. Hortonworks Stinger is relatively new to the market. New aggregate functions, optimized query, and optimized Hive runtime are some of the features added in these products.

There are many resources you can consult to find comparisons of EMC HAWQ with Cloudera Impala and other such products. One that you may find useful is at: https://www.informationweek.com/software/information-management/cloudera-impala-brings-sql-querying-to-h/240153861 . You find more information going to http://www.cloudera.com/content/cloudera/en/products/cdh/impala.html and http://hortonworks.com/blog/100x-faster-hive .

Ingestion tools

Problem

What are the essential tools/frameworks required in your big data ingestion layer?

Solution

There are many product options to facilitate batch-processing-based ingestion. Here are some major frameworks available:

Apache Sqoop: A tool used for transferring bulk data from RDBMS to Apache Hadoop and vice-versa. It offers two-way replication, with both snapshots and incremental updates.
Chukwa: Chukwa is a Hadoop subproject that is designed for efficient log processing. It provides a scalable distributed system for monitoring and analyzing log-based data. It supports appending to existing files and can be configured to monitor and process logs that are generated incrementally across many machines.
Apache Kafka: A distributed publish-subscribe messaging system. It is designed to provide high-throughput persistent messaging that’s scalable and allows for parallel data loads into Hadoop. Its features include the use of compression to optimize I/O performance and mirroring to improve availability and scalability and to optimize performance in multiple-cluster scenarios. It can be used as the framework between the router and Hadoop in the multi-destination pattern implementation.
Flume: A distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS. It is based on streaming data flows. Flume provides extensibility for online analytic applications. However, Flume requires a fair amount of configuration, which can become complex for very large systems.
Storm: Supports event-stream processing and can respond to individual events within a reasonable time frame. Storm is a general-purpose, event-processing system that uses a cluster of services for scalability and reliability. In Storm terminology, you create a topology that runs continuously over a stream of incoming data. The data sources for the topology are called spouts, and each processing node is called a bolt. Bolts can perform sophisticated computations on the data, including output to data stores and other services. It is common for organizations to run a combination of Hadoop and Storm services to gain the best features of both platforms.
InfoSphere Streams: Performs complex analytics of heterogeneous data types. InfoSphere Streams can support all data types. It can perform real-time and look-ahead analysis of regularly generated data, using digital filtering, pattern/correlation analysis, and decomposition, as well as geospacial analysis.
Apache S4: A real-time data ingestion tool used for processing continuous streams of data. Client programs that send and receive events can be written in any programming language. S4 is designed as a highly distributed system. Throughput can be increased linearly by adding nodes into a cluster.

Map Reduce alternatives

Problem

For multinode parallel processing, is MapReduce the only algorithm option?

Solution

Spark and Nokia DISCO are some of the alternatives to MapReduce. A fair comparison can be found at http://www.bytemining.com/2011/08/Hadoop-fatigue-alternatives-to-Hadoop/ .

Because most vendor products and enhancements are focused on MapReduce jobs, it makes sense to stick to MapReduce unless there is a pressing need to look for a massive parallel processing option.

Cloud Options

Problem

Buying inexpensive hardware for large big data implementations can still be a very large capital expense. Are there any pay-as-you-go cloud options?

Solution

Amazon EMR is a public cloud based web service that provides huge data computing power. Amazon EMR uses Amazon S3 for storage unlike the Hadoop HSFs storage. Amazon EMR data ingestion, analysis and import/export concerns are different from a typical HDFS based Hadoop system.

Amazon EMR data ingestion, analysis, and import/export concerns are discussed at http://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf .

Table-Style Database Management Services

Problem

My existing database and analytics experts have RDBMS and SQL skills. How can I quickly make them big data users?

Solution

Cassandra, HP Vertica, DataStax, Oracle Exadata, and Aster Teradata are some table-style, database-management services that run over Hadoop and provide the abstraction needed to reduce latency and improve performance.

Compared to NoSQL DBMSs, table-style DBMSs bring vendor lock-in, hardware dependencies, and complicated clustering. Table-style DBMSs also hog memory and network, and hence have consistency issues across clusters. Data integrity and consistency suffer as performance improves.

Some table-style DBMSs come packaged as a combo of hardware, platform, and software. Hence, the eventual cost and ROI need to be thoroughly investigated before opting for it.

You can find information concerning the total cost of ownership (TCO) for moving to Oracle Exadata at the following web site: http://www.zdnet.com/reproducing-youtube-on-oracle-exadata-1339318266/ .

NoSQL Databases

Problem

Is there a “one solution fits all” NoSQL database?

Solution

NoSQL databases (Figure 11-1) are very use-case-centric, unlike RDBMS, which are generic and cater to multiple system needs. Maintenance of data integrity and consistency can become a concern in the long run.

Figure 11-1. NoSQL databases

Also, there are multiple key-value pair, graph, and document databases that have very different implementations. Moving from one to another might raise issues.

In-Memory Big Data Management Systems

Problem

Real-time big data analysis requires in-memory processing capabilities. Which are the leading products?

Solution

SAP HANA and Oracle Exalytics are some of the leading products for in-memory processing. Oracle and SAP HANA provide their own set of adapters to connect to various other products, as well. Though different as far as hardware and platform, both products provide comparable features. The only concern is vendor lock-in and the inability to integrate with any other NoSQL database. Implementing these vendor products might require significant changes to a customer’s existing high-availability and disaster-recovery processes.

DataSets

Problem

Are there any large data sets available in the public domain that I can use for my big data pilot projects?

Solution

Data.gov is an official US government web site. You can visit http://catalog.data.gov/dataset to see about 100,000 datasets belonging to different categories.

Data Discovery

Data discovery and search capabilities have gained more importance because of the new non-enterprise and social-media data that has started feeding in to an organization’s decision-making systems. Discovering insights from unstructured data (videos, call center audios, blogs, twitter feeds, Facebook posts) has been a challenge for existing decision-making systems and has given birth to new products like IBM Vivisimo, Oracle Endeca, and others.

You can find more information at the following web sites: https://wikis.oracle.com/display/endecainformationdiscovery/Home;jsessionid=7EF303D9FADB3215001F27A4F4DACFE5 and http://www.ndm.net/datawarehouse/IBM/infosphere-dataexplorer-vivisimo .

Visualization

Problem

There are so many cool visualization tools coming up every day, how do I select the appropriate tool for my enterprise?

Solution

High-volume, real-time analytics has brought in newer products in the market like the following ones:

Tableau
QLikView
Tibco Spotfire
MicroStrategy
SAS VA

From a market presence perspective, QLikView has been around for a while. Comparatively, Tableau is a new entrant into the market. SAS has also jumped into the fray with its SAS Visual Analytics offering. MicroStrategy and Tibco Spotfire also have a substantial market presence.

Though these are in-memory visualization tools that are highly integrated with the Hadoop ecosystem, issues such as the following exist with them:

Data is expected in a certain format for better performance.
Integration with Apache Hive is at times not of very high performance.
Apache Hive has to be modified to provide data in an aggregated manner to get high performance.

Analytics Tools

Here are some of the available analytics tools:

Revolution R Analytics
SAS

Though Revolution R Analytics is a new entrant into the market, it has made an immense impact. Because SAS is the old horse, it has an advantage over Revolution Analytics. However, it might be too costly an affair in the long run.

Data Integration Tools

Problem

I already invested in business intelligence (BI) tools like Talend, Informatica, and others. Can I use them for big data integration?

Solution

Talend, Informatica, Pentaho, and IBM DataStage are some of the tools that can act as ETL tools, as well as scheduling tools.

Some tools are still not mature and do not have all the adapters and accelerators needed to connect to all the Hadoop ecosystem entities. There might be restrictions on which versions of the entities these might work with.

Not all the tools are compatible with different distributions of Hadoop (like Cloudera, Hortonworks, MapR, Intel, and IBM BigInsights).

You can read about the Talend features at http://www.talend.com/products/big-data/matrix .

You can find additional information about Informatica PowerExchange at http://www.informatica.com/us/products/enterprise-data-integration/powerexchange/ .

Summary

Since the big data industry is still evolving, there might be more products that will emerge as leaders in the field. As I am writing this, Oracle has launched Big Data X4-2 appliance, Pivotal has launched a Platform as a Service (PaaS) called “Pivotal One,” and AWS has upgraded its Elastic MapReduce offering to Hadoop 2.2 with the YARN framework. More updates and upgrades with better performance will follow. Architects should keep themselves abreast of the latest developments so that they can recommend the right products to their customers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 11: Resources, References, and Tools

Create new playlist

Sign In

Sign Up

Table of Contents for
CHAPTER 11: Resources, References, and Tools