Chapter 7: Big Data Management and Analytics Technologies

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Over the past 10 years or so, numerous big data management and analytics systems have emerged. In addition, various cloud service providers have also implemented big data solutions. In addition, infrastructures/platforms for big data systems have also been developed. Notable among the big data systems include MongoDB, Google’s BigQuery, and Apache HIVE. The big data solutions are being developed by cloud providers including Amazon, IBM, Google, and Microsoft. In addition, infrastructures/platforms based on products such as Apache’s Hadoop, Spark, and Storm have been developed.

Selecting the products to discuss is a difficult task. This is because almost every database vendor as well as cloud computing vendors together with analytics tools vendors are now marketing products as big data management and analytics (BDMA) solutions. When we combine the products offered by all vendors as well as include the open-source products, then there are hundreds of products to discuss. Therefore, we have selected the products that we are most familiar with by discussing these products in the courses we teach and/or using them in our experimentation. In other words, we have only selected the service providers, products, and frameworks that we are most familiar with and those that we have examined in our work. Describing all of the service providers, products, and frameworks is beyond the scope of this book. Furthermore, we are not endorsing any product in this book.

The organization of this chapter is as follows. In Section 7.2, we will describe the various infrastructure products that host big data systems. Examples of big data systems are discussed in Section 7.3. Section 7.4 discusses the big data solutions provided by some cloud service providers. This chapter is summarized in Section 7.5. Figure 7.1 illustrates the concepts discussed in this chapter.

Figure 7.1 Big data management and analytics systems and tools.

7.2 Infrastructure Tools to Host BDMA Systems

In this section, we will discuss various infrastructure products that host BDMA systems. These are Apache’s Hadoop, Spark, Storm, Flink, Kafka, and the MapReduce programming model.

Apache Hadoop: Hadoop is an open-source distributed framework for processing large amounts of data. It uses the MapReduce programming model that we will discuss next. Its storage system is called the Hadoop Distributed File System (HDFS). It is hosted on clusters of machines. A file consists of multiple blocks and the blocks of a file are replicated for availability. It supports the parallel processing of the data for performance. JobTracker is a part of Hadoop that tracks the MapReduce jobs. These jobs are submitted by the client application. Most of the big data systems as well as the cloud applications are hosted on Hadoop. More details of Hadoop can be found in [HADO].

MapReduce: MapReduce is a programming model and an associated implementation that takes the client requests and transforms them into MapReduce jobs. These jobs are then executed by Hadoop. The main feature of the programming model is the generation of the jobs. The MapReduce model has two components: (i) map and (ii) reduce. As stated in [MAPR], “a MapReduce job usually splits the input dataset into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system. The framework takes care of scheduling tasks, monitoring them, and re-executes the failed tasks.” More details of the MapReduce model are given in [MAPR].

Apache Spark: Apache Spark is an open-source distributed computing framework for processing massive amounts of data. The application programmers use Spark through an interface that consists of a data structure called the resilient distributed dataset (RDD). Spark was developed to overcome the limitations in the MapReduce programming model. The RDD data structure of Spark provides the support for distributed shared memory. Due to the in-memory processing capabilities, Spark offers good performance. Spark has interfaces with various NoSQL-based big data systems such as Cassandra and Amazon’s cloud platform. Spark supports SQL capabilities with Spark SQL. More details on Spark can be found in [SPAR].

Apache Pig: Apache Pig is a scripting platform for analyzing and processing large datasets. Apache Pig enables Hadoop users to write complex MapReduce transformations using simple scripting language called Pig Latin. Pig converts Pig Latin script to a MapReduce job. The MapReduce jobs are then executed by Hadoop for the data stored in HDFS. Pig Latin programming is similar to specifying a query execution plan. That is, the Pig Latin scripts can be regarded to be an execution plan. This makes it simpler for the programmers to carry out their tasks. More details on Pig can be found in [PIG].

Apache Storm: Apache Storm is an open-source distributed real-time computation system for processing massive amounts of data. Storm is essentially a real-time framework for processing streaming data and real-time analytics. It can be integrated with the HDFS. It provides features like scalability, reliability, and fault tolerance. The latest version of Storm supports streaming SQL, predictive modeling, and integration with systems such as Kafka. In summary, Storm is for real-time processing and Hadoop is for batch processing. More details on Storm can be found in [STOR].

Apache Flink: Flink is an open-source scalable stream processing framework. As stated in [FLIN], Flink consists of the following features: (i) provides results that are accurate, even in the case of out-of-order or late-arriving data, (ii) is stateful and fault tolerant and can seamlessly recover from failures while maintaining exactly once application state, and (iii) performs at large scale, running on thousands of nodes with very good throughput and latency characteristics. Flink is essentially a distributed data flow engine implemented in Scala and Java. It executes programs both in parallel and pipelined modes. It supports Java, Python, and SQL programming environments. While it does not have its own data storage, it integrates with systems such as HDFS, Kafka, and Cassandra.

Apache Kafka: Kafka was initially developed by LinkedIn and then further developed as an open-source Apache project. It is also implemented in Scala and Java and is a distributed stream processing system. It is highly scalable and handles massive amounts of streaming data. Its storage layer is based on a pub/sub messaging queue architecture. The design is essentially based on distributed transaction logs. Transaction logs are used in database systems to recover from the failure of the transactions. More details on Kafka can be found in [KAFK].

7.3 BDMA Systems and Tools

In this section, we will discuss the various BDMA systems and tools. We first provide big data systems that are based on SQL. These are Apache Hive and Google BigQuery. Then we discuss NoSQL (non-SQL) databases in general. This is followed by a discussion of examples of NoSQL systems such as Google BigTable, HBase, MongoDB, Cassandra, CouchDB, and the Oracle NoSQL Database. This will be followed by a discussion of two data mining/machine learning systems for big data and they are Weka and Apache Mahout.

7.3.1 Apache Hive

Apache Hive is an open-source SQL-like database/data warehouse that is implemented on top of the Hadoop/MapReduce platform. It was initially developed by Facebook to store the information related to Facebook data. However, later it became an open-source project and a trademark of Apache. Hive manages very large datasets and functions on top of the Hadoop/MapReduce storage model. It provides an SQL-like query language which is called HiveQL. That is, SQL-like queries are supported by Hive. However, since Hadoop is implemented in Java, the queries are also implemented in Java. This way, there is no need to have a low-level Java application programming interface (API) to implement the queries. The Hive engine essentially converts the SQL queries into MapReduce jobs that are then executed by Hadoop. More details on Apache Hive can be found in [HIVE].

7.3.2 Google BigQuery

BigQuery is essentially a data warehouse that manages petabyte scale data. It runs on Google’s infrastructure and can process SQL queries or carry out analytics extremely fast. For example, terabyte data can be accessed in seconds, while petabyte data can be accessed in minutes. The BigQuery data is stored in different types of tables: native tables store the BigQuery data, Views store the virtual tables, and External tables store the external data. BigQuery can be accessed in many ways such as command line tools, RESTful interface or a web user interface, and client libraries (e.g., Java,.NET, and Python). More details on BigQuery can be found in [BIGQ].

7.3.3 NoSQL Database

NoSQL database is a generic term for essentially a nonrelational database design or scalability for the web. It is known as a nonrelational high performance database. The data models for NoSQL databases may include graphs, document structures, and key–value pairs. It can be argued that the databases that were developed in the 1960s such as IBM’s information management system (IMS) and those based on the network data model are NoSQL databases. However, other object-oriented data models that were developed in the 1990s led the way to develop NoSQL databases in the 2000s. What is different from the NoSQL databases and the older hierarchical, network, and object databases is that the NoSQL databases have been designed with the web in mind. That is, the goal is to access massive amounts of data on the web rapidly.

The most popular NoSQL database model is the key–value pair. While relational databases consist of a collection of relations where each relation has a collection or attributes, these attributes are labeled and included in the schema. NoSQL databases have tables that have two columns: key and value. Key could be anything such as a person’s name or the index of a stock. However, the value could be a collection of attributes such as the name of the stock, the value of the stock and other information such as whether to buy the stock and if so the quantity recommended. Therefore, all the information pertaining to a stock can be retrieved without having to perform many joins. Some of the popular NoSQL databases will be discussed in this section (e.g., MongoDB and HBase). For a detailed discussion of NoSQL databases, we refer the reader to [NOSQ].

7.3.4 Google BigTable

BigTable is one of the early NoSQL databases running on top of the Google file system (GFS). It is now provided as a service in the cloud. BigTable maps the row key and the column together with a time stamp into a byte array. That is, it is essentially a NoSQL database that is based on the key value pair model. It was designed to handle petabyte-sized data. It uses compression algorithms when the data gets too large. Each table in BigTable has many dimensions and may be divided into what is called tablets to work with GFS. BigTable is used by many applications including Google’s YouTube, Google Maps, Google Earth, and Gmail. More details on BigTable can be found in [BIGT].

7.3.5 Apache HBase

HBase is an open-source nonrelational distributed database that was the first table developed for the Hadoop/MapReduce platform. That is, HBase is a NoSQL database that is based on a column-oriented key value data store model. It is implemented in Java. The queries are executed as MapReduce jobs. It is somewhat similar to Google’s Big Table and uses compression for in-memory storage. HBase is scalable and handles billions of rows with millions of columns. It also integrates multiple data stores in different formats as well as facilitates the storage of sparse data. More details on HBase can be found in [HBAS].

7.3.6 MongoDB

MongoDB is a NoSQL database. It is a cross-platform open-source distributed database. It has been used to store and manage documents. That is, it is mainly a document-oriented database. The documents are stored in a JSON-like format. It supports both field and range queries and regular expression-based searches. It supports data replication and load balancing which occurs through horizontal scaling. The batch processing of data as well as aggregation operations can be carried out through MapReduce. More details of MongoDB can be found at [MONG].

7.3.7 Apache Cassandra

Cassandra is a NoSQL distributed database. It was first developed at Facebook to power the Facebook applications and then became an Apache foundation open software initiative. It was designed with no single point of failure in mind. It supports clusters which span multiple data centers. All the nodes in a cluster perform the same function. As a result, there is virtually no single point of failure. It supports replication and is highly scalable. It is also fault tolerant and is integrated with Hadoop with support for MapReduce. The query language supported by Cassandra is Cassandra Query Language which is an alternative to SQL. It can also be accessed from programs such as Java, C++, and Python. More details of Cassandra can be found at [CASS].

7.3.8 Apache CouchDB

As stated in [COUC], CouchDB enables one to access data by implementing the Couch replication protocol. This protocol has been implemented by numerous platforms from clusters to the web to mobile phones. It is a NoSQL database and implemented in a concurrent language called Erlang. It uses JSON to store the data and JavaScript for the query language. More details on CouchDB can be found in [COUC].

7.3.9 Oracle NoSQL Database

Oracle is one of the premier relational database vendors and has marketed relational database products since the late 1970s. It offered object-relational database products in the 1990s and more recently the NoSQL database. The NoSQL database is based on the key value paid model. Each row has a unique key and has a value that is of arbitrary length and interpreted by the applicant. Oracle NoSQL database is a shared nothing system and is distributed across what are called multiple shards in a cluster. The data is replicated in the storage nodes within a shard for availability. The data can be accessed via programs such as those written in Java, C, Python as well as RESTful web services. More details on the Oracle NoSQL database can be found at [ORAC].

7.3.10 Weka

Weka is an open-source software product that implements a collection of data mining techniques from association rule mining to classification to clustering. It has been designed, developed, and maintained by Mankato University in New Zealand. Weka 3, a version of Weka, operated on big datasets. While earlier versions of Weka required the entire datasets to be loaded into memory to carry out say classification, the big data version carried out incremental loading and classification. Weka 3 also supports distributed data mining with map and reduce tasks. It also provides wrappers for Hadoop and Spark. More details on Weka can be found in [WEKA].

7.3.11 Apache Mahout

The Apache Mahout provides an environment to implement a collection of machine learning systems. These systems are scalable and implemented on top of Hadoop. The machine learning algorithms include classification and clustering. The instructions of the machine learning algorithms are transformed into MapReduce jobs and executed on Hadoop. Mahout provides Java libraries to implement the mathematics and statistical techniques involved in machine learning algorithms. The goal is for the machine learning algorithms to operate on very large datasets. In summary, as stated in [MAHO], Mahout provides the following three features: (i) a simple and extensible programming environment and framework for building scalable algorithms; (ii) a wide variety of premade algorithms for Scala + Apache Spark, H2O, and Apache Flink; and (iii) Samsara, a vector math experimentation environment with R-like syntax which works at scale. More details on Mahout can be found at [MAHO].

7.4 Cloud Platforms

In this section, we will discuss how some cloud service providers are supporting big data management. In particular, we discuss Amazon’s DynamoDB and Microsoft’s Cosmos DB. Then we discuss the cloud-based big data solutions provided by IBM and Google.

7.4.1 Amazon Web Services’ DynamoDB

Dynamo is a NoSQL database which is part of Amazon Web Services (AWS) product portfolio. It supports both document and the key-value store models. High performance and high throughput are the goals of DynamoDB. With respect to storage, DynamoDB can expand and shrink as needed by the applications. It also has an in-memory cache called Amazon DynamoDB Accelerator that can provide millisecond responses for millions of requests per second. More details on DynamoDB can be found at [DYNA].

7.4.2 Microsoft Azure’s Cosmos DB

Cosmos DB is a database that runs on Microsoft’s cloud platform, Azure. It was developed with scalability and high performance in mind. It has a distributed model with replication of availability. Its scalable architecture enables the support for multiple data models and programming languages. As stated in [COSM], “the core type system of Azure Cosmos DB’s database engine is atom-record-sequence based. Atoms consist of a small set of primitive types, for example, string, Boolean, number, and so on. Records are structs and sequences are arrays consisting of atoms, records, or sequences.” Developers use the Cosmos DB by provisioning a database account. The notion of a container is used to store the stored procedures, triggers, and user-defined functions. The entities under that database account include the containers as well as the databases and permissions. These entities are called resources. Data in containers is horizontally partitioned. More details of the Cosmos DB can be found in [COSM].

7.4.3 IBM’s Cloud-Based Big Data Solutions

IBM is a leader in cloud computing including managing big data in the cloud. It has developed an architecture using a collection of hardware and software that can host a number of big data systems. IBM offers database as a service and provides support for data management as well as data analytics. For example, Cloudant is a NoSQL data layer. The dashDB system is a cloud-based data analytics system that carries out analytics. ElephantSQL is an open-source database running in the cloud. In addition, the IBM cloud also provides support for a number of data management capabilities including for stream computing and content management. More details on IBM’s cloud big data solutions can be found in [IBM].

7.4.4 Google’s Cloud-Based Big Data Solutions

The Google cloud platform is a comprehensive collection of hardware and software that enables the users to obtain various services from Google. The cloud supports both the Hadoop/MapReduce as well as the Spark platforms. In addition to the users accessing the data in BigTable and BigQuery, Google also provide solutions for carrying out analytics as well as accessing systems such as YouTube. More details on Google’s cloud platform can be found in [GOOG].

7.5 Summary and Directions

In this chapter, we have discussed three types of big data systems. First, we discussed what we call infrastructures (which we also call frameworks). These are essentially massive data processing platforms such as the Apache Hadoop, Spark, Storm, and Flink. Then we discussed various big data management systems. These included SQL- and NoSQL-based systems. This was followed by a discussion of big data analytics systems. Finally, we discussed cloud platforms that provide capability for management of massive amounts of data.

As we have mentioned, we have experimented with several of these systems. They include Apache Hadoop, Storm, MapReduce, Hive, MongoDB, Weka, and Amazon AWS. Some of the experimental systems we have developed will be discussed in Part IV of this book. We have also developed lectures on all of the products discussed in this chapter for our course on Big Data Analytics.

We believe that BDMA technologies have exploded over the past decade. Many of the systems we have discussed in this chapter did not exist 15 years ago. We expect the technologies to continue to explode. Therefore, it is important for us to not only keep up with the literature, but also experiment with the big data technologies. As progress is made, we will have a better idea as to what system to use when, where, why, and how.

References

[HADO]. http://hadoop.apache.org/

[SPAR]. http://spark.apache.org/

[PIG]. https://pig.apache.org/

[HIVE]. https://hive.apache.org/

[STOR]. http://storm.apache.org/

[MONG]. https://www.mongodb.com/

[CASS]. http://cassandra.apache.org/

[BIGT]. https://cloud.google.com/bigtable/

[BIGQ]. https://cloud.google.com/bigquery/

[WEKA]. http://www.cs.waikato.ac.nz/ml/weka/bigdata.html

[ORAC]. https://www.oracle.com/big-data/index.html