The Hadoop ecosystem

This chapter should be titled as the Apache ecosystem. Hadoop, like all the other projects that will be discussed in this section, is an Apache project. Apache is used loosely as a short form for the open source projects that are supported by the Apache Software Foundation. It originally has its roots in the development of the Apache HTTP server in the early 90s, and today is a collaborative global initiative that comprises entirely of volunteers who participate in releasing open source software to the global technical community.

Hadoop started out as, and still is, one of the projects in the Apache ecosystem. Due to its popularity, many other projects that are also part of Apache have been linked directly or indirectly to Hadoop as they support key functionalities in the Hadoop environment. That said, it is important to bear in mind that these projects can in most cases exist as independent products that can function without a Hadoop environment. Whether it would provide optimal functionality would be a separate topic.

In this section, we'll go over some of the Apache projects that have had a great deal of influence as well as an impact on the growth and usability of Hadoop as a standard IT enterprise solution, as detailed in the following figure:

Product

Functionality

Apache Pig

Apache Pig, also known as Pig Latin, is a language specifically designed to represent MapReduce programs through concise statements that define workflows. Coding MapReduce programs in the traditional methods, such as with Java, can be quite complex, and Pig provides an easy abstraction to express a MapReduce workflow and complex Extract-Transform-Load (ETL) through the use of simple semantics. Pig programs are executed via the Grunt shell.

Apache HBase

Apache HBase is a distributed column-oriented database that sits on top of HDFS. It was modelled on Google's BigTable whereby data is represented in a columnar format. HBase supports low-latency read-write across tables with billions of records and is well suited to tasks that require direct random access to data. More concretely, HBase indexes data in three dimensions - row, column, and timestamp. It also provides a means to represent data with an arbitrary number of columns as column values can be expressed as key-value pairs within the cells of an HBase table.

Apache Hive

Apache Hive provides a SQL-like dialect to query data stored in HDFS. Hive stores data as serialized binary files in a folder-like structure in HDFS. Similar to tables in traditional database management systems, Hive stores data in tabular format in HDFS partitioned based on user-selected attributes. Partitions are thus subfolders of the higher-level directories or tables. There is a third level of abstraction provided by the concept of buckets, which reference files in the partitions of the Hive tables.

Apache Sqoop

Sqoop is used to extract data from traditional databases to HDFS. Large enterprises that have data stored in relational database management systems can thus use Sqoop to transfer data from their data warehouse to a Hadoop implementation.

Apache Flume

Flume is used for the management, aggregation, and analysis of large-scale log data.

Apache Kafka

Kafka is a publish/subscribe-based middleware system that can be used to analyze and subsequently persist (in HDFS) streaming data in real time.

Apache Oozie

Oozie is a workflow management system designed to schedule Hadoop jobs. It implements a key concept known as a directed acyclic graph (DAG), which will be discussed in our section on Spark.

Apache Spark

Spark is one of the most significant projects in Apache and was designed to address some of the shortcomings of the HDFS-MapReduce model. It started as a relatively small project at UC Berkeley and evolved rapidly to become one of the most prominent alternatives to using Hadoop for analytical tasks. Spark has seen a widespread adoption across the industry and comprises of various other subprojects that provide additional capabilities such as machine learning, streaming analytics, and others.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.42.243