Apache Hadoop

The first, and probably still most widely used, framework for big data processing is Apache Hadoop. Its foundation is Hadoop Distributed File System (HDFS). Developed at Yahoo! in the 2000's, it originally served as an open source alternative to Google File System (GFS), a distributed file system that was serving Google's needs for distributed storage of its search index.

Hadoop also implemented a MapReduce alternative to Google's proprietary system, Hadoop MapReduce. Together with HDFS, they constitute a framework for distributed storage and computations. Written in Java, with bindings for most programming languages and many projects that provide abstracted and simple functionality, sometimes based on SQL querying, it is a system that can reliably be used to store and process TBs or even PBs of data.

In later versions, Hadoop became more modularized by introducing Yet Another Resource Negotiator (YARN) which provides the abstraction for applications to be developed on top of Hadoop. This has enabled several applications to be deployed on top of Hadoop such as Storm, Tez, Open MPI, Giraph, and of course Apache Spark as we will see in the next sections.

Hadoop MapReduce is a batch-oriented system, meaning that it relies on processing data in batches and is not designed for real-time use cases.

Table of Contents for Apache Hadoop

Create new playlist

Sign In

Sign Up

Table of Contents for
Apache Hadoop