Big Data technologies – Hadoop and Spark

Data management at scale, that is, hundreds of GB and beyond, require the use of multiple machines that form a cluster to conduct read, write and compute operations in parallel, that is, it is distributed over various machines.

The Hadoop ecosystem has emerged as an open-source software framework for distributed storage and processing of big data using the MapReduce programming model developed by Google. The ecosystem has diversified under the roof of the Apache Foundation and today includes numerous projects that cover different aspects of data management at scale. Key tools within Hadoop include:

Apache Pig: Data processing language to implement large-scale extract-transform-load (ETL) pipelines using MapReduce, developed at Yahoo
Apache Hive: The defacto standard for interactive SQL queries over petabytes of data developed at Facebook
Apache HBASE: NoSQL database for real-time read/write access that scales linearly to billions of rows and millions of columns, and can combines data sources using a variety of different schemas.

Apache Spark has become the most popular platform for interactive analytics on a cluster. The MapReduce framework allowed for parallel computation but required repeated read/write operations from disk to ensure data redundancy. Spark has dramatically accelerated computation at scale due to the Resilient Distributed Data (RDD) structure that allows for highly optimized in-memory computation. This includes iterative computation as required for optimization, for example gradient descent for numerous ML algorithms.

Table of Contents for Big Data technologies – Hadoop and Spark

Create new playlist

Sign In

Sign Up

Table of Contents for
Big Data technologies – Hadoop and Spark