Big Data technologies – Hadoop and Spark

Data management at scale, that is, hundreds of GB and beyond, require the use of multiple machines that form a cluster to conduct read, write and compute operations in parallel, that is, it is distributed over various machines. 

The Hadoop ecosystem has emerged as an open-source software framework for distributed storage and processing of big data using the MapReduce programming model developed by Google. The ecosystem has diversified under the roof of the Apache Foundation and today includes numerous projects that cover different aspects of data management at scale. Key tools within Hadoop include:

  • Apache Pig: Data processing language to implement large-scale extract-transform-load (ETL) pipelines using MapReduce, developed at Yahoo
  • Apache Hive: The defacto standard for interactive SQL queries over petabytes of data developed at Facebook
  • Apache HBASENoSQL database for real-time read/write access that scales linearly to billions of rows and millions of columns, and can combines data sources using a variety of different schemas.

Apache Spark has become the most popular platform for interactive analytics on a cluster. The MapReduce framework allowed for parallel computation but required repeated read/write operations from disk to ensure data redundancy. Spark has dramatically accelerated computation at scale due to the Resilient Distributed Data (RDD) structure that allows for highly optimized in-memory computation. This includes iterative computation as required for optimization, for example gradient descent for numerous ML algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.182.179