Distributed computing using Apache Hadoop

Our world is filled with devices starting from the smart refrigerator, smart watch, phone, tablet, laptops, kiosks at the airport, ATM dispensing cash to you, and many many more. We are able to do things we could not even imagine just a few years ago. Instagram, Snapchat, Gmail, Facebook, Twitter, and Pinterest are a few of the applications we are now so used to; it is difficult to imagine a day without access to such applications.

With the advent of Cloud computing, using a few clicks we are able to launch 100s if not, 1000s of machines in AWS, Azure (Microsoft), or Google Cloud among others and use immense resources to realize our business goals of all sorts.

Cloud computing has introduced us to the concepts of IaaS, PaaS, and SaaS, which gives us the ability to build and operate scalable infrastructures serving all types of use cases and business needs.

IaaS (Infrastructure as a Service) - Reliable-managed hardware is provided without the need for a Data center, power cords, Airconditioning, and so on.
PaaS (Platform as a Service) - On top of IaaS, managed platforms such as Windows, Linux , Databases and so on are provided.
SaaS (Software as a Service) - On top of SaaS, managed services such as SalesForce, Kayak.com and so on are provided to everyone.

Behind the scenes is the world of highly scalable distributed computing, which makes it possible to store and process PB (PetaBytes) of data.

1 ExaByte = 1024 PetaBytes (50 Million Blue Ray Movies)
1 PetaByte = 1024 Tera Bytes (50,000 Blue Ray Movies)
1 TeraByte = 1024 Giga Bytes (50 Blue Ray Movies)
Average size of 1 Blue Ray Disc for a Movie is ~ 20 GB

Now, the paradigm of Distributed Computing is not really a genuinely new topic and has been pursued in some shape or form over decades primarily at research facilities as well as by a few commercial product companies. Massively Parallel Processing (MPP) is a paradigm that was in use decades ago in several areas such as Oceanography, Earthquake monitoring, and Space exploration. Several companies such as Teradata also implemented MPP platforms and provided commercial products and applications. Eventually, tech companies such as Google and Amazon among others pushed the niche area of scalable distributed computing to a new stage of evolution, which eventually led to the creation of Apache Spark by Berkeley University.

Google published a paper on Map Reduce (MR) as well as Google File System (GFS), which brought the principles of distributed computing to everyone. Of course, due credit needs to be given to Doug Cutting, who made it possible by implementing the concepts given in the Google white papers and introducing the world to Hadoop.

The Apache Hadoop Framework is an open source software framework written in Java. The two main areas provided by the framework are storage and processing. For Storage, the Apache Hadoop Framework uses Hadoop Distributed File System (HDFS), which is based on the Google File System paper released on October 2003. For processing or computing, the framework depends on MapReduce, which is based on a Google paper on MR released in December 2004.


The MapReduce framework evolved from V1 (based on Job Tracker and Task Tracker) to V2 (based on YARN).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.155.130