This section will not provide in-depth knowledge of the Hadoop architecture, but only a high-level overview so we can understand the following chapters without much difficulty.
Before Hadoop Version 2.2, MapReduce was referred to as MapReduce V1 and had a different architecture:
There are two main types of nodes. They are classified as HDFS or MapReduce nodes:
MapReduce V1 dominated the big data landscape for many years, but there were a few limitations, such as:
MapReduce V2 came up with architectural changes to fix these limitations. This new architecture is also called Yet Another Resource Negotiator (YARN). However, it is not mandatory to run YARN on MapReduce V2.
The new architecture has DataNodes but has no TaskTrackers and a JobTracker.
MapReduce V1 is supported on V2, as V1 is still widely used by organizations across the world.
YARN works on two fundamentals:
As shown in the following figure, when an application gets invoked by the client, an Application Master gets started on a NodeManager. The Application Master is then responsible for negotiating resources with the ResourceManager. These resources are assigned to containers on each slave node and the tasks are run in the containers.
The Hadoop architecture is now modified to support high availability of NameNode, which is a key requirement for any business-critical application. There are now two NameNodes, one active and one standby.
As shown in the following figure, there are JournalNodes. For a basic setup of one active and one standby NameNode, there are three JournalNodes. As expected, only one of the NameNodes is active at a time. The JournalNodes work together to decide which of the NameNodes is to be the active one. If, for some reason, the active NameNode has gone down, the backup NameNode will take over.
Hadoop has been improved further to provide extreme scalability. There are multiple NameNodes acting independently. Each NameNode has its own namespace and therefore has control over its own set of files. However, they share all of the DataNodes.
MapReduce V2, like V1, has awareness of the topology of the network. When rack-awareness is configured for your cluster, Hadoop will always try to run the task on the TaskTracker node with the highest bandwidth access to the data.
3.137.188.201