Hadoop YARN

YARN was a module introduced in Hadoop 2. In Hadoop 1, the process of managing jobs and monitoring them was performed by processes known as JobTracker and TaskTracker(s). NameNodes that ran the JobTracker daemon (process) would submit jobs to the DataNodes which ran TaskTracker daemons (processes).

The JobTracker was responsible for the co-ordination of all MapReduce jobs and served as a central administrator for managing processes, handling server failure, re-allocating to new DataNodes, and so on. The TaskTracker monitored the execution of jobs local to its own instance in the DataNode and provided feedback on the status to the JobTracker as shown in the following:

JobTracker and TaskTrackers

This design worked well for a long time, but as Hadoop evolved, the demands for more sophisticated and dynamic functionalities rose proportionally. In Hadoop 1, the NameNode, and consequently the JobTracker process, managed both job scheduling and resource monitoring. In the event the NameNode failed, all activities in the cluster would cease immediately. Lastly, all jobs had to be represented in MapReduce terms - that is, all code would have to be written in the MapReduce framework in order to be executed.

Hadoop 2 alleviated all these concerns:

  • The process of job management, scheduling, and resource monitoring was decoupled and delegated to a new framework/module called YARN
  • A secondary NameNode could be defined which would act as a helper for the primary NameNode
  • Further, Hadoop 2.0 would accommodate frameworks beyond MapReduce
  • Instead of fixed map and reduce slots, Hadoop 2 would leverage containers

In MapReduce, all data had to be read from disk, and this was fine for operations on large datasets but it was not optimal for operations on smaller datasets. In fact, any tasks that required very fast processing (low latency), were interactive in nature, or had multiple iterations (thus requiring multiple reads from the disk for the same data), would be extremely slow.

By removing these dependencies, Hadoop 2 allowed developers to implement new programming frameworks that would support jobs with diverse performance requirements, such as low latency and interactive real-time querying, iterative processing required for machine learning, different topologies such as the processing of streaming data, optimizations such as in-memory data caching/processing, and so on.

A few new terms became prominent:

  • ApplicationMaster: Responsible for managing the resources needed by applications. For example, if a certain job required more memory, the ApplicationMaster would be responsible for securing the required resource. An application in this context refers to application execution frameworks such as MapReduce, Spark, and so on.
  • Containers: The unit of resource allocation (for example, 1 GB of memory and four CPUs). An application may require several such containers to execute. The ResourceManager allocates containers for executing tasks. Once the allocation is complete, the ApplicationMaster requests DataNodes to start the allocated containers and takes over the management of the containers.
  • ResourceManager: A component of YARN that had the primary role of allocating resources to applications and functioned as a replacement for the JobTracker. The ResourceManager process ran on the NameNode just as the JobTracker did.
  • NodeManagers: A replacement for TaskTracker, NodeManagers were responsible for reporting the status of jobs to the ResourceManager (RM) and monitoring the resource utilization of containers.

The following image shows a high level view of the ResourceManager and NodeManagers in Hadoop 2.0:

Hadoop 2.0
The prominent concepts inherent in Hadoop 2 have been illustrated in the next image:
Hadoop 2.0 Concepts
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.172.130